5 Transforming Data


Summary

A tibble is similar to a data frame in R but has default behavior that is slightly easier to work with.

  • The dplyr package gives us several tools for transforming our tibbles.

  • Different variables types get abbreviated in R. The type <int> is short for integer, <dbl> (which stands for double) is for floating point data, <chr> (stands for character) is for text (string) data, <dttm> is a date and time format.

  • filter retains observations with the requested properties.

  • The equality logical operator is == in R.

  • The operator logical and (& in R) is true if and only if all the logical expressions it connects are true.

  • The operator logical or (| in R) is true if and only if one or more of the logical expressions it connects are true.

  • The operator logical not (! in R) changes true to false and false to true.

  • arrange allows us to reorder the data points by their values from low to high.

  • The helper function desc, when applied to a variable that is a parameter of arrange causes the arrangement to be high to low.

  • Helper functions for select include starts_with, ends_with, contains, and num_range.

  • The operators & and | when applied to a pair of equal length vectors, will apply logical and as well as logical or component by component. To return only the first component of the result, use && and ||. These are needed primarily for program control using if and while. For filtering, always use the single character versions.


As usual, it is helpful to have the entire tidyverse ready to go.

library(tidyverse)

To illustrate the ideas in this chapter, a dataset from California detailing revenue sources from transit operators from 2002 to 2021 will be used. This set can be downloaded from https://s3.us-west-1.amazonaws.com/markhuber-datascience-resources.org/Data_sets/Transit_Operators_-_Revenues.csv. Once placed in the datasets folder of the current directory, the command

transrev <- read_csv("datasets/Transit_Operators_-_Revenues.csv")
## Rows: 196218 Columns: 11
## ── Column specification ───────────────────────────────────
## Delimiter: ","
## chr (8): Entity Name, Type, Form/Table, Category, Subcategory, Line Descript...
## dbl (3): Fiscal Year, Value, Row Number
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

will load it into main memory. This is a reasonably large size data set, with 196218 observations and 11 variables. However, it is nowhere near big data size for a modern computer.

Since read_csv can read directly from a web address, why download first to local storage and read? The reason is that local storage will be much faster. And everytime a new R session is started, the data needs to be read in again to main memory. So when working with a file, if possible, the best course of action is to download it to a local hard drive, and then load to main memory from that hard drive.

Consider this transit revenue random variable.

transrev
## # A tibble: 196,218 × 11
##    `Entity Name`     `Fiscal Year` Type  `Form/Table` Category
##    <chr>                     <dbl> <chr> <chr>        <chr>   
##  1 Access Services …          2017 Reve… TO_INCOME_S… Operati…
##  2 Access Services …          2017 Reve… TO_INCOME_S… Operati…
##  3 Access Services …          2017 Reve… TO_INCOME_S… Operati…
##  4 Access Services …          2017 Reve… TO_INCOME_S… Operati…
##  5 Access Services …          2017 Reve… TO_INCOME_S… Operati…
##  6 Access Services …          2017 Reve… TO_INCOME_S… Operati…
##  7 Access Services …          2017 Reve… TO_INCOME_S… Operati…
##  8 Access Services …          2017 Reve… TO_INCOME_S… Operati…
##  9 Access Services …          2017 Reve… TO_INCOME_S… Operati…
## 10 Access Services …          2017 Reve… TO_INCOME_S… Operati…
## # ℹ 196,208 more rows
## # ℹ 6 more variables: Subcategory <chr>,
## #   `Line Description` <chr>, Value <dbl>,
## #   `City, State` <chr>, `Zip Code` <chr>, `Row Number` <dbl>

The first line gives the variable names, such a "Entity Name" and "Fiscal Year". Right below the headings are abbreviations like <chr> and <dbl>. These tell us the type of variable we are dealing with. The four most commonly seen variable types are <int>, <dbl>, <chr>, and <dttm>.

  • <int> is an integer valued variable.

  • <dbl> is a floating point number. It is meant to represent a real number that has been rounded at a point where 64 bits can represent the number in a computer. The abbreviation stands for double, since initially, floating point numbers used 32 bits and modern floating point numbers use double that space.

  • <chr> This stands for character and is used for strings of characters like “UA” or “AA”.

  • <dttm> This stands for date and time and records both the data and current time values for the data point.

Recall that select can give us specific variables from the dataset. This function can also be used to rename variables. It turns out that spaces and special characters are difficult to use with later functions, so it helps to have names that are just one word.

transrev |> select("FY" = "Fiscal Year", "Category", "City-State" = "City, State")
## # A tibble: 196,218 × 3
##       FY Category          `City-State`
##    <dbl> <chr>             <chr>       
##  1  2017 Operating Revenue El Monte, CA
##  2  2017 Operating Revenue El Monte, CA
##  3  2017 Operating Revenue El Monte, CA
##  4  2017 Operating Revenue El Monte, CA
##  5  2017 Operating Revenue El Monte, CA
##  6  2017 Operating Revenue El Monte, CA
##  7  2017 Operating Revenue El Monte, CA
##  8  2017 Operating Revenue El Monte, CA
##  9  2017 Operating Revenue El Monte, CA
## 10  2017 Operating Revenue El Monte, CA
## # ℹ 196,208 more rows

Often, the goal is to understand a subset of the data. The filter command can be used to look at observations with specific properties.

5.1 The filter function

The filter function takes two arguments. The first is the dataset (often supplied via a pipe) and the second, third, et cetera are logical statements that must be true in order for the observation to be kept.

For instance, suppose that the goal is to keep observations from fiscal year 2013 coming from West Covina, CA. The following does this.

transrev |> 
  select("FY" = "Fiscal Year", "Category", "CityState" = "City, State") |>
  filter(FY == 2013) |>
  filter(CityState == "West Covina, CA")
## # A tibble: 41 × 3
##       FY Category          CityState      
##    <dbl> <chr>             <chr>          
##  1  2013 Operating Revenue West Covina, CA
##  2  2013 Operating Revenue West Covina, CA
##  3  2013 Operating Revenue West Covina, CA
##  4  2013 Operating Revenue West Covina, CA
##  5  2013 Operating Revenue West Covina, CA
##  6  2013 Operating Revenue West Covina, CA
##  7  2013 Operating Revenue West Covina, CA
##  8  2013 Operating Revenue West Covina, CA
##  9  2013 Operating Revenue West Covina, CA
## 10  2013 Operating Revenue West Covina, CA
## # ℹ 31 more rows

These two filtering operations could have been boiled down to one filter call.

transrev |> 
  select("FY" = "Fiscal Year", "Category", "CityState" = "City, State") |>
  filter(FY == 2013, CityState == "West Covina, CA")
## # A tibble: 41 × 3
##       FY Category          CityState      
##    <dbl> <chr>             <chr>          
##  1  2013 Operating Revenue West Covina, CA
##  2  2013 Operating Revenue West Covina, CA
##  3  2013 Operating Revenue West Covina, CA
##  4  2013 Operating Revenue West Covina, CA
##  5  2013 Operating Revenue West Covina, CA
##  6  2013 Operating Revenue West Covina, CA
##  7  2013 Operating Revenue West Covina, CA
##  8  2013 Operating Revenue West Covina, CA
##  9  2013 Operating Revenue West Covina, CA
## 10  2013 Operating Revenue West Covina, CA
## # ℹ 31 more rows

5.2 Comparison operators

Note that == is the logical operator that tests for equality. That is because = is used for parameter assignment in R. This use of the double equals == for checking equality is used across a wide variety of programming languages.

There are six commonly used comparison operators.

In words In R
greater than >
greater than or equal to >=
less than <
less than or equal to <=
equal to ==
not equal to !=

One wrinkle in using == is that floating point numbers are not exact real numbers. For instance, consider the square root of 2.

sqrt(2)
## [1] 1.414214

This number is irrational, so the digits do not form a repeating pattern. That means that at some point, the computer will simply truncate the remaining digits. Normally this is not a problem, until an operation that mathematically would return a digit.

sqrt(2) * sqrt(2)
## [1] 2

This result looks like a 2. But when == is used with 2, the result is FALSE!

sqrt(2) * sqrt(2) == 2
## [1] FALSE

This is because it is not quite a 2 that is being returned from sqrt(2) * sqrt(2), instead it is a number so close to 2 that it prints as 2, but differs in a far off digit.

In order to deal with this floating point phenomenon, there is a command called near to deal with this exact situation. The command

near(sqrt(2)^2, 2)
## [1] TRUE

returns TRUE as desired, because the numbers are within a threshold called machine precision. They are close enough that the computer cannot tell them apart, hence near returns TRUE.

5.3 Finding observations with minimum value.

The filter and min functions together with the comparison operator == can be used to find the observation (row) with the minimum value in a particular column.

For instance, to find the observation in the iris dataset that has the smallest Sepal.Length:

iris |>
  filter(Sepal.Length == min(Sepal.Length))
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          4.3           3          1.1         0.1  setosa

Compare this to the summarize function, that only returns the minimum Petal.Length value, and none of the other columns involved in the observation.

iris |>
  summarize(min(Sepal.Length))
##   min(Sepal.Length)
## 1               4.3

5.4 More about logic

So the filter command uses logical expressions, operators that return values that are either TRUE or FALSE. The rules for dealing with logical values of this type are called two valued logic, and is the most important type of logic for purposes of data science.

Sometimes the need is for multiple conditions where at least one condition needs to be true. This is called logical or. Sometimes all the conditions need to be true. This is a logical and requirement.

The logical and of two logical statements is defined by \[\begin{align*} \textsf{T}\wedge \textsf{T}&= \textsf{T}\\ \textsf{T}\wedge \textsf{F}&= \textsf{F}\\ \textsf{F}\wedge \textsf{T}&= \textsf{F}\\ \textsf{F}\wedge \textsf{F}&= \textsf{F}. \end{align*}\]

In other words, the logical and of two logical statements is true if and only if both statements are true. The logical or of the statements is true if and only if at least one of the statements is true.

The logical or of two logical statements is defined by \[\begin{align*} \textsf{T}\vee \textsf{T}&= \textsf{T}\\ \textsf{T}\vee \textsf{F}&= \textsf{T}\\ \textsf{F}\vee \textsf{T}&= \textsf{T}\\ \textsf{F}\vee \textsf{F}&= \textsf{F}. \end{align*}\]

There is also logical not, which flips true to false and false to true.

The logical not of a logical statement is defined as \[\begin{align*} \neg \textsf{T}&= \textsf{F}\\ \neg \textsf{F}&= \textsf{T}. \end{align*}\]

In R, logical and (\(p \wedge q\)) uses syntax p & q. Logical or (\(p \vee q\)) uses syntax p | q, logical not \(\neg p\) uses syntax !p.

So if the goal is to get observations from 2013 or 2009, use

transrev |> 
  select("FY" = "Fiscal Year", "Category", "CityState" = "City, State") |>
  filter(FY == 2013 | FY == 2009)
## # A tibble: 21,976 × 3
##       FY Category          CityState     
##    <dbl> <chr>             <chr>         
##  1  2013 Operating Revenue Clovis, CA    
##  2  2013 Operating Revenue Downey, CA    
##  3  2013 Operating Revenue Downey, CA    
##  4  2013 Operating Revenue Exeter, CA    
##  5  2013 Operating Revenue Lompoc, CA    
##  6  2013 Operating Revenue Mariposa, CA  
##  7  2009 Operating Revenue Downey, CA    
##  8  2009 Operating Revenue Downey, CA    
##  9  2009 Operating Revenue Downey, CA    
## 10  2009 Operating Revenue Montebello, CA
## # ℹ 21,966 more rows

Note that there are no observations where the Fiscal Year is both 2009 and 2013.

transrev |> 
  select("FY" = "Fiscal Year", "Category", "CityState" = "City, State") |>
  filter(FY == 2013 & FY == 2009)
## # A tibble: 0 × 3
## # ℹ 3 variables: FY <dbl>, Category <chr>, CityState <chr>

5.4.1 Missing values

One thing that often appears in data is missing values, where a data value is simply not there. For instance, if the recipient of a census survey did not fill out their age, it would appear in the data frame as NA.

Such missing data values are impossible to compare with values, and so tend to result in NA when used. For instance, the commands

NA > 8
## [1] NA
-3 == NA
## [1] NA
NA + 0
## [1] NA
NA / 2
## [1] NA
NA == NA
## [1] NA

all return NA.

The special command is.na determines if a value is missing or not:

x <- c(NA, 3, NA)
is.na(x)
## [1]  TRUE FALSE  TRUE

Now the filter command only returns rows where the condition is TRUE, if it is either FALSE or NA, then it is eliminated by the filter. So if you want to keep your missing values as well, you must explicitly ask for NA values as well. For instance,

df <- tibble(x = c(1, NA, 3))
df |> filter(x > 1)
## # A tibble: 1 × 1
##       x
##   <dbl>
## 1     3

does not return the NA value in line 2. Whereas

df |> filter(is.na(x) | x > 1)
## # A tibble: 2 × 1
##       x
##   <dbl>
## 1    NA
## 2     3

does return lines where either \(x > 1\) or the value is NA.

5.5 Using arrange to order rows

The arrange command will take the rows and sort them by numerical value. For instance,

transrev |>
  select("FY" = "Fiscal Year", "Category", "CityState" = "City, State") |>
  arrange(FY)
## # A tibble: 196,218 × 3
##       FY Category          CityState    
##    <dbl> <chr>             <chr>        
##  1  2003 Operating Revenue Corcoran, CA 
##  2  2003 Operating Revenue King City, CA
##  3  2003 Operating Revenue Mariposa, CA 
##  4  2003 Operating Revenue Albany, CA   
##  5  2003 Operating Revenue Alameda, CA  
##  6  2003 Operating Revenue Alameda, CA  
##  7  2003 Operating Revenue Alameda, CA  
##  8  2003 Operating Revenue Alameda, CA  
##  9  2003 Operating Revenue Alameda, CA  
## 10  2003 Operating Revenue Alameda, CA  
## # ℹ 196,208 more rows

arranges the rows from the lowest year to the highest.

Note arrange can arrange by more than one variable. So to also arrange (alphabetically) by city:

transrev |>
  select("FY" = "Fiscal Year", "Category", "CityState" = "City, State") |>
  arrange(FY, CityState)
## # A tibble: 196,218 × 3
##       FY Category          CityState  
##    <dbl> <chr>             <chr>      
##  1  2003 Operating Revenue Alameda, CA
##  2  2003 Operating Revenue Alameda, CA
##  3  2003 Operating Revenue Alameda, CA
##  4  2003 Operating Revenue Alameda, CA
##  5  2003 Operating Revenue Alameda, CA
##  6  2003 Operating Revenue Alameda, CA
##  7  2003 Operating Revenue Alameda, CA
##  8  2003 Operating Revenue Alameda, CA
##  9  2003 Operating Revenue Alameda, CA
## 10  2003 Operating Revenue Alameda, CA
## # ℹ 196,208 more rows

5.5.1 Helper functions

A helper function is used with parameters of a function to modify the behavior of the function.

Suppose instead of arranging low to high, the goal is to arrange high to low. Then the helper function desc is placed around the variable inside the arrange function.

transrev |>
  select("FY" = "Fiscal Year", "Category", "CityState" = "City, State") |>
  arrange(desc(FY), CityState)
## # A tibble: 196,218 × 3
##       FY Category              CityState  
##    <dbl> <chr>                 <chr>      
##  1  2021 Operating Revenues    Alameda, CA
##  2  2021 Operating Revenues    Alameda, CA
##  3  2021 Operating Revenues    Alameda, CA
##  4  2021 Operating Revenues    Alameda, CA
##  5  2021 Operating Revenues    Alameda, CA
##  6  2021 Operating Revenues    Alameda, CA
##  7  2021 Operating Revenues    Alameda, CA
##  8  2021 Operating Revenues    Alameda, CA
##  9  2021 Nonoperating Revenues Alameda, CA
## 10  2021 Nonoperating Revenues Alameda, CA
## # ℹ 196,208 more rows

Here the year is going from 2021 downwards, but the cities are still being arranged alphabetically. Helper functions make existing functions much more powerful.

There are also several logical helper functions for the select function.

  • starts_with("start") matches all names that begin with "start".

  • ends_with("end") matches all names that end with "end".

  • contains("middle") matches all names that have the string "middle" somewhere inside them.

  • num_range(a1:4) would match either a1, a2, a3, or a4.

transrev |>
  select(starts_with("Fiscal") | ends_with("State"))
## # A tibble: 196,218 × 2
##    `Fiscal Year` `City, State`
##            <dbl> <chr>        
##  1          2017 El Monte, CA 
##  2          2017 El Monte, CA 
##  3          2017 El Monte, CA 
##  4          2017 El Monte, CA 
##  5          2017 El Monte, CA 
##  6          2017 El Monte, CA 
##  7          2017 El Monte, CA 
##  8          2017 El Monte, CA 
##  9          2017 El Monte, CA 
## 10          2017 El Monte, CA 
## # ℹ 196,208 more rows

For more general string matching, there is the str_matches command, which uses what are called regular expressions. Regular expressions will be discussed in detail later on.

5.6 Vector operators in R

Note that the logical operators ==, & and | are vector operators. For instance, consider the command

c(2,1,-6) == c(2,7,-6)
## [1]  TRUE FALSE  TRUE

Because it looks at each component of the vector and sees if it is a match, the result is a vector of three Boolean values. Similarly consider

c(TRUE, FALSE, FALSE) & c(TRUE, TRUE, FALSE)
## [1]  TRUE FALSE FALSE

This also has a vector of three Boolean values.

Typically, this is exactly the behavior we want when using to find data with certain properties.

There are other logical operators, however, && and ||. These are not vector operators, but only work on the first component.

c(TRUE, FALSE, FALSE) && c(TRUE, TRUE, FALSE)
## Error in c(TRUE, FALSE, FALSE) && c(TRUE, TRUE, FALSE): 'length = 3' in coercion to 'logical(1)'

The output was only a single Boolean, based on the first component

These double symbol operators are better for program control using if and while, which is why they are needed. However be careful: most of the time for filter, the need is for the vector operators ==, & and |.

5.7 How this relates to SQL

Those who have learned a bit about data science might have heard of the Structured Query Language (SQL). This language is designed to perform tasks similar to the ones that we looked at in this chapter. Later, details will be given for how to build queries from a relational database with SQL that accomplishes the types of tasks done here with dplyr. A nice feature of the tidyverse is that even when dealing with an SQL database, the dplyr commands can be used instead of SQL to query the database.

Questions

Consider the following dataset.

simple_example <- tibble(
  change = c(-5, 3, 4, -1),
  season = c("Winter", "Summer", "Summer", "Fall")  
)
simple_example
## # A tibble: 4 × 2
##   change season
##    <dbl> <chr> 
## 1     -5 Winter
## 2      3 Summer
## 3      4 Summer
## 4     -1 Fall
  1. Write code to order the observations by change in descending order.

  2. Write code to only keep the variable change.

The iris dataset built into R contains petal and sepal lengths (in cm) of 150 samples of irises drawn from three species.

  1. Use summarize to find the mean petal length of all 150 flowers.

  2. Use filter to keep only those whose petal length is at least 5.5

  3. Use both summarize and filter to find the mean of those flowers with petal length at least 5.5.

In the iris dataset, the lengths of petals are measured in centimeters. One inch is 2.54 centimeters. Use mutate, filter, summarize and n to count the number of flowers in the dataset that have petals of length at least 2 inches.

The quakes dataset gives the locations of 1000 seismic events of magnitude greater than 4.0 in a cube near Fiji since 1964.

The number of observations can be verified with summarize and the n function.

quakes |>
  summarize(n())
##    n()
## 1 1000
  1. Find how many earthquakes there are in the data set of magnitude at least 5.0.

  2. Find how many earthquakes have magnitude at least 5.0 and depth at least 100 km.

Load in the nycflights13 package with the following code.

library(nycflights13)

This contains a dataset called flights.

flights |> select(year:dep_delay) |> 
  head() |> kable() |> kable_styling()
year month day dep_time sched_dep_time dep_delay
2013 1 1 517 515 2
2013 1 1 533 529 4
2013 1 1 542 540 2
2013 1 1 544 545 -1
2013 1 1 554 600 -6
2013 1 1 554 558 -4

Sort the flights data set in this package to find the flights that had the longest departure delay. What day was the flight on?

Continuing the last problem, consider the flights data set in package nycflights13.

  1. What command would filter the observations where arr_time is not available?

  2. How many such observations are there?

Consider a data set that just has angles in degrees from 0 to 360.

df_angles <- tibble(angle = 0:360)
  1. Use mutate to add columns that correspond to the cosine and sine of the angle.

  2. Filter your previous answer to get only those rows with angle between 175 and 185 degrees inclusive.

In the flights data set, the variable tailnum tells us the exact plane that took a flight.

  1. Find the plane that flew the most number of the flights.

  2. How many planes flew at least 100 flights?

Which carrier had the highest percentage of on-time flights?