3 The Tidyverse

Summary

A library or package is a collection of functions and datasets. Packages can also be collections of other packages.
The package dplyr has functions for manipulating and transforming data.
The slice function picks out a subset of observations from the whole dataset.
The mutate function creates new variables as functions of existing ones.
The select function picks out variables with certain properties.
The pipe operator |> allows the output of a function to be given as the first input for another function. This makes composition of several functions easy, and makes code much more readable.
The tidyverse is a collection of packages in R for doing all aspects of data science.

3.1 Packages

So far the commands and data used have been part of base R, which consists of things that are available when the minimal version of R is installed on a computer.

Like all computer languages, R can be extended through the use of libraries, also known as packages.

A library or package is a collection of functions and datasets. Packages can also be collections of other packages.

A particular library typically has a theme. For instance, the dplyr library is made for transforming and manipulating data. Before a library can be used, it must be installed on the system. This can be accomplished with the install.packages command. It takes a single argument, the name of the package in quotes. So install.packages("dplyr") will install the dplyr.

Installing a package only needs to be done once for any particular R installation. If you use the install.packages command again, R will check for any updates available for the package and install those.

3.2 Built in datasets

R contains a number of built in datasets to assist in learning how functions work. For instance, consider the cars dataset. Using ?cars reveals that this is a data frame with 50 observations on 2 variables from around 1920. The first variable, speed, is the speed of the car in miles per hour, mph. The second variable, dist is the number of feet the car needed to stop from that speed.

To get a glimpse of the data, the head function can be used to look at the first few observations. This function takes as its first argument any data frame, and by default, it will show the first six observations.

head(cars)

##   speed dist
## 1     4    2
## 2     4   10
## 3     7    4
## 4     7   22
## 5     8   16
## 6     9   10

3.3 The slice function

While head is useful, it always starts with the first observation. To go beyond this, the slice function can be used. This function is part of the dplyr package. What it does is to take some of the observations. For instance, the third, seventh, and eleventh observations in the cars dataset can be found as follows.

dplyr::slice(cars, 3, 7, 11)

##   speed dist
## 1     7    4
## 2    10   18
## 3    11   28

The dplyr:: before the function name indicates that the slice function is part of the dplyr package. Here the :: separates the package name from the function.

The : symbol can also be used to construct sequences. For instance 4:6 translates to the three numbers 4, 5, 6. The : notation for sequences can be used inside of the slice function. The following uses this notation to place the first three observations in their own dataset first_cars.

first_cars <- dplyr::slice(cars, 1:3)
first_cars

##   speed dist
## 1     4    2
## 2     4   10
## 3     7    4

Slice can also be used to remove observations using negative index notation. There the observations that we wish to remove are given negative indices. For instance,

dplyr::slice(first_cars, -2)

##   speed dist
## 1     4    2
## 2     7    4

removes the second observation from the data set first_cars.

3.4 The library function

It gets old pretty quick typing dplyr:: before every function from the package dplyr. The library command can be used to bring all the functions and variables from a package into the environment.

library(dplyr)

You only need to use install.packages once to download and install a library in your local R installation. On the other hand, you need to use library every time you restart R or at the beginning of an R Markdown file to use the functions in a library.

Now that the library has been loaded, slice (or any other function in dplyr) can be used whenever we would like without the dplyr:: prefix.

slice(cars, 2:4)

##   speed dist
## 1     4   10
## 2     7    4
## 3     7   22

3.5 The mutate command

Another useful function in the dplyr package is the mutate function. This allows the creation of a new variable (or overwriting an existing variable) in the dataset as a function of other variables. For instance, the speed variable in cars uses miles per hour, mph. To convert to kilometers per hour, kph, multiply by 1.6. Using mutate to do this works as follows.

cars_kph <- mutate(cars, speed_kph = speed * 1.6)

Note that a big advantage of using the mutate command is that once the dataset is input as cars, there is no need for the $ operator. This can be helpful when dealing with many variable names, as here we can be sure that the speed variable belonging to the cars dataset is being used.

3.6 The select command

The select command in package dplyr can be used to keep some of the columns/variables in the dataset. For instance, to keep speed_kph and dist as the only variables in the dataset, and then only the first three observations.

slice(select(cars_kph, speed_kph, dist), 1:3)

##   speed_kph dist
## 1       6.4    2
## 2       6.4   10
## 3      11.2    4

The notation here is of nested functions. To figure out what happens, start at the inside and work towards the outside. Unfortunately, this is not a very natural way to think. A better way to approach such problems is to think about starting with the cars_kph dataset, then applying select, then applying slice. An object called a pipe can be used to code in this way.

3.7 Pipes

The pipe operator |> allows easy application of more than one function to a dataset. It changes the way the code is written so as to start with our dataset, and then apply one function after another.

A pipe works by moving the first argument to a function to the left of the pipe symbol. For example, consider the following simple function that adds together its two arguments.

add <- function(a, b) return(a + b)
add(3, 8)

## [1] 11

Then with the pipe symbol, the first argument can be moved to the left hand side of the |>. That is:

3 |> add(8)

## [1] 11

At this point it might be hard to see the usefulness of pipes. With one or two functions pipes are not really necessary, but pipes really shine when many functions are being applied one after the other. For instance, consider the following code.

square <- function(a) return(a * a)
square(4)

## [1] 16

Now consider applying several iterations of add and square to some numbers.

x <- 3
add(square(add(x, 3)), -4)

## [1] 32

It is kind of hard to parse what is going on. Now look at the same code written using pipes.

x |>
  add(3) |>
  square() |>
  add(-4)

## [1] 32

Each application of a function gets its own line, which improves readability. Moreover, the operations now occur in order instead of inside-out. First take x, add 3, square it, then add -4. For transformations of datasets, this type of notation usually makes things much simpler to read and understand.

Consider the earlier code for selecting two variables and three observations from the cars_kph dataset.

slice(select(cars_kph, speed_kph, dist), 1:3)

##   speed_kph dist
## 1       6.4    2
## 2       6.4   10
## 3      11.2    4

With pipes, this becomes:

cars_kph |> select(speed_kph, dist) |> slice(1:3)

##   speed_kph dist
## 1       6.4    2
## 2       6.4   10
## 3      11.2    4

It is the same result, but the pipe expression is more easily translated to human language (start with cars_kph, keep the variables speed_kph and dist, keep the first three observations) because it moves left to right instead of inside out.

To assign the result of several pipes to another variable the assignment operator <- can be used. Put the new variable name first, this <-, then all of the piped together commands. (There is also a -> assignment that assigns whatever is left to the name on the right, but it is almost never used in practice and should be avoided.)

For instance, to do the previous changes to cars and store it in a new variable cars2, use the following.

cars2 <-
  cars |>
  mutate(speed_kph = speed * 1.6) |>
  select(-speed)
cars2 |> head()

##   dist speed_kph
## 1    2       6.4
## 2   10       6.4
## 3    4      11.2
## 4   22      11.2
## 5   16      12.8
## 6   10      14.4

Note here in select instead of listing out the variables to keep, a - sign was put in front of the variable to be gotten rid of.

One could make the changes to cars and overwrite the variable cars instead of creating a new variable cars2. This, however, is poor practice, as it could break existing code that did not expect these changes to cars and relied on the original version. Especially with small variables that have at most a few million observations, major alterations to the data should be stored in a new variable instead of overwriting the old.

3.8 The summarize command

Now consider the problem of applying a function like mean to a particular variable that is embedded within a tibble. The summarize command can be used to accomplish this.

The first parameter (supplied either directly or through a pipe) is the dataset to be used. The second parameter is a function like mean, max, or min that applies to a vector of numerical data. The result is output as a tibble that applies the statistic to the variable specified.

For example, the following code finds the maximum speed in the cars dataset.

cars |>
  summarize(max_speed = max(speed))

##   max_speed
## 1        25

More than one statistic can be found with a single summarize command.

cars |>
  summarize(max_speed = max(speed), avg_dist = mean(dist))

##   max_speed avg_dist
## 1        25    42.98

3.9 The tidyverse package

Data is said to be tidy when it is in a table where each line contains an observation, and each column contains a variable that is something that can be measured. For instance, in the cars dataset in R, there are 50 observations, which means there are 50 rows. There are two variables (speed and dist,) which means there are two columns.

Variables in a dataset should not be confused with a variable in a programming language. In statistics, a variable is just anything that can be measured, such as height, weight, color, speed, education level, et cetera. This type of variable is also sometimes called a factor.

The tidyverse package is a collection of packages that accomplish the tasks needed in data science. These include the following that will be discussed in the rest of this text.

dplyr Transformation and manipulation of data.
readr Reading and writing data from a website or hard drive to main memory.
ggplot2 Visualization of data.
tidyr For putting data into tidy form.
stringr Deals with text data (aka strings.)
forcats Dealing with categorical data.
modelr Modeling data.
purrr Replaces loops for better efficiency.

If you use install.packages("tidyverse"), R will install every package in the tidyverse at once. Make sure you have a good Internet connection, put your feet up and relax, that could take a while!

Questions

Consider the following dataset.

simple_example <- tibble(
  change = c(-5, 3, 4, -1),
  season = c("Winter", "Summer", "Summer", "Fall")  
)
simple_example

## # A tibble: 4 × 2
##   change season
##    <dbl> <chr> 
## 1     -5 Winter
## 2      3 Summer
## 3      4 Summer
## 4     -1 Fall

Write code to add a variable abs_change that is the absolute value of the change value.
Write code to add a variable positive which has value TRUE if the value of change is greater than 0, and FALSE to otherwise.

This can be done with

simple_example |> mutate(abs_change = abs(change))

## # A tibble: 4 × 3
##   change season abs_change
##    <dbl> <chr>       <dbl>
## 1     -5 Winter          5
## 2      3 Summer          3
## 3      4 Summer          4
## 4     -1 Fall            1

This can be done with

simple_example |>
  mutate(positive = (change > 0))

## # A tibble: 4 × 3
##   change season positive
##    <dbl> <chr>  <lgl>   
## 1     -5 Winter FALSE   
## 2      3 Summer TRUE    
## 3      4 Summer TRUE    
## 4     -1 Fall   FALSE

Consider the following set of high and low temperature forecasts for Claremont, California during a few days in September 2022.

temps_claremont <- tibble(
  dates = c("2022-09-07",
            "2022-09-08",
            "2022-09-09",
            "2022-09-10"),
  high  = c(105, 102, 104, 77),
  low   = c(75, 77, 77, 71)
)

Write code to only keep from temps_claremont the first and last observation.
Write code to only keep from temps_claremont the high and low temperature data.
The temperatures are given using the Fahrenheit temperature scale. Using the formula $C = (F - 32)(5 / 9)$ to convert from Fahrenheit to Celsius, add a new variable low_celsius that holds the low forecasts in Celsius.

This can be done with

temps_claremont |> slice(c(1, 4))

## # A tibble: 2 × 3
##   dates       high   low
##   <chr>      <dbl> <dbl>
## 1 2022-09-07   105    75
## 2 2022-09-10    77    71

This can be done with

temps_claremont |> select(high, low)

## # A tibble: 4 × 2
##    high   low
##   <dbl> <dbl>
## 1   105    75
## 2   102    77
## 3   104    77
## 4    77    71

This can be done with

temps_claremont |> mutate(low_celsius = (low - 32) * (5 / 9))

## # A tibble: 4 × 4
##   dates       high   low low_celsius
##   <chr>      <dbl> <dbl>       <dbl>
## 1 2022-09-07   105    75        23.9
## 2 2022-09-08   102    77        25  
## 3 2022-09-09   104    77        25  
## 4 2022-09-10    77    71        21.7

Consider the mpg dataset which is in the ggplot2 package, which can be loaded with the following code.

library(ggplot2)

Currently, engine displacement in mpg is measured in liters. Convert this to cubic centimeters with the mutate command.

The median command in R calculates the sample median of a dataset. This is the middle value in a vector of values if the length of the vector is odd, and the arithmetic average of the two middle values in a vector of values if the length of the vector is even.

For instance,

median(c(3, 7, 17))

## [1] 7

and

median(c(3, 7, 10, 17))

## [1] 8.5

illustrates this sample median.

Use this command together with summarize to find the sample median of the mpg variable in the mtcars dataset built into R.

Use the summarize command to create a tibble that contains the average mpg, the median mpg, and the average of the wt variable that measures the weight of the vehicle in thousands of pounds.

Given a vector that consists of boolean values, TRUE and FALSE, when you use sum, every TRUE gets turned into a 1, and every FALSE into a 0.

The rest of this problem uses

x <- c(TRUE, TRUE, FALSE, FALSE, TRUE, FALSE, TRUE, TRUE, TRUE, FALSE)

Try applying sum to this vector x.
Try applying mean to this vector x.
Try applying max to this vector x.
Try applying min to this vector x.

Consider the variable flights in the package nycflights13. When arr_delay is zero or negative, say that a particular flight is on time.

Use mutate to add a new boolean that indicates whether or not a flight is on-time.
What percentage of the flights were on time?

The data set uspop is a time series data type that holds the results of the United States Census from 1790 to 1970. You can convert it to a tibble using the tibble function.

us_census <- tibble(uspop)

Add to the tibble a year variable that runs from 1790 to 1970 skipping by ten years.
Add to the tibble from part a another variable log_pop that shows the natural logarithm of the population.