11 Shaping Data


Summary

In tidy data,

  1. each row corresponds to an observation,
  2. each column corresponds to a variable, and
  3. each entry only contains a single value.

There are several commands for tidying data when these are not already true. Because these commands can change the number of columns and/or rows, this is often called reshaping the data.

commands description
pivot_wider Turns entries into variable names (fewer rows, more columns)
pivot_longer Turns variables names into entries (fewer columns, more rows)
separate When entry is two values, separates into two variables (columns)
unite Combines two variables (columns) into one variable

There are also functions for dealing with missing values. A missing value NA is explicit if we write it out directly in the data table, and implicit if we remove the row that has an NA value. Also, sometimes NA values appear in data to indicate that the entry is the same as the one above it. We have functions to deal with these situations.

commands description
complete Finds missing observations and explicitly makes them NA
fill Changes NA entries to value of the entry above

11.1 What is tidy data?

So far we’ve learned how to visualize data, transform data, and import data. All of these tools such as ggplot, filter, and summarize expect to be given the input data in a tidy form.

Recall that for a statistician, a variable or factor is something that can be measured such as the latitude or temperature. A level is the different values that a factor can take on. In tidy data, each column is entitled with the factor name, and then the elements of that column have values drawn from the levels.

However, in tabulating data, often the levels are used for the column names instead! But this makes the data difficult to analyze. In particular, a command like ggplot will not work on data that is organized in this way. A better way is to have each column correspond to a variable, each row to an observation, and then the entries at each row and column should be a level value.

When this is how the data is organized, we call the data tidy.

A dataset is in tidy form when

  1. Each row corresponds to an observation.

  2. Each column corresponds to a variable.

  3. Each entry only contains a single value.

When these properties do not hold, the tidyr package has several useful functions for manipulating our tables and getting them into tidy form. For instance, consider the following variable table1 that is built in to the tidyr package.

table1
## # A tibble: 6 × 4
##   country      year  cases population
##   <chr>       <dbl>  <dbl>      <dbl>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3 Brazil       1999  37737  172006362
## 4 Brazil       2000  80488  174504898
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583

Each of the six observations occupies its own row. Each column corresponds to a unique variable, and each of the \(24 = 6 \cdot 4\) entries of the table corresponds to a single value.

In contrast, here is the same data, but organized differently. Here

table2
## # A tibble: 12 × 4
##    country      year type            count
##    <chr>       <dbl> <chr>           <dbl>
##  1 Afghanistan  1999 cases             745
##  2 Afghanistan  1999 population   19987071
##  3 Afghanistan  2000 cases            2666
##  4 Afghanistan  2000 population   20595360
##  5 Brazil       1999 cases           37737
##  6 Brazil       1999 population  172006362
##  7 Brazil       2000 cases           80488
##  8 Brazil       2000 population  174504898
##  9 China        1999 cases          212258
## 10 China        1999 population 1272915272
## 11 China        2000 cases          213766
## 12 China        2000 population 1280428583

Here the two variables case and population have been turned into different values for the type variable, and the actual data values have been put into a count variable.

This table contains the same information, but table2 is far more difficult to work with than table1.

  • The format of table2 obscures the fact that there are 6 data points. By taking two different types of data and conflating them, the fact that we are working with countries at various years is lost.

  • It becomes more difficult to analyze. If we look just at the mean of the counts the overall result conflates the cases and the population values. A single column (if possible) should always have the same units and be measuring the same thing. Here two different things (with different units) are being measured. In tidy data, the entries in each column are all measuring the same thing.

There are many reasons why the data you encounter in practice is not tidy. The two main reasons are the following.

  1. Unless you have trained in analyzing tidy data, you simply might not think to organize your data in a tidy fashion. Despite its simplicity, the tidy principle is not self-apparent.

  2. Data is often organized in a way to make recording the data as efficient as possible, not for analysis of the data.

So that means we have to learn tools that take data sets that might be untidy as recorded and turn them into tidy data sets.

11.2 Turning entries into column names

The first tool we will introduce is pivot_wider, which deals with situations like those found in table2. We wish to take the 12 rows, and change them back to the six observations that we know exist by combining rows with different type entries.

The function to do this is called pivot_wider because it will make our table wider (as well as shorter) by removing the count variable and introducing an equivalent of a cases variable and a population variable.

The pivot_wider function has two main parameters. The first, names_from, is the variable that holds the entries that we will turn into separate variables. The second, values_from, is the name of the existing variable with names for each of the variables created under the key variable. The result is something like:

table2 |> pivot_wider(names_from = type, values_from = count)
## # A tibble: 6 × 4
##   country      year  cases population
##   <chr>       <dbl>  <dbl>      <dbl>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3 Brazil       1999  37737  172006362
## 4 Brazil       2000  80488  174504898
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583

Now each variable has its own column, and each row in each variable contains a value rather than a variable name.

11.3 Turning column names into entries

We use pivot_wider when a column contains factor names rather than values. What about when a column name is a level rather than a factor name? In that case, we use pivot_longer. This turns column names into entries, which removes columns and adds observations.

This often happens when dealing with numerical data. Suppose we reorganize our table of data in yet another fashion.

table4a
## # A tibble: 3 × 3
##   country     `1999` `2000`
##   <chr>        <dbl>  <dbl>
## 1 Afghanistan    745   2666
## 2 Brazil       37737  80488
## 3 China       212258 213766

You can see that 1999 and 2000 are not true variable names because there is nothing to measure here. Instead, they are really values that should have been assigned to a variable named something like year. Unlike with pivot_wider, we are going to have to tell pivot_longer exactly which of the existing variable names are actually values. Then use the key parameter to say what the name of the new variable should be, and finally values tells us what the name of the new value variable should be.

g4a <- table4a |> 
  pivot_longer('1999':'2000', 
               names_to = "year", 
               values_to = "cases")
g4a
## # A tibble: 6 × 3
##   country     year   cases
##   <chr>       <chr>  <dbl>
## 1 Afghanistan 1999     745
## 2 Afghanistan 2000    2666
## 3 Brazil      1999   37737
## 4 Brazil      2000   80488
## 5 China       1999  212258
## 6 China       2000  213766

Of course, this table only contains the numbers of cases of TB, it does not contain the population information. The rest of the data is contained in a variable table4a:

table4b
## # A tibble: 3 × 3
##   country         `1999`     `2000`
##   <chr>            <dbl>      <dbl>
## 1 Afghanistan   19987071   20595360
## 2 Brazil       172006362  174504898
## 3 China       1272915272 1280428583

The pivot_longer function can be used to tidy this data as was done with table4a.

g4b <- table4b |> 
  pivot_longer('1999':'2000', 
               names_to = "year", 
               values_to = "population")
g4b
## # A tibble: 6 × 3
##   country     year  population
##   <chr>       <chr>      <dbl>
## 1 Afghanistan 1999    19987071
## 2 Afghanistan 2000    20595360
## 3 Brazil      1999   172006362
## 4 Brazil      2000   174504898
## 5 China       1999  1272915272
## 6 China       2000  1280428583

After tidying up this data in the same way as for table4a, we want to combine the two resulting tables into one single table. A command for combining multiple tables into one is the left_join.

Use left_join to bring things together and recreate our tidy data set.

left_join(g4a, g4b)
## Joining with `by = join_by(country, year)`
## # A tibble: 6 × 4
##   country     year   cases population
##   <chr>       <chr>  <dbl>      <dbl>
## 1 Afghanistan 1999     745   19987071
## 2 Afghanistan 2000    2666   20595360
## 3 Brazil      1999   37737  172006362
## 4 Brazil      2000   80488  174504898
## 5 China       1999  212258 1272915272
## 6 China       2000  213766 1280428583

11.4 Separate

Another problem that can prevent data from being tidy is when the table tries to hold values for two variables inside one entry. For example:

table3
## # A tibble: 6 × 3
##   country      year rate             
##   <chr>       <dbl> <chr>            
## 1 Afghanistan  1999 745/19987071     
## 2 Afghanistan  2000 2666/20595360    
## 3 Brazil       1999 37737/172006362  
## 4 Brazil       2000 80488/174504898  
## 5 China        1999 212258/1272915272
## 6 China        2000 213766/1280428583

This is pretty uncommon, the widespread use of spreadsheets tends to discourage this sort of thing. Still, it is often the case that a name and ID number, or first and last name, get combined into one variable, and we often want to separate entries into their different variables.

In the case of table3, the entries under the variable rate actually contain two values, not one. We can use the separate function to do this. The syntax is straightforward: we provide separate with the variable name to split and the new names of the variables.

table3 |> 
  separate(rate, into = c("cases", "population"))
## # A tibble: 6 × 4
##   country      year cases  population
##   <chr>       <dbl> <chr>  <chr>     
## 1 Afghanistan  1999 745    19987071  
## 2 Afghanistan  2000 2666   20595360  
## 3 Brazil       1999 37737  172006362 
## 4 Brazil       2000 80488  174504898 
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583

Note that separate figured out what delimiting character separated the two values, /. If you want to specify your own delimiter, that works too.

The default call to separate will treat the two entries as character strings. This is a problem if they are actually another data type. By setting the convert to TRUE, the user tells the function to guess at what type of variables are appearing in each of the two separated columns.

table3 |> 
  separate(rate, into = c("cases", "population"), convert = TRUE)
## # A tibble: 6 × 4
##   country      year  cases population
##   <chr>       <dbl>  <int>      <int>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3 Brazil       1999  37737  172006362
## 4 Brazil       2000  80488  174504898
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583

11.5 Unite

The unite function does precisely the opposite of separate, it brings two variables together into one. This can be useful when, for instance, the century and two digit year within the century have been separated into separate variables. Consider table5:

table5
## # A tibble: 6 × 4
##   country     century year  rate             
##   <chr>       <chr>   <chr> <chr>            
## 1 Afghanistan 19      99    745/19987071     
## 2 Afghanistan 20      00    2666/20595360    
## 3 Brazil      19      99    37737/172006362  
## 4 Brazil      20      00    80488/174504898  
## 5 China       19      99    212258/1272915272
## 6 China       20      00    213766/1280428583

The function unite also has a straightforward syntax: tell it the new name of the variable, followed by one or more variables you wish to unite.

table5 |> unite(col = year, century, year)
## # A tibble: 6 × 3
##   country     year  rate             
##   <chr>       <chr> <chr>            
## 1 Afghanistan 19_99 745/19987071     
## 2 Afghanistan 20_00 2666/20595360    
## 3 Brazil      19_99 37737/172006362  
## 4 Brazil      20_00 80488/174504898  
## 5 China       19_99 212258/1272915272
## 6 China       20_00 213766/1280428583

To denote the combination, by default unite uses an underscore character (_) to separate values. For the year, we don’t want that. We can use the parameter sep to change the parameter, or eliminate it entirely by giving it a blank string ("").

table5 |> unite(col = year, century, year, sep = "")
## # A tibble: 6 × 3
##   country     year  rate             
##   <chr>       <chr> <chr>            
## 1 Afghanistan 1999  745/19987071     
## 2 Afghanistan 2000  2666/20595360    
## 3 Brazil      1999  37737/172006362  
## 4 Brazil      2000  80488/174504898  
## 5 China       1999  212258/1272915272
## 6 China       2000  213766/1280428583

11.6 Missing Values

Missing data can be denoted in two different ways.

  • Explicitly. This is when we give in the value NA.
  • Implicitly. This is when we have an observation missing from the table.

For instance, suppose our original table had only had the first five observations:

table1 |> slice(1:5)
## # A tibble: 5 × 4
##   country      year  cases population
##   <chr>       <dbl>  <dbl>      <dbl>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3 Brazil       1999  37737  172006362
## 4 Brazil       2000  80488  174504898
## 5 China        1999 212258 1272915272

There is no entry with country equal to China and year equal to 2000. Because no observation with these values exists, that data is implicitly missing. The complete command can be used to try to figure out what data is missing. For instance, consider:

table1 |> slice(1:5) |> complete(country, year)
## # A tibble: 6 × 4
##   country      year  cases population
##   <chr>       <dbl>  <dbl>      <dbl>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3 Brazil       1999  37737  172006362
## 4 Brazil       2000  80488  174504898
## 5 China        1999 212258 1272915272
## 6 China        2000     NA         NA

Now the China in 2000 data is explicitly said to be NA. How did complete know that this was missing? It looked at the various values for country and year and discovered that one of the \(2 \cdot 3 = 6\) possible combinations was not there, in this case for China. So it added it in explicitly as an observation with missing data.

11.6.1 Blank lines in spreadsheets

Another issue we encounter is when a cell is blank in a spreadsheet, it often means to use the value of the cell above it. Then when this is read into R using read_csv or another data import function, the blank cells get changed to NA. For instance, if this type of technique was used for the table1 data, we might end up with the following.

table1b <- table1
table1b[c(2, 4, 6), 1] <- c(NA, NA, NA)
table1b
## # A tibble: 6 × 4
##   country      year  cases population
##   <chr>       <dbl>  <dbl>      <dbl>
## 1 Afghanistan  1999    745   19987071
## 2 <NA>         2000   2666   20595360
## 3 Brazil       1999  37737  172006362
## 4 <NA>         2000  80488  174504898
## 5 China        1999 212258 1272915272
## 6 <NA>         2000 213766 1280428583

This can be difficult to fix, fortunately tidyr has a specific function to solve this problem: fill.

table1b |> fill(country)
## # A tibble: 6 × 4
##   country      year  cases population
##   <chr>       <dbl>  <dbl>      <dbl>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3 Brazil       1999  37737  172006362
## 4 Brazil       2000  80488  174504898
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583

11.7 Cleaning data

Even when data is in tidy form, it might end up being somewhat dirty. This happens when there are errors in the data file or misnamed variables. Here are two extra tools that are useful for cleaning up such data sets.

  • names returns the variable names of the input tibble as a vector of strings.

  • rename allows us to efficiently change the name or names of a variable in a tibble. For example: rename(“newvariablename” = oldvariablename)

  • str_replace takes a string and replaces it with another string. This is often used either before or after a call to separate or unite to prepare the data. For example:

    str_replace("tunabake", "bake", "melt")
    ## [1] "tunamelt"

    can be used to quickly change one or more names to something else.

Questions

State which of the following

  • pivot_wider

  • pivot_longer

  • separate

  • unite

is the right function to use in the situation.

  1. Two variable values need to be combined into one variable entry.

  2. The entries of a column should be column names instead.

  3. An entry contains values for two or more variables.

  4. The column names are measurements rather than something you measure.

Consider the following dataset that gives the number of days of rainfall for five cities over three months.

df1 <- read_csv2('City;January;February;March
  Atlanta, Georgia;11;10;10
  Austin, Texas;7;7;9
  Baltimore, Maryland;10;9;10
  Birmingham, Alabama;11;10;10
  Boston, Massachusetts;11;10;12')
df1
## # A tibble: 5 × 4
##   City                  January February March
##   <chr>                   <dbl>    <dbl> <dbl>
## 1 Atlanta, Georgia           11       10    10
## 2 Austin, Texas               7        7     9
## 3 Baltimore, Maryland        10        9    10
## 4 Birmingham, Alabama        11       10    10
## 5 Boston, Massachusetts      11       10    12
  1. Tidy this data so that there is a column for the month and one for the rainfall.

  2. Find the mean number of days of rainfall from January through March for each of the five cities.

Now suppose the data from the last problem was presented in the following CSV file:

df2 <- read_csv2("
  Atlanta, Georgia;Jan/11
  Atlanta, Georgia;Feb/10
  Atlanta, Georgia;Mar/10
  Austin, Texas;Jan/7
  Austin, Texas;Feb/7
  Austin, Texas;Mar/9
  Baltimore, Maryland;Jan/10
  Baltimore, Maryland;Feb/9
  Baltimore, Maryland;Mar/10
  Birmingham, Alabama;Jan/11
  Birmingham, Alabama;Feb/10
  Birmingham, Alabama;Mar/10
  Boston, Massachusetts;Jan/11
  Boston, Massachusetts;Feb/10
  Boston, Massachusetts;Mar/12",
  col_names = c("City", "Days_rain")
)
  1. Tidy this data.

  2. Create a horizontal bar plot that for each of the three months, making five bars for the five cities where the height of each bar is the total number of days of rain for that city from January through March.

Consider the following data set.

state_data <- tribble(
  ~state, ~type, ~number,
  "Oregon", "year_founded", 1859,
  "Oregon", "population_2020_census", 4237256,
  "Oregon", "population_2010_census", 3831074,
  "California", "year_founded", 1850,
  "California", "population_2020_census", 39538223,
  "Washington", "population_2020_census", 7705281,
  "Washington", "population_2010_census", 6724540,
)
  1. Are all the values under number measuring the same thing?

  2. Tidy this data.

  3. How many NA values are in your table?

Consider the following dataset. Bring together the first and last names into a single field name, where the last name goes first, followed by a comma, then a space, and finally the first name. For instance if the first name was Pete and the last name Aguilar, the result should be Aguilar, Pete.

jan6_committee <-
  tribble(
  ~first_name, ~last_name, ~membership,
  "Bennie", "Thompson", "Chairperson",
  "Zoe", "Lofgren", "Majority",
  "Adam", "Schiff", "Majority",
  "Pete", "Aguilar", "Majority",
  "Stephanie", "Murphy", "Majority",
  "Jamie", "Raskin", "Majority",
  "Elaine", "Luria", "Majority",
  "Liz", "Cheney", "Minority",
  "Adam", "Kinzinger", "Minority"
  )