10 Combining datasets with keys


Summary

  • If \(V\) is the set of variables for a dataset, a subset \(K \subseteq V\) is a key if for two observations \(x\) and \(y\), \(x[K] = y[K]\) implies that \(x[V] = y[V]\). That is, the measurements of the columns in the key uniquely determine the observation.

  • Given a first table and a second table of data, left_join creates a table that has every observation from the first table, and draws measurements from both tables.

  • Given a first table and second table of data, a full_join creates a table where all observations and measurements from either one or two tables appear.

  • Given a first table and second table of data, an inner_join is used to create a table where each observation in the output appears in two tables, with measurements from both tables.

  • Given a first table and second table of data, a semi_join is used to create a table where each observation in the output is in the first and second tables, with measurements from the first table.

  • Given a first table and second table of data, an anti_join is used to create a table where each observation in the output is in the first table but not the second table, with measurements from the first table.

  • The row_number function can be used with mutate to create an artifical key for any dataset.


10.1 Keys

Recall that the observations in a data set are unique. Because of this, if all of the variable values are known for every observation, it is possible to tell which observation is which. Consider the following data set.
latitude longitude temperature weather
34.1 117.7 26.3 sunny
41.9 -87.6 22.1 sunny
34.1 112.2 16.7 rainy

Consider the set of variables \(V = \{\text{latitude}, \text{longitude}, \text{temperature}, \text{weather} \}\). Note that the set is surrounded by curly braces \(\{\) and \(\}\) to indicate that it is a set, and the order of the objects in the set is not important.

If the value of the observation for all elements of \(V\) is known, then it is possible to determine exactly which observation is being discussed. For instance, if the observation for \(V\) is known to be \(34.1, 117.7, 26.3, \text{sunny}\), then it must be the first observation.

On the other hand, if only \(\{ \text{latitude} \}\) is known, that is not enough to determine the observation. For instance, knowing the latitude is 34.1 means that either the first or third observation is being discussed. Therefore, say that \(\{\text{latitude}, \text{longitude}, \text{temperature}, \text{weather} \}\) is a key, while \(\{ \text{latitude} \}\) is not.

To formally define a key, it helps recall set notation. For a set \(S\), write \(s \in S\) to mean that \(s\) is in \(S\) is a true statement. For instance, \[ \text{temperature} \in \{\text{latitude}, \text{longitude}, \text{temperature}, \text{weather} \} \] is a true statement.

A key consists of some of the variables being measured. When a set consists of some (or all) of the elements of another set, it is called a subset of the other set. So a key is a subset of the variables. But more is needed!

For a subset of variables \(K \subset V\), let \(x[K]\) denote the values of an observation for those \(K\) variables. For instance, \[ x[\{ \text{latitude}, \text{longitude} \}] = (34.1, 117.7) \] for the first observation. Then a set of variables is a key if knowing the values of these variables is enough to pinpoint the exact observation.

Given a relation with variables \(V = \{v_1, \ldots, v_n\}\), a subset \(K \subseteq V\) is a key if for all observations \(x\) and \(y\), \(x[K] = y[K] \rightarrow x = y\).

In the example above, there are many keys.

\(K\) Is it a key?
\(\{ \text{latitude} \}\) No
\(\{ \text{latitude}, \text{longitude} \}\) Yes
\(\{ \text{weather} \}\) No
\(\{ \text{weather}, \text{latitude} \}\) Yes

10.2 Joining relations using keys

Keys are important because they allow data from more than one data set to be combined. The simplest type of combination of data sets is called a left join. In this combination, two different relations have spread out the data over more than one table.

For instance, the following code creates a tibble temp_data that holds the temperature data that has been considered so far, but not the weather data.

temp_data <- tibble(
  latitude    = c(34.1, 41.9, 34.1),
  longitude   = c(117.7, -87.6, 112.2),
  temperature = c(26.3, 22.1, 16.7)
)
temp_data |> kable() |> kable_styling()
latitude longitude temperature
34.1 117.7 26.3
41.9 -87.6 22.1
34.1 112.2 16.7

The next tibble weather_data holds a measurement called weather, but not the temperature measurement for some locations.

weather_data <- tibble(
  latitude    = c(34.1, 41.9, 32.5, 35.6),
  longitude   = c(117.7, -87.6, 90.4, 110),
  weather     = c("sunny", "sunny", "rainy", "rainy")
)
weather_data |> kable() |> kable_styling()
latitude longitude weather
34.1 117.7 sunny
41.9 -87.6 sunny
32.5 90.4 rainy
35.6 110.0 rainy

In both tables, \(\{\text{latitude}, \text{longitude} \}\) form a key that uniquely identifies the observation. And so using this key, it is possible to join the weather data to the temperature data. The command to do so is left_join, which is part of the package dplyr.

left_join(temp_data, weather_data, by = c("longitude", "latitude"))
## # A tibble: 3 × 4
##   latitude longitude temperature weather
##      <dbl>     <dbl>       <dbl> <chr>  
## 1     34.1     118.         26.3 sunny  
## 2     41.9     -87.6        22.1 sunny  
## 3     34.1     112.         16.7 <NA>

Note that the observations in the left join match those of the first (left) table. Because the third observation in temp_data does not appear in weather_data, the weather measurement was replaced with an NA

Suppose that the keys for the two tables were unknown to the user? In that case, the left_join command tries its best to figure out a key for joining the data together.

temp_data |>
  left_join(weather_data)
## Joining with `by = join_by(latitude, longitude)`
## # A tibble: 3 × 4
##   latitude longitude temperature weather
##      <dbl>     <dbl>       <dbl> <chr>  
## 1     34.1     118.         26.3 sunny  
## 2     41.9     -87.6        22.1 sunny  
## 3     34.1     112.         16.7 <NA>

The above code results in the same new table as before. It gives an informational message “Joining, by = c(”latitude”, “longitude”)” to indicate that it used latitude and longitude as the variables to make a key.

10.3 The full/outer join

The left join is an example of a mutating join, because it can add columns to the first table. Other examples of mutating joins include full/outer joins and inner joins.

In a mutating join the output columns can come from either input table, and always include the key columns.

A full join is similar to a left join, except that observations from both tables are always included. This can lead to many more NA values.

In the following outer join, there are two observations in common, one belonging only to temp_data, and two belonging only to weather_data.

full_join(temp_data, weather_data)
## Joining with `by = join_by(latitude, longitude)`
## # A tibble: 5 × 4
##   latitude longitude temperature weather
##      <dbl>     <dbl>       <dbl> <chr>  
## 1     34.1     118.         26.3 sunny  
## 2     41.9     -87.6        22.1 sunny  
## 3     34.1     112.         16.7 <NA>   
## 4     32.5      90.4        NA   rainy  
## 5     35.6     110          NA   rainy

The result is two observations with both a temperature measurement and a weather measurement, one observation with only a temperature measurement, and two observations with only a weather measurement.

Note that full_join is similar to union in that observations appear if they show up in either of the two tables.

10.4 Inner joins

In an inner join, only observations that appear in both tables are included. This is similar to the intersect function from earlier.

inner_join(temp_data, weather_data)
## Joining with `by = join_by(latitude, longitude)`
## # A tibble: 2 × 4
##   latitude longitude temperature weather
##      <dbl>     <dbl>       <dbl> <chr>  
## 1     34.1     118.         26.3 sunny  
## 2     41.9     -87.6        22.1 sunny

10.5 Filtering joins

In a filtering join the columns from the first table are kept as is, and the output rows form a subset of the first table rows.

The two filtering joins are semi_join and anti_join. These roughly correspond to intersect and setdiff.

In a semi join the columns from the first table are kept as is, and the output rows are those observations in the first table and in the second.

semi_join(temp_data, weather_data)
## Joining with `by = join_by(latitude, longitude)`
## # A tibble: 2 × 3
##   latitude longitude temperature
##      <dbl>     <dbl>       <dbl>
## 1     34.1     118.         26.3
## 2     41.9     -87.6        22.1

In a anti join the columns from the first table are kept as is, and the output rows are those observations in the first table but not the second.

anti_join(temp_data, weather_data)
## Joining with `by = join_by(latitude, longitude)`
## # A tibble: 1 × 3
##   latitude longitude temperature
##      <dbl>     <dbl>       <dbl>
## 1     34.1      112.        16.7

Since an observation from the first table is either in the second table or not, the union of these two tibbles returns the original table.

union(
  semi_join(temp_data, weather_data),
  anti_join(temp_data, weather_data)
)
## Joining with `by = join_by(latitude, longitude)`
## Joining with `by = join_by(latitude, longitude)`
## # A tibble: 3 × 3
##   latitude longitude temperature
##      <dbl>     <dbl>       <dbl>
## 1     34.1     118.         26.3
## 2     41.9     -87.6        22.1
## 3     34.1     112.         16.7

10.6 Rows and columns

It helps to remember the number of rows and columns output by the various join commands to understand what they do.

  • left_join. Number of rows equals the number of rows in the first table. Number of columns is the number of columns in the first table plus any columns added from the second table.

  • full_join. Number of rows equals the number of observations in either one or two tables. Number of columns is the number of columns in the first table plus any columns added from the second table.

  • inner_join. Number of rows equals the number of observations in two tables. Number of columns is the number of columns in the first table plus any columns added from the second table.

  • semi_join. Number of rows equals the number of observations in the first and second table. Number of columns is the number of columns in the first table.

  • anti_join. Number of rows equals the number of rows in the first table but not in the second table. Number of columns is the number of columns in the first table.

10.7 What if something other than a key is used to join?

So far our examples have used the key \(\{ \text{latitude}, \text{longitude} \}\) in order to identify observations. What if it was forced to use something not a key to join the tables?

For instance, \(\{ \text{latitude} \}\) by itself is not a key for temp_data, since the first and third observations in this dataset with the same latitude value.

temp_data
## # A tibble: 3 × 3
##   latitude longitude temperature
##      <dbl>     <dbl>       <dbl>
## 1     34.1     118.         26.3
## 2     41.9     -87.6        22.1
## 3     34.1     112.         16.7

So if an attempt was made to do a left join with weather_data, a problem would occur, and extra columns would have to be added to make it into a key.

left_join(temp_data, weather_data,
          by = "latitude")
## # A tibble: 3 × 5
##   latitude longitude.x temperature longitude.y weather
##      <dbl>       <dbl>       <dbl>       <dbl> <chr>  
## 1     34.1       118.         26.3       118.  sunny  
## 2     41.9       -87.6        22.1       -87.6 sunny  
## 3     34.1       112.         16.7       118.  sunny

An extra column longitude.x was added to break the single value 34.1 into two observations. This is because it was not a key to start with. So if this type of column addition happens, be sure to check if the original joining variables form a key to begin with.

10.8 Artificial keys

So far the keys used have come from the data itself. However, sometimes it is useful to create an artificial key, which is a new variable that can serve as a key by assigning each observation a different value. The row_number function from dplyr can be used to accomplish this task.

temp_data |>
  mutate(id = row_number(), .before = 1)
## # A tibble: 3 × 4
##      id latitude longitude temperature
##   <int>    <dbl>     <dbl>       <dbl>
## 1     1     34.1     118.         26.3
## 2     2     41.9     -87.6        22.1
## 3     3     34.1     112.         16.7

An artificial or surrogate key is a variable deliberately added to a dataset whose entries are all unique, thereby making it a key.

Questions

Consider the following tibbles.

t1 <- tibble(
  emp_id = c(1001, 1116, 1239),
  class = c("temp", "full", "full")
)
t2 <- tibble(
  emp_id = c(1001, 1211, 1239, 1543),
  comp = c(16232, 42003, 51522, 44023)
)
  1. What is the primary key in t1?

  2. How can we add the information from the comp variable of t2 to the three observations in t1?

  3. How do we remove the observations from t1 where we do not have comp values in t2, and does not add the comp variable to the result?

  1. The class variable contains repeats, so is not a key. The emp_id is clearly designed to be a key in the data set, and so it makes sense that
emp_id

is the primary key.

  1. We can use left_join to add this information. Of course, we do not have the comp value for employee number 1116, so that will result in a NA value.
t1 |> left_join(t2, by = "emp_id")
## # A tibble: 3 × 3
##   emp_id class  comp
##    <dbl> <chr> <dbl>
## 1   1001 temp  16232
## 2   1116 full     NA
## 3   1239 full  51522
  1. This is done via a semi_join, which is a filtering join that only keeps those observations where the key appears in both tables.
t1 |> semi_join(t2, by = "emp_id")
## # A tibble: 2 × 2
##   emp_id class
##    <dbl> <chr>
## 1   1001 temp 
## 2   1239 full

(Note the difference between this and an inner_join: the semi_join only filters observations, it does not add the comp variable to t1.)

Write code to add a surrogate key to the iris dataset built into R, calling the key number. Print out the first few rows of the new table.

This can be done with

iris |>
  mutate(number = row_number()) |>
  select(number, everything()) |>
  head()
##   number Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1      1          5.1         3.5          1.4         0.2  setosa
## 2      2          4.9         3.0          1.4         0.2  setosa
## 3      3          4.7         3.2          1.3         0.2  setosa
## 4      4          4.6         3.1          1.5         0.2  setosa
## 5      5          5.0         3.6          1.4         0.2  setosa
## 6      6          5.4         3.9          1.7         0.4  setosa