9 Combining datasets as sets

Summary

Say that \(s\) is an element of the set \(S\) if the statement \(s \in S\) is true.
Say that \(A\) is a subset of the set \(S\) if for every \(a \in A\) it also holds that \(a \in S\). Write \(A \subseteq S\).
A data point can be thought of as an ordered \(n\)-tuple, where each component corresponds to the value of one of the \(n\) variables. This is also called an observation.
A dataset is a set of data points.
union combines two datasets with the same variables by including all observations that occur in at least one of the two datasets.
intersect combines two datasets with the same variables by including all observations that occur in both of the two datasets.
setdiff combines two datasets with the same variables by including all observations that occur in the first dataset but not the second.

In order to understand how to combine different datasets, a mathematical model of data will be needed.

9.1 What is a data point?

What exactly is a data point, an observation? Mathematically, these can be modeled by what mathematicians call ordered \(n\)-tuples.

For instance, if equipment that is sitting outside at latitude 34.1° N and longitude 117.7° W measures the air temperature as \(26.3^\circ\textsf{C}\), this could be represented as a data point that is an ordered 3-tuple: \[ (34.1, 117.7, 26.3). \] The adjective ordered here means that the order of the components matters. The first value must be latitude, the second value must be longitude, and the third value must be temperature. A data point of \((34.1, 26.3, 117.7)\) means something quite different with this ordering!

The 3 in ordered 3-tuple means that there are three values in the data point. If further information is collected indicating if the day was sunny or overcast, then there might be a 4-tuple data point: \[ (34.1, 117.7, 26.3, \text{sunny}). \]

Note that the first three values were real numbers, and the last component came from either sunny or overcast.

The values being measured are typically called variables or factors. In this case, the factors are latitude, longitude, and temperature.

9.2 What is a data set?

There are many different types of data sets, but the most common mathematical model is called a relation. This is simply a set of \(n\)-tuples. A set is a mathematical object where the order of the objects does not matter. An object in a set is called an element of the set, and elements cannot be repeated in the set. For instance, the data set

latitude	longitude	temperature	weather
34.1	117.7	26.3	sunny
41.9	-87.6	22.1	sunny

and

latitude	longitude	temperature	weather
41.9	-87.6	22.1	sunny
34.1	117.7	26.3	sunny

both consist of the same two data points. The fact that they are in two different orders does not change the data set.

It is important that observations not be repeated in the relation. Consider the following data set.

latitude	longitude	temperature	weather
41.9	-87.6	22.1	sunny
34.1	117.7	26.3	sunny
34.1	117.7	26.3	sunny

The last two observations are the same, and so really only count once. A properly cleaned up data set would replace these last two observations with a single observation.

If it does matter when the observations were made, then time should be one of the factors being measured in the observation.

9.3 Set notation

Set notation helps describe how objects belong to sets. For a set \(S\), write \(s \in S\) to mean that \(s\) is in \(S\) is a true statement. For instance, \[ \text{temperature} \in \{\text{latitude}, \text{longitude}, \text{temperature}, \text{weather} \} \] is a true statement.

Write \(s \in S\) (read as “\(s\) is an element of \(S\)”) to mean that it is true that the object \(s\) is part of the set \(S\).

Using this terminology, a data point is an element of a dataset.

When a set consists of some (or all) of the elements of another set, it is called a subset of the other set.

Write \(A \subseteq S\) (read as “\(A\) is a subset of \(S\)”) if for every \(s\) such that \(s \in A\), it holds that \(s \in S\).

9.4 Joining two sets

The operation of joining the elements of a set together is called taking the union of the two sets.

For sets \(A\) and \(B\), the union of the sets is written \(A \cup B\), and consists of those elements that are in at least one of the sets.

9.5 Joining two datasets with the same variables

If two datasets have the same set of variables in the same order, then each observation in each of the two tables is an \(n\)-tuple with the same ordering. So to combine the tables, simply create a new dataset that contains all observations that are in at least one of the existing tables. The function to do so is called union.

For example, consider two tables, each with variables x and y.

df1 <- tibble(
  x = c(3, 4, 5),
  y = c(-4, 0, 10)
)
df1 |> kable()

x	y
3	-4
4	0
5	10

df2 <- tibble(
  x = c(3, 8, 2),
  y = c(-4, 0, 10)
)
df2 |> kable()

x	y
3	-4
8	0
2	10

The first observation in each table is the same, but the other four observations are all different. Hence the union of the two tables will be a single dataset with \(1 + 4 = 5\) rows.

union(df1, df2)

## # A tibble: 5 × 2
##       x     y
##   <dbl> <dbl>
## 1     3    -4
## 2     4     0
## 3     5    10
## 4     8     0
## 5     2    10

9.6 Joining sets and tables through intersection

For two sets \(A\) and \(B\), the intersection of the two sets consists of elements that appear in both sets.

The intersection of \(A\) and \(B\), written \(A \cap B\), consists of elements that are in both sets.

Similarly, the intersection of two datasets with the same variables consists of those data points that appear in both datasets. For df1 and df2, this is only the first observation. The intersect finds this element.

intersect(df1, df2)

## # A tibble: 1 × 2
##       x     y
##   <dbl> <dbl>
## 1     3    -4

9.7 Set difference

When the difference of numbers is considered, the second number is “taken away” from the first. Hence \(7 - 3\) takes away 3 from 7, leaving 4.

For sets, the set difference of \(B\) from \(A\) means to take away from \(A\) any element which appears in \(B\).

For sets \(A\) and \(B\), the set difference, written \(A \setminus B\), is those elements in \(A\) that are not in \(B\).

For example, if \(A = \{a, b, c\}\), and \(B = \{b, c, d \}\), then \(A \setminus B = \{a \}\), since elements \(b\) and \(c\) are taken away because they are also in \(B\). Element \(d\) is also in \(B\), but it was never in \(A\) to begin with, so taking it away does nothing.

For datasets, the setdiff function takes away any observations from the first dataset that appear in the second dataset. For instance, suppose a list of residents of a town wishes to remove those residents that have passed away or moved out of town. Set difference can accomplish this task.

residents <- tibble(
  name = c("Brenda Starr", "Beatle Bailey", "Mark Trail")
)
left_town <- tibble(
  name = c("Brenda Starr", "Beatle Bailey", "Andy Capp")
)
setdiff(residents, left_town)

## # A tibble: 1 × 1
##   name      
##   <chr>     
## 1 Mark Trail

Questions

Consider the following two tibbles:

df1 <- tibble(
  name = c("Sophia", "Olivia", "Emma", "Ava", "Isabella"), 
  gender = c("F", "F", "F", "F", "F")
  )
df2 <- tibble(
  name = c("Jackson", "Liam", "Noah", "Aiden", "Caden"),  
  gender = c("M", "M", "M", "M", "M")
  )

Give a command that combines the observations from df1 and df2.