16 Joining tables in SQL

Summary

SQL was designed to use relational databases, and so has many commands for drawing data from multiple tables.

SQL Command	Description
JOIN	Brings two tables together.
OUTER	Modifies `JOIN` to be an outer join.
LEFT	Modifies `OUTER` to be a left outer join.
UNION	Union of observations from tables with same variables.
INTERSECT	Intersection of observations from tables with same variables.
MINUS	Set difference of observations from tables with same variables.

Summarizing data and combining strings can also be done in SQL.

SQL Command	Description
SUM	Adds together the non NULL values.
COUNT	Counts non NULL values.
AVG	Averages non NULL values.
MIN	Minimum of non NULL values.
MAX	Maximum of non NULL values.
GROUP_CONCAT	concatenate strings.

Libraries needed for this chapter include:

library(DBI)
library(RSQLite)

The SQL examples so far have been working with one table (relation) at a time, but the point of having more than one table is that we should have the ability to collect data from multiple tables to get the report that we are after.

For this chapter data is stored in an SQLite file called sales.sqlite. This file is available from https://s3.us-west-1.amazonaws.com/markhuber-datascience-resources.org/Data_sets/SQLite/sales.sqlite. Form a connection to that database using:

db <- dbConnect(SQLite(),
dbname = "datasets/sales.sqlite")

16.1 Inner Joins

An inner join of two datasets brings together observations that appear in both tables.

In the table sales_teams, each value of sales_agent appears only once, since each agent is part of only one team. Therefore it is a key in this table.

SELECT sales_agent, manager
  FROM sales_teams

Table 16.1: Displaying records 1 - 10
sales_agent	manager
Anna Snelling	Dustin Brinkmann
Cecily Lampkin	Dustin Brinkmann
Versie Hillebrand	Dustin Brinkmann
Lajuana Vencill	Dustin Brinkmann
Moses Frase	Dustin Brinkmann
Jonathan Berthelot	Melvin Marxen
Marty Freudenburg	Melvin Marxen
Gladys Colclough	Melvin Marxen
Niesha Huffines	Melvin Marxen
Darcel Schlecht	Melvin Marxen

The table sales_pipeline tells us what deals a particular agent has in their pipeline. In this table each agent might be working on more than one deal at a time.

SELECT account, sales_agent
  FROM sales_pipeline

Table 16.2: Displaying records 1 - 10
Account	Sales_Agent
Cancity	Moses Frase
Isdom	Darcel Schlecht
Cancity	Darcel Schlecht
Codehow	Moses Frase
Hatfan	Zane Levy
Ron-tech	Anna Snelling
J-Texon	Vicki Laflamme
Cheers	Markita Hansen
Zumgoity	Niesha Huffines
NA	James Ascencio

So sales_agent is not a key in this table. Since it is a key in sales_teams, it provides a foreign key for that table. This simplest kind of inner join can be accomplished just by using the WHERE command.

SELECT sales_teams.manager,
       sales_pipeline.sales_agent,
       sales_pipeline.account
  FROM sales_teams, sales_pipeline
 WHERE (sales_pipeline.sales_agent = sales_teams.sales_agent)
       AND sales_pipeline.deal_stage = "Won"

Table 16.3: Displaying records 1 - 10
manager	Sales_Agent	Account
Dustin Brinkmann	Moses Frase	Cancity
Melvin Marxen	Darcel Schlecht	Isdom
Melvin Marxen	Darcel Schlecht	Cancity
Dustin Brinkmann	Moses Frase	Codehow
Summer Sewald	Zane Levy	Hatfan
Dustin Brinkmann	Anna Snelling	Ron-tech
Celia Rouche	Vicki Laflamme	J-Texon
Celia Rouche	Markita Hansen	Cheers
Melvin Marxen	Niesha Huffines	Zumgoity
Dustin Brinkmann	Anna Snelling	Bioholding

So what exactly happened here?

In FROM, there are two tables sales_teams and sales_pipeline. So data can be drawn from both.
In SELECT, it is possible to include variables from either sales_teams, sales_pipeline, or both. Use a tablename.variablename formulation for such selections.
In WHERE, we indicate how the join works by declaring which of the variables from the two tables must be equal. Then AND was used to place an extra condition on data from one of the tables.

The AS keyword can be used to relabel the variables for extra readability. It can also be used to relabel the tables. Note that although the AS in the FROM command relabels the tables, the relabels can actually be used earlier in the SELECT command.

SELECT teams.manager,
       pipeline.sales_agent AS agent,
       pipeline.account
  FROM sales_teams AS teams, sales_pipeline AS pipeline
 WHERE (pipeline.sales_agent = teams.sales_agent)
       AND pipeline.deal_stage = "Won"

Table 16.4: Displaying records 1 - 10
manager	agent	Account
Dustin Brinkmann	Moses Frase	Cancity
Melvin Marxen	Darcel Schlecht	Isdom
Melvin Marxen	Darcel Schlecht	Cancity
Dustin Brinkmann	Moses Frase	Codehow
Summer Sewald	Zane Levy	Hatfan
Dustin Brinkmann	Anna Snelling	Ron-tech
Celia Rouche	Vicki Laflamme	J-Texon
Celia Rouche	Markita Hansen	Cheers
Melvin Marxen	Niesha Huffines	Zumgoity
Dustin Brinkmann	Anna Snelling	Bioholding

Using WHERE in this way to make a join works, but SQL does also have an explicit JOIN command so we can separate conceptually the join from the filtering condition. The USING keyword then explicitly tells us what variable to use to bring the tables together.

SELECT DISTINCT teams.manager,
                pipeline.sales_agent AS agent,
                pipeline.account
           FROM sales_teams AS teams
                JOIN sales_pipeline AS pipeline USING (sales_agent)
          WHERE pipeline.deal_stage = "Won"

Table 16.5: Displaying records 1 - 10
manager	agent	Account
Dustin Brinkmann	Moses Frase	Cancity
Melvin Marxen	Darcel Schlecht	Isdom
Melvin Marxen	Darcel Schlecht	Cancity
Dustin Brinkmann	Moses Frase	Codehow
Summer Sewald	Zane Levy	Hatfan
Dustin Brinkmann	Anna Snelling	Ron-tech
Celia Rouche	Vicki Laflamme	J-Texon
Celia Rouche	Markita Hansen	Cheers
Melvin Marxen	Niesha Huffines	Zumgoity
Dustin Brinkmann	Anna Snelling	Bioholding

What if our foreign key had a different name in the second table? Then instead of using USING, join the tables with ON which allows the specification of names in both tables.

16.2 Outer Joins

Inner joins only return observations where the value for a particular column appears in both tables. Outer joins return observations where the value for a particular column appears at least once. For left outer joins the value has to appear in the left table. For full outer joins it could appear in either table.

In a left outer join (or just left join) all observations in the first table appear in the output.

In a full outer join (or just full join) all observations in both tables appear in the output.

In SQL, OUTER is a keyword that modifies JOIN, which can then be further modified by LEFT, RIGHT, or FULL.

For instance, to do a left outer join on the tables:

SELECT DISTINCT sales_teams.sales_agent AS agent,
                sales_pipeline.deal_stage
           FROM sales_teams
LEFT OUTER JOIN sales_pipeline
             ON sales_teams.sales_agent = sales_pipeline.sales_agent
          WHERE sales_pipeline.deal_stage = "In Progress"

Table 16.6: Displaying records 1 - 10
agent	Deal_Stage
James Ascencio	In Progress
Maureen Marcano	In Progress
Kami Bicknell	In Progress
Cassey Cress	In Progress
Hayden Neloms	In Progress
Corliss Cosme	In Progress
Elease Gluck	In Progress
Markita Hansen	In Progress
Daniell Hammack	In Progress
Zane Levy	In Progress

A left outer join is appropriate here since the goal is not to list all sales agents, only those that have a deal at some stage of completion.

16.3 Self Join

The need for a self join arises when the values of a key for a table are also used as entries in another column. For instance, suppose that I have a variable which is the employee ID. Then this ID number might be used to also indicate a manager.

SELECT *
  FROM employees

Table 16.7: Displaying records 1 - 10
emp_id	name	mgr_id	Regional_Office	Status
10001	Anna Snelling	10036	Central	Current
10002	Cecily Lampkin	10036	Central	Current
10003	Versie Hillebrand	10036	Central	Current
10004	Lajuana Vencill	10036	Central	Current
10005	Moses Frase	10036	Central	Current
10006	Jonathan Berthelot	10037	Central	Current
10007	Marty Freudenburg	10037	Central	Current
10008	Gladys Colclough	10037	Central	Current
10009	Niesha Huffines	10037	Central	Current
10010	Darcel Schlecht	10037	Central	Current

Note that the mgr_id is using the emp_id number to identify the manager of each person. So to pull out the name of the manager, the mgr_id is being used as the key in the second table, while the emp_id is the key in the first.

SELECT emp.name AS employee,
       mgr.name AS manager
  FROM employees AS emp
  JOIN employees AS mgr ON emp.mgr_id = mgr.emp_id

Table 16.8: Displaying records 1 - 10
employee	manager
Anna Snelling	Dustin Brinkmann
Cecily Lampkin	Dustin Brinkmann
Versie Hillebrand	Dustin Brinkmann
Lajuana Vencill	Dustin Brinkmann
Moses Frase	Dustin Brinkmann
Jonathan Berthelot	Melvin Marxen
Marty Freudenburg	Melvin Marxen
Gladys Colclough	Melvin Marxen
Niesha Huffines	Melvin Marxen
Darcel Schlecht	Melvin Marxen

16.4 Aggregation

To apply functions to variable values in the tidyverse, we used mutate or summarize. In the SQL framework, these are called aggregations.

SUM adds together the non NULL values. \
COUNT counts non NULL values.
AVG averages non NULL values.
MIN finds the minimum of non NULL values.
MAX finds the maximum of non NULL values.
GROUP_CONCAT concatenates strings.

For example, to get the total monetary value of deals closed:

SELECT SUM(close_value)
  FROM sales_pipeline

Table 16.9: 1 records
SUM(close_value)
10005534

The total number of won deals:

SELECT COUNT(*)
  FROM sales_pipeline
 WHERE deal_stage = "Won"

Table 16.10: 1 records
COUNT(*)
4238

The average value of the deals won:

SELECT AVG(close_value)
  FROM sales_pipeline
 WHERE sales_pipeline.deal_stage = "Won"

Table 16.11: 1 records
AVG(close_value)
2360.909

The smallest deal won:

SELECT MIN(close_value)
  FROM sales_pipeline
 WHERE sales_pipeline.deal_stage = "Won"

Table 16.12: 1 records
MIN(close_value)
38

16.5 GROUP BY

The GROUP BY command in SQL performs the same function as group_by in the tidyverse: it partitions the observations by the values of a particular variable. For instance, the following finds the average deal size for each sales agent. It first groups the data by sales_agent, then calculates the average value using AVG.

  SELECT sales_agent,
         AVG(close_value)
    FROM sales_pipeline
   WHERE sales_pipeline.deal_stage = "Won"
GROUP BY sales_agent
ORDER BY AVG(close_value) DESC

Table 16.13: Displaying records 1 - 10
Sales_Agent	AVG(close_value)
Elease Gluck	3614.938
Darcel Schlecht	3304.338
Rosalina Dieter	3269.486
Daniell Hammack	3194.991
James Ascencio	3063.207
Rosie Papadopoulos	2950.885
Wilburn Farren	2866.182
Reed Clapper	2827.974
Donn Cantrell	2821.899
Corliss Cosme	2806.907

Once two tables are joined, then GROUP BY treats them as a single table. This is useful, for instance, if the goal is to obtain the average deal by manager.

  SELECT sales_teams.manager,
         AVG(sales_pipeline.close_value)
    FROM sales_teams
         JOIN sales_pipeline ON (sales_teams.sales_agent = sales_pipeline.sales_agent)
         WHERE sales_pipeline.deal_stage = "Won"
GROUP BY sales_teams.manager

Table 16.14: 6 records
manager	AVG(sales_pipeline.close_value)
Cara Losch	2354.269
Celia Rouche	2629.339
Dustin Brinkmann	1465.011
Melvin Marxen	2553.209
Rocco Neubert	2837.258
Summer Sewald	2372.886

Filtering observations used WHERE, but if the plan is to use these filtered observations within a GROUP BY, then WHERE must be surrounded by a FILTER. For instance, to get the number of deals won that had a value greater than 1000, use the following.

  SELECT sales_agent,
         COUNT(sales_pipeline.close_value) AS total,
         COUNT(sales_pipeline.close_value)
FILTER(WHERE(sales_pipeline.close_value > 1000)) AS `over 1000`
    FROM sales_pipeline
   WHERE sales_pipeline.deal_stage = "Won"
GROUP BY sales_pipeline.sales_agent

Table 16.15: Displaying records 1 - 10
Sales_Agent	total	over 1000
Anna Snelling	208	68
Boris Faz	101	70
Cassey Cress	163	112
Cecily Lampkin	107	63
Corliss Cosme	150	97
Daniell Hammack	114	91
Darcel Schlecht	349	272
Donn Cantrell	158	106
Elease Gluck	80	32
Garret Kinder	75	46

To filter observations after aggregation has occurred, use the HAVING

  SELECT sales_agent,
         COUNT(sales_pipeline.close_value) AS `number won`
    FROM sales_pipeline
   WHERE sales_pipeline.deal_stage = "Won"
GROUP BY sales_pipeline.sales_agent
  HAVING COUNT(sales_pipeline.close_value) > 200

Table 16.16: 4 records
Sales_Agent	number won
Anna Snelling	208
Darcel Schlecht	349
Kary Hendrixson	209
Vicki Laflamme	221

16.6 Set operations in SQL

In the tidyverse, the set operations union, intersect and setdiff find the union, intersection, and set difference respectively of observations that belong to different tables, but have the same variables. The corresponding commands in SQL are UNION, INTERSECT, and MINUS.

Up until now, we have been using SQL queries that only have one SELECT command. Each time the SELECT command is used it creates a table. Set operations can then be used to combine these tables.

For instance, suppose that international accounts were located in one table, and domestic in another. Two SELECT commands to put the data from both tables into the same form, then UNION could be used to combine them.

SELECT intl_accounts.account,
       intl_accounts.office_location AS location
  FROM intl_accounts
   UNION
SELECT accounts.account,
       "USA" AS location
  FROM accounts

Table 16.17: Displaying records 1 - 10
account	location
AWOLEX	USA
Acme Corporation	USA
Betasoloin	USA
Betatech	Kenya
Betatech	USA
Bioholding	Philipines
Bioholding	USA
Bioplex	USA
Blackzim	USA
Bluth Company	USA

Now that the database is not needed, disconnect.

dbDisconnect(db)

Questions

Suppose the iris data table was in an SQL database

Write an SQL query to find the setosa flower observations with petal length greater than 1.5 cm. Return only the petal length values.
Write an SQL query that first finds the setosa flower observations with petal length greater than 1.5 cm. Return the sample average of the petal length values.

Write an SQL query for the iris SQL database that does the following: returns the largest sepal length and the species that attains it.
Write an SQL query for the iris SQL database that does the following: returns the smallest sepal length and the species that attains it.

Write an SQL query for the iris SQL database that does the following: for each different species, return the largest sepal length under the variable name max_sepal_length together with the species name.

Suppose the mtcars dataset built into R was an SQL table.

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Write an SQL query that for each of the different gear values, returns the number of gears together with the sample average of the mpg values.

Write SQL code to perform an inner join between the two tables, giving a report with variables State, Capital, and Population, and using State as the key to join the tables.

Suppose that the relational database includes a table Region that has variables State, Mining, and Population. The Mining variable is either high, med, or low.

Write SQL code that accomplishes the following.

Removes all observations where the population is less than one million.
For each of the mining levels, computes the average population.
Reports the Mining level, the average population for that level, and orders in decreasing order of average population.

Suppose the PlantGrowth dataset built in to R was in an SQL database.

weight	group
4.17	ctrl
5.58	ctrl
5.18	ctrl
6.11	ctrl
4.50	ctrl
4.61	ctrl

Write an SQL query that finds the unique values that group takes on.
Write an SQL query that finds the average value of the weight.
Write an SQL query that finds the average value of the weight for each of the three values group takes on.

Suppose that the table Orders has two columns, OrderID and CustomerID. Here OrderID is a key, but since the same customer might make more than one order, CustomerID is not. The whole table has 930 rows, here are the first ten rows.

tribble(
  ~OrderID, ~CustomerID,
10643,  "ALFKI",            
10692,  "ALFKI",        
10702,  "ALFKI",            
10835,  "ALFKI",            
10952,  "ALFKI",            
11011,  "ALFKI",            
10308,  "ANATR",            
10625,  "ANATR",            
10759,  "ANATR",            
10926,  "ANATR",
) |> kable() |> kable_styling()

OrderID	CustomerID
10643	ALFKI
10692	ALFKI
10702	ALFKI
10835	ALFKI
10952	ALFKI
11011	ALFKI
10308	ANATR
10625	ANATR
10759	ANATR
10926	ANATR

Write an SQL query for the Orders table where the reported data is the the CustomerID together with the number of orders that customer made. So the first row of the report will be

tribble(
  ~CustomerID, ~Num_orders,
  "ALFKI", 6
) |> kable() |> kable_styling()

CustomerID	Num_orders
ALFKI	6

The tables in the Northwind database include Suppliers and Customers. Both these tables include the variables CompanyName and City.

Write an SQL query that creates a report that shows that name of the company and the city where all the suppliers are based.
Write an SQL query that creates a report that shows that name of the company and the city where all the customers are based.
Write a single SQL query that combines your answers from a. and b. into a single report.