16 Joining tables in SQL


Summary

SQL was designed to use relational databases, and so has many commands for drawing data from multiple tables.

SQL Command Description
JOIN Brings two tables together.
OUTER Modifies JOIN to be an outer join.
LEFT Modifies OUTER to be a left outer join.
UNION Union of observations from tables with same variables.
INTERSECT Intersection of observations from tables with same variables.
MINUS Set difference of observations from tables with same variables.

Summarizing data and combining strings can also be done in SQL.

SQL Command Description
SUM Adds together the non NULL values.
COUNT Counts non NULL values.
AVG Averages non NULL values.
MIN Minimum of non NULL values.
MAX Maximum of non NULL values.
GROUP_CONCAT concatenate strings.

Libraries needed for this chapter include:

library(DBI)
library(RSQLite)

The SQL examples so far have been working with one table (relation) at a time, but the point of having more than one table is that we should have the ability to collect data from multiple tables to get the report that we are after.

For this chapter data is stored in an SQLite file called sales.sqlite. This file is available from https://s3.us-west-1.amazonaws.com/markhuber-datascience-resources.org/Data_sets/SQLite/sales.sqlite. Form a connection to that database using:

db <- dbConnect(SQLite(),
dbname = "datasets/sales.sqlite")

16.1 Inner Joins

An inner join of two datasets brings together observations that appear in both tables.

In the table sales_teams, each value of sales_agent appears only once, since each agent is part of only one team. Therefore it is a key in this table.

SELECT sales_agent, manager
  FROM sales_teams
Table 16.1: Displaying records 1 - 10
sales_agent manager
Anna Snelling Dustin Brinkmann
Cecily Lampkin Dustin Brinkmann
Versie Hillebrand Dustin Brinkmann
Lajuana Vencill Dustin Brinkmann
Moses Frase Dustin Brinkmann
Jonathan Berthelot Melvin Marxen
Marty Freudenburg Melvin Marxen
Gladys Colclough Melvin Marxen
Niesha Huffines Melvin Marxen
Darcel Schlecht Melvin Marxen

The table sales_pipeline tells us what deals a particular agent has in their pipeline. In this table each agent might be working on more than one deal at a time.

SELECT account, sales_agent
  FROM sales_pipeline
Table 16.2: Displaying records 1 - 10
Account Sales_Agent
Cancity Moses Frase
Isdom Darcel Schlecht
Cancity Darcel Schlecht
Codehow Moses Frase
Hatfan Zane Levy
Ron-tech Anna Snelling
J-Texon Vicki Laflamme
Cheers Markita Hansen
Zumgoity Niesha Huffines
NA James Ascencio

So sales_agent is not a key in this table. Since it is a key in sales_teams, it provides a foreign key for that table. This simplest kind of inner join can be accomplished just by using the WHERE command.

SELECT sales_teams.manager,
       sales_pipeline.sales_agent,
       sales_pipeline.account
  FROM sales_teams, sales_pipeline
 WHERE (sales_pipeline.sales_agent = sales_teams.sales_agent)
       AND sales_pipeline.deal_stage = "Won"
Table 16.3: Displaying records 1 - 10
manager Sales_Agent Account
Dustin Brinkmann Moses Frase Cancity
Melvin Marxen Darcel Schlecht Isdom
Melvin Marxen Darcel Schlecht Cancity
Dustin Brinkmann Moses Frase Codehow
Summer Sewald Zane Levy Hatfan
Dustin Brinkmann Anna Snelling Ron-tech
Celia Rouche Vicki Laflamme J-Texon
Celia Rouche Markita Hansen Cheers
Melvin Marxen Niesha Huffines Zumgoity
Dustin Brinkmann Anna Snelling Bioholding

So what exactly happened here?

  • In FROM, there are two tables sales_teams and sales_pipeline. So data can be drawn from both.

  • In SELECT, it is possible to include variables from either sales_teams, sales_pipline, or both. Use a tablename.variablename formulation for such selections.

  • In WHERE, we indicate how the join works by declaring which of the variables from the two tables must be equal. Then AND was used to place an extra condition on data from one of the tables.

The AS keyword can be used to relabel the variables for extra readability. It can also be used to relabel the tables. Note that although the AS in the FROM command relabels the tables, the relabels can actually be used earlier in the SELECT command.

SELECT teams.manager,
       pipeline.sales_agent AS agent,
       pipeline.account
  FROM sales_teams AS teams, sales_pipeline AS pipeline
 WHERE (pipeline.sales_agent = teams.sales_agent)
       AND pipeline.deal_stage = "Won"
Table 16.4: Displaying records 1 - 10
manager agent Account
Dustin Brinkmann Moses Frase Cancity
Melvin Marxen Darcel Schlecht Isdom
Melvin Marxen Darcel Schlecht Cancity
Dustin Brinkmann Moses Frase Codehow
Summer Sewald Zane Levy Hatfan
Dustin Brinkmann Anna Snelling Ron-tech
Celia Rouche Vicki Laflamme J-Texon
Celia Rouche Markita Hansen Cheers
Melvin Marxen Niesha Huffines Zumgoity
Dustin Brinkmann Anna Snelling Bioholding

Using WHERE in this way to make a join works, but SQL does also have an explicit JOIN command so we can separate conceptually the join from the filtering condition. The USING keyword then explicitly tells us what variable to use to bring the tables together.

SELECT DISTINCT teams.manager,
                pipeline.sales_agent AS agent,
                pipeline.account
           FROM sales_teams AS teams
                JOIN sales_pipeline AS pipeline USING (sales_agent)
          WHERE pipeline.deal_stage = "Won"
Table 16.5: Displaying records 1 - 10
manager agent Account
Dustin Brinkmann Moses Frase Cancity
Melvin Marxen Darcel Schlecht Isdom
Melvin Marxen Darcel Schlecht Cancity
Dustin Brinkmann Moses Frase Codehow
Summer Sewald Zane Levy Hatfan
Dustin Brinkmann Anna Snelling Ron-tech
Celia Rouche Vicki Laflamme J-Texon
Celia Rouche Markita Hansen Cheers
Melvin Marxen Niesha Huffines Zumgoity
Dustin Brinkmann Anna Snelling Bioholding

What if our foreign key had a different name in the second table? Then instead of using USING, join the tables with ON which allows the specification of names in both tables.

16.2 Outer Joins

Inner joins only return observations where the value for a particular column appears in both tables. Outer joins return observations where the value for a particular column appears at least once. For left outer joins the value has to appear in the left table. For full outer joins it could appear in either table.

In a left outer join (or just left join) all observations in the first table appear in the output.

In a full outer join (or just full join) all observations in both tables appear in the output.

In SQL, OUTER is a keyword that modifies JOIN, which can then be further modified by LEFT, RIGHT, or FULL.

For instance, to do a left outer join on the tables:

SELECT DISTINCT sales_teams.sales_agent AS agent,
                sales_pipeline.deal_stage
           FROM sales_teams
LEFT OUTER JOIN sales_pipeline
             ON sales_teams.sales_agent = sales_pipeline.sales_agent
          WHERE sales_pipeline.deal_stage = "In Progress"
Table 16.6: Displaying records 1 - 10
agent Deal_Stage
James Ascencio In Progress
Maureen Marcano In Progress
Kami Bicknell In Progress
Cassey Cress In Progress
Hayden Neloms In Progress
Corliss Cosme In Progress
Elease Gluck In Progress
Markita Hansen In Progress
Daniell Hammack In Progress
Zane Levy In Progress

A left outer join is appropriate here since the goal is not to list all sales agents, only those that have a deal at some stage of completion.

16.3 Self Join

The need for a self join arises when the values of a key for a table are also used as entries in another column. For instance, suppose that I have a variable which is the employee ID. Then this ID number might be used to also indicate a manager.

SELECT *
  FROM employees
Table 16.7: Displaying records 1 - 10
emp_id name mgr_id Regional_Office Status
10001 Anna Snelling 10036 Central Current
10002 Cecily Lampkin 10036 Central Current
10003 Versie Hillebrand 10036 Central Current
10004 Lajuana Vencill 10036 Central Current
10005 Moses Frase 10036 Central Current
10006 Jonathan Berthelot 10037 Central Current
10007 Marty Freudenburg 10037 Central Current
10008 Gladys Colclough 10037 Central Current
10009 Niesha Huffines 10037 Central Current
10010 Darcel Schlecht 10037 Central Current

Note that the mgr_id is using the emp_id number to identify the manager of each person. So to pull out the name of the manager, the mgr_id is being used as the key in the second table, while the emp_id is the key in the first.

SELECT emp.name AS employee,
       mgr.name AS manager
  FROM employees AS emp
  JOIN employees AS mgr ON emp.mgr_id = mgr.emp_id
Table 16.8: Displaying records 1 - 10
employee manager
Anna Snelling Dustin Brinkmann
Cecily Lampkin Dustin Brinkmann
Versie Hillebrand Dustin Brinkmann
Lajuana Vencill Dustin Brinkmann
Moses Frase Dustin Brinkmann
Jonathan Berthelot Melvin Marxen
Marty Freudenburg Melvin Marxen
Gladys Colclough Melvin Marxen
Niesha Huffines Melvin Marxen
Darcel Schlecht Melvin Marxen

16.4 Aggregation

To apply functions to variable values in the tidyverse, we used mutate or summarize. In the SQL framework, these are called aggregations.

  • SUM adds together the non NULL values. \

  • COUNT counts non NULL values.

  • AVG averages non NULL values.

  • MIN finds the minimum of non NULL values.

  • MAX finds the maximum of non NULL values.

  • GROUP_CONCAT concatenates strings.

For example, to get the total monetary value of deals closed:

SELECT SUM(close_value)
  FROM sales_pipeline
Table 16.9: 1 records
SUM(close_value)
10005534

The total number of won deals:

SELECT COUNT(*)
  FROM sales_pipeline
 WHERE deal_stage = "Won"
Table 16.10: 1 records
COUNT(*)
4238

The average value of the deals won:

SELECT AVG(close_value)
  FROM sales_pipeline
 WHERE sales_pipeline.deal_stage = "Won"
Table 16.11: 1 records
AVG(close_value)
2360.909

The smallest deal won:

SELECT MIN(close_value)
  FROM sales_pipeline
 WHERE sales_pipeline.deal_stage = "Won"
Table 16.12: 1 records
MIN(close_value)
38

16.5 GROUP BY

The GROUP BY command in SQL performs the same function as group_by in the tidyverse: it partitions the observations by the values of a particular variable. For instance, the following finds the average deal size for each sales agent. It first groups the data by sales_agent, then calculates the average value using AVG.

  SELECT sales_agent,
         AVG(close_value)
    FROM sales_pipeline
   WHERE sales_pipeline.deal_stage = "Won"
GROUP BY sales_agent
ORDER BY AVG(close_value) DESC
Table 16.13: Displaying records 1 - 10
Sales_Agent AVG(close_value)
Elease Gluck 3614.938
Darcel Schlecht 3304.338
Rosalina Dieter 3269.486
Daniell Hammack 3194.991
James Ascencio 3063.207
Rosie Papadopoulos 2950.885
Wilburn Farren 2866.182
Reed Clapper 2827.974
Donn Cantrell 2821.899
Corliss Cosme 2806.907

Once two tables are joined, then GROUP BY treats them as a single table. This is useful, for instance, if the goal is to obtain the average deal by manager.

  SELECT sales_teams.manager,
         AVG(sales_pipeline.close_value)
    FROM sales_teams
         JOIN sales_pipeline ON (sales_teams.sales_agent = sales_pipeline.sales_agent)
         WHERE sales_pipeline.deal_stage = "Won"
GROUP BY sales_teams.manager
Table 16.14: 6 records
manager AVG(sales_pipeline.close_value)
Cara Losch 2354.269
Celia Rouche 2629.339
Dustin Brinkmann 1465.011
Melvin Marxen 2553.209
Rocco Neubert 2837.258
Summer Sewald 2372.886

Filtering observations used WHERE, but if the plan is to use these filtered observations within a GROUP BY, then WHERE must be surrounded by a FILTER. For instance, to get the number of deals won that had a value greater than 1000, use the following.

  SELECT sales_agent,
         COUNT(sales_pipeline.close_value) AS total,
         COUNT(sales_pipeline.close_value)
FILTER(WHERE(sales_pipeline.close_value > 1000)) AS `over 1000`
    FROM sales_pipeline
   WHERE sales_pipeline.deal_stage = "Won"
GROUP BY sales_pipeline.sales_agent
Table 16.15: Displaying records 1 - 10
Sales_Agent total over 1000
Anna Snelling 208 68
Boris Faz 101 70
Cassey Cress 163 112
Cecily Lampkin 107 63
Corliss Cosme 150 97
Daniell Hammack 114 91
Darcel Schlecht 349 272
Donn Cantrell 158 106
Elease Gluck 80 32
Garret Kinder 75 46

To filter observations after aggregation has occurred, use the HAVING

  SELECT sales_agent,
         COUNT(sales_pipeline.close_value) AS `number won`
    FROM sales_pipeline
   WHERE sales_pipeline.deal_stage = "Won"
GROUP BY sales_pipeline.sales_agent
  HAVING COUNT(sales_pipeline.close_value) > 200
Table 16.16: 4 records
Sales_Agent number won
Anna Snelling 208
Darcel Schlecht 349
Kary Hendrixson 209
Vicki Laflamme 221

16.6 Set operations in SQL

In the tidyverse, the set operations union, intersect and setdiff find the union, intersection, and set difference respectively of observations that belong to different tables, but have the same variables. The corresponding commands in SQL are UNION, INTERSECT, and MINUS.

Up until now, we have been using SQL queries that only have one SELECT command. Each time the SELECT command is used it creates a table. Set operations can then be used to combine these tables.

For instance, suppose that international accounts were located in one table, and domestic in another. Two SELECT commands to put the data from both tables into the same form, then UNION could be used to combine them.

SELECT intl_accounts.account,
       intl_accounts.office_location AS location
  FROM intl_accounts
   UNION
SELECT accounts.account,
       "USA" AS location
  FROM accounts
Table 16.17: Displaying records 1 - 10
account location
AWOLEX USA
Acme Corporation USA
Betasoloin USA
Betatech Kenya
Betatech USA
Bioholding Philipines
Bioholding USA
Bioplex USA
Blackzim USA
Bluth Company USA

Now that the database is not needed, disconnect.

dbDisconnect(db)

Questions

Suppose the iris data table was in an SQL database

  1. Write an SQL query to find the setosa flower observations with petal length greater than 1.5 cm. Return only the petal length values.

  2. Write an SQL query that first finds the setosa flower observations with petal length greater than 1.5 cm. Return the sample average of the petal length values.

  1. Write an SQL query for the iris SQL database that does the following: returns the largest sepal length and the species that attains it.

  2. Write an SQL query for the iris SQL database that does the following: returns the smallest sepal length and the species that attains it.

Write an SQL query for the iris SQL database that does the following: for each different species, return the largest sepal length under the variable name max_sepal_length together with the species name.

Suppose the mtcars dataset built into R was an SQL table.

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Write an SQL query that for each of the different gear values, returns the number of gears together with the sample average of the mpg values.

Suppose that a relational database has two tables. The first table Territory has variables State, Level, Capital, and the second table Region has variables State, Mining, and Population.

Write SQL code to perform an inner join between the two tables, giving a report with variables State, Capital, and Population, and using State as the key to join the tables.

Suppose that the relational database includes a table Region has variables State, Mining, and Population. The Mining variable is either high, med, or low.

Write SQL code that accomplishes the following.

  • Removes all observations where the population is less than one million.

  • For each of the mining levels, computes the average population.

  • Reports the Mining level, the average population for that level, and orders in decreasing order of average population.

Suppose the PlantGrowth dataset built in to R was in an SQL database.

weight group
4.17 ctrl
5.58 ctrl
5.18 ctrl
6.11 ctrl
4.50 ctrl
4.61 ctrl
  1. Write an SQL query that finds the unique values that group takes on.

  2. Write an SQL query that finds the average value of the petal length.

  3. Write an SQL query that finds the average value of the petal length for each of the three values group takes on.

Suppose that the table Orders has two columns, OrderID and CustomerID. Here OrderID is a key, but since the same customer might make more than one order, CustomerID is not. The whole table has 930 rows, here are the first ten rows.

tribble(
  ~OrderID, ~CustomerID,
10643,  "ALFKI",            
10692,  "ALFKI",        
10702,  "ALFKI",            
10835,  "ALFKI",            
10952,  "ALFKI",            
11011,  "ALFKI",            
10308,  "ANATR",            
10625,  "ANATR",            
10759,  "ANATR",            
10926,  "ANATR",
) |> kable() |> kable_styling()
OrderID CustomerID
10643 ALFKI
10692 ALFKI
10702 ALFKI
10835 ALFKI
10952 ALFKI
11011 ALFKI
10308 ANATR
10625 ANATR
10759 ANATR
10926 ANATR

Write an SQL query for the Orders table where the reported data is the the CustomerID together with the number of orders that customer made. So the first row of the report will be

tribble(
  ~CustomerID, ~Num_orders,
  "ALFKI", 6
) |> kable() |> kable_styling()
CustomerID Num_orders
ALFKI 6

The tables in the Northwind database include Suppliers and Customers. Both these tables include the variables CompanyName and City.

  1. Write an SQL query that creates a report that shows that name of the company and the city where all the suppliers are based.

  2. Write an SQL query that creates a report that shows that name of the company and the city where all the customers are based.

  3. Write a single SQL query that combines your answers from a. and b. into a single report.