16 Joining tables in SQL
Summary
SQL was designed to use relational databases, and so has many commands for drawing data from multiple tables.
| SQL Command | Description |
|---|---|
| JOIN | Brings two tables together. |
| OUTER |
Modifies JOIN to be an outer join.
|
| LEFT |
Modifies OUTER to be a left outer join.
|
| UNION | Union of observations from tables with same variables. |
| INTERSECT | Intersection of observations from tables with same variables. |
| MINUS | Set difference of observations from tables with same variables. |
Summarizing data and combining strings can also be done in SQL.
| SQL Command | Description |
|---|---|
| SUM | Adds together the non NULL values. |
| COUNT | Counts non NULL values. |
| AVG | Averages non NULL values. |
| MIN | Minimum of non NULL values. |
| MAX | Maximum of non NULL values. |
| GROUP_CONCAT | concatenate strings. |
Libraries needed for this chapter include:
The SQL examples so far have been working with one table (relation) at a time, but the point of having more than one table is that we should have the ability to collect data from multiple tables to get the report that we are after.
For this chapter data is stored in an SQLite file called sales.sqlite. This file is available from https://s3.us-west-1.amazonaws.com/markhuber-datascience-resources.org/Data_sets/SQLite/sales.sqlite. Form a connection to that database using:
16.1 Inner Joins
An inner join of two datasets brings together observations that appear in both tables.
In the table sales_teams, each value of
sales_agent appears only once, since each agent is part
of only one team. Therefore it is a key in this table.
| sales_agent | manager |
|---|---|
| Anna Snelling | Dustin Brinkmann |
| Cecily Lampkin | Dustin Brinkmann |
| Versie Hillebrand | Dustin Brinkmann |
| Lajuana Vencill | Dustin Brinkmann |
| Moses Frase | Dustin Brinkmann |
| Jonathan Berthelot | Melvin Marxen |
| Marty Freudenburg | Melvin Marxen |
| Gladys Colclough | Melvin Marxen |
| Niesha Huffines | Melvin Marxen |
| Darcel Schlecht | Melvin Marxen |
The table sales_pipeline tells us what deals a
particular agent has in their pipeline. In this table each agent might be
working on more than one deal at a time.
| Account | Sales_Agent |
|---|---|
| Cancity | Moses Frase |
| Isdom | Darcel Schlecht |
| Cancity | Darcel Schlecht |
| Codehow | Moses Frase |
| Hatfan | Zane Levy |
| Ron-tech | Anna Snelling |
| J-Texon | Vicki Laflamme |
| Cheers | Markita Hansen |
| Zumgoity | Niesha Huffines |
| NA | James Ascencio |
So sales_agent is not a key in this table. Since
it is a key in sales_teams, it provides a foreign key
for that table. This simplest kind of inner join can be accomplished
just by using the WHERE command.
SELECT sales_teams.manager,
sales_pipeline.sales_agent,
sales_pipeline.account
FROM sales_teams, sales_pipeline
WHERE (sales_pipeline.sales_agent = sales_teams.sales_agent)
AND sales_pipeline.deal_stage = "Won"| manager | Sales_Agent | Account |
|---|---|---|
| Dustin Brinkmann | Moses Frase | Cancity |
| Melvin Marxen | Darcel Schlecht | Isdom |
| Melvin Marxen | Darcel Schlecht | Cancity |
| Dustin Brinkmann | Moses Frase | Codehow |
| Summer Sewald | Zane Levy | Hatfan |
| Dustin Brinkmann | Anna Snelling | Ron-tech |
| Celia Rouche | Vicki Laflamme | J-Texon |
| Celia Rouche | Markita Hansen | Cheers |
| Melvin Marxen | Niesha Huffines | Zumgoity |
| Dustin Brinkmann | Anna Snelling | Bioholding |
So what exactly happened here?
In FROM, there are two tables
sales_teamsandsales_pipeline. So data can be drawn from both.In SELECT, it is possible to include variables from either
sales_teams,sales_pipline, or both. Use atablename.variablenameformulation for such selections.In WHERE, we indicate how the join works by declaring which of the variables from the two tables must be equal. Then AND was used to place an extra condition on data from one of the tables.
The AS keyword can be used to relabel the variables for extra readability. It can also be used to relabel the tables. Note that although the AS in the FROM command relabels the tables, the relabels can actually be used earlier in the SELECT command.
SELECT teams.manager,
pipeline.sales_agent AS agent,
pipeline.account
FROM sales_teams AS teams, sales_pipeline AS pipeline
WHERE (pipeline.sales_agent = teams.sales_agent)
AND pipeline.deal_stage = "Won"| manager | agent | Account |
|---|---|---|
| Dustin Brinkmann | Moses Frase | Cancity |
| Melvin Marxen | Darcel Schlecht | Isdom |
| Melvin Marxen | Darcel Schlecht | Cancity |
| Dustin Brinkmann | Moses Frase | Codehow |
| Summer Sewald | Zane Levy | Hatfan |
| Dustin Brinkmann | Anna Snelling | Ron-tech |
| Celia Rouche | Vicki Laflamme | J-Texon |
| Celia Rouche | Markita Hansen | Cheers |
| Melvin Marxen | Niesha Huffines | Zumgoity |
| Dustin Brinkmann | Anna Snelling | Bioholding |
Using WHERE in this way to make a join works, but SQL does also have an explicit JOIN command so we can separate conceptually the join from the filtering condition. The USING keyword then explicitly tells us what variable to use to bring the tables together.
SELECT DISTINCT teams.manager,
pipeline.sales_agent AS agent,
pipeline.account
FROM sales_teams AS teams
JOIN sales_pipeline AS pipeline USING (sales_agent)
WHERE pipeline.deal_stage = "Won"| manager | agent | Account |
|---|---|---|
| Dustin Brinkmann | Moses Frase | Cancity |
| Melvin Marxen | Darcel Schlecht | Isdom |
| Melvin Marxen | Darcel Schlecht | Cancity |
| Dustin Brinkmann | Moses Frase | Codehow |
| Summer Sewald | Zane Levy | Hatfan |
| Dustin Brinkmann | Anna Snelling | Ron-tech |
| Celia Rouche | Vicki Laflamme | J-Texon |
| Celia Rouche | Markita Hansen | Cheers |
| Melvin Marxen | Niesha Huffines | Zumgoity |
| Dustin Brinkmann | Anna Snelling | Bioholding |
What if our foreign key had a different name in the second table? Then instead of using USING, join the tables with ON which allows the specification of names in both tables.
16.2 Outer Joins
Inner joins only return observations where the value for a particular column appears in both tables. Outer joins return observations where the value for a particular column appears at least once. For left outer joins the value has to appear in the left table. For full outer joins it could appear in either table.
In a left outer join (or just left join) all observations in the first table appear in the output.
In a full outer join (or just full join) all observations in both tables appear in the output.
In SQL, OUTER is a keyword that modifies JOIN, which can then be further modified by LEFT, RIGHT, or FULL.
For instance, to do a left outer join on the tables:
SELECT DISTINCT sales_teams.sales_agent AS agent,
sales_pipeline.deal_stage
FROM sales_teams
LEFT OUTER JOIN sales_pipeline
ON sales_teams.sales_agent = sales_pipeline.sales_agent
WHERE sales_pipeline.deal_stage = "In Progress"| agent | Deal_Stage |
|---|---|
| James Ascencio | In Progress |
| Maureen Marcano | In Progress |
| Kami Bicknell | In Progress |
| Cassey Cress | In Progress |
| Hayden Neloms | In Progress |
| Corliss Cosme | In Progress |
| Elease Gluck | In Progress |
| Markita Hansen | In Progress |
| Daniell Hammack | In Progress |
| Zane Levy | In Progress |
A left outer join is appropriate here since the goal is not to list all sales agents, only those that have a deal at some stage of completion.
16.3 Self Join
The need for a self join arises when the values of a key for a table are also used as entries in another column. For instance, suppose that I have a variable which is the employee ID. Then this ID number might be used to also indicate a manager.
| emp_id | name | mgr_id | Regional_Office | Status |
|---|---|---|---|---|
| 10001 | Anna Snelling | 10036 | Central | Current |
| 10002 | Cecily Lampkin | 10036 | Central | Current |
| 10003 | Versie Hillebrand | 10036 | Central | Current |
| 10004 | Lajuana Vencill | 10036 | Central | Current |
| 10005 | Moses Frase | 10036 | Central | Current |
| 10006 | Jonathan Berthelot | 10037 | Central | Current |
| 10007 | Marty Freudenburg | 10037 | Central | Current |
| 10008 | Gladys Colclough | 10037 | Central | Current |
| 10009 | Niesha Huffines | 10037 | Central | Current |
| 10010 | Darcel Schlecht | 10037 | Central | Current |
Note that the mgr_id is using the emp_id number to identify the manager of each person. So to pull out the name of the manager, the mgr_id is being used as the key in the second table, while the emp_id is the key in the first.
SELECT emp.name AS employee,
mgr.name AS manager
FROM employees AS emp
JOIN employees AS mgr ON emp.mgr_id = mgr.emp_id| employee | manager |
|---|---|
| Anna Snelling | Dustin Brinkmann |
| Cecily Lampkin | Dustin Brinkmann |
| Versie Hillebrand | Dustin Brinkmann |
| Lajuana Vencill | Dustin Brinkmann |
| Moses Frase | Dustin Brinkmann |
| Jonathan Berthelot | Melvin Marxen |
| Marty Freudenburg | Melvin Marxen |
| Gladys Colclough | Melvin Marxen |
| Niesha Huffines | Melvin Marxen |
| Darcel Schlecht | Melvin Marxen |
16.4 Aggregation
To apply functions to variable values in the tidyverse, we used mutate or summarize. In the SQL framework, these are called aggregations.
SUM adds together the non
NULLvalues. \COUNT counts non
NULLvalues.AVG averages non
NULLvalues.MIN finds the minimum of non
NULLvalues.MAX finds the maximum of non
NULLvalues.GROUP_CONCAT concatenates strings.
For example, to get the total monetary value of deals closed:
| SUM(close_value) |
|---|
| 10005534 |
The total number of won deals:
| COUNT(*) |
|---|
| 4238 |
The average value of the deals won:
| AVG(close_value) |
|---|
| 2360.909 |
The smallest deal won:
| MIN(close_value) |
|---|
| 38 |
16.5 GROUP BY
The GROUP BY command in SQL performs the same function as group_by in the tidyverse: it partitions the observations by the values of a particular variable. For instance, the following finds the average deal size for each sales agent. It first groups the data by sales_agent, then calculates the average value using AVG.
SELECT sales_agent,
AVG(close_value)
FROM sales_pipeline
WHERE sales_pipeline.deal_stage = "Won"
GROUP BY sales_agent
ORDER BY AVG(close_value) DESC| Sales_Agent | AVG(close_value) |
|---|---|
| Elease Gluck | 3614.938 |
| Darcel Schlecht | 3304.338 |
| Rosalina Dieter | 3269.486 |
| Daniell Hammack | 3194.991 |
| James Ascencio | 3063.207 |
| Rosie Papadopoulos | 2950.885 |
| Wilburn Farren | 2866.182 |
| Reed Clapper | 2827.974 |
| Donn Cantrell | 2821.899 |
| Corliss Cosme | 2806.907 |
Once two tables are joined, then GROUP BY treats them as a single table. This is useful, for instance, if the goal is to obtain the average deal by manager.
SELECT sales_teams.manager,
AVG(sales_pipeline.close_value)
FROM sales_teams
JOIN sales_pipeline ON (sales_teams.sales_agent = sales_pipeline.sales_agent)
WHERE sales_pipeline.deal_stage = "Won"
GROUP BY sales_teams.manager| manager | AVG(sales_pipeline.close_value) |
|---|---|
| Cara Losch | 2354.269 |
| Celia Rouche | 2629.339 |
| Dustin Brinkmann | 1465.011 |
| Melvin Marxen | 2553.209 |
| Rocco Neubert | 2837.258 |
| Summer Sewald | 2372.886 |
Filtering observations used WHERE, but if the plan is to use these filtered observations within a GROUP BY, then WHERE must be surrounded by a FILTER. For instance, to get the number of deals won that had a value greater than 1000, use the following.
SELECT sales_agent,
COUNT(sales_pipeline.close_value) AS total,
COUNT(sales_pipeline.close_value)
FILTER(WHERE(sales_pipeline.close_value > 1000)) AS `over 1000`
FROM sales_pipeline
WHERE sales_pipeline.deal_stage = "Won"
GROUP BY sales_pipeline.sales_agent| Sales_Agent | total | over 1000 |
|---|---|---|
| Anna Snelling | 208 | 68 |
| Boris Faz | 101 | 70 |
| Cassey Cress | 163 | 112 |
| Cecily Lampkin | 107 | 63 |
| Corliss Cosme | 150 | 97 |
| Daniell Hammack | 114 | 91 |
| Darcel Schlecht | 349 | 272 |
| Donn Cantrell | 158 | 106 |
| Elease Gluck | 80 | 32 |
| Garret Kinder | 75 | 46 |
To filter observations after aggregation has occurred, use the HAVING
SELECT sales_agent,
COUNT(sales_pipeline.close_value) AS `number won`
FROM sales_pipeline
WHERE sales_pipeline.deal_stage = "Won"
GROUP BY sales_pipeline.sales_agent
HAVING COUNT(sales_pipeline.close_value) > 200| Sales_Agent | number won |
|---|---|
| Anna Snelling | 208 |
| Darcel Schlecht | 349 |
| Kary Hendrixson | 209 |
| Vicki Laflamme | 221 |
16.6 Set operations in SQL
In the tidyverse, the set operations union, intersect and setdiff find the union, intersection, and set difference respectively of observations that belong to different tables, but have the same variables. The corresponding commands in SQL are UNION, INTERSECT, and MINUS.
Up until now, we have been using SQL queries that only have one SELECT command. Each time the SELECT command is used it creates a table. Set operations can then be used to combine these tables.
For instance, suppose that international accounts were located in one table, and domestic in another. Two SELECT commands to put the data from both tables into the same form, then UNION could be used to combine them.
SELECT intl_accounts.account,
intl_accounts.office_location AS location
FROM intl_accounts
UNION
SELECT accounts.account,
"USA" AS location
FROM accounts| account | location |
|---|---|
| AWOLEX | USA |
| Acme Corporation | USA |
| Betasoloin | USA |
| Betatech | Kenya |
| Betatech | USA |
| Bioholding | Philipines |
| Bioholding | USA |
| Bioplex | USA |
| Blackzim | USA |
| Bluth Company | USA |
Now that the database is not needed, disconnect.
Questions
Suppose the iris data table was in an SQL database
Write an SQL query to find the setosa flower observations with petal length greater than 1.5 cm. Return only the petal length values.
Write an SQL query that first finds the setosa flower observations with petal length greater than 1.5 cm. Return the sample average of the petal length values.
Write an SQL query for the
irisSQL database that does the following: returns the largest sepal length and the species that attains it.Write an SQL query for the
irisSQL database that does the following: returns the smallest sepal length and the species that attains it.
Write an SQL query for the iris SQL database that does the following: for each different species, return the largest sepal length under the variable name max_sepal_length together with the species name.
Suppose the mtcars dataset built into R was an SQL table.
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
Write an SQL query that for each of the different gear values, returns the number of gears together with the sample average of the mpg values.
Suppose that a relational database has two tables. The first table Territory has variables State, Level, Capital, and the second table Region has variables State, Mining, and Population.
Write SQL code to perform an inner join between the two tables, giving a report with variables State, Capital, and Population, and using State as the key to join the tables.
Suppose that the relational database includes a table Region has variables State, Mining, and Population. The Mining variable is either high, med, or low.
Write SQL code that accomplishes the following.
Removes all observations where the population is less than one million.
For each of the mining levels, computes the average population.
Reports the
Mininglevel, the average population for that level, and orders in decreasing order of average population.
Suppose the PlantGrowth dataset built in to R was in an SQL database.
| weight | group |
|---|---|
| 4.17 | ctrl |
| 5.58 | ctrl |
| 5.18 | ctrl |
| 6.11 | ctrl |
| 4.50 | ctrl |
| 4.61 | ctrl |
Write an SQL query that finds the unique values that
grouptakes on.Write an SQL query that finds the average value of the petal length.
Write an SQL query that finds the average value of the petal length for each of the three values
grouptakes on.
Suppose that the table Orders has two columns, OrderID and CustomerID. Here OrderID is a key, but since the same customer might make more than one order, CustomerID is not. The whole table has 930 rows, here are the first ten rows.
tribble(
~OrderID, ~CustomerID,
10643, "ALFKI",
10692, "ALFKI",
10702, "ALFKI",
10835, "ALFKI",
10952, "ALFKI",
11011, "ALFKI",
10308, "ANATR",
10625, "ANATR",
10759, "ANATR",
10926, "ANATR",
) |> kable() |> kable_styling()| OrderID | CustomerID |
|---|---|
| 10643 | ALFKI |
| 10692 | ALFKI |
| 10702 | ALFKI |
| 10835 | ALFKI |
| 10952 | ALFKI |
| 11011 | ALFKI |
| 10308 | ANATR |
| 10625 | ANATR |
| 10759 | ANATR |
| 10926 | ANATR |
Write an SQL query for the Orders table where the reported data is the the CustomerID together with the number of orders that customer made. So the first row of the report will be
| CustomerID | Num_orders |
|---|---|
| ALFKI | 6 |
The tables in the Northwind database include Suppliers and Customers. Both these tables include the variables CompanyName and City.
Write an SQL query that creates a report that shows that name of the company and the city where all the suppliers are based.
Write an SQL query that creates a report that shows that name of the company and the city where all the customers are based.
Write a single SQL query that combines your answers from a. and b. into a single report.