In this section, you will learn about several types of JOIN
statements. Joins are the primary mechanism for combining data from multiple tables. In order to do this, you define the common attribute(s) between tables in order for them to be combined.
You will be able to:
- Write SQL queries that make use of various types of joins
- Compare and contrast the various types of joins
- Discuss how primary and foreign keys are used in SQL
- Decide and perform whichever type of join is best for retrieving desired data
In almost all industry cases, rather than just working with a single table you will generally need data from multiple tables. Doing this requires the use of joins using shared columns from the two tables. For example, here's a diagram of a mock customer relationship management (CRM) database.
As usual, you'll start by connecting to the database.
In the cell below, type the code to import
sqlite
andpandas
with the standard alias. Then in the next cell create a connection to the databasedata.sqlite
and asign it to a variable:
CLICK to Reveal Code
import sqlite3 import pandas as pd
conn = sqlite3.connect('data.sqlite')
# replace this comment with the code to import the libraries
# replace this comment with the code to create a connection to the database data.database
import sqlite3
import pandas as pd
conn = sqlite3.connect('data.sqlite')
Let's say you need to generate a report that includes details about products from orders. To do that, we would need to take data from multiple tables in a single statement. To do this we will use JOIN
.
In the cell below, type the query to select all records from
orderdetails
andproducts
and join them using thier common keyproductCode
and display the first 10.
CLICK to Reveal Code
q = """
SELECT *
FROM orderdetails
JOIN products
ON orderdetails.productCode = products.productCode
LIMIT 10;
"""
pd.read_sql(q, conn)
# replace None with the query to join orderdetails and proucts on productCode
query = None
pd.read_sql(query, conn)
a DataFrame with 10 rows and 14 columns
In the cell below, type the code to select all records from
orderdetails
and display the first10
CLICK to Reveal Code
query = """
SELECT *
FROM orderdetails LIMIT 10;
"""
pd.read_sql(query, conn)
# replace None with the query to display the first 10 records in orderdetails
query = None
pd.read_sql(query, conn)
the first 10 records in orderdetails
In the cell below, type the code to select all records from
products
and display the first10
CLICK to Reveal Code
query = """
SELECT *
FROM products LIMIT 10;
"""
pd.read_sql(query, conn)
# replace None with the query to display the first 10 records in products
query = None
pd.read_sql(query, conn)
the first 10 records in products
A more concise way to join the tables, if the column name is identical, is the USING
clause. Rather then saying ON tableA.column = tableB.column
we can simply say USING(column)
. Again, this only works if the column is identically named for both tables.
In the cell below, type the query to select all records in
orderdetails
andproducts
and join them onproductCode
with theUSING()
clause, and return the first 10 records:
CLICK to Reveal Code
query = """
SELECT *
FROM orderdetails
JOIN products
USING(productCode)
LIMIT 10;
"""
pd.read_sql(query, conn)
# replace None with the query to join orderdetails and proucts on productCode with the using() clause
query = None
pd.read_sql(query, conn)
a DataFrame with 10 rows and 14 columns
You can also assign tables an alias by entering an alternative shorthand name. This is slightly different than the previous lesson where we introduced aliases for column names, since now we are aliasing tables.
When aliasing columns the goal is usually to improve readability by giving something a more specific or easier-to-read name. For example, name AS employee_name
, AVG(AVG) AS average_batting_average
, or COUNT(*) AS num_products
.
When aliasing tables the goal is usually to shorten the name, in order to shorten the overall query. So typically you'll see examples that alias a longer table name to a one-character or two-character shorthand. For example, orderdetails AS od
or products AS p
. (It is also possible to use aliases to clarify what exactly is in a table, like how aliases are used for columns, just less common.)
The following query produces the same result as the previous ones, using aliases od
and p
for orderdetails
and products
, respectively:
In the following cell, type the following code to demonstrate the use of aliasing:
CLICK to Reveal Code
query = """
SELECT *
FROM orderdetails AS od
JOIN products AS p
ON od.productCode = p.productCode
LIMIT 10;
"""
# replace None with the query to demonstrate aliasing
query = None
pd.read_sql(query, conn)
a DataFrame with 10 rows and 14 columns
Note that just like with column aliases, the AS
keyword is optional in SQLite. So, instead of FROM orderdetails AS od
you could write FROM orderdetails od
with the same outcome.
It is somewhat more common to see AS
used with column aliases and skipped with table aliases, but again, you'll want to check the syntax rules of your particular type of SQL as well as style guidelines from your employer to know which syntax to use in a professional setting.
By default a JOIN
is an INNER JOIN
, or the intersection between two tables. In other words, the JOIN
between orders and products is only for productCodes
that are in both the orderdetails
and products
tables. If a product had yet to be ordered (and wasn't in the orderdetails
table) then it would also not be in the result of the JOIN
.
The LEFT JOIN
keyword returns all records from the left table (table1), and the matched records from the right table (table2). The result is NULL from the right side if there is no match.
There are many other types of joins, displayed below. Of these, SQLite does not support outer joins, but it is good to be aware of as more powerful versions of SQL such as PostgreSQL support these additional functions.
For example, the statement
SELECT * FROM products LEFT JOIN orderdetails
would return all products, even those that hadn't been ordered. You can imagine that all products in inventory should have a description in the product table, but perhaps not every product is represented in the orderdetails table.
In the cell below, type the query to select all records from
products
and join them with all records inorderdetails
onproductcode
usingLEFT JOIN
, then execute the query and store it in a dataframe nameddf
:
CLICK to Reveal Code
query = """ SELECT * FROM products LEFT JOIN orderdetails USING(productCode); """
df = pd.read_sql(query, conn)
# replace this comment with the code to create the specified query
# replace this comment with the code to execute the query and store it in a dataframe named df
print("Number of records returned:", len(df))
print("Number of records where order details are null:", len(df[df.orderNumber.isnull()]))
Number of records returned: 2997
Number of records where order details are null: 1
Let's take a look at the one record that has null values in the order details:
# run this cell with no changes to view the one record with null values
df[df.orderNumber.isnull()]
a dataframe with one row and 14 columns
As you can see, it's a rare occurrence, but there is one product that has yet to be ordered.
Another important consideration when performing joins is to think more about the key or column you are joining on. As you'll see in upcoming lessons, this can lead to interesting behavior if the join value is not unique in one or both of the tables. In all of the above examples, you joined two tables using the primary key. The primary key(s) of a table are those column(s) which uniquely identify a row. You'll also see this designated in our schema diagram with the asterisk (*).
You can also join tables using foreign keys which are not the primary key for that particular table, but rather another table. For example, employeeNumber
is the primary key for the employees table and corresponds to the salesRepEmployeeNumber
of the customers table. In the customers table, salesRepEmployeeNumber
is only a foreign key, and is unlikely to be a unique identifier, as it is likely that an employee serves multiple customers. As such, in the resulting view employeeNumber
would no longer be a unique field.
In the cell below, type the query to join
customers
using the aliasc
withemployees
using the aliase
on the foreign keyssalesTepEmoloyeeNumber
andemployeeNumber
and order the result byemployeeNumber
, then type the code to execute the query:
CLICK to Reveal Code
query = """
SELECT *
FROM customers AS c
JOIN employees AS e
ON c.salesRepEmployeeNumber = e.employeeNumber
ORDER By employeeNumber;
"""
pd.read_sql(query, conn)
# replace None with the query to select the desired records
# replace this comment with the code to execute the query
a section of the df DataFrame with 100 rows and 21 columns
Notice that this also returned both columns: salesRepEmployeeNumber
and employeeNumber
. These columns contain identical values so you would probably actually only want to select one or the other.
In this lesson, you investigated joins. This included implementing the ON
and USING
clauses, aliasing table names, implementing LEFT JOIN
, and using primary vs. foreign keys.