Combining tables#
When working with many tables in a relational database, it often makes sense to combine tables in summary tables and store them before further processing. Just as a reminder, we acquired images
during the experiments
and corresponding measurements
. We now would like to extract an overview
table that contains the most important information from image
, experiments
and measurements
.
import pandas as pd
from pandasql import sqldf
We again start by loading our tables from disc.
experiment = pd.read_csv("../../data/experiments.csv")
image = pd.read_csv("../../data/images.csv")
measurement = pd.read_csv("../../data/measurements.csv")
Again, we define a function that can query the database.
query_database = lambda q: sqldf(q, globals())
Inner Join#
SQL also allows combining tables using the JOIN
statement. Most common is the INNER JOIN
which allows to combine two tables in a way that rows are matched according to a defined connection ON
which the tables are combined. The INNER JOIN
makes sure that only rows are shown to which both tables have entries. That’s why experiment ID 4
is not shown in the following table. There were no images stored during experiment 4.
query = """
SELECT *
FROM experiment
INNER JOIN image ON experiment.ID = image.experiment_ID
"""
query_database(query)
Unnamed: 0 | ID | Comment | Concentration | Unnamed: 0 | ID | Filename | experiment_ID | |
---|---|---|---|---|---|---|---|---|
0 | 0 | 1 | High dose | 50 | 0 | 1 | image0.tif | 1 |
1 | 0 | 1 | High dose | 50 | 1 | 2 | image1.tif | 1 |
2 | 0 | 1 | High dose | 50 | 2 | 3 | image2.tif | 1 |
3 | 1 | 2 | Medium dose | 20 | 3 | 4 | image3.tif | 2 |
4 | 1 | 2 | Medium dose | 20 | 4 | 5 | image4.tif | 2 |
5 | 1 | 2 | Medium dose | 20 | 5 | 6 | image5.tif | 2 |
6 | 2 | 3 | Control | 0 | 6 | 7 | image6.tif | 3 |
7 | 2 | 3 | Control | 0 | 7 | 8 | image7.tif | 3 |
8 | 2 | 3 | Control | 0 | 8 | 9 | image8.tif | 3 |
9 | 2 | 3 | Control | 0 | 9 | 10 | image9.tif | 3 |
Outer Join#
If one wanted to combine the tables in a way that also experiment 4 is listed, we need to use an OUTER JOIN
. In this case, as the experiments table is on the left of our SQL statement, we use the LEFT OUTER JOIN
. Experiment 4 is then listed in our table with some NaNs
in the columns of the images.
query = """
SELECT *
FROM experiment
LEFT OUTER JOIN image ON experiment.ID = image.experiment_ID
"""
query_database(query)
Unnamed: 0 | ID | Comment | Concentration | Unnamed: 0 | ID | Filename | experiment_ID | |
---|---|---|---|---|---|---|---|---|
0 | 0 | 1 | High dose | 50 | 0.0 | 1.0 | image0.tif | 1.0 |
1 | 0 | 1 | High dose | 50 | 1.0 | 2.0 | image1.tif | 1.0 |
2 | 0 | 1 | High dose | 50 | 2.0 | 3.0 | image2.tif | 1.0 |
3 | 1 | 2 | Medium dose | 20 | 3.0 | 4.0 | image3.tif | 2.0 |
4 | 1 | 2 | Medium dose | 20 | 4.0 | 5.0 | image4.tif | 2.0 |
5 | 1 | 2 | Medium dose | 20 | 5.0 | 6.0 | image5.tif | 2.0 |
6 | 2 | 3 | Control | 0 | 6.0 | 7.0 | image6.tif | 3.0 |
7 | 2 | 3 | Control | 0 | 7.0 | 8.0 | image7.tif | 3.0 |
8 | 2 | 3 | Control | 0 | 8.0 | 9.0 | image8.tif | 3.0 |
9 | 2 | 3 | Control | 0 | 9.0 | 10.0 | image9.tif | 3.0 |
10 | 3 | 4 | Super high dose | 1000 | NaN | NaN | None | NaN |
Combining many tables#
The JOIN
statements above can also be used in chains to combine more than two tables. For example, we might be interested in the number of cells over time for every image in every experiment
query = """
SELECT *
FROM experiment
LEFT OUTER JOIN image ON experiment.ID = image.experiment_ID
INNER JOIN measurement ON image.ID = measurement.image_ID
"""
query_database(query)
Unnamed: 0 | ID | Comment | Concentration | Unnamed: 0 | ID | Filename | experiment_ID | Unnamed: 0 | image_ID | time_after_exposure_in_s | number_of_cells | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 1 | High dose | 50 | 0 | 1 | image0.tif | 1 | 0 | 1 | 0 | 96 |
1 | 0 | 1 | High dose | 50 | 0 | 1 | image0.tif | 1 | 1 | 1 | 1 | 47 |
2 | 0 | 1 | High dose | 50 | 0 | 1 | image0.tif | 1 | 2 | 1 | 2 | 32 |
3 | 0 | 1 | High dose | 50 | 0 | 1 | image0.tif | 1 | 3 | 1 | 3 | 24 |
4 | 0 | 1 | High dose | 50 | 0 | 1 | image0.tif | 1 | 4 | 1 | 4 | 19 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
95 | 2 | 3 | Control | 0 | 9 | 10 | image9.tif | 3 | 95 | 10 | 5 | 142 |
96 | 2 | 3 | Control | 0 | 9 | 10 | image9.tif | 3 | 96 | 10 | 6 | 148 |
97 | 2 | 3 | Control | 0 | 9 | 10 | image9.tif | 3 | 97 | 10 | 7 | 144 |
98 | 2 | 3 | Control | 0 | 9 | 10 | image9.tif | 3 | 98 | 10 | 8 | 141 |
99 | 2 | 3 | Control | 0 | 9 | 10 | image9.tif | 3 | 99 | 10 | 9 | 140 |
100 rows × 12 columns
Selecting columns#
When a table becomes too wide and has columns we do not really need for our analysis, it is recommended to replace the *
in the SELECT
statement with column names.
query = """
SELECT experiment.Comment, image.Filename, measurement.time_after_exposure_in_s, measurement.number_of_cells
FROM experiment
LEFT OUTER JOIN image ON experiment.ID = image.experiment_ID
INNER JOIN measurement ON image.ID = measurement.image_ID
"""
query_database(query)
Comment | Filename | time_after_exposure_in_s | number_of_cells | |
---|---|---|---|---|
0 | High dose | image0.tif | 0 | 96 |
1 | High dose | image0.tif | 1 | 47 |
2 | High dose | image0.tif | 2 | 32 |
3 | High dose | image0.tif | 3 | 24 |
4 | High dose | image0.tif | 4 | 19 |
... | ... | ... | ... | ... |
95 | Control | image9.tif | 5 | 142 |
96 | Control | image9.tif | 6 | 148 |
97 | Control | image9.tif | 7 | 144 |
98 | Control | image9.tif | 8 | 141 |
99 | Control | image9.tif | 9 | 140 |
100 rows × 4 columns
Renaming columns#
Using the AS
statement, we can also rename columns and make our new table easier to read and understand.
query = """
SELECT experiment.Comment AS experiment_name, image.Filename, measurement.time_after_exposure_in_s, measurement.number_of_cells
FROM experiment
LEFT OUTER JOIN image ON experiment.ID = image.experiment_ID
INNER JOIN measurement ON image.ID = measurement.image_ID
"""
query_database(query)
experiment_name | Filename | time_after_exposure_in_s | number_of_cells | |
---|---|---|---|---|
0 | High dose | image0.tif | 0 | 96 |
1 | High dose | image0.tif | 1 | 47 |
2 | High dose | image0.tif | 2 | 32 |
3 | High dose | image0.tif | 3 | 24 |
4 | High dose | image0.tif | 4 | 19 |
... | ... | ... | ... | ... |
95 | Control | image9.tif | 5 | 142 |
96 | Control | image9.tif | 6 | 148 |
97 | Control | image9.tif | 7 | 144 |
98 | Control | image9.tif | 8 | 141 |
99 | Control | image9.tif | 9 | 140 |
100 rows × 4 columns
As this statement is quite complicated and we should not copy past it over and over, we can store it in a pandas DataFrame overview
.
query = """
SELECT experiment.Comment AS experiment_name, image.Filename, measurement.time_after_exposure_in_s, measurement.number_of_cells
FROM experiment
LEFT OUTER JOIN image ON experiment.ID = image.experiment_ID
INNER JOIN measurement ON image.ID = measurement.image_ID
"""
overview = query_database(query)
The overview
table is part of our database. Note: The overview was not saved to disc yet.
query = """
SELECT *
FROM overview
"""
query_database(query)
experiment_name | Filename | time_after_exposure_in_s | number_of_cells | |
---|---|---|---|---|
0 | High dose | image0.tif | 0 | 96 |
1 | High dose | image0.tif | 1 | 47 |
2 | High dose | image0.tif | 2 | 32 |
3 | High dose | image0.tif | 3 | 24 |
4 | High dose | image0.tif | 4 | 19 |
... | ... | ... | ... | ... |
95 | Control | image9.tif | 5 | 142 |
96 | Control | image9.tif | 6 | 148 |
97 | Control | image9.tif | 7 | 144 |
98 | Control | image9.tif | 8 | 141 |
99 | Control | image9.tif | 9 | 140 |
100 rows × 4 columns
For reuse in later lessons, we save the summary to disc.
overview.to_csv("../../data/overview.csv")