Combining tables#

When workging with many tables in a relational database, it often makes sense to combine tables in summary tables and store them before further processing. Just as a reminder, we aquired images during the experiments and corresponding measurements. We now would like to extract an overview table that contains the most important information from image, experiments and measurements.

import pandas as pd
from pandasql import sqldf

We again start by loading our tables from disc.

experiment = pd.read_csv("../../data/experiments.csv")
image = pd.read_csv("../../data/images.csv")
measurement = pd.read_csv("../../data/measurements.csv")

Again, we define a function that can query the database.

query_database = lambda q: sqldf(q, globals())

Inner Join#

SQL also allows combining tables using the JOIN statement. Most common is the INNER JOIN which allows to combine two tables in a way that rows are matched according to a defined connection ON which the tables are combined. The INNER JOIN makes sure that only rows are shown to which both tables have entries. That’s why experiment ID 4 is not shown in the following table. There were no images stored during experiment 4.

query = """
SELECT * 
FROM experiment
INNER JOIN image ON experiment.ID = image.experiment_ID
"""

query_database(query)
Unnamed: 0 ID Comment Concentration Unnamed: 0 ID Filename experiment_ID
0 0 1 High dose 50 0 1 image0.tif 1
1 0 1 High dose 50 1 2 image1.tif 1
2 0 1 High dose 50 2 3 image2.tif 1
3 1 2 Medium dose 20 3 4 image3.tif 2
4 1 2 Medium dose 20 4 5 image4.tif 2
5 1 2 Medium dose 20 5 6 image5.tif 2
6 2 3 Control 0 6 7 image6.tif 3
7 2 3 Control 0 7 8 image7.tif 3
8 2 3 Control 0 8 9 image8.tif 3
9 2 3 Control 0 9 10 image9.tif 3

Outer Join#

If one wanted to combine the tables in a way that also experiment 4 is listed, we need to use an OUTER JOIN. In this case, as the experiments table is on the left of our SQL statement, we use the LEFT OUTER JOIN. Experiment 4 is then listed in our table with some NaNs in the columns of the images.

query = """
SELECT * 
FROM experiment
LEFT OUTER JOIN image ON experiment.ID = image.experiment_ID
"""

query_database(query)
Unnamed: 0 ID Comment Concentration Unnamed: 0 ID Filename experiment_ID
0 0 1 High dose 50 0.0 1.0 image0.tif 1.0
1 0 1 High dose 50 1.0 2.0 image1.tif 1.0
2 0 1 High dose 50 2.0 3.0 image2.tif 1.0
3 1 2 Medium dose 20 3.0 4.0 image3.tif 2.0
4 1 2 Medium dose 20 4.0 5.0 image4.tif 2.0
5 1 2 Medium dose 20 5.0 6.0 image5.tif 2.0
6 2 3 Control 0 6.0 7.0 image6.tif 3.0
7 2 3 Control 0 7.0 8.0 image7.tif 3.0
8 2 3 Control 0 8.0 9.0 image8.tif 3.0
9 2 3 Control 0 9.0 10.0 image9.tif 3.0
10 3 4 Super high dose 1000 NaN NaN None NaN

Combining many tables#

The JOIN statements above can also be used in chains to combine more than two tables. For example, we might be interested in the number of cells over time for every image in every experiment

query = """
SELECT * 
FROM experiment
LEFT OUTER JOIN image ON experiment.ID = image.experiment_ID
INNER JOIN measurement ON image.ID = measurement.image_ID
"""

query_database(query)
Unnamed: 0 ID Comment Concentration Unnamed: 0 ID Filename experiment_ID Unnamed: 0 image_ID time_after_exposure_in_s number_of_cells
0 0 1 High dose 50 0 1 image0.tif 1 0 1 0 96
1 0 1 High dose 50 0 1 image0.tif 1 1 1 1 47
2 0 1 High dose 50 0 1 image0.tif 1 2 1 2 32
3 0 1 High dose 50 0 1 image0.tif 1 3 1 3 24
4 0 1 High dose 50 0 1 image0.tif 1 4 1 4 19
... ... ... ... ... ... ... ... ... ... ... ... ...
95 2 3 Control 0 9 10 image9.tif 3 95 10 5 142
96 2 3 Control 0 9 10 image9.tif 3 96 10 6 148
97 2 3 Control 0 9 10 image9.tif 3 97 10 7 144
98 2 3 Control 0 9 10 image9.tif 3 98 10 8 141
99 2 3 Control 0 9 10 image9.tif 3 99 10 9 140

100 rows × 12 columns

Selecting columns#

When a table becomes too wide and has columns we do not really need for our analysis, it is recommended to replace the * in the SELECT statement with column names.

query = """
SELECT experiment.Comment, image.Filename, measurement.time_after_exposure_in_s, measurement.number_of_cells
FROM experiment
LEFT OUTER JOIN image ON experiment.ID = image.experiment_ID
INNER JOIN measurement ON image.ID = measurement.image_ID
"""

query_database(query)
Comment Filename time_after_exposure_in_s number_of_cells
0 High dose image0.tif 0 96
1 High dose image0.tif 1 47
2 High dose image0.tif 2 32
3 High dose image0.tif 3 24
4 High dose image0.tif 4 19
... ... ... ... ...
95 Control image9.tif 5 142
96 Control image9.tif 6 148
97 Control image9.tif 7 144
98 Control image9.tif 8 141
99 Control image9.tif 9 140

100 rows × 4 columns

Renaming columns#

Using the AS statement, we can also rename columns and make our new table easier to read and understand.

query = """
SELECT experiment.Comment AS experiment_name, image.Filename, measurement.time_after_exposure_in_s, measurement.number_of_cells
FROM experiment
LEFT OUTER JOIN image ON experiment.ID = image.experiment_ID
INNER JOIN measurement ON image.ID = measurement.image_ID
"""

query_database(query)
experiment_name Filename time_after_exposure_in_s number_of_cells
0 High dose image0.tif 0 96
1 High dose image0.tif 1 47
2 High dose image0.tif 2 32
3 High dose image0.tif 3 24
4 High dose image0.tif 4 19
... ... ... ... ...
95 Control image9.tif 5 142
96 Control image9.tif 6 148
97 Control image9.tif 7 144
98 Control image9.tif 8 141
99 Control image9.tif 9 140

100 rows × 4 columns

As this statement is quite complicated and we should not copy past it over and over, we can store it in a pandas DataFrame overview.

query = """
SELECT experiment.Comment AS experiment_name, image.Filename, measurement.time_after_exposure_in_s, measurement.number_of_cells
FROM experiment
LEFT OUTER JOIN image ON experiment.ID = image.experiment_ID
INNER JOIN measurement ON image.ID = measurement.image_ID
"""

overview = query_database(query)

The overview table is part of our database. Note: The overview was not saved to disc yet.

query = """
SELECT * 
FROM overview
"""

query_database(query)
experiment_name Filename time_after_exposure_in_s number_of_cells
0 High dose image0.tif 0 96
1 High dose image0.tif 1 47
2 High dose image0.tif 2 32
3 High dose image0.tif 3 24
4 High dose image0.tif 4 19
... ... ... ... ...
95 Control image9.tif 5 142
96 Control image9.tif 6 148
97 Control image9.tif 7 144
98 Control image9.tif 8 141
99 Control image9.tif 9 140

100 rows × 4 columns

For reuse in later lessons, we save the summary to disc.

overview.to_csv("../../data/overview.csv")