Summarizing measurements#

From an SQL-accessible database containing information about experiments, images acquired during the experiments and corresponding measurements, we would like to extract summary statistics.

import pandas as pd
from pandasql import sqldf
experiment = pd.read_csv("../../data/experiments.csv")
image = pd.read_csv("../../data/images.csv")
overview = pd.read_csv("../../data/overview.csv")
query_database = lambda q: sqldf(q, globals())

Example data#

Just a reminder, our overview table contains the key information retrieved from our database about the number of cells counted in images which were acquired during experiments.

query = """
SELECT *
FROM overview
"""

query_database(query)
Unnamed: 0 experiment_name Filename time_after_exposure_in_s number_of_cells
0 0 High dose image0.tif 0 96
1 1 High dose image0.tif 1 47
2 2 High dose image0.tif 2 32
3 3 High dose image0.tif 3 24
4 4 High dose image0.tif 4 19
... ... ... ... ... ...
95 95 Control image9.tif 5 142
96 96 Control image9.tif 6 148
97 97 Control image9.tif 7 144
98 98 Control image9.tif 8 141
99 99 Control image9.tif 9 140

100 rows × 5 columns

Grouping rows#

The GROUP BY statement allows to group long tables to shorter ones and it allows us to do descriptive statistics. We can for example determine the smallest and largest number of cells observed during an experiment using the so called aggregators MIN() and MAX().

query = """
SELECT experiment_name, MIN(number_of_cells), MAX(number_of_cells)
FROM overview
GROUP BY experiment_name
"""

query_database(query)
experiment_name MIN(number_of_cells) MAX(number_of_cells)
0 Control 113 159
1 High dose 9 151
2 Medium dose 7 158

Groups can also be formed using multiple columns. Here we also see that it is common to specify columns used for grouping twice, once after GROUP BY and once after SELECT because we want to see the grouping in the table.

query = """
SELECT experiment_name, Filename, MIN(number_of_cells), MAX(number_of_cells)
FROM overview
GROUP BY experiment_name, Filename
"""

query_database(query)
experiment_name Filename MIN(number_of_cells) MAX(number_of_cells)
0 Control image6.tif 147 156
1 Control image7.tif 113 122
2 Control image8.tif 145 159
3 Control image9.tif 140 149
4 High dose image0.tif 9 96
5 High dose image1.tif 15 151
6 High dose image2.tif 9 97
7 Medium dose image3.tif 15 158
8 Medium dose image4.tif 7 76
9 Medium dose image5.tif 14 147

A similar strategy uses the COUNT aggregator to count the number of images acquired per experiment.

query = """
SELECT experiment_name, COUNT(Filename)
FROM overview
GROUP BY experiment_name
"""

query_database(query)
experiment_name COUNT(Filename)
0 Control 40
1 High dose 30
2 Medium dose 30

Combining statistics#

Assume the measurements in our database result from a timelapse analysis. A quick look into a single image file and corresponding measurements shows that the number of cells is decreasing over time.

query = """
SELECT *
FROM overview
WHERE Filename = "image0.tif"
"""

query_database(query)
Unnamed: 0 experiment_name Filename time_after_exposure_in_s number_of_cells
0 0 High dose image0.tif 0 96
1 1 High dose image0.tif 1 47
2 2 High dose image0.tif 2 32
3 3 High dose image0.tif 3 24
4 4 High dose image0.tif 4 19
5 5 High dose image0.tif 5 16
6 6 High dose image0.tif 6 13
7 7 High dose image0.tif 7 11
8 8 High dose image0.tif 8 10
9 9 High dose image0.tif 9 9

We can now determine the average number of cells at the beginning of the experiment by averaging the first 3 timepoints.

query = """
SELECT Filename, AVG(number_of_cells) as number_of_cells
FROM overview
WHERE time_after_exposure_in_s < 3
GROUP BY Filename
"""

query_database(query)
Filename number_of_cells
0 image0.tif 58.333333
1 image1.tif 93.000000
2 image2.tif 58.000000
3 image3.tif 97.333333
4 image4.tif 46.333333
5 image5.tif 89.666667
6 image6.tif 150.666667
7 image7.tif 113.666667
8 image8.tif 151.000000
9 image9.tif 147.666667
query = """
SELECT Filename, AVG(number_of_cells) as number_of_cells
FROM overview
WHERE time_after_exposure_in_s < 3
GROUP BY Filename
"""
cell_count_early = query_database(query)
query = """ 
SELECT Filename, AVG(number_of_cells) as number_of_cells
FROM overview
WHERE time_after_exposure_in_s > 6
GROUP BY Filename;
"""
cell_count_late = query_database(query)
query = """
SELECT cell_count_early.Filename, (cell_count_late.number_of_cells / cell_count_early.number_of_cells) AS cell_count_ratio
FROM cell_count_early
INNER JOIN cell_count_late ON cell_count_early.Filename = cell_count_late.Filename
"""

cell_count_ratio_overview = query_database(query)
cell_count_ratio_overview
Filename cell_count_ratio
0 image0.tif 0.171429
1 image1.tif 0.182796
2 image2.tif 0.172414
3 image3.tif 0.171233
4 image4.tif 0.172662
5 image5.tif 0.174721
6 image6.tif 0.986726
7 image7.tif 1.055718
8 image8.tif 0.991170
9 image9.tif 0.959368

We know now the fraction of cells which survived in which file. We need to use the JOIN statement again to know in which experiment these survival ratios were observed.

query = """
SELECT experiment.Comment, cell_count_ratio_overview.cell_count_ratio
FROM cell_count_ratio_overview
INNER JOIN image ON cell_count_ratio_overview.Filename = image.Filename
INNER JOIN experiment ON image.experiment_ID = experiment.ID
"""

query_database(query)
Comment cell_count_ratio
0 High dose 0.171429
1 High dose 0.182796
2 High dose 0.172414
3 Medium dose 0.171233
4 Medium dose 0.172662
5 Medium dose 0.174721
6 Control 0.986726
7 Control 1.055718
8 Control 0.991170
9 Control 0.959368

Also these can be summarized using the GROUP BY statement. Good scientists do not just report the average of these measurements but also the number of images which have been analyzed.

query = """
SELECT 
    experiment.Comment as experiment_name, 
    AVG(cell_count_ratio_overview.cell_count_ratio) as mean_cell_count_ratio, 
    COUNT(cell_count_ratio_overview.cell_count_ratio) as num_images
FROM cell_count_ratio_overview
INNER JOIN image ON cell_count_ratio_overview.Filename = image.Filename
INNER JOIN experiment ON image.experiment_ID = experiment.ID
GROUP BY experiment.Comment
"""

query_database(query)
experiment_name mean_cell_count_ratio num_images
0 Control 0.998246 4
1 High dose 0.175546 3
2 Medium dose 0.172872 3

Exercise#

Determine the average cell count over the entire timelapse for all images individually.

Summarize these measurements for individual images to present the average for each experiment.