Summarizing measurements#
From an SQL-accessible database containing information about experiments, images acquired during the experiments and corresponding measurements, we would like to extract summary statistics.
import pandas as pd
from pandasql import sqldf
experiment = pd.read_csv("../../data/experiments.csv")
image = pd.read_csv("../../data/images.csv")
overview = pd.read_csv("../../data/overview.csv")
query_database = lambda q: sqldf(q, globals())
Example data#
Just a reminder, our overview table contains the key information retrieved from our database about the number of cells counted in images which were acquired during experiments.
query = """
SELECT *
FROM overview
"""
query_database(query)
Unnamed: 0 | experiment_name | Filename | time_after_exposure_in_s | number_of_cells | |
---|---|---|---|---|---|
0 | 0 | High dose | image0.tif | 0 | 96 |
1 | 1 | High dose | image0.tif | 1 | 47 |
2 | 2 | High dose | image0.tif | 2 | 32 |
3 | 3 | High dose | image0.tif | 3 | 24 |
4 | 4 | High dose | image0.tif | 4 | 19 |
... | ... | ... | ... | ... | ... |
95 | 95 | Control | image9.tif | 5 | 142 |
96 | 96 | Control | image9.tif | 6 | 148 |
97 | 97 | Control | image9.tif | 7 | 144 |
98 | 98 | Control | image9.tif | 8 | 141 |
99 | 99 | Control | image9.tif | 9 | 140 |
100 rows × 5 columns
Grouping rows#
The GROUP BY
statement allows to group long tables to shorter ones and it allows us to do descriptive statistics. We can for example determine the smallest and largest number of cells observed during an experiment using the so called aggregators MIN()
and MAX()
.
query = """
SELECT experiment_name, MIN(number_of_cells), MAX(number_of_cells)
FROM overview
GROUP BY experiment_name
"""
query_database(query)
experiment_name | MIN(number_of_cells) | MAX(number_of_cells) | |
---|---|---|---|
0 | Control | 113 | 159 |
1 | High dose | 9 | 151 |
2 | Medium dose | 7 | 158 |
Groups can also be formed using multiple columns. Here we also see that it is common to specify columns used for grouping twice, once after GROUP BY
and once after SELECT
because we want to see the grouping in the table.
query = """
SELECT experiment_name, Filename, MIN(number_of_cells), MAX(number_of_cells)
FROM overview
GROUP BY experiment_name, Filename
"""
query_database(query)
experiment_name | Filename | MIN(number_of_cells) | MAX(number_of_cells) | |
---|---|---|---|---|
0 | Control | image6.tif | 147 | 156 |
1 | Control | image7.tif | 113 | 122 |
2 | Control | image8.tif | 145 | 159 |
3 | Control | image9.tif | 140 | 149 |
4 | High dose | image0.tif | 9 | 96 |
5 | High dose | image1.tif | 15 | 151 |
6 | High dose | image2.tif | 9 | 97 |
7 | Medium dose | image3.tif | 15 | 158 |
8 | Medium dose | image4.tif | 7 | 76 |
9 | Medium dose | image5.tif | 14 | 147 |
A similar strategy uses the COUNT
aggregator to count the number of images acquired per experiment.
query = """
SELECT experiment_name, COUNT(Filename)
FROM overview
GROUP BY experiment_name
"""
query_database(query)
experiment_name | COUNT(Filename) | |
---|---|---|
0 | Control | 40 |
1 | High dose | 30 |
2 | Medium dose | 30 |
Combining statistics#
Assume the measurements in our database result from a timelapse analysis. A quick look into a single image file and corresponding measurements shows that the number of cells is decreasing over time.
query = """
SELECT *
FROM overview
WHERE Filename = "image0.tif"
"""
query_database(query)
Unnamed: 0 | experiment_name | Filename | time_after_exposure_in_s | number_of_cells | |
---|---|---|---|---|---|
0 | 0 | High dose | image0.tif | 0 | 96 |
1 | 1 | High dose | image0.tif | 1 | 47 |
2 | 2 | High dose | image0.tif | 2 | 32 |
3 | 3 | High dose | image0.tif | 3 | 24 |
4 | 4 | High dose | image0.tif | 4 | 19 |
5 | 5 | High dose | image0.tif | 5 | 16 |
6 | 6 | High dose | image0.tif | 6 | 13 |
7 | 7 | High dose | image0.tif | 7 | 11 |
8 | 8 | High dose | image0.tif | 8 | 10 |
9 | 9 | High dose | image0.tif | 9 | 9 |
We can now determine the average number of cells at the beginning of the experiment by averaging the first 3 timepoints.
query = """
SELECT Filename, AVG(number_of_cells) as number_of_cells
FROM overview
WHERE time_after_exposure_in_s < 3
GROUP BY Filename
"""
query_database(query)
Filename | number_of_cells | |
---|---|---|
0 | image0.tif | 58.333333 |
1 | image1.tif | 93.000000 |
2 | image2.tif | 58.000000 |
3 | image3.tif | 97.333333 |
4 | image4.tif | 46.333333 |
5 | image5.tif | 89.666667 |
6 | image6.tif | 150.666667 |
7 | image7.tif | 113.666667 |
8 | image8.tif | 151.000000 |
9 | image9.tif | 147.666667 |
query = """
SELECT Filename, AVG(number_of_cells) as number_of_cells
FROM overview
WHERE time_after_exposure_in_s < 3
GROUP BY Filename
"""
cell_count_early = query_database(query)
query = """
SELECT Filename, AVG(number_of_cells) as number_of_cells
FROM overview
WHERE time_after_exposure_in_s > 6
GROUP BY Filename;
"""
cell_count_late = query_database(query)
query = """
SELECT cell_count_early.Filename, (cell_count_late.number_of_cells / cell_count_early.number_of_cells) AS cell_count_ratio
FROM cell_count_early
INNER JOIN cell_count_late ON cell_count_early.Filename = cell_count_late.Filename
"""
cell_count_ratio_overview = query_database(query)
cell_count_ratio_overview
Filename | cell_count_ratio | |
---|---|---|
0 | image0.tif | 0.171429 |
1 | image1.tif | 0.182796 |
2 | image2.tif | 0.172414 |
3 | image3.tif | 0.171233 |
4 | image4.tif | 0.172662 |
5 | image5.tif | 0.174721 |
6 | image6.tif | 0.986726 |
7 | image7.tif | 1.055718 |
8 | image8.tif | 0.991170 |
9 | image9.tif | 0.959368 |
We know now the fraction of cells which survived in which file. We need to use the JOIN
statement again to know in which experiment these survival ratios were observed.
query = """
SELECT experiment.Comment, cell_count_ratio_overview.cell_count_ratio
FROM cell_count_ratio_overview
INNER JOIN image ON cell_count_ratio_overview.Filename = image.Filename
INNER JOIN experiment ON image.experiment_ID = experiment.ID
"""
query_database(query)
Comment | cell_count_ratio | |
---|---|---|
0 | High dose | 0.171429 |
1 | High dose | 0.182796 |
2 | High dose | 0.172414 |
3 | Medium dose | 0.171233 |
4 | Medium dose | 0.172662 |
5 | Medium dose | 0.174721 |
6 | Control | 0.986726 |
7 | Control | 1.055718 |
8 | Control | 0.991170 |
9 | Control | 0.959368 |
Also these can be summarized using the GROUP BY
statement. Good scientists do not just report the average of these measurements but also the number of images which have been analyzed.
query = """
SELECT
experiment.Comment as experiment_name,
AVG(cell_count_ratio_overview.cell_count_ratio) as mean_cell_count_ratio,
COUNT(cell_count_ratio_overview.cell_count_ratio) as num_images
FROM cell_count_ratio_overview
INNER JOIN image ON cell_count_ratio_overview.Filename = image.Filename
INNER JOIN experiment ON image.experiment_ID = experiment.ID
GROUP BY experiment.Comment
"""
query_database(query)
experiment_name | mean_cell_count_ratio | num_images | |
---|---|---|---|
0 | Control | 0.998246 | 4 |
1 | High dose | 0.175546 | 3 |
2 | Medium dose | 0.172872 | 3 |
Exercise#
Determine the average cell count over the entire timelapse for all images individually.
Summarize these measurements for individual images to present the average for each experiment.