Summarizing measurements

Summarizing measurements#

From an SQL-accessible database containing information about experiments, images acquired during the experiments and corresponding measurements, we would like to extract summary statistics.

import pandas as pd
from pandasql import sqldf

experiment = pd.read_csv("../../data/experiments.csv")
image = pd.read_csv("../../data/images.csv")
overview = pd.read_csv("../../data/overview.csv")

query_database = lambda q: sqldf(q, globals())

Example data#

Just a reminder, our overview table contains the key information retrieved from our database about the number of cells counted in images which were acquired during experiments.

query = """
SELECT *
FROM overview
"""

query_database(query)

	Unnamed: 0	experiment_name	Filename	time_after_exposure_in_s	number_of_cells
0	0	High dose	image0.tif	0	96
1	1	High dose	image0.tif	1	47
2	2	High dose	image0.tif	2	32
3	3	High dose	image0.tif	3	24
4	4	High dose	image0.tif	4	19
...	...	...	...	...	...
95	95	Control	image9.tif	5	142
96	96	Control	image9.tif	6	148
97	97	Control	image9.tif	7	144
98	98	Control	image9.tif	8	141
99	99	Control	image9.tif	9	140

100 rows × 5 columns

Grouping rows#

The GROUP BY statement allows to group long tables to shorter ones and it allows us to do descriptive statistics. We can for example determine the smallest and largest number of cells observed during an experiment using the so called aggregators MIN() and MAX().

query = """
SELECT experiment_name, MIN(number_of_cells), MAX(number_of_cells)
FROM overview
GROUP BY experiment_name
"""

query_database(query)

	experiment_name	MIN(number_of_cells)	MAX(number_of_cells)
0	Control	113	159
1	High dose	9	151
2	Medium dose	7	158

Groups can also be formed using multiple columns. Here we also see that it is common to specify columns used for grouping twice, once after GROUP BY and once after SELECT because we want to see the grouping in the table.

query = """
SELECT experiment_name, Filename, MIN(number_of_cells), MAX(number_of_cells)
FROM overview
GROUP BY experiment_name, Filename
"""

query_database(query)

	experiment_name	Filename	MIN(number_of_cells)	MAX(number_of_cells)
0	Control	image6.tif	147	156
1	Control	image7.tif	113	122
2	Control	image8.tif	145	159
3	Control	image9.tif	140	149
4	High dose	image0.tif	9	96
5	High dose	image1.tif	15	151
6	High dose	image2.tif	9	97
7	Medium dose	image3.tif	15	158
8	Medium dose	image4.tif	7	76
9	Medium dose	image5.tif	14	147

A similar strategy uses the COUNT aggregator to count the number of images acquired per experiment.

query = """
SELECT experiment_name, COUNT(Filename)
FROM overview
GROUP BY experiment_name
"""

query_database(query)

	experiment_name	COUNT(Filename)
0	Control	40
1	High dose	30
2	Medium dose	30

Combining statistics#

Assume the measurements in our database result from a timelapse analysis. A quick look into a single image file and corresponding measurements shows that the number of cells is decreasing over time.

query = """
SELECT *
FROM overview
WHERE Filename = "image0.tif"
"""

query_database(query)

	Unnamed: 0	experiment_name	Filename	time_after_exposure_in_s	number_of_cells
0	0	High dose	image0.tif	0	96
1	1	High dose	image0.tif	1	47
2	2	High dose	image0.tif	2	32
3	3	High dose	image0.tif	3	24
4	4	High dose	image0.tif	4	19
5	5	High dose	image0.tif	5	16
6	6	High dose	image0.tif	6	13
7	7	High dose	image0.tif	7	11
8	8	High dose	image0.tif	8	10
9	9	High dose	image0.tif	9	9

We can now determine the average number of cells at the beginning of the experiment by averaging the first 3 timepoints.

query = """
SELECT Filename, AVG(number_of_cells) as number_of_cells
FROM overview
WHERE time_after_exposure_in_s < 3
GROUP BY Filename
"""

query_database(query)

	Filename	number_of_cells
0	image0.tif	58.333333
1	image1.tif	93.000000
2	image2.tif	58.000000
3	image3.tif	97.333333
4	image4.tif	46.333333
5	image5.tif	89.666667
6	image6.tif	150.666667
7	image7.tif	113.666667
8	image8.tif	151.000000
9	image9.tif	147.666667

query = """
SELECT Filename, AVG(number_of_cells) as number_of_cells
FROM overview
WHERE time_after_exposure_in_s < 3
GROUP BY Filename
"""
cell_count_early = query_database(query)

query = """ 
SELECT Filename, AVG(number_of_cells) as number_of_cells
FROM overview
WHERE time_after_exposure_in_s > 6
GROUP BY Filename;
"""
cell_count_late = query_database(query)

query = """
SELECT cell_count_early.Filename, (cell_count_late.number_of_cells / cell_count_early.number_of_cells) AS cell_count_ratio
FROM cell_count_early
INNER JOIN cell_count_late ON cell_count_early.Filename = cell_count_late.Filename
"""

cell_count_ratio_overview = query_database(query)
cell_count_ratio_overview

	Filename	cell_count_ratio
0	image0.tif	0.171429
1	image1.tif	0.182796
2	image2.tif	0.172414
3	image3.tif	0.171233
4	image4.tif	0.172662
5	image5.tif	0.174721
6	image6.tif	0.986726
7	image7.tif	1.055718
8	image8.tif	0.991170
9	image9.tif	0.959368

We know now the fraction of cells which survived in which file. We need to use the JOIN statement again to know in which experiment these survival ratios were observed.

query = """
SELECT experiment.Comment, cell_count_ratio_overview.cell_count_ratio
FROM cell_count_ratio_overview
INNER JOIN image ON cell_count_ratio_overview.Filename = image.Filename
INNER JOIN experiment ON image.experiment_ID = experiment.ID
"""

query_database(query)

	Comment	cell_count_ratio
0	High dose	0.171429
1	High dose	0.182796
2	High dose	0.172414
3	Medium dose	0.171233
4	Medium dose	0.172662
5	Medium dose	0.174721
6	Control	0.986726
7	Control	1.055718
8	Control	0.991170
9	Control	0.959368

Also these can be summarized using the GROUP BY statement. Good scientists do not just report the average of these measurements but also the number of images which have been analyzed.

query = """
SELECT 
    experiment.Comment as experiment_name, 
    AVG(cell_count_ratio_overview.cell_count_ratio) as mean_cell_count_ratio, 
    COUNT(cell_count_ratio_overview.cell_count_ratio) as num_images
FROM cell_count_ratio_overview
INNER JOIN image ON cell_count_ratio_overview.Filename = image.Filename
INNER JOIN experiment ON image.experiment_ID = experiment.ID
GROUP BY experiment.Comment
"""

query_database(query)

	experiment_name	mean_cell_count_ratio	num_images
0	Control	0.998246	4
1	High dose	0.175546	3
2	Medium dose	0.172872	3

Exercise#

Determine the average cell count over the entire timelapse for all images individually.

Summarize these measurements for individual images to present the average for each experiment.