{ "cells": [ { "cell_type": "markdown", "id": "85b63768-dc58-4f62-8297-1a90c26f5c36", "metadata": {}, "source": [ "## Summarizing measurements\n", "\n", "From an SQL-accessible database containing information about experiments, images acquired during the experiments and corresponding measurements, we would like to extract summary statistics." ] }, { "cell_type": "code", "execution_count": 1, "id": "f9f790d9-22b3-4160-9785-75aa4c661c1d", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "from pandasql import sqldf" ] }, { "cell_type": "code", "execution_count": 2, "id": "805125bc-d88c-4a4b-9a92-001e2ff098e6", "metadata": {}, "outputs": [], "source": [ "experiment = pd.read_csv(\"../../data/experiments.csv\")\n", "image = pd.read_csv(\"../../data/images.csv\")\n", "overview = pd.read_csv(\"../../data/overview.csv\")" ] }, { "cell_type": "code", "execution_count": 3, "id": "8580ae00-7a1d-4239-b9cc-946bfc9bf705", "metadata": {}, "outputs": [], "source": [ "query_database = lambda q: sqldf(q, globals())" ] }, { "cell_type": "markdown", "id": "ba23795e-9f8e-4e08-a719-4e708514443b", "metadata": {}, "source": [ "## Example data\n", "\n", "Just a reminder, our overview table contains the key information retrieved from our database about the number of cells counted in images which were acquired during experiments." ] }, { "cell_type": "code", "execution_count": 4, "id": "8db92849-9040-4b7b-8b9d-5c19c0dc08b3", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Unnamed: 0experiment_nameFilenametime_after_exposure_in_snumber_of_cells
00High doseimage0.tif096
11High doseimage0.tif147
22High doseimage0.tif232
33High doseimage0.tif324
44High doseimage0.tif419
..................
9595Controlimage9.tif5142
9696Controlimage9.tif6148
9797Controlimage9.tif7144
9898Controlimage9.tif8141
9999Controlimage9.tif9140
\n", "

100 rows × 5 columns

\n", "
" ], "text/plain": [ " Unnamed: 0 experiment_name Filename time_after_exposure_in_s \\\n", "0 0 High dose image0.tif 0 \n", "1 1 High dose image0.tif 1 \n", "2 2 High dose image0.tif 2 \n", "3 3 High dose image0.tif 3 \n", "4 4 High dose image0.tif 4 \n", ".. ... ... ... ... \n", "95 95 Control image9.tif 5 \n", "96 96 Control image9.tif 6 \n", "97 97 Control image9.tif 7 \n", "98 98 Control image9.tif 8 \n", "99 99 Control image9.tif 9 \n", "\n", " number_of_cells \n", "0 96 \n", "1 47 \n", "2 32 \n", "3 24 \n", "4 19 \n", ".. ... \n", "95 142 \n", "96 148 \n", "97 144 \n", "98 141 \n", "99 140 \n", "\n", "[100 rows x 5 columns]" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "query = \"\"\"\n", "SELECT *\n", "FROM overview\n", "\"\"\"\n", "\n", "query_database(query)" ] }, { "cell_type": "markdown", "id": "3c3e5fa2-5aa8-4d59-b137-5317e9054bc5", "metadata": {}, "source": [ "## Grouping rows\n", "The `GROUP BY` statement allows to group long tables to shorter ones and it allows us to do descriptive statistics. We can for example determine the smallest and largest number of cells observed during an experiment using the so called aggregators `MIN()` and `MAX()`." ] }, { "cell_type": "code", "execution_count": 5, "id": "57f5aa4c-31c0-4a5d-8605-b23b0c62679e", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
experiment_nameMIN(number_of_cells)MAX(number_of_cells)
0Control113159
1High dose9151
2Medium dose7158
\n", "
" ], "text/plain": [ " experiment_name MIN(number_of_cells) MAX(number_of_cells)\n", "0 Control 113 159\n", "1 High dose 9 151\n", "2 Medium dose 7 158" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "query = \"\"\"\n", "SELECT experiment_name, MIN(number_of_cells), MAX(number_of_cells)\n", "FROM overview\n", "GROUP BY experiment_name\n", "\"\"\"\n", "\n", "query_database(query)" ] }, { "cell_type": "markdown", "id": "0949d398-13a4-4835-b326-5ecd4ba80a6b", "metadata": {}, "source": [ "Groups can also be formed using multiple columns. Here we also see that it is common to specify columns used for grouping twice, once after `GROUP BY` and once after `SELECT` because we want to see the grouping in the table." ] }, { "cell_type": "code", "execution_count": 6, "id": "7e856deb-b824-4a8f-b0cd-4daa1a4a9824", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
experiment_nameFilenameMIN(number_of_cells)MAX(number_of_cells)
0Controlimage6.tif147156
1Controlimage7.tif113122
2Controlimage8.tif145159
3Controlimage9.tif140149
4High doseimage0.tif996
5High doseimage1.tif15151
6High doseimage2.tif997
7Medium doseimage3.tif15158
8Medium doseimage4.tif776
9Medium doseimage5.tif14147
\n", "
" ], "text/plain": [ " experiment_name Filename MIN(number_of_cells) MAX(number_of_cells)\n", "0 Control image6.tif 147 156\n", "1 Control image7.tif 113 122\n", "2 Control image8.tif 145 159\n", "3 Control image9.tif 140 149\n", "4 High dose image0.tif 9 96\n", "5 High dose image1.tif 15 151\n", "6 High dose image2.tif 9 97\n", "7 Medium dose image3.tif 15 158\n", "8 Medium dose image4.tif 7 76\n", "9 Medium dose image5.tif 14 147" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "query = \"\"\"\n", "SELECT experiment_name, Filename, MIN(number_of_cells), MAX(number_of_cells)\n", "FROM overview\n", "GROUP BY experiment_name, Filename\n", "\"\"\"\n", "\n", "query_database(query)" ] }, { "cell_type": "markdown", "id": "b07b09a7-0f7e-4198-a971-6dd8c43d8d5b", "metadata": {}, "source": [ "A similar strategy uses the `COUNT` aggregator to count the number of images acquired per experiment." ] }, { "cell_type": "code", "execution_count": 7, "id": "c592c42b-e90a-4318-9407-128d6f4d9055", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
experiment_nameCOUNT(Filename)
0Control40
1High dose30
2Medium dose30
\n", "
" ], "text/plain": [ " experiment_name COUNT(Filename)\n", "0 Control 40\n", "1 High dose 30\n", "2 Medium dose 30" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "query = \"\"\"\n", "SELECT experiment_name, COUNT(Filename)\n", "FROM overview\n", "GROUP BY experiment_name\n", "\"\"\"\n", "\n", "query_database(query)" ] }, { "cell_type": "markdown", "id": "b4c24928-ffee-4c62-898c-a913041fe0f2", "metadata": {}, "source": [ "## Combining statistics\n", "\n", "Assume the measurements in our database result from a timelapse analysis. A quick look into a single image file and corresponding measurements shows that the number of cells is decreasing over time." ] }, { "cell_type": "code", "execution_count": 8, "id": "7a017ff6-1cf3-45ef-9b66-09c5a07679da", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Unnamed: 0experiment_nameFilenametime_after_exposure_in_snumber_of_cells
00High doseimage0.tif096
11High doseimage0.tif147
22High doseimage0.tif232
33High doseimage0.tif324
44High doseimage0.tif419
55High doseimage0.tif516
66High doseimage0.tif613
77High doseimage0.tif711
88High doseimage0.tif810
99High doseimage0.tif99
\n", "
" ], "text/plain": [ " Unnamed: 0 experiment_name Filename time_after_exposure_in_s \\\n", "0 0 High dose image0.tif 0 \n", "1 1 High dose image0.tif 1 \n", "2 2 High dose image0.tif 2 \n", "3 3 High dose image0.tif 3 \n", "4 4 High dose image0.tif 4 \n", "5 5 High dose image0.tif 5 \n", "6 6 High dose image0.tif 6 \n", "7 7 High dose image0.tif 7 \n", "8 8 High dose image0.tif 8 \n", "9 9 High dose image0.tif 9 \n", "\n", " number_of_cells \n", "0 96 \n", "1 47 \n", "2 32 \n", "3 24 \n", "4 19 \n", "5 16 \n", "6 13 \n", "7 11 \n", "8 10 \n", "9 9 " ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "query = \"\"\"\n", "SELECT *\n", "FROM overview\n", "WHERE Filename = \"image0.tif\"\n", "\"\"\"\n", "\n", "query_database(query)" ] }, { "cell_type": "markdown", "id": "beb4fa47-1bd5-498e-b69d-832a143fb77d", "metadata": {}, "source": [ "We can now determine the average number of cells at the beginning of the experiment by averaging the first 3 timepoints." ] }, { "cell_type": "code", "execution_count": 9, "id": "c8193b5c-485c-4a8b-a4ca-6a572c860237", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Filenamenumber_of_cells
0image0.tif58.333333
1image1.tif93.000000
2image2.tif58.000000
3image3.tif97.333333
4image4.tif46.333333
5image5.tif89.666667
6image6.tif150.666667
7image7.tif113.666667
8image8.tif151.000000
9image9.tif147.666667
\n", "
" ], "text/plain": [ " Filename number_of_cells\n", "0 image0.tif 58.333333\n", "1 image1.tif 93.000000\n", "2 image2.tif 58.000000\n", "3 image3.tif 97.333333\n", "4 image4.tif 46.333333\n", "5 image5.tif 89.666667\n", "6 image6.tif 150.666667\n", "7 image7.tif 113.666667\n", "8 image8.tif 151.000000\n", "9 image9.tif 147.666667" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "query = \"\"\"\n", "SELECT Filename, AVG(number_of_cells) as number_of_cells\n", "FROM overview\n", "WHERE time_after_exposure_in_s < 3\n", "GROUP BY Filename\n", "\"\"\"\n", "\n", "query_database(query)" ] }, { "cell_type": "code", "execution_count": 10, "id": "2805d88b-7b2b-48bf-9d79-3f4896199990", "metadata": {}, "outputs": [], "source": [ "query = \"\"\"\n", "SELECT Filename, AVG(number_of_cells) as number_of_cells\n", "FROM overview\n", "WHERE time_after_exposure_in_s < 3\n", "GROUP BY Filename\n", "\"\"\"\n", "cell_count_early = query_database(query)" ] }, { "cell_type": "code", "execution_count": 11, "id": "86b4e49f-a9ec-4fb7-9ce1-0a43cc40feb2", "metadata": {}, "outputs": [], "source": [ "query = \"\"\" \n", "SELECT Filename, AVG(number_of_cells) as number_of_cells\n", "FROM overview\n", "WHERE time_after_exposure_in_s > 6\n", "GROUP BY Filename;\n", "\"\"\"\n", "cell_count_late = query_database(query)" ] }, { "cell_type": "code", "execution_count": 12, "id": "ae17ddb7-4fa8-4eef-a79a-ff3562e989c2", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Filenamecell_count_ratio
0image0.tif0.171429
1image1.tif0.182796
2image2.tif0.172414
3image3.tif0.171233
4image4.tif0.172662
5image5.tif0.174721
6image6.tif0.986726
7image7.tif1.055718
8image8.tif0.991170
9image9.tif0.959368
\n", "
" ], "text/plain": [ " Filename cell_count_ratio\n", "0 image0.tif 0.171429\n", "1 image1.tif 0.182796\n", "2 image2.tif 0.172414\n", "3 image3.tif 0.171233\n", "4 image4.tif 0.172662\n", "5 image5.tif 0.174721\n", "6 image6.tif 0.986726\n", "7 image7.tif 1.055718\n", "8 image8.tif 0.991170\n", "9 image9.tif 0.959368" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "query = \"\"\"\n", "SELECT cell_count_early.Filename, (cell_count_late.number_of_cells / cell_count_early.number_of_cells) AS cell_count_ratio\n", "FROM cell_count_early\n", "INNER JOIN cell_count_late ON cell_count_early.Filename = cell_count_late.Filename\n", "\"\"\"\n", "\n", "cell_count_ratio_overview = query_database(query)\n", "cell_count_ratio_overview" ] }, { "cell_type": "markdown", "id": "2701323c-8ef0-459a-b10f-6a15b133eccb", "metadata": {}, "source": [ "We know now the fraction of cells which survived in which file. We need to use the `JOIN` statement again to know in which experiment these survival ratios were observed." ] }, { "cell_type": "code", "execution_count": 13, "id": "bf78f4fc-0783-4f39-b0fd-d2376a2f0a8f", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Commentcell_count_ratio
0High dose0.171429
1High dose0.182796
2High dose0.172414
3Medium dose0.171233
4Medium dose0.172662
5Medium dose0.174721
6Control0.986726
7Control1.055718
8Control0.991170
9Control0.959368
\n", "
" ], "text/plain": [ " Comment cell_count_ratio\n", "0 High dose 0.171429\n", "1 High dose 0.182796\n", "2 High dose 0.172414\n", "3 Medium dose 0.171233\n", "4 Medium dose 0.172662\n", "5 Medium dose 0.174721\n", "6 Control 0.986726\n", "7 Control 1.055718\n", "8 Control 0.991170\n", "9 Control 0.959368" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "query = \"\"\"\n", "SELECT experiment.Comment, cell_count_ratio_overview.cell_count_ratio\n", "FROM cell_count_ratio_overview\n", "INNER JOIN image ON cell_count_ratio_overview.Filename = image.Filename\n", "INNER JOIN experiment ON image.experiment_ID = experiment.ID\n", "\"\"\"\n", "\n", "query_database(query)" ] }, { "cell_type": "markdown", "id": "aa20382d-11f4-4d74-948d-dd3397eaa0c0", "metadata": {}, "source": [ "Also these can be summarized using the `GROUP BY` statement. Good scientists do not just report the average of these measurements but also the number of images which have been analyzed." ] }, { "cell_type": "code", "execution_count": 14, "id": "3294fd6a-14f2-46f5-8436-6d5079fed30c", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
experiment_namemean_cell_count_rationum_images
0Control0.9982464
1High dose0.1755463
2Medium dose0.1728723
\n", "
" ], "text/plain": [ " experiment_name mean_cell_count_ratio num_images\n", "0 Control 0.998246 4\n", "1 High dose 0.175546 3\n", "2 Medium dose 0.172872 3" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "query = \"\"\"\n", "SELECT \n", " experiment.Comment as experiment_name, \n", " AVG(cell_count_ratio_overview.cell_count_ratio) as mean_cell_count_ratio, \n", " COUNT(cell_count_ratio_overview.cell_count_ratio) as num_images\n", "FROM cell_count_ratio_overview\n", "INNER JOIN image ON cell_count_ratio_overview.Filename = image.Filename\n", "INNER JOIN experiment ON image.experiment_ID = experiment.ID\n", "GROUP BY experiment.Comment\n", "\"\"\"\n", "\n", "query_database(query)" ] }, { "cell_type": "markdown", "id": "94d78d2f-444c-4dc4-b968-e1c960d3a40d", "metadata": {}, "source": [ "## Exercise\n", "\n", "Determine the average cell count over the entire timelapse for all images individually." ] }, { "cell_type": "code", "execution_count": null, "id": "cdebfca7-6763-472e-a7f7-0e210327d02f", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "8a87efed-deeb-4e50-a983-f2f2f98bbae7", "metadata": {}, "source": [ "Summarize these measurements for individial images to present the average for each experiment." ] }, { "cell_type": "code", "execution_count": null, "id": "6e04c721-174c-480f-a146-cf6426b717fb", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.15" }, "widgets": { "application/vnd.jupyter.widget-state+json": { "state": {}, "version_major": 2, "version_minor": 0 } } }, "nbformat": 4, "nbformat_minor": 5 }