{
"cells": [
{
"cell_type": "markdown",
"id": "731039f0-a993-430b-996a-cad1b809ac02",
"metadata": {},
"source": [
"# Summarizing subsets of data"
]
},
{
"cell_type": "markdown",
"id": "f34f881f-7f9b-4e51-9ee7-8c50b877390c",
"metadata": {},
"source": [
"Assume we want to summarize our data, e.g. by splitting it into groups according to filename and computing mean intensity measurements for these groups. This will give us a smaller table with summarized measurements per file.\n",
"\n",
"See also\n",
"* [Split-apply-combine](https://biapol.github.io/Quantitative_Bio_Image_Analysis_with_Python_2022/day3a_Tabular_Data_and_Descriptive_Statistics/07_split_apply_combine.html)\n",
"* [Pandas' Group by](https://pandas.pydata.org/docs/user_guide/groupby.html)"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "6ad14083-e065-44e9-a5ef-69db286bfc67",
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import numpy as np"
]
},
{
"cell_type": "markdown",
"id": "f1da98ac-e6f4-4dd1-afc2-4bcc10d54492",
"metadata": {},
"source": [
"To demonstate the example, we load a table which contains shape measurements of many objects that have been segmented in multiple files of the [Broad Bioimage Benchmark Collection BBB0007 dataset from Jones et al., Proc. ICCV Workshop on Computer Vision for Biomedical Image Applications, 2005)](https://bbbc.broadinstitute.org/BBBC007)."
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "8e6ddf36-5f87-415e-a983-364168dee4c3",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" area | \n",
" intensity_mean | \n",
" major_axis_length | \n",
" minor_axis_length | \n",
" aspect_ratio | \n",
" file_name | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 139 | \n",
" 96.546763 | \n",
" 17.504104 | \n",
" 10.292770 | \n",
" 1.700621 | \n",
" 20P1_POS0010_D_1UL | \n",
"
\n",
" \n",
" 1 | \n",
" 360 | \n",
" 86.613889 | \n",
" 35.746808 | \n",
" 14.983124 | \n",
" 2.385805 | \n",
" 20P1_POS0010_D_1UL | \n",
"
\n",
" \n",
" 2 | \n",
" 43 | \n",
" 91.488372 | \n",
" 12.967884 | \n",
" 4.351573 | \n",
" 2.980045 | \n",
" 20P1_POS0010_D_1UL | \n",
"
\n",
" \n",
" 3 | \n",
" 140 | \n",
" 73.742857 | \n",
" 18.940508 | \n",
" 10.314404 | \n",
" 1.836316 | \n",
" 20P1_POS0010_D_1UL | \n",
"
\n",
" \n",
" 4 | \n",
" 144 | \n",
" 89.375000 | \n",
" 13.639308 | \n",
" 13.458532 | \n",
" 1.013432 | \n",
" 20P1_POS0010_D_1UL | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 106 | \n",
" 305 | \n",
" 88.252459 | \n",
" 20.226532 | \n",
" 19.244210 | \n",
" 1.051045 | \n",
" 20P1_POS0007_D_1UL | \n",
"
\n",
" \n",
" 107 | \n",
" 593 | \n",
" 89.905565 | \n",
" 36.508370 | \n",
" 21.365394 | \n",
" 1.708762 | \n",
" 20P1_POS0007_D_1UL | \n",
"
\n",
" \n",
" 108 | \n",
" 289 | \n",
" 106.851211 | \n",
" 20.427809 | \n",
" 18.221452 | \n",
" 1.121086 | \n",
" 20P1_POS0007_D_1UL | \n",
"
\n",
" \n",
" 109 | \n",
" 277 | \n",
" 100.664260 | \n",
" 20.307965 | \n",
" 17.432920 | \n",
" 1.164920 | \n",
" 20P1_POS0007_D_1UL | \n",
"
\n",
" \n",
" 110 | \n",
" 46 | \n",
" 70.869565 | \n",
" 11.648895 | \n",
" 5.298003 | \n",
" 2.198733 | \n",
" 20P1_POS0007_D_1UL | \n",
"
\n",
" \n",
"
\n",
"
111 rows × 6 columns
\n",
"
"
],
"text/plain": [
" area intensity_mean major_axis_length minor_axis_length aspect_ratio \\\n",
"0 139 96.546763 17.504104 10.292770 1.700621 \n",
"1 360 86.613889 35.746808 14.983124 2.385805 \n",
"2 43 91.488372 12.967884 4.351573 2.980045 \n",
"3 140 73.742857 18.940508 10.314404 1.836316 \n",
"4 144 89.375000 13.639308 13.458532 1.013432 \n",
".. ... ... ... ... ... \n",
"106 305 88.252459 20.226532 19.244210 1.051045 \n",
"107 593 89.905565 36.508370 21.365394 1.708762 \n",
"108 289 106.851211 20.427809 18.221452 1.121086 \n",
"109 277 100.664260 20.307965 17.432920 1.164920 \n",
"110 46 70.869565 11.648895 5.298003 2.198733 \n",
"\n",
" file_name \n",
"0 20P1_POS0010_D_1UL \n",
"1 20P1_POS0010_D_1UL \n",
"2 20P1_POS0010_D_1UL \n",
"3 20P1_POS0010_D_1UL \n",
"4 20P1_POS0010_D_1UL \n",
".. ... \n",
"106 20P1_POS0007_D_1UL \n",
"107 20P1_POS0007_D_1UL \n",
"108 20P1_POS0007_D_1UL \n",
"109 20P1_POS0007_D_1UL \n",
"110 20P1_POS0007_D_1UL \n",
"\n",
"[111 rows x 6 columns]"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = pd.read_csv('../../data/BBBC007_analysis.csv')\n",
"df"
]
},
{
"cell_type": "markdown",
"id": "464afbc4-287e-4ebc-9b29-dc80fe4f8805",
"metadata": {},
"source": [
"## Grouping by filename"
]
},
{
"cell_type": "markdown",
"id": "ef04e5ff-02a2-4b72-aab6-e596dcb63eda",
"metadata": {},
"source": [
"We will now group the table by image filename."
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "37aaf674-fcc8-4ff1-96e2-216a0135fe10",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
""
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"grouped_df = df.groupby('file_name')\n",
"grouped_df"
]
},
{
"cell_type": "markdown",
"id": "00eac35a-50a6-4085-b94e-6c79592c5ef5",
"metadata": {},
"source": [
"From this `grouped_df` object we can derive basic statistics, for example the mean of all numeric columns."
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "14983422-5c93-431e-a105-7eb90f739372",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" area | \n",
" intensity_mean | \n",
" major_axis_length | \n",
" minor_axis_length | \n",
" aspect_ratio | \n",
"
\n",
" \n",
" file_name | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 20P1_POS0007_D_1UL | \n",
" 300.859375 | \n",
" 95.889956 | \n",
" 22.015742 | \n",
" 17.132505 | \n",
" 1.316197 | \n",
"
\n",
" \n",
" 20P1_POS0010_D_1UL | \n",
" 253.361702 | \n",
" 96.745373 | \n",
" 20.120268 | \n",
" 15.330923 | \n",
" 1.402934 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" area intensity_mean major_axis_length \\\n",
"file_name \n",
"20P1_POS0007_D_1UL 300.859375 95.889956 22.015742 \n",
"20P1_POS0010_D_1UL 253.361702 96.745373 20.120268 \n",
"\n",
" minor_axis_length aspect_ratio \n",
"file_name \n",
"20P1_POS0007_D_1UL 17.132505 1.316197 \n",
"20P1_POS0010_D_1UL 15.330923 1.402934 "
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"summary_df = grouped_df.mean(numeric_only = True)\n",
"summary_df"
]
},
{
"cell_type": "markdown",
"id": "339a3a54-7d67-427b-9a74-c62239040959",
"metadata": {},
"source": [
"The outputted data frame has the mean values of all quantities, including the intensities that we wanted. Note that this data frame has 'filename' as the name of the row index. To convert it back to a normal table with a numeric index columm, we can use the reset_index() method."
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "103b54c4-24a1-487a-a169-4f4fc535ddf7",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" file_name | \n",
" area | \n",
" intensity_mean | \n",
" major_axis_length | \n",
" minor_axis_length | \n",
" aspect_ratio | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 20P1_POS0007_D_1UL | \n",
" 300.859375 | \n",
" 95.889956 | \n",
" 22.015742 | \n",
" 17.132505 | \n",
" 1.316197 | \n",
"
\n",
" \n",
" 1 | \n",
" 20P1_POS0010_D_1UL | \n",
" 253.361702 | \n",
" 96.745373 | \n",
" 20.120268 | \n",
" 15.330923 | \n",
" 1.402934 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" file_name area intensity_mean major_axis_length \\\n",
"0 20P1_POS0007_D_1UL 300.859375 95.889956 22.015742 \n",
"1 20P1_POS0010_D_1UL 253.361702 96.745373 20.120268 \n",
"\n",
" minor_axis_length aspect_ratio \n",
"0 17.132505 1.316197 \n",
"1 15.330923 1.402934 "
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"summary_df.reset_index()"
]
},
{
"cell_type": "markdown",
"id": "c01ca9fd-eb54-4ebd-ba73-e0318b2d143d",
"metadata": {},
"source": [
"Note, though, that this was not done in-place. `summary_df` still has an index labeled `round`. If you want to update your table, you have to explicitly do so with an assignment operator."
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "199fe5d5-e7ea-4956-be6c-76fece4ae370",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" file_name | \n",
" area | \n",
" intensity_mean | \n",
" major_axis_length | \n",
" minor_axis_length | \n",
" aspect_ratio | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 20P1_POS0007_D_1UL | \n",
" 300.859375 | \n",
" 95.889956 | \n",
" 22.015742 | \n",
" 17.132505 | \n",
" 1.316197 | \n",
"
\n",
" \n",
" 1 | \n",
" 20P1_POS0010_D_1UL | \n",
" 253.361702 | \n",
" 96.745373 | \n",
" 20.120268 | \n",
" 15.330923 | \n",
" 1.402934 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" file_name area intensity_mean major_axis_length \\\n",
"0 20P1_POS0007_D_1UL 300.859375 95.889956 22.015742 \n",
"1 20P1_POS0010_D_1UL 253.361702 96.745373 20.120268 \n",
"\n",
" minor_axis_length aspect_ratio \n",
"0 17.132505 1.316197 \n",
"1 15.330923 1.402934 "
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"summary_df = summary_df.reset_index()\n",
"summary_df"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9293eae3-a69b-4caa-b0c0-75434f20e39a",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.15"
}
},
"nbformat": 4,
"nbformat_minor": 5
}