Summarizing subsets of data#
Assume we want to summarize our data, e.g. by splitting it into groups according to filename and computing mean intensity measurements for these groups. This will give us a smaller table with summarized measurements per file.
See also
import pandas as pd
import numpy as np
To demonstate the example, we load a table which contains shape measurements of many objects that have been segmented in multiple files of the Broad Bioimage Benchmark Collection BBB0007 dataset from Jones et al., Proc. ICCV Workshop on Computer Vision for Biomedical Image Applications, 2005).
df = pd.read_csv('../../data/BBBC007_analysis.csv')
df
area | intensity_mean | major_axis_length | minor_axis_length | aspect_ratio | file_name | |
---|---|---|---|---|---|---|
0 | 139 | 96.546763 | 17.504104 | 10.292770 | 1.700621 | 20P1_POS0010_D_1UL |
1 | 360 | 86.613889 | 35.746808 | 14.983124 | 2.385805 | 20P1_POS0010_D_1UL |
2 | 43 | 91.488372 | 12.967884 | 4.351573 | 2.980045 | 20P1_POS0010_D_1UL |
3 | 140 | 73.742857 | 18.940508 | 10.314404 | 1.836316 | 20P1_POS0010_D_1UL |
4 | 144 | 89.375000 | 13.639308 | 13.458532 | 1.013432 | 20P1_POS0010_D_1UL |
... | ... | ... | ... | ... | ... | ... |
106 | 305 | 88.252459 | 20.226532 | 19.244210 | 1.051045 | 20P1_POS0007_D_1UL |
107 | 593 | 89.905565 | 36.508370 | 21.365394 | 1.708762 | 20P1_POS0007_D_1UL |
108 | 289 | 106.851211 | 20.427809 | 18.221452 | 1.121086 | 20P1_POS0007_D_1UL |
109 | 277 | 100.664260 | 20.307965 | 17.432920 | 1.164920 | 20P1_POS0007_D_1UL |
110 | 46 | 70.869565 | 11.648895 | 5.298003 | 2.198733 | 20P1_POS0007_D_1UL |
111 rows × 6 columns
Grouping by filename#
We will now group the table by image filename.
grouped_df = df.groupby('file_name')
grouped_df
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000002DC95CF2520>
From this grouped_df
object we can derive basic statistics, for example the mean of all numeric columns.
summary_df = grouped_df.mean(numeric_only = True)
summary_df
area | intensity_mean | major_axis_length | minor_axis_length | aspect_ratio | |
---|---|---|---|---|---|
file_name | |||||
20P1_POS0007_D_1UL | 300.859375 | 95.889956 | 22.015742 | 17.132505 | 1.316197 |
20P1_POS0010_D_1UL | 253.361702 | 96.745373 | 20.120268 | 15.330923 | 1.402934 |
The outputted data frame has the mean values of all quantities, including the intensities that we wanted. Note that this data frame has ‘filename’ as the name of the row index. To convert it back to a normal table with a numeric index columm, we can use the reset_index() method.
summary_df.reset_index()
file_name | area | intensity_mean | major_axis_length | minor_axis_length | aspect_ratio | |
---|---|---|---|---|---|---|
0 | 20P1_POS0007_D_1UL | 300.859375 | 95.889956 | 22.015742 | 17.132505 | 1.316197 |
1 | 20P1_POS0010_D_1UL | 253.361702 | 96.745373 | 20.120268 | 15.330923 | 1.402934 |
Note, though, that this was not done in-place. summary_df
still has an index labeled round
. If you want to update your table, you have to explicitly do so with an assignment operator.
summary_df = summary_df.reset_index()
summary_df
file_name | area | intensity_mean | major_axis_length | minor_axis_length | aspect_ratio | |
---|---|---|---|---|---|---|
0 | 20P1_POS0007_D_1UL | 300.859375 | 95.889956 | 22.015742 | 17.132505 | 1.316197 |
1 | 20P1_POS0010_D_1UL | 253.361702 | 96.745373 | 20.120268 | 15.330923 | 1.402934 |