Appending tables#
When processing multiple images, potentially using multiple image processing libraries, a common task is to combine tables.
We start with two small tables of measurements that could have been obtained from different functions or different libraries.
import pandas as pd
table1 = pd.DataFrame({
"label": [1, 2, 3],
"circularity": [0.3, 0.5, 0.7],
"elongation": [2.3, 3.4, 1.2],
})
table1
label | circularity | elongation | |
---|---|---|---|
0 | 1 | 0.3 | 2.3 |
1 | 2 | 0.5 | 3.4 |
2 | 3 | 0.7 | 1.2 |
table2 = pd.DataFrame({
"label": [3, 2, 1, 4],
"area": [22, 32, 25, 18],
"skewness": [0.5, 0.6, 0.3, 0.3],
})
table2
label | area | skewness | |
---|---|---|---|
0 | 3 | 22 | 0.5 |
1 | 2 | 32 | 0.6 |
2 | 1 | 25 | 0.3 |
3 | 4 | 18 | 0.3 |
Combining columns of tables#
According to the pandas documentation there are multiple ways for combining tables. We first use a wrong example to highlight pitfalls when combining tables.
In the following example, measurements of label 1 and 3 are mixed. Furthermore, one of our tables did not contain measurements for label 4.
wrongly_combined_tables = pd.concat([table1, table2], axis=1)
wrongly_combined_tables
label | circularity | elongation | label | area | skewness | |
---|---|---|---|---|---|---|
0 | 1.0 | 0.3 | 2.3 | 3 | 22 | 0.5 |
1 | 2.0 | 0.5 | 3.4 | 2 | 32 | 0.6 |
2 | 3.0 | 0.7 | 1.2 | 1 | 25 | 0.3 |
3 | NaN | NaN | NaN | 4 | 18 | 0.3 |
A better way for combining tables is the merge
command. It allows to explicitly specify on
which column the tables should be combined. Data scientists speak of the ‘index’ or ‘identifier’ of rows in the tables.
correctly_combined_tables1 = pd.merge(table1, table2, how='inner', on='label')
correctly_combined_tables1
label | circularity | elongation | area | skewness | |
---|---|---|---|---|---|
0 | 1 | 0.3 | 2.3 | 25 | 0.3 |
1 | 2 | 0.5 | 3.4 | 32 | 0.6 |
2 | 3 | 0.7 | 1.2 | 22 | 0.5 |
You may note that in the above example, label 4 is missing. We can also get it in the out table by performing an outer join
.
correctly_combined_tables2 = pd.merge(table1, table2, how='outer', on='label')
correctly_combined_tables2
label | circularity | elongation | area | skewness | |
---|---|---|---|---|---|
0 | 1 | 0.3 | 2.3 | 25 | 0.3 |
1 | 2 | 0.5 | 3.4 | 32 | 0.6 |
2 | 3 | 0.7 | 1.2 | 22 | 0.5 |
3 | 4 | NaN | NaN | 18 | 0.3 |
correctly_combined_tables2 = pd.merge(table1, table2, how='right', on='label')
correctly_combined_tables2
label | circularity | elongation | area | skewness | |
---|---|---|---|---|---|
0 | 3 | 0.7 | 1.2 | 22 | 0.5 |
1 | 2 | 0.5 | 3.4 | 32 | 0.6 |
2 | 1 | 0.3 | 2.3 | 25 | 0.3 |
3 | 4 | NaN | NaN | 18 | 0.3 |
Suppose there is a common measurement name from different tables. For example, table3 below also contains “elongation”.
table3 = pd.DataFrame({
"label": [3, 2, 1, 4],
"area": [22, 32, 25, 18],
"skewness": [0.5, 0.6, 0.3, 0.3],
"elongation": [2.3, 3.4, 1.2, 1.1]
})
table3
label | area | skewness | elongation | |
---|---|---|---|---|
0 | 3 | 22 | 0.5 | 2.3 |
1 | 2 | 32 | 0.6 | 3.4 |
2 | 1 | 25 | 0.3 | 1.2 |
3 | 4 | 18 | 0.3 | 1.1 |
Applying merge still preserves both measurements in different columns.
correctly_combined_tables3 = pd.merge(table1, table3, how='outer', on='label')
correctly_combined_tables3
label | circularity | elongation_x | area | skewness | elongation_y | |
---|---|---|---|---|---|---|
0 | 1 | 0.3 | 2.3 | 25 | 0.3 | 1.2 |
1 | 2 | 0.5 | 3.4 | 32 | 0.6 | 3.4 |
2 | 3 | 0.7 | 1.2 | 22 | 0.5 | 2.3 |
3 | 4 | NaN | NaN | 18 | 0.3 | 1.1 |
We can change ‘x’ and ‘y’ by passing other suffixes.
correctly_combined_tables3 = pd.merge(table1, table3, how='outer', on='label', suffixes=('_method1', '_method2'))
correctly_combined_tables3
label | circularity | elongation_method1 | area | skewness | elongation_method2 | |
---|---|---|---|---|---|---|
0 | 1 | 0.3 | 2.3 | 25 | 0.3 | 1.2 |
1 | 2 | 0.5 | 3.4 | 32 | 0.6 | 3.4 |
2 | 3 | 0.7 | 1.2 | 22 | 0.5 | 2.3 |
3 | 4 | NaN | NaN | 18 | 0.3 | 1.1 |
Combining measurements from multiple image files#
When applying a workflow to many images, you would get tables with the same column names, but with a variable number of rows. To calculate statistics for whole folders or to perform machine learning, we usually need to concatenate those tables, but it is important to keep track of the source files.
Let’s open two tables generated from applying the same worflow to different files.
df1 = pd.read_csv('../../data/BBBC007_20P1_POS0007_D_1UL.csv')
df1.head()
area | intensity_mean | major_axis_length | minor_axis_length | aspect_ratio | |
---|---|---|---|---|---|
0 | 256 | 93.250000 | 19.995017 | 17.021559 | 1.174688 |
1 | 90 | 82.488889 | 15.939969 | 7.516326 | 2.120713 |
2 | 577 | 90.637782 | 35.324458 | 21.759434 | 1.623409 |
3 | 270 | 95.640741 | 20.229431 | 17.669052 | 1.144908 |
4 | 153 | 84.908497 | 15.683703 | 12.420475 | 1.262730 |
df2 = pd.read_csv('../../data/BBBC007_20P1_POS0010_D_1UL.csv')
df2.head()
area | intensity_mean | major_axis_length | minor_axis_length | aspect_ratio | |
---|---|---|---|---|---|
0 | 139 | 96.546763 | 17.504104 | 10.292770 | 1.700621 |
1 | 360 | 86.613889 | 35.746808 | 14.983124 | 2.385805 |
2 | 43 | 91.488372 | 12.967884 | 4.351573 | 2.980045 |
3 | 140 | 73.742857 | 18.940508 | 10.314404 | 1.836316 |
4 | 144 | 89.375000 | 13.639308 | 13.458532 | 1.013432 |
In this particular case where we know we have the same columns, we could concatenate them into a single big table.
big_df = pd.concat([df1, df2], axis=0)
big_df
area | intensity_mean | major_axis_length | minor_axis_length | aspect_ratio | |
---|---|---|---|---|---|
0 | 256 | 93.250000 | 19.995017 | 17.021559 | 1.174688 |
1 | 90 | 82.488889 | 15.939969 | 7.516326 | 2.120713 |
2 | 577 | 90.637782 | 35.324458 | 21.759434 | 1.623409 |
3 | 270 | 95.640741 | 20.229431 | 17.669052 | 1.144908 |
4 | 153 | 84.908497 | 15.683703 | 12.420475 | 1.262730 |
... | ... | ... | ... | ... | ... |
42 | 315 | 91.133333 | 20.927095 | 19.209283 | 1.089426 |
43 | 206 | 94.262136 | 23.381879 | 11.669668 | 2.003646 |
44 | 45 | 68.377778 | 9.406371 | 6.276445 | 1.498678 |
45 | 33 | 76.727273 | 10.724275 | 4.174568 | 2.568955 |
46 | 16 | 76.750000 | 7.745967 | 2.783882 | 2.782433 |
111 rows × 5 columns
The problem is that we lose their source identity. An easy fix for that is to add a new column with the file name before concatenating them. This will facilitate splitting them back and plotting later on.
When we give a single value to a new column, it is assigned to all rows.
df1['file_name'] = 'BBBC007_20P1_POS0007_D_1UL'
df2['file_name'] = 'BBBC007_20P1_POS0010_D_1UL'
big_df = pd.concat([df1, df2], axis=0)
big_df
area | intensity_mean | major_axis_length | minor_axis_length | aspect_ratio | file_name | |
---|---|---|---|---|---|---|
0 | 256 | 93.250000 | 19.995017 | 17.021559 | 1.174688 | BBBC007_20P1_POS0007_D_1UL |
1 | 90 | 82.488889 | 15.939969 | 7.516326 | 2.120713 | BBBC007_20P1_POS0007_D_1UL |
2 | 577 | 90.637782 | 35.324458 | 21.759434 | 1.623409 | BBBC007_20P1_POS0007_D_1UL |
3 | 270 | 95.640741 | 20.229431 | 17.669052 | 1.144908 | BBBC007_20P1_POS0007_D_1UL |
4 | 153 | 84.908497 | 15.683703 | 12.420475 | 1.262730 | BBBC007_20P1_POS0007_D_1UL |
... | ... | ... | ... | ... | ... | ... |
42 | 315 | 91.133333 | 20.927095 | 19.209283 | 1.089426 | BBBC007_20P1_POS0010_D_1UL |
43 | 206 | 94.262136 | 23.381879 | 11.669668 | 2.003646 | BBBC007_20P1_POS0010_D_1UL |
44 | 45 | 68.377778 | 9.406371 | 6.276445 | 1.498678 | BBBC007_20P1_POS0010_D_1UL |
45 | 33 | 76.727273 | 10.724275 | 4.174568 | 2.568955 | BBBC007_20P1_POS0010_D_1UL |
46 | 16 | 76.750000 | 7.745967 | 2.783882 | 2.782433 | BBBC007_20P1_POS0010_D_1UL |
111 rows × 6 columns
Now, we can safely distinguish the source of each row.