Appending tables#

When processing multiple images, potentially using multiple image processing libraries, a common task is to combine tables.

We start with two small tables of measurements that could have been obtained from different functions or different libraries.

import pandas as pd
table1 = pd.DataFrame({
    "label":       [1,   2,   3],
    "circularity": [0.3, 0.5, 0.7],
    "elongation":  [2.3, 3.4, 1.2],
    })
table1
label circularity elongation
0 1 0.3 2.3
1 2 0.5 3.4
2 3 0.7 1.2
table2 = pd.DataFrame({
    "label":    [3,   2,   1,   4],
    "area":     [22,  32,  25,  18],
    "skewness": [0.5, 0.6, 0.3, 0.3],
    })
table2
label area skewness
0 3 22 0.5
1 2 32 0.6
2 1 25 0.3
3 4 18 0.3

Combining columns of tables#

According to the pandas documentation there are multiple ways for combining tables. We first use a wrong example to highlight pitfalls when combining tables.

In the following example, measurements of label 1 and 3 are mixed. Furthermore, one of our tables did not contain measurements for label 4.

wrongly_combined_tables = pd.concat([table1, table2], axis=1)
wrongly_combined_tables
label circularity elongation label area skewness
0 1.0 0.3 2.3 3 22 0.5
1 2.0 0.5 3.4 2 32 0.6
2 3.0 0.7 1.2 1 25 0.3
3 NaN NaN NaN 4 18 0.3

A better way for combining tables is the merge command. It allows to explicitly specify on which column the tables should be combined. Data scientists speak of the ‘index’ or ‘identifier’ of rows in the tables.

correctly_combined_tables1 = pd.merge(table1, table2, how='inner', on='label')
correctly_combined_tables1
label circularity elongation area skewness
0 1 0.3 2.3 25 0.3
1 2 0.5 3.4 32 0.6
2 3 0.7 1.2 22 0.5

You may note that in the above example, label 4 is missing. We can also get it in the out table by performing an outer join.

correctly_combined_tables2 = pd.merge(table1, table2, how='outer', on='label')
correctly_combined_tables2
label circularity elongation area skewness
0 1 0.3 2.3 25 0.3
1 2 0.5 3.4 32 0.6
2 3 0.7 1.2 22 0.5
3 4 NaN NaN 18 0.3
correctly_combined_tables2 = pd.merge(table1, table2, how='right', on='label')
correctly_combined_tables2
label circularity elongation area skewness
0 3 0.7 1.2 22 0.5
1 2 0.5 3.4 32 0.6
2 1 0.3 2.3 25 0.3
3 4 NaN NaN 18 0.3

Suppose there is a common measurement name from different tables. For example, table3 below also contains “elongation”.

table3 = pd.DataFrame({
    "label":    [3,   2,   1,   4],
    "area":     [22,  32,  25,  18],
    "skewness": [0.5, 0.6, 0.3, 0.3],
    "elongation":  [2.3, 3.4, 1.2, 1.1]
    })
table3
label area skewness elongation
0 3 22 0.5 2.3
1 2 32 0.6 3.4
2 1 25 0.3 1.2
3 4 18 0.3 1.1

Applying merge still preserves both measurements in different columns.

correctly_combined_tables3 = pd.merge(table1, table3, how='outer', on='label')
correctly_combined_tables3 
label circularity elongation_x area skewness elongation_y
0 1 0.3 2.3 25 0.3 1.2
1 2 0.5 3.4 32 0.6 3.4
2 3 0.7 1.2 22 0.5 2.3
3 4 NaN NaN 18 0.3 1.1

We can change ‘x’ and ‘y’ by passing other suffixes.

correctly_combined_tables3 = pd.merge(table1, table3, how='outer', on='label', suffixes=('_method1', '_method2'))
correctly_combined_tables3
label circularity elongation_method1 area skewness elongation_method2
0 1 0.3 2.3 25 0.3 1.2
1 2 0.5 3.4 32 0.6 3.4
2 3 0.7 1.2 22 0.5 2.3
3 4 NaN NaN 18 0.3 1.1

Combining measurements from multiple image files#

When applying a workflow to many images, you would get tables with the same column names, but with a variable number of rows. To calculate statistics for whole folders or to perform machine learning, we usually need to concatenate those tables, but it is important to keep track of the source files.

Let’s open two tables generated from applying the same worflow to different files.

df1 = pd.read_csv('../../data/BBBC007_20P1_POS0007_D_1UL.csv')
df1.head()
area intensity_mean major_axis_length minor_axis_length aspect_ratio
0 256 93.250000 19.995017 17.021559 1.174688
1 90 82.488889 15.939969 7.516326 2.120713
2 577 90.637782 35.324458 21.759434 1.623409
3 270 95.640741 20.229431 17.669052 1.144908
4 153 84.908497 15.683703 12.420475 1.262730
df2 = pd.read_csv('../../data/BBBC007_20P1_POS0010_D_1UL.csv')
df2.head()
area intensity_mean major_axis_length minor_axis_length aspect_ratio
0 139 96.546763 17.504104 10.292770 1.700621
1 360 86.613889 35.746808 14.983124 2.385805
2 43 91.488372 12.967884 4.351573 2.980045
3 140 73.742857 18.940508 10.314404 1.836316
4 144 89.375000 13.639308 13.458532 1.013432

In this particular case where we know we have the same columns, we could concatenate them into a single big table.

big_df = pd.concat([df1, df2], axis=0)
big_df
area intensity_mean major_axis_length minor_axis_length aspect_ratio
0 256 93.250000 19.995017 17.021559 1.174688
1 90 82.488889 15.939969 7.516326 2.120713
2 577 90.637782 35.324458 21.759434 1.623409
3 270 95.640741 20.229431 17.669052 1.144908
4 153 84.908497 15.683703 12.420475 1.262730
... ... ... ... ... ...
42 315 91.133333 20.927095 19.209283 1.089426
43 206 94.262136 23.381879 11.669668 2.003646
44 45 68.377778 9.406371 6.276445 1.498678
45 33 76.727273 10.724275 4.174568 2.568955
46 16 76.750000 7.745967 2.783882 2.782433

111 rows × 5 columns

The problem is that we lose their source identity. An easy fix for that is to add a new column with the file name before concatenating them. This will facilitate splitting them back and plotting later on.

When we give a single value to a new column, it is assigned to all rows.

df1['file_name'] = 'BBBC007_20P1_POS0007_D_1UL'

df2['file_name'] = 'BBBC007_20P1_POS0010_D_1UL'
big_df = pd.concat([df1, df2], axis=0)
big_df
area intensity_mean major_axis_length minor_axis_length aspect_ratio file_name
0 256 93.250000 19.995017 17.021559 1.174688 BBBC007_20P1_POS0007_D_1UL
1 90 82.488889 15.939969 7.516326 2.120713 BBBC007_20P1_POS0007_D_1UL
2 577 90.637782 35.324458 21.759434 1.623409 BBBC007_20P1_POS0007_D_1UL
3 270 95.640741 20.229431 17.669052 1.144908 BBBC007_20P1_POS0007_D_1UL
4 153 84.908497 15.683703 12.420475 1.262730 BBBC007_20P1_POS0007_D_1UL
... ... ... ... ... ... ...
42 315 91.133333 20.927095 19.209283 1.089426 BBBC007_20P1_POS0010_D_1UL
43 206 94.262136 23.381879 11.669668 2.003646 BBBC007_20P1_POS0010_D_1UL
44 45 68.377778 9.406371 6.276445 1.498678 BBBC007_20P1_POS0010_D_1UL
45 33 76.727273 10.724275 4.174568 2.568955 BBBC007_20P1_POS0010_D_1UL
46 16 76.750000 7.745967 2.783882 2.782433 BBBC007_20P1_POS0010_D_1UL

111 rows × 6 columns

Now, we can safely distinguish the source of each row.