Introduction to working with DataFrames#
In basic python, we often use dictionaries containing our measurements as vectors. While these basic structures are handy for collecting data, they are suboptimal for further data processing. For that we introduce panda DataFrames which are more handy in the next steps. In Python, scientists often call tables “DataFrames”.
import pandas as pd
Creating DataFrames from a dictionary of lists#
Assume we did some image processing and have some results available in a dictionary that contains lists of numbers:
measurements = {
"labels": [1, 2, 3],
"area": [45, 23, 68],
"minor_axis": [2, 4, 4],
"major_axis": [3, 4, 5],
}
This data structure can be nicely visualized using a DataFrame:
df = pd.DataFrame(measurements)
df
labels | area | minor_axis | major_axis | |
---|---|---|---|---|
0 | 1 | 45 | 2 | 3 |
1 | 2 | 23 | 4 | 4 |
2 | 3 | 68 | 4 | 5 |
Using these DataFrames, data modification is straighforward. For example one can append a new column and compute its values from existing columns:
df["aspect_ratio"] = df["major_axis"] / df["minor_axis"]
df
labels | area | minor_axis | major_axis | aspect_ratio | |
---|---|---|---|---|---|
0 | 1 | 45 | 2 | 3 | 1.50 |
1 | 2 | 23 | 4 | 4 | 1.00 |
2 | 3 | 68 | 4 | 5 | 1.25 |
Saving data frames#
We can also save this table for continuing to work with it.
df.to_csv("../../data/short_table.csv")
Loading data frames#
Tables can also be read from CSV files.
df_csv = pd.read_csv('../../data/blobs_statistics.csv')
df_csv
Unnamed: 0 | area | mean_intensity | minor_axis_length | major_axis_length | eccentricity | extent | feret_diameter_max | equivalent_diameter_area | bbox-0 | bbox-1 | bbox-2 | bbox-3 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 422 | 192.379147 | 16.488550 | 34.566789 | 0.878900 | 0.586111 | 35.227830 | 23.179885 | 0 | 11 | 30 | 35 |
1 | 1 | 182 | 180.131868 | 11.736074 | 20.802697 | 0.825665 | 0.787879 | 21.377558 | 15.222667 | 0 | 53 | 11 | 74 |
2 | 2 | 661 | 205.216339 | 28.409502 | 30.208433 | 0.339934 | 0.874339 | 32.756679 | 29.010538 | 0 | 95 | 28 | 122 |
3 | 3 | 437 | 216.585812 | 23.143996 | 24.606130 | 0.339576 | 0.826087 | 26.925824 | 23.588253 | 0 | 144 | 23 | 167 |
4 | 4 | 476 | 212.302521 | 19.852882 | 31.075106 | 0.769317 | 0.863884 | 31.384710 | 24.618327 | 0 | 237 | 29 | 256 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
56 | 56 | 211 | 185.061611 | 14.522762 | 18.489138 | 0.618893 | 0.781481 | 18.973666 | 16.390654 | 232 | 39 | 250 | 54 |
57 | 57 | 78 | 185.230769 | 6.028638 | 17.579799 | 0.939361 | 0.722222 | 18.027756 | 9.965575 | 248 | 170 | 254 | 188 |
58 | 58 | 86 | 183.720930 | 5.426871 | 21.261427 | 0.966876 | 0.781818 | 22.000000 | 10.464158 | 249 | 117 | 254 | 139 |
59 | 59 | 51 | 190.431373 | 5.032414 | 13.742079 | 0.930534 | 0.728571 | 14.035669 | 8.058239 | 249 | 228 | 254 | 242 |
60 | 60 | 46 | 175.304348 | 3.803982 | 15.948714 | 0.971139 | 0.766667 | 15.033296 | 7.653040 | 250 | 67 | 254 | 82 |
61 rows × 13 columns
Typically, we don’t need all the information in these tables and thus, it makes sense to reduce the table. For that, we print out the column names first.
df_csv.keys()
Index(['Unnamed: 0', 'area', 'mean_intensity', 'minor_axis_length',
'major_axis_length', 'eccentricity', 'extent', 'feret_diameter_max',
'equivalent_diameter_area', 'bbox-0', 'bbox-1', 'bbox-2', 'bbox-3'],
dtype='object')
We can then copy&paste the column names we’re interested in and create a new data frame.
df_analysis = df_csv[['area', 'mean_intensity']]
df_analysis
area | mean_intensity | |
---|---|---|
0 | 422 | 192.379147 |
1 | 182 | 180.131868 |
2 | 661 | 205.216339 |
3 | 437 | 216.585812 |
4 | 476 | 212.302521 |
... | ... | ... |
56 | 211 | 185.061611 |
57 | 78 | 185.230769 |
58 | 86 | 183.720930 |
59 | 51 | 190.431373 |
60 | 46 | 175.304348 |
61 rows × 2 columns
You can then access columns and add new columns.
df_analysis['total_intensity'] = df_analysis['area'] * df_analysis['mean_intensity']
df_analysis
C:\Users\haase\AppData\Local\Temp\ipykernel_3576\206920941.py:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df_analysis['total_intensity'] = df_analysis['area'] * df_analysis['mean_intensity']
area | mean_intensity | total_intensity | |
---|---|---|---|
0 | 422 | 192.379147 | 81184.0 |
1 | 182 | 180.131868 | 32784.0 |
2 | 661 | 205.216339 | 135648.0 |
3 | 437 | 216.585812 | 94648.0 |
4 | 476 | 212.302521 | 101056.0 |
... | ... | ... | ... |
56 | 211 | 185.061611 | 39048.0 |
57 | 78 | 185.230769 | 14448.0 |
58 | 86 | 183.720930 | 15800.0 |
59 | 51 | 190.431373 | 9712.0 |
60 | 46 | 175.304348 | 8064.0 |
61 rows × 3 columns
Exercise#
For the loaded CSV file, create a table that only contains these columns:
minor_axis_length
major_axis_length
aspect_ratio
df_shape = pd.read_csv('../../data/blobs_statistics.csv')