Tidy-Data

Tidy-Data#

Hadley Wickham wrote a great article in favor of “tidy data.” Tidy data frames follow the rules:

  • Each variable is a column.

  • Each observation is a row.

  • Each type of observation has its own separate data frame.

This is less pretty to visualize as a table, but we rarely look at data in tables. Indeed, the representation of data which is convenient for visualization is different from that which is convenient for analysis. A tidy data frame is almost always much easier to work with than non-tidy formats.

Let’s import a saved table with measurements. Is this table tidy?

import pandas as pd
df = pd.read_csv('../../data/Multi_analysis.csv', header = [0,1], sep=';')
df
Before After
channel_1 channel_2 channel_1 channel_2
0 13.250000 21.000000 15.137984 42.022776
1 44.954545 24.318182 43.328836 48.661610
2 13.590909 18.772727 11.685995 37.926184
3 85.032258 19.741935 86.031461 40.396353
4 10.731707 25.268293 10.075421 51.471865
5 94.625000 36.450000 95.180900 73.347843
6 89.836735 34.693878 89.857864 69.902829
7 100.261905 34.904762 101.989852 70.156432
8 29.615385 52.115385 31.516654 104.525198
9 15.868421 24.921053 16.086932 50.563301
10 12.475000 25.450000 11.529924 51.381594
11 87.875000 28.050000 89.745522 56.543107
12 58.800000 22.600000 59.646229 45.215405
13 91.061224 40.367347 89.935893 81.326111
14 23.500000 117.333333 21.676993 235.067654
15 82.566667 34.566667 84.097735 69.820702
16 36.120000 29.600000 37.688676 59.870177
17 70.687500 33.843750 72.569112 68.493363
18 102.021277 33.297872 100.419746 67.379506
19 72.318182 103.909091 70.843134 207.956510
20 18.100000 29.166667 17.865201 58.361239
21 5.217391 36.347826 6.961346 73.286439
22 19.925926 72.814815 18.607102 145.900739
23 26.673077 57.403846 27.611368 115.347217
24 13.340000 30.400000 14.160543 61.225962
25 15.028571 38.400000 14.529963 77.490249

The most useful function for tidying data is pd.melt.

df.melt()
variable_0 variable_1 value
0 Before channel_1 13.250000
1 Before channel_1 44.954545
2 Before channel_1 13.590909
3 Before channel_1 85.032258
4 Before channel_1 10.731707
... ... ... ...
99 After channel_2 73.286439
100 After channel_2 145.900739
101 After channel_2 115.347217
102 After channel_2 61.225962
103 After channel_2 77.490249

104 rows × 3 columns

We can specify names for the value and for the variables. In this case, our measurements are of intensity and our variables are Intervention (before or after) and channel.

df_tidy = df.melt(value_name='intensity', var_name=['Intervention', 'Channel'])
df_tidy
Intervention Channel intensity
0 Before channel_1 13.250000
1 Before channel_1 44.954545
2 Before channel_1 13.590909
3 Before channel_1 85.032258
4 Before channel_1 10.731707
... ... ... ...
99 After channel_2 73.286439
100 After channel_2 145.900739
101 After channel_2 115.347217
102 After channel_2 61.225962
103 After channel_2 77.490249

104 rows × 3 columns

It may not look better for interpreting, but it becomes easier to manipulate, because now we can more easily mask by columns. Here we select intensity measurement rows for “channel_2” and “After” intervention.

row_mask = (df_tidy['Channel'] == 'channel_2') & (df_tidy['Intervention'] == 'After')
df_tidy.loc[row_mask, :]['intensity']
78      42.022776
79      48.661610
80      37.926184
81      40.396353
82      51.471865
83      73.347843
84      69.902829
85      70.156432
86     104.525198
87      50.563301
88      51.381594
89      56.543107
90      45.215405
91      81.326111
92     235.067654
93      69.820702
94      59.870177
95      68.493363
96      67.379506
97     207.956510
98      58.361239
99      73.286439
100    145.900739
101    115.347217
102     61.225962
103     77.490249
Name: intensity, dtype: float64