Handling NaN values#
When analysing tabular data, sometimes table cells are present that do not contain data. In Python this typically means the value is Not a Number (NaN). We cannot assume these values are 0
or -1
or any other value because that would distort descriptive statistics, for example. We need to deal with these NaN entries differently and this notebook will introduce how.
To get a first view where NaNs play a role, we load again an example table and sort it.
import numpy as np
import pandas as pd
We are sorting the table by parameter area
to understand where NaNs play a role. We are sorting the table using sort_values
.
data = pd.read_csv('../../data/Results.csv', index_col=0, delimiter=';')
data.sort_values(by = "Area", ascending=False)
Area | Mean | StdDev | Min | Max | X | Y | XM | YM | Major | Minor | Angle | %Area | Type | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
190 | 2755.0 | 859.928 | 235.458 | 539.0 | 3880.0 | 108.710 | 302.158 | 110.999 | 300.247 | 144.475 | 24.280 | 39.318 | 100 | C |
81 | 2295.0 | 765.239 | 96.545 | 558.0 | 1431.0 | 375.003 | 134.888 | 374.982 | 135.359 | 65.769 | 44.429 | 127.247 | 100 | B |
209 | 1821.0 | 847.761 | 122.074 | 600.0 | 1510.0 | 287.795 | 321.115 | 288.074 | 321.824 | 55.879 | 41.492 | 112.124 | 100 | A |
252 | 1528.0 | 763.777 | 83.183 | 572.0 | 1172.0 | 191.969 | 385.944 | 192.487 | 385.697 | 63.150 | 30.808 | 34.424 | 100 | B |
265 | 1252.0 | 793.371 | 117.139 | 579.0 | 1668.0 | 262.071 | 394.497 | 262.268 | 394.326 | 60.154 | 26.500 | 50.147 | 100 | A |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
113 | 1.0 | 587.000 | 0.000 | 587.0 | 587.0 | 399.500 | 117.500 | 399.500 | 117.500 | 1.128 | 1.128 | 0.000 | 100 | A |
310 | 1.0 | 866.000 | 0.000 | 866.0 | 866.0 | 343.500 | 408.500 | 343.500 | 408.500 | 1.128 | 1.128 | 0.000 | 100 | A |
219 | 1.0 | 763.000 | 0.000 | 763.0 | 763.0 | 411.500 | 296.500 | 411.500 | 296.500 | 1.128 | 1.128 | 0.000 | 100 | A |
3 | NaN | NaN | NaN | 608.0 | 964.0 | NaN | NaN | NaN | 7.665 | 7.359 | NaN | 101.121 | 100 | A |
5 | NaN | NaN | 69.438 | 566.0 | 792.0 | 348.500 | 7.500 | NaN | 7.508 | NaN | 3.088 | NaN | 100 | A |
391 rows × 14 columns
We could also use this function to sort along an axis (rows/columns).
As you can see, there are rows at the bottom containing NaNs. These are at the bottom of the table because pandas cannot sort them.
A quick check if there are NaNs anywhere in a table is an important quality check for a good scientific practice:
data.isnull().values.any()
True
Now we know that we have NaNs in our table. We can also get some deeper insights in which columns these NaN values are located.
data.isnull().sum()
Area 2
Mean 5
StdDev 3
Min 3
Max 3
X 2
Y 3
XM 3
YM 5
Major 8
Minor 3
Angle 1
%Area 0
Type 0
dtype: int64
For getting a glimpse about if we can further process that table, we may want to know the percentage of NaNs for each column:
data.isnull().mean().sort_values(ascending=False) *100
Major 2.046036
Mean 1.278772
YM 1.278772
StdDev 0.767263
Min 0.767263
Max 0.767263
Y 0.767263
XM 0.767263
Minor 0.767263
Area 0.511509
X 0.511509
Angle 0.255754
%Area 0.000000
Type 0.000000
dtype: float64
Dropping rows that contain NaNs#
Depending on what kind of data analysis should be performed, it might make sense to just ignore columns that contain NaN values. Alternatively, it is possible to delete rows that contain NaNs.
It depends on your project and what is important or not for the analysis. Its not an easy answer.
data_no_nan = data.dropna(how="any")
data_no_nan
Area | Mean | StdDev | Min | Max | X | Y | XM | YM | Major | Minor | Angle | %Area | Type | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 18.0 | 730.389 | 103.354 | 592.0 | 948.0 | 435.000 | 4.722 | 434.962 | 4.697 | 5.987 | 3.828 | 168.425 | 100 | A |
2 | 126.0 | 718.333 | 90.367 | 556.0 | 1046.0 | 388.087 | 8.683 | 388.183 | 8.687 | 16.559 | 9.688 | 175.471 | 100 | A |
4 | 68.0 | 686.985 | 61.169 | 571.0 | 880.0 | 126.147 | 8.809 | 126.192 | 8.811 | 15.136 | 5.720 | 168.133 | 100 | A |
6 | 669.0 | 697.164 | 72.863 | 539.0 | 957.0 | 471.696 | 26.253 | 471.694 | 26.197 | 36.656 | 23.237 | 124.340 | 100 | A |
7 | 5.0 | 658.600 | 49.161 | 607.0 | 710.0 | 28.300 | 8.100 | 28.284 | 8.103 | 3.144 | 2.025 | 161.565 | 100 | A |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
383 | 94.0 | 746.617 | 85.198 | 550.0 | 1021.0 | 194.032 | 498.223 | 194.014 | 498.239 | 17.295 | 6.920 | 52.720 | 100 | B |
387 | 152.0 | 801.599 | 111.328 | 582.0 | 1263.0 | 348.487 | 497.632 | 348.451 | 497.675 | 17.773 | 10.889 | 11.829 | 100 | A |
389 | 60.0 | 758.033 | 77.309 | 601.0 | 947.0 | 259.000 | 499.300 | 258.990 | 499.289 | 9.476 | 8.062 | 90.000 | 100 | A |
390 | 12.0 | 714.833 | 67.294 | 551.0 | 785.0 | 240.167 | 498.167 | 240.179 | 498.148 | 4.606 | 3.317 | 168.690 | 100 | A |
391 | 23.0 | 695.043 | 67.356 | 611.0 | 846.0 | 49.891 | 503.022 | 49.882 | 502.979 | 6.454 | 4.537 | 73.243 | 100 | A |
374 rows × 14 columns
On the bottom of that table, you can see that it still contains 374 of the original 391 columns. If you remove rows, you should document in your later scientific publication, how many out of how many datasets were analysed.
We can now also check again if NaNs are present.
data_no_nan.isnull().values.any()
False
Determining rows that contain NaNs#
In some use-cases it might be useful to have a list of row-indices where there are NaN values.
data = {
'A': [0, 1, 22, 21, 12, 23],
'B': [2, 3, np.nan, 2, 12, 22],
'C': [2, 3, 44, 2, np.nan, 52],
}
table = pd.DataFrame(data)
table
A | B | C | |
---|---|---|---|
0 | 0 | 2.0 | 2.0 |
1 | 1 | 3.0 | 3.0 |
2 | 22 | NaN | 44.0 |
3 | 21 | 2.0 | 2.0 |
4 | 12 | 12.0 | NaN |
5 | 23 | 22.0 | 52.0 |
np.max(table.isnull().values, axis=1)
array([False, False, True, False, True, False])
Dropping columns that contain NaNs#
As mentioned above, sometimes it also makes sense to drop columns. For example if one column is full of NaN values. To show this, we will create such a column:
data['difficult_measurement'] = np.nan
nan_table = pd.DataFrame(data)
nan_table
A | B | C | difficult_measurement | |
---|---|---|---|---|
0 | 0 | 2.0 | 2.0 | NaN |
1 | 1 | 3.0 | 3.0 | NaN |
2 | 22 | NaN | 44.0 | NaN |
3 | 21 | 2.0 | 2.0 | NaN |
4 | 12 | 12.0 | NaN | NaN |
5 | 23 | 22.0 | 52.0 | NaN |
We can now drop
the column like this:
table_dropped = nan_table.drop('difficult_measurement', axis=1)
And have the column again removed from our table
table_dropped
A | B | C | |
---|---|---|---|
0 | 0 | 2.0 | 2.0 |
1 | 1 | 3.0 | 3.0 |
2 | 22 | NaN | 44.0 |
3 | 21 | 2.0 | 2.0 |
4 | 12 | 12.0 | NaN |
5 | 23 | 22.0 | 52.0 |