Handling NaN values

Handling NaN values#

When analysing tabular data, sometimes table cells are present that do not contain data. In Python this typically means the value is Not a Number (NaN). We cannot assume these values are 0 or -1 or any other value because that would distort descriptive statistics, for example. We need to deal with these NaN entries differently and this notebook will introduce how.

To get a first view where NaNs play a role, we load again an example table and sort it.

import numpy as np
import pandas as pd 

We are sorting the table by parameter area to understand where NaNs play a role. We are sorting the table using sort_values.

data = pd.read_csv('../../data/Results.csv', index_col=0, delimiter=';')
data.sort_values(by = "Area", ascending=False)

	Area	Mean	StdDev	Min	Max	X	Y	XM	YM	Major	Minor	Angle	%Area	Type

190	2755.0	859.928	235.458	539.0	3880.0	108.710	302.158	110.999	300.247	144.475	24.280	39.318	100	C
81	2295.0	765.239	96.545	558.0	1431.0	375.003	134.888	374.982	135.359	65.769	44.429	127.247	100	B
209	1821.0	847.761	122.074	600.0	1510.0	287.795	321.115	288.074	321.824	55.879	41.492	112.124	100	A
252	1528.0	763.777	83.183	572.0	1172.0	191.969	385.944	192.487	385.697	63.150	30.808	34.424	100	B
265	1252.0	793.371	117.139	579.0	1668.0	262.071	394.497	262.268	394.326	60.154	26.500	50.147	100	A
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
113	1.0	587.000	0.000	587.0	587.0	399.500	117.500	399.500	117.500	1.128	1.128	0.000	100	A
310	1.0	866.000	0.000	866.0	866.0	343.500	408.500	343.500	408.500	1.128	1.128	0.000	100	A
219	1.0	763.000	0.000	763.0	763.0	411.500	296.500	411.500	296.500	1.128	1.128	0.000	100	A
3	NaN	NaN	NaN	608.0	964.0	NaN	NaN	NaN	7.665	7.359	NaN	101.121	100	A
5	NaN	NaN	69.438	566.0	792.0	348.500	7.500	NaN	7.508	NaN	3.088	NaN	100	A

391 rows × 14 columns

We could also use this function to sort along an axis (rows/columns).

As you can see, there are rows at the bottom containing NaNs. These are at the bottom of the table because pandas cannot sort them.

A quick check if there are NaNs anywhere in a table is an important quality check for a good scientific practice:

data.isnull().values.any()

True

Now we know that we have NaNs in our table. We can also get some deeper insights in which columns these NaN values are located.

data.isnull().sum()

Area      2
Mean      5
StdDev    3
Min       3
Max       3
X         2
Y         3
XM        3
YM        5
Major     8
Minor     3
Angle     1
%Area     0
Type      0
dtype: int64

For getting a glimpse about if we can further process that table, we may want to know the percentage of NaNs for each column:

data.isnull().mean().sort_values(ascending=False) *100

Major     2.046036
Mean      1.278772
YM        1.278772
StdDev    0.767263
Min       0.767263
Max       0.767263
Y         0.767263
XM        0.767263
Minor     0.767263
Area      0.511509
X         0.511509
Angle     0.255754
%Area     0.000000
Type      0.000000
dtype: float64

Dropping rows that contain NaNs#

Depending on what kind of data analysis should be performed, it might make sense to just ignore columns that contain NaN values. Alternatively, it is possible to delete rows that contain NaNs.

It depends on your project and what is important or not for the analysis. Its not an easy answer.

data_no_nan = data.dropna(how="any")
data_no_nan 

	Area	Mean	StdDev	Min	Max	X	Y	XM	YM	Major	Minor	Angle	%Area	Type

1	18.0	730.389	103.354	592.0	948.0	435.000	4.722	434.962	4.697	5.987	3.828	168.425	100	A
2	126.0	718.333	90.367	556.0	1046.0	388.087	8.683	388.183	8.687	16.559	9.688	175.471	100	A
4	68.0	686.985	61.169	571.0	880.0	126.147	8.809	126.192	8.811	15.136	5.720	168.133	100	A
6	669.0	697.164	72.863	539.0	957.0	471.696	26.253	471.694	26.197	36.656	23.237	124.340	100	A
7	5.0	658.600	49.161	607.0	710.0	28.300	8.100	28.284	8.103	3.144	2.025	161.565	100	A
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
383	94.0	746.617	85.198	550.0	1021.0	194.032	498.223	194.014	498.239	17.295	6.920	52.720	100	B
387	152.0	801.599	111.328	582.0	1263.0	348.487	497.632	348.451	497.675	17.773	10.889	11.829	100	A
389	60.0	758.033	77.309	601.0	947.0	259.000	499.300	258.990	499.289	9.476	8.062	90.000	100	A
390	12.0	714.833	67.294	551.0	785.0	240.167	498.167	240.179	498.148	4.606	3.317	168.690	100	A
391	23.0	695.043	67.356	611.0	846.0	49.891	503.022	49.882	502.979	6.454	4.537	73.243	100	A

374 rows × 14 columns

On the bottom of that table, you can see that it still contains 374 of the original 391 columns. If you remove rows, you should document in your later scientific publication, how many out of how many datasets were analysed.

We can now also check again if NaNs are present.

data_no_nan.isnull().values.any()

False

Determining rows that contain NaNs#

In some use-cases it might be useful to have a list of row-indices where there are NaN values.

data = {
    'A': [0, 1, 22, 21, 12, 23],
    'B': [2, 3, np.nan,  2,  12, 22],
    'C': [2, 3, 44,  2,  np.nan, 52],
}

table = pd.DataFrame(data)
table

	A	B	C
0	0	2.0	2.0
1	1	3.0	3.0
2	22	NaN	44.0
3	21	2.0	2.0
4	12	12.0	NaN
5	23	22.0	52.0

np.max(table.isnull().values, axis=1)

array([False, False,  True, False,  True, False])

Dropping columns that contain NaNs#

As mentioned above, sometimes it also makes sense to drop columns. For example if one column is full of NaN values. To show this, we will create such a column:

data['difficult_measurement'] = np.nan

nan_table = pd.DataFrame(data)
nan_table

	A	B	C	difficult_measurement
0	0	2.0	2.0	NaN
1	1	3.0	3.0	NaN
2	22	NaN	44.0	NaN
3	21	2.0	2.0	NaN
4	12	12.0	NaN	NaN
5	23	22.0	52.0	NaN

We can now drop the column like this:

table_dropped = nan_table.drop('difficult_measurement', axis=1)

And have the column again removed from our table

table_dropped

	A	B	C
0	0	2.0	2.0
1	1	3.0	3.0
2	22	NaN	44.0
3	21	2.0	2.0
4	12	12.0	NaN
5	23	22.0	52.0

Handling NaN values

Contents

Handling NaN values#

Dropping rows that contain NaNs#

Determining rows that contain NaNs#

Dropping columns that contain NaNs#