Plotting Distributions with Seaborn#

With Seaborn, it is also very practical to plot data distributions such as boxplots, bar graphs, histograms and kernel density estimation plots.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

We start by loading a table of measurements into a pandas DataFrame.

df = pd.read_csv("../../data/BBBC007_analysis.csv")
df
area intensity_mean major_axis_length minor_axis_length aspect_ratio file_name
0 139 96.546763 17.504104 10.292770 1.700621 20P1_POS0010_D_1UL
1 360 86.613889 35.746808 14.983124 2.385805 20P1_POS0010_D_1UL
2 43 91.488372 12.967884 4.351573 2.980045 20P1_POS0010_D_1UL
3 140 73.742857 18.940508 10.314404 1.836316 20P1_POS0010_D_1UL
4 144 89.375000 13.639308 13.458532 1.013432 20P1_POS0010_D_1UL
... ... ... ... ... ... ...
106 305 88.252459 20.226532 19.244210 1.051045 20P1_POS0007_D_1UL
107 593 89.905565 36.508370 21.365394 1.708762 20P1_POS0007_D_1UL
108 289 106.851211 20.427809 18.221452 1.121086 20P1_POS0007_D_1UL
109 277 100.664260 20.307965 17.432920 1.164920 20P1_POS0007_D_1UL
110 46 70.869565 11.648895 5.298003 2.198733 20P1_POS0007_D_1UL

111 rows × 6 columns

Boxplots#

The axes function for plotting boxplots is boxplot.

Seaborn already identified file_name as a categorical value and ìntensity_mean as a numerical value. Thus, it plots boxplots for the intensity variable. If we invert x and y, we still get the same graph, but as vertical bosplots.

sns.boxplot(data=df, x="intensity_mean", y="file_name")
<AxesSubplot: xlabel='intensity_mean', ylabel='file_name'>
../_images/ef77a01adaa5c5887770a0a98f71164d6620643aa593ca6bf55edbdfd7b50d1a.png

The figure-level, and more general, version of this kind of plot is catplot. We just have to provide kind as box.

sns.catplot(data=df, x="intensity_mean", y="file_name", kind="box")
<seaborn.axisgrid.FacetGrid at 0x27775d754f0>
../_images/d0ce25765ccd86b0241683da191a9c0bcdbe80e78664dfa812569071361af1f2.png

There are other kinds available, like a bar graph.

sns.catplot(data=df, x="file_name", y="intensity_mean", kind="bar")
<seaborn.axisgrid.FacetGrid at 0x2777b1abb80>
../_images/9274917e3ee7b0c960318ae80f6988c4c99a917678a0f03cdfba37fd5323d30d.png

Histograms and Distribution Plots#

The axes-level function for plotting histograms is histplot.

sns.histplot(data = df, x="intensity_mean", hue="file_name")
<AxesSubplot: xlabel='intensity_mean', ylabel='Count'>
../_images/7ce39dc708a5046c5f7cbba2d8cb2aad51d385670a372a6e7b7fc4fa246a55bd.png

We can instead plot the kernel density estimation (kde) with kdeplot function. Just be careful while interpreting these plots (check some pitfalls here)

sns.kdeplot(data=df, x="intensity_mean", hue="file_name")
<AxesSubplot: xlabel='intensity_mean', ylabel='Density'>
../_images/7050b3f1cafd17da2e1bfa96497351c411ab7fc4490ccbaca39decac64afb868.png

The figure-level function for distributions is distplot. With it, you can have histograms and kde in the same plot, or other kinds of plots, like the empirical cumulative distribution function (ecdf).

sns.displot(data = df, x="intensity_mean", hue="file_name", kde=True)
<seaborn.axisgrid.FacetGrid at 0x2777b77c910>
../_images/c29cb3f2d77132bfcb047fab8e390e2c0ac229b4cadf9294ccad160db8357a37.png

Exercise#

Plot two empirical cumulative distribution functions for ‘area’ from different files on a same graph with different colors.

Repeat this for the property ‘intensity_mean’ on a second figure. Infer whether you would expect these properties to be different or not.

*Hint: look for the kind parameter of displot