Clustering  a quick walkthrough#
The term clustering stands for grouping objects according to their properties without providing any annotation. Algorithms in this field are also reffered to as unsupervised machine learning. The algorithms do receive input from the programmer though, e.g. by selecting specific measurements, which might render the term unsupervised in this context a bit misleading.
When clustering data retrieved from images, we use terms such as standardscaling, dimensionality reduction and finally algorithms such as kmeans clustering. This notebook is a quick walk through using these techniques before the methods are demonstrated in more detail in the following notebooks.
See also
import pyclesperanto_prototype as cle
from napari_simpleitk_image_processing import label_statistics
from sklearn.preprocessing import StandardScaler
from sklearn.mixture import GaussianMixture
from sklearn.cluster import KMeans
import pandas as pd
import seaborn as sns
import seaborn
import numpy as np
import umap
import matplotlib.pyplot as plt
from skimage.data import human_mitosis
First we start by loading a 3D dataset showing a Tribolium castaneum embryo undergoing gastrulation (curtesy: Daniela Vorkel, Myers lab, MPICBG / CSBD Dresden). The dataset shows dense nuclei marked with nucleiGFP on the left forming the embryo and less dense nuclei, called serosa, on the right surrounding the embryo.
image = cle.imread("../../data/Lund25MB.tif")
image
cle._ image

We segment the nuclei as shown in earlier sections using tophatfiltering for background removal and VoronoiOtsuLabeling.
background_subtracted = cle.top_hat_box(image, radius_x=5, radius_y=5)
nuclei_labels = cle.voronoi_otsu_labeling(background_subtracted, spot_sigma=1)
nuclei_labels
cle._ image

Feature extraction#
We next measure properties such as intensity, size and shape from the labeled nuclei using napariSimpleITKimageprocessing.
statistics = label_statistics(image, nuclei_labels,
intensity=True,
perimeter=True,
shape=True)
statistics.keys()
Index(['label', 'maximum', 'mean', 'median', 'minimum', 'sigma', 'sum',
'variance', 'elongation', 'feret_diameter', 'flatness', 'roundness',
'equivalent_ellipsoid_diameter_0', 'equivalent_ellipsoid_diameter_1',
'equivalent_ellipsoid_diameter_2', 'equivalent_spherical_perimeter',
'equivalent_spherical_radius', 'number_of_pixels',
'number_of_pixels_on_border', 'perimeter', 'perimeter_on_border',
'perimeter_on_border_ratio'],
dtype='object')
statistics.head()
label  maximum  mean  median  minimum  sigma  sum  variance  elongation  feret_diameter  ...  equivalent_ellipsoid_diameter_0  equivalent_ellipsoid_diameter_1  equivalent_ellipsoid_diameter_2  equivalent_spherical_perimeter  equivalent_spherical_radius  number_of_pixels  number_of_pixels_on_border  perimeter  perimeter_on_border  perimeter_on_border_ratio  

0  1  143.0  117.489451  117.041016  93.0  9.489786  27845.0  90.056032  1.228690  8.774964  ...  6.517200  7.518360  9.237736  185.203713  3.839016  237  13  191.790349  13.0  0.067782 
1  2  113.0  83.052219  82.177734  65.0  9.699808  31809.0  94.086271  1.325096  13.152946  ...  7.202178  8.754764  11.600904  255.044898  4.505089  383  74  311.446414  74.0  0.237601 
2  3  130.0  108.930403  108.076172  92.0  7.557057  29738.0  57.109109  1.565911  12.884099  ...  5.449251  7.816819  12.240444  203.513187  4.024309  273  74  252.130963  74.0  0.293498 
3  4  129.0  94.576991  93.134766  70.0  11.433116  53436.0  130.716136  1.227027  14.352700  ...  7.665557  10.710899  13.142567  330.508847  5.128456  565  65  396.766310  65.0  0.163824 
4  5  149.0  119.454545  119.033203  89.0  12.017958  32850.0  144.431321  1.429829  10.723805  ...  6.109627  7.753855  11.086684  204.505937  4.034113  275  0  234.611278  0.0  0.000000 
5 rows × 22 columns
Feature selection#
Selecting the right features for differentiating objects in an art. We will select some features here manually and will explain in the next sections how a good educated guess for selecting features can be made.
selected_statistics = statistics[
[
'mean',
'variance',
'number_of_pixels',
'elongation',
'feret_diameter',
]
].values
Standard scaling#
The selected features need to be scaled so that all values range from 1 to 1. This is necessary for the following algorithms as they could misinterpret a perimeter of 5 microns less than a perimeter of 50 pixels, even if both might be exactly the same length physically (Read more).
scaled_statistics = StandardScaler().fit_transform(selected_statistics)
type(scaled_statistics), scaled_statistics.shape
(numpy.ndarray, (1200, 5))
Dimensionality reduction#
As the measured statistics are a large table with many columns, we cannot easily get a picture of the distribution of the data points. For a clustering algorithm it might also be challenging. Thus, we reduce the number of dimensions, e.g. using the Uniform Manifold Approximation Projection (UMAP).
reducer = umap.UMAP(random_state=42)
embedding = reducer.fit_transform(scaled_statistics)
type(embedding), embedding.shape
(numpy.ndarray, (1200, 2))
seaborn.scatterplot(x=embedding[:, 0],
y=embedding[:, 1])
plt.xlabel('UMAP1')
plt.ylabel('UMAP2')
Text(0, 0.5, 'UMAP2')
Kmeans clustering#
Eventually, we will automatically annotate all data points with a class. In the following we will separate data points into two classes. Looking at the UMAP visualized above, we need an algorithm that takes relationsships of datapoints locally into account. Kmeans clustering has the capability to group data points in a way so that the distance between the data points to the cluster centers is minimal. It basically splits the data points along a region in the plot where the density of the data points is lower than in cluster centers, e.g. in the plot above where UMAP1 is approximately 5.
num_classes = 2
kmeans = KMeans(n_clusters=num_classes, random_state=42).fit(embedding)
kmeans_prediction = kmeans.predict(embedding)
C:\Users\rober\miniconda3\envs\bio_39\lib\sitepackages\sklearn\cluster\_kmeans.py:1332: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=5.
warnings.warn(
seaborn.scatterplot(x=embedding[:, 0],
y=embedding[:, 1],
hue=kmeans_prediction)
<AxesSubplot:>
We can also project the cluster identifier (0 or 1) back into the 3D image space. We just need to make sure we cannot mix up these two classes with background.
# suffix [0] represents background
# we add 1 to the measurement so that there is no
# cluster with ID 0, which corresponds to background.
cluster_id = [0] + (kmeans_prediction + 1).tolist()
kmeans_prediction_map = cle.replace_intensities(nuclei_labels, cluster_id)
cle.imshow(kmeans_prediction_map, labels=True)
Interpreting clustering results#
For technical reasons it is illegal to claim that kmeans clustering can differentiate serosa and embryo. The algorithm was not informed about what to differentiate and thus, just differentiates objects according to their properties. It is now the task of the scientists to figure out which parameters influenced the decision making. For example, if shapebased parameters dominated the decision making, a followup questions would be why so. It might be possible that nuclei within the embryo are differently shaped than nuclei within the serosa. But it is also possible that the nuclei segmentation algorithm caused objects to be segmented wrongly. Thus, clustering is just a tool for generating new hypotheses which need to be investigated further, before an algorithm can be designed that differentiates nuclei in serosa and embryo. Such an algorithm then also needs to be validated, e.g. by processing new, unseen datasets, and measuring the quality of the clustering quantitatively, e.g. by providing groundtruth annotations and comparing with them.