Clustering with UMAPs

Clustering with UMAPs#

Clustering objects can be challenging when working with many parameters, in particular when interacting with data manually. To reduce the number of parameters, dimensionality reduction techniques such as the Uniform Manifold Approximation Projection (UMAP) have been developed. In this notebook we use the technique to differentiate nuclei in an image which are mitotic from those which are not mitotic.

Example data#

First, we load the example image (taken from the scikit-image examples) showing human cell nuclei.

image = cle.asarray(human_mitosis())
image

Downloading file 'data/mitosis.tif' from 'https://gitlab.com/scikit-image/data/-/raw/2cdc5ce89b334d28f06a58c9f0ca21aa6992a5ba/AS_09125_050116030001_D03f00d0.tif' to 'C:\Users\haase\AppData\Local\scikit-image\scikit-image\Cache\0.20.0'.

cle._ image

shape	(512, 512)
dtype	float32
size	1024.0 kB
min	7.0
max	255.0

labels = cle.voronoi_otsu_labeling(image, spot_sigma=2.5, outline_sigma=0)
labels

cle._ image

shape	(512, 512)
dtype	uint32
size	1024.0 kB
min	0.0
max	320.0

Feature extraction#

We then measure properties of the nuclei such as intensity, size, shape, perimeter and moments.

nuclei_statistics = label_statistics(image, labels, 
                                     intensity=True, 
                                     size=True, 
                                     shape=True, 
                                     perimeter=True,
                                     moments=True)
nuclei_statistics.head()

	label	maximum	mean	median	minimum	sigma	sum	variance	elongation	feret_diameter	...	number_of_pixels_on_border	perimeter	perimeter_on_border	perimeter_on_border_ratio	principal_axes0	principal_axes1	principal_axes2	principal_axes3	principal_moments0	principal_moments1
0	1	83.0	69.886364	73.359375	48.0	9.890601	3075.0	97.823996	1.858735	9.486833	...	10	25.664982	10.0	0.389636	0.998995	-0.044825	0.044825	0.998995	1.926448	6.655680
1	2	46.0	42.125000	43.328125	38.0	2.748376	337.0	7.553571	0.000000	7.000000	...	8	15.954349	8.0	0.501431	1.000000	0.000000	0.000000	1.000000	0.000000	5.250000
2	3	53.0	44.916667	45.265625	38.0	4.461111	539.0	19.901515	3.609511	6.000000	...	7	14.843629	7.0	0.471583	1.000000	0.000000	0.000000	1.000000	0.243056	3.166667
3	4	92.0	65.603175	65.609375	38.0	14.475273	4133.0	209.533538	1.212933	10.000000	...	7	29.687257	7.0	0.235791	0.914511	0.404561	-0.404561	0.914511	4.227271	6.219189
4	5	59.0	46.315068	46.234375	38.0	5.065890	3381.0	25.663242	1.249507	11.401754	...	9	34.264893	9.0	0.262660	0.363116	-0.931744	0.931744	0.363116	5.001247	7.808286

5 rows × 27 columns

# list all columns we measured
nuclei_statistics.keys()

Index(['label', 'maximum', 'mean', 'median', 'minimum', 'sigma', 'sum',
       'variance', 'elongation', 'feret_diameter', 'flatness', 'roundness',
       'equivalent_ellipsoid_diameter_0', 'equivalent_ellipsoid_diameter_1',
       'equivalent_spherical_perimeter', 'equivalent_spherical_radius',
       'number_of_pixels', 'number_of_pixels_on_border', 'perimeter',
       'perimeter_on_border', 'perimeter_on_border_ratio', 'principal_axes0',
       'principal_axes1', 'principal_axes2', 'principal_axes3',
       'principal_moments0', 'principal_moments1'],
      dtype='object')

Feature selection#

In the image it is obvious that dividing nuclei are brighter than others. Furthermore, they apper elongated. Thus, we select intensity and shape-based features.

selected_table = nuclei_statistics[
    [
        "mean",
        "variance",
        "elongation",
    ]
]

selected_statistics = selected_table.values

Standard scaling#

We then scale those measurements so that intensity levels and distances can be interpreted in a balanced way (Read more).

scaled_statistics = StandardScaler().fit_transform(selected_statistics)

type(scaled_statistics), scaled_statistics.shape

(numpy.ndarray, (320, 3))

Plotting#

For demonstration purposes, we plot the three selected features against each other.

def hide_current_axis(*args, **kwds):
    plt.gca().set_visible(False)

g = seaborn.PairGrid(selected_table)
g.map(seaborn.scatterplot)

<seaborn.axisgrid.PairGrid at 0x240b01e9c40>

../_images/436ad93dd3e8a4f79a687c69a00ee08da589fc93aeb8ded27ad8fd95386fed44.png

From these plots, one could presume that datapoints with high mean intensity, variance and/or elongation are mitotic. However, there is no clear group of data points that are characteristically different from others, which could be easily differentiated.

Dimensionality reduction#

To demonstrate the UMAP algorithm, we now reduce these three dimensions to two.

reducer = umap.UMAP()
embedding = reducer.fit_transform(scaled_statistics)
type(embedding), embedding.shape

(numpy.ndarray, (320, 2))

seaborn.scatterplot(x=embedding[:, 0], 
                    y=embedding[:, 1])

<AxesSubplot: >

../_images/8a71619bdc701615a02e9d6c6f035f3a8273676cfb89e1026ed95b754032f675.png

A note on repeatability#

The algorithm behind the UMAP is partially a non-deterministic. Thus, if you run the same code twice, the result might look slightly different.

reducer = umap.UMAP()
embedding2 = reducer.fit_transform(scaled_statistics)

seaborn.scatterplot(x=embedding2[:, 0], 
                    y=embedding2[:, 1])

<AxesSubplot: >

../_images/69ad30fb43b1e6e6c994962827b5b70ff8805bd4039033a071146062089346ef.png

reducer = umap.UMAP()
embedding2 = reducer.fit_transform(scaled_statistics)

seaborn.scatterplot(x=embedding2[:, 0], 
                    y=embedding2[:, 1])

<AxesSubplot: >

../_images/8c953c05dd0d397f71032670c59dd73f15de7323ed9b6a27b8a59e2d9923f23d.png

This limitation can be circumvented by providing a non-random seed random_state. However, it does not solve the general limitation. If our input data is slightly different, e.g. coming from a different image showing different cells, we may not receive the same UMAP result.

reducer = umap.UMAP(random_state=42)
embedding3 = reducer.fit_transform(scaled_statistics)

seaborn.scatterplot(x=embedding3[:, 0], 
                    y=embedding3[:, 1])

<AxesSubplot: >

../_images/36e1037604bdea0390c95dbebfb0d1cc8874c51bda436933f753eb8fd16c34f9.png

reducer = umap.UMAP(random_state=42)
embedding4 = reducer.fit_transform(scaled_statistics)

seaborn.scatterplot(x=embedding4[:, 0], 
                    y=embedding4[:, 1])

<AxesSubplot: >

nuclei_statistics["UMAP1"] = embedding4[:, 0]
nuclei_statistics["UMAP2"] = embedding4[:, 1]

Manual clustering#

We can mark regions in the UMAP plot interactively (e.g. using the napari-clusters-plotter). To mimik this in a notebook, we set a manual threshold on a single UMAP axis to mark a region of data points we would like to investigate further. As mentioned above, as the UMAP result may not be 100% repeatable, we might need to adapt this threshold after generating a new UMAP on a different dataset.

def manual_threshold(x):
    if x < 9:
        return 1
    else:
        return 2
    
nuclei_statistics["MANUAL_CLUSTER_ID"] = [
    manual_threshold(x) 
    for x in nuclei_statistics["UMAP1"]
]

seaborn.scatterplot(
    x=nuclei_statistics["UMAP1"],
    y=nuclei_statistics["UMAP2"],
    hue=nuclei_statistics["MANUAL_CLUSTER_ID"],
)

<AxesSubplot: xlabel='UMAP1', ylabel='UMAP2'>

../_images/5e76ddced09708f3fb245f82a29aebc19496baf05a149ec47450c499dd4988e9.png

Cluster visualization in image space#

# put a 0 for background in front
new_values = [0] + nuclei_statistics["MANUAL_CLUSTER_ID"].tolist()

print(new_values[:10])

[0, 1, 1, 1, 1, 1, 1, 1, 1, 1]

cluster_id_image = cle.replace_intensities(labels, new_values)
cle.imshow(cluster_id_image, labels=True)

../_images/ef8b6341414a6a0bfaac7755c4c75dd1dc4b32e88b4475bb5b73cda82e394fd4.png

label_edges = cle.detect_label_edges(labels)
coloured_label_edges = label_edges * cluster_id_image

def show_crop(x=0, y=0, size=100):
    fig, axs = plt.subplots(1,3, figsize=(10,10))
    
    cle.imshow(image[x:x+size, y:y+size], plot=axs[0], 
               min_display_intensity=0,
               max_display_intensity=255)
    
    cle.imshow(cluster_id_image[x:x+size, y:y+size], labels=True, plot=axs[1], 
               min_display_intensity=0,
               max_display_intensity=3)

    cle.imshow(image[x:x+size, y:y+size], continue_drawing=True, plot=axs[2], 
               min_display_intensity=0,
               max_display_intensity=255)
    cle.imshow(coloured_label_edges[x:x+size, y:y+size], labels=True, alpha=0.5, 
               min_display_intensity=0,
               max_display_intensity=3,
                  plot=axs[2])
    
    axs[0].set_title("Image crop")
    axs[1].set_title("Cluster")
    axs[2].set_title("Overlay")
    

show_crop()
show_crop(400, 0)
show_crop(0, 400)

../_images/de5372ab76589e3511486579a7fac7367fcacd143ae774f24d86107cfe04f816.png

../_images/5a792b20b95c174752d24173dc13348aa5875d6a8175b83b61c1730c960e2f8f.png

../_images/2ce939c12d3e4fb1e4fd8e1d1b3afe44b0a8679eb42a1e882710fa16e7b254fb.png

Feature selection#

When drawing UMAPs, it matters which parameters are selected. The procedure above demonstrates that it is possible to produce a UMAP and select a region in it which represents cells with a given phenontype. When choosing wrong parameters, a UMAP can still be generated but may not allow differentiating phenotypes. To demonstrate this, we take the result from the manual clustering step above as ground-truth for the next demonstration. We draw a UMAP from parameters which are supposedly not useful for differentiating phenotypes. In this UMAP, we will color the objects’s manual cluster ID. Principal moments and axes are describing how cells are oriented in space. The phenotype mitotic vs. non-mitotic should not be related with those parameters.

selected_statistics2 = nuclei_statistics[
    [
        "principal_moments0",
        "principal_moments1",
        "principal_axes0",
        "principal_axes1",
        "principal_axes2",
        "principal_axes3",
    ]
].values

scaled_statistics2 = StandardScaler().fit_transform(selected_statistics2)

type(scaled_statistics2), scaled_statistics2.shape

(numpy.ndarray, (320, 6))

reducer = umap.UMAP()
embedding5 = reducer.fit_transform(scaled_statistics2)

seaborn.scatterplot(embedding5[:, 0], embedding5[:, 1], 
                    hue=nuclei_statistics["MANUAL_CLUSTER_ID"])

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[22], line 4
      1 reducer = umap.UMAP()
      2 embedding5 = reducer.fit_transform(scaled_statistics2)
----> 4 seaborn.scatterplot(embedding5[:, 0], embedding5[:, 1], 
      5                     hue=nuclei_statistics["MANUAL_CLUSTER_ID"])

TypeError: scatterplot() takes from 0 to 1 positional arguments but 2 positional arguments (and 1 keyword-only argument) were given

Thus, when experimenting with dimensionality reduction, it is always useful to have some annota:ed datasets available. Without annotations and without the possibility to mark regions in the a UMAP, it may be hard to interpret these plots.

selected_statistics3 = nuclei_statistics[
    [
        "label",
        "minimum"
    ]
].values

scaled_statistics3 = StandardScaler().fit_transform(selected_statistics3)

type(scaled_statistics3), scaled_statistics3.shape

UMAPs of random and non-informative measurements#

Our measurement table above also contains columns which may not at all be related to the phenotype. The column label is a subsequent enumeration of all object. Assuming the objects are enumerated in random order makes this effectively a random number. The used segmentation algorithm uses a thresholding technique under the hood. Thus, the minimum intensity of all segmented objects is presumably close to this threshold. This leads to the column minimum containing almost constant values with some random variation related to noise in the image. If we use such columns to generate a UMAP from, the algorithm will generate a snake-like arrangement of data points.

reducer = umap.UMAP()
embedding6 = reducer.fit_transform(scaled_statistics3)

seaborn.scatterplot(embedding6[:, 0], embedding6[:, 1], 
                    hue=nuclei_statistics["MANUAL_CLUSTER_ID"])

Interpreting this UMAP may not lead to reasonable results. Just as comparison, we draw the plot of the two selected columns.

seaborn.scatterplot(x=nuclei_statistics["label"], 
                    y=nuclei_statistics["minimum"], 
                    hue=nuclei_statistics["MANUAL_CLUSTER_ID"])

Clustering with UMAPs

Contents

Clustering with UMAPs#

Example data#

Feature extraction#

Feature selection#

Standard scaling#

Plotting#

Dimensionality reduction#

A note on repeatability#

Manual clustering#

Cluster visualization in image space#

Feature selection#

UMAPs of random and non-informative measurements#