Quantitative labeling comparison#

Segmentation algorithms may produce different results. These differences may or may not be crucial, depending on the purpose of the scientific analysis these algorithms are used for.

In this notebook we will check if the number of segmented objects is different using Otsu’s thresholding method on blobs.tif and we will check if the area measurements are different between these two algorithms. The visual comparison performed before suggests that there should be a difference in area measurements.

import numpy as np
from skimage.io import imread, imsave
import matplotlib.pyplot as plt
import pandas as pd
from skimage.measure import regionprops
from pyclesperanto_prototype import imshow
from scipy.stats import describe
from scipy.stats import ttest_ind
from statsmodels.stats.weightstats import ttost_ind

Just as a recap, we take a quick look at the two label images. One was produced in ImageJ, the other using scikit-image.

blobs_labels_imagej = imread("blobs_labels_imagej.tif")
blobs_labels_skimage = imread("blobs_labels_skimage.tif")

imshow(blobs_labels_imagej, labels=True)
imshow(blobs_labels_skimage, labels=True)
../_images/cdc69e90459a3a8cba2befb95af73d5516bbe4e985521e69b09df76db2dc5308.png ../_images/5a68c38f79e7493551a8dc8b23f1f4a06090b9f51adb83488f404269f4575869.png

Comparing label counts#

First, we will count the number of objects in the two images. If the images are labeled subsequently, which means every integer label between 0 and the maximum of the labels exits, the maximum intensity in these label images corresponds to the number of labels present in the image.

blobs_labels_imagej.max(), blobs_labels_skimage.max()
(63, 64)

If the images are not subsequently labeled, we should first determine the unique sets of labels and count them. If background intensity (0) is present, these two numbers will be higher by one than the maximum.

len(np.unique(blobs_labels_imagej)), len(np.unique(blobs_labels_skimage))
(64, 65)

Comparing label counts from one single image gives limited insights. It shall be recommended to compare counts from multiple images and apply statistical tests as shown below. With these error analysis methods one can get deeper insights into how different the algorithms are.

Quantitative comparison#

Depending on the desired scientific analysis, the found number of objects may not be relevant, but the area of the found objects might be. Hence, we shoud compare how different area measurements between the algorithms are. Also this should actually be done using multiplel images. We demonstrate it with the single image to make it easisly reproducible.

First, we derive area measurements from the label images and take a quick look.

imagej_statistics = regionprops(blobs_labels_imagej)
imagej_areas = [p.area for p in imagej_statistics]
[415, 181, 646, 426, 465, 273, 74, 264, 221, 25, 485, 639, 92, 216, 432, 387, 503, 407, 257, 345, 149, 399, 407, 245, 494, 272, 659, 171, 350, 527, 581, 10, 611, 184, 584, 20, 260, 871, 461, 228, 158, 387, 398, 233, 364, 629, 364, 558, 62, 158, 450, 598, 525, 197, 544, 828, 267, 201, 1, 87, 73, 49, 46]
skimage_statistics = regionprops(blobs_labels_skimage)
skimage_areas = [p.area for p in skimage_statistics]
[433, 185, 658, 434, 477, 285, 81, 278, 231, 30, 501, 660, 99, 228, 448, 401, 520, 425, 271, 350, 159, 412, 426, 260, 506, 289, 676, 175, 361, 545, 610, 14, 641, 195, 593, 22, 268, 902, 473, 239, 167, 413, 415, 244, 377, 652, 379, 578, 69, 170, 472, 613, 543, 204, 555, 858, 281, 215, 3, 1, 81, 90, 53, 49]

Just to confirm our insights from above, we check the number of measurements

len(imagej_areas), len(skimage_areas)
(63, 64)

A simple and yet powerful approach for comparing quantitative measurements visually, is to draw histograms of the measurements, e.g. using matplotlib’s hist method. This method works for non-paired data and for paired datasets.

(array([10.,  5., 11.,  5., 12.,  6.,  7.,  5.,  0.,  2.]),
 array([  1.,  88., 175., 262., 349., 436., 523., 610., 697., 784., 871.]),
 <BarContainer object of 10 artists>)
(array([11.,  5., 11.,  6., 11.,  6.,  7.,  5.,  0.,  2.]),
 array([  1. ,  91.1, 181.2, 271.3, 361.4, 451.5, 541.6, 631.7, 721.8,
        811.9, 902. ]),
 <BarContainer object of 10 artists>)

The histograms look very similar, and small differences can be identified.

We can nicer to read overview by using scipy’s describe method.

DescribeResult(nobs=63, minmax=(1, 871), mean=339.8888888888889, variance=43858.100358422926, skewness=0.2985582216973995, kurtosis=-0.5512673189985389)
DescribeResult(nobs=64, minmax=(1, 902), mean=347.546875, variance=47422.15649801587, skewness=0.29237468285874324, kurtosis=-0.5465739129172253)

A bit easier to read is the output of pandas’ describe mwethod. In order to make it work for our data, we need to create to pandas DataFrames and concatenate them. This is necessary because we have measurements with different lengths (read more).

table1 = {
    "ImageJ": imagej_areas
table2 = {
    "scikit-image": skimage_areas

df = pd.concat([pd.DataFrame(table1), pd.DataFrame(table2)], axis=1)
ImageJ scikit-image
count 63.000000 64.000000
mean 339.888889 347.546875
std 209.423256 217.766289
min 1.000000 1.000000
25% 182.500000 182.500000
50% 350.000000 355.500000
75% 489.500000 502.250000
max 871.000000 902.000000

Student’s t-test - testing for differences#

We now know that the mean of the measurements are different. We should determine if the differences between the measurements are significant.

We can use the Student’s t-test for that using the null-hypothesis: Means of measurements are different. We use the ttest_ind method because we do not have paired datasets.

ttest_ind(imagej_areas, skimage_areas)
Ttest_indResult(statistic=-0.20194436015007275, pvalue=0.8402885093667958)

From the printed p-value we can not conclude that differences are insignificant. We can only say that according to the given sample, significance could not be shown.

Two-sided t-test for equivalence testing#

For proving that two algorithms perform similarly and means are different less that a given threshold, we can use a two-sided t-test, e.g. using statsmodels’ ttost_ind method. Our null-hypothesis: Means of measurements are more than 5% different.

five_percent_error_threshold = 0.05 * (np.mean(imagej_areas) + np.mean(skimage_areas)) / 2
ttost_ind(imagej_areas, skimage_areas, -five_percent_error_threshold, five_percent_error_threshold)
 (0.25125499758118736, 0.40101477051276024, 125.0),
 (-0.6551437178813329, 0.2567895351853574, 125.0))

Note to self: I’m not sure if I interpret the result correctly. I’m also not sure if I use this test correctly. If anyone reads this, and understands why the p-value here is 0.4, please get in touch: robert.haase@tu-dresden.de

    weights=(None, None),
test of (non-)equivalence for two independent samples

TOST: two one-sided t tests

null hypothesis:  m1 - m2 < low or m1 - m2 > upp
alternative hypothesis:  low < m1 - m2 < upp

where m1, m2 are the means, expected values of the two samples.

If the pvalue is smaller than a threshold, say 0.05, then we reject the
hypothesis that the difference between the two samples is larger than the
the thresholds given by low and upp.

x1 : array_like, 1-D or 2-D
    first of the two independent samples, see notes for 2-D case
x2 : array_like, 1-D or 2-D
    second of the two independent samples, see notes for 2-D case
low, upp : float
    equivalence interval low < m1 - m2 < upp
usevar : str, 'pooled' or 'unequal'
    If ``pooled``, then the standard deviation of the samples is assumed to be
    the same. If ``unequal``, then Welch ttest with Satterthwait degrees
    of freedom is used
weights : tuple of None or ndarrays
    Case weights for the two samples. For details on weights see
transform : None or function
    If None (default), then the data is not transformed. Given a function,
    sample data and thresholds are transformed. If transform is log, then
    the equivalence interval is in ratio: low < m1 / m2 < upp

pvalue : float
    pvalue of the non-equivalence test
t1, pv1 : tuple of floats
    test statistic and pvalue for lower threshold test
t2, pv2 : tuple of floats
    test statistic and pvalue for upper threshold test

The test rejects if the 2*alpha confidence interval for the difference
is contained in the ``(low, upp)`` interval.

This test works also for multi-endpoint comparisons: If d1 and d2
have the same number of columns, then each column of the data in d1 is
compared with the corresponding column in d2. This is the same as
comparing each of the corresponding columns separately. Currently no
multi-comparison correction is used. The raw p-values reported here can
be correction with the functions in ``multitest``.
File:      c:\users\rober\miniconda3\envs\bio_39\lib\site-packages\statsmodels\stats\weightstats.py
Type:      function