Selecting features#

When selecting the right features, there are some rules of thumb one can take into account. For example if one wants to segment objects very precisely, small radius / sigma values should be used. If a rough outline is enough, or if single individual pixels on borders of objects should be eliminated, it make sense to use larger radius and sigma values. However, this topic can also be approached using statistics.

from skimage.io import imread, imsave
import pyclesperanto_prototype as cle
import numpy as np
import apoc
import matplotlib.pyplot as plt
import pandas as pd
image = imread('../../data/blobs.tif')

manual_annotation = imread('../../data/blobs_annotations.tif')

fig, axs = plt.subplots(1,2)

cle.imshow(image, plot=axs[0])
cle.imshow(manual_annotation, labels=True, plot=axs[1])
../_images/7919f0a3e13e370fb4d27e3ec9ab32b4d3d6f5880e76bccf3fabe54ea6581148.png

Training - with too many features#

We now train a object segmenter and provide many features. We also need to provide parameters to configure deep decision trees and many trees. This is necessary so that the next steps, deriving statistics, has enough statistical power. Afterwards, we take a look at the result for a quick sanity check.

# define features
features = apoc.PredefinedFeatureSet.small_dog_log.value + " " + \
           apoc.PredefinedFeatureSet.medium_dog_log.value + " " + \
           apoc.PredefinedFeatureSet.large_dog_log.value

# this is where the model will be saved
cl_filename = '../../data/blobs_object_segmenter2.cl'

apoc.erase_classifier(cl_filename)
classifier = apoc.ObjectSegmenter(opencl_filename=cl_filename, 
                           positive_class_identifier=2, 
                           max_depth=5,
                           num_ensembles=1000)
classifier.train(features, manual_annotation, image)

segmentation_result = classifier.predict(features=features, image=image)
cle.imshow(segmentation_result, labels=True)
../_images/9367a3fc115b1f3124871139af914475562a8956611ab0b9ad5ead0bc7023369.png

Classifier statistics#

After training, we can print out some statistics from the classifier. It gives us a table of used features and how important the features were for making the pixel classification decision.

shares, counts = classifier.statistics()

def colorize(styler):
    styler.background_gradient(axis=None, cmap="rainbow")
    return styler

df = pd.DataFrame(shares).T
df.style.pipe(colorize)
  0 1 2 3 4
original 0.138000 0.046423 0.042312 0.037281 0.062112
gaussian_blur=1 0.228000 0.092846 0.074303 0.105263 0.055901
difference_of_gaussian=1 0.000000 0.108828 0.095975 0.074561 0.086957
laplace_box_of_gaussian_blur=1 0.000000 0.105784 0.089783 0.081140 0.099379
gaussian_blur=5 0.096000 0.064688 0.118679 0.096491 0.130435
difference_of_gaussian=5 0.254000 0.182648 0.112487 0.120614 0.118012
laplace_box_of_gaussian_blur=5 0.209000 0.194064 0.121775 0.118421 0.124224
gaussian_blur=25 0.004000 0.061644 0.113519 0.127193 0.080745
difference_of_gaussian=25 0.032000 0.072298 0.122807 0.127193 0.130435
laplace_box_of_gaussian_blur=25 0.039000 0.070776 0.108359 0.111842 0.111801

In this visualization you can see that the features gaussian_blur=1, difference_of_gaussian=5 and laplace_box_of_gaussian_blur=5 make about 65% of the decision. On the first level (level 0). If these three features are crucial, we can train another classifier that only takes these features into account. Furthermore, we see that the share the features are used on the higher three depth levels is more uniformly distributed. These levels may not make a big difference when classifying pixels. The next classifier we train, we can train with lower max_depth.

# define features
features = "gaussian_blur=1 difference_of_gaussian=5 laplace_box_of_gaussian_blur=5"

# this is where the model will be saved
cl_filename = '../../data/blobs_object_segmenter3.cl'

apoc.erase_classifier(cl_filename)
classifier = apoc.ObjectSegmenter(opencl_filename=cl_filename, 
                           positive_class_identifier=2, 
                           max_depth=3,
                           num_ensembles=1000)
classifier.train(features, manual_annotation, image)

segmentation_result = classifier.predict(features=features, image=image)
cle.imshow(segmentation_result, labels=True)
../_images/8fdbda78354f283b74387c3b1f6520063c3dc9b212210925736d563365713eaa.png

The new classifier still produces a very similar result. It takes less features into account, which makes it faster, but potentially also less robust again differences between images and imaging conditions. We just take another look at the classifier statistics:

shares, counts = classifier.statistics()
df = pd.DataFrame(shares).T
df.style.pipe(colorize)
  0 1 2
gaussian_blur=1 0.331000 0.349194 0.344620
difference_of_gaussian=5 0.356000 0.329839 0.337096
laplace_box_of_gaussian_blur=5 0.313000 0.320968 0.318284

For demonstration purposes, we will now train another classifier with very similar features.

# define features
features = "gaussian_blur=1 difference_of_gaussian=2 difference_of_gaussian=3 difference_of_gaussian=4 difference_of_gaussian=5 difference_of_gaussian=6 laplace_box_of_gaussian_blur=5"

# this is where the model will be saved
cl_filename = '../../data/blobs_object_segmenter3.cl'

apoc.erase_classifier(cl_filename)
classifier = apoc.ObjectSegmenter(opencl_filename=cl_filename, 
                           positive_class_identifier=2, 
                           max_depth=3,
                           num_ensembles=1000)
classifier.train(features, manual_annotation, image)

segmentation_result = classifier.predict(features=features, image=image)
cle.imshow(segmentation_result, labels=True)
../_images/818aa0099f0c4174441c547fab71d7b7a09525ec526a14c64d4bf785a4f24bf4.png

Again, the segmentation result looks very similar, but the classifier statistic is different.

shares, counts = classifier.statistics()
df = pd.DataFrame(shares).T
df.style.pipe(colorize)
  0 1 2
gaussian_blur=1 0.200000 0.093750 0.162829
difference_of_gaussian=2 0.000000 0.120888 0.148026
difference_of_gaussian=3 0.053000 0.064967 0.125000
difference_of_gaussian=4 0.236000 0.162829 0.097039
difference_of_gaussian=5 0.178000 0.222862 0.125822
difference_of_gaussian=6 0.134000 0.120066 0.194901
laplace_box_of_gaussian_blur=5 0.199000 0.214638 0.146382

In that way one can also fine-tune the radius and sigma parameters one needs to use for the specified features.

The hints given here in this section are no solid rules for selecting the right features. The provided tools may help though for looking a bit behind the features and for measuring the influence provided feature lists and parameters have.