Random forest decision making statistics#

After training a random forest classifier, we can study its internal mechanics. APOC allows to retrieve the number of decisions in the forest based on the given features.

See also

from skimage.io import imread, imsave
import pyclesperanto_prototype as cle
import pandas as pd
import numpy as np
import apoc
import matplotlib.pyplot as plt
import pandas as pd

cle.select_device('RTX')
<NVIDIA GeForce RTX 3050 Ti Laptop GPU on Platform: NVIDIA CUDA (1 refs)>

For demonstration purposes we use an image from David Legland shared under CC-BY 4.0 available the mathematical_morphology_with_MorphoLibJ repository.

We also add a label image that was generated in an earlier chapter.

image = cle.push(imread('../../data/maize_clsm.tif'))
labels = cle.push(imread('../../data/maize_clsm_labels.tif'))

fix, axs = plt.subplots(1,2, figsize=(10,10))
cle.imshow(image, plot=axs[0])
cle.imshow(labels, plot=axs[1], labels=True)
../_images/6c021aff06f585f192b36ee6f99511fb971de2a0cdd30cb52746fad71c8c4df4.png

We previously created an object classifier and apply it now to the pair of intensity and label images.

classifier = apoc.ObjectClassifier("../../data/maize_cslm_object_classifier.cl")
classification_map = classifier.predict(labels=labels, image=image)

cle.imshow(classification_map, labels=True, min_display_intensity=0)
../_images/6a677d119dc9935c7bc8d491c02e610c2ef2077e4ff36677167201710e544915.png

Classifier statistics#

The loaded classifier can give us statistical information about its inner structure. The random forest classifier consists of many decision trees and every decision tree consists of binary decisions on multiple levels. E.g. a forest with 10 trees makes 10 decisions on the first level, as every tree makes at least this one decision. On the second level, every tree can make up to 2 decisions, which results in maximum 20 decisions on this level. We can now visualize how many decisions on every level take specific features into account. The statistics are given as two dictionaries which can be visualized using pandas

shares, counts = classifier.statistics()

First, we display the number of decisions on every level. Again, from lower to higher levels, the total number of decisions increases, in this table from the left to the right.

pd.DataFrame(counts).T
0 1
area 4 33
mean_intensity 32 44
standard_deviation_intensity 37 44
touching_neighbor_count 8 28
average_distance_of_n_nearest_neighbors=6 19 34

The table above tells us that on the first level, 26 trees took mean_intensity into account, which is the highest number on this level. On the second level, 30 decisions were made taking the standard_deviation_intensity into account. The average distance of n-nearest neighbors was taken into account 21-29 times on this level, which is close. You could argue that intensity and centroid distances between neighbors were the crucial parameters for differentiating objects.

Next, we look at the normalized shares, which are the counts divided by the total number of decisions made per depth level. We visualize this in colour to highlight features with high and low values.

def colorize(styler):
    styler.background_gradient(axis=None, cmap="rainbow")
    return styler

df = pd.DataFrame(shares).T
df.style.pipe(colorize)
  0 1
area 0.040000 0.180328
mean_intensity 0.320000 0.240437
standard_deviation_intensity 0.370000 0.240437
touching_neighbor_count 0.080000 0.153005
average_distance_of_n_nearest_neighbors=6 0.190000 0.185792

Adding to our insights described above, we can also see here that the distribution of decisions on the levels becomes more uniform the higher the level. Hence, one could consider training a classifier with maybe just two depth levels.

Feature importance#

A more common concept to study relevance of extracted features is the feature importance, which is computed from the classifier statistics shown above and may be easier to interpret as it is a single number describing each feature.

feature_importance = classifier.feature_importances()
feature_importance = {k:[v] for k, v in feature_importance.items()}
feature_importance
{'area': [0.1023460967511782],
 'mean_intensity': [0.27884719464885743],
 'standard_deviation_intensity': [0.34910187501327306],
 'touching_neighbor_count': [0.09231893555382481],
 'average_distance_of_n_nearest_neighbors=6': [0.1773858980328665]}
def colorize(styler):
    styler.background_gradient(axis=None, cmap="rainbow")
    return styler

df = pd.DataFrame(feature_importance).T
df.style.pipe(colorize)
  0
area 0.102346
mean_intensity 0.278847
standard_deviation_intensity 0.349102
touching_neighbor_count 0.092319
average_distance_of_n_nearest_neighbors=6 0.177386