Introduction

In this part, we:

  • Motivate the importance of creating performance tests for subpopulations.
  • Use different components of the performance test creation page to explore subpopulations, namely the suggested subpopulations, the error heatmap, and filters.
  • Create a performance test for a subpopulation.

Why subpopulations?

Despite being a natural first performance test, aggregate metrics computed over the whole validation set provide a low-resolution picture of what’s going on. The performance of our model is, likely, not uniform across different cohorts of the data, as in the image below.

A better and more realistic approach to ultimately achieve a high model performance is to focus on improving the model one slice of data at a time. That’s the importance of creating performance tests for subpopulations.

What are subpopulations?

Subpopulations are data cohorts. They can be defined by feature values (e.g., the data cohort where Age < 30 and CreditScore > 90) or by other criteria (e.g., the data cohort known to be critical from domain expertise.)

The components of the performance test creation page exist to help us break down the validation set into subpopulations. Among the components are suggested subpopulations, the error heatmap, and filters.

Find subpopulations

Suggested subpopulations

The suggested subpopulations are on the sidebar of the performance test creation page. They are automatically found by the Openlayer platform and represent feature value combinations that result in particularly high error rates.

Actionable insights

Note that despite a higher performance on the whole validation set, there are subpopulations with much higher error rates. For example, there is a data slice with error rate of 82%.

By clicking on “Apply filters” we can explore the suggested subpopulation.

Click “Explore subpopulation” for the subpopulation with an error rate of 82%.

After doing so, the other page components are updated. Furthermore, the filters that define this subpopulation are shown on the top of the page.

This is an interesting subpopulation. From the “Metrics” section, we can see that there is a significant gap in performance if we compare it to the validation and training sets. Let’s see if we can break down this subpopulation even further.

Error heatmap

The error heatmap shows at a glance the error rate for data buckets within the subpopulation being explored. These buckets are defined by the two features selected in the dropdowns.

Select features in the error heatmap.

First, remove the filter on Gender on the top of the page by clicking on the “x” button next to it and then “Apply filters”.

Select Age as “Feature 1” and Gender as “Feature 2.” Then, click on “Apply,” to see the error heatmap for these two features.

Actionable insights

The error heatmap shows the model’s error rate for each bucket. For instance, for female users aged between 38.13 - 41, the error rate is equal to 68%.

Note how the model performs much poorly for females than for males across all age groups shown.

Click one of the buckets.

After clicking a bucket, you can see more information about it, such as the number of rows within it. There is also the option to add it to the set of filters.

Filtering

We can also add filters using the feature values and labels to find subpopulations.

Since it seems like our model has issues when Gender is equal to Female for the subpopulation we are exploring, let’s add a filter for it.

Add back a filter for Gender is equal to Female.

Then, click “Apply filters” to explore the resulting subpopulation.

Actionable insights

Our model clearly has issues with this subpopulation. Not only the aggregate metrics are low (F1 of 0.29, contrasting to 0.51 on the validation set), but from the confusion matrix, the mistakes are always of the same type: predicting Exited instead of Retained.

Once we are satisfied with a subpopulation, we can create a test for it.

Click “Create test” under the filters to create a subpopulation performance test.

We can set a threshold for the accuracy score of 0.7.

The poor performance for this subpopulation seems to be related to the fact that there are very few rows on the training set from this subpopulation (e.g., female users younger than 41.5 with less than 1.5 products.) We can arrive at this conclusion by inspecting the feature value histograms and the explainability scores for this subpopulation.

New colab notebook

Here is a new Colab notebook where we strive to solve this issue and commit the new version to the platform.