Performance goals - subpopulations
Despite being a natural first performance goal, aggregate metrics computed over the whole validation set provide a low-resolution picture of what’s going on. The performance of our model is, likely, not uniform across different cohorts of the data, as in the image below.
A better and more realistic approach to ultimately achieve a high model performance is to focus on improving the model one slice of data at a time. That’s the importance of creating performance goals for subpopulations.
What are subpopulations?
Subpopulations are data cohorts. They can be defined by feature values (e.g., the data cohort where
Age < 30
andCreditScore > 90
) or by other criteria (e.g., the data cohort known to be critical from domain expertise.)The components of the performance goal creation page exist to help us break down the validation set into subpopulations. Among the components are suggested subpopulations, the error heatmap, and filters.
Suggested subpopulations
The suggested subpopulations are on the sidebar of the performance goal creation page. They are automatically found by the Openlayer platform and represent feature value combinations that result in particularly high error rates.

Actionable insights
Note that despite a higher performance on the whole validation set, there are subpopulations with much higher error rates. For example, the data slice defined by
NumOfProducts <= 1.5
andAge <= 41.5
has an error rate of 46%.
The option to explore that data slice more deeply appears when we hover over the subpopulation component.
Explore a suggested subpopulation
Click “Explore subpopulation” for the subpopulation defined by
NumOfProducts <= 1.5
andAge <= 41.5
(with an error rate of 46%).After doing so, the other page components are updated. Furthermore, the filters that define this subpopulation are added to the “Subpopulation explorer” sidebar.

This is an interesting subpopulation. From the “Metrics” section, we can see that there is a significant gap in performance if we compare it to the validation and training sets. Let’s see if we can break down this subpopulation even further.
Error heatmap
The error heatmap shows at a glance the error rate for data buckets within the subpopulation being explored. These buckets are defined by the two features selected in the dropdowns.
Select features in the error heatmap
Select
Age
as “Feature 1” andGender
as “Feature 2.” Then, click on “Apply,” to see the error heatmap for these two features.

Actionable insights
The error heatmap shows the model’s error rate for each bucket. For instance, for female users aged between 38.12 - 41, the error rate is equal to 68%.
Note how the model performs much poorly for females than for males across all age groups shown.
Click one of the buckets
After clicking a bucket, you can see more information about it, such as the number of rows within it. There is also the option to add it to the set of filters.

Filtering
We can also add filters using the feature values and labels to find subpopulations.
Since it seems like our model has issues when Gender
is equal to Female
for the subpopulation we are exploring, let’s add a filter for it.
Adding filters
Add a filter for
Gender
is equal toFemale
. Then, click “Apply filters” to explore the resulting subpopulation.

Actionable insights
Our model clearly has issues with this subpopulation. Not only the aggregate metrics are low (F1 of 0.16, contrasting to 0.65 on the validation set), but from the confusion matrix, the mistakes are always of the same type: predicting
Exited
instead ofRetained
.
Once we are satisfied with a subpopulation, we can create a goal for it.
Create a subpopulation performance goal
Click “Create goal” under the filters to create a subpopulation performance goal.
We can set a threshold for the accuracy score of 0.7.

Solving failing goals
The poor performance for this subpopulation seems to be related to the fact that there are very few rows on the training set from this subpopulation (e.g., female users younger than 41.5 with less than 1.5 products.) We can arrive at this conclusion by inspecting the feature value histograms and the explainability scores for this subpopulation.
Here is a new Colab notebook where we strive to solve this issue and commit the new version to the platform.
Updated 2 months ago