In this part, we:

  • Define what are performance tests.
  • Create a performance test for the whole validation set.
  • Explore the performance test report, which includes row-level explainability.

Performance tests

Let’s start with the definition.

What are performance tests?

Performance tests define the expected level of model performance for the entire validation set or specific subpopulations.

For instance, we may aim for our model to achieve a minimum F1 score of 0.8 on the validation set. We can also establish more specific tests, such as a precision target of 0.75 for individuals aged 25-35 from Spain.

Create a performance test

Click on “Performance” to go to the performance test creation page.

“Performance” is under “Create tests” on the left sidebar.

For starters, an interesting performance test to be created sets the expected model performance for the whole validation set. To do so, let’s first interpret the information displayed in the “Metrics” section.

Actionable insights

Our model performs much better on the training set than on the validation set.

Even though a higher training performance is expected, the gap between the training and validation performance is helpful to understand if the model suffers from a bias or variance issue. In our case, it seems like our model overfits the training data, suggesting that regularization strategies (such as getting more data, dropping features, and others) can be beneficial.

The “Metrics” section also has a graph view with additional information.

You can click “Create test” on the left-hand panel to create a performance test for the whole validation set. By doing so, the test creation model will show up and ask for a metric threshold.

Click “Create test” to create a performance test for the whole validation set.

The information from the “Metrics” section is handy to help us choose a threshold for the whole validation set.

The training performance is almost an upper bound for the performance we can expect by regularizing this modeling approach. Therefore, starting with a threshold slightly below the training performance is a reasonable choice.

Let’s use an F1 threshold of 0.7. You can also add multiple metric thresholds to the same test.

Explore the performance test report

After creating the test, we can see the test card on the tests page. Let’s explore the information shown in performance test overview.

Click the newly created test to open the test overview page.

Scroll down to understand the supporting information available for diagnostics.

As usual, the test overview provides information that helps us understand and solve the failed test. In this case, not only the metrics and confusion matrix are available, but also histograms for the model’s confidence distribution and each feature value. Comparing such distributions can help us understand the model’s behavior. Often, underrepresented feature value ranges are the culprits for poor validation performance.


Another interesting piece of information available inside performance test reports is row-level explainability. In broad strokes, explainability techniques help justify our model’s predictions.

Click on the “Row analysis” tab.

It’s right below the description field in the test overview page.

Now that we understand performance tests, we can start breaking down the validation set into subpopulations and creating individual tests for them. That’s what we will do in the next part of the tutorial!