Aggregate metric tests allow you to define the expected level of model performance for the entire validation set or specific subpopulations. You can use any of the available metrics for the task type you are working on.

To compute most of the aggregate metrics supported, your data must contain ground truths.

For monitoring use cases, if your data is not labeled during publish/stream time, you can update ground truths later on. Check out the Set up monitoring guide for details.


  • Category: Performance.
  • Task types: LLM, tabular classification, tabular regression, text classification.
  • Availability: and .

Why it matters

  • Aggregate metrics are a straightforward way to measure model performance.
  • Overall aggregate metrics (i.e., computed on the entire validation set or production data) are useful to get a high-level view of the model performance. However, we encourage you to go beyond them and also define tests for specific subpopulations.
  • The performance of our model is, likely, not uniform across different cohorts of the data, as in the image below. A better and more realistic approach to ultimately achieve a high model performance is to focus on improving the model one slice of data at a time.