Definition

Aggregate metric tests allow you to define the expected level of model performance for the entire validation set or specific subpopulations. You can use any of the available metrics for the task type you are working on.

To compute most of the aggregate metrics supported, your data must contain ground truths.

For monitoring use cases, if your data is not labeled during publish/stream time, you can update ground truths later on. Check out the Updating data guide for details.

Taxonomy

  • Category: Performance.
  • Task types: LLM, tabular classification, tabular regression, text classification.
  • Availability: and .

Why it matters

  • Aggregate metrics are a straightforward way to measure model performance.
  • Overall aggregate metrics (i.e., computed on the entire validation set or production data) are useful to get a high-level view of the model performance. However, we encourage you to go beyond them and also define tests for specific subpopulations.
  • The performance of our model is, likely, not uniform across different cohorts of the data, as in the image below. A better and more realistic approach to ultimately achieve a high model performance is to focus on improving the model one slice of data at a time.

Available metrics

The aggregate metrics available for LLM projects are:

MetricDescriptionmeasurement for the tests.json
Answer relevancy*Measures how relevant the answer (output) is given the question. Based on the Ragas response relevancy.answerRelevancy
Answer correctness*Compares and evaluates the factual accuracy of the generated response with respect to the reference. Based on the Ragas factual correctness.answerCorrectness
Context precision*Measures how relevant the context retrieved is given the question. Based on the Ragas context precision.contextRelevancy
Context recall*Measures the ability of the retriever to retrieve all necessary context for the question. Based on the Ragas context recall.contextRecall
Correctness*Correctness of the answer. Based on the Ragas aspect critique for correctness.correctness
Harmfulness*Harmfulness of the answer. Based on the Ragas aspect critique for harmfulness.harmfulness
Coherence*Coherence of the answer. Based on the Ragas aspect critique for coherence.coherence
Conciseness*Conciseness of the answer. Based on the Ragas aspect critique for conciseness.conciseness
Maliciousness*Maliciousness of the answer. Based on the Ragas aspect critique for maliciousness.maliciousness
Faithfulness*Measures the factual consistency of the generated answer against the given context. Based on the Ragas faithfulness.faithfulness
Mean BLEUBilingual Evaluation Understudy score. Available precision from unigram to 4-gram (BLEU-1, 2, 3, and 4).meanBleu1, meanBleu2, meanBleu3, meanBleu4
Mean edit distanceMinimum number of single-character insertions, deletions, or substitutions required to transform one string into another, serving as a measure of their similarity.meanEditDistance
Mean exact matchAssesses if two strings are identical in every aspect.meanExactMatch
Mean JSON scoreMeasures how close the output is to a valid JSON.meanJsonScore
Mean quasi-exact matchAssesses if two strings are similar, allowing partial matches and variations.meanQuasiExactMatch
Mean semantic similarityAssesses the similarity in meaning between sentences, by measuring their closeness in semantic space.meanSemanticSimilarity
Mean, max, and total number of tokensStatistics on the number of tokens.meanTokens, maxTokens, totalTokens
Mean, max, and latency percentilesStatistics on the response latency.meanLatency, maxLatency, p90Latency, p95Latency, p99Latency

* Metric based on the Ragas framework. All of them rely on an LLM evaluator judging your submission. You can configure the underlying LLM used to compute these metrics. Check out the OpenAI or Anthropic integration guides for details.

Test configuration examples

If you are writing a tests.json, here are a few valid configurations for the character length test: