> ## Documentation Index
> Fetch the complete documentation index at: https://docs.openlayer.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Aggregate metrics

> Learn how to use aggregate metrics to evaluate your model

## Definition

Aggregate metric tests allow you to define the expected level of model performance for the entire validation set or specific subpopulations.
You can use any of the [available metrics](#available-metrics) for the task type you are working on.

<Info>
  To compute most of the aggregate metrics supported, your data must contain ground truths.

  For monitoring use cases, if your data is not labeled during publish/stream time, you can update
  ground truths later on. Check out the [Updating data guide](/monitoring/updating-data) for details.
</Info>

## Taxonomy

* **Task types**: LLM, tabular classification, tabular regression, text classification.
* **Availability**: <Tooltip tip="Continuously evaluate your models and datasets as you iterate on their versions.">development</Tooltip>
  and <Tooltip tip="Monitor a model in production, measure its health, check for drifts and set up alerts.">monitoring</Tooltip>.

## Why it matters

* Aggregate metrics are a straightforward way to measure model performance.
* Overall aggregate metrics (i.e., computed on the entire validation set or production data) are useful to get a high-level view of the model performance. However, we encourage you to go beyond them and also define tests for specific subpopulations.
* The performance of our model is, likely, not uniform across different cohorts of the data, as in the image below. A better and more realistic approach to ultimately achieve a high model performance is to focus on improving the model one slice of data at a time.

<img width="700" style={{ borderRadius: "0.5rem" }} src="https://mintcdn.com/openlayer-44/H9yKzpc2_1D--vQB/images/tutorials/traditional-ml-development/subpopulations.svg?fit=max&auto=format&n=H9yKzpc2_1D--vQB&q=85&s=189a5ac1d09aa908bd12cda74a1d7b66" alt="Subpopulations" data-path="images/tutorials/traditional-ml-development/subpopulations.svg" />

## Available metrics

<Tabs>
  <Tab title="LLM">
    The aggregate metrics available for **LLM** projects are:

    | Metric                                | Description                                                                                                                                                                                                                             | `measurement` for the `tests.json`                                    |
    | ------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------- |
    | Answer relevancy\*                    | Measures how relevant the answer (output) is given the question. Based on the Ragas [response relevancy](https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/answer_relevance/).                                         | `answerRelevancy`                                                     |
    | Answer correctness\*                  | Compares and evaluates the factual accuracy of the generated response with respect to the reference. Based on the Ragas [factual correctness](https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/factual_correctness/). | `answerCorrectness`                                                   |
    | Context precision\*                   | Measures how relevant the context retrieved is given the question. Based on the Ragas [context precision](https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/context_precision/).                                       | `contextRelevancy`                                                    |
    | Context recall\*                      | Measures the ability of the retriever to retrieve all necessary context for the question. Based on the Ragas [context recall](https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/context_recall/).                      | `contextRecall`                                                       |
    | Correctness\*                         | Correctness of the answer. Based on the Ragas [aspect critique](https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/general_purpose/#aspect-critic) for correctness.                                                     | `correctness`                                                         |
    | Harmfulness\*                         | Harmfulness of the answer. Based on the Ragas [aspect critique](https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/general_purpose/#aspect-critic) for harmfulness.                                                     | `harmfulness`                                                         |
    | Coherence\*                           | Coherence of the answer. Based on the Ragas [aspect critique](https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/general_purpose/#aspect-critic) for coherence.                                                         | `coherence`                                                           |
    | Conciseness\*                         | Conciseness of the answer. Based on the Ragas [aspect critique](https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/general_purpose/#aspect-critic) for conciseness.                                                     | `conciseness`                                                         |
    | Maliciousness\*                       | Maliciousness of the answer. Based on the Ragas [aspect critique](https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/general_purpose/#aspect-critic) for maliciousness.                                                 | `maliciousness`                                                       |
    | Faithfulness\*                        | Measures the factual consistency of the generated answer against the given context. Based on the Ragas [faithfulness](https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/faithfulness/).                                | `faithfulness`                                                        |
    | Mean BLEU                             | Bilingual Evaluation Understudy score. Available precision from unigram to 4-gram (BLEU-1, 2, 3, and 4).                                                                                                                                | `meanBleu1`, `meanBleu2`, `meanBleu3`, `meanBleu4`                    |
    | Mean edit distance                    | Minimum number of single-character insertions, deletions, or substitutions required to transform one string into another, serving as a measure of their similarity.                                                                     | `meanEditDistance`                                                    |
    | Mean exact match                      | Assesses if two strings are identical in every aspect.                                                                                                                                                                                  | `meanExactMatch`                                                      |
    | Mean JSON score                       | Measures how close the output is to a valid JSON.                                                                                                                                                                                       | `meanJsonScore`                                                       |
    | Mean quasi-exact match                | Assesses if two strings are similar, allowing partial matches and variations.                                                                                                                                                           | `meanQuasiExactMatch`                                                 |
    | Mean semantic similarity              | Assesses the similarity in meaning between sentences, by measuring their closeness in semantic space.                                                                                                                                   | `meanSemanticSimilarity`                                              |
    | Mean, max, and total number of tokens | Statistics on the number of tokens.                                                                                                                                                                                                     | `meanTokens`, `maxTokens`, `totalTokens`                              |
    | Mean, max, and latency percentiles    | Statistics on the response latency.                                                                                                                                                                                                     | `meanLatency`, `maxLatency`, `p90Latency`, `p95Latency`, `p99Latency` |

    \* Metric based on the [Ragas](/integrations/ragas) framework. All of them rely on an LLM evaluator judging your submission. You can configure the underlying LLM used to compute these
    metrics. Check out the [OpenAI](/integrations/openai#openai-llm-evaluator) or [Anthropic](/integrations/anthropic#anthropic-llm-evaluator) integration guides for details.
  </Tab>

  <Tab title="Classification">
    The aggregate metrics available for **tabular classification** and **text classification** projects are:

    | Metric              | Description                                                                                                                                                                                                                | `measurement` for the `tests.json` |
    | ------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------- |
    | Accuracy            | The classification accuracy. Defined as the ratio of the number of correctly classified samples and the total number of samples.                                                                                           | `accuracy`                         |
    | Precision per class | The precision score for each class. Given by TP / (TP + FP).                                                                                                                                                               | `precisionPerClass`                |
    | Recall per class    | The recall score for each class. Given by TP / (TP + FN).                                                                                                                                                                  | `recallPerClass`                   |
    | F1 per class        | The F1 score for each class. Given by 2 \_ ( Precision \_ Recall ) / ( Precision + Recall ).                                                                                                                               | `f1PerClass`                       |
    | Precision           | For **binary classification**, the precision considering class 1 as "positive." For **multiclass classification**, the macro-average of the precision score for each class, i.e., treating all classes equally.            | `precision`                        |
    | Recall              | For **binary classification**, the recall considering class 1 as "positive." For **multiclass classification**, the macro-average of the recall score for each class, i.e., treating all classes equally.                  | `recall`                           |
    | F1                  | For **binary classification**, the F1 considering class 1 as "positive." For **multiclass classification**, the macro-average of the F1 score for each class, i.e., treating all classes equally.                          | `f1`                               |
    | ROC AUC             | The **macro-average** of the area under the receiver operating characteristic curve score for each class, i.e., treating all classes equally. For multi-class classification tasks, uses the one-versus-one configuration. | `rocAuc`                           |
    | False positive rate | Given by FP / (FP + TN). The false positive rate is only available for **binary classification** tasks.                                                                                                                    | `falsePositiveRate`                |
    | Geometric mean      | The geometric mean of the precision and the recall.                                                                                                                                                                        | `geometricMean`                    |
    | Log loss            | Measure of the dissimilarity between predicted probabilities and the true distribution. Also known as cross-entropy loss or binary cross-entropy (in the binary classification case).                                      | `logLoss`                          |

    Where:

    * TP: true positive.
    * TN: true negative.
    * FP: false positive.
    * FN: false negative.
  </Tab>

  <Tab title="Regression">
    The aggregate metrics available for **tabular regression** projects are:

    | Metric                                | Description                                                                                                                                                         | `measurement` for the `tests.json` |
    | :------------------------------------ | :------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ---------------------------------- |
    | Mean squared error (MSE)              | Average of the squared differences between the predicted values and the true values.                                                                                | `mse`                              |
    | Root mean squared error (RMSE)        | The square root of the MSE.                                                                                                                                         | `rmse`                             |
    | Mean absolute error (MAE)             | Average of the absolute differences between the predicted values and the true values.                                                                               | `mae`                              |
    | R-squared                             | Also known as coefficient of determination. Quantifies the proportion of the variance in the dependent variable that is predictable from the independent variables. | `r2`                               |
    | Mean absolute percentage error (MAPE) | Average of the absolute percentage differences between the predicted values and the true values.                                                                    | `mape`                             |
  </Tab>
</Tabs>

## Test configuration examples

If you are writing a `tests.json`, here are a few valid configurations for the character length test:

<CodeGroup>
  ```json Development theme={null}
  [
    {
      "name": "Mean answer relevancy greater than 0.8",
      "description": "Ragas-based answer relevancy over the data is greater than 0.8",
      "type": "performance",
      "subtype": "metricThreshold",
      "thresholds": [
        {
          "insightName": "metrics",
          "measurement": "answerRelevancy",
          "operator": ">",
          "value": 0.8
        }
      ],
      "subpopulationFilters": null,
      "mode": "development",
      "usesValidationDataset": true,
      "usesTrainingDataset": false,
      "usesMlModel": true,
      "syncId": "b4dee7dc-4f15-48ca-a282-63e2c04e0689"
    }
  ]
  ```

  ```json Monitoring theme={null}
  [
    {
      "name": "Mean answer relevancy greater than 0.8",
      "description": "Ragas-based answer relevancy over the data is greater than 0.8",
      "type": "performance",
      "subtype": "metricThreshold",
      "thresholds": [
        {
          "insightName": "metrics",
          "measurement": "answerRelevancy",
          "operator": ">",
          "value": 0.8
        }
      ],
      "subpopulationFilters": null,
      "mode": "monitoring",
      "usesProductionData": true,
      "evaluationWindow": 3600,
      "delayWindow": 0,
      "syncId": "b4dee7dc-4f15-48ca-a282-63e2c04e0689"
    }
  ]
  ```
</CodeGroup>