> ## Documentation Index
> Fetch the complete documentation index at: https://docs.openlayer.com/llms.txt
> Use this file to discover all available pages before exploring further.

# BLEU score

> Learn how to use the BLEU score test

## Definition

The BLEU (Bilingual Evaluation Understudy) score test measures the quality of machine-generated text by comparing it to reference text. BLEU scores are available for unigram to 4-gram precision (BLEU-1, BLEU-2, BLEU-3, and BLEU-4).

## Taxonomy

* **Task types**: LLM.
* **Availability**: <Tooltip tip="Continuously evaluate your models and datasets as you iterate on their versions.">development</Tooltip>
  and <Tooltip tip="Monitor a model in production, measure its health, check for drifts and set up alerts.">monitoring</Tooltip>.

## Why it matters

* BLEU scores provide a standardized way to evaluate the quality of generated text against reference translations or expected outputs.
* Different n-gram levels capture different aspects of text quality: BLEU-1 focuses on word choice, while higher n-grams (BLEU-2 to BLEU-4) capture phrase structure and fluency.
* This metric is particularly useful for translation tasks, text summarization, and other text generation applications where you have reference outputs.

## Required columns

To compute this metric, your dataset must contain the following columns:

* **Outputs**: The generated text from your LLM
* **Ground truths**: The reference/expected text to compare against

## Test configuration examples

If you are writing a `tests.json`, here are a few valid configurations for the BLEU score test:

<CodeGroup>
  ```json Development - BLEU-1 theme={null}
  [
    {
      "name": "Mean BLEU-1 score above 0.6",
      "description": "Ensure that the mean BLEU-1 score is above 0.6",
      "type": "performance",
      "subtype": "metricThreshold",
      "thresholds": [
        {
          "insightName": "metrics",
          "insightParameters": null,
          "measurement": "meanBleu1",
          "operator": ">",
          "value": 0.6
        }
      ],
      "subpopulationFilters": null,
      "mode": "development",
      "usesValidationDataset": true,
      "usesTrainingDataset": false,
      "usesMlModel": false,
      "syncId": "b4dee7dc-4f15-48ca-a282-63e2c04e0689"
    }
  ]
  ```

  ```json Development - BLEU-4 theme={null}
  [
    {
      "name": "Mean BLEU-4 score above 0.4",
      "description": "Ensure that the mean BLEU-4 score is above 0.4",
      "type": "performance",
      "subtype": "metricThreshold",
      "thresholds": [
        {
          "insightName": "metrics",
          "insightParameters": null,
          "measurement": "meanBleu4",
          "operator": ">",
          "value": 0.4
        }
      ],
      "subpopulationFilters": null,
      "mode": "development",
      "usesValidationDataset": true,
      "usesTrainingDataset": false,
      "usesMlModel": false,
      "syncId": "b4dee7dc-4f15-48ca-a282-63e2c04e0689"
    }
  ]
  ```

  ```json Monitoring - BLEU-2 theme={null}
  [
    {
      "name": "Mean BLEU-2 score above 0.5",
      "description": "Ensure that the mean BLEU-2 score is above 0.5",
      "type": "performance",
      "subtype": "metricThreshold",
      "thresholds": [
        {
          "insightName": "metrics",
          "insightParameters": null,
          "measurement": "meanBleu2",
          "operator": ">",
          "value": 0.5
        }
      ],
      "subpopulationFilters": null,
      "mode": "monitoring",
      "usesProductionData": true,
      "evaluationWindow": 3600,
      "delayWindow": 0,
      "syncId": "b4dee7dc-4f15-48ca-a282-63e2c04e0689"
    }
  ]
  ```
</CodeGroup>

## Available measurements

* `meanBleu1` - Mean BLEU-1 score (unigram precision)
* `meanBleu2` - Mean BLEU-2 score (bigram precision)
* `meanBleu3` - Mean BLEU-3 score (trigram precision)
* `meanBleu4` - Mean BLEU-4 score (4-gram precision)

## Related

* [Edit distance test](/tests/catalog/edit-distance) - Measure character-level similarity.
* [Exact match test](/tests/catalog/exact-match) - Assess identical string matches.
* [Semantic similarity test](/tests/catalog/semantic-similarity) - Measure meaning similarity.
* [Aggregate metrics](/tests/performance/aggregate-metrics) - Overview of all available metrics.
