Definition

The BLEU (Bilingual Evaluation Understudy) score test measures the quality of machine-generated text by comparing it to reference text. BLEU scores are available for unigram to 4-gram precision (BLEU-1, BLEU-2, BLEU-3, and BLEU-4).

Taxonomy

  • Task types: LLM.
  • Availability: and .

Why it matters

  • BLEU scores provide a standardized way to evaluate the quality of generated text against reference translations or expected outputs.
  • Different n-gram levels capture different aspects of text quality: BLEU-1 focuses on word choice, while higher n-grams (BLEU-2 to BLEU-4) capture phrase structure and fluency.
  • This metric is particularly useful for translation tasks, text summarization, and other text generation applications where you have reference outputs.

Required columns

To compute this metric, your dataset must contain the following columns:
  • Outputs: The generated text from your LLM
  • Ground truths: The reference/expected text to compare against

Test configuration examples

If you are writing a tests.json, here are a few valid configurations for the BLEU score test:
[
  {
    "name": "Mean BLEU-1 score above 0.6",
    "description": "Ensure that the mean BLEU-1 score is above 0.6",
    "type": "performance",
    "subtype": "metricThreshold",
    "thresholds": [
      {
        "insightName": "metrics",
        "insightParameters": null,
        "measurement": "meanBleu1",
        "operator": ">",
        "value": 0.6
      }
    ],
    "subpopulationFilters": null,
    "mode": "development",
    "usesValidationDataset": true,
    "usesTrainingDataset": false,
    "usesMlModel": false,
    "syncId": "b4dee7dc-4f15-48ca-a282-63e2c04e0689"
  }
]

Available measurements

  • meanBleu1 - Mean BLEU-1 score (unigram precision)
  • meanBleu2 - Mean BLEU-2 score (bigram precision)
  • meanBleu3 - Mean BLEU-3 score (trigram precision)
  • meanBleu4 - Mean BLEU-4 score (4-gram precision)