Definition
The BLEU (Bilingual Evaluation Understudy) score test measures the quality of machine-generated text by comparing it to reference text. BLEU scores are available for unigram to 4-gram precision (BLEU-1, BLEU-2, BLEU-3, and BLEU-4).Taxonomy
- Task types: LLM.
- Availability: and .
Why it matters
- BLEU scores provide a standardized way to evaluate the quality of generated text against reference translations or expected outputs.
- Different n-gram levels capture different aspects of text quality: BLEU-1 focuses on word choice, while higher n-grams (BLEU-2 to BLEU-4) capture phrase structure and fluency.
- This metric is particularly useful for translation tasks, text summarization, and other text generation applications where you have reference outputs.
Required columns
To compute this metric, your dataset must contain the following columns:- Outputs: The generated text from your LLM
- Ground truths: The reference/expected text to compare against
Test configuration examples
If you are writing atests.json
, here are a few valid configurations for the BLEU score test:
Available measurements
meanBleu1
- Mean BLEU-1 score (unigram precision)meanBleu2
- Mean BLEU-2 score (bigram precision)meanBleu3
- Mean BLEU-3 score (trigram precision)meanBleu4
- Mean BLEU-4 score (4-gram precision)
Related
- Edit distance test - Measure character-level similarity.
- Exact match test - Assess identical string matches.
- Semantic similarity test - Measure meaning similarity.
- Aggregate metrics - Overview of all available metrics.