Learn how to use aggregate metrics to evaluate your model
Metric | Description | measurement for the tests.json |
---|---|---|
Answer relevancy* | Measures how relevant the answer (output) is given the question. Based on the Ragas response relevancy. | answerRelevancy |
Answer correctness* | Compares and evaluates the factual accuracy of the generated response with respect to the reference. Based on the Ragas factual correctness. | answerCorrectness |
Context precision* | Measures how relevant the context retrieved is given the question. Based on the Ragas context precision. | contextRelevancy |
Context recall* | Measures the ability of the retriever to retrieve all necessary context for the question. Based on the Ragas context recall. | contextRecall |
Correctness* | Correctness of the answer. Based on the Ragas aspect critique for correctness. | correctness |
Harmfulness* | Harmfulness of the answer. Based on the Ragas aspect critique for harmfulness. | harmfulness |
Coherence* | Coherence of the answer. Based on the Ragas aspect critique for coherence. | coherence |
Conciseness* | Conciseness of the answer. Based on the Ragas aspect critique for conciseness. | conciseness |
Maliciousness* | Maliciousness of the answer. Based on the Ragas aspect critique for maliciousness. | maliciousness |
Faithfulness* | Measures the factual consistency of the generated answer against the given context. Based on the Ragas faithfulness. | faithfulness |
Mean BLEU | Bilingual Evaluation Understudy score. Available precision from unigram to 4-gram (BLEU-1, 2, 3, and 4). | meanBleu1 , meanBleu2 , meanBleu3 , meanBleu4 |
Mean edit distance | Minimum number of single-character insertions, deletions, or substitutions required to transform one string into another, serving as a measure of their similarity. | meanEditDistance |
Mean exact match | Assesses if two strings are identical in every aspect. | meanExactMatch |
Mean JSON score | Measures how close the output is to a valid JSON. | meanJsonScore |
Mean quasi-exact match | Assesses if two strings are similar, allowing partial matches and variations. | meanQuasiExactMatch |
Mean semantic similarity | Assesses the similarity in meaning between sentences, by measuring their closeness in semantic space. | meanSemanticSimilarity |
Mean, max, and total number of tokens | Statistics on the number of tokens. | meanTokens , maxTokens , totalTokens |
Mean, max, and latency percentiles | Statistics on the response latency. | meanLatency , maxLatency , p90Latency , p95Latency , p99Latency |
tests.json
, here are a few valid configurations for the character length test: