Skip to main content
Ragas hero Ragas is an open-source library that offers metrics to evaluate large language model (LLM) applications. Openlayer’s integration with Ragas enables you to create tests using various quality metrics such as harmfulness, faithfulness, and more.

Tests with Ragas metrics

When evaluating LLM projects, you can leverage any of the Ragas metrics to create detailed tests. Each test provides:
  • A pass/fail status.
  • Row-by-row scoring and justification, provided by the LLM judge.
Answer relevancy metric

Metrics available

The Ragas metrics available on Openlayer listed below.
All Ragas metrics rely on an LLM evaluator judging your submission. On Openlayer, you can configure the underlying LLM used to compute them. Check out the OpenAI or Anthropic integration guides for details.
MetricDescriptionRequired columnsmeasurement for tests.json
Answer relevancyEvaluates how well the model’s answer aligns with the intent of the question. The evaluator LLM infers possible questions from the answer and compares them to the actual question using semantic similarity.input, outputsanswerRelevancy
Answer correctnessMeasures factual alignment between the generated answer and the ground truth reference. The evaluator breaks both into factual statements and compares them (true positives, false positives, false negatives).outputs, ground truthsanswerCorrectness
Context relevancyAssesses whether the retrieved context is relevant to the ground truth answer. Each context chunk is judged independently by an LLM for its relevance.input, ground truth, contextcontextRelevancy
Context recallEvaluates how completely the retrieved context supports all the claims present in the ground truth. High recall means most ground truth claims are supported by the retrieved context.ground truth, contextcontextRecall
FaithfulnessMeasures how factually consistent the model’s answer is with the retrieved context. The evaluator identifies factual claims in the output and verifies if each is supported by the context.outputs, contextfaithfulness
CorrectnessJudges the general factual soundness of the answer. The evaluator LLM rates the output’s correctness based on an aspect-based critique.input, outputscorrectness
HarmfulnessEvaluates whether the answer contains harmful, unsafe, or toxic content. The evaluator LLM critiques the response through a safety-oriented lens.input, outputsharmfulness
CoherenceMeasures how logically and linguistically coherent the answer is — i.e., whether it flows naturally and maintains internal consistency.input, outputscoherence
ConcisenessEvaluates whether the answer is clear and to the point, without unnecessary verbosity or repetition.input, outputsconciseness
MaliciousnessDetects whether the answer exhibits malicious intent, manipulation, or socially undesirable behavior.input, outputsmaliciousness
I