Ragas

Ragas is an open-source library that offers metrics to evaluate large language model (LLM) applications. Openlayer’s integration with Ragas enables you to create tests using various quality metrics such as harmfulness, faithfulness, and more.

Tests with Ragas metrics

When evaluating LLM projects, you can leverage any of the Ragas metrics to create detailed tests. Each test provides:

A pass/fail status.
Row-by-row scoring and justification, provided by the LLM judge.

Metrics available

The Ragas metrics available on Openlayer listed below.

All Ragas metrics rely on an LLM evaluator judging your submission. On Openlayer, you can configure the underlying LLM used to compute them. Check out the OpenAI or Anthropic integration guides for details.

Metric	Description	Required columns	`measurement` for `tests.json`
Answer relevancy	Evaluates how well the model’s answer aligns with the intent of the question. The evaluator LLM infers possible questions from the answer and compares them to the actual question using semantic similarity.	`input`, `outputs`	`answerRelevancy`
Answer correctness	Measures factual alignment between the generated answer and the ground truth reference. The evaluator breaks both into factual statements and compares them (true positives, false positives, false negatives).	`outputs`, `ground truths`	`answerCorrectness`
Context relevancy	Assesses whether the retrieved context is relevant to the ground truth answer. Each context chunk is judged independently by an LLM for its relevance.	`input`, `ground truth`, `context`	`contextRelevancy`
Context recall	Evaluates how completely the retrieved context supports all the claims present in the ground truth. High recall means most ground truth claims are supported by the retrieved context.	`ground truth`, `context`	`contextRecall`
Faithfulness	Measures how factually consistent the model’s answer is with the retrieved context. The evaluator identifies factual claims in the output and verifies if each is supported by the context.	`outputs`, `context`	`faithfulness`
Correctness	Judges the general factual soundness of the answer. The evaluator LLM rates the output’s correctness based on an aspect-based critique.	`input`, `outputs`	`correctness`
Harmfulness	Evaluates whether the answer contains harmful, unsafe, or toxic content. The evaluator LLM critiques the response through a safety-oriented lens.	`input`, `outputs`	`harmfulness`
Coherence	Measures how logically and linguistically coherent the answer is — i.e., whether it flows naturally and maintains internal consistency.	`input`, `outputs`	`coherence`
Conciseness	Evaluates whether the answer is clear and to the point, without unnecessary verbosity or repetition.	`input`, `outputs`	`conciseness`
Maliciousness	Detects whether the answer exhibits malicious intent, manipulation, or socially undesirable behavior.	`input`, `outputs`	`maliciousness`

Integrations

Model

Data

Instrumentation

Notifications

Other

Tests with Ragas metrics

Metrics available

Integrations

Model

Data

Instrumentation

Notifications

Other

​Tests with Ragas metrics

​Metrics available

Tests with Ragas metrics

Metrics available