LLM-as-a-judge

Definition
Taxonomy
Why it matters
How it works
Choosing the LLM judge
Test configuration examples

Definition

The LLM-as-a-judge test lets you evaluate model or agent outputs using another LLM as an evaluator (or “judge”). Instead of relying solely on quantitative metrics, you can define descriptive evaluation criteria such as:

“The response should be polite and informative.”
“Ensure the output is written in Portuguese.”
“Verify that the model provides factual information about the query.”

Openlayer sends your model’s outputs and the specified criteria to an evaluator LLM of your choice and asks it to grade each example. For each evaluation, the judge provides both a score and an explanation.

Taxonomy

Task types: LLM.
Availability: and .

Why it matters

Traditional metrics often fail to capture qualitative expectations. The LLM-as-a-judge test enables subjective or stylistic evaluations (e.g., tone, helpfulness, coherence), explainability (every evaluation includes a rationale from the LLM), and consistency (same rubric can be reused across model versions or production evaluations).

How it works

Behind the scenes, the LLM-as-a-judge test goes through the following steps:

You define the rubric

You specify one the evaluation criteria in natural language (e.g., “Ensure the text is polite and factual”) and the scoring method (e.g., binary or score within the 0-1 range).

Openlayer builds an evaluation prompt

For each data point, the Openlayer constructs a prompt to the evaluator LLM combining:

An internal base prompt that instructs the evaluator LLM to grade the data point based on the rubric
The original input and model output
Your rubric
The scoring format (Yes/No or 0-1)

The evaluator LLM grades the data point

The evaluator LLM grades the data point based on the prompt and returns a score and an explanation.

Openlayer aggregates the scores and explanations

The scores are aggregated to compute the explanations stored.

Choosing the LLM judge

You can configure which LLM acts as the evaluator (“judge”) for the test. This can be done either at the project level (default for all tests) or on a per-test basis. You can choose from the following LLM providers:

OpenAI
Anthropic
Azure OpenAI
Amazon Bedrock
Cohere
Google
Groq
Mistral

If a provider is not connected, you’ll see a ⚠️ API not connected indicator. Follow the respective integration guide (e.g., OpenAI, Anthropic) to add credentials.

When deployed on-prem, Openlayer can also be configured to use an internal gateway instead of direct API calls. This enables centralized routing, caching, and compliance controls for evaluator LLMs.

Test configuration examples

If you are writing a tests.json, here are a few valid configurations for the character length test:

[
  {
    "name": "The output of the model is polite and informative",
    "description": "Uses another LLM to check if the output of the model is polite and informative",
    "type": "performance",
    "subtype": "llmRubricThresholdV2",
    "thresholds": [
      {
        "insightName": "llmRubricV2",
        "insightParameters": [
          {
            "name": "criteria_list",
            "value": [
              {
                "name": "Polite and informative", // Name of the criterion
                "criteria": "Ensure outputs are polite and informative", // Prompt for the LLM evaluator
                "scoring": "Yes or No" // Must be either 'Yes or No' (for a binary evaluation) or '0-1' (for a score within the 0-1 range)
              }
            ]
          }
        ],
        "measurement": "criteria0MeanScore",
        "operator": ">=",
        "value": 1.0
      }
    ],
    "subpopulationFilters": null,
    "mode": "development",
    "usesValidationDataset": true, // Apply test to the validation set
    "usesTrainingDataset": false,
    "usesMlModel": false,
    "syncId": "b4dee7dc-4f15-48ca-a282-63e2c04e0689" // Some unique id
  }
]

Aggregate metrics

Max cost

⌘I

Get started

Workspace setup

Governance

Observability

Data quality monitoring

Offline testing

Tests

Administration

Other resources

Definition

Taxonomy

Why it matters

How it works

Choosing the LLM judge

Test configuration examples

Get started

Workspace setup

Governance

Observability

Data quality monitoring

Offline testing

Tests

Administration

Other resources

​Definition

​Taxonomy

​Why it matters

​How it works

​Choosing the LLM judge

​Test configuration examples

Definition

Taxonomy

Why it matters

How it works

Choosing the LLM judge

Test configuration examples