Skip to main content

Definition

The LLM-as-a-judge test lets you evaluate model or agent outputs using another LLM as an evaluator (or “judge”). Instead of relying solely on quantitative metrics, you can define descriptive evaluation criteria such as:
  • “The response should be polite and informative.”
  • “Ensure the output is written in Portuguese.”
  • “Verify that the model provides factual information about the query.”
Openlayer sends your model’s outputs and the specified criteria to an evaluator LLM of your choice and asks it to grade each example. For each evaluation, the judge provides both a score and an explanation.

Taxonomy

  • Task types: LLM.
  • Availability: and .

Why it matters

  • Traditional metrics often fail to capture qualitative expectations. The LLM-as-a-judge test enables subjective or stylistic evaluations (e.g., tone, helpfulness, coherence), explainability (every evaluation includes a rationale from the LLM), and consistency (same rubric can be reused across model versions or production evaluations).

How it works

Behind the scenes, the LLM-as-a-judge test goes through the following steps:
1

You define the rubric

You specify one the evaluation criteria in natural language (e.g., “Ensure the text is polite and factual”) and the scoring method (e.g., binary or score within the 0-1 range).
2

Openlayer builds an evaluation prompt

For each data point, the Openlayer constructs a prompt to the evaluator LLM combining:
  • An internal base prompt that instructs the evaluator LLM to grade the data point based on the rubric
  • The original input and model output
  • Your rubric
  • The scoring format (Yes/No or 0-1)
3

The evaluator LLM grades the data point

The evaluator LLM grades the data point based on the prompt and returns a score and an explanation.
4

Openlayer aggregates the scores and explanations

The scores are aggregated to compute the explanations stored.

Choosing the LLM judge

You can configure which LLM acts as the evaluator (“judge”) for the test. This can be done either at the project level (default for all tests) or on a per-test basis. You can choose from the following LLM providers:
  • OpenAI
  • Anthropic
  • Azure OpenAI
  • Amazon Bedrock
  • Cohere
  • Google
  • Groq
  • Mistral
If a provider is not connected, you’ll see a ⚠️ API not connected indicator. Follow the respective integration guide (e.g., OpenAI, Anthropic) to add credentials.
When deployed on-prem, Openlayer can also be configured to use an internal gateway instead of direct API calls. This enables centralized routing, caching, and compliance controls for evaluator LLMs.

Test configuration examples

If you are writing a tests.json, here are a few valid configurations for the character length test:
[
  {
    "name": "The output of the model is polite and informative",
    "description": "Uses another LLM to check if the output of the model is polite and informative",
    "type": "performance",
    "subtype": "llmRubricThresholdV2",
    "thresholds": [
      {
        "insightName": "llmRubricV2",
        "insightParameters": [
          {
            "name": "criteria_list",
            "value": [
              {
                "name": "Polite and informative", // Name of the criterion
                "criteria": "Ensure outputs are polite and informative", // Prompt for the LLM evaluator
                "scoring": "Yes or No" // Must be either 'Yes or No' (for a binary evaluation) or '0-1' (for a score within the 0-1 range)
              }
            ]
          }
        ],
        "measurement": "criteria0MeanScore",
        "operator": ">=",
        "value": 1.0
      }
    ],
    "subpopulationFilters": null,
    "mode": "development",
    "usesValidationDataset": true, // Apply test to the validation set
    "usesTrainingDataset": false,
    "usesMlModel": false,
    "syncId": "b4dee7dc-4f15-48ca-a282-63e2c04e0689" // Some unique id
  }
]
I