> ## Documentation Index
> Fetch the complete documentation index at: https://docs.openlayer.com/llms.txt
> Use this file to discover all available pages before exploring further.

# LLM-as-a-judge

> Learn how to use the LLM evaluation test

## Definition

The **LLM-as-a-judge** test lets you evaluate model or agent outputs using another LLM as an evaluator (or “judge”).
Instead of relying solely on quantitative metrics, you can define **descriptive evaluation criteria** such as:

* “The response should be polite and informative.”
* “Ensure the output is written in Portuguese.”
* “Verify that the model provides factual information about the query.”

Openlayer sends your model’s outputs and the specified criteria to an evaluator **LLM of
your choice** and asks it to grade each example.

For each evaluation, the judge provides both a **score** and an **explanation**.

<img width="700" style={{ borderRadius: "0.5rem" }} src="https://mintcdn.com/openlayer-44/GldoX8d6bHTGtfaU/images/documentation/llm_as_a_judge.png?fit=max&auto=format&n=GldoX8d6bHTGtfaU&q=85&s=65cb3f84a4de04780921c423b08974a4" data-path="images/documentation/llm_as_a_judge.png" />

## Taxonomy

* **Task types**: LLM.
* **Availability**: <Tooltip tip="Continuously evaluate your models and datasets as you iterate on their versions.">development</Tooltip>
  and <Tooltip tip="Monitor a model in production, measure its health, check for drifts and set up alerts.">monitoring</Tooltip>.

## Why it matters

* Traditional metrics often fail to capture qualitative expectations. The LLM-as-a-judge test enables
  **subjective or stylistic evaluations** (e.g., tone, helpfulness, coherence), **explainability** (every evaluation includes a rationale from the LLM),
  and **consistency** (same rubric can be reused across model versions or production evaluations).

## How it works

Behind the scenes, the LLM-as-a-judge test goes through the following steps:

<Steps>
  <Step title="You define the rubric">
    You specify one the evaluation criteria in natural language (e.g., “Ensure
    the text is polite and factual”) and the scoring method (e.g., binary or
    score within the 0-1 range).

    <img width="700" style={{ borderRadius: "0.5rem" }} src="https://mintcdn.com/openlayer-44/GldoX8d6bHTGtfaU/images/documentation/llm_config.png?fit=max&auto=format&n=GldoX8d6bHTGtfaU&q=85&s=273fef266d2cdc2585d1bbe6c9d4db09" data-path="images/documentation/llm_config.png" />
  </Step>

  <Step title="Openlayer builds an evaluation prompt">
    For each data point, the Openlayer constructs a prompt to the evaluator LLM
    combining:

    * An internal base prompt that instructs the evaluator LLM to grade the data point based on the rubric
    * The original input and model output
    * Your rubric
    * The scoring format (`Yes/No` or `0-1`)
  </Step>

  <Step title="The evaluator LLM grades the data point">
    The evaluator LLM grades the data point based on the prompt and returns a
    score and an explanation.
  </Step>

  <Step title="Openlayer aggregates the scores and explanations">
    The scores are aggregated to compute the explanations stored.

    <img width="700" style={{ borderRadius: "0.5rem" }} src="https://mintcdn.com/openlayer-44/GldoX8d6bHTGtfaU/images/documentation/llm_as_a_judge.png?fit=max&auto=format&n=GldoX8d6bHTGtfaU&q=85&s=65cb3f84a4de04780921c423b08974a4" data-path="images/documentation/llm_as_a_judge.png" />
  </Step>
</Steps>

## Choosing the LLM judge

You can configure which LLM acts as the evaluator (“judge”) for the test. This can be
done either at the **project level** (default for all tests) or on a **per-test basis**.

You can choose from the following LLM providers:

* OpenAI
* Anthropic
* Azure OpenAI
* Amazon Bedrock
* Cohere
* Google
* Groq
* Mistral

If a provider is not connected, you’ll see a `⚠️ API not connected indicator`.
Follow the respective integration guide (e.g., [OpenAI](/integrations/openai#using-openai-llms-as-the-llm-judge),
[Anthropic](/integrations/anthropic#using-anthropic-llms-as-the-llm-judge)) to add credentials.

<Info>
  When deployed on-prem, Openlayer can also be configured to use an **internal
  gateway** instead of direct API calls. This enables centralized routing,
  caching, and compliance controls for evaluator LLMs.
</Info>

## Test configuration examples

If you are writing a `tests.json`, here are a few valid configurations for the character length test:

<CodeGroup>
  ```json Development theme={null}
  [
    {
      "name": "The output of the model is polite and informative",
      "description": "Uses another LLM to check if the output of the model is polite and informative",
      "type": "performance",
      "subtype": "llmRubricThresholdV2",
      "thresholds": [
        {
          "insightName": "llmRubricV2",
          "insightParameters": [
            {
              "name": "criteria_list",
              "value": [
                {
                  "name": "Polite and informative", // Name of the criterion
                  "criteria": "Ensure outputs are polite and informative", // Prompt for the LLM evaluator
                  "scoring": "Yes or No" // Must be either 'Yes or No' (for a binary evaluation) or '0-1' (for a score within the 0-1 range)
                }
              ]
            }
          ],
          "measurement": "criteria0MeanScore",
          "operator": ">=",
          "value": 1.0
        }
      ],
      "subpopulationFilters": null,
      "mode": "development",
      "usesValidationDataset": true, // Apply test to the validation set
      "usesTrainingDataset": false,
      "usesMlModel": false,
      "syncId": "b4dee7dc-4f15-48ca-a282-63e2c04e0689" // Some unique id
    }
  ]
  ```

  ```json Monitoring theme={null}
  [
    {
      "name": "The output of the model is polite and informative",
      "description": "Uses another LLM to check if the output of the model is polite and informative",
      "type": "performance",
      "subtype": "llmRubricThresholdV2",
      "thresholds": [
        {
          "insightName": "llmRubricV2",
          "insightParameters": [
            {
              "name": "criteria_list",
              "value": [
                {
                  "name": "Polite and informative", // Name of the criterion
                  "criteria": "Ensure outputs are polite and informative", // Prompt for the LLM evaluator
                  "scoring": "Yes or No" // Must be either 'Yes or No' (for a binary evaluation) or '0-1' (for a score within the 0-1 range)
                }
              ]
            }
          ],
          "measurement": "criteria0MeanScore",
          "operator": ">=",
          "value": 1.0
        }
      ],
      "subpopulationFilters": null,
      "mode": "monitoring",
      "usesProductionData": true,
      "evaluationWindow": 3600, // 1 hour
      "delayWindow": 0,
      "syncId": "b4dee7dc-4f15-48ca-a282-63e2c04e0689" // Some unique id
    }
  ]
  ```
</CodeGroup>
