LLM evaluation

Definition

The LLM evaluation test allows you to create tests using an LLM as a judge. You can write descriptive evaluations like “Make sure the outputs are in Portuguese,” and Openlayer will use an LLM to grade your agent or model given this criterion. Besides producing a score, the LLM will also explain its evaluation.

To use this test, you must select the underlying LLM used as the evaluator and provide the required API credentials. You can check the OpenAI and Anthropic integration guides for details.

Taxonomy

Category: Performance.
Task types: LLM.
Availability: and .

Why it matters

Sometimes, it is hard to evaluate a model’s performance using only a metric. For example, if you are building a chatbot, you might want to make sure that the bot does not use profanity. You can encode these more subjective criteria using the LLM evaluation test.

Test configuration examples

If you are writing a tests.json, here are a few valid configurations for the character length test:

[
  {
    "name": "The output of the model is polite and informative",
    "description": "Uses another LLM to check if the output of the model is polite and informative",
    "type": "performance",
    "subtype": "llmRubricThresholdV2",
    "thresholds": [
      {
        "insightName": "llmRubricV2",
        "insightParameters": [
          {
            "name": "criteria_list",
            "value": [
              {
                "name": "Polite and informative", // Name of the criterion
                "criteria": "Ensure outputs are polite and informative", // Prompt for the LLM evaluator
                "scoring": "Yes or No" // Must be either 'Yes or No' (for a binary evaluation) or '0-1' (for a score within the 0-1 range)
              }
            ]
          }
        ],
        "measurement": "criteria0MeanScore",
        "operator": ">=",
        "value": 1.0
      }
    ],
    "subpopulationFilters": null,
    "mode": "development",
    "usesValidationDataset": true, // Apply test to the validation set
    "usesTrainingDataset": false,
    "usesMlModel": false,
    "syncId": "b4dee7dc-4f15-48ca-a282-63e2c04e0689" // Some unique id
  }
]

Get started

Set up tests

Test your system offline

Monitor your live system

Other resources

Definition

Taxonomy

Why it matters

Test configuration examples

Get started

Set up tests

Test your system offline

Monitor your live system

Other resources

​Definition

​Taxonomy

​Why it matters

​Test configuration examples

Definition

Taxonomy

Why it matters

Test configuration examples