Definition

The LLM evaluation test allows you to create tests using an LLM as a judge. You can write descriptive evaluations like “Make sure the outputs are in Portuguese,” and Openlayer will use an LLM to grade your agent or model given this criterion. Besides producing a score, the LLM will also explain its evaluation.

To use this test, you must select the underlying LLM used as the evaluator and provide the required API credentials. You can check the OpenAI and Anthropic integration guides for details.

Taxonomy

  • Category: Performance.
  • Task types: LLM.
  • Availability: and .

Why it matters

  • Sometimes, it is hard to evaluate a model’s performance using only a metric. For example, if you are building a chatbot, you might want to make sure that the bot does not use profanity. You can encode these more subjective criteria using the LLM evaluation test.

Test configuration examples

If you are writing a tests.json, here are a few valid configurations for the character length test: