Performance
LLM evaluation
Definition
The LLM evaluation test allows you to create tests using an LLM as a judge. You can write descriptive evaluations like “Make sure the outputs are in Portuguese,” and Openlayer will use an LLM to grade your agent or model given this criterion. Besides producing a score, the LLM will also explain its evaluation.
Taxonomy
- Category: Performance.
- Task types: LLM.
- Availability: and .
Why it matters
- Sometimes, it is hard to evaluate a model’s performance using only a metric. For example, if you are building a chatbot, you might want to make sure that the bot does not use profanity. You can encode these more subjective criteria using the LLM evaluation test.
Test configuration examples
If you are writing a tests.json
, here are a few valid configurations for the character length test:
Was this page helpful?