Skip to main content

Definition

The session context retention test evaluates whether the assistant maintains and correctly uses context across the turns of a conversation. An LLM-as-a-judge reads the full session and scores it against four criteria:
  • Remembers facts and preferences established in prior turns
  • Builds upon previously established context rather than starting fresh each turn
  • Avoids asking for information the user has already provided
  • Doesn’t contradict information given earlier in the session

Taxonomy

  • Task types: LLM.
  • Availability: and .
  • Evaluation level: session.
  • Polarity: higher score = better. 0 = no context retention, 1 = perfect context retention.

Why it matters

  • Context-retention failures are a primary driver of user frustration in multi-turn assistants — especially re-asking for information already supplied.

Required columns

  • Input: The user’s message in each turn.
  • Output: The assistant’s response in each turn.
  • Session ID: Groups turns belonging to the same conversation.
  • Timestamp: Used to reconstruct turn order within a session.
This metric relies on an LLM evaluator. On Openlayer you can configure the underlying LLM used to compute it. Check out the OpenAI or Anthropic integration guides for details.

Test configuration examples

[
  {
    "name": "Session context retention above 0.7",
    "description": "Ensure the assistant maintains context across session turns",
    "type": "performance",
    "subtype": "sessionContextRetention",
    "thresholds": [
      {
        "insightName": "sessionContextRetention",
        "measurement": "meanScore",
        "operator": ">=",
        "value": 0.7
      }
    ],
    "subpopulationFilters": null,
    "mode": "monitoring",
    "usesProductionData": true,
    "evaluationWindow": 3600,
    "delayWindow": 0
  }
]