Training-validation leakage

On this page

Definition
Taxonomy
Why it matters
Test configuration examples

Definition

The training-validation lekage test allows you to detect training rows that are also present in the validation dataset.

Taxonomy

Category: Consistency.
Task types: LLM, tabular classification, tabular regression, text classification.
Availability: .

Why it matters

The training and validation datasets must be completely disjoint. Otherwise, all evaluation insights extracted from the validation set are unreliable and overly optimistic.

Test configuration examples

If you are writing a tests.json, here are a few valid configurations for the character length test:

[
  {
    "name": "No training-validation leakage",
    "description": "Asserts that no rows from the validation set are present in the training set",
    "type": "consistency",
    "subtype": "trainValLeakageRowCount",
    "thresholds": [
      {
        "insightName": "trainValLeakageRowCount",
        "insightParameters": null,
        "measurement": "trainValLeakageRowCount",
        "operator": "<=",
        "value": 0
      }
    ],
    "subpopulationFilters": null,
    "mode": "development",
    "usesValidationDataset": true,
    "usesTrainingDataset": true,
    "usesMlModel": false,
    "syncId": "b4dee7dc-4f15-48ca-a282-63e2c04e0689" // Some unique id
  }
]

Size ratio Aggregate metrics

Get started

Set up tests

Test your system offline

Monitor your live system

Other resources

Training-validation leakage

Definition

Taxonomy

Why it matters

Test configuration examples

Get started

Set up tests

Test your system offline

Monitor your live system

Other resources

​Definition

​Taxonomy

​Why it matters

​Test configuration examples

Definition

Taxonomy

Why it matters

Test configuration examples