Integrity
Duplicate rows
Definition
The duplicate rows test checks if there are rows that are identical to each other in the dataset.
Taxonomy
- Category: Integrity.
- Task types: LLM, tabular classification, tabular regression, text classification.
- Availability: and .
Why it matters
- Duplicate rows on the training set can lead the model to overfit on the duplicated examples.
- Duplicate rows on the validation set can distort the aggregate metrics, making them overly optimistic or pessimistic.
Test configuration examples
If you are writing a tests.json
, here are a few valid configurations for the character length test:
Was this page helpful?