Column statistics

Definition

The column statistics test allows you to set thresholds on statistical measures of individual columns in your dataset. You can select any column and specify a statistic (such as mean, median, variance, etc.), then define acceptable ranges or values for that statistic. This test computes the specified statistical measure for the chosen column and compares it against your defined threshold.

Taxonomy

  • Task types: LLM, tabular classification, tabular regression.
  • Availability: and .

Why it matters

  • Column statistics tests help ensure that your data maintains expected statistical properties over time.
  • They can detect data quality issues, distribution shifts, or unusual patterns in individual features.
  • These tests are essential for monitoring data consistency and ensuring that model inputs remain within expected ranges.
  • Statistical validation helps identify potential data pipeline issues or changes in data collection processes.

Available statistics

The following statistical measures are supported:
StatisticDescriptionTypical Use Cases
meanAverage value of the columnMonitor if average values stay within expected ranges
medianMiddle value when data is sortedDetect shifts in central tendency, robust to outliers
minMinimum value in the columnEnsure no values fall below acceptable minimums
maxMaximum value in the columnDetect outliers or values exceeding acceptable maximums
stdStandard deviation of the columnMonitor data variability and spread
sumSum of all values in the columnUseful for totals, counts, or aggregate validations
countNumber of non-null valuesMonitor data completeness
varianceVariance of the column valuesAlternative measure of data spread

Test configuration examples

If you are writing a tests.json, here are a few valid configurations for the column statistics test:
[
  {
    "name": "Average age within expected range",
    "description": "Ensures the average age in the dataset is greater than 25",
    "type": "integrity",
    "subtype": "columnStatistic",
    "thresholds": [
      {
        "insightName": "columnStatistic",
        "insightParameters": [
          { "name": "column_name", "value": "age" }, // Select the column
          { "name": "statistic", "value": "mean" }   // Select the statistic
        ],
        "measurement": "columnStatistic",
        "operator": ">",
        "value": 25
      }
    ],
    "subpopulationFilters": null,
    "mode": "development",
    "usesValidationDataset": true, // Apply test to the validation set
    "usesTrainingDataset": false,
    "usesMlModel": false,
    "syncId": "b4dee7dc-4f15-48ca-a282-63e2c04e0689" // Some unique id
  },
  {
    "name": "Income variance stability check",
    "description": "Ensures income variance doesn't exceed threshold, indicating stable distribution",
    "type": "integrity",
    "subtype": "columnStatistic",
    "thresholds": [
      {
        "insightName": "columnStatistic",
        "insightParameters": [
          { "name": "column_name", "value": "income" },
          { "name": "statistic", "value": "variance" }
        ],
        "measurement": "columnStatistic",
        "operator": "<=",
        "value": 1000000 // Maximum acceptable variance
      }
    ],
    "subpopulationFilters": null,
    "mode": "development",
    "usesValidationDataset": true,
    "usesTrainingDataset": false,
    "usesMlModel": false,
    "syncId": "96622fba-ea00-4e42-8f42-5e8f5f60805f" // Some unique id
  }
]