Anomalous column count

Definition

The anomalous column count test automatically learns time series patterns for each column in your dataset and detects when values fall outside predicted bounds. For numeric columns, it tracks statistical measures (like averages) over time, while for categorical columns, it monitors category counts. The test continuously learns expected ranges for each column and counts how many columns exhibit anomalous behavior on each evaluation, comparing this count against your specified threshold.

Taxonomy

  • Task types: Tabular classification, tabular regression.
  • Availability: only.
This test is only available in monitoring mode as it requires historical data to learn time series patterns and establish baseline expectations for each column.

Why it matters

  • Automated monitoring: Provides comprehensive data quality monitoring with minimal configuration required
  • Early anomaly detection: Identifies unusual patterns across all columns simultaneously before they impact model performance
  • Time series learning: Adapts to natural variations and trends in your data over time
  • Comprehensive coverage: Monitors both numeric and categorical columns automatically
  • Minimal setup: No need to manually configure thresholds for individual columns - the system learns appropriate bounds

How it works

The test operates through several phases:
  1. Learning phase: Analyzes historical data to establish time series patterns for each column
    • Numeric columns: Tracks statistical measures (averages, medians, etc.) over time
    • Categorical columns: Monitors counts of each category over time
  2. Prediction: Uses learned patterns to predict expected upper and lower bounds for each column’s current values
  3. Anomaly detection: Compares current column values against predicted bounds
    • Values outside the confidence interval are flagged as anomalous
  4. Counting: Counts the total number of columns exhibiting anomalous behavior
  5. Threshold comparison: Compares the anomalous column count against your specified threshold

Configuration parameters

The test supports an optional interval_width parameter that controls the confidence interval for anomaly detection:
  • interval_width: Confidence interval width (default: 0.95)
    • 0.95 = 95% confidence interval (stricter, detects more anomalies)
    • 0.99 = 99% confidence interval (more lenient, detects fewer anomalies)

Test configuration examples

If you are writing a tests.json, here are a few valid configurations for the anomalous column count test:
[
  {
    "name": "No anomalous columns detected",
    "description": "Alerts when any column shows anomalous behavior based on learned patterns",
    "type": "integrity",
    "subtype": "anomalousColumnCount",
    "thresholds": [
      {
        "insightName": "anomalousColumnCount",
        "measurement": "anomalousColumnCount",
        "operator": "<=",
        "value": 0  // No anomalous columns allowed
      }
    ],
    "subpopulationFilters": null,
    "mode": "monitoring",
    "usesProductionData": true,
    "evaluationWindow": 86400, // 24 hours (daily evaluation)
    "delayWindow": 0,
    "syncId": "b4dee7dc-4f15-48ca-a282-63e2c04e0689" // Some unique id
  }
]