> ## Documentation Index
> Fetch the complete documentation index at: https://docs.openlayer.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Group by column statistic

> Learn how to use the group by column statistic test to validate statistical properties across data groups

## Definition

The group by column statistic test allows you to measure a statistical property of
one column grouped by the unique values of another column, and then set thresholds
on how many groups fail to meet your criteria.

For each unique value in the grouping column, the test calculates the specified
statistic on the target column and checks if it meets your defined condition. The
test then counts how many groups fail this condition and compares against your threshold.

## Taxonomy

* **Task types**: LLM, tabular classification, tabular regression.
* **Availability**: <Tooltip tip="Continuously evaluate your models and datasets as you iterate on their versions.">development</Tooltip>
  and <Tooltip tip="Monitor a model in production, measure its health, check for drifts and set up alerts.">monitoring</Tooltip>.

## Why it matters

* This test helps ensure statistical consistency across different segments or categories in your data.
* It can detect bias, inconsistencies, or quality issues that affect specific subgroups differently.
* It's essential for fairness validation, ensuring that model inputs have similar
  statistical properties across different demographics or categories.
* It helps identify data collection issues that might affect certain groups disproportionately.

## How it works

The test follows these steps:

1. **Group the data** by unique values in the specified grouping column
2. **Calculate the statistic** (mean, median, etc.) on the target column for each group
3. **Apply the condition** to each group's statistic (e.g., mean >= 25)
4. **Count failing groups** that don't meet the condition
5. **Compare** the count/percentage of failing groups against your threshold

## Available statistics

The following statistical measures are supported for the target column:

| Statistic  | Description                       | Example Use Case                     |
| ---------- | --------------------------------- | ------------------------------------ |
| `sum`      | Sum of all values in each group   | Total sales by region                |
| `mean`     | Average value for each group      | Average age by geography             |
| `median`   | Median value for each group       | Median income by job category        |
| `min`      | Minimum value in each group       | Minimum score by demographic         |
| `max`      | Maximum value in each group       | Maximum transaction by customer type |
| `count`    | Number of records in each group   | Sample size validation by segment    |
| `variance` | Variance of values in each group  | Consistency check by category        |
| `std`      | Standard deviation for each group | Variability assessment by group      |

## Test configuration examples

If you are writing a `tests.json`, here are a few valid configurations for the group by column statistic test:

<CodeGroup>
  ```json Development theme={null}
  [
    {
      "name": "Average age consistency across geographies",
      "description": "Ensures that average age in each geography is at least 25, with max 1 failing geography allowed",
      "type": "integrity",
      "subtype": "groupByColumnStatsCheck",
      "thresholds": [
        {
          "insightName": "groupByColumnStatsCheck",
          "insightParameters": [
            { "name": "target_column_statistic", "value": "mean" },     // Statistic to calculate
            { "name": "target_column_name", "value": "age" },          // Column to analyze
            { "name": "operator", "value": ">=" },                    // Condition for each group
            { "name": "value", "value": 25 },                         // Threshold for each group
            { "name": "group_by_column_name", "value": "geography" }   // Column to group by
          ],
          "measurement": "failingGroupCount",  // Count of groups that fail the condition
          "operator": "<=",
          "value": 1  // Allow at most 1 geography to fail
        }
      ],
      "subpopulationFilters": null,
      "mode": "development",
      "usesValidationDataset": true,
      "usesTrainingDataset": false,
      "usesMlModel": false,
      "syncId": "b4dee7dc-4f15-48ca-a282-63e2c04e0689" // Some unique id
    },
    {
      "name": "Income distribution fairness check",
      "description": "Ensures no more than 10% of job categories have median income below $40K",
      "type": "integrity",
      "subtype": "groupByColumnStatsCheck",
      "thresholds": [
        {
          "insightName": "groupByColumnStatsCheck",
          "insightParameters": [
            { "name": "target_column_statistic", "value": "median" },
            { "name": "target_column_name", "value": "income" },
            { "name": "operator", "value": ">=" },
            { "name": "value", "value": 40000 },
            { "name": "group_by_column_name", "value": "job_category" }
          ],
          "measurement": "failingGroupPercentage",  // Percentage of groups that fail
          "operator": "<=",
          "value": 10.0  // Allow at most 10% of job categories to fail
        }
      ],
      "subpopulationFilters": null,
      "mode": "development",
      "usesValidationDataset": true,
      "usesTrainingDataset": false,
      "usesMlModel": false,
      "syncId": "96622fba-ea00-4e42-8f42-5e8f5f60805f" // Some unique id
    }
  ]
  ```

  ```json Monitoring theme={null}
  [
    {
      "name": "Transaction volume consistency by region",
      "description": "Monitors that all regions maintain minimum transaction counts",
      "type": "integrity",
      "subtype": "groupByColumnStatsCheck",
      "thresholds": [
        {
          "insightName": "groupByColumnStatsCheck",
          "insightParameters": [
            { "name": "target_column_statistic", "value": "count" },
            { "name": "target_column_name", "value": "transaction_id" },
            { "name": "operator", "value": ">=" },
            { "name": "value", "value": 100 }, // Each region should have at least 100 transactions
            { "name": "group_by_column_name", "value": "region" }
          ],
          "measurement": "failingGroupCount",
          "operator": "<=",
          "value": 0 // No regions should fall below minimum
        }
      ],
      "subpopulationFilters": null,
      "mode": "monitoring",
      "usesProductionData": true,
      "evaluationWindow": 3600, // 1 hour
      "delayWindow": 0,
      "syncId": "b4dee7dc-4f15-48ca-a282-63e2c04e0689" // Some unique id
    },
    {
      "name": "Revenue consistency across customer segments",
      "description": "Ensures average transaction amounts are consistent across customer types",
      "type": "integrity",
      "subtype": "groupByColumnStatsCheck",
      "thresholds": [
        {
          "insightName": "groupByColumnStatsCheck",
          "insightParameters": [
            { "name": "target_column_statistic", "value": "mean" },
            { "name": "target_column_name", "value": "transaction_amount" },
            { "name": "operator", "value": ">=" },
            { "name": "value", "value": 50.0 },
            { "name": "group_by_column_name", "value": "customer_type" }
          ],
          "measurement": "failingGroupPercentage",
          "operator": "<=",
          "value": 20.0 // Allow up to 20% of customer types to have lower averages
        }
      ],
      "subpopulationFilters": null,
      "mode": "monitoring",
      "usesProductionData": true,
      "evaluationWindow": 3600, // 1 hour
      "delayWindow": 0,
      "syncId": "96622fba-ea00-4e42-8f42-5e8f5f60805f" // Some unique id
    }
  ]
  ```
</CodeGroup>
