Personal identifiable information (PII)

Definition

The PII test asserts that no personal identifiable information (PII) is in the data. Currently, the test can check for credit card numbers and social security numbers (SSN).

Taxonomy

  • Category: Integrity.
  • Task types: LLM.
  • Availability: and .

Why it matters

  • If the dataset is not anonymized, it can lead to a data breach or biased models.
  • LLMs are prone to hallucinating (or leaking) PII.

Test configuration examples

If you are writing a tests.json, here are a few valid configurations for the character length test:
[
  {
    "name": "No credit card numbers leaked on output",
    "description": "Asserts no credit card numbers are leaked",
    "type": "integrity",
    "subtype": "containsPii",
    "thresholds": [
      {
        "insightName": "containsPii",
        "insightParameters": [
          {
            "name": "pii_type",
            "value": "cc_num"
          }, // Checks for credit card numbers...
          {
            "name": "column_name",
            "value": "output"
          } // ... on the column `output`
        ],
        "measurement": "containsPIIRowCount",
        "operator": "<=",
        "value": 0.0
      }
    ],
    "subpopulationFilters": null,
    "mode": "development",
    "usesValidationDataset": true, // Apply test to the validation set
    "usesTrainingDataset": false,
    "usesMlModel": false,
    "syncId": "b4dee7dc-4f15-48ca-a282-63e2c04e0689" // Some unique id
  },
  {
    "name": "No social security numbers leaked on output",
    "description": "Asserts no SSN are leaked",
    "type": "integrity",
    "subtype": "containsPii",
    "thresholds": [
      {
        "insightName": "containsPii",
        "insightParameters": [
          {
            "name": "pii_type",
            "value": "ssn"
          },  // Checks for social security numbers...
          {
            "name": "column_name",
            "value": "output"
          }  // ... on the column `output`
        ],
        "measurement": "containsPIIRowCount",
        "operator": "<=",
        "value": 0.0
      }
    ],
    "subpopulationFilters": null,
    "mode": "development",
    "usesValidationDataset": true, // Apply test to the validation set
    "usesTrainingDataset": false,
    "usesMlModel": false,
    "syncId": "96622fba-ea00-4e42-8f42-5e8f5f60805f" // Some unique id
  }
]