Treat datasets like code. Generate 5KB statistical signatures of massive datasets to monitor drift, prevent model decay, and automate data integrity—all without moving your data.
Open-sourced under the Apache 2.0 License.
| Feature | Drift (PSI) | Type | Status |
|---|---|---|---|
| user_age | 0.02 | float64 | PASS |
| income_bracket | 0.45 ↑ | category | DRIFT DETECTED |
| email_address | - | string | PII LEAK |
| transaction_vol | 0.08 | float64 | PASS |
--anonymize flag suggested to redact email_address sample.Zero-ETL Integrations
StatGit replaces fragile DAGs and heavy ETL pipelines with a lightweight, cloud-native validation engine.
Snapshot data directly from S3, GCS, or Snowflake. Computation happens natively in your cloud warehouse using push-down queries. Only the 5KB mathematical signature stays local.
Automated PII scanning blocks sensitive columns. Differential Privacy (DP) noise injection means you share metrics, not raw data.
Prioritize alerts based on feature importance. Stop waking up for noise in unused columns; start acting on what matters.
Connect StatGit to your CI/CD. Automatically quarantine bad data, block model deployments, and trigger Airflow retraining workflows via Webhooks or GitHub Actions.
No heavy infrastructure. Just a CLI designed for the modern ML stack.
Drag slider to reveal hidden subpopulation drift
Global averages lie. When you average out features across a massive dataset, dangerous drifts cancel each other out—a phenomenon related to Simpson's Paradox.
StatGit automatically slices your data across categorical dimensions (like Region or Age Group) and runs statistical tests on every segment.