Quality Metrics
Functions for profiling and measuring data quality.
Overview
Quality metrics provide insight into the health and usability of your data. They answer fundamental questions:
- How complete is my data?
- How unique are my records?
- How fresh is my data?
- Are my columns consistent?
Core Metrics (7 functions)
anofox_metric_nullness
Measure percentage of NULL values in a column.
Signature:
anofox_metric_nullness(column) → DECIMAL
Returns: Percentage of NULLs (0.0-100.0)
Speed: O(n) full table scan
Example:
SELECT
'email' as column_name,
anofox_metric_nullness(email) as null_percentage
FROM customers;
-- Output: 2.5 (2.5% of emails are NULL)
Interpretation:
- 0%: No missing values (complete)
- 1-5%: Mostly complete
- 5-20%: Significant gaps
- > 20%: Unusable column
anofox_metric_distinctness
Measure percentage of unique values in a column.
Signature:
anofox_metric_distinctness(column) → DECIMAL
Returns: Percentage of unique values (0.0-100.0)
Speed: O(n) with dedup
Example:
SELECT
anofox_metric_distinctness(customer_id) as id_uniqueness,
anofox_metric_distinctness(country) as country_uniqueness
FROM customers;
-- Output:
-- id_uniqueness: 100.0 (each customer has unique ID)
-- country_uniqueness: 5.2 (only 5 countries represented)
Interpretation:
- 100%: All unique (key column)
- 50-100%: High diversity
- 10-50%: Moderate categorization
- < 10%: Heavily concentrated
anofox_metric_freshness
Measure days since most recent update.
Signature:
anofox_metric_freshness(date_column) → DECIMAL
Returns: Age in days since MAX(date_column)
Speed: O(n)
Example:
SELECT
anofox_metric_freshness(updated_at) as days_since_update
FROM customers;
-- Output: 3 (most recent update was 3 days ago)
Interpretation:
- 0-1 days: Real-time (fresh)
- 1-7 days: Current (acceptable)
- 7-30 days: Stale (aging)
- > 30 days: Very stale (likely unused)
anofox_metric_volume
Get row count of table.
Signature:
anofox_metric_volume(table_name VARCHAR) → INTEGER
Returns: Number of rows
Speed: O(1)
Example:
SELECT
anofox_metric_volume('customers') as customer_count,
anofox_metric_volume('transactions') as transaction_count;
-- Output:
-- customer_count: 125000
-- transaction_count: 8500000
anofox_metric_consistency
Measure format consistency across column.
Signature:
anofox_metric_consistency(column) → DECIMAL
Returns: Percentage of values matching dominant format (0.0-100.0)
Speed: O(n)
Example:
SELECT
anofox_metric_consistency(phone) as phone_format_consistency
FROM customers;
-- Output: 75.5 (75.5% of phones follow same format)
Interpretation:
- > 95%: Highly consistent
- 80-95%: Mostly consistent
- 50-80%: Inconsistent (may need normalization)
- < 50%: Very inconsistent (data quality issue)
anofox_metric_schema_match
Check if two tables have compatible schemas.
Signature:
anofox_metric_schema_match(table1 VARCHAR, table2 VARCHAR) → BOOLEAN
Returns: TRUE if schemas match
Speed: O(1)
Example:
SELECT anofox_metric_schema_match('customers', 'customers_backup');
-- Output: true (both have same columns and types)
Use Case: Validate before ETL or migration
anofox_metric_row_count
Get row count (alias for volume).
Signature:
anofox_metric_row_count(table_name VARCHAR) → INTEGER
Returns: Number of rows
Speed: O(1)
Example:
SELECT anofox_metric_row_count('transactions');
-- Output: 8500000
Data Quality Profile
Create a comprehensive health check:
CREATE VIEW data_quality_profile AS
SELECT
'customers' as table_name,
anofox_metric_row_count('customers') as row_count,
anofox_metric_nullness(email) as email_nullness_pct,
anofox_metric_nullness(phone) as phone_nullness_pct,
anofox_metric_distinctness(customer_id) as id_uniqueness_pct,
anofox_metric_distinctness(country) as country_diversity_pct,
anofox_metric_freshness(updated_at) as days_since_update,
anofox_metric_consistency(phone) as phone_format_consistency_pct,
CURRENT_TIMESTAMP as profiled_at
FROM customers;
SELECT * FROM data_quality_profile;
Output:
table_name | row_count | email_nullness | phone_nullness | id_uniqueness | country_diversity | days_since_update | phone_consistency | profiled_at
customers | 125000 | 2.5 | 3.1 | 100.0 | 5.2 | 3 | 75.5 | 2025-01-22 14:30:00
Quality Monitoring Dashboard
Track quality over time:
-- Daily quality tracking
CREATE TABLE quality_metrics_daily (
profile_date DATE,
table_name VARCHAR,
row_count INTEGER,
null_rate_pct DECIMAL,
uniqueness_pct DECIMAL,
age_days DECIMAL,
consistency_pct DECIMAL
);
-- Insert daily snapshots
INSERT INTO quality_metrics_daily
SELECT
CURRENT_DATE as profile_date,
'customers' as table_name,
anofox_metric_row_count('customers') as row_count,
anofox_metric_nullness(email) as null_rate_pct,
anofox_metric_distinctness(customer_id) as uniqueness_pct,
anofox_metric_freshness(updated_at) as age_days,
anofox_metric_consistency(phone) as consistency_pct
FROM customers
LIMIT 1;
-- Monitor trends
SELECT
profile_date,
row_count,
null_rate_pct,
LAG(row_count) OVER (ORDER BY profile_date) as prior_row_count,
LAG(row_count) OVER (ORDER BY profile_date) - row_count as row_count_change
FROM quality_metrics_daily
ORDER BY profile_date DESC;
Quality Scoring
Create a composite quality score:
-- Calculate quality score (0-100)
CREATE VIEW data_quality_score AS
SELECT
'customers' as table_name,
ROUND(
(
-- Completeness: 25 points
CASE WHEN anofox_metric_nullness(email) < 5 THEN 25 ELSE 0 END +
-- Uniqueness: 25 points
CASE WHEN anofox_metric_distinctness(customer_id) > 95 THEN 25 ELSE 0 END +
-- Freshness: 25 points
CASE WHEN anofox_metric_freshness(updated_at) < 7 THEN 25 ELSE 0 END +
-- Consistency: 25 points
CASE WHEN anofox_metric_consistency(phone) > 80 THEN 25 ELSE 0 END
) / 100 * 100,
1
) as quality_score
FROM customers;
-- Grade the data
SELECT
table_name,
quality_score,
CASE
WHEN quality_score >= 90 THEN 'A (Excellent)'
WHEN quality_score >= 80 THEN 'B (Good)'
WHEN quality_score >= 70 THEN 'C (Fair)'
WHEN quality_score >= 60 THEN 'D (Poor)'
ELSE 'F (Failing)'
END as grade
FROM data_quality_score;
Alerting Rules
Set thresholds for automated alerts:
-- Create alerts for quality degradation
CREATE VIEW quality_alerts AS
SELECT
CASE
WHEN anofox_metric_nullness(email) > 10 THEN 'Alert: Email nullness > 10%'
ELSE 'OK'
END as email_alert,
CASE
WHEN anofox_metric_freshness(updated_at) > 30 THEN 'Alert: Data is > 30 days old'
ELSE 'OK'
END as freshness_alert,
CASE
WHEN anofox_metric_consistency(phone) < 70 THEN 'Alert: Phone format < 70% consistent'
ELSE 'OK'
END as consistency_alert
FROM customers
WHERE 1=1
AND anofox_metric_nullness(email) > 10 OR anofox_metric_freshness(updated_at) > 30
OR anofox_metric_consistency(phone) < 70;
Quality Thresholds
Recommended thresholds:
| Metric | Green | Yellow | Red |
|---|---|---|---|
| Nullness | < 5% | 5-10% | > 10% |
| Uniqueness (ID) | > 99% | 95-99% | < 95% |
| Freshness | < 1 day | 1-7 days | > 7 days |
| Consistency | > 90% | 70-90% | < 70% |
| Row Count Growth | 0-10% daily | 10-50% daily | > 50% daily |
Best Practices
✅ Profile data before analysis
✅ Track metrics over time (trend analysis)
✅ Set quality thresholds appropriate for your domain
✅ Alert on degradation (not just absolute values)
✅ Review quality scores before using data for decisions
Next Steps
- Data Profiling — Understand profiling methodology
- Anomaly Detection — Detect outliers beyond quality metrics
- Basic Workflow — Implement quality checks in pipeline