Understanding Data Quality
Data quality is the foundation of reliable business decisions.
What is Data Quality?
Data quality measures how fit your data is for its intended purpose. Poor data leads to poor decisions—bad forecasts, wrong analyses, failed campaigns, regulatory failures.
The six dimensions of data quality:
1. Accuracy
Definition: Data correctly represents the real-world entity it describes.
Examples:
- Email "john@exampl.com" (typo) = inaccurate
- Phone "+1-555-0123" (valid format, carrier verified) = accurate
- VAT "DE999999999" (syntactically correct but not issued) = inaccurate
Check with Tabular:
SELECT email, anofox_email_validate(email, 'smtp') as smtp_verified
FROM customers;
2. Completeness
Definition: All required data is present; no critical fields are missing.
Examples:
- Customer table missing 20% of phone numbers = 80% complete
- All invoices have amounts = 100% complete
- Required email field is NULL = incomplete
Check with Tabular:
SELECT
COUNT(*) as total,
COUNT(*) FILTER (WHERE phone IS NOT NULL) as phone_count,
ROUND(COUNT(*) FILTER (WHERE phone IS NOT NULL) * 100.0 / COUNT(*), 1) as completeness_pct
FROM contacts;
3. Consistency
Definition: Data is uniform in format and structure across the system.
Examples:
- Phone numbers in different formats: "+1 555 0123", "555-0123", "5550123" = inconsistent
- Currency always in EUR, but one record in USD = inconsistent
- Customer IDs with leading zeros sometimes, sometimes without = inconsistent
Check with Tabular:
SELECT phone, anofox_phone_format(phone, 'US') as normalized
FROM contacts
GROUP BY phone;
4. Validity
Definition: Data conforms to required syntax and business rules.
Examples:
- Email "john@example.com" (valid syntax, valid domain) = valid
- Phone "+1-555-0123" (valid US number) = valid
- VAT "DE123456789" (passes EU validation rules) = valid
- Amount "-100.00" when amounts must be positive = invalid
Check with Tabular:
SELECT
vat_id,
anofox_vat_is_valid(vat_id) as vat_valid
FROM vendors;
5. Timeliness
Definition: Data is fresh, current, and available when needed.
Examples:
- Customer address updated yesterday = timely
- Inventory count from 6 months ago = stale, not timely
- Real-time stock prices vs. end-of-day prices = different timeliness requirements
Check with Tabular:
SELECT
COUNT(*) as total,
COUNT(*) FILTER (WHERE last_updated > now() - interval '7 days') as updated_past_week,
MAX(last_updated) as most_recent
FROM customer_data;
6. Uniqueness
Definition: No unwanted duplicates; unique records are actually unique.
Examples:
- Customer table with duplicate customer_id rows = not unique
- Two invoices with the same invoice_number = not unique
- Deduped email list (one email per person) = unique
Check with Tabular:
SELECT
COUNT(*) as total,
COUNT(DISTINCT customer_id) as unique_ids,
COUNT(*) - COUNT(DISTINCT customer_id) as duplicate_count
FROM customers;
The Cost of Poor Data
Business Impact
| Dimension | Impact | Example |
|---|---|---|
| Accuracy | Wrong decisions, failed campaigns | Marketing spend on fake emails → 0 ROI |
| Completeness | Biased analysis, missing segments | 20% missing phone numbers → incomplete customer profile |
| Consistency | Integration failures, schema mismatches | Phone format inconsistency → matching failures |
| Validity | Regulatory failures, system errors | Invalid VAT IDs → can't invoice EU customers |
| Timeliness | Stale insights, missed opportunities | Old inventory data → stockouts or overstock |
| Uniqueness | Reporting errors, double-counting | Duplicate customer records → inflated churn metrics |
The Cascade Effect
Poor Data → Bad Forecasts → Wrong Inventory → Stockouts/Overstock
→ Bad Analysis → Wrong Strategy → Lost Revenue
→ System Failures → Lost Trust → Churn
The Data Quality Workflow
Prevention → Detection → Correction → Monitoring
- Prevention: Validate data at the source (regex, DNS, VAT rules)
- Detection: Detect anomalies after ingestion (Z-Score, Isolation Forest)
- Correction: Flag and quarantine bad records
- Monitoring: Track quality metrics over time
Tabular helps with all four stages.
Next Steps
- Validation Strategies — Syntax vs. semantic vs. external
- Anomaly Detection Methods — Statistical and ML approaches
- Data Profiling — Essential metrics for data health