Validation Strategies
Choose the right validation approach for your data and constraints.
Three Types of Validation
All validation falls into three categories, each with different guarantees and costs.
1. Syntax Validation (Format Checking)
Definition: Does the data match the expected format/pattern?
Cost: <1ms per record (instant)
Guarantee: Format is correct, content may not be
Use when: Need fast filtering of obviously bad data
Examples
Email Syntax:
SELECT anofox_email_validate('john@example.com', 'regex') as is_valid;
-- Output: true (matches RFC 5322 pattern)
Phone Syntax:
SELECT anofox_phone_format('+1-555-0123', 'US') as is_valid;
-- Output: true (valid US number format)
VAT Syntax:
SELECT anofox_vat_is_valid('DE123456789') as is_valid;
-- Output: true (matches German VAT format)
Currency Syntax:
SELECT anofox_money_is_valid_amount(100.50, 0.01, 9999999.99) as in_range;
-- Output: true (within valid range)
2. Semantic Validation (External Data Checking)
Definition: Does the data exist and make sense in the external system?
Cost: ~100-500ms per record (network-dependent)
Guarantee: Format is correct AND the data exists/is issued
Use when: Can afford network latency; data correctness is critical
Examples
Email DNS Check:
SELECT anofox_email_validate('john@example.com', 'dns') as domain_exists;
-- Checks: Does example.com have MX records?
-- Output: true (domain has mail servers)
Email SMTP Check:
SELECT anofox_email_validate('john@example.com', 'smtp') as mailbox_exists;
-- Checks: Does the mailbox exist on the mail server?
-- Output: true (mailbox is real)
-- WARNING: Some servers don't allow SMTP checks; timeout is ~500ms
Phone Carrier Check:
SELECT anofox_phone_validate('+1-555-0123', 'carrier') as carrier_verified;
-- Checks: Is this phone number issued to a carrier?
-- Output: depends on carrier database
VAT Verification (EU):
SELECT anofox_vat_verify_eu('DE123456789') as is_eu_registered;
-- Checks: Is this VAT ID registered in the EU VIES system?
-- Output: true (VAT is active and registered)
3. Progressive Validation (Tiered Approach)
Definition: Apply cheap checks first, expensive checks only if cheap checks pass.
Cost: Adaptive (fast path for obviously bad data, thorough path for good data)
Guarantee: Balanced speed and accuracy
Use when: Validating large datasets with mixed data quality
Example: Email Validation Pipeline
-- Step 1: Syntax check (instant)
CREATE TABLE step1_syntax AS
SELECT
customer_id,
email,
CASE
WHEN anofox_email_validate(email, 'regex') THEN 'syntax_pass'
ELSE 'syntax_fail'
END as validation_step
FROM raw_customers;
-- Step 2: DNS check (only on syntax-passing emails)
CREATE TABLE step2_dns AS
SELECT
customer_id,
email,
CASE
WHEN validation_step = 'syntax_fail' THEN 'syntax_fail'
WHEN anofox_email_validate(email, 'dns') THEN 'dns_pass'
ELSE 'dns_fail'
END as validation_step
FROM step1_syntax;
-- Step 3: SMTP check (only on DNS-passing emails)
CREATE TABLE step3_smtp AS
SELECT
customer_id,
email,
CASE
WHEN validation_step = 'dns_fail' THEN 'dns_fail'
WHEN anofox_email_validate(email, 'smtp') THEN 'smtp_pass'
ELSE 'smtp_fail'
END as validation_step
FROM step2_dns;
-- Final result: only emails that pass all stages
SELECT * FROM step3_smtp WHERE validation_step = 'smtp_pass';
Comparison Matrix
| Strategy | Speed | Accuracy | Cost | Use Case |
|---|---|---|---|---|
| Syntax Only | Instant | Low | Free | Initial filter, high volume |
| Syntax + DNS | ~100ms | High | $$ | Email validation, moderate volume |
| Syntax + SMTP | ~500ms | Very High | $$$ | Critical emails, small volume |
| Progressive | Fast path | Balanced | $ then $$ | Mixed quality data |
Decision Tree
Start: New data to validate
↓
Is speed critical (< 1ms)?
├─ YES → Use Syntax only (regex)
└─ NO → Continue
↓
Can you afford 100-500ms per record?
├─ NO → Use Syntax only
└─ YES → Continue
↓
Is data correctness critical?
├─ NO → Use Syntax + DNS
└─ YES → Use Syntax + DNS + SMTP (or Progressive)
Performance Tips
Batch Processing
Don't validate one record at a time. Batch and parallelize:
-- Fast: Validate in SQL, vectorized
SELECT
customer_id,
email,
anofox_email_validate(email, 'regex') as is_valid
FROM customers; -- All at once, vectorized
-- Slow: Validate in application, one by one
for customer in customers:
is_valid = validate_email(customer.email) # Network round-trip per record
Separate Valid/Invalid Early
-- Good: Separate valid/invalid early, handle each path
CREATE TABLE valid_emails AS
SELECT * FROM raw_customers
WHERE anofox_email_validate(email, 'dns') IS TRUE;
CREATE TABLE invalid_emails AS
SELECT * FROM raw_customers
WHERE anofox_email_validate(email, 'dns') IS NOT TRUE;
-- Then process each:
-- - Valid emails: proceed to forecasting/analytics
-- - Invalid emails: manual review, re-contact customer
Cache Validation Results
-- Don't re-validate the same email multiple times
CREATE TABLE email_validation_cache AS
SELECT
DISTINCT email,
anofox_email_validate(email, 'dns') as is_valid
FROM raw_customers;
-- Then join to get cached results
SELECT
customers.*,
cache.is_valid
FROM customers
LEFT JOIN email_validation_cache cache USING (email);
Next Steps
- Anomaly Detection Methods — Detect statistical outliers
- Data Profiling — Understand your data landscape
- Email Validation Guide — Deep dive into email checks