Skip to main content

Validation Strategies

Choose the right validation approach for your data and constraints.


Three Types of Validation

All validation falls into three categories, each with different guarantees and costs.


1. Syntax Validation (Format Checking)

Definition: Does the data match the expected format/pattern?

Cost: <1ms per record (instant)
Guarantee: Format is correct, content may not be
Use when: Need fast filtering of obviously bad data

Examples

Email Syntax:

SELECT anofox_email_validate('john@example.com', 'regex') as is_valid;
-- Output: true (matches RFC 5322 pattern)

Phone Syntax:

SELECT anofox_phone_format('+1-555-0123', 'US') as is_valid;
-- Output: true (valid US number format)

VAT Syntax:

SELECT anofox_vat_is_valid('DE123456789') as is_valid;
-- Output: true (matches German VAT format)

Currency Syntax:

SELECT anofox_money_is_valid_amount(100.50, 0.01, 9999999.99) as in_range;
-- Output: true (within valid range)

2. Semantic Validation (External Data Checking)

Definition: Does the data exist and make sense in the external system?

Cost: ~100-500ms per record (network-dependent)
Guarantee: Format is correct AND the data exists/is issued
Use when: Can afford network latency; data correctness is critical

Examples

Email DNS Check:

SELECT anofox_email_validate('john@example.com', 'dns') as domain_exists;
-- Checks: Does example.com have MX records?
-- Output: true (domain has mail servers)

Email SMTP Check:

SELECT anofox_email_validate('john@example.com', 'smtp') as mailbox_exists;
-- Checks: Does the mailbox exist on the mail server?
-- Output: true (mailbox is real)
-- WARNING: Some servers don't allow SMTP checks; timeout is ~500ms

Phone Carrier Check:

SELECT anofox_phone_validate('+1-555-0123', 'carrier') as carrier_verified;
-- Checks: Is this phone number issued to a carrier?
-- Output: depends on carrier database

VAT Verification (EU):

SELECT anofox_vat_verify_eu('DE123456789') as is_eu_registered;
-- Checks: Is this VAT ID registered in the EU VIES system?
-- Output: true (VAT is active and registered)

3. Progressive Validation (Tiered Approach)

Definition: Apply cheap checks first, expensive checks only if cheap checks pass.

Cost: Adaptive (fast path for obviously bad data, thorough path for good data)
Guarantee: Balanced speed and accuracy
Use when: Validating large datasets with mixed data quality

Example: Email Validation Pipeline

-- Step 1: Syntax check (instant)
CREATE TABLE step1_syntax AS
SELECT
customer_id,
email,
CASE
WHEN anofox_email_validate(email, 'regex') THEN 'syntax_pass'
ELSE 'syntax_fail'
END as validation_step
FROM raw_customers;

-- Step 2: DNS check (only on syntax-passing emails)
CREATE TABLE step2_dns AS
SELECT
customer_id,
email,
CASE
WHEN validation_step = 'syntax_fail' THEN 'syntax_fail'
WHEN anofox_email_validate(email, 'dns') THEN 'dns_pass'
ELSE 'dns_fail'
END as validation_step
FROM step1_syntax;

-- Step 3: SMTP check (only on DNS-passing emails)
CREATE TABLE step3_smtp AS
SELECT
customer_id,
email,
CASE
WHEN validation_step = 'dns_fail' THEN 'dns_fail'
WHEN anofox_email_validate(email, 'smtp') THEN 'smtp_pass'
ELSE 'smtp_fail'
END as validation_step
FROM step2_dns;

-- Final result: only emails that pass all stages
SELECT * FROM step3_smtp WHERE validation_step = 'smtp_pass';

Comparison Matrix

StrategySpeedAccuracyCostUse Case
Syntax OnlyInstantLowFreeInitial filter, high volume
Syntax + DNS~100msHigh$$Email validation, moderate volume
Syntax + SMTP~500msVery High$$$Critical emails, small volume
ProgressiveFast pathBalanced$ then $$Mixed quality data

Decision Tree

Start: New data to validate

Is speed critical (&lt; 1ms)?
├─ YES → Use Syntax only (regex)
└─ NO → Continue

Can you afford 100-500ms per record?
├─ NO → Use Syntax only
└─ YES → Continue

Is data correctness critical?
├─ NO → Use Syntax + DNS
└─ YES → Use Syntax + DNS + SMTP (or Progressive)

Performance Tips

Batch Processing

Don't validate one record at a time. Batch and parallelize:

-- Fast: Validate in SQL, vectorized
SELECT
customer_id,
email,
anofox_email_validate(email, 'regex') as is_valid
FROM customers; -- All at once, vectorized

-- Slow: Validate in application, one by one
for customer in customers:
is_valid = validate_email(customer.email) # Network round-trip per record

Separate Valid/Invalid Early

-- Good: Separate valid/invalid early, handle each path
CREATE TABLE valid_emails AS
SELECT * FROM raw_customers
WHERE anofox_email_validate(email, 'dns') IS TRUE;

CREATE TABLE invalid_emails AS
SELECT * FROM raw_customers
WHERE anofox_email_validate(email, 'dns') IS NOT TRUE;

-- Then process each:
-- - Valid emails: proceed to forecasting/analytics
-- - Invalid emails: manual review, re-contact customer

Cache Validation Results

-- Don't re-validate the same email multiple times
CREATE TABLE email_validation_cache AS
SELECT
DISTINCT email,
anofox_email_validate(email, 'dns') as is_valid
FROM raw_customers;

-- Then join to get cached results
SELECT
customers.*,
cache.is_valid
FROM customers
LEFT JOIN email_validation_cache cache USING (email);

Next Steps

🍪 Cookie Settings