Data Operations
Dataset comparison and diffing functions.
Overview
Data operations allow you to compare two tables to find differences, identify changes, and validate data consistency. Use these for:
- Schema validation (before migration)
- Change tracking (audit trails)
- Data sync verification
- Dataset comparison
Diffing Functions (2)
anofox_table_diff_hash
Hash-based table diff for change detection.
Signature:
anofox_table_diff_hash(table1 VARCHAR, table2 VARCHAR) → TABLE
Parameters:
table1: Source table nametable2: Comparison table name
Returns: Rows in table1 but not in table2, and vice versa
Speed: O(n) per table, hash-based comparison
Method:
- Computes hash of each row
- Compares hash sets to find differences
- Returns only rows with different hashes
Example:
SELECT * FROM anofox_table_diff_hash('customers', 'customers_backup');
Output:
table_name | row_number | change_type | reason
customers | 1 | ADDED | Row exists in table1 but not in table2
customers | 5 | MODIFIED | Row hash differs
customers_backup | 3 | DELETED | Row exists in table2 but not in table1
Pros:
- Fast (hash-based)
- Works with any column types
- Simple to use
- Good for "what changed?" queries
Cons:
- Doesn't show which columns changed
- Requires rows to have stable order
- Can't see detailed differences
anofox_table_diff_join
Join-based table diff with detailed column-level differences.
Signature:
anofox_table_diff_join(table1 VARCHAR, table2 VARCHAR, key_columns ARRAY[VARCHAR]) → TABLE
Parameters:
table1: Source table nametable2: Comparison table namekey_columns: Column(s) to join on (e.g., ARRAY['id', 'customer_id'])
Returns: Detailed row-by-row comparison
Speed: O(n log n) join-based
Method:
- Joins table1 and table2 on key columns
- Compares all columns row-by-row
- Returns matches, mismatches, and rows only in one table
Example:
SELECT * FROM anofox_table_diff_join(
'customers',
'customers_backup',
ARRAY['customer_id']
);
Output:
customer_id | email_match | phone_match | amount_match | status
1 | TRUE | TRUE | FALSE | DIFFERENT
2 | FALSE | TRUE | TRUE | DIFFERENT
3 | TRUE | TRUE | TRUE | SAME
4 | NULL | NULL | NULL | ONLY_IN_TABLE1
Pros:
- Column-level detail
- See exactly which columns differ
- Identifies additions and deletions
- Better for data validation
Cons:
- Slower (full join)
- Requires defining key columns
- More verbose output
Practical Patterns
Pattern 1: Validate Before ETL
Ensure source matches expectations:
-- Check if source data matches known schema
CREATE TABLE validation_schema AS
SELECT * FROM anofox_table_diff_join(
'raw_import',
'customers', -- Reference schema
ARRAY['customer_id']
)
WHERE status != 'SAME';
-- Count mismatches
SELECT COUNT(*) as mismatch_count FROM validation_schema;
Pattern 2: Change Tracking
Track what changed between exports:
-- Compare today's export to yesterday's
CREATE TABLE change_log AS
SELECT
customer_id,
'EMAIL_CHANGED' as change_type
FROM anofox_table_diff_join(
'customers_today',
'customers_yesterday',
ARRAY['customer_id']
)
WHERE email_match = FALSE;
-- Combine with other changes
INSERT INTO change_log
SELECT
customer_id,
'PHONE_CHANGED' as change_type
FROM anofox_table_diff_join(
'customers_today',
'customers_yesterday',
ARRAY['customer_id']
)
WHERE phone_match = FALSE;
Pattern 3: Data Sync Verification
Ensure replicas are in sync:
-- Verify read replica matches primary
CREATE TABLE sync_health AS
SELECT
COUNT(*) as total_rows,
COUNT(*) FILTER (WHERE status = 'SAME') as matching_rows,
COUNT(*) FILTER (WHERE status = 'DIFFERENT') as different_rows,
COUNT(*) FILTER (WHERE status = 'ONLY_IN_TABLE1') as only_in_primary,
COUNT(*) FILTER (WHERE status = 'ONLY_IN_TABLE2') as only_in_replica,
ROUND(100.0 * COUNT(*) FILTER (WHERE status = 'SAME') / COUNT(*), 2) as sync_percentage
FROM anofox_table_diff_join(
'customers_primary',
'customers_replica',
ARRAY['customer_id']
);
SELECT * FROM sync_health;
Pattern 4: Migration Validation
Compare before/after migration:
-- Ensure old and new systems have same data
CREATE TABLE migration_validation AS
SELECT * FROM anofox_table_diff_join(
'customers_legacy_system',
'customers_new_system',
ARRAY['customer_id']
)
WHERE status != 'SAME';
-- Flag failures
SELECT
COUNT(*) as total_differences,
COUNT(*) FILTER (WHERE status = 'ONLY_IN_TABLE1') as lost_in_migration,
COUNT(*) FILTER (WHERE status = 'ONLY_IN_TABLE2') as added_in_migration,
COUNT(*) FILTER (WHERE status = 'DIFFERENT') as data_corruption
FROM migration_validation;
Pattern 5: Audit Trail
Create immutable change log:
-- Append-only change audit
CREATE TABLE audit_table_changes (
audit_id SERIAL PRIMARY KEY,
table_name VARCHAR,
column_name VARCHAR,
row_key VARCHAR,
old_value VARCHAR,
new_value VARCHAR,
changed_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
changed_by VARCHAR
);
-- Log changes detected
INSERT INTO audit_table_changes
(table_name, column_name, row_key, old_value, new_value, changed_by)
SELECT
'customers' as table_name,
'email' as column_name,
CAST(customer_id AS VARCHAR) as row_key,
old.email as old_value,
new.email as new_value,
'system' as changed_by
FROM anofox_table_diff_join(
'customers_yesterday',
'customers_today',
ARRAY['customer_id']
) diff
JOIN customers_yesterday old USING (customer_id)
JOIN customers_today new USING (customer_id)
WHERE diff.email_match = FALSE;
Performance Considerations
Hash-Based (Faster)
-- Use when you just need "are they different?"
SELECT * FROM anofox_table_diff_hash('t1', 't2');
Best for:
- Large tables (10M+ rows)
- Any column types
- Just checking equality
Join-Based (More Detail)
-- Use when you need to know what changed
SELECT * FROM anofox_table_diff_join('t1', 't2', ARRAY['id']);
Best for:
- Medium tables (< 10M rows)
- Detailed change reports
- Column-level validation
Combining with Other Functions
Diff + Validation
-- Find differences AND validate them
SELECT
diff.*,
anofox_vat_is_valid(new.vat_id) as vat_valid
FROM anofox_table_diff_join(
'customers_old',
'customers_new',
ARRAY['customer_id']
) diff
LEFT JOIN customers_new new USING (customer_id)
WHERE diff.status = 'DIFFERENT'
AND NOT anofox_vat_is_valid(new.vat_id);
Diff + Quality Metrics
-- Compare quality before/after transformation
SELECT
'before' as phase,
anofox_metric_nullness(email) as null_rate,
anofox_metric_freshness(updated_at) as age_days
FROM customers_old
UNION ALL
SELECT
'after' as phase,
anofox_metric_nullness(email) as null_rate,
anofox_metric_freshness(updated_at) as age_days
FROM customers_new;
Best Practices
✅ Use hash diff for quick checks
✅ Use join diff for detailed analysis
✅ Audit all table modifications
✅ Keep before/after snapshots
✅ Schedule regular sync checks
✅ Alert on unexpected differences
Next Steps
- Basic Workflow — Use diffing in validation pipeline
- Production Deployment — Integrate diffing into CI/CD
- Quality Metrics — Track quality alongside changes