Skip to main content

Data Operations

Dataset comparison and diffing functions.


Overview

Data operations allow you to compare two tables to find differences, identify changes, and validate data consistency. Use these for:

  • Schema validation (before migration)
  • Change tracking (audit trails)
  • Data sync verification
  • Dataset comparison

Diffing Functions (2)

anofox_table_diff_hash

Hash-based table diff for change detection.

Signature:

anofox_table_diff_hash(table1 VARCHAR, table2 VARCHAR)TABLE

Parameters:

  • table1: Source table name
  • table2: Comparison table name

Returns: Rows in table1 but not in table2, and vice versa

Speed: O(n) per table, hash-based comparison

Method:

  • Computes hash of each row
  • Compares hash sets to find differences
  • Returns only rows with different hashes

Example:

SELECT * FROM anofox_table_diff_hash('customers', 'customers_backup');

Output:

table_name | row_number | change_type | reason
customers | 1 | ADDED | Row exists in table1 but not in table2
customers | 5 | MODIFIED | Row hash differs
customers_backup | 3 | DELETED | Row exists in table2 but not in table1

Pros:

  • Fast (hash-based)
  • Works with any column types
  • Simple to use
  • Good for "what changed?" queries

Cons:

  • Doesn't show which columns changed
  • Requires rows to have stable order
  • Can't see detailed differences

anofox_table_diff_join

Join-based table diff with detailed column-level differences.

Signature:

anofox_table_diff_join(table1 VARCHAR, table2 VARCHAR, key_columns ARRAY[VARCHAR])TABLE

Parameters:

  • table1: Source table name
  • table2: Comparison table name
  • key_columns: Column(s) to join on (e.g., ARRAY['id', 'customer_id'])

Returns: Detailed row-by-row comparison

Speed: O(n log n) join-based

Method:

  • Joins table1 and table2 on key columns
  • Compares all columns row-by-row
  • Returns matches, mismatches, and rows only in one table

Example:

SELECT * FROM anofox_table_diff_join(
'customers',
'customers_backup',
ARRAY['customer_id']
);

Output:

customer_id | email_match | phone_match | amount_match | status
1 | TRUE | TRUE | FALSE | DIFFERENT
2 | FALSE | TRUE | TRUE | DIFFERENT
3 | TRUE | TRUE | TRUE | SAME
4 | NULL | NULL | NULL | ONLY_IN_TABLE1

Pros:

  • Column-level detail
  • See exactly which columns differ
  • Identifies additions and deletions
  • Better for data validation

Cons:

  • Slower (full join)
  • Requires defining key columns
  • More verbose output

Practical Patterns

Pattern 1: Validate Before ETL

Ensure source matches expectations:

-- Check if source data matches known schema
CREATE TABLE validation_schema AS
SELECT * FROM anofox_table_diff_join(
'raw_import',
'customers', -- Reference schema
ARRAY['customer_id']
)
WHERE status != 'SAME';

-- Count mismatches
SELECT COUNT(*) as mismatch_count FROM validation_schema;

Pattern 2: Change Tracking

Track what changed between exports:

-- Compare today's export to yesterday's
CREATE TABLE change_log AS
SELECT
customer_id,
'EMAIL_CHANGED' as change_type
FROM anofox_table_diff_join(
'customers_today',
'customers_yesterday',
ARRAY['customer_id']
)
WHERE email_match = FALSE;

-- Combine with other changes
INSERT INTO change_log
SELECT
customer_id,
'PHONE_CHANGED' as change_type
FROM anofox_table_diff_join(
'customers_today',
'customers_yesterday',
ARRAY['customer_id']
)
WHERE phone_match = FALSE;

Pattern 3: Data Sync Verification

Ensure replicas are in sync:

-- Verify read replica matches primary
CREATE TABLE sync_health AS
SELECT
COUNT(*) as total_rows,
COUNT(*) FILTER (WHERE status = 'SAME') as matching_rows,
COUNT(*) FILTER (WHERE status = 'DIFFERENT') as different_rows,
COUNT(*) FILTER (WHERE status = 'ONLY_IN_TABLE1') as only_in_primary,
COUNT(*) FILTER (WHERE status = 'ONLY_IN_TABLE2') as only_in_replica,
ROUND(100.0 * COUNT(*) FILTER (WHERE status = 'SAME') / COUNT(*), 2) as sync_percentage
FROM anofox_table_diff_join(
'customers_primary',
'customers_replica',
ARRAY['customer_id']
);

SELECT * FROM sync_health;

Pattern 4: Migration Validation

Compare before/after migration:

-- Ensure old and new systems have same data
CREATE TABLE migration_validation AS
SELECT * FROM anofox_table_diff_join(
'customers_legacy_system',
'customers_new_system',
ARRAY['customer_id']
)
WHERE status != 'SAME';

-- Flag failures
SELECT
COUNT(*) as total_differences,
COUNT(*) FILTER (WHERE status = 'ONLY_IN_TABLE1') as lost_in_migration,
COUNT(*) FILTER (WHERE status = 'ONLY_IN_TABLE2') as added_in_migration,
COUNT(*) FILTER (WHERE status = 'DIFFERENT') as data_corruption
FROM migration_validation;

Pattern 5: Audit Trail

Create immutable change log:

-- Append-only change audit
CREATE TABLE audit_table_changes (
audit_id SERIAL PRIMARY KEY,
table_name VARCHAR,
column_name VARCHAR,
row_key VARCHAR,
old_value VARCHAR,
new_value VARCHAR,
changed_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
changed_by VARCHAR
);

-- Log changes detected
INSERT INTO audit_table_changes
(table_name, column_name, row_key, old_value, new_value, changed_by)
SELECT
'customers' as table_name,
'email' as column_name,
CAST(customer_id AS VARCHAR) as row_key,
old.email as old_value,
new.email as new_value,
'system' as changed_by
FROM anofox_table_diff_join(
'customers_yesterday',
'customers_today',
ARRAY['customer_id']
) diff
JOIN customers_yesterday old USING (customer_id)
JOIN customers_today new USING (customer_id)
WHERE diff.email_match = FALSE;

Performance Considerations

Hash-Based (Faster)

-- Use when you just need "are they different?"
SELECT * FROM anofox_table_diff_hash('t1', 't2');

Best for:

  • Large tables (10M+ rows)
  • Any column types
  • Just checking equality

Join-Based (More Detail)

-- Use when you need to know what changed
SELECT * FROM anofox_table_diff_join('t1', 't2', ARRAY['id']);

Best for:

  • Medium tables (< 10M rows)
  • Detailed change reports
  • Column-level validation

Combining with Other Functions

Diff + Validation

-- Find differences AND validate them
SELECT
diff.*,
anofox_vat_is_valid(new.vat_id) as vat_valid
FROM anofox_table_diff_join(
'customers_old',
'customers_new',
ARRAY['customer_id']
) diff
LEFT JOIN customers_new new USING (customer_id)
WHERE diff.status = 'DIFFERENT'
AND NOT anofox_vat_is_valid(new.vat_id);

Diff + Quality Metrics

-- Compare quality before/after transformation
SELECT
'before' as phase,
anofox_metric_nullness(email) as null_rate,
anofox_metric_freshness(updated_at) as age_days
FROM customers_old

UNION ALL

SELECT
'after' as phase,
anofox_metric_nullness(email) as null_rate,
anofox_metric_freshness(updated_at) as age_days
FROM customers_new;

Best Practices

✅ Use hash diff for quick checks
✅ Use join diff for detailed analysis
✅ Audit all table modifications
✅ Keep before/after snapshots
✅ Schedule regular sync checks
✅ Alert on unexpected differences


Next Steps

🍪 Cookie Settings