Data Operations

Dataset comparison and diffing functions.

Overview

Data operations allow you to compare two tables to find differences, identify changes, and validate data consistency. Use these for:

Schema validation (before migration)
Change tracking (audit trails)
Data sync verification
Dataset comparison

Diffing Functions (2)

anofox_table_diff_hash

Hash-based table diff for change detection.

Signature:

anofox_table_diff_hash(table1 VARCHAR, table2 VARCHAR) → TABLE

Parameters:

table1: Source table name
table2: Comparison table name

Returns: Rows in table1 but not in table2, and vice versa

Speed: O(n) per table, hash-based comparison

Method:

Computes hash of each row
Compares hash sets to find differences
Returns only rows with different hashes

Example:

SELECT * FROM anofox_table_diff_hash('customers', 'customers_backup');

Output:

table_name | row_number | change_type | reason
customers  | 1          | ADDED       | Row exists in table1 but not in table2
customers  | 5          | MODIFIED    | Row hash differs
customers_backup | 3    | DELETED     | Row exists in table2 but not in table1

Pros:

Fast (hash-based)
Works with any column types
Simple to use
Good for "what changed?" queries

Cons:

Doesn't show which columns changed
Requires rows to have stable order
Can't see detailed differences

anofox_table_diff_join

Join-based table diff with detailed column-level differences.

Signature:

anofox_table_diff_join(table1 VARCHAR, table2 VARCHAR, key_columns ARRAY[VARCHAR]) → TABLE

Parameters:

table1: Source table name
table2: Comparison table name
key_columns: Column(s) to join on (e.g., ARRAY['id', 'customer_id'])

Returns: Detailed row-by-row comparison

Speed: O(n log n) join-based

Method:

Joins table1 and table2 on key columns
Compares all columns row-by-row
Returns matches, mismatches, and rows only in one table

Example:

SELECT * FROM anofox_table_diff_join(
    'customers',
    'customers_backup',
    ARRAY['customer_id']
);

Output:

customer_id | email_match | phone_match | amount_match | status
         | TRUE        | TRUE        | FALSE        | DIFFERENT
         | FALSE       | TRUE        | TRUE         | DIFFERENT
         | TRUE        | TRUE        | TRUE         | SAME
         | NULL        | NULL        | NULL         | ONLY_IN_TABLE1

Pros:

Column-level detail
See exactly which columns differ
Identifies additions and deletions
Better for data validation

Cons:

Slower (full join)
Requires defining key columns
More verbose output

Practical Patterns

Pattern 1: Validate Before ETL

Ensure source matches expectations:

-- Check if source data matches known schema
CREATE TABLE validation_schema AS
SELECT * FROM anofox_table_diff_join(
    'raw_import',
    'customers',  -- Reference schema
    ARRAY['customer_id']
)
WHERE status != 'SAME';

-- Count mismatches
SELECT COUNT(*) as mismatch_count FROM validation_schema;

Pattern 2: Change Tracking

Track what changed between exports:

-- Compare today's export to yesterday's
CREATE TABLE change_log AS
SELECT
    customer_id,
    'EMAIL_CHANGED' as change_type
FROM anofox_table_diff_join(
    'customers_today',
    'customers_yesterday',
    ARRAY['customer_id']
)
WHERE email_match = FALSE;

-- Combine with other changes
INSERT INTO change_log
SELECT
    customer_id,
    'PHONE_CHANGED' as change_type
FROM anofox_table_diff_join(
    'customers_today',
    'customers_yesterday',
    ARRAY['customer_id']
)
WHERE phone_match = FALSE;

Pattern 3: Data Sync Verification

Ensure replicas are in sync:

-- Verify read replica matches primary
CREATE TABLE sync_health AS
SELECT
    COUNT(*) as total_rows,
    COUNT(*) FILTER (WHERE status = 'SAME') as matching_rows,
    COUNT(*) FILTER (WHERE status = 'DIFFERENT') as different_rows,
    COUNT(*) FILTER (WHERE status = 'ONLY_IN_TABLE1') as only_in_primary,
    COUNT(*) FILTER (WHERE status = 'ONLY_IN_TABLE2') as only_in_replica,
    ROUND(100.0 * COUNT(*) FILTER (WHERE status = 'SAME') / COUNT(*), 2) as sync_percentage
FROM anofox_table_diff_join(
    'customers_primary',
    'customers_replica',
    ARRAY['customer_id']
);

SELECT * FROM sync_health;

Pattern 4: Migration Validation

Compare before/after migration:

-- Ensure old and new systems have same data
CREATE TABLE migration_validation AS
SELECT * FROM anofox_table_diff_join(
    'customers_legacy_system',
    'customers_new_system',
    ARRAY['customer_id']
)
WHERE status != 'SAME';

-- Flag failures
SELECT
    COUNT(*) as total_differences,
    COUNT(*) FILTER (WHERE status = 'ONLY_IN_TABLE1') as lost_in_migration,
    COUNT(*) FILTER (WHERE status = 'ONLY_IN_TABLE2') as added_in_migration,
    COUNT(*) FILTER (WHERE status = 'DIFFERENT') as data_corruption
FROM migration_validation;

Pattern 5: Audit Trail

Create immutable change log:

-- Append-only change audit
CREATE TABLE audit_table_changes (
    audit_id SERIAL PRIMARY KEY,
    table_name VARCHAR,
    column_name VARCHAR,
    row_key VARCHAR,
    old_value VARCHAR,
    new_value VARCHAR,
    changed_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    changed_by VARCHAR
);

-- Log changes detected
INSERT INTO audit_table_changes
    (table_name, column_name, row_key, old_value, new_value, changed_by)
SELECT
    'customers' as table_name,
    'email' as column_name,
    CAST(customer_id AS VARCHAR) as row_key,
    old.email as old_value,
    new.email as new_value,
    'system' as changed_by
FROM anofox_table_diff_join(
    'customers_yesterday',
    'customers_today',
    ARRAY['customer_id']
) diff
JOIN customers_yesterday old USING (customer_id)
JOIN customers_today new USING (customer_id)
WHERE diff.email_match = FALSE;

Performance Considerations

Hash-Based (Faster)

-- Use when you just need "are they different?"
SELECT * FROM anofox_table_diff_hash('t1', 't2');

Best for:

Large tables (10M+ rows)
Any column types
Just checking equality

Join-Based (More Detail)

-- Use when you need to know what changed
SELECT * FROM anofox_table_diff_join('t1', 't2', ARRAY['id']);

Best for:

Medium tables (< 10M rows)
Detailed change reports
Column-level validation

Combining with Other Functions

Diff + Validation

-- Find differences AND validate them
SELECT
    diff.*,
    anofox_vat_is_valid(new.vat_id) as vat_valid
FROM anofox_table_diff_join(
    'customers_old',
    'customers_new',
    ARRAY['customer_id']
) diff
LEFT JOIN customers_new new USING (customer_id)
WHERE diff.status = 'DIFFERENT'
  AND NOT anofox_vat_is_valid(new.vat_id);

Diff + Quality Metrics

-- Compare quality before/after transformation
SELECT
    'before' as phase,
    anofox_metric_nullness(email) as null_rate,
    anofox_metric_freshness(updated_at) as age_days
FROM customers_old

UNION ALL

SELECT
    'after' as phase,
    anofox_metric_nullness(email) as null_rate,
    anofox_metric_freshness(updated_at) as age_days
FROM customers_new;

Best Practices

✅ Use hash diff for quick checks
✅ Use join diff for detailed analysis
✅ Audit all table modifications
✅ Keep before/after snapshots
✅ Schedule regular sync checks
✅ Alert on unexpected differences

Next Steps

Basic Workflow — Use diffing in validation pipeline
Production Deployment — Integrate diffing into CI/CD
Quality Metrics — Track quality alongside changes

Overview​

Diffing Functions (2)​

anofox_table_diff_hash​

anofox_table_diff_join​

Practical Patterns​

Pattern 1: Validate Before ETL​

Pattern 2: Change Tracking​

Pattern 3: Data Sync Verification​

Pattern 4: Migration Validation​

Pattern 5: Audit Trail​

Performance Considerations​

Hash-Based (Faster)​

Join-Based (More Detail)​

Combining with Other Functions​

Diff + Validation​

Diff + Quality Metrics​

Best Practices​

Next Steps​