Data Profiling

Understand your data's characteristics before validating and analyzing.

What is Data Profiling?

Data profiling is the systematic examination of data to understand its structure, content, and quality. It answers:

How much data do I have?
How complete is it?
How diverse is it?
How fresh is it?
What are its normal ranges and distributions?

The Six Essential Metrics

1. Volume

Question: How much data do I have?

Metric: Row count and size

SELECT
    COUNT(*) as total_rows,
    ROUND(SUM(length(CAST(* AS VARCHAR))) / 1024 / 1024, 2) as size_mb
FROM customer_data;

Why it matters:

Determines validation strategy (small datasets → thorough checks; large → sampled checks)
Affects performance and cost

Threshold considerations:

< 1M rows: Can do intensive checks
1M-100M rows: Balance speed and accuracy
> 100M rows: Use sampling and streaming checks

2. Null Rate (Completeness)

Question: How much data is missing?

Metric: Percentage of NULL values per column

SELECT
    COUNT(*) as total,
    COUNT(*) FILTER (WHERE email IS NOT NULL) as email_non_null,
    ROUND(100.0 * COUNT(*) FILTER (WHERE email IS NOT NULL) / COUNT(*), 1) as email_completeness_pct
FROM customers;

Why it matters:

Identifies which analyses are possible
Reveals data collection failures
Guides imputation strategy

Threshold considerations:

> 95% complete: Acceptable
80-95% complete: Investigate gaps
< 80% complete: May not be usable

3. Distinct Count (Cardinality & Uniqueness)

Question: How many unique values are there?

Metric: COUNT(DISTINCT column) / COUNT(*) = cardinality ratio

SELECT
    COUNT(*) as total,
    COUNT(DISTINCT customer_id) as unique_customers,
    COUNT(DISTINCT country) as unique_countries,
    ROUND(100.0 * COUNT(DISTINCT customer_id) / COUNT(*), 1) as uniqueness_pct
FROM customers;

Why it matters:

Reveals if column should be a key (ID columns should be 100% unique)
Shows diversity (high cardinality = diverse; low = concentrated)
Detects duplicates

Threshold considerations:

ID columns: Should be ~100% unique
Category columns: 5-100 distinct values typical
Numeric columns: Look for unexpected clustering

4. Freshness (Temporal Quality)

Question: How current is my data?

Metric: Age of most recent record

SELECT
    MAX(updated_at) as most_recent_update,
    now() - MAX(updated_at) as age_since_update,
    COUNT(*) FILTER (WHERE updated_at > now() - interval '7 days') as updated_past_week,
    ROUND(100.0 * COUNT(*) FILTER (WHERE updated_at > now() - interval '7 days') / COUNT(*), 1) as recent_pct
FROM customer_data;

Why it matters:

Indicates if data is useful for decisions
Shows update frequency
Guides caching and refresh strategy

Threshold considerations:

Real-time data: Should be < 1 minute old
Daily batch data: Should be < 24 hours old
Reference data: Can be weeks old
Customer profiles: Should be < 30 days old

5. Schema & Data Types

Question: Is the data structured as expected?

Metric: Column types, constraints, and ranges

-- Check schema
DESCRIBE customer_data;

-- Check ranges
SELECT
    MIN(amount) as min_amount,
    MAX(amount) as max_amount,
    AVG(amount) as avg_amount,
    STDDEV(amount) as stddev_amount
FROM transactions;

Why it matters:

Catches type mismatches (string instead of numeric)
Reveals out-of-range values
Detects schema drift

Threshold considerations:

Numeric columns: Check min/max make sense
String columns: Check length distribution
Dates: Check min/max are reasonable

6. Distribution & Skewness

Question: Is my data normally distributed or skewed?

Metric: Percentiles, skewness, kurtosis

SELECT
    QUANTILE_DISC(amount, 0.25) as q1,
    QUANTILE_DISC(amount, 0.50) as median,
    QUANTILE_DISC(amount, 0.75) as q3,
    QUANTILE_DISC(amount, 0.95) as p95,
    QUANTILE_DISC(amount, 0.99) as p99
FROM transactions;

Why it matters:

Informs anomaly detection method (Z-Score works for normal; IQR better for skewed)
Guides statistical analysis assumptions
Reveals multimodal distributions

Threshold considerations:

Normal distribution: IQR ≈ 1.35 * STDDEV
Right-skewed: p99 >> p95 >> median
Left-skewed: median >> p5

The Data Quality Dashboard

Create a single view for quick data health checks:

CREATE VIEW data_quality_profile AS
SELECT
    'customers' as table_name,
    COUNT(*) as row_count,
    CURRENT_TIMESTAMP as profiled_at,
    ROUND(100.0 * COUNT(*) FILTER (WHERE email IS NOT NULL) / COUNT(*), 1) as email_completeness_pct,
    ROUND(100.0 * COUNT(*) FILTER (WHERE phone IS NOT NULL) / COUNT(*), 1) as phone_completeness_pct,
    COUNT(DISTINCT customer_id) as unique_customers,
    MAX(updated_at) as most_recent_update,
    now() - MAX(updated_at) as age_days,
    MIN(created_at) as oldest_record,
    ROUND(100.0 * COUNT(*) FILTER (WHERE created_at > now() - interval '90 days') / COUNT(*), 1) as created_last_90days_pct
FROM customers
GROUP BY 1;

Profiling Workflow

Step 1: Quick Scan

-- Get basic stats
DESCRIBE table_name;
SELECT COUNT(*) as rows, COUNT(DISTINCT id) as unique_ids FROM table_name;

Step 2: Completeness Check

-- Find which columns have nulls
SELECT
    (SELECT COUNT(*) FROM table_name WHERE col1 IS NULL) as col1_nulls,
    (SELECT COUNT(*) FROM table_name WHERE col2 IS NULL) as col2_nulls,
    (SELECT COUNT(*) FROM table_name WHERE col3 IS NULL) as col3_nulls
FROM table_name LIMIT 1;

Step 3: Distribution Analysis

-- Understand normal ranges
SELECT
    COUNT(*) as total,
    MIN(numeric_col) as min_val,
    QUANTILE_DISC(numeric_col, 0.25) as q1,
    QUANTILE_DISC(numeric_col, 0.50) as median,
    QUANTILE_DISC(numeric_col, 0.75) as q3,
    MAX(numeric_col) as max_val
FROM table_name;

Step 4: Anomaly Baseline

-- Understand normal vs. anomalous
SELECT
    numeric_col,
    COUNT(*) as frequency,
    ROUND(100.0 * COUNT(*) / (SELECT COUNT(*) FROM table_name), 1) as pct
FROM table_name
GROUP BY 1
ORDER BY frequency DESC
LIMIT 20;

Setting Quality Thresholds

Use profiling to establish baselines:

Metric	Acceptable	Investigate	Critical
Completeness	> 95%	80-95%	< 80%
Uniqueness (IDs)	99-100%	95-99%	< 95%
Freshness	< 24h old	1-7 days old	> 7 days old
Nulls	< 5%	5-20%	> 20%
Anomaly Rate	< 1%	1-5%	> 5%

Next Steps

Validation Strategies — Apply learnings to validation choices
Anomaly Detection Methods — Detect outliers in profiled data
Basic Workflow — Build a complete profiling pipeline

What is Data Profiling?​

The Six Essential Metrics​

1. Volume​

2. Null Rate (Completeness)​

3. Distinct Count (Cardinality & Uniqueness)​

4. Freshness (Temporal Quality)​

5. Schema & Data Types​

6. Distribution & Skewness​

The Data Quality Dashboard​

Profiling Workflow​

Setting Quality Thresholds​

Next Steps​

What is Data Profiling?

The Six Essential Metrics

1. Volume

2. Null Rate (Completeness)

3. Distinct Count (Cardinality & Uniqueness)

4. Freshness (Temporal Quality)

5. Schema & Data Types

6. Distribution & Skewness

The Data Quality Dashboard

Profiling Workflow

Setting Quality Thresholds

Next Steps