Skip to main content

Data Profiling

Understand your data's characteristics before validating and analyzing.


What is Data Profiling?

Data profiling is the systematic examination of data to understand its structure, content, and quality. It answers:

  • How much data do I have?
  • How complete is it?
  • How diverse is it?
  • How fresh is it?
  • What are its normal ranges and distributions?

The Six Essential Metrics


1. Volume

Question: How much data do I have?

Metric: Row count and size

SELECT
COUNT(*) as total_rows,
ROUND(SUM(length(CAST(* AS VARCHAR))) / 1024 / 1024, 2) as size_mb
FROM customer_data;

Why it matters:

  • Determines validation strategy (small datasets → thorough checks; large → sampled checks)
  • Affects performance and cost

Threshold considerations:

  • < 1M rows: Can do intensive checks
  • 1M-100M rows: Balance speed and accuracy
  • > 100M rows: Use sampling and streaming checks

2. Null Rate (Completeness)

Question: How much data is missing?

Metric: Percentage of NULL values per column

SELECT
COUNT(*) as total,
COUNT(*) FILTER (WHERE email IS NOT NULL) as email_non_null,
ROUND(100.0 * COUNT(*) FILTER (WHERE email IS NOT NULL) / COUNT(*), 1) as email_completeness_pct
FROM customers;

Why it matters:

  • Identifies which analyses are possible
  • Reveals data collection failures
  • Guides imputation strategy

Threshold considerations:

  • > 95% complete: Acceptable
  • 80-95% complete: Investigate gaps
  • < 80% complete: May not be usable

3. Distinct Count (Cardinality & Uniqueness)

Question: How many unique values are there?

Metric: COUNT(DISTINCT column) / COUNT(*) = cardinality ratio

SELECT
COUNT(*) as total,
COUNT(DISTINCT customer_id) as unique_customers,
COUNT(DISTINCT country) as unique_countries,
ROUND(100.0 * COUNT(DISTINCT customer_id) / COUNT(*), 1) as uniqueness_pct
FROM customers;

Why it matters:

  • Reveals if column should be a key (ID columns should be 100% unique)
  • Shows diversity (high cardinality = diverse; low = concentrated)
  • Detects duplicates

Threshold considerations:

  • ID columns: Should be ~100% unique
  • Category columns: 5-100 distinct values typical
  • Numeric columns: Look for unexpected clustering

4. Freshness (Temporal Quality)

Question: How current is my data?

Metric: Age of most recent record

SELECT
MAX(updated_at) as most_recent_update,
now() - MAX(updated_at) as age_since_update,
COUNT(*) FILTER (WHERE updated_at > now() - interval '7 days') as updated_past_week,
ROUND(100.0 * COUNT(*) FILTER (WHERE updated_at > now() - interval '7 days') / COUNT(*), 1) as recent_pct
FROM customer_data;

Why it matters:

  • Indicates if data is useful for decisions
  • Shows update frequency
  • Guides caching and refresh strategy

Threshold considerations:

  • Real-time data: Should be < 1 minute old
  • Daily batch data: Should be < 24 hours old
  • Reference data: Can be weeks old
  • Customer profiles: Should be < 30 days old

5. Schema & Data Types

Question: Is the data structured as expected?

Metric: Column types, constraints, and ranges

-- Check schema
DESCRIBE customer_data;

-- Check ranges
SELECT
MIN(amount) as min_amount,
MAX(amount) as max_amount,
AVG(amount) as avg_amount,
STDDEV(amount) as stddev_amount
FROM transactions;

Why it matters:

  • Catches type mismatches (string instead of numeric)
  • Reveals out-of-range values
  • Detects schema drift

Threshold considerations:

  • Numeric columns: Check min/max make sense
  • String columns: Check length distribution
  • Dates: Check min/max are reasonable

6. Distribution & Skewness

Question: Is my data normally distributed or skewed?

Metric: Percentiles, skewness, kurtosis

SELECT
QUANTILE_DISC(amount, 0.25) as q1,
QUANTILE_DISC(amount, 0.50) as median,
QUANTILE_DISC(amount, 0.75) as q3,
QUANTILE_DISC(amount, 0.95) as p95,
QUANTILE_DISC(amount, 0.99) as p99
FROM transactions;

Why it matters:

  • Informs anomaly detection method (Z-Score works for normal; IQR better for skewed)
  • Guides statistical analysis assumptions
  • Reveals multimodal distributions

Threshold considerations:

  • Normal distribution: IQR ≈ 1.35 * STDDEV
  • Right-skewed: p99 >> p95 >> median
  • Left-skewed: median >> p5

The Data Quality Dashboard

Create a single view for quick data health checks:

CREATE VIEW data_quality_profile AS
SELECT
'customers' as table_name,
COUNT(*) as row_count,
CURRENT_TIMESTAMP as profiled_at,
ROUND(100.0 * COUNT(*) FILTER (WHERE email IS NOT NULL) / COUNT(*), 1) as email_completeness_pct,
ROUND(100.0 * COUNT(*) FILTER (WHERE phone IS NOT NULL) / COUNT(*), 1) as phone_completeness_pct,
COUNT(DISTINCT customer_id) as unique_customers,
MAX(updated_at) as most_recent_update,
now() - MAX(updated_at) as age_days,
MIN(created_at) as oldest_record,
ROUND(100.0 * COUNT(*) FILTER (WHERE created_at > now() - interval '90 days') / COUNT(*), 1) as created_last_90days_pct
FROM customers
GROUP BY 1;

Profiling Workflow

Step 1: Quick Scan

-- Get basic stats
DESCRIBE table_name;
SELECT COUNT(*) as rows, COUNT(DISTINCT id) as unique_ids FROM table_name;

Step 2: Completeness Check

-- Find which columns have nulls
SELECT
(SELECT COUNT(*) FROM table_name WHERE col1 IS NULL) as col1_nulls,
(SELECT COUNT(*) FROM table_name WHERE col2 IS NULL) as col2_nulls,
(SELECT COUNT(*) FROM table_name WHERE col3 IS NULL) as col3_nulls
FROM table_name LIMIT 1;

Step 3: Distribution Analysis

-- Understand normal ranges
SELECT
COUNT(*) as total,
MIN(numeric_col) as min_val,
QUANTILE_DISC(numeric_col, 0.25) as q1,
QUANTILE_DISC(numeric_col, 0.50) as median,
QUANTILE_DISC(numeric_col, 0.75) as q3,
MAX(numeric_col) as max_val
FROM table_name;

Step 4: Anomaly Baseline

-- Understand normal vs. anomalous
SELECT
numeric_col,
COUNT(*) as frequency,
ROUND(100.0 * COUNT(*) / (SELECT COUNT(*) FROM table_name), 1) as pct
FROM table_name
GROUP BY 1
ORDER BY frequency DESC
LIMIT 20;

Setting Quality Thresholds

Use profiling to establish baselines:

MetricAcceptableInvestigateCritical
Completeness> 95%80-95%< 80%
Uniqueness (IDs)99-100%95-99%< 95%
Freshness< 24h old1-7 days old> 7 days old
Nulls< 5%5-20%> 20%
Anomaly Rate< 1%1-5%> 5%

Next Steps

🍪 Cookie Settings