Data Preparation & Cleaning
Prepare your real-world data for forecasting by handling missing values, outliers, and inconsistencies.
Why Data Preparation Matters
Bad data → Bad forecasts
Good data → Good forecasts
Most forecasting projects spend 60-70% of time on data preparation.
Common Data Quality Issues
Issue 1: Missing Values (NULLs)
-- Check for missing values
SELECT
COUNT(*) as total_rows,
COUNT(CASE WHEN date IS NULL THEN 1 END) as null_dates,
COUNT(CASE WHEN sales IS NULL THEN 1 END) as null_sales
FROM sales_data;
Solutions:
-- 1. Remove rows with nulls (if few)
CREATE TABLE cleaned AS
SELECT * FROM sales_data
WHERE date IS NOT NULL AND sales IS NOT NULL;
-- 2. Forward fill (repeat last value)
CREATE TABLE filled AS
SELECT * FROM TS_FILL_GAPS(
'sales_data',
'date',
'sales',
{'method': 'forward_fill'}
);
-- 3. Linear interpolation (estimate between points)
CREATE TABLE interpolated AS
SELECT * FROM TS_FILL_GAPS(
'sales_data',
'date',
'sales',
{'method': 'linear'}
);
-- 4. Mean imputation (use average)
CREATE TABLE mean_filled AS
SELECT * FROM TS_FILL_GAPS(
'sales_data',
'date',
'sales',
{'method': 'mean'}
);
When to use each:
- Remove: <5% missing, scattered randomly
- Forward fill: Stable data, missing at end
- Linear: Trending data, short gaps
- Mean: Seasonal data, random gaps
Issue 2: Outliers (Extreme Values)
Unusual spikes that don't reflect normal patterns.
-- Detect outliers using IQR method
SELECT
date,
sales,
CASE
WHEN ABS(sales - (SELECT AVG(sales) FROM sales_data)) >
3 * (SELECT STDDEV(sales) FROM sales_data)
THEN 'OUTLIER'
ELSE 'OK'
END as flag
FROM sales_data
WHERE flag = 'OUTLIER'
ORDER BY date;
Solutions:
-- 1. Remove extreme outliers
CREATE TABLE no_outliers AS
SELECT * FROM sales_data
WHERE sales > (SELECT AVG(sales) - 3 * STDDEV(sales) FROM sales_data)
AND sales < (SELECT AVG(sales) + 3 * STDDEV(sales) FROM sales_data);
-- 2. Use robust model (handles outliers)
SELECT * FROM TS_FORECAST(
'sales_data',
'date',
'sales',
'MFLES', -- Median-based, outlier-robust
30,
{}
);
-- 3. Cap at reasonable limits
CREATE TABLE capped AS
SELECT
date,
LEAST(GREATEST(sales, 50), 200) as sales -- Cap between 50-200
FROM sales_data;
-- 4. Replace with interpolated value
CREATE TABLE replaced AS
SELECT
date,
CASE
WHEN ABS(sales - AVG(sales) OVER (ORDER BY date ROWS BETWEEN 7 PRECEDING AND 7 FOLLOWING)) > 3 * STDDEV(sales) OVER (...)
THEN AVG(sales) OVER (ORDER BY date ROWS BETWEEN 7 PRECEDING AND 7 FOLLOWING)
ELSE sales
END as sales
FROM sales_data;
Issue 3: Time Gaps (Missing Dates)
Data not at regular intervals.
-- Find gaps
SELECT
date,
LEAD(date) OVER (ORDER BY date) as next_date,
LEAD(date) OVER (ORDER BY date) - date as gap_days
FROM sales_data
WHERE LEAD(date) OVER (ORDER BY date) - date > INTERVAL '1 day'
ORDER BY date;
Solutions:
-- Fill gaps with interpolated values
CREATE TABLE filled_gaps AS
SELECT * FROM TS_FILL_GAPS(
'sales_data',
'date',
'sales',
{'method': 'linear', 'max_gap': 7} -- Fill gaps up to 7 days
);
-- Or aggregate to coarser frequency
CREATE TABLE weekly AS
SELECT
DATE_TRUNC('week', date) as week,
SUM(sales) as weekly_sales
FROM sales_data
GROUP BY DATE_TRUNC('week', date)
ORDER BY week;
Issue 4: Duplicates (Duplicate Dates)
Multiple values for same date.
-- Find duplicates
SELECT date, COUNT(*) as cnt
FROM sales_data
GROUP BY date
HAVING COUNT(*) > 1;
Solutions:
-- 1. Aggregate duplicates
CREATE TABLE aggregated AS
SELECT
date,
AVG(sales) as sales -- Or SUM, MAX, MIN depending on context
FROM sales_data
GROUP BY date
ORDER BY date;
-- 2. Keep first occurrence
CREATE TABLE no_dups AS
SELECT DISTINCT ON (date) date, sales
FROM sales_data
ORDER BY date, sales;
-- 3. Keep latest
CREATE TABLE latest AS
SELECT DISTINCT ON (date) date, sales
FROM sales_data
ORDER BY date, sales DESC;
Issue 5: Inconsistent Time Intervals
Data collected at irregular frequencies.
-- Check interval consistency
SELECT
MIN(interval_days) as min_interval,
MAX(interval_days) as max_interval,
COUNT(*) as num_intervals
FROM (
SELECT
CAST((LEAD(date) OVER (ORDER BY date) - date) AS INT) as interval_days
FROM sales_data
)
WHERE interval_days IS NOT NULL;
Solutions:
-- Resample to consistent frequency
-- Aggregate daily → weekly
CREATE TABLE weekly_sales AS
SELECT
DATE_TRUNC('week', date) as week,
SUM(sales) as weekly_sales
FROM sales_data
GROUP BY DATE_TRUNC('week', date);
-- Or disaggregate weekly → daily (if possible with external data)
Issue 6: Extreme Seasonality or Multiple Patterns
Data with very strong or conflicting patterns.
-- Detect seasonality strength
SELECT * FROM TS_DETECT_SEASONALITY(
'sales_data',
'date',
'sales',
{}
);
Solutions:
-- Use model that handles complexity
SELECT * FROM TS_FORECAST(
'sales_data',
'date',
'sales',
'TBATS', -- Handles multiple seasonality
30,
{'seasonal_periods': [7, 365]}
);
-- Or decompose and forecast components separately
SELECT * FROM TS_STL_DECOMPOSITION(
'sales_data',
'date',
'sales',
{'seasonal_period': 7}
);
Complete Data Validation Script
-- Comprehensive data quality check
CREATE TABLE data_quality_report AS
SELECT
'Total Rows' as metric,
CAST(COUNT(*) AS VARCHAR) as value
FROM sales_data
UNION ALL
SELECT 'Null Dates', CAST(COUNT(CASE WHEN date IS NULL THEN 1 END) AS VARCHAR)
FROM sales_data
UNION ALL
SELECT 'Null Values', CAST(COUNT(CASE WHEN sales IS NULL THEN 1 END) AS VARCHAR)
FROM sales_data
UNION ALL
SELECT 'Duplicate Dates', CAST(SUM(CASE WHEN cnt > 1 THEN 1 ELSE 0 END) AS VARCHAR)
FROM (SELECT date, COUNT(*) as cnt FROM sales_data GROUP BY date)
UNION ALL
SELECT 'Date Range', MIN(date)::VARCHAR || ' to ' || MAX(date)::VARCHAR
FROM sales_data
UNION ALL
SELECT 'Min Sales', ROUND(MIN(sales), 2)::VARCHAR
FROM sales_data
UNION ALL
SELECT 'Max Sales', ROUND(MAX(sales), 2)::VARCHAR
FROM sales_data
UNION ALL
SELECT 'Mean Sales', ROUND(AVG(sales), 2)::VARCHAR
FROM sales_data
UNION ALL
SELECT 'StdDev Sales', ROUND(STDDEV(sales), 2)::VARCHAR
FROM sales_data;
SELECT * FROM data_quality_report;
Data Cleaning Pipeline
Recommended sequence for production:
-- 1. Remove/flag obvious errors
CREATE TABLE step1_errors_removed AS
SELECT *
FROM sales_data
WHERE sales > 0 -- No negative sales
AND date IS NOT NULL;
-- 2. Fill small gaps
CREATE TABLE step2_gaps_filled AS
SELECT * FROM TS_FILL_GAPS(
'step1_errors_removed',
'date',
'sales',
{'method': 'linear', 'max_gap': 7}
);
-- 3. Handle outliers
CREATE TABLE step3_outliers_handled AS
SELECT * FROM TS_REMOVE_OUTLIERS(
'step2_gaps_filled',
'sales',
{'method': 'iqr', 'multiplier': 3.0}
);
-- 4. Remove duplicates
CREATE TABLE step4_deduped AS
SELECT DISTINCT ON (date) *
FROM step3_outliers_handled
ORDER BY date, sales;
-- 5. Final validation
SELECT * FROM step4_deduped
ORDER BY date;
Handling Special Cases
Zero Values
Common in intermittent demand (inventory, spare parts):
-- Don't remove zeros - they're valid!
-- Use intermittent demand models
SELECT * FROM TS_FORECAST(
'sales_data',
'date',
'sales',
'Croston', -- Designed for sparse data
30,
{}
);
Large Jumps (Price Changes)
Structural breaks in the data:
-- Detect and handle changepoints
SELECT * FROM TS_DETECT_CHANGEPOINTS(
'sales_data',
'date',
'sales',
{}
);
-- Model before and after changepoint separately
CREATE TABLE pre_change AS
SELECT * FROM sales_data WHERE date < '2023-06-01';
CREATE TABLE post_change AS
SELECT * FROM sales_data WHERE date >= '2023-06-01';
-- Forecast each separately
SELECT * FROM TS_FORECAST('pre_change', ...) UNION ALL
SELECT * FROM TS_FORECAST('post_change', ...);
Seasonal Adjustments
Remove seasonality to see true trend:
-- Deseasonalize
CREATE TABLE deseasonalized AS
SELECT
date,
sales,
seasonal,
sales - seasonal as sales_deseasonalized
FROM TS_STL_DECOMPOSITION(
'sales_data',
'date',
'sales',
{'seasonal_period': 7}
);
-- Forecast deseasonalized, then add seasonality back
Data Quality Checklist
Before forecasting:
- No NULL values in date or value columns
- No duplicate dates
- Dates are in chronological order
- Regular time intervals (no large gaps)
- Reasonable value range (check for typos)
- At least 2 full seasonal cycles (60+ points)
- Outliers identified and handled
- Data validated with domain expert
Common Mistakes
❌ Removing All Outliers
-- WRONG: Legitimate spikes are part of data
-- RIGHT: Only remove measurement errors
-- Use robust model (MFLES) instead of removing
❌ Forward-filling Long Gaps
-- WRONG: Filling 6-month gap with last value
-- RIGHT: Use shorter max_gap parameter or use external data
SELECT * FROM TS_FILL_GAPS(..., {'max_gap': 30}); -- Max 30 days
❌ Ignoring Zero Values
-- WRONG: Removing zero sales (common in spare parts)
-- RIGHT: Use Croston or intermittent demand model
Next Steps
- Detecting Patterns — Find seasonality and trends
- Basic Workflow — Complete forecasting workflow
- Production Deployment — Scale to production
Key Takeaways
- ✅ Data preparation = 60-70% of forecasting project
- ✅ Check for nulls, outliers, duplicates, gaps
- ✅ Handle missing values appropriately (don't just remove)
- ✅ Use robust models for noisy data
- ✅ Validate data with domain expert
- ✅ Document cleaning decisions
- ✅ Keep before/after records