Skip to main content

Data Preparation & Cleaning

Prepare your real-world data for forecasting by handling missing values, outliers, and inconsistencies.

Why Data Preparation Matters

Bad data → Bad forecasts
Good data → Good forecasts

Most forecasting projects spend 60-70% of time on data preparation.


Common Data Quality Issues

Issue 1: Missing Values (NULLs)

-- Check for missing values
SELECT
COUNT(*) as total_rows,
COUNT(CASE WHEN date IS NULL THEN 1 END) as null_dates,
COUNT(CASE WHEN sales IS NULL THEN 1 END) as null_sales
FROM sales_data;

Solutions:

-- 1. Remove rows with nulls (if few)
CREATE TABLE cleaned AS
SELECT * FROM sales_data
WHERE date IS NOT NULL AND sales IS NOT NULL;

-- 2. Forward fill (repeat last value)
CREATE TABLE filled AS
SELECT * FROM TS_FILL_GAPS(
'sales_data',
'date',
'sales',
{'method': 'forward_fill'}
);

-- 3. Linear interpolation (estimate between points)
CREATE TABLE interpolated AS
SELECT * FROM TS_FILL_GAPS(
'sales_data',
'date',
'sales',
{'method': 'linear'}
);

-- 4. Mean imputation (use average)
CREATE TABLE mean_filled AS
SELECT * FROM TS_FILL_GAPS(
'sales_data',
'date',
'sales',
{'method': 'mean'}
);

When to use each:

  • Remove: <5% missing, scattered randomly
  • Forward fill: Stable data, missing at end
  • Linear: Trending data, short gaps
  • Mean: Seasonal data, random gaps

Issue 2: Outliers (Extreme Values)

Unusual spikes that don't reflect normal patterns.

-- Detect outliers using IQR method
SELECT
date,
sales,
CASE
WHEN ABS(sales - (SELECT AVG(sales) FROM sales_data)) >
3 * (SELECT STDDEV(sales) FROM sales_data)
THEN 'OUTLIER'
ELSE 'OK'
END as flag
FROM sales_data
WHERE flag = 'OUTLIER'
ORDER BY date;

Solutions:

-- 1. Remove extreme outliers
CREATE TABLE no_outliers AS
SELECT * FROM sales_data
WHERE sales > (SELECT AVG(sales) - 3 * STDDEV(sales) FROM sales_data)
AND sales < (SELECT AVG(sales) + 3 * STDDEV(sales) FROM sales_data);

-- 2. Use robust model (handles outliers)
SELECT * FROM TS_FORECAST(
'sales_data',
'date',
'sales',
'MFLES', -- Median-based, outlier-robust
30,
{}
);

-- 3. Cap at reasonable limits
CREATE TABLE capped AS
SELECT
date,
LEAST(GREATEST(sales, 50), 200) as sales -- Cap between 50-200
FROM sales_data;

-- 4. Replace with interpolated value
CREATE TABLE replaced AS
SELECT
date,
CASE
WHEN ABS(sales - AVG(sales) OVER (ORDER BY date ROWS BETWEEN 7 PRECEDING AND 7 FOLLOWING)) > 3 * STDDEV(sales) OVER (...)
THEN AVG(sales) OVER (ORDER BY date ROWS BETWEEN 7 PRECEDING AND 7 FOLLOWING)
ELSE sales
END as sales
FROM sales_data;

Issue 3: Time Gaps (Missing Dates)

Data not at regular intervals.

-- Find gaps
SELECT
date,
LEAD(date) OVER (ORDER BY date) as next_date,
LEAD(date) OVER (ORDER BY date) - date as gap_days
FROM sales_data
WHERE LEAD(date) OVER (ORDER BY date) - date > INTERVAL '1 day'
ORDER BY date;

Solutions:

-- Fill gaps with interpolated values
CREATE TABLE filled_gaps AS
SELECT * FROM TS_FILL_GAPS(
'sales_data',
'date',
'sales',
{'method': 'linear', 'max_gap': 7} -- Fill gaps up to 7 days
);

-- Or aggregate to coarser frequency
CREATE TABLE weekly AS
SELECT
DATE_TRUNC('week', date) as week,
SUM(sales) as weekly_sales
FROM sales_data
GROUP BY DATE_TRUNC('week', date)
ORDER BY week;

Issue 4: Duplicates (Duplicate Dates)

Multiple values for same date.

-- Find duplicates
SELECT date, COUNT(*) as cnt
FROM sales_data
GROUP BY date
HAVING COUNT(*) > 1;

Solutions:

-- 1. Aggregate duplicates
CREATE TABLE aggregated AS
SELECT
date,
AVG(sales) as sales -- Or SUM, MAX, MIN depending on context
FROM sales_data
GROUP BY date
ORDER BY date;

-- 2. Keep first occurrence
CREATE TABLE no_dups AS
SELECT DISTINCT ON (date) date, sales
FROM sales_data
ORDER BY date, sales;

-- 3. Keep latest
CREATE TABLE latest AS
SELECT DISTINCT ON (date) date, sales
FROM sales_data
ORDER BY date, sales DESC;

Issue 5: Inconsistent Time Intervals

Data collected at irregular frequencies.

-- Check interval consistency
SELECT
MIN(interval_days) as min_interval,
MAX(interval_days) as max_interval,
COUNT(*) as num_intervals
FROM (
SELECT
CAST((LEAD(date) OVER (ORDER BY date) - date) AS INT) as interval_days
FROM sales_data
)
WHERE interval_days IS NOT NULL;

Solutions:

-- Resample to consistent frequency
-- Aggregate daily → weekly
CREATE TABLE weekly_sales AS
SELECT
DATE_TRUNC('week', date) as week,
SUM(sales) as weekly_sales
FROM sales_data
GROUP BY DATE_TRUNC('week', date);

-- Or disaggregate weekly → daily (if possible with external data)

Issue 6: Extreme Seasonality or Multiple Patterns

Data with very strong or conflicting patterns.

-- Detect seasonality strength
SELECT * FROM TS_DETECT_SEASONALITY(
'sales_data',
'date',
'sales',
{}
);

Solutions:

-- Use model that handles complexity
SELECT * FROM TS_FORECAST(
'sales_data',
'date',
'sales',
'TBATS', -- Handles multiple seasonality
30,
{'seasonal_periods': [7, 365]}
);

-- Or decompose and forecast components separately
SELECT * FROM TS_STL_DECOMPOSITION(
'sales_data',
'date',
'sales',
{'seasonal_period': 7}
);

Complete Data Validation Script

-- Comprehensive data quality check
CREATE TABLE data_quality_report AS
SELECT
'Total Rows' as metric,
CAST(COUNT(*) AS VARCHAR) as value
FROM sales_data

UNION ALL

SELECT 'Null Dates', CAST(COUNT(CASE WHEN date IS NULL THEN 1 END) AS VARCHAR)
FROM sales_data

UNION ALL

SELECT 'Null Values', CAST(COUNT(CASE WHEN sales IS NULL THEN 1 END) AS VARCHAR)
FROM sales_data

UNION ALL

SELECT 'Duplicate Dates', CAST(SUM(CASE WHEN cnt > 1 THEN 1 ELSE 0 END) AS VARCHAR)
FROM (SELECT date, COUNT(*) as cnt FROM sales_data GROUP BY date)

UNION ALL

SELECT 'Date Range', MIN(date)::VARCHAR || ' to ' || MAX(date)::VARCHAR
FROM sales_data

UNION ALL

SELECT 'Min Sales', ROUND(MIN(sales), 2)::VARCHAR
FROM sales_data

UNION ALL

SELECT 'Max Sales', ROUND(MAX(sales), 2)::VARCHAR
FROM sales_data

UNION ALL

SELECT 'Mean Sales', ROUND(AVG(sales), 2)::VARCHAR
FROM sales_data

UNION ALL

SELECT 'StdDev Sales', ROUND(STDDEV(sales), 2)::VARCHAR
FROM sales_data;

SELECT * FROM data_quality_report;

Data Cleaning Pipeline

Recommended sequence for production:

-- 1. Remove/flag obvious errors
CREATE TABLE step1_errors_removed AS
SELECT *
FROM sales_data
WHERE sales > 0 -- No negative sales
AND date IS NOT NULL;

-- 2. Fill small gaps
CREATE TABLE step2_gaps_filled AS
SELECT * FROM TS_FILL_GAPS(
'step1_errors_removed',
'date',
'sales',
{'method': 'linear', 'max_gap': 7}
);

-- 3. Handle outliers
CREATE TABLE step3_outliers_handled AS
SELECT * FROM TS_REMOVE_OUTLIERS(
'step2_gaps_filled',
'sales',
{'method': 'iqr', 'multiplier': 3.0}
);

-- 4. Remove duplicates
CREATE TABLE step4_deduped AS
SELECT DISTINCT ON (date) *
FROM step3_outliers_handled
ORDER BY date, sales;

-- 5. Final validation
SELECT * FROM step4_deduped
ORDER BY date;

Handling Special Cases

Zero Values

Common in intermittent demand (inventory, spare parts):

-- Don't remove zeros - they're valid!
-- Use intermittent demand models
SELECT * FROM TS_FORECAST(
'sales_data',
'date',
'sales',
'Croston', -- Designed for sparse data
30,
{}
);

Large Jumps (Price Changes)

Structural breaks in the data:

-- Detect and handle changepoints
SELECT * FROM TS_DETECT_CHANGEPOINTS(
'sales_data',
'date',
'sales',
{}
);

-- Model before and after changepoint separately
CREATE TABLE pre_change AS
SELECT * FROM sales_data WHERE date < '2023-06-01';

CREATE TABLE post_change AS
SELECT * FROM sales_data WHERE date >= '2023-06-01';

-- Forecast each separately
SELECT * FROM TS_FORECAST('pre_change', ...) UNION ALL
SELECT * FROM TS_FORECAST('post_change', ...);

Seasonal Adjustments

Remove seasonality to see true trend:

-- Deseasonalize
CREATE TABLE deseasonalized AS
SELECT
date,
sales,
seasonal,
sales - seasonal as sales_deseasonalized
FROM TS_STL_DECOMPOSITION(
'sales_data',
'date',
'sales',
{'seasonal_period': 7}
);

-- Forecast deseasonalized, then add seasonality back

Data Quality Checklist

Before forecasting:

  • No NULL values in date or value columns
  • No duplicate dates
  • Dates are in chronological order
  • Regular time intervals (no large gaps)
  • Reasonable value range (check for typos)
  • At least 2 full seasonal cycles (60+ points)
  • Outliers identified and handled
  • Data validated with domain expert

Common Mistakes

❌ Removing All Outliers

-- WRONG: Legitimate spikes are part of data
-- RIGHT: Only remove measurement errors
-- Use robust model (MFLES) instead of removing

❌ Forward-filling Long Gaps

-- WRONG: Filling 6-month gap with last value
-- RIGHT: Use shorter max_gap parameter or use external data
SELECT * FROM TS_FILL_GAPS(..., {'max_gap': 30}); -- Max 30 days

❌ Ignoring Zero Values

-- WRONG: Removing zero sales (common in spare parts)
-- RIGHT: Use Croston or intermittent demand model

Next Steps

  1. Detecting Patterns — Find seasonality and trends
  2. Basic Workflow — Complete forecasting workflow
  3. Production Deployment — Scale to production

Key Takeaways

  • ✅ Data preparation = 60-70% of forecasting project
  • ✅ Check for nulls, outliers, duplicates, gaps
  • ✅ Handle missing values appropriately (don't just remove)
  • ✅ Use robust models for noisy data
  • ✅ Validate data with domain expert
  • ✅ Document cleaning decisions
  • ✅ Keep before/after records
🍪 Cookie Settings