Basic Forecasting Workflow
Complete end-to-end guide to forecasting from data preparation through evaluation.
The Workflow
Raw Data
↓
Data Preparation (cleaning, validation)
↓
Exploratory Analysis (detect patterns)
↓
Train/Test Split
↓
Model Training
↓
Forecast Generation
↓
Evaluation (metrics)
↓
Model Comparison
↓
Deploy Best Model
Step 1: Create or Load Your Data
Start with a table containing time-series data:
-- Example: Load sales data
CREATE TABLE sales_data AS
SELECT * FROM read_parquet('s3://bucket/sales.parquet');
-- Or create synthetic data for testing
CREATE TABLE sales_data AS
SELECT
DATE '2023-01-01' + INTERVAL (d) DAY AS date,
100 + (d / 3.65) + -- Trend
20 * SIN(2 * PI() * (d % 7) / 7) + -- Weekly seasonality
RANDOM() * 10 AS sales -- Noise
FROM generate_series(0, 364) t(d)
ORDER BY date;
Data requirements:
- Time column (DATE, TIMESTAMP, or INTEGER)
- Value column (DOUBLE, INTEGER)
- Regular time intervals (daily, hourly, etc.)
- At least 30 observations (2+ seasonal cycles recommended)
Step 2: Data Validation
Check data quality before forecasting:
-- Check for nulls and basic statistics
SELECT
COUNT(*) as total_rows,
COUNT(DISTINCT date) as unique_dates,
COUNT(CASE WHEN sales IS NULL THEN 1 END) as null_count,
MIN(sales) as min_sales,
MAX(sales) as max_sales,
ROUND(AVG(sales), 2) as avg_sales,
ROUND(STDDEV(sales), 2) as std_dev
FROM sales_data;
-- Check for duplicates
SELECT date, COUNT(*) as cnt
FROM sales_data
GROUP BY date
HAVING COUNT(*) > 1;
-- Check for gaps in dates
SELECT
date,
LEAD(date) OVER (ORDER BY date) - date as days_gap
FROM sales_data
WHERE LEAD(date) OVER (ORDER BY date) - date != INTERVAL '1 day';
Step 3: Data Cleaning
Handle missing values, outliers, and gaps:
-- Fill time gaps (optional)
CREATE TABLE sales_cleaned AS
SELECT * FROM TS_FILL_GAPS(
'sales_data',
'date',
'sales',
{'method': 'linear'} -- linear, forward_fill, or mean
);
-- Remove outliers (optional)
SELECT * FROM TS_REMOVE_OUTLIERS(
'sales_cleaned',
'sales',
{'method': 'iqr', 'multiplier': 3.0} -- IQR method with 3× threshold
);
Step 4: Exploratory Analysis
Understand patterns before modeling:
-- Detect seasonality
SELECT * FROM TS_DETECT_SEASONALITY(
'sales_data',
'date',
'sales',
{'method': 'auto'}
);
-- Detect changepoints (structural breaks)
SELECT * FROM TS_DETECT_CHANGEPOINTS(
'sales_data',
'date',
'sales',
{}
);
-- Decompose series to see components
SELECT
date,
sales,
trend,
seasonal,
remainder
FROM TS_STL_DECOMPOSITION(
'sales_data',
'date',
'sales',
{'seasonal_period': 7}
);
Key questions to answer:
- Is there seasonality? How strong?
- Is there a trend (increasing/decreasing)?
- Are there structural breaks or regime changes?
- How much noise/volatility?
Step 5: Train/Test Split
Split data for unbiased evaluation:
-- Typical: 80% train, 20% test
-- Or for time series: Use time-based split
CREATE TABLE train_data AS
SELECT * FROM sales_data
WHERE date < DATE '2023-11-01';
CREATE TABLE test_data AS
SELECT * FROM sales_data
WHERE date >= DATE '2023-11-01';
-- Verify split
SELECT
'train' as dataset,
COUNT(*) as rows,
MIN(date) as start_date,
MAX(date) as end_date
FROM train_data
UNION ALL
SELECT
'test' as dataset,
COUNT(*) as rows,
MIN(date) as start_date,
MAX(date) as end_date
FROM test_data;
Step 6: Train Models
Fit multiple models for comparison:
-- Model 1: Simple baseline
CREATE TABLE forecast_naive AS
SELECT * FROM TS_FORECAST(
'train_data',
'date',
'sales',
'SeasonalNaive',
(SELECT COUNT(*) FROM test_data),
{'seasonal_period': 7}
);
-- Model 2: Statistical (AutoETS)
CREATE TABLE forecast_ets AS
SELECT * FROM TS_FORECAST(
'train_data',
'date',
'sales',
'AutoETS',
(SELECT COUNT(*) FROM test_data),
{'seasonal_period': 7, 'confidence_level': 0.95}
);
-- Model 3: ARIMA
CREATE TABLE forecast_arima AS
SELECT * FROM TS_FORECAST(
'train_data',
'date',
'sales',
'AutoARIMA',
(SELECT COUNT(*) FROM test_data),
{'seasonal_period': 7, 'confidence_level': 0.95}
);
Step 7: Generate Forecasts
Create predictions for test period:
-- All models in one unified table
CREATE TABLE all_forecasts AS
SELECT
forecast_step,
date_col as forecast_date,
'SeasonalNaive' as model_name,
point_forecast,
lower_95,
upper_95
FROM forecast_naive
UNION ALL
SELECT
forecast_step,
date_col,
'AutoETS',
point_forecast,
lower_95,
upper_95
FROM forecast_ets
UNION ALL
SELECT
forecast_step,
date_col,
'AutoARIMA',
point_forecast,
lower_95,
upper_95
FROM forecast_arima
ORDER BY forecast_date, model_name;
Step 8: Compare Actual vs. Forecast
Join predictions with actual values:
-- Side-by-side comparison
CREATE TABLE comparison AS
SELECT
t.date,
t.sales as actual,
f.model_name,
ROUND(f.point_forecast, 2) as forecast,
ROUND(ABS(t.sales - f.point_forecast), 2) as absolute_error,
ROUND(ABS(t.sales - f.point_forecast) / t.sales * 100, 1) as error_pct
FROM test_data t
LEFT JOIN all_forecasts f ON t.date = f.forecast_date
ORDER BY t.date, f.model_name;
-- View the comparison
SELECT * FROM comparison LIMIT 20;
Step 9: Calculate Evaluation Metrics
Quantify forecast accuracy:
-- Aggregate metrics by model
CREATE TABLE model_metrics AS
SELECT
model_name,
ROUND(TS_MAE(LIST(actual), LIST(forecast)), 2) as MAE,
ROUND(TS_RMSE(LIST(actual), LIST(forecast)), 2) as RMSE,
ROUND(TS_MAPE(LIST(actual), LIST(forecast)), 2) as MAPE_pct,
ROUND(TS_MASE(LIST(actual), LIST(forecast), 7), 2) as MASE,
ROUND(TS_ME(LIST(actual), LIST(forecast)), 2) as ME,
ROUND(TS_R_SQUARED(LIST(actual), LIST(forecast)), 4) as R_squared,
ROUND(TS_COVERAGE(LIST(actual), LIST(lower_95), LIST(upper_95)), 3) as coverage_95
FROM (
SELECT
f.model_name,
t.sales as actual,
f.point_forecast as forecast,
f.lower_95,
f.upper_95
FROM test_data t
LEFT JOIN all_forecasts f ON t.date = f.forecast_date
)
GROUP BY model_name
ORDER BY MAPE_pct; -- Sort by MAPE (typical metric)
-- View results
SELECT * FROM model_metrics;
Output example:
model_name MAE RMSE MAPE_pct MASE ME R_squared
SeasonalNaive 15.2 18.5 12.3 1.15 0.8 0.7832
AutoETS 8.3 11.2 6.8 0.63 -0.2 0.9234
AutoARIMA 9.1 12.8 7.5 0.69 0.5 0.9045
Step 10: Choose Best Model
Select model based on metrics:
-- Get the best model (lowest MAPE)
SELECT TOP 1
model_name,
MAE,
RMSE,
MAPE_pct
FROM model_metrics
ORDER BY MAPE_pct
LIMIT 1;
-- In this example: AutoETS with 6.8% MAPE is best
Step 11: Deploy for Production
Use best model to forecast future:
-- Retrain on full historical data
CREATE TABLE production_forecast AS
SELECT * FROM TS_FORECAST(
'sales_data', -- Use all historical data, not just training
'date',
'sales',
'AutoETS', -- Best model from Step 10
90, -- Forecast 90 days ahead
{'seasonal_period': 7, 'confidence_level': 0.95}
);
-- View next 7 days forecast
SELECT
forecast_step,
date_col as forecast_date,
ROUND(point_forecast, 2) as forecast,
ROUND(lower_95, 2) as lower,
ROUND(upper_95, 2) as upper
FROM production_forecast
WHERE forecast_step <= 7
ORDER BY forecast_step;
Step 12: Monitor & Retrain
Track model performance over time:
-- Weekly monitoring query
SELECT
DATE_TRUNC('week', date) as week,
COUNT(*) as forecast_count,
ROUND(AVG(ABS(actual - forecast)), 2) as avg_error,
ROUND(TS_MAE(LIST(actual), LIST(forecast)), 2) as weekly_MAE
FROM (
SELECT
f.date_col as date,
t.sales as actual,
f.point_forecast as forecast
FROM forecasts f
LEFT JOIN sales_data t ON f.date_col = t.date
WHERE f.date_col >= TODAY() - INTERVAL '30 days'
)
GROUP BY DATE_TRUNC('week', date)
ORDER BY week;
-- If metrics degrade, retrain!
Complete Workflow Script
Copy and paste to run entire workflow:
LOAD anofox_forecast;
-- 1. Create sample data
CREATE TABLE sales_data AS
SELECT
DATE '2023-01-01' + INTERVAL (d) DAY AS date,
100 + (d / 3.65) + 20 * SIN(2 * PI() * (d % 7) / 7) + RANDOM() * 10 AS sales
FROM generate_series(0, 364) t(d);
-- 2. Split data
CREATE TABLE train_data AS SELECT * FROM sales_data WHERE date < '2023-11-01';
CREATE TABLE test_data AS SELECT * FROM sales_data WHERE date >= '2023-11-01';
-- 3. Generate forecasts (3 models)
CREATE TABLE forecast_results AS
SELECT * FROM TS_FORECAST('train_data', 'date', 'sales', 'SeasonalNaive', 30, {'seasonal_period': 7})
UNION ALL
SELECT * FROM TS_FORECAST('train_data', 'date', 'sales', 'AutoETS', 30, {'seasonal_period': 7})
UNION ALL
SELECT * FROM TS_FORECAST('train_data', 'date', 'sales', 'AutoARIMA', 30, {'seasonal_period': 7});
-- 4. Evaluate
SELECT
model_name,
ROUND(TS_MAE(LIST(t.sales), LIST(f.point_forecast)), 2) as MAE,
ROUND(TS_RMSE(LIST(t.sales), LIST(f.point_forecast)), 2) as RMSE,
ROUND(TS_MAPE(LIST(t.sales), LIST(f.point_forecast)), 2) as MAPE_pct
FROM test_data t
LEFT JOIN forecast_results f ON t.date = f.date_col
GROUP BY model_name
ORDER BY MAPE_pct;
Common Issues
Issue: "Not enough data"
Solution: Ensure at least 2 seasonal cycles (e.g., 14 days for weekly seasonality)
Issue: "Forecast values are unrealistic"
Solution:
- Check seasonal_period parameter
- Verify data quality
- Try different model
Issue: "Model too slow"
Solution:
- Use faster model (Naive, Theta)
- Use parallel evaluation with GROUP BY
- Reduce data if possible
Next Steps
- Data Preparation — Cleaning real-world data
- Model Comparison — Advanced comparison techniques
- Production Deployment — Scale to millions of series
Key Takeaways
- ✅ Split data before evaluation (train/test)
- ✅ Try multiple models and compare metrics
- ✅ Use MAE, RMSE, MAPE for evaluation
- ✅ Deploy best model to production
- ✅ Monitor performance weekly
- ✅ Retrain when metrics degrade