Model Comparison & Selection
Systematically compare multiple forecasting models to choose the best one.
The Comparison Strategy
1. Select candidate models
↓
2. Train on same data
↓
3. Evaluate on same test set
↓
4. Compare using multiple metrics
↓
5. Choose winner
↓
6. Deploy to production
Selecting Candidate Models
Rule of Thumb
For any dataset, compare:
- 1 baseline (SeasonalNaive or Naive)
- 1 simple model (ETS or Theta)
- 1 complex model (AutoARIMA or TBATS)
-- Three core models to always compare
'SeasonalNaive' -- Baseline
'AutoETS' -- Simple statistical
'AutoARIMA' -- More complex
Complete Comparison Example
Step 1: Prepare Data
LOAD anofox_forecast;
-- Create sample data
CREATE TABLE sales_data AS
SELECT
DATE '2023-01-01' + INTERVAL (d) DAY AS date,
100 + (d / 3.65) + 20 * SIN(2 * PI() * (d % 7) / 7) + RANDOM() * 10 AS sales
FROM generate_series(0, 364) t(d);
-- Split 80/20
CREATE TABLE train AS SELECT * FROM sales_data WHERE date < '2023-11-01';
CREATE TABLE test AS SELECT * FROM sales_data WHERE date >= '2023-11-01';
Step 2: Train Models
-- Model 1: Baseline
CREATE TABLE fc_baseline AS
SELECT 'SeasonalNaive' as model, * FROM TS_FORECAST(
'train', 'date', 'sales', 'SeasonalNaive',
(SELECT COUNT(*) FROM test),
{'seasonal_period': 7}
);
-- Model 2: ETS
CREATE TABLE fc_ets AS
SELECT 'AutoETS' as model, * FROM TS_FORECAST(
'train', 'date', 'sales', 'AutoETS',
(SELECT COUNT(*) FROM test),
{'seasonal_period': 7, 'confidence_level': 0.95}
);
-- Model 3: ARIMA
CREATE TABLE fc_arima AS
SELECT 'AutoARIMA' as model, * FROM TS_FORECAST(
'train', 'date', 'sales', 'AutoARIMA',
(SELECT COUNT(*) FROM test),
{'seasonal_period': 7, 'confidence_level': 0.95}
);
-- Model 4: Theta
CREATE TABLE fc_theta AS
SELECT 'Theta' as model, * FROM TS_FORECAST(
'train', 'date', 'sales', 'Theta',
(SELECT COUNT(*) FROM test),
{'seasonal_period': 7, 'confidence_level': 0.95}
);
-- Model 5: MFLES
CREATE TABLE fc_mfles AS
SELECT 'MFLES' as model, * FROM TS_FORECAST(
'train', 'date', 'sales', 'MFLES',
(SELECT COUNT(*) FROM test),
{'seasonal_period': 7, 'confidence_level': 0.95}
);
Step 3: Combine Forecasts
-- Unified forecast table
CREATE TABLE all_forecasts AS
SELECT * FROM fc_baseline
UNION ALL SELECT * FROM fc_ets
UNION ALL SELECT * FROM fc_arima
UNION ALL SELECT * FROM fc_theta
UNION ALL SELECT * FROM fc_mfles;
Step 4: Join with Actual
CREATE TABLE comparison AS
SELECT
t.date,
t.sales as actual,
f.model,
ROUND(f.point_forecast, 2) as forecast,
ROUND(ABS(t.sales - f.point_forecast), 2) as abs_error,
ROUND(f.lower_95, 2) as lower,
ROUND(f.upper_95, 2) as upper
FROM test t
LEFT JOIN all_forecasts f ON t.date = f.date_col
ORDER BY t.date, f.model;
Step 5: Calculate Metrics
CREATE TABLE metrics AS
SELECT
model,
COUNT(*) as num_forecasts,
ROUND(TS_MAE(LIST(actual), LIST(forecast)), 2) as MAE,
ROUND(TS_RMSE(LIST(actual), LIST(forecast)), 2) as RMSE,
ROUND(TS_MAPE(LIST(actual), LIST(forecast)), 2) as MAPE_pct,
ROUND(TS_SMAPE(LIST(actual), LIST(forecast)), 2) as SMAPE_pct,
ROUND(TS_MASE(LIST(actual), LIST(forecast), 7), 2) as MASE,
ROUND(TS_ME(LIST(actual), LIST(forecast)), 2) as ME,
ROUND(TS_COVERAGE(LIST(actual), LIST(lower), LIST(upper)), 3) as coverage_95
FROM comparison
GROUP BY model
ORDER BY MAPE_pct;
Output example:
model | MAE | RMSE | MAPE_pct | SMAPE_pct | MASE | ME | coverage_95
---------------|-------|-------|----------|-----------|------|-------|----------
SeasonalNaive | 15.2 | 18.5 | 12.3 | 12.1 | 1.15 | 0.8 | 0.985
Theta | 10.8 | 13.2 | 8.9 | 8.7 | 0.82 | -0.3 | 0.977
AutoETS | 8.3 | 11.2 | 6.8 | 6.6 | 0.63 | -0.1 | 0.969
AutoARIMA | 9.1 | 12.8 | 7.5 | 7.3 | 0.69 | 0.4 | 0.954
MFLES | 7.9 | 10.8 | 6.4 | 6.2 | 0.60 | -0.2 | 0.962
Analyzing Results
Winner Criteria
Primary metric: MAPE (easy to explain)
- MFLES wins with 6.4% MAPE
Validation:
- Check RMSE (penalizes large errors)
- MFLES: 10.8 ✓ (best)
- Check MASE (relative to naive)
- MFLES: 0.60 ✓ (best, much better than 1.0 baseline)
- Check coverage
- MFLES: 0.962 ✓ (close to 0.95, good)
Winner: MFLES
Detailed Model Breakdown
Baseline (SeasonalNaive)
- MAE: 15.2
- Purpose: Benchmark
- Interpretation: Repeat last year's value
Simple Models (Theta, AutoETS)
- MAE: 10.8-8.3
- Improvement over baseline: 29-45%
- Good balance of simplicity and accuracy
Complex Models (AutoARIMA, MFLES)
- MAE: 9.1-7.9
- Improvement over baseline: 40-48%
- Highest accuracy, slightly slower
Comparison Metrics Explained
MAE (Mean Absolute Error)
Lower is better
Primary metric for model selection
RMSE (Root Mean Squared Error)
Penalizes large errors
Watch for if RMSE >> MAE (indicates outliers)
MAPE (Mean Absolute Percentage Error)
Percentage error
Easy to report to business stakeholders
SMAPE (Symmetric MAPE)
Similar to MAPE but handles zeros better
Compare with MAPE for consistency
MASE (Mean Absolute Scaled Error)
< 1: Better than naive
= 1: Same as naive
> 1: Worse than naive
Quick comparison: Is this better than baseline?
ME (Mean Error)
Positive: Overforecasting (pessimistic)
Negative: Underforecasting (optimistic)
Negative: Forecast too low on average
Zero: Unbiased forecast
Coverage
Expected: ~0.95 for 95% confidence
Too low (0.90): Intervals too narrow
Too high (0.99): Intervals too wide
Ideal: Within 0.93-0.97 range
Statistical Testing
Win by Significant Margin
-- Is Model A better than Model B?
WITH model_comparison AS (
SELECT
model,
ABS(actual - forecast) as error
FROM comparison
)
SELECT
COUNT(*) as sample_size,
ROUND(AVG(error), 2) as avg_error,
ROUND(STDDEV(error), 2) as std_error
FROM model_comparison
GROUP BY model;
-- Compare standard errors
-- If Model A error is 2+ std dev lower, it's better
Speed Comparison
-- Time to train each model
-- (Add timing code to your script)
CREATE TABLE timing AS
SELECT
'SeasonalNaive' as model,
'<100ms' as train_time,
'Very fast' as comment
UNION ALL SELECT 'AutoETS', '~1s', 'Fast'
UNION ALL SELECT 'Theta', '~500ms', 'Fast'
UNION ALL SELECT 'AutoARIMA', '~5s', 'Slow'
UNION ALL SELECT 'MFLES', '~2s', 'Medium'
UNION ALL SELECT 'TBATS', '~10s', 'Very slow';
Ranking Models
Create a score based on multiple criteria:
-- Weighted scoring
-- (MAPE weight: 0.5, RMSE: 0.3, MASE: 0.2)
WITH ranked AS (
SELECT
model,
RANK() OVER (ORDER BY MAPE_pct) as mape_rank,
RANK() OVER (ORDER BY RMSE) as rmse_rank,
RANK() OVER (ORDER BY MASE) as mase_rank
FROM metrics
)
SELECT
model,
ROUND(
(mape_rank * 0.5 + rmse_rank * 0.3 + mase_rank * 0.2),
2
) as composite_score,
mape_rank,
rmse_rank,
mase_rank
FROM ranked
ORDER BY composite_score;
Handling Tied Models
When two models are very close:
-- Model A: MAE = 8.3, RMSE = 11.2, Time = 1s
-- Model B: MAE = 8.4, RMSE = 11.1, Time = 5s
-- Decision: Use Model A (slightly better, much faster)
-- Or: Run ensemble (combine both)
SELECT
date,
(forecast_a + forecast_b) / 2 as ensemble_forecast
FROM ...;
Special Comparison Cases
Very Short Data (< 60 observations)
-- Use simpler models
-- SeasonalNaive often wins
-- Avoid AutoARIMA
Very Noisy Data
-- MFLES often wins (robust to outliers)
-- Check for outliers and consider cleaning
Multiple Seasonality
-- Use TBATS or MSTL
-- Don't compare against simpler models
Forecast Intervals Matter
-- Check coverage metric
-- Wider intervals (higher confidence) sacrifices precision
-- Choose based on risk tolerance
Model Comparison Workflow
-- Complete workflow script
-- 1. SETUP
LOAD anofox_forecast;
-- ... create train/test ...
-- 2. TRAIN (all models)
-- ... create all FC tables ...
-- 3. EVALUATE
-- ... calculate metrics ...
-- 4. COMPARE
SELECT * FROM metrics ORDER BY MAPE_pct;
-- 5. WINNER
-- MFLES wins with lowest MAPE
-- 6. DECISION
-- Deploy MFLES to production
-- Monitor for 4 weeks
-- If metrics degrade, retrain or try another model
Common Comparison Mistakes
❌ Only Using One Metric
-- WRONG: "Model A has lowest RMSE, so it's best"
-- RIGHT: Compare MAE, RMSE, MAPE, MASE, coverage
❌ Not Using Test Data
-- WRONG: Evaluating on training data
-- RIGHT: Always use hold-out test data
❌ Different Test Sets
-- WRONG: Model A tested on Jan, Model B on Feb
-- RIGHT: All models tested on same dates
❌ Ignoring Speed
-- WRONG: "TBATS has best metrics, deploy it"
-- RIGHT: TBATS takes 10s per model, impractical
-- Consider trade-off between accuracy and speed
Next Steps
- Production Deployment — Deploy winning model
- Evaluating Accuracy — Deep dive on metrics
- Reference: Models — Details on each model
Key Takeaways
- ✅ Always compare multiple models (baseline + simple + complex)
- ✅ Evaluate on same test data
- ✅ Use multiple metrics (MAE, RMSE, MAPE, MASE)
- ✅ Check prediction interval coverage
- ✅ Consider speed vs. accuracy trade-off
- ✅ Statistical significance > marginal improvement
- ✅ Monitor deployed model weekly