Skip to main content

Model Comparison & Selection

Systematically compare multiple forecasting models to choose the best one.

The Comparison Strategy

1. Select candidate models

2. Train on same data

3. Evaluate on same test set

4. Compare using multiple metrics

5. Choose winner

6. Deploy to production

Selecting Candidate Models

Rule of Thumb

For any dataset, compare:

  • 1 baseline (SeasonalNaive or Naive)
  • 1 simple model (ETS or Theta)
  • 1 complex model (AutoARIMA or TBATS)
-- Three core models to always compare
'SeasonalNaive' -- Baseline
'AutoETS' -- Simple statistical
'AutoARIMA' -- More complex

Complete Comparison Example

Step 1: Prepare Data

LOAD anofox_forecast;

-- Create sample data
CREATE TABLE sales_data AS
SELECT
DATE '2023-01-01' + INTERVAL (d) DAY AS date,
100 + (d / 3.65) + 20 * SIN(2 * PI() * (d % 7) / 7) + RANDOM() * 10 AS sales
FROM generate_series(0, 364) t(d);

-- Split 80/20
CREATE TABLE train AS SELECT * FROM sales_data WHERE date < '2023-11-01';
CREATE TABLE test AS SELECT * FROM sales_data WHERE date >= '2023-11-01';

Step 2: Train Models

-- Model 1: Baseline
CREATE TABLE fc_baseline AS
SELECT 'SeasonalNaive' as model, * FROM TS_FORECAST(
'train', 'date', 'sales', 'SeasonalNaive',
(SELECT COUNT(*) FROM test),
{'seasonal_period': 7}
);

-- Model 2: ETS
CREATE TABLE fc_ets AS
SELECT 'AutoETS' as model, * FROM TS_FORECAST(
'train', 'date', 'sales', 'AutoETS',
(SELECT COUNT(*) FROM test),
{'seasonal_period': 7, 'confidence_level': 0.95}
);

-- Model 3: ARIMA
CREATE TABLE fc_arima AS
SELECT 'AutoARIMA' as model, * FROM TS_FORECAST(
'train', 'date', 'sales', 'AutoARIMA',
(SELECT COUNT(*) FROM test),
{'seasonal_period': 7, 'confidence_level': 0.95}
);

-- Model 4: Theta
CREATE TABLE fc_theta AS
SELECT 'Theta' as model, * FROM TS_FORECAST(
'train', 'date', 'sales', 'Theta',
(SELECT COUNT(*) FROM test),
{'seasonal_period': 7, 'confidence_level': 0.95}
);

-- Model 5: MFLES
CREATE TABLE fc_mfles AS
SELECT 'MFLES' as model, * FROM TS_FORECAST(
'train', 'date', 'sales', 'MFLES',
(SELECT COUNT(*) FROM test),
{'seasonal_period': 7, 'confidence_level': 0.95}
);

Step 3: Combine Forecasts

-- Unified forecast table
CREATE TABLE all_forecasts AS
SELECT * FROM fc_baseline
UNION ALL SELECT * FROM fc_ets
UNION ALL SELECT * FROM fc_arima
UNION ALL SELECT * FROM fc_theta
UNION ALL SELECT * FROM fc_mfles;

Step 4: Join with Actual

CREATE TABLE comparison AS
SELECT
t.date,
t.sales as actual,
f.model,
ROUND(f.point_forecast, 2) as forecast,
ROUND(ABS(t.sales - f.point_forecast), 2) as abs_error,
ROUND(f.lower_95, 2) as lower,
ROUND(f.upper_95, 2) as upper
FROM test t
LEFT JOIN all_forecasts f ON t.date = f.date_col
ORDER BY t.date, f.model;

Step 5: Calculate Metrics

CREATE TABLE metrics AS
SELECT
model,
COUNT(*) as num_forecasts,
ROUND(TS_MAE(LIST(actual), LIST(forecast)), 2) as MAE,
ROUND(TS_RMSE(LIST(actual), LIST(forecast)), 2) as RMSE,
ROUND(TS_MAPE(LIST(actual), LIST(forecast)), 2) as MAPE_pct,
ROUND(TS_SMAPE(LIST(actual), LIST(forecast)), 2) as SMAPE_pct,
ROUND(TS_MASE(LIST(actual), LIST(forecast), 7), 2) as MASE,
ROUND(TS_ME(LIST(actual), LIST(forecast)), 2) as ME,
ROUND(TS_COVERAGE(LIST(actual), LIST(lower), LIST(upper)), 3) as coverage_95
FROM comparison
GROUP BY model
ORDER BY MAPE_pct;

Output example:

model          | MAE   | RMSE  | MAPE_pct | SMAPE_pct | MASE | ME    | coverage_95
---------------|-------|-------|----------|-----------|------|-------|----------
SeasonalNaive | 15.2 | 18.5 | 12.3 | 12.1 | 1.15 | 0.8 | 0.985
Theta | 10.8 | 13.2 | 8.9 | 8.7 | 0.82 | -0.3 | 0.977
AutoETS | 8.3 | 11.2 | 6.8 | 6.6 | 0.63 | -0.1 | 0.969
AutoARIMA | 9.1 | 12.8 | 7.5 | 7.3 | 0.69 | 0.4 | 0.954
MFLES | 7.9 | 10.8 | 6.4 | 6.2 | 0.60 | -0.2 | 0.962

Analyzing Results

Winner Criteria

Primary metric: MAPE (easy to explain)

  • MFLES wins with 6.4% MAPE

Validation:

  • Check RMSE (penalizes large errors)
    • MFLES: 10.8 ✓ (best)
  • Check MASE (relative to naive)
    • MFLES: 0.60 ✓ (best, much better than 1.0 baseline)
  • Check coverage
    • MFLES: 0.962 ✓ (close to 0.95, good)

Winner: MFLES


Detailed Model Breakdown

Baseline (SeasonalNaive)

  • MAE: 15.2
  • Purpose: Benchmark
  • Interpretation: Repeat last year's value

Simple Models (Theta, AutoETS)

  • MAE: 10.8-8.3
  • Improvement over baseline: 29-45%
  • Good balance of simplicity and accuracy

Complex Models (AutoARIMA, MFLES)

  • MAE: 9.1-7.9
  • Improvement over baseline: 40-48%
  • Highest accuracy, slightly slower

Comparison Metrics Explained

MAE (Mean Absolute Error)

Lower is better
Primary metric for model selection

RMSE (Root Mean Squared Error)

Penalizes large errors
Watch for if RMSE >> MAE (indicates outliers)

MAPE (Mean Absolute Percentage Error)

Percentage error
Easy to report to business stakeholders

SMAPE (Symmetric MAPE)

Similar to MAPE but handles zeros better
Compare with MAPE for consistency

MASE (Mean Absolute Scaled Error)

< 1: Better than naive
= 1: Same as naive
> 1: Worse than naive
Quick comparison: Is this better than baseline?

ME (Mean Error)

Positive: Overforecasting (pessimistic)
Negative: Underforecasting (optimistic)
Negative: Forecast too low on average
Zero: Unbiased forecast

Coverage

Expected: ~0.95 for 95% confidence
Too low (0.90): Intervals too narrow
Too high (0.99): Intervals too wide
Ideal: Within 0.93-0.97 range

Statistical Testing

Win by Significant Margin

-- Is Model A better than Model B?
WITH model_comparison AS (
SELECT
model,
ABS(actual - forecast) as error
FROM comparison
)
SELECT
COUNT(*) as sample_size,
ROUND(AVG(error), 2) as avg_error,
ROUND(STDDEV(error), 2) as std_error
FROM model_comparison
GROUP BY model;

-- Compare standard errors
-- If Model A error is 2+ std dev lower, it's better

Speed Comparison

-- Time to train each model
-- (Add timing code to your script)

CREATE TABLE timing AS
SELECT
'SeasonalNaive' as model,
'&lt;100ms' as train_time,
'Very fast' as comment
UNION ALL SELECT 'AutoETS', '~1s', 'Fast'
UNION ALL SELECT 'Theta', '~500ms', 'Fast'
UNION ALL SELECT 'AutoARIMA', '~5s', 'Slow'
UNION ALL SELECT 'MFLES', '~2s', 'Medium'
UNION ALL SELECT 'TBATS', '~10s', 'Very slow';

Ranking Models

Create a score based on multiple criteria:

-- Weighted scoring
-- (MAPE weight: 0.5, RMSE: 0.3, MASE: 0.2)

WITH ranked AS (
SELECT
model,
RANK() OVER (ORDER BY MAPE_pct) as mape_rank,
RANK() OVER (ORDER BY RMSE) as rmse_rank,
RANK() OVER (ORDER BY MASE) as mase_rank
FROM metrics
)
SELECT
model,
ROUND(
(mape_rank * 0.5 + rmse_rank * 0.3 + mase_rank * 0.2),
2
) as composite_score,
mape_rank,
rmse_rank,
mase_rank
FROM ranked
ORDER BY composite_score;

Handling Tied Models

When two models are very close:

-- Model A: MAE = 8.3, RMSE = 11.2, Time = 1s
-- Model B: MAE = 8.4, RMSE = 11.1, Time = 5s

-- Decision: Use Model A (slightly better, much faster)

-- Or: Run ensemble (combine both)
SELECT
date,
(forecast_a + forecast_b) / 2 as ensemble_forecast
FROM ...;

Special Comparison Cases

Very Short Data (< 60 observations)

-- Use simpler models
-- SeasonalNaive often wins
-- Avoid AutoARIMA

Very Noisy Data

-- MFLES often wins (robust to outliers)
-- Check for outliers and consider cleaning

Multiple Seasonality

-- Use TBATS or MSTL
-- Don't compare against simpler models

Forecast Intervals Matter

-- Check coverage metric
-- Wider intervals (higher confidence) sacrifices precision
-- Choose based on risk tolerance

Model Comparison Workflow

-- Complete workflow script

-- 1. SETUP
LOAD anofox_forecast;
-- ... create train/test ...

-- 2. TRAIN (all models)
-- ... create all FC tables ...

-- 3. EVALUATE
-- ... calculate metrics ...

-- 4. COMPARE
SELECT * FROM metrics ORDER BY MAPE_pct;

-- 5. WINNER
-- MFLES wins with lowest MAPE

-- 6. DECISION
-- Deploy MFLES to production
-- Monitor for 4 weeks
-- If metrics degrade, retrain or try another model

Common Comparison Mistakes

❌ Only Using One Metric

-- WRONG: "Model A has lowest RMSE, so it's best"
-- RIGHT: Compare MAE, RMSE, MAPE, MASE, coverage

❌ Not Using Test Data

-- WRONG: Evaluating on training data
-- RIGHT: Always use hold-out test data

❌ Different Test Sets

-- WRONG: Model A tested on Jan, Model B on Feb
-- RIGHT: All models tested on same dates

❌ Ignoring Speed

-- WRONG: "TBATS has best metrics, deploy it"
-- RIGHT: TBATS takes 10s per model, impractical
-- Consider trade-off between accuracy and speed

Next Steps

  1. Production Deployment — Deploy winning model
  2. Evaluating Accuracy — Deep dive on metrics
  3. Reference: Models — Details on each model

Key Takeaways

  • ✅ Always compare multiple models (baseline + simple + complex)
  • ✅ Evaluate on same test data
  • ✅ Use multiple metrics (MAE, RMSE, MAPE, MASE)
  • ✅ Check prediction interval coverage
  • ✅ Consider speed vs. accuracy trade-off
  • ✅ Statistical significance > marginal improvement
  • ✅ Monitor deployed model weekly
🍪 Cookie Settings