Evaluating Forecast Accuracy

Measure and interpret forecast quality to choose the best models and identify improvements.

Why Evaluation Matters

Bad evaluation → Wrong model chosen → Terrible production performance

Good evaluation → Confident model selection → Reliable forecasts

You must evaluate forecasts before deployment to ensure quality.

Core Evaluation Approach

1. Split Data: Train vs. Test

Historical data (365 days)
│
├─ Training set (300 days) - Use to fit model
│
└─ Test set (65 days) - Use to evaluate

Why test on different data?

Training metrics are optimistic (model has seen this data)
Test metrics show real-world performance (new data)

2. Generate Forecast on Test Set

-- Fit model on training data
CREATE TABLE forecast AS
SELECT * FROM TS_FORECAST(
    'training_data',
    'date',
    'sales',
    'AutoETS',
    65,
    {'seasonal_period': 7}
);

-- Compare against actual test data
SELECT
    t.date,
    t.sales as actual,
    f.point_forecast as predicted,
    ABS(t.sales - f.point_forecast) as absolute_error
FROM test_data t
LEFT JOIN forecast f ON t.date = f.date_col
ORDER BY t.date;

3. Calculate Accuracy Metrics

-- Aggregate errors into metrics
SELECT
    TS_MAE(...) as MAE,
    TS_RMSE(...) as RMSE,
    TS_MAPE(...) as MAPE
FROM ...;

4. Compare Models

-- Run evaluation for multiple models
-- Choose the one with best (lowest) metrics

Evaluation Metrics

AnoFox Forecast provides 12 built-in metrics:

Level 1: Basic Metrics (Start Here)

MAE (Mean Absolute Error)

Average of absolute errors

TS_MAE(actual_list, predicted_list)

Formula: Σ|actual - predicted| / n

Example:

Actuals:     [100, 110, 95, 105]
Predictions: [102, 108, 98, 104]
Errors:      [2, 2, 3, 1]

MAE = (2 + 2 + 3 + 1) / 4 = 2.0

Interpretation:

MAE = 2.0 means average forecast error is 2 units
Lower is better
Same units as original data (interpretable)

Use when: Easy interpretation needed, outliers shouldn't dominate

Pros: ✅ Simple, interpretable Cons: ❌ Doesn't penalize large errors heavily

RMSE (Root Mean Squared Error)

Square root of average squared errors

TS_RMSE(actual_list, predicted_list)

Formula: √(Σ(actual - predicted)² / n)

Example:

Actuals:     [100, 110, 95, 105]
Predictions: [102, 108, 98, 104]
Errors:      [2, 2, 3, 1]

RMSE = √((4 + 4 + 9 + 1) / 4) = √4.5 = 2.12

Interpretation:

RMSE = 2.12 means typical error is ~2.12 units
Penalizes large errors more than MAE
Same units as original data

Use when: Large errors are more costly, want to avoid outliers

Pros: ✅ Penalizes large errors Cons: ❌ Less interpretable, affected by outliers

MAPE (Mean Absolute Percentage Error)

Average absolute error as percentage

TS_MAPE(actual_list, predicted_list)

Formula: Σ|actual - predicted| / |actual| × 100 / n

Example:

Actuals:     [100, 200, 50, 150]
Predictions: [105, 190, 55, 145]
Errors:      [5%, 5%, 10%, 3.3%]

MAPE = (5 + 5 + 10 + 3.3) / 4 = 5.8%

Interpretation:

MAPE = 5.8% means average error is 5.8% of actual value
Scale-independent (good for comparing across products)
Percentage error (easier to understand)

Use when: Comparing across different scales/products, management reporting

Pros: ✅ Scale-independent, percentage-based Cons: ❌ Undefined for zero values, biased towards negative errors

Level 2: Specialized Metrics

SMAPE (Symmetric Mean Absolute Percentage Error)

Symmetric version of MAPE

TS_SMAPE(actual_list, predicted_list)

Use when: MAPE biases present (many zeros or small values)

MASE (Mean Absolute Scaled Error)

Compares against naive forecast

TS_MASE(actual_list, predicted_list, seasonal_period)

Interpretation:

MASE < 1: Better than naive forecast
MASE = 1: Same as naive forecast
MASE > 1: Worse than naive forecast

Use when: Want performance relative to baseline

ME (Mean Error)

Average signed error (positive or negative)

TS_ME(actual_list, predicted_list)

Interpretation:

ME = 0: Forecast is unbiased (neither over nor under forecasting)
ME > 0: Forecast tends to overestimate
ME < 0: Forecast tends to underestimate

Use when: Checking for systematic bias

Bias

Tendency to over/under forecast

TS_BIAS(actual_list, predicted_list)

Interpretation:

Positive: Forecast is optimistic (predicts higher)
Negative: Forecast is pessimistic (predicts lower)
Zero: Unbiased

Use when: Understanding if model consistently overestimates

Level 3: Coverage Metrics

Coverage

Percentage of actual values within prediction intervals

TS_COVERAGE(actual_list, lower_list, upper_list)

Expected vs Actual:

Set confidence_level=0.95 → Expect 95% coverage
If actual coverage = 93% → Good (close to 95%)
If actual coverage = 80% → Bad (intervals too narrow)
If actual coverage = 99% → Safe (intervals too wide)

Use when: Validating prediction intervals are appropriate

Average width of prediction intervals

Interpretation:

Narrow intervals = confident, but risky if wrong
Wide intervals = conservative, safe but less useful

Trade-off:

Narrow intervals + Low coverage = Problem!
(Intervals are too tight for confidence level)

Wide intervals + High coverage = Good
(Conservative prediction intervals)

Level 4: Aggregate Metrics

R² (Coefficient of Determination)

Percentage of variance explained

TS_R_SQUARED(actual_list, predicted_list)

Range: 0 to 1

Interpretation:

R² = 1.0: Perfect forecast
R² = 0.8: Explains 80% of variance (good)
R² = 0.5: Explains 50% of variance (fair)
R² = 0.0: No better than predicting mean
R² < 0: Worse than predicting mean

Use when: Want overall goodness-of-fit measure

Metric Selection Guide

Situation	Use This Metric	Why
Inventory forecasting	MASE	Want vs. naive baseline
Financial forecasting	RMSE	Can't afford large errors
Multiple products	MAPE	Scale-independent
Executive reporting	MAPE	Easy to explain
Validation intervals	Coverage	Ensure 95% contains actuals
Model comparison	MAE & RMSE	Industry standard
Bias check	ME or Bias	Detect systematic over/under
Data with zeros	RMSE or SMAPE	MAPE undefined for zeros

Complete Evaluation Example

-- 1. Split data
CREATE TABLE train_data AS
SELECT * FROM sales_data WHERE date < '2024-01-01';

CREATE TABLE test_data AS
SELECT * FROM sales_data WHERE date >= '2024-01-01';

-- 2. Forecast test period
CREATE TABLE forecast AS
SELECT * FROM TS_FORECAST(
    'train_data', 'date', 'sales', 'AutoETS',
    (SELECT COUNT(*) FROM test_data),
    {'seasonal_period': 7}
);

-- 3. Join forecast with actual
CREATE TABLE joined AS
SELECT
    t.date,
    t.sales as actual,
    f.point_forecast as predicted,
    f.lower_95 as lower,
    f.upper_95 as upper
FROM test_data t
LEFT JOIN forecast f ON t.date = f.date_col;

-- 4. Calculate metrics
SELECT
    ROUND(TS_MAE(LIST(actual), LIST(predicted)), 2) as MAE,
    ROUND(TS_RMSE(LIST(actual), LIST(predicted)), 2) as RMSE,
    ROUND(TS_MAPE(LIST(actual), LIST(predicted)), 2) as MAPE_pct,
    ROUND(TS_MASE(LIST(actual), LIST(predicted), 7), 2) as MASE,
    ROUND(TS_COVERAGE(LIST(actual), LIST(lower), LIST(upper)), 2) as coverage,
    ROUND(TS_ME(LIST(actual), LIST(predicted)), 2) as ME,
    ROUND(TS_R_SQUARED(LIST(actual), LIST(predicted)), 4) as R_squared
FROM joined;

Comparing Multiple Models

-- Compare AutoETS vs AutoARIMA
CREATE TABLE model_comparison AS
SELECT
    'AutoETS' as model,
    ROUND(TS_MAE(LIST(t.sales), LIST(f1.point_forecast)), 2) as MAE,
    ROUND(TS_RMSE(LIST(t.sales), LIST(f1.point_forecast)), 2) as RMSE,
    ROUND(TS_MAPE(LIST(t.sales), LIST(f1.point_forecast)), 2) as MAPE
FROM test_data t
LEFT JOIN (
    SELECT * FROM TS_FORECAST('train_data', 'date', 'sales', 'AutoETS', 30, {})
) f1 ON t.date = f1.date_col

UNION ALL

SELECT
    'AutoARIMA' as model,
    ROUND(TS_MAE(LIST(t.sales), LIST(f2.point_forecast)), 2) as MAE,
    ROUND(TS_RMSE(LIST(t.sales), LIST(f2.point_forecast)), 2) as RMSE,
    ROUND(TS_MAPE(LIST(t.sales), LIST(f2.point_forecast)), 2) as MAPE
FROM test_data t
LEFT JOIN (
    SELECT * FROM TS_FORECAST('train_data', 'date', 'sales', 'AutoARIMA', 30, {})
) f2 ON t.date = f2.date_col

ORDER BY MAPE;  -- Choose model with lowest MAPE

Interpreting Metrics

Typical Metric Ranges

Metric	Excellent	Good	Fair	Poor
MAPE	<5%	5-10%	10-20%	>20%
RMSE	<10% of mean	10-20%	20-30%	>30%
R²	0.95+	0.80-0.95	0.60-0.80	<0.60
MASE	<0.8	0.8-1.0	1.0-1.5	>1.5
Coverage (95% CI)	93-97%	90-98%	85-99%	<85% or >99%

Diagnostic Plots

Visualize forecast vs actual:

-- Forecast comparison (copy to Excel for visualization)
SELECT
    t.date,
    t.sales as actual,
    f.point_forecast as forecast,
    t.sales - f.point_forecast as error,
    CASE WHEN ABS(t.sales - f.point_forecast) > 2 * TS_RMSE(...)
         THEN 'OUTLIER' ELSE 'OK' END as flag
FROM test_data t
LEFT JOIN forecast f ON t.date = f.date_col
ORDER BY t.date;

Look for:

Are errors randomly distributed? (Good)
Are there patterns in errors? (Bad - model missing something)
Are there unusual spikes? (Outliers to investigate)

Cross-Validation

Stronger evaluation using multiple train/test splits:

-- Example: 5-fold rolling cross-validation
-- Split year into 5 periods, test on each

FOR fold IN 1 TO 5:
  train_data = data before fold
  test_data = fold

  forecast = TS_FORECAST(train_data, ...)
  metrics = calculate_metrics(test_data, forecast)
  save_results(metrics)

average_metrics = mean(all_fold_metrics)

This gives more robust estimate of true performance.

Improving Forecast Accuracy

When Metrics Are Poor

MAPE > 15%?
│
├─ Is data noisy?
│  └─ Try MFLES (robust model)
│
├─ Multiple seasonal patterns?
│  └─ Try TBATS or MSTL
│
├─ Data has trend changes?
│  └─ Try AutoARIMA or retrain weekly
│
├─ Is forecast too simple?
│  └─ Try ensemble (average multiple models)
│
└─ Do you have enough data?
   └─ Collect more history or use different model

Steps to Improve

Data Quality: Clean outliers, handle missing values
Feature Engineering: Add external variables if possible
Model Selection: Try more complex models
Ensemble: Combine predictions from multiple models
Retraining: Retrain regularly as new data arrives

Production Evaluation

Once deployed, continue monitoring:

-- Weekly model performance check
SELECT
    DATE_TRUNC('week', forecast_date) as week,
    ROUND(TS_MAE(LIST(actual), LIST(predicted)), 2) as weekly_MAE,
    ROUND(TS_MAPE(LIST(actual), LIST(predicted)), 2) as weekly_MAPE
FROM forecast_results
WHERE forecast_date >= TODAY() - INTERVAL '4 weeks'
GROUP BY DATE_TRUNC('week', forecast_date)
ORDER BY week DESC;

-- If metrics degrade, retrain model!

Quick Reference: Metric Cheat Sheet

Metric	Formula	Lower	Better	Meaning
MAE	Σ\|e\|/n	0	Yes	Avg absolute error
RMSE	√(Σe²/n)	0	Yes	Penalizes large errors
MAPE	Σ\|e/a\|×100/n	0	Yes	% error relative to actual
SMAPE	2Σ\|e/(a+p)\|×100/n	0	Yes	Symmetric % error
MASE	MAE / MAEnaive	1	No	Ratio to naive
ME	Σe/n	-	0	Bias (over/under)
Coverage	%within_interval	-	0.95	% actual in CI
R²	1 - Σe²/Σ(y-ȳ)²	0	1	Variance explained

Common Evaluation Mistakes

❌ Evaluating on Training Data

-- WRONG: Training data known to model
SELECT TS_MAE(LIST(train_sales), LIST(train_predictions));

-- RIGHT: Test on unseen data
SELECT TS_MAE(LIST(test_sales), LIST(test_predictions));

❌ Using Only One Metric

-- WRONG: Only checking MAPE
MAPE = 3%  -- Looks great!
But: RMSE = 1000  -- Actually bad!

-- RIGHT: Use multiple metrics
MAPE, RMSE, MASE all tell different stories

❌ Not Accounting for Seasonality

-- WRONG: Testing during off-season only
Test MAPE = 2%, but seasonal peak coming

-- RIGHT: Test across full seasonal cycle
Include peaks, valleys, normal periods

❌ Single Random Split

-- WRONG: One train/test split
Could be lucky with this particular split

-- RIGHT: Cross-validation or rolling windows
Multiple splits give more stable estimate

Next Steps

Model Comparison Guide — Practical comparison workflow
Production Deployment — Monitor deployed models
Metrics Reference — All 12 metrics detailed

Key Takeaways

✅ Always evaluate on test data, not training data
✅ Use multiple metrics (MAE, RMSE, MAPE all tell different stories)
✅ MAE = simple, RMSE = penalizes big errors, MAPE = percentage
✅ Coverage validates prediction intervals
✅ Compare models using same test data
✅ Monitor production forecasts weekly
✅ Retrain when metrics degrade
✅ Use cross-validation for robust evaluation

Why Evaluation Matters​

Core Evaluation Approach​

1. Split Data: Train vs. Test​

2. Generate Forecast on Test Set​

3. Calculate Accuracy Metrics​

4. Compare Models​

Evaluation Metrics​

Level 1: Basic Metrics (Start Here)​

MAE (Mean Absolute Error)​

RMSE (Root Mean Squared Error)​

MAPE (Mean Absolute Percentage Error)​

Level 2: Specialized Metrics​

SMAPE (Symmetric Mean Absolute Percentage Error)​

MASE (Mean Absolute Scaled Error)​

ME (Mean Error)​

Bias​

Level 3: Coverage Metrics​

Coverage​

Mean Interval Width (Related to Coverage)​

Level 4: Aggregate Metrics​

R² (Coefficient of Determination)​

Metric Selection Guide​

Complete Evaluation Example​

Comparing Multiple Models​

Interpreting Metrics​

Typical Metric Ranges​

Diagnostic Plots​

Cross-Validation​

Improving Forecast Accuracy​

When Metrics Are Poor​

Steps to Improve​

Production Evaluation​

Quick Reference: Metric Cheat Sheet​

Common Evaluation Mistakes​

❌ Evaluating on Training Data​

❌ Using Only One Metric​

❌ Not Accounting for Seasonality​

❌ Single Random Split​

Next Steps​

Key Takeaways​

Why Evaluation Matters

Core Evaluation Approach

1. Split Data: Train vs. Test

2. Generate Forecast on Test Set

3. Calculate Accuracy Metrics

4. Compare Models

Evaluation Metrics

Level 1: Basic Metrics (Start Here)

MAE (Mean Absolute Error)

RMSE (Root Mean Squared Error)

MAPE (Mean Absolute Percentage Error)

Level 2: Specialized Metrics

SMAPE (Symmetric Mean Absolute Percentage Error)

MASE (Mean Absolute Scaled Error)

ME (Mean Error)

Bias

Level 3: Coverage Metrics

Coverage

Mean Interval Width (Related to Coverage)

Level 4: Aggregate Metrics

R² (Coefficient of Determination)

Metric Selection Guide

Complete Evaluation Example

Comparing Multiple Models

Interpreting Metrics

Typical Metric Ranges

Diagnostic Plots

Cross-Validation

Improving Forecast Accuracy

When Metrics Are Poor

Steps to Improve

Production Evaluation

Quick Reference: Metric Cheat Sheet

Common Evaluation Mistakes

❌ Evaluating on Training Data

❌ Using Only One Metric

❌ Not Accounting for Seasonality

❌ Single Random Split

Next Steps

Key Takeaways