Skip to main content

Evaluating Forecast Accuracy

Measure and interpret forecast quality to choose the best models and identify improvements.

Why Evaluation Matters

Bad evaluation → Wrong model chosen → Terrible production performance

Good evaluation → Confident model selection → Reliable forecasts

You must evaluate forecasts before deployment to ensure quality.


Core Evaluation Approach

1. Split Data: Train vs. Test

Historical data (365 days)

├─ Training set (300 days) - Use to fit model

└─ Test set (65 days) - Use to evaluate

Why test on different data?

  • Training metrics are optimistic (model has seen this data)
  • Test metrics show real-world performance (new data)

2. Generate Forecast on Test Set

-- Fit model on training data
CREATE TABLE forecast AS
SELECT * FROM TS_FORECAST(
'training_data',
'date',
'sales',
'AutoETS',
65,
{'seasonal_period': 7}
);

-- Compare against actual test data
SELECT
t.date,
t.sales as actual,
f.point_forecast as predicted,
ABS(t.sales - f.point_forecast) as absolute_error
FROM test_data t
LEFT JOIN forecast f ON t.date = f.date_col
ORDER BY t.date;

3. Calculate Accuracy Metrics

-- Aggregate errors into metrics
SELECT
TS_MAE(...) as MAE,
TS_RMSE(...) as RMSE,
TS_MAPE(...) as MAPE
FROM ...;

4. Compare Models

-- Run evaluation for multiple models
-- Choose the one with best (lowest) metrics

Evaluation Metrics

AnoFox Forecast provides 12 built-in metrics:

Level 1: Basic Metrics (Start Here)

MAE (Mean Absolute Error)

Average of absolute errors

TS_MAE(actual_list, predicted_list)

Formula: Σ|actual - predicted| / n

Example:

Actuals:     [100, 110, 95, 105]
Predictions: [102, 108, 98, 104]
Errors: [2, 2, 3, 1]

MAE = (2 + 2 + 3 + 1) / 4 = 2.0

Interpretation:

  • MAE = 2.0 means average forecast error is 2 units
  • Lower is better
  • Same units as original data (interpretable)

Use when: Easy interpretation needed, outliers shouldn't dominate

Pros: ✅ Simple, interpretable Cons: ❌ Doesn't penalize large errors heavily


RMSE (Root Mean Squared Error)

Square root of average squared errors

TS_RMSE(actual_list, predicted_list)

Formula: √(Σ(actual - predicted)² / n)

Example:

Actuals:     [100, 110, 95, 105]
Predictions: [102, 108, 98, 104]
Errors: [2, 2, 3, 1]

RMSE = √((4 + 4 + 9 + 1) / 4) = √4.5 = 2.12

Interpretation:

  • RMSE = 2.12 means typical error is ~2.12 units
  • Penalizes large errors more than MAE
  • Same units as original data

Use when: Large errors are more costly, want to avoid outliers

Pros: ✅ Penalizes large errors Cons: ❌ Less interpretable, affected by outliers


MAPE (Mean Absolute Percentage Error)

Average absolute error as percentage

TS_MAPE(actual_list, predicted_list)

Formula: Σ|actual - predicted| / |actual| × 100 / n

Example:

Actuals:     [100, 200, 50, 150]
Predictions: [105, 190, 55, 145]
Errors: [5%, 5%, 10%, 3.3%]

MAPE = (5 + 5 + 10 + 3.3) / 4 = 5.8%

Interpretation:

  • MAPE = 5.8% means average error is 5.8% of actual value
  • Scale-independent (good for comparing across products)
  • Percentage error (easier to understand)

Use when: Comparing across different scales/products, management reporting

Pros: ✅ Scale-independent, percentage-based Cons: ❌ Undefined for zero values, biased towards negative errors


Level 2: Specialized Metrics

SMAPE (Symmetric Mean Absolute Percentage Error)

Symmetric version of MAPE

TS_SMAPE(actual_list, predicted_list)

Use when: MAPE biases present (many zeros or small values)


MASE (Mean Absolute Scaled Error)

Compares against naive forecast

TS_MASE(actual_list, predicted_list, seasonal_period)

Interpretation:

  • MASE < 1: Better than naive forecast
  • MASE = 1: Same as naive forecast
  • MASE > 1: Worse than naive forecast

Use when: Want performance relative to baseline


ME (Mean Error)

Average signed error (positive or negative)

TS_ME(actual_list, predicted_list)

Interpretation:

  • ME = 0: Forecast is unbiased (neither over nor under forecasting)
  • ME > 0: Forecast tends to overestimate
  • ME < 0: Forecast tends to underestimate

Use when: Checking for systematic bias


Bias

Tendency to over/under forecast

TS_BIAS(actual_list, predicted_list)

Interpretation:

  • Positive: Forecast is optimistic (predicts higher)
  • Negative: Forecast is pessimistic (predicts lower)
  • Zero: Unbiased

Use when: Understanding if model consistently overestimates


Level 3: Coverage Metrics

Coverage

Percentage of actual values within prediction intervals

TS_COVERAGE(actual_list, lower_list, upper_list)

Expected vs Actual:

  • Set confidence_level=0.95 → Expect 95% coverage
  • If actual coverage = 93% → Good (close to 95%)
  • If actual coverage = 80% → Bad (intervals too narrow)
  • If actual coverage = 99% → Safe (intervals too wide)

Use when: Validating prediction intervals are appropriate


Average width of prediction intervals

Interpretation:

  • Narrow intervals = confident, but risky if wrong
  • Wide intervals = conservative, safe but less useful

Trade-off:

Narrow intervals + Low coverage = Problem!
(Intervals are too tight for confidence level)

Wide intervals + High coverage = Good
(Conservative prediction intervals)

Level 4: Aggregate Metrics

R² (Coefficient of Determination)

Percentage of variance explained

TS_R_SQUARED(actual_list, predicted_list)

Range: 0 to 1

Interpretation:

  • R² = 1.0: Perfect forecast
  • R² = 0.8: Explains 80% of variance (good)
  • R² = 0.5: Explains 50% of variance (fair)
  • R² = 0.0: No better than predicting mean
  • R² < 0: Worse than predicting mean

Use when: Want overall goodness-of-fit measure


Metric Selection Guide

SituationUse This MetricWhy
Inventory forecastingMASEWant vs. naive baseline
Financial forecastingRMSECan't afford large errors
Multiple productsMAPEScale-independent
Executive reportingMAPEEasy to explain
Validation intervalsCoverageEnsure 95% contains actuals
Model comparisonMAE & RMSEIndustry standard
Bias checkME or BiasDetect systematic over/under
Data with zerosRMSE or SMAPEMAPE undefined for zeros

Complete Evaluation Example

-- 1. Split data
CREATE TABLE train_data AS
SELECT * FROM sales_data WHERE date < '2024-01-01';

CREATE TABLE test_data AS
SELECT * FROM sales_data WHERE date >= '2024-01-01';

-- 2. Forecast test period
CREATE TABLE forecast AS
SELECT * FROM TS_FORECAST(
'train_data', 'date', 'sales', 'AutoETS',
(SELECT COUNT(*) FROM test_data),
{'seasonal_period': 7}
);

-- 3. Join forecast with actual
CREATE TABLE joined AS
SELECT
t.date,
t.sales as actual,
f.point_forecast as predicted,
f.lower_95 as lower,
f.upper_95 as upper
FROM test_data t
LEFT JOIN forecast f ON t.date = f.date_col;

-- 4. Calculate metrics
SELECT
ROUND(TS_MAE(LIST(actual), LIST(predicted)), 2) as MAE,
ROUND(TS_RMSE(LIST(actual), LIST(predicted)), 2) as RMSE,
ROUND(TS_MAPE(LIST(actual), LIST(predicted)), 2) as MAPE_pct,
ROUND(TS_MASE(LIST(actual), LIST(predicted), 7), 2) as MASE,
ROUND(TS_COVERAGE(LIST(actual), LIST(lower), LIST(upper)), 2) as coverage,
ROUND(TS_ME(LIST(actual), LIST(predicted)), 2) as ME,
ROUND(TS_R_SQUARED(LIST(actual), LIST(predicted)), 4) as R_squared
FROM joined;

Comparing Multiple Models

-- Compare AutoETS vs AutoARIMA
CREATE TABLE model_comparison AS
SELECT
'AutoETS' as model,
ROUND(TS_MAE(LIST(t.sales), LIST(f1.point_forecast)), 2) as MAE,
ROUND(TS_RMSE(LIST(t.sales), LIST(f1.point_forecast)), 2) as RMSE,
ROUND(TS_MAPE(LIST(t.sales), LIST(f1.point_forecast)), 2) as MAPE
FROM test_data t
LEFT JOIN (
SELECT * FROM TS_FORECAST('train_data', 'date', 'sales', 'AutoETS', 30, {})
) f1 ON t.date = f1.date_col

UNION ALL

SELECT
'AutoARIMA' as model,
ROUND(TS_MAE(LIST(t.sales), LIST(f2.point_forecast)), 2) as MAE,
ROUND(TS_RMSE(LIST(t.sales), LIST(f2.point_forecast)), 2) as RMSE,
ROUND(TS_MAPE(LIST(t.sales), LIST(f2.point_forecast)), 2) as MAPE
FROM test_data t
LEFT JOIN (
SELECT * FROM TS_FORECAST('train_data', 'date', 'sales', 'AutoARIMA', 30, {})
) f2 ON t.date = f2.date_col

ORDER BY MAPE; -- Choose model with lowest MAPE

Interpreting Metrics

Typical Metric Ranges

MetricExcellentGoodFairPoor
MAPE<5%5-10%10-20%>20%
RMSE<10% of mean10-20%20-30%>30%
0.95+0.80-0.950.60-0.80<0.60
MASE<0.80.8-1.01.0-1.5>1.5
Coverage (95% CI)93-97%90-98%85-99%<85% or >99%

Diagnostic Plots

Visualize forecast vs actual:

-- Forecast comparison (copy to Excel for visualization)
SELECT
t.date,
t.sales as actual,
f.point_forecast as forecast,
t.sales - f.point_forecast as error,
CASE WHEN ABS(t.sales - f.point_forecast) > 2 * TS_RMSE(...)
THEN 'OUTLIER' ELSE 'OK' END as flag
FROM test_data t
LEFT JOIN forecast f ON t.date = f.date_col
ORDER BY t.date;

Look for:

  • Are errors randomly distributed? (Good)
  • Are there patterns in errors? (Bad - model missing something)
  • Are there unusual spikes? (Outliers to investigate)

Cross-Validation

Stronger evaluation using multiple train/test splits:

-- Example: 5-fold rolling cross-validation
-- Split year into 5 periods, test on each

FOR fold IN 1 TO 5:
train_data = data before fold
test_data = fold

forecast = TS_FORECAST(train_data, ...)
metrics = calculate_metrics(test_data, forecast)
save_results(metrics)

average_metrics = mean(all_fold_metrics)

This gives more robust estimate of true performance.


Improving Forecast Accuracy

When Metrics Are Poor

MAPE > 15%?

├─ Is data noisy?
│ └─ Try MFLES (robust model)

├─ Multiple seasonal patterns?
│ └─ Try TBATS or MSTL

├─ Data has trend changes?
│ └─ Try AutoARIMA or retrain weekly

├─ Is forecast too simple?
│ └─ Try ensemble (average multiple models)

└─ Do you have enough data?
└─ Collect more history or use different model

Steps to Improve

  1. Data Quality: Clean outliers, handle missing values
  2. Feature Engineering: Add external variables if possible
  3. Model Selection: Try more complex models
  4. Ensemble: Combine predictions from multiple models
  5. Retraining: Retrain regularly as new data arrives

Production Evaluation

Once deployed, continue monitoring:

-- Weekly model performance check
SELECT
DATE_TRUNC('week', forecast_date) as week,
ROUND(TS_MAE(LIST(actual), LIST(predicted)), 2) as weekly_MAE,
ROUND(TS_MAPE(LIST(actual), LIST(predicted)), 2) as weekly_MAPE
FROM forecast_results
WHERE forecast_date >= TODAY() - INTERVAL '4 weeks'
GROUP BY DATE_TRUNC('week', forecast_date)
ORDER BY week DESC;

-- If metrics degrade, retrain model!

Quick Reference: Metric Cheat Sheet

MetricFormulaLowerBetterMeaning
MAEΣ|e|/n0YesAvg absolute error
RMSE√(Σe²/n)0YesPenalizes large errors
MAPEΣ|e/a|×100/n0Yes% error relative to actual
SMAPE2Σ|e/(a+p)|×100/n0YesSymmetric % error
MASEMAE / MAEnaive1NoRatio to naive
MEΣe/n-0Bias (over/under)
Coverage%within_interval-0.95% actual in CI
1 - Σe²/Σ(y-ȳ)²01Variance explained

Common Evaluation Mistakes

❌ Evaluating on Training Data

-- WRONG: Training data known to model
SELECT TS_MAE(LIST(train_sales), LIST(train_predictions));

-- RIGHT: Test on unseen data
SELECT TS_MAE(LIST(test_sales), LIST(test_predictions));

❌ Using Only One Metric

-- WRONG: Only checking MAPE
MAPE = 3% -- Looks great!
But: RMSE = 1000 -- Actually bad!

-- RIGHT: Use multiple metrics
MAPE, RMSE, MASE all tell different stories

❌ Not Accounting for Seasonality

-- WRONG: Testing during off-season only
Test MAPE = 2%, but seasonal peak coming

-- RIGHT: Test across full seasonal cycle
Include peaks, valleys, normal periods

❌ Single Random Split

-- WRONG: One train/test split
Could be lucky with this particular split

-- RIGHT: Cross-validation or rolling windows
Multiple splits give more stable estimate

Next Steps

  1. Model Comparison Guide — Practical comparison workflow
  2. Production Deployment — Monitor deployed models
  3. Metrics Reference — All 12 metrics detailed

Key Takeaways

  • ✅ Always evaluate on test data, not training data
  • ✅ Use multiple metrics (MAE, RMSE, MAPE all tell different stories)
  • ✅ MAE = simple, RMSE = penalizes big errors, MAPE = percentage
  • ✅ Coverage validates prediction intervals
  • ✅ Compare models using same test data
  • ✅ Monitor production forecasts weekly
  • ✅ Retrain when metrics degrade
  • ✅ Use cross-validation for robust evaluation
🍪 Cookie Settings