Skip to main content

Basic Forecasting Workflow

Complete end-to-end guide to forecasting from data preparation through evaluation.

The Workflow

Raw Data

Data Preparation (cleaning, validation)

Exploratory Analysis (detect patterns)

Train/Test Split

Model Training

Forecast Generation

Evaluation (metrics)

Model Comparison

Deploy Best Model

Step 1: Create or Load Your Data

Start with a table containing time-series data:

-- Example: Load sales data
CREATE TABLE sales_data AS
SELECT * FROM read_parquet('s3://bucket/sales.parquet');

-- Or create synthetic data for testing
CREATE TABLE sales_data AS
SELECT
DATE '2023-01-01' + INTERVAL (d) DAY AS date,
100 + (d / 3.65) + -- Trend
20 * SIN(2 * PI() * (d % 7) / 7) + -- Weekly seasonality
RANDOM() * 10 AS sales -- Noise
FROM generate_series(0, 364) t(d)
ORDER BY date;

Data requirements:

  • Time column (DATE, TIMESTAMP, or INTEGER)
  • Value column (DOUBLE, INTEGER)
  • Regular time intervals (daily, hourly, etc.)
  • At least 30 observations (2+ seasonal cycles recommended)

Step 2: Data Validation

Check data quality before forecasting:

-- Check for nulls and basic statistics
SELECT
COUNT(*) as total_rows,
COUNT(DISTINCT date) as unique_dates,
COUNT(CASE WHEN sales IS NULL THEN 1 END) as null_count,
MIN(sales) as min_sales,
MAX(sales) as max_sales,
ROUND(AVG(sales), 2) as avg_sales,
ROUND(STDDEV(sales), 2) as std_dev
FROM sales_data;

-- Check for duplicates
SELECT date, COUNT(*) as cnt
FROM sales_data
GROUP BY date
HAVING COUNT(*) > 1;

-- Check for gaps in dates
SELECT
date,
LEAD(date) OVER (ORDER BY date) - date as days_gap
FROM sales_data
WHERE LEAD(date) OVER (ORDER BY date) - date != INTERVAL '1 day';

Step 3: Data Cleaning

Handle missing values, outliers, and gaps:

-- Fill time gaps (optional)
CREATE TABLE sales_cleaned AS
SELECT * FROM TS_FILL_GAPS(
'sales_data',
'date',
'sales',
{'method': 'linear'} -- linear, forward_fill, or mean
);

-- Remove outliers (optional)
SELECT * FROM TS_REMOVE_OUTLIERS(
'sales_cleaned',
'sales',
{'method': 'iqr', 'multiplier': 3.0} -- IQR method with 3× threshold
);

Step 4: Exploratory Analysis

Understand patterns before modeling:

-- Detect seasonality
SELECT * FROM TS_DETECT_SEASONALITY(
'sales_data',
'date',
'sales',
{'method': 'auto'}
);

-- Detect changepoints (structural breaks)
SELECT * FROM TS_DETECT_CHANGEPOINTS(
'sales_data',
'date',
'sales',
{}
);

-- Decompose series to see components
SELECT
date,
sales,
trend,
seasonal,
remainder
FROM TS_STL_DECOMPOSITION(
'sales_data',
'date',
'sales',
{'seasonal_period': 7}
);

Key questions to answer:

  • Is there seasonality? How strong?
  • Is there a trend (increasing/decreasing)?
  • Are there structural breaks or regime changes?
  • How much noise/volatility?

Step 5: Train/Test Split

Split data for unbiased evaluation:

-- Typical: 80% train, 20% test
-- Or for time series: Use time-based split

CREATE TABLE train_data AS
SELECT * FROM sales_data
WHERE date < DATE '2023-11-01';

CREATE TABLE test_data AS
SELECT * FROM sales_data
WHERE date >= DATE '2023-11-01';

-- Verify split
SELECT
'train' as dataset,
COUNT(*) as rows,
MIN(date) as start_date,
MAX(date) as end_date
FROM train_data
UNION ALL
SELECT
'test' as dataset,
COUNT(*) as rows,
MIN(date) as start_date,
MAX(date) as end_date
FROM test_data;

Step 6: Train Models

Fit multiple models for comparison:

-- Model 1: Simple baseline
CREATE TABLE forecast_naive AS
SELECT * FROM TS_FORECAST(
'train_data',
'date',
'sales',
'SeasonalNaive',
(SELECT COUNT(*) FROM test_data),
{'seasonal_period': 7}
);

-- Model 2: Statistical (AutoETS)
CREATE TABLE forecast_ets AS
SELECT * FROM TS_FORECAST(
'train_data',
'date',
'sales',
'AutoETS',
(SELECT COUNT(*) FROM test_data),
{'seasonal_period': 7, 'confidence_level': 0.95}
);

-- Model 3: ARIMA
CREATE TABLE forecast_arima AS
SELECT * FROM TS_FORECAST(
'train_data',
'date',
'sales',
'AutoARIMA',
(SELECT COUNT(*) FROM test_data),
{'seasonal_period': 7, 'confidence_level': 0.95}
);

Step 7: Generate Forecasts

Create predictions for test period:

-- All models in one unified table
CREATE TABLE all_forecasts AS
SELECT
forecast_step,
date_col as forecast_date,
'SeasonalNaive' as model_name,
point_forecast,
lower_95,
upper_95
FROM forecast_naive

UNION ALL

SELECT
forecast_step,
date_col,
'AutoETS',
point_forecast,
lower_95,
upper_95
FROM forecast_ets

UNION ALL

SELECT
forecast_step,
date_col,
'AutoARIMA',
point_forecast,
lower_95,
upper_95
FROM forecast_arima

ORDER BY forecast_date, model_name;

Step 8: Compare Actual vs. Forecast

Join predictions with actual values:

-- Side-by-side comparison
CREATE TABLE comparison AS
SELECT
t.date,
t.sales as actual,
f.model_name,
ROUND(f.point_forecast, 2) as forecast,
ROUND(ABS(t.sales - f.point_forecast), 2) as absolute_error,
ROUND(ABS(t.sales - f.point_forecast) / t.sales * 100, 1) as error_pct
FROM test_data t
LEFT JOIN all_forecasts f ON t.date = f.forecast_date
ORDER BY t.date, f.model_name;

-- View the comparison
SELECT * FROM comparison LIMIT 20;

Step 9: Calculate Evaluation Metrics

Quantify forecast accuracy:

-- Aggregate metrics by model
CREATE TABLE model_metrics AS
SELECT
model_name,
ROUND(TS_MAE(LIST(actual), LIST(forecast)), 2) as MAE,
ROUND(TS_RMSE(LIST(actual), LIST(forecast)), 2) as RMSE,
ROUND(TS_MAPE(LIST(actual), LIST(forecast)), 2) as MAPE_pct,
ROUND(TS_MASE(LIST(actual), LIST(forecast), 7), 2) as MASE,
ROUND(TS_ME(LIST(actual), LIST(forecast)), 2) as ME,
ROUND(TS_R_SQUARED(LIST(actual), LIST(forecast)), 4) as R_squared,
ROUND(TS_COVERAGE(LIST(actual), LIST(lower_95), LIST(upper_95)), 3) as coverage_95
FROM (
SELECT
f.model_name,
t.sales as actual,
f.point_forecast as forecast,
f.lower_95,
f.upper_95
FROM test_data t
LEFT JOIN all_forecasts f ON t.date = f.forecast_date
)
GROUP BY model_name
ORDER BY MAPE_pct; -- Sort by MAPE (typical metric)

-- View results
SELECT * FROM model_metrics;

Output example:

model_name      MAE   RMSE  MAPE_pct  MASE  ME      R_squared
SeasonalNaive 15.2 18.5 12.3 1.15 0.8 0.7832
AutoETS 8.3 11.2 6.8 0.63 -0.2 0.9234
AutoARIMA 9.1 12.8 7.5 0.69 0.5 0.9045

Step 10: Choose Best Model

Select model based on metrics:

-- Get the best model (lowest MAPE)
SELECT TOP 1
model_name,
MAE,
RMSE,
MAPE_pct
FROM model_metrics
ORDER BY MAPE_pct
LIMIT 1;

-- In this example: AutoETS with 6.8% MAPE is best

Step 11: Deploy for Production

Use best model to forecast future:

-- Retrain on full historical data
CREATE TABLE production_forecast AS
SELECT * FROM TS_FORECAST(
'sales_data', -- Use all historical data, not just training
'date',
'sales',
'AutoETS', -- Best model from Step 10
90, -- Forecast 90 days ahead
{'seasonal_period': 7, 'confidence_level': 0.95}
);

-- View next 7 days forecast
SELECT
forecast_step,
date_col as forecast_date,
ROUND(point_forecast, 2) as forecast,
ROUND(lower_95, 2) as lower,
ROUND(upper_95, 2) as upper
FROM production_forecast
WHERE forecast_step <= 7
ORDER BY forecast_step;

Step 12: Monitor & Retrain

Track model performance over time:

-- Weekly monitoring query
SELECT
DATE_TRUNC('week', date) as week,
COUNT(*) as forecast_count,
ROUND(AVG(ABS(actual - forecast)), 2) as avg_error,
ROUND(TS_MAE(LIST(actual), LIST(forecast)), 2) as weekly_MAE
FROM (
SELECT
f.date_col as date,
t.sales as actual,
f.point_forecast as forecast
FROM forecasts f
LEFT JOIN sales_data t ON f.date_col = t.date
WHERE f.date_col >= TODAY() - INTERVAL '30 days'
)
GROUP BY DATE_TRUNC('week', date)
ORDER BY week;

-- If metrics degrade, retrain!

Complete Workflow Script

Copy and paste to run entire workflow:

LOAD anofox_forecast;

-- 1. Create sample data
CREATE TABLE sales_data AS
SELECT
DATE '2023-01-01' + INTERVAL (d) DAY AS date,
100 + (d / 3.65) + 20 * SIN(2 * PI() * (d % 7) / 7) + RANDOM() * 10 AS sales
FROM generate_series(0, 364) t(d);

-- 2. Split data
CREATE TABLE train_data AS SELECT * FROM sales_data WHERE date < '2023-11-01';
CREATE TABLE test_data AS SELECT * FROM sales_data WHERE date >= '2023-11-01';

-- 3. Generate forecasts (3 models)
CREATE TABLE forecast_results AS
SELECT * FROM TS_FORECAST('train_data', 'date', 'sales', 'SeasonalNaive', 30, {'seasonal_period': 7})
UNION ALL
SELECT * FROM TS_FORECAST('train_data', 'date', 'sales', 'AutoETS', 30, {'seasonal_period': 7})
UNION ALL
SELECT * FROM TS_FORECAST('train_data', 'date', 'sales', 'AutoARIMA', 30, {'seasonal_period': 7});

-- 4. Evaluate
SELECT
model_name,
ROUND(TS_MAE(LIST(t.sales), LIST(f.point_forecast)), 2) as MAE,
ROUND(TS_RMSE(LIST(t.sales), LIST(f.point_forecast)), 2) as RMSE,
ROUND(TS_MAPE(LIST(t.sales), LIST(f.point_forecast)), 2) as MAPE_pct
FROM test_data t
LEFT JOIN forecast_results f ON t.date = f.date_col
GROUP BY model_name
ORDER BY MAPE_pct;

Common Issues

Issue: "Not enough data"

Solution: Ensure at least 2 seasonal cycles (e.g., 14 days for weekly seasonality)

Issue: "Forecast values are unrealistic"

Solution:

  1. Check seasonal_period parameter
  2. Verify data quality
  3. Try different model

Issue: "Model too slow"

Solution:

  1. Use faster model (Naive, Theta)
  2. Use parallel evaluation with GROUP BY
  3. Reduce data if possible

Next Steps

  1. Data Preparation — Cleaning real-world data
  2. Model Comparison — Advanced comparison techniques
  3. Production Deployment — Scale to millions of series

Key Takeaways

  • ✅ Split data before evaluation (train/test)
  • ✅ Try multiple models and compare metrics
  • ✅ Use MAE, RMSE, MAPE for evaluation
  • ✅ Deploy best model to production
  • ✅ Monitor performance weekly
  • ✅ Retrain when metrics degrade
🍪 Cookie Settings