Quickstart - 5-Minute Regression
Fit your first statistical regression model in DuckDB in under 5 minutes.
What You'll Learn
- ✅ Load the Statistics extension
- ✅ Create sample data
- ✅ Fit a linear regression model
- ✅ Interpret coefficients and R²
- ✅ Test statistical significance
- ✅ Make predictions with confidence intervals
Time estimate: 5 minutes
Step 1: Load Extension (30 seconds)
LOAD anofox_statistics;
Step 2: Create Sample Data (1 minute)
Let's create a marketing spend vs. revenue dataset:
CREATE TABLE marketing_data AS
SELECT
month::INT as month_id,
(RANDOM() * 50000 + 20000)::DOUBLE as marketing_spend,
(marketing_spend * 3.5 + RANDOM() * 100000)::DOUBLE as revenue,
CASE WHEN RANDOM() > 0.7 THEN 1 ELSE 0 END as had_campaign
FROM (
SELECT month FROM range(1, 48)
) t(month);
-- Check data
SELECT * FROM marketing_data LIMIT 5;
Step 3: Run Simple Regression (1 minute)
Fit a model predicting revenue from marketing spend:
SELECT
coefficient[1] as intercept,
coefficient[2] as spend_effect,
std_error[1] as intercept_se,
std_error[2] as spend_effect_se,
t_statistic[2] as spend_t_stat,
p_value[2] as spend_p_value,
r_squared,
rmse,
aic,
bic
FROM anofox_statistics_ols(
'marketing_data',
'revenue',
ARRAY['marketing_spend']
);
Output:
intercept | spend_effect | intercept_se | spend_effect_se | spend_t_stat | spend_p_value | r_squared | rmse | aic | bic
-----------|--------------|--------------|-----------------|--------------|---------------|-----------|-----------|-----------|----------
50000 | 3.2 | 15000 | 0.08 | 40.0 | 0.000001 | 0.72 | 250000 | 450.2 | 455.1
Step 4: Interpret Results (1 minute)
Coefficients
- intercept (50000): Base revenue with $0 marketing spend
- spend_effect (3.2): For each $1 increase in marketing spend, revenue increases by $3.20
Statistical Significance
- spend_p_value (0.000001): The effect is highly significant (p < 0.05)
- spend_t_stat (40.0): Large t-statistic confirms strong evidence
Model Fit
- r_squared (0.72): Marketing spend explains 72% of revenue variance
- rmse (250000): Typical prediction error is $250K
Step 5: Multiple Regression (1 minute)
Add campaign flag as additional predictor:
SELECT
coefficient[1] as intercept,
coefficient[2] as spend_effect,
coefficient[3] as campaign_effect,
p_value[2] as spend_pvalue,
p_value[3] as campaign_pvalue,
r_squared,
adjusted_r_squared
FROM anofox_statistics_ols(
'marketing_data',
'revenue',
ARRAY['marketing_spend', 'had_campaign']
);
Step 6: Prediction with Intervals (1 minute)
Make predictions with 95% confidence intervals:
SELECT
test_row.marketing_spend,
pred.point_estimate as predicted_revenue,
pred.lower_95 as ci_lower,
pred.upper_95 as ci_upper,
(pred.upper_95 - pred.lower_95) as interval_width
FROM (
SELECT ARRAY[10000.0, 25000.0, 50000.0] as spend_values
) test_data,
LATERAL (
SELECT
marketing_spend,
point_forecast,
lower_95,
upper_95
FROM anofox_statistics_ols_predict_interval(
'marketing_data',
'revenue',
ARRAY['marketing_spend'],
STRUCT_PACK(
marketing_spend := test_data.spend_values[idx]
),
0.95
)
) pred
CROSS JOIN (
SELECT * FROM range(1, 4)
) idx(idx);
Common Variations
Weighted Regression (Heteroscedastic Data)
-- WLS: When variance differs across observations
SELECT * FROM anofox_statistics_wls(
'marketing_data',
'revenue',
ARRAY['marketing_spend'],
'weight_column'
);
Ridge Regression (Multicollinearity)
-- Ridge: Handle correlated predictors
SELECT * FROM anofox_statistics_ridge(
'marketing_data',
'revenue',
ARRAY['marketing_spend', 'had_campaign'],
MAP_CREATE(ARRAY['lambda'], ARRAY['0.1'])
);
Per-Group Analysis
-- Analyze each month separately
SELECT
month_id,
result.coefficients[2] as monthly_spend_effect
FROM marketing_data
GROUP BY month_id
APPLY anofox_statistics_ols_agg(revenue, ARRAY[marketing_spend]);
Troubleshooting
Q: "Insufficient observations" error?
A: Need at least p+1 rows where p = number of predictors
-- Check row count
SELECT COUNT(*) FROM marketing_data;
-- Should be > number of predictors
Q: "Unknown function" error?
A: Extension not loaded in this session
-- Load again
LOAD anofox_statistics;
Q: How do I interpret p-values?
A: p < 0.05 = statistically significant (95% confidence)
- p < 0.01 = highly significant (99% confidence)
- p > 0.05 = not significant
Q: What's the difference between confidence and prediction intervals?
A:
- Confidence: Uncertainty about predicted mean
- Prediction: Wider; includes individual variation
Next Steps
Ready to go deeper? Choose your path:
- Understanding Regression — Learn core concepts
- Basic Workflow — Complete end-to-end example
- Model Types — OLS vs. Ridge vs. WLS
- Reference — Complete API
Complete Script (Copy-Paste Ready)
LOAD anofox_statistics;
CREATE TABLE marketing_data AS
SELECT
month::INT,
(RANDOM() * 50000 + 20000)::DOUBLE as marketing_spend,
(marketing_spend * 3.5 + RANDOM() * 100000)::DOUBLE as revenue
FROM range(1, 48) t(month);
-- OLS regression
SELECT
coefficient[2] as spend_effect,
p_value[2] as p_value,
r_squared
FROM anofox_statistics_ols(
'marketing_data',
'revenue',
ARRAY['marketing_spend']
);