Skip to main content

Quickstart - 5-Minute Regression

Fit your first statistical regression model in DuckDB in under 5 minutes.

What You'll Learn

  • ✅ Load the Statistics extension
  • ✅ Create sample data
  • ✅ Fit a linear regression model
  • ✅ Interpret coefficients and R²
  • ✅ Test statistical significance
  • ✅ Make predictions with confidence intervals

Time estimate: 5 minutes

Step 1: Load Extension (30 seconds)

LOAD anofox_statistics;

Step 2: Create Sample Data (1 minute)

Let's create a marketing spend vs. revenue dataset:

CREATE TABLE marketing_data AS
SELECT
month::INT as month_id,
(RANDOM() * 50000 + 20000)::DOUBLE as marketing_spend,
(marketing_spend * 3.5 + RANDOM() * 100000)::DOUBLE as revenue,
CASE WHEN RANDOM() > 0.7 THEN 1 ELSE 0 END as had_campaign
FROM (
SELECT month FROM range(1, 48)
) t(month);

-- Check data
SELECT * FROM marketing_data LIMIT 5;

Step 3: Run Simple Regression (1 minute)

Fit a model predicting revenue from marketing spend:

SELECT
coefficient[1] as intercept,
coefficient[2] as spend_effect,
std_error[1] as intercept_se,
std_error[2] as spend_effect_se,
t_statistic[2] as spend_t_stat,
p_value[2] as spend_p_value,
r_squared,
rmse,
aic,
bic
FROM anofox_statistics_ols(
'marketing_data',
'revenue',
ARRAY['marketing_spend']
);

Output:

intercept  | spend_effect | intercept_se | spend_effect_se | spend_t_stat | spend_p_value | r_squared | rmse      | aic       | bic
-----------|--------------|--------------|-----------------|--------------|---------------|-----------|-----------|-----------|----------
50000 | 3.2 | 15000 | 0.08 | 40.0 | 0.000001 | 0.72 | 250000 | 450.2 | 455.1

Step 4: Interpret Results (1 minute)

Coefficients

  • intercept (50000): Base revenue with $0 marketing spend
  • spend_effect (3.2): For each $1 increase in marketing spend, revenue increases by $3.20

Statistical Significance

  • spend_p_value (0.000001): The effect is highly significant (p < 0.05)
  • spend_t_stat (40.0): Large t-statistic confirms strong evidence

Model Fit

  • r_squared (0.72): Marketing spend explains 72% of revenue variance
  • rmse (250000): Typical prediction error is $250K

Step 5: Multiple Regression (1 minute)

Add campaign flag as additional predictor:

SELECT
coefficient[1] as intercept,
coefficient[2] as spend_effect,
coefficient[3] as campaign_effect,
p_value[2] as spend_pvalue,
p_value[3] as campaign_pvalue,
r_squared,
adjusted_r_squared
FROM anofox_statistics_ols(
'marketing_data',
'revenue',
ARRAY['marketing_spend', 'had_campaign']
);

Step 6: Prediction with Intervals (1 minute)

Make predictions with 95% confidence intervals:

SELECT
test_row.marketing_spend,
pred.point_estimate as predicted_revenue,
pred.lower_95 as ci_lower,
pred.upper_95 as ci_upper,
(pred.upper_95 - pred.lower_95) as interval_width
FROM (
SELECT ARRAY[10000.0, 25000.0, 50000.0] as spend_values
) test_data,
LATERAL (
SELECT
marketing_spend,
point_forecast,
lower_95,
upper_95
FROM anofox_statistics_ols_predict_interval(
'marketing_data',
'revenue',
ARRAY['marketing_spend'],
STRUCT_PACK(
marketing_spend := test_data.spend_values[idx]
),
0.95
)
) pred
CROSS JOIN (
SELECT * FROM range(1, 4)
) idx(idx);

Common Variations

Weighted Regression (Heteroscedastic Data)

-- WLS: When variance differs across observations
SELECT * FROM anofox_statistics_wls(
'marketing_data',
'revenue',
ARRAY['marketing_spend'],
'weight_column'
);

Ridge Regression (Multicollinearity)

-- Ridge: Handle correlated predictors
SELECT * FROM anofox_statistics_ridge(
'marketing_data',
'revenue',
ARRAY['marketing_spend', 'had_campaign'],
MAP_CREATE(ARRAY['lambda'], ARRAY['0.1'])
);

Per-Group Analysis

-- Analyze each month separately
SELECT
month_id,
result.coefficients[2] as monthly_spend_effect
FROM marketing_data
GROUP BY month_id
APPLY anofox_statistics_ols_agg(revenue, ARRAY[marketing_spend]);

Troubleshooting

Q: "Insufficient observations" error?

A: Need at least p+1 rows where p = number of predictors

-- Check row count
SELECT COUNT(*) FROM marketing_data;
-- Should be > number of predictors

Q: "Unknown function" error?

A: Extension not loaded in this session

-- Load again
LOAD anofox_statistics;

Q: How do I interpret p-values?

A: p < 0.05 = statistically significant (95% confidence)

  • p < 0.01 = highly significant (99% confidence)
  • p > 0.05 = not significant

Q: What's the difference between confidence and prediction intervals?

A:

  • Confidence: Uncertainty about predicted mean
  • Prediction: Wider; includes individual variation

Next Steps

Ready to go deeper? Choose your path:

  1. Understanding Regression — Learn core concepts
  2. Basic Workflow — Complete end-to-end example
  3. Model Types — OLS vs. Ridge vs. WLS
  4. Reference — Complete API

Complete Script (Copy-Paste Ready)

LOAD anofox_statistics;

CREATE TABLE marketing_data AS
SELECT
month::INT,
(RANDOM() * 50000 + 20000)::DOUBLE as marketing_spend,
(marketing_spend * 3.5 + RANDOM() * 100000)::DOUBLE as revenue
FROM range(1, 48) t(month);

-- OLS regression
SELECT
coefficient[2] as spend_effect,
p_value[2] as p_value,
r_squared
FROM anofox_statistics_ols(
'marketing_data',
'revenue',
ARRAY['marketing_spend']
);
🍪 Cookie Settings