Quickstart - 5-Minute Regression

Fit your first statistical regression model in DuckDB in under 5 minutes.

What You'll Learn

✅ Load the Statistics extension
✅ Create sample data
✅ Fit a linear regression model
✅ Interpret coefficients and R²
✅ Test statistical significance
✅ Make predictions with confidence intervals

Time estimate: 5 minutes

Step 1: Load Extension (30 seconds)

LOAD anofox_statistics;

Step 2: Create Sample Data (1 minute)

Let's create a marketing spend vs. revenue dataset:

CREATE TABLE marketing_data AS
SELECT
    month::INT as month_id,
    (RANDOM() * 50000 + 20000)::DOUBLE as marketing_spend,
    (marketing_spend * 3.5 + RANDOM() * 100000)::DOUBLE as revenue,
    CASE WHEN RANDOM() > 0.7 THEN 1 ELSE 0 END as had_campaign
FROM (
    SELECT month FROM range(1, 48)
) t(month);

-- Check data
SELECT * FROM marketing_data LIMIT 5;

Step 3: Run Simple Regression (1 minute)

Fit a model predicting revenue from marketing spend:

SELECT
    coefficient[1] as intercept,
    coefficient[2] as spend_effect,
    std_error[1] as intercept_se,
    std_error[2] as spend_effect_se,
    t_statistic[2] as spend_t_stat,
    p_value[2] as spend_p_value,
    r_squared,
    rmse,
    aic,
    bic
FROM anofox_statistics_ols(
    'marketing_data',
    'revenue',
    ARRAY['marketing_spend']
);

Output:

intercept  | spend_effect | intercept_se | spend_effect_se | spend_t_stat | spend_p_value | r_squared | rmse      | aic       | bic
-----------|--------------|--------------|-----------------|--------------|---------------|-----------|-----------|-----------|----------
50000      | 3.2          | 15000        | 0.08            | 40.0         | 0.000001      | 0.72      | 250000    | 450.2     | 455.1

Step 4: Interpret Results (1 minute)

Coefficients

intercept (50000): Base revenue with $0 marketing spend
spend_effect (3.2): For each $1 increase in marketing spend, revenue increases by $3.20

Statistical Significance

spend_p_value (0.000001): The effect is highly significant (p < 0.05)
spend_t_stat (40.0): Large t-statistic confirms strong evidence

Model Fit

r_squared (0.72): Marketing spend explains 72% of revenue variance
rmse (250000): Typical prediction error is $250K

Step 5: Multiple Regression (1 minute)

Add campaign flag as additional predictor:

SELECT
    coefficient[1] as intercept,
    coefficient[2] as spend_effect,
    coefficient[3] as campaign_effect,
    p_value[2] as spend_pvalue,
    p_value[3] as campaign_pvalue,
    r_squared,
    adjusted_r_squared
FROM anofox_statistics_ols(
    'marketing_data',
    'revenue',
    ARRAY['marketing_spend', 'had_campaign']
);

Step 6: Prediction with Intervals (1 minute)

Make predictions with 95% confidence intervals:

SELECT
    test_row.marketing_spend,
    pred.point_estimate as predicted_revenue,
    pred.lower_95 as ci_lower,
    pred.upper_95 as ci_upper,
    (pred.upper_95 - pred.lower_95) as interval_width
FROM (
    SELECT ARRAY[10000.0, 25000.0, 50000.0] as spend_values
) test_data,
LATERAL (
    SELECT
        marketing_spend,
        point_forecast,
        lower_95,
        upper_95
    FROM anofox_statistics_ols_predict_interval(
        'marketing_data',
        'revenue',
        ARRAY['marketing_spend'],
        STRUCT_PACK(
            marketing_spend := test_data.spend_values[idx]
        ),
        0.95
    )
) pred
CROSS JOIN (
    SELECT * FROM range(1, 4)
) idx(idx);

Common Variations

Weighted Regression (Heteroscedastic Data)

-- WLS: When variance differs across observations
SELECT * FROM anofox_statistics_wls(
    'marketing_data',
    'revenue',
    ARRAY['marketing_spend'],
    'weight_column'
);

Ridge Regression (Multicollinearity)

-- Ridge: Handle correlated predictors
SELECT * FROM anofox_statistics_ridge(
    'marketing_data',
    'revenue',
    ARRAY['marketing_spend', 'had_campaign'],
    MAP_CREATE(ARRAY['lambda'], ARRAY['0.1'])
);

Per-Group Analysis

-- Analyze each month separately
SELECT
    month_id,
    result.coefficients[2] as monthly_spend_effect
FROM marketing_data
GROUP BY month_id
APPLY anofox_statistics_ols_agg(revenue, ARRAY[marketing_spend]);

Troubleshooting

Q: "Insufficient observations" error?

A: Need at least p+1 rows where p = number of predictors

-- Check row count
SELECT COUNT(*) FROM marketing_data;
-- Should be > number of predictors

Q: "Unknown function" error?

A: Extension not loaded in this session

-- Load again
LOAD anofox_statistics;

Q: How do I interpret p-values?

A: p < 0.05 = statistically significant (95% confidence)

p < 0.01 = highly significant (99% confidence)
p > 0.05 = not significant

Q: What's the difference between confidence and prediction intervals?

Confidence: Uncertainty about predicted mean
Prediction: Wider; includes individual variation

Next Steps

Ready to go deeper? Choose your path:

Understanding Regression — Learn core concepts
Basic Workflow — Complete end-to-end example
Model Types — OLS vs. Ridge vs. WLS
Reference — Complete API

Complete Script (Copy-Paste Ready)

LOAD anofox_statistics;

CREATE TABLE marketing_data AS
SELECT
    month::INT,
    (RANDOM() * 50000 + 20000)::DOUBLE as marketing_spend,
    (marketing_spend * 3.5 + RANDOM() * 100000)::DOUBLE as revenue
FROM range(1, 48) t(month);

-- OLS regression
SELECT
    coefficient[2] as spend_effect,
    p_value[2] as p_value,
    r_squared
FROM anofox_statistics_ols(
    'marketing_data',
    'revenue',
    ARRAY['marketing_spend']
);

What You'll Learn​

Step 1: Load Extension (30 seconds)​

Step 2: Create Sample Data (1 minute)​

Step 3: Run Simple Regression (1 minute)​

Step 4: Interpret Results (1 minute)​

Coefficients​

Statistical Significance​

Model Fit​

Step 5: Multiple Regression (1 minute)​

Step 6: Prediction with Intervals (1 minute)​

Common Variations​

Weighted Regression (Heteroscedastic Data)​

Ridge Regression (Multicollinearity)​

Per-Group Analysis​

Troubleshooting​

Q: "Insufficient observations" error?​

Q: "Unknown function" error?​

Q: How do I interpret p-values?​

Q: What's the difference between confidence and prediction intervals?​

Next Steps​

Complete Script (Copy-Paste Ready)​