Understanding Regression

Learn the fundamentals of linear regression and how to interpret statistical results.

What is Regression?

Regression is a statistical method for modeling the relationship between a target variable (y) and one or more predictors (x₁, x₂, ..., xₚ).

The Regression Equation

y = β₀ + β₁x₁ + β₂x₂ + ... + βₚxₚ + ε

y: Target variable (what you're predicting)
x₁, x₂, ..., xₚ: Predictor variables
β₀: Intercept (y-value when all x=0)
βⱼ: Coefficient (marginal effect of xⱼ)
ε: Error term (unobserved noise)

Simple Example

Predicting house price from square footage:

Price = $50,000 + $200 × Sq Ft + ε

$50,000: Base price for a 0 sq ft house (intercept)
$200: Price per additional square foot (slope)
ε: Individual variations (bad location, needs repairs, etc.)

Interpreting Coefficients

β₁ = 3.5 (Marketing Spend → Revenue)

"For each $1 increase in marketing spend, revenue increases by $3.50"

This is the marginal effect while holding other variables constant.

β₂ = 15,000 (Campaign Flag)

"Campaigns generate $15,000 additional revenue on average"

This is a shift up (or down) from the baseline.

Goodness-of-Fit Metrics

R² (Coefficient of Determination)

R² = 1 - (Σ(actual - predicted)²) / Σ(actual - mean)²

Interpretation:

R² = 0.72 = "72% of variance explained"
R² = 0.95 = "Excellent fit"
R² = 0.30 = "Poor fit"

RMSE (Root Mean Squared Error)

RMSE = √(Σ(actual - predicted)² / n)

Interpretation:

"Typical prediction error is $250,000"
Same units as y
Lower is better

Adjusted R²

Adjusted R² = 1 - [(1 - R²)(n-1)/(n-p-1)]

Penalizes for adding unnecessary predictors. Use when comparing models.

Regression Assumptions

1. Linearity

The relationship between y and x is linear.

Check: Plot actual vs. predicted values

2. Independence

Observations are independent (no autocorrelation).

Check: Time-series data? Use RLS instead.

3. Homoscedasticity

Constant variance of errors across x.

Check: Residual plot should show random scatter

4. Normality

Errors are normally distributed.

Check: Jarque-Bera test (p > 0.05 = normal)

5. No Multicollinearity

Predictors should not be perfectly correlated.

Check: VIF < 10 for each predictor

When Regression Works Well

✅ Predict house prices from square footage, bedrooms, location ✅ Estimate demand based on price, advertising, seasonality ✅ Analyze ROI of marketing spend by channel ✅ Model financial risk (portfolio returns vs. market factors)

When Regression Fails

❌ Nonlinear relationships (e.g., exponential growth) → Use ridge/elastic net or nonlinear models

❌ Outliers (extreme values distort fit) → Use WLS or robust regression

❌ Multicollinearity (correlated predictors) → Use Ridge regression (L2 regularization)

❌ Missing data patterns → Impute or use only complete cases

❌ Categorical targets (classification) → Use logistic regression (future release)

Linear vs. Multiple Regression

Simple Regression (1 predictor)

SELECT * FROM anofox_statistics_ols(
    'data',
    'revenue',
    ARRAY['marketing_spend']
);

Multiple Regression (many predictors)

SELECT * FROM anofox_statistics_ols(
    'data',
    'revenue',
    ARRAY['marketing_spend', 'team_size', 'had_campaign']
);

Real-World Example

Problem: Estimate Profit from Ad Spend

Data: 48 months of marketing spend and profit

Model:

Profit = β₀ + β₁ × Spend + ε

Results:

Coefficient  Value        Interpretation
β₀           $50,000      Base profit (no ads)
β₁           2.5          Each $1 spend → $2.50 profit
p-value      0.00001      Highly significant
R²           0.85         Spend explains 85% of profit variance

Decision: Invest in marketing (ROI = 2.5x)

Next Steps

Model Types — OLS vs Ridge vs WLS
Inference & Testing — Statistical significance
Basic Workflow — Hands-on example

What is Regression?​

The Regression Equation​

Simple Example​

Interpreting Coefficients​

β₁ = 3.5 (Marketing Spend → Revenue)​

β₂ = 15,000 (Campaign Flag)​

Goodness-of-Fit Metrics​

R² (Coefficient of Determination)​

RMSE (Root Mean Squared Error)​

Adjusted R²​

Regression Assumptions​

1. Linearity​

2. Independence​

3. Homoscedasticity​

4. Normality​

5. No Multicollinearity​

When Regression Works Well​

When Regression Fails​

Linear vs. Multiple Regression​

Simple Regression (1 predictor)​

Multiple Regression (many predictors)​

Real-World Example​

Problem: Estimate Profit from Ad Spend​

Next Steps​