Skip to main content

Understanding Regression

Learn the fundamentals of linear regression and how to interpret statistical results.

What is Regression?

Regression is a statistical method for modeling the relationship between a target variable (y) and one or more predictors (x₁, x₂, ..., xₚ).

The Regression Equation

y = β₀ + β₁x₁ + β₂x₂ + ... + βₚxₚ + ε
  • y: Target variable (what you're predicting)
  • x₁, x₂, ..., xₚ: Predictor variables
  • β₀: Intercept (y-value when all x=0)
  • βⱼ: Coefficient (marginal effect of xⱼ)
  • ε: Error term (unobserved noise)

Simple Example

Predicting house price from square footage:

Price = $50,000 + $200 × Sq Ft + ε
  • $50,000: Base price for a 0 sq ft house (intercept)
  • $200: Price per additional square foot (slope)
  • ε: Individual variations (bad location, needs repairs, etc.)

Interpreting Coefficients

β₁ = 3.5 (Marketing Spend → Revenue)

"For each $1 increase in marketing spend, revenue increases by $3.50"

This is the marginal effect while holding other variables constant.

β₂ = 15,000 (Campaign Flag)

"Campaigns generate $15,000 additional revenue on average"

This is a shift up (or down) from the baseline.


Goodness-of-Fit Metrics

R² (Coefficient of Determination)

R² = 1 - (Σ(actual - predicted)²) / Σ(actual - mean)²

Interpretation:

  • R² = 0.72 = "72% of variance explained"
  • R² = 0.95 = "Excellent fit"
  • R² = 0.30 = "Poor fit"

RMSE (Root Mean Squared Error)

RMSE = √(Σ(actual - predicted)² / n)

Interpretation:

  • "Typical prediction error is $250,000"
  • Same units as y
  • Lower is better

Adjusted R²

Adjusted R² = 1 - [(1 - R²)(n-1)/(n-p-1)]

Penalizes for adding unnecessary predictors. Use when comparing models.


Regression Assumptions

1. Linearity

The relationship between y and x is linear.

Check: Plot actual vs. predicted values

2. Independence

Observations are independent (no autocorrelation).

Check: Time-series data? Use RLS instead.

3. Homoscedasticity

Constant variance of errors across x.

Check: Residual plot should show random scatter

4. Normality

Errors are normally distributed.

Check: Jarque-Bera test (p > 0.05 = normal)

5. No Multicollinearity

Predictors should not be perfectly correlated.

Check: VIF < 10 for each predictor


When Regression Works Well

Predict house prices from square footage, bedrooms, location ✅ Estimate demand based on price, advertising, seasonality ✅ Analyze ROI of marketing spend by channel ✅ Model financial risk (portfolio returns vs. market factors)


When Regression Fails

Nonlinear relationships (e.g., exponential growth) → Use ridge/elastic net or nonlinear models

Outliers (extreme values distort fit) → Use WLS or robust regression

Multicollinearity (correlated predictors) → Use Ridge regression (L2 regularization)

Missing data patterns → Impute or use only complete cases

Categorical targets (classification) → Use logistic regression (future release)


Linear vs. Multiple Regression

Simple Regression (1 predictor)

SELECT * FROM anofox_statistics_ols(
'data',
'revenue',
ARRAY['marketing_spend']
);

Multiple Regression (many predictors)

SELECT * FROM anofox_statistics_ols(
'data',
'revenue',
ARRAY['marketing_spend', 'team_size', 'had_campaign']
);

Real-World Example

Problem: Estimate Profit from Ad Spend

Data: 48 months of marketing spend and profit

Model:

Profit = β₀ + β₁ × Spend + ε

Results:

Coefficient  Value        Interpretation
β₀ $50,000 Base profit (no ads)
β₁ 2.5 Each $1 spend → $2.50 profit
p-value 0.00001 Highly significant
R² 0.85 Spend explains 85% of profit variance

Decision: Invest in marketing (ROI = 2.5x)


Next Steps

🍪 Cookie Settings