Understanding Regression
Learn the fundamentals of linear regression and how to interpret statistical results.
What is Regression?
Regression is a statistical method for modeling the relationship between a target variable (y) and one or more predictors (x₁, x₂, ..., xₚ).
The Regression Equation
y = β₀ + β₁x₁ + β₂x₂ + ... + βₚxₚ + ε
- y: Target variable (what you're predicting)
- x₁, x₂, ..., xₚ: Predictor variables
- β₀: Intercept (y-value when all x=0)
- βⱼ: Coefficient (marginal effect of xⱼ)
- ε: Error term (unobserved noise)
Simple Example
Predicting house price from square footage:
Price = $50,000 + $200 × Sq Ft + ε
- $50,000: Base price for a 0 sq ft house (intercept)
- $200: Price per additional square foot (slope)
- ε: Individual variations (bad location, needs repairs, etc.)
Interpreting Coefficients
β₁ = 3.5 (Marketing Spend → Revenue)
"For each $1 increase in marketing spend, revenue increases by $3.50"
This is the marginal effect while holding other variables constant.
β₂ = 15,000 (Campaign Flag)
"Campaigns generate $15,000 additional revenue on average"
This is a shift up (or down) from the baseline.
Goodness-of-Fit Metrics
R² (Coefficient of Determination)
R² = 1 - (Σ(actual - predicted)²) / Σ(actual - mean)²
Interpretation:
- R² = 0.72 = "72% of variance explained"
- R² = 0.95 = "Excellent fit"
- R² = 0.30 = "Poor fit"
RMSE (Root Mean Squared Error)
RMSE = √(Σ(actual - predicted)² / n)
Interpretation:
- "Typical prediction error is $250,000"
- Same units as y
- Lower is better
Adjusted R²
Adjusted R² = 1 - [(1 - R²)(n-1)/(n-p-1)]
Penalizes for adding unnecessary predictors. Use when comparing models.
Regression Assumptions
1. Linearity
The relationship between y and x is linear.
Check: Plot actual vs. predicted values
2. Independence
Observations are independent (no autocorrelation).
Check: Time-series data? Use RLS instead.
3. Homoscedasticity
Constant variance of errors across x.
Check: Residual plot should show random scatter
4. Normality
Errors are normally distributed.
Check: Jarque-Bera test (p > 0.05 = normal)
5. No Multicollinearity
Predictors should not be perfectly correlated.
Check: VIF < 10 for each predictor
When Regression Works Well
✅ Predict house prices from square footage, bedrooms, location ✅ Estimate demand based on price, advertising, seasonality ✅ Analyze ROI of marketing spend by channel ✅ Model financial risk (portfolio returns vs. market factors)
When Regression Fails
❌ Nonlinear relationships (e.g., exponential growth) → Use ridge/elastic net or nonlinear models
❌ Outliers (extreme values distort fit) → Use WLS or robust regression
❌ Multicollinearity (correlated predictors) → Use Ridge regression (L2 regularization)
❌ Missing data patterns → Impute or use only complete cases
❌ Categorical targets (classification) → Use logistic regression (future release)
Linear vs. Multiple Regression
Simple Regression (1 predictor)
SELECT * FROM anofox_statistics_ols(
'data',
'revenue',
ARRAY['marketing_spend']
);
Multiple Regression (many predictors)
SELECT * FROM anofox_statistics_ols(
'data',
'revenue',
ARRAY['marketing_spend', 'team_size', 'had_campaign']
);
Real-World Example
Problem: Estimate Profit from Ad Spend
Data: 48 months of marketing spend and profit
Model:
Profit = β₀ + β₁ × Spend + ε
Results:
Coefficient Value Interpretation
β₀ $50,000 Base profit (no ads)
β₁ 2.5 Each $1 spend → $2.50 profit
p-value 0.00001 Highly significant
R² 0.85 Spend explains 85% of profit variance
Decision: Invest in marketing (ROI = 2.5x)
Next Steps
- Model Types — OLS vs Ridge vs WLS
- Inference & Testing — Statistical significance
- Basic Workflow — Hands-on example