Diagnostics & Validation
Verify regression assumptions and validate model quality.
Regression Assumptions
1. Linearity
Is the relationship linear?
Check: Plot actual vs. predicted
- Linear pattern ✅
- Curved pattern ❌ → Use polynomial terms or Ridge
2. Independence
Are observations independent?
Check: Is data time-series?
- No time dependency ✅
- Time-series ❌ → Use RLS
3. Homoscedasticity
Is variance constant across x?
SELECT
y_actual,
y_predicted,
residual,
leverage,
cooks_distance
FROM anofox_statistics_residual_diagnostics(
actual_values,
predicted_values
);
Check: Residual plot
- Random scatter ✅
- Funneling pattern ❌ → Use WLS
4. Normality
Are errors normally distributed?
SELECT * FROM anofox_statistics_normality_test(
residuals,
0.05
);
Jarque-Bera Test:
- p > 0.05 → Normal ✅
- p < 0.05 → Not normal ❌
5. No Multicollinearity
Are predictors independent?
SELECT
variable,
vif
FROM anofox_statistics_vif(
ARRAY['x1', 'x2', 'x3']
);
VIF Interpretation:
- VIF < 5 → OK ✅
- VIF 5-10 → Caution ⚠️
- VIF > 10 → Multicollinearity ❌ → Use Ridge
Residual Diagnostics
SELECT
y_actual,
y_predicted,
residual,
leverage,
cooks_distance,
is_outlier
FROM anofox_statistics_residual_diagnostics(
actual_values,
predicted_values,
outlier_threshold := 3.0
);
Cook's Distance
Measures how much removing each observation changes coefficients
- D < 0.5 → Normal observation
- D > 1.0 → Influential outlier → Consider removing
Leverage
Measures distance of x from center
- High leverage + large residual = Problem
Model Selection Criteria
AIC (Akaike Information Criterion)
AIC = 2p + n×ln(RSS/n)
Penalizes model complexity
BIC (Bayesian Information Criterion)
BIC = p×ln(n) + n×ln(RSS/n)
Stronger penalty than AIC
Rule
Lower AIC/BIC = Better model
SELECT
model_name,
r_squared,
aic,
bic
FROM model_comparison
ORDER BY aic; -- Choose lowest AIC model
Diagnostic Workflow
1. Fit Model
↓
2. Check Residuals
- Normal? (Jarque-Bera test)
- Homoscedastic? (plot residuals)
- Any outliers? (Cook's distance)
↓
3. Check Multicollinearity
- VIF < 10?
↓
4. If Issues Found
- Outliers → Use WLS or robust regression
- Multicollinearity → Use Ridge
- Non-normal → Check data quality
↓
5. Validate on Test Data
- Compare predictions to holdout
Next Steps
- Basic Workflow — End-to-end example
- Handling Multicollinearity — Ridge regression
- Model Selection — Comparing models