Skip to main content

Diagnostics & Validation

Verify regression assumptions and validate model quality.

Regression Assumptions

1. Linearity

Is the relationship linear?

Check: Plot actual vs. predicted

  • Linear pattern ✅
  • Curved pattern ❌ → Use polynomial terms or Ridge

2. Independence

Are observations independent?

Check: Is data time-series?

  • No time dependency ✅
  • Time-series ❌ → Use RLS

3. Homoscedasticity

Is variance constant across x?

SELECT
y_actual,
y_predicted,
residual,
leverage,
cooks_distance
FROM anofox_statistics_residual_diagnostics(
actual_values,
predicted_values
);

Check: Residual plot

  • Random scatter ✅
  • Funneling pattern ❌ → Use WLS

4. Normality

Are errors normally distributed?

SELECT * FROM anofox_statistics_normality_test(
residuals,
0.05
);

Jarque-Bera Test:

  • p > 0.05 → Normal ✅
  • p < 0.05 → Not normal ❌

5. No Multicollinearity

Are predictors independent?

SELECT
variable,
vif
FROM anofox_statistics_vif(
ARRAY['x1', 'x2', 'x3']
);

VIF Interpretation:

  • VIF < 5 → OK ✅
  • VIF 5-10 → Caution ⚠️
  • VIF > 10 → Multicollinearity ❌ → Use Ridge

Residual Diagnostics

SELECT
y_actual,
y_predicted,
residual,
leverage,
cooks_distance,
is_outlier
FROM anofox_statistics_residual_diagnostics(
actual_values,
predicted_values,
outlier_threshold := 3.0
);

Cook's Distance

Measures how much removing each observation changes coefficients

  • D < 0.5 → Normal observation
  • D > 1.0 → Influential outlier → Consider removing

Leverage

Measures distance of x from center

  • High leverage + large residual = Problem

Model Selection Criteria

AIC (Akaike Information Criterion)

AIC = 2p + n×ln(RSS/n)

Penalizes model complexity

BIC (Bayesian Information Criterion)

BIC = p×ln(n) + n×ln(RSS/n)

Stronger penalty than AIC

Rule

Lower AIC/BIC = Better model

SELECT
model_name,
r_squared,
aic,
bic
FROM model_comparison
ORDER BY aic; -- Choose lowest AIC model

Diagnostic Workflow

1. Fit Model

2. Check Residuals
- Normal? (Jarque-Bera test)
- Homoscedastic? (plot residuals)
- Any outliers? (Cook's distance)

3. Check Multicollinearity
- VIF < 10?

4. If Issues Found
- Outliers → Use WLS or robust regression
- Multicollinearity → Use Ridge
- Non-normal → Check data quality

5. Validate on Test Data
- Compare predictions to holdout

Next Steps

🍪 Cookie Settings