AnoFox Statistics Extension
In-Database Regression Analysis & Statistical Inference in SQL
Build predictive models directly in DuckDB with 5 regression types, comprehensive diagnostics, and hypothesis testing—zero Python overhead.
30-Second Example
-- Load the extension
LOAD anofox_statistics;
-- Simple linear regression
SELECT
coefficient,
std_error,
t_statistic,
p_value,
r_squared,
rmse
FROM anofox_statistics_ols(
'sales_data',
'revenue',
ARRAY['marketing_spend', 'team_size']
);
Output: Coefficients with statistical significance tests, goodness-of-fit metrics, and standard errors.
What's Included
| Component | Coverage | Highlights |
|---|---|---|
| Regression Models | 5 types | OLS, Ridge, WLS, RLS, Elastic Net |
| Inference | 10+ functions | t-tests, p-values, confidence intervals, prediction intervals |
| Diagnostics | 5 types | Residuals, VIF, normality tests, outlier detection, AIC/BIC |
| Aggregates | GROUP BY & OVER | Per-group models, rolling regression, expanding windows |
| Utilities | Basic metrics | R², RMSE, MSE, information criteria |
Function Finder
What I Want to Do → Function to Use
| Goal | Function | Guide |
|---|---|---|
| Fit simple linear regression | anofox_statistics_ols | Basic Workflow |
| Test coefficient significance | anofox_statistics_ols_inference | Inference & Testing |
| Detect multicollinearity | anofox_statistics_vif | Handling Multicollinearity |
| Make predictions with uncertainty | anofox_statistics_ols_predict_interval | Prediction Intervals |
| Analyze per segment/group | anofox_statistics_ols_agg | Grouped Analysis |
| Rolling regression over time | anofox_statistics_ols_agg OVER (...) | Rolling Regression |
| Handle correlated predictors | anofox_statistics_ridge | Model Types |
| Adaptive online learning | anofox_statistics_rls | Model Types |
| Select best model | anofox_statistics_information_criteria | Model Selection |
| Check regression assumptions | Diagnostics functions | Diagnostics |
Why AnoFox Statistics?
Native In-Database Processing
- No data export/import cycles
- Direct SQL integration with your data pipeline
- Zero Python dependency overhead
Production-Ready Algorithms
- OLS: The statistical workhorse (BLUE property guaranteed)
- Ridge: Handle multicollinearity with L2 regularization
- WLS: Heteroscedastic data with weighted observations
- RLS: Online adaptive learning with forgetting factor
- Elastic Net: Combined L1+L2 for feature selection
Comprehensive Statistical Inference
- Coefficient significance testing (t-tests, p-values)
- Confidence intervals for parameters
- Prediction intervals (individual vs. mean)
- Hypothesis testing framework built-in
- Multiple comparison corrections
Enterprise Diagnostics
- Residual analysis (leverage, Cook's distance)
- Multicollinearity detection (VIF)
- Normality testing (Jarque-Bera)
- Information criteria (AIC, BIC)
- Per-group analysis via GROUP BY
Getting Started Paths
👤 New to Statistics?
Start with Understanding Regression → Quickstart → Basic Workflow
👨💼 Business Analyst
Quickstart → Basic Workflow → Production Deployment
👨🔬 Data Scientist
Model Types → Grouped Analysis → Advanced Guides
🏭 Production Focus
Installation → Quickstart → Production Deployment
Key Concepts at a Glance
Linear Regression Equation
y = β₀ + β₁x₁ + β₂x₂ + ... + βₚxₚ + ε
- β₀: Intercept
- βⱼ: Coefficient (marginal effect of xⱼ)
- ε: Error term (unobserved noise)
R² (Coefficient of Determination)
Proportion of variance explained: 0 (no fit) to 1 (perfect fit)
p-Value
Probability of observing this coefficient under null hypothesis (H₀: β = 0). p < 0.05 = significant.
Prediction Intervals vs. Confidence Intervals
- Confidence Interval: Uncertainty about the mean prediction
- Prediction Interval: Wider; includes individual variation
Multicollinearity
Correlated predictors inflate standard errors, weaken inference. VIF > 10 = problematic.
Next Steps
- Installation — Get the extension running
- Quickstart — Fit your first model in 5 minutes
- Concepts — Understand regression fundamentals
- Function Reference — Complete API documentation
- GitHub Repository — Source code and discussions
Key Takeaways
- ✅ 5 regression models for different data types
- ✅ Full statistical inference (tests, intervals, significance)
- ✅ Comprehensive diagnostics (residuals, VIF, normality)
- ✅ GROUP BY and window functions for complex analyses
- ✅ Production-grade in-database processing
- ✅ Zero Python/R integration friction