Skip to main content

Anofox Statistics

AnoFox Statistics delivers 9 regression model types, 30+ hypothesis tests, and a complete diagnostics toolkit -- all running natively inside DuckDB as a C++ extension. From OLS and Ridge to Poisson GLM and the Augmented Linear Model with 24 error distributions, every statistical method executes in-database with zero data movement. The extension supports 4 SQL integration patterns -- scalar fit, GROUP BY aggregates, window functions, and batch predict -- so a single query can fit separate models per region, product line, or customer segment.

What is Anofox Statistics?

Anofox Statistics is a native DuckDB extension that brings professional-grade statistical analysis directly to your database. Statistical inference refers to the process of drawing conclusions about populations from sample data, and this extension makes it accessible through pure SQL. Build predictive models, run hypothesis tests, and validate assumptions - no Python overhead, no data export/import cycles.

Key Features

  • Regression Analysis - OLS, Ridge, WLS, RLS, Elastic Net with full inference
  • Hypothesis Testing - 30+ tests including t-test, ANOVA, chi-square, correlation
  • Model Diagnostics - VIF, residual analysis, normality tests, information criteria
  • Specialized Models - Poisson GLM, ALM (24 distributions), constrained optimization
  • Aggregates & Windows - GROUP BY and OVER patterns for rolling/expanding analysis
DocumentationDescription
InstallationSetup and prerequisites
Function FinderFind the right function for your task
RegressionOLS, Ridge, WLS, RLS, Elastic Net, GLM, ALM, BLS, NNLS
Demand AnalysisAID classification and anomaly detection
Hypothesis TestsParametric, nonparametric, correlation
DiagnosticsVIF, residuals, normality, model selection
DuckDB PatternsAggregates, windows, helpers

Basic Usage

-- Load the extension
LOAD anofox_statistics;

-- Fit a linear regression model
SELECT
(model).coefficients,
(model).r_squared,
(model).p_values
FROM (
SELECT anofox_stats_ols_fit_agg(
revenue,
[marketing_spend, team_size]
) as model
FROM sales_data
);

This returns coefficients with statistical significance tests, goodness-of-fit metrics, and standard errors - all computed in-database.

The hypothesis testing suite spans 4 categories: parametric tests (t-test, ANOVA, proportion tests, equivalence TOST), nonparametric tests (Mann-Whitney U, Kruskal-Wallis, Wilcoxon, permutation tests with up to 10,000 iterations), correlation analysis (Pearson, Spearman, Kendall, distance correlation, ICC), and categorical tests (chi-square, Fisher exact, G-test, McNemar, Cohen's Kappa). Each test returns structured output with test statistics, p-values, confidence intervals, and effect sizes -- ready for downstream decision logic in SQL.

Frequently Asked Questions

How does Anofox Statistics differ from running Python statsmodels or R?

Anofox Statistics executes natively inside DuckDB as a C++ extension, so your data never leaves the database. There is no serialization overhead, no Python GIL, and no need to export data to CSV or Parquet for analysis. For large datasets, this can be orders of magnitude faster than pulling data into Python or R.

Can I fit separate models per group (e.g., per region or product)?

Yes. Every regression function has an _agg variant that works with GROUP BY. For example, anofox_stats_ols_fit_agg(revenue, [price]) ... GROUP BY region fits a separate OLS model for each region in a single query. Window variants (_fit_predict) support rolling and expanding models using OVER (...).

How do I choose between OLS, Ridge, and Elastic Net?

Start with OLS for interpretability and full inference. If you detect multicollinearity (VIF > 5), switch to Ridge. If you have many predictors and need automatic feature selection, use Elastic Net. Check the Diagnostics page for VIF and model comparison tools.

What does the MAP syntax for options mean?

Optional parameters use DuckDB's MAP type: MAP {'option_name': 'value'}. All values are strings, even for booleans and numbers (e.g., 'true', '0.95'). Pass MAP{} for defaults. This approach avoids positional argument ambiguity and makes queries self-documenting.

🍪 Cookie Settings