Skip to main content

DuckDB Patterns

DuckDB Patterns refers to the set of SQL integration conventions that determine how AnoFox Statistics functions compose with standard SQL clauses like GROUP BY, OVER (...), and scalar expressions.

How the API Works

Every statistics function follows a consistent naming pattern:

anofox_stats_<method>_<variant>

Variants determine how the function integrates with SQL:

SuffixSQL PatternUse Case
_fitScalar functionFit a single model on your entire dataset
_fit_aggGROUP BYFit separate models per group (e.g., per region, per product)
_fit_predictOVER (...)Rolling or expanding window models (e.g., 60-day rolling regression)
_predict_aggGROUP BYBatch predictions per group

All model functions return a STRUCT containing coefficients, p-values, R², and diagnostics. Extract fields using (result).field_name syntax:

SELECT
(model).r_squared,
(model).coefficients[2] as slope,
(model).p_values[2] as p_value
FROM (
SELECT anofox_stats_ols_fit_agg(sales, [advertising]) as model
FROM data
);

Quick Reference

CategoryPatternDescription
Aggregates_agg suffixPer-group models with GROUP BY
Windows_fit_predict suffixRolling/expanding with OVER
Predictionanofox_stats_predictGenerate predictions from coefficients
Info Criteriaanofox_stats_aic, anofox_stats_bicModel selection metrics

Aggregate Function Pattern

All _agg functions work with GROUP BY for per-group models.

Basic GROUP BY

SELECT
group_column,
anofox_stats_ols_fit_agg(y, [x1, x2]) as model
FROM data
GROUP BY group_column;

Extracting Results

SELECT
region,
(model).r_squared as fit,
(model).coefficients[1] as intercept,
(model).coefficients[2] as slope,
(model).p_values[2] as p_value
FROM (
SELECT
region,
anofox_stats_ols_fit_agg(sales, [marketing_spend]) as model
FROM regional_data
GROUP BY region
);

Available Aggregate Functions

Regression Aggregates

FunctionDescription
anofox_stats_ols_fit_aggOLS regression
anofox_stats_ridge_fit_aggRidge regression
anofox_stats_wls_fit_aggWeighted least squares
anofox_stats_rls_fit_aggRecursive least squares
anofox_stats_elasticnet_fit_aggElastic Net
anofox_stats_poisson_fit_aggPoisson GLM
anofox_stats_alm_fit_aggAugmented linear model
anofox_stats_bls_fit_aggBounded least squares
anofox_stats_nnls_fit_aggNon-negative least squares

Prediction Aggregates

FunctionDescription
anofox_stats_ols_predict_aggBatch OLS predictions
anofox_stats_ridge_predict_aggBatch Ridge predictions
anofox_stats_wls_predict_aggBatch WLS predictions
anofox_stats_rls_predict_aggBatch RLS predictions
anofox_stats_elasticnet_predict_aggBatch Elastic Net predictions

Statistical Test Aggregates

FunctionDescription
anofox_stats_t_test_aggt-test
anofox_stats_one_way_anova_aggANOVA
anofox_stats_mann_whitney_u_aggMann-Whitney U
anofox_stats_kruskal_wallis_aggKruskal-Wallis
anofox_stats_pearson_aggPearson correlation
anofox_stats_spearman_aggSpearman correlation
anofox_stats_chisq_test_aggChi-square test

Diagnostic Aggregates

FunctionDescription
anofox_stats_vif_aggVariance Inflation Factor
anofox_stats_shapiro_wilk_aggShapiro-Wilk normality
anofox_stats_jarque_bera_aggJarque-Bera normality
anofox_stats_residuals_diagnostics_aggResidual analysis

Demand Analysis Aggregates

FunctionDescription
anofox_stats_aid_aggAID classification
anofox_stats_aid_anomaly_aggAID anomaly detection

Window Functions

Use _fit_predict functions with OVER for rolling/expanding analyses.

Rolling Regression

SELECT
date,
value,
anofox_stats_ols_fit_predict(y, [x1, x2]) OVER (
ORDER BY date
ROWS BETWEEN 60 PRECEDING AND CURRENT ROW
) as rolling_model
FROM time_series_data;

Expanding Window

SELECT
date,
anofox_stats_ols_fit_predict(y, [x1]) OVER (
ORDER BY date
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) as expanding_model
FROM data;

Partitioned Rolling

SELECT
region,
date,
anofox_stats_ridge_fit_predict(y, [x1, x2], 0.5) OVER (
PARTITION BY region
ORDER BY date
ROWS BETWEEN 30 PRECEDING AND CURRENT ROW
) as regional_rolling
FROM multi_region_data;

Window Function Variants

FunctionDescription
anofox_stats_ols_fit_predictRolling/expanding OLS
anofox_stats_ridge_fit_predictRolling/expanding Ridge
anofox_stats_wls_fit_predictRolling/expanding WLS
anofox_stats_rls_fit_predictRolling/expanding RLS
anofox_stats_elasticnet_fit_predictRolling/expanding Elastic Net

Window Specifications

Rolling Window

Fixed-size moving window:

ROWS BETWEEN 60 PRECEDING AND CURRENT ROW
-- Uses last 61 observations (including current)

Expanding Window

Growing window from start:

ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
-- Uses all observations up to current

Range-Based Window

Time-based window:

RANGE BETWEEN INTERVAL '30 days' PRECEDING AND CURRENT ROW
-- Uses observations within last 30 days

Prediction

Generate predictions from fitted coefficients.

Parameters

ParameterTypeRequiredDefaultDescription
xLIST(LIST(DOUBLE))Yes-New predictor values
coefficientsLIST(DOUBLE)Yes-Fitted coefficients
interceptDOUBLEYes-Intercept term

Returns: LIST(DOUBLE) - predicted values

Example

WITH fitted AS (
SELECT anofox_stats_ols_fit_agg(revenue, [marketing, seasonality]) as model
FROM historical_data
)
SELECT
scenario_name,
anofox_stats_predict(
[[marketing_plan, seasonal_factor]],
model.coefficients,
model.intercept
)[1] as predicted_revenue
FROM scenarios, fitted;

Information Criteria

AIC (Akaike Information Criterion)

SELECT anofox_stats_aic(
rss, -- DOUBLE: residual sum of squares
n, -- BIGINT: number of observations
k -- BIGINT: number of parameters (including intercept)
) as aic;

BIC (Bayesian Information Criterion)

SELECT anofox_stats_bic(
rss, -- DOUBLE: residual sum of squares
n, -- BIGINT: number of observations
k -- BIGINT: number of parameters
) as bic;

Example

WITH residuals AS (
SELECT sum(power(actual - predicted, 2)) as rss,
count(*) as n
FROM predictions
)
SELECT
anofox_stats_aic(rss, n, 3) as aic, -- 3 = intercept + 2 predictors
anofox_stats_bic(rss, n, 3) as bic
FROM residuals;

Working with STRUCT Results

All model functions return STRUCT types. Here's how to extract values.

Direct Field Access

SELECT
(model).r_squared,
(model).coefficients[2],
(model).p_values[2]
FROM (
SELECT anofox_stats_ols_fit_agg(y, [x1, x2]) as model
FROM data
);

Named Extraction

SELECT
result.coefficients[1] as intercept,
result.coefficients[2] as x1_effect,
result.coefficients[3] as x2_effect,
result.p_values[2] as x1_pvalue,
result.r_squared as fit
FROM (
SELECT anofox_stats_ols_fit_agg(y, [x1, x2]) as result
FROM data
);

Unnesting Arrays

WITH model AS (
SELECT anofox_stats_ols_fit_agg(y, [x1, x2, x3]) as m
FROM data
)
SELECT
idx,
m.coefficients[idx] as coefficient,
m.std_errors[idx] as std_error,
m.p_values[idx] as p_value
FROM model, generate_series(1, array_length(m.coefficients)) as idx;

Common Patterns

Filtering by Significance

SELECT
predictor_name,
coefficient,
p_value
FROM model_results
WHERE p_value < 0.05;

Model Comparison Table

WITH models AS (
SELECT 'Simple' as name, anofox_stats_ols_fit_agg(y, [x1]) as m FROM data
UNION ALL
SELECT 'Full', anofox_stats_ols_fit_agg(y, [x1, x2, x3]) FROM data
)
SELECT
name,
round(m.r_squared, 4) as r2,
round(m.adj_r_squared, 4) as adj_r2,
round(m.aic, 2) as aic,
round(m.bic, 2) as bic,
round(m.rmse, 4) as rmse
FROM models
ORDER BY m.bic;

Coefficient Summary Table

WITH model AS (
SELECT anofox_stats_ols_fit_agg(
revenue,
[marketing, seasonality, competitor],
true, true, 0.95
) as m
FROM sales_data
),
predictors AS (
SELECT * FROM (VALUES
(1, 'Intercept'),
(2, 'Marketing'),
(3, 'Seasonality'),
(4, 'Competitor')
) AS t(idx, name)
)
SELECT
p.name,
round(m.coefficients[p.idx], 4) as estimate,
round(m.std_errors[p.idx], 4) as std_error,
round(m.t_statistics[p.idx], 3) as t_stat,
round(m.p_values[p.idx], 4) as p_value,
CASE WHEN m.p_values[p.idx] < 0.001 THEN '***'
WHEN m.p_values[p.idx] < 0.01 THEN '**'
WHEN m.p_values[p.idx] < 0.05 THEN '*'
WHEN m.p_values[p.idx] < 0.1 THEN '.'
ELSE '' END as sig
FROM model, predictors p
ORDER BY p.idx;

Examples

Per-Product Price Elasticity

SELECT
product_id,
product_name,
(model).coefficients[2] as price_elasticity,
(model).r_squared as model_fit,
CASE
WHEN (model).p_values[2] < 0.05 THEN 'Significant'
ELSE 'Not Significant'
END as significance
FROM (
SELECT
product_id,
product_name,
anofox_stats_ols_fit_agg(
log(quantity_sold),
[log(price), log(competitor_price)]
) as model
FROM sales_data
GROUP BY product_id, product_name
)
ORDER BY abs((model).coefficients[2]) DESC;

Rolling Beta Estimation

SELECT
date,
stock_ticker,
(rolling_model).coefficients[2] as beta,
(rolling_model).r_squared as r_squared
FROM (
SELECT
date,
stock_ticker,
anofox_stats_ols_fit_predict(
stock_return,
[market_return]
) OVER (
PARTITION BY stock_ticker
ORDER BY date
ROWS BETWEEN 252 PRECEDING AND CURRENT ROW
) as rolling_model
FROM stock_returns
)
WHERE date >= '2024-01-01';

Regional A/B Test Analysis

SELECT
region,
(result).mean_diff as lift,
(result).p_value,
(result).cohens_d as effect_size,
CASE
WHEN (result).p_value < 0.05 AND (result).mean_diff > 0 THEN 'Winner'
WHEN (result).p_value < 0.05 AND (result).mean_diff < 0 THEN 'Loser'
ELSE 'Inconclusive'
END as decision
FROM (
SELECT
region,
anofox_stats_t_test_agg(conversion_rate, treatment_group) as result
FROM ab_test_data
GROUP BY region
);

Time-Varying Correlation

SELECT
date,
(corr).r as correlation,
(corr).p_value,
CASE
WHEN abs((corr).r) > 0.7 THEN 'Strong'
WHEN abs((corr).r) > 0.4 THEN 'Moderate'
ELSE 'Weak'
END as strength
FROM (
SELECT
date,
anofox_stats_pearson_agg(x, y) OVER (
ORDER BY date
ROWS BETWEEN 60 PRECEDING AND CURRENT ROW
) as corr
FROM paired_series
);

Type Reference

Common Output Types

FieldTypeDescription
coefficientsLIST(DOUBLE)Regression coefficients
std_errorsLIST(DOUBLE)Standard errors
t_statisticsLIST(DOUBLE)t-statistics
p_valuesLIST(DOUBLE)p-values
confidence_intervalsLIST(STRUCT)CI bounds per coefficient
r_squaredDOUBLER² (0-1)
adj_r_squaredDOUBLEAdjusted R²
rmseDOUBLERoot mean squared error
aicDOUBLEAkaike Information Criterion
bicDOUBLEBayesian Information Criterion
nBIGINTNumber of observations
kBIGINTNumber of predictors

Hypothesis Test Output Types

FieldTypeDescription
statisticDOUBLETest statistic
p_valueDOUBLEp-value
dfDOUBLE/BIGINTDegrees of freedom
effect_sizeDOUBLEEffect size measure
ci_lowerDOUBLECI lower bound
ci_upperDOUBLECI upper bound

Frequently Asked Questions

What is the difference between _fit_agg and _fit_predict variants?

The _fit_agg variant works with GROUP BY to fit one model per group and returns a single STRUCT result per group. The _fit_predict variant works with OVER (...) window functions to perform rolling or expanding computations, returning a result for every row in the window. Use _fit_agg for batch analysis and _fit_predict for time-varying analysis.

How do I extract individual fields from the STRUCT result?

Use parenthesized dot notation: (model).r_squared, (model).coefficients[2]. Array indices start at 1, where index 1 is the intercept (if included) and index 2 is the first predictor's coefficient. You can also alias the result and use result.field_name syntax.

Can I use window functions with a time-based range instead of row count?

Yes. DuckDB supports range-based windows like RANGE BETWEEN INTERVAL '30 days' PRECEDING AND CURRENT ROW. This uses all observations within the last 30 calendar days rather than a fixed row count. This is useful when your data has irregular time spacing.

How do I compare multiple models using information criteria?

Fit each model and extract the AIC and BIC values from the result STRUCT. Lower AIC favors prediction accuracy while lower BIC favors simpler models. Use UNION ALL to combine results and ORDER BY m.bic to rank them. See the Model Comparison Table example above.

🍪 Cookie Settings