DuckDB Patterns
DuckDB Patterns refers to the set of SQL integration conventions that determine how AnoFox Statistics functions compose with standard SQL clauses like GROUP BY, OVER (...), and scalar expressions.
How the API Works
Every statistics function follows a consistent naming pattern:
anofox_stats_<method>_<variant>
Variants determine how the function integrates with SQL:
| Suffix | SQL Pattern | Use Case |
|---|---|---|
_fit | Scalar function | Fit a single model on your entire dataset |
_fit_agg | GROUP BY | Fit separate models per group (e.g., per region, per product) |
_fit_predict | OVER (...) | Rolling or expanding window models (e.g., 60-day rolling regression) |
_predict_agg | GROUP BY | Batch predictions per group |
All model functions return a STRUCT containing coefficients, p-values, R², and diagnostics. Extract fields using (result).field_name syntax:
SELECT
(model).r_squared,
(model).coefficients[2] as slope,
(model).p_values[2] as p_value
FROM (
SELECT anofox_stats_ols_fit_agg(sales, [advertising]) as model
FROM data
);
Quick Reference
| Category | Pattern | Description |
|---|---|---|
| Aggregates | _agg suffix | Per-group models with GROUP BY |
| Windows | _fit_predict suffix | Rolling/expanding with OVER |
| Prediction | anofox_stats_predict | Generate predictions from coefficients |
| Info Criteria | anofox_stats_aic, anofox_stats_bic | Model selection metrics |
Aggregate Function Pattern
All _agg functions work with GROUP BY for per-group models.
Basic GROUP BY
SELECT
group_column,
anofox_stats_ols_fit_agg(y, [x1, x2]) as model
FROM data
GROUP BY group_column;
Extracting Results
SELECT
region,
(model).r_squared as fit,
(model).coefficients[1] as intercept,
(model).coefficients[2] as slope,
(model).p_values[2] as p_value
FROM (
SELECT
region,
anofox_stats_ols_fit_agg(sales, [marketing_spend]) as model
FROM regional_data
GROUP BY region
);
Available Aggregate Functions
Regression Aggregates
| Function | Description |
|---|---|
anofox_stats_ols_fit_agg | OLS regression |
anofox_stats_ridge_fit_agg | Ridge regression |
anofox_stats_wls_fit_agg | Weighted least squares |
anofox_stats_rls_fit_agg | Recursive least squares |
anofox_stats_elasticnet_fit_agg | Elastic Net |
anofox_stats_poisson_fit_agg | Poisson GLM |
anofox_stats_alm_fit_agg | Augmented linear model |
anofox_stats_bls_fit_agg | Bounded least squares |
anofox_stats_nnls_fit_agg | Non-negative least squares |
Prediction Aggregates
| Function | Description |
|---|---|
anofox_stats_ols_predict_agg | Batch OLS predictions |
anofox_stats_ridge_predict_agg | Batch Ridge predictions |
anofox_stats_wls_predict_agg | Batch WLS predictions |
anofox_stats_rls_predict_agg | Batch RLS predictions |
anofox_stats_elasticnet_predict_agg | Batch Elastic Net predictions |
Statistical Test Aggregates
| Function | Description |
|---|---|
anofox_stats_t_test_agg | t-test |
anofox_stats_one_way_anova_agg | ANOVA |
anofox_stats_mann_whitney_u_agg | Mann-Whitney U |
anofox_stats_kruskal_wallis_agg | Kruskal-Wallis |
anofox_stats_pearson_agg | Pearson correlation |
anofox_stats_spearman_agg | Spearman correlation |
anofox_stats_chisq_test_agg | Chi-square test |
Diagnostic Aggregates
| Function | Description |
|---|---|
anofox_stats_vif_agg | Variance Inflation Factor |
anofox_stats_shapiro_wilk_agg | Shapiro-Wilk normality |
anofox_stats_jarque_bera_agg | Jarque-Bera normality |
anofox_stats_residuals_diagnostics_agg | Residual analysis |
Demand Analysis Aggregates
| Function | Description |
|---|---|
anofox_stats_aid_agg | AID classification |
anofox_stats_aid_anomaly_agg | AID anomaly detection |
Window Functions
Use _fit_predict functions with OVER for rolling/expanding analyses.
Rolling Regression
SELECT
date,
value,
anofox_stats_ols_fit_predict(y, [x1, x2]) OVER (
ORDER BY date
ROWS BETWEEN 60 PRECEDING AND CURRENT ROW
) as rolling_model
FROM time_series_data;
Expanding Window
SELECT
date,
anofox_stats_ols_fit_predict(y, [x1]) OVER (
ORDER BY date
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) as expanding_model
FROM data;
Partitioned Rolling
SELECT
region,
date,
anofox_stats_ridge_fit_predict(y, [x1, x2], 0.5) OVER (
PARTITION BY region
ORDER BY date
ROWS BETWEEN 30 PRECEDING AND CURRENT ROW
) as regional_rolling
FROM multi_region_data;
Window Function Variants
| Function | Description |
|---|---|
anofox_stats_ols_fit_predict | Rolling/expanding OLS |
anofox_stats_ridge_fit_predict | Rolling/expanding Ridge |
anofox_stats_wls_fit_predict | Rolling/expanding WLS |
anofox_stats_rls_fit_predict | Rolling/expanding RLS |
anofox_stats_elasticnet_fit_predict | Rolling/expanding Elastic Net |
Window Specifications
Rolling Window
Fixed-size moving window:
ROWS BETWEEN 60 PRECEDING AND CURRENT ROW
-- Uses last 61 observations (including current)
Expanding Window
Growing window from start:
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
-- Uses all observations up to current
Range-Based Window
Time-based window:
RANGE BETWEEN INTERVAL '30 days' PRECEDING AND CURRENT ROW
-- Uses observations within last 30 days
Prediction
Generate predictions from fitted coefficients.
Parameters
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
x | LIST(LIST(DOUBLE)) | Yes | - | New predictor values |
coefficients | LIST(DOUBLE) | Yes | - | Fitted coefficients |
intercept | DOUBLE | Yes | - | Intercept term |
Returns: LIST(DOUBLE) - predicted values
Example
WITH fitted AS (
SELECT anofox_stats_ols_fit_agg(revenue, [marketing, seasonality]) as model
FROM historical_data
)
SELECT
scenario_name,
anofox_stats_predict(
[[marketing_plan, seasonal_factor]],
model.coefficients,
model.intercept
)[1] as predicted_revenue
FROM scenarios, fitted;
Information Criteria
AIC (Akaike Information Criterion)
SELECT anofox_stats_aic(
rss, -- DOUBLE: residual sum of squares
n, -- BIGINT: number of observations
k -- BIGINT: number of parameters (including intercept)
) as aic;
BIC (Bayesian Information Criterion)
SELECT anofox_stats_bic(
rss, -- DOUBLE: residual sum of squares
n, -- BIGINT: number of observations
k -- BIGINT: number of parameters
) as bic;
Example
WITH residuals AS (
SELECT sum(power(actual - predicted, 2)) as rss,
count(*) as n
FROM predictions
)
SELECT
anofox_stats_aic(rss, n, 3) as aic, -- 3 = intercept + 2 predictors
anofox_stats_bic(rss, n, 3) as bic
FROM residuals;
Working with STRUCT Results
All model functions return STRUCT types. Here's how to extract values.
Direct Field Access
SELECT
(model).r_squared,
(model).coefficients[2],
(model).p_values[2]
FROM (
SELECT anofox_stats_ols_fit_agg(y, [x1, x2]) as model
FROM data
);
Named Extraction
SELECT
result.coefficients[1] as intercept,
result.coefficients[2] as x1_effect,
result.coefficients[3] as x2_effect,
result.p_values[2] as x1_pvalue,
result.r_squared as fit
FROM (
SELECT anofox_stats_ols_fit_agg(y, [x1, x2]) as result
FROM data
);
Unnesting Arrays
WITH model AS (
SELECT anofox_stats_ols_fit_agg(y, [x1, x2, x3]) as m
FROM data
)
SELECT
idx,
m.coefficients[idx] as coefficient,
m.std_errors[idx] as std_error,
m.p_values[idx] as p_value
FROM model, generate_series(1, array_length(m.coefficients)) as idx;
Common Patterns
Filtering by Significance
SELECT
predictor_name,
coefficient,
p_value
FROM model_results
WHERE p_value < 0.05;
Model Comparison Table
WITH models AS (
SELECT 'Simple' as name, anofox_stats_ols_fit_agg(y, [x1]) as m FROM data
UNION ALL
SELECT 'Full', anofox_stats_ols_fit_agg(y, [x1, x2, x3]) FROM data
)
SELECT
name,
round(m.r_squared, 4) as r2,
round(m.adj_r_squared, 4) as adj_r2,
round(m.aic, 2) as aic,
round(m.bic, 2) as bic,
round(m.rmse, 4) as rmse
FROM models
ORDER BY m.bic;
Coefficient Summary Table
WITH model AS (
SELECT anofox_stats_ols_fit_agg(
revenue,
[marketing, seasonality, competitor],
true, true, 0.95
) as m
FROM sales_data
),
predictors AS (
SELECT * FROM (VALUES
(1, 'Intercept'),
(2, 'Marketing'),
(3, 'Seasonality'),
(4, 'Competitor')
) AS t(idx, name)
)
SELECT
p.name,
round(m.coefficients[p.idx], 4) as estimate,
round(m.std_errors[p.idx], 4) as std_error,
round(m.t_statistics[p.idx], 3) as t_stat,
round(m.p_values[p.idx], 4) as p_value,
CASE WHEN m.p_values[p.idx] < 0.001 THEN '***'
WHEN m.p_values[p.idx] < 0.01 THEN '**'
WHEN m.p_values[p.idx] < 0.05 THEN '*'
WHEN m.p_values[p.idx] < 0.1 THEN '.'
ELSE '' END as sig
FROM model, predictors p
ORDER BY p.idx;
Examples
Per-Product Price Elasticity
SELECT
product_id,
product_name,
(model).coefficients[2] as price_elasticity,
(model).r_squared as model_fit,
CASE
WHEN (model).p_values[2] < 0.05 THEN 'Significant'
ELSE 'Not Significant'
END as significance
FROM (
SELECT
product_id,
product_name,
anofox_stats_ols_fit_agg(
log(quantity_sold),
[log(price), log(competitor_price)]
) as model
FROM sales_data
GROUP BY product_id, product_name
)
ORDER BY abs((model).coefficients[2]) DESC;
Rolling Beta Estimation
SELECT
date,
stock_ticker,
(rolling_model).coefficients[2] as beta,
(rolling_model).r_squared as r_squared
FROM (
SELECT
date,
stock_ticker,
anofox_stats_ols_fit_predict(
stock_return,
[market_return]
) OVER (
PARTITION BY stock_ticker
ORDER BY date
ROWS BETWEEN 252 PRECEDING AND CURRENT ROW
) as rolling_model
FROM stock_returns
)
WHERE date >= '2024-01-01';
Regional A/B Test Analysis
SELECT
region,
(result).mean_diff as lift,
(result).p_value,
(result).cohens_d as effect_size,
CASE
WHEN (result).p_value < 0.05 AND (result).mean_diff > 0 THEN 'Winner'
WHEN (result).p_value < 0.05 AND (result).mean_diff < 0 THEN 'Loser'
ELSE 'Inconclusive'
END as decision
FROM (
SELECT
region,
anofox_stats_t_test_agg(conversion_rate, treatment_group) as result
FROM ab_test_data
GROUP BY region
);
Time-Varying Correlation
SELECT
date,
(corr).r as correlation,
(corr).p_value,
CASE
WHEN abs((corr).r) > 0.7 THEN 'Strong'
WHEN abs((corr).r) > 0.4 THEN 'Moderate'
ELSE 'Weak'
END as strength
FROM (
SELECT
date,
anofox_stats_pearson_agg(x, y) OVER (
ORDER BY date
ROWS BETWEEN 60 PRECEDING AND CURRENT ROW
) as corr
FROM paired_series
);
Type Reference
Common Output Types
| Field | Type | Description |
|---|---|---|
coefficients | LIST(DOUBLE) | Regression coefficients |
std_errors | LIST(DOUBLE) | Standard errors |
t_statistics | LIST(DOUBLE) | t-statistics |
p_values | LIST(DOUBLE) | p-values |
confidence_intervals | LIST(STRUCT) | CI bounds per coefficient |
r_squared | DOUBLE | R² (0-1) |
adj_r_squared | DOUBLE | Adjusted R² |
rmse | DOUBLE | Root mean squared error |
aic | DOUBLE | Akaike Information Criterion |
bic | DOUBLE | Bayesian Information Criterion |
n | BIGINT | Number of observations |
k | BIGINT | Number of predictors |
Hypothesis Test Output Types
| Field | Type | Description |
|---|---|---|
statistic | DOUBLE | Test statistic |
p_value | DOUBLE | p-value |
df | DOUBLE/BIGINT | Degrees of freedom |
effect_size | DOUBLE | Effect size measure |
ci_lower | DOUBLE | CI lower bound |
ci_upper | DOUBLE | CI upper bound |
Frequently Asked Questions
What is the difference between _fit_agg and _fit_predict variants?
The _fit_agg variant works with GROUP BY to fit one model per group and returns a single STRUCT result per group. The _fit_predict variant works with OVER (...) window functions to perform rolling or expanding computations, returning a result for every row in the window. Use _fit_agg for batch analysis and _fit_predict for time-varying analysis.
How do I extract individual fields from the STRUCT result?
Use parenthesized dot notation: (model).r_squared, (model).coefficients[2]. Array indices start at 1, where index 1 is the intercept (if included) and index 2 is the first predictor's coefficient. You can also alias the result and use result.field_name syntax.
Can I use window functions with a time-based range instead of row count?
Yes. DuckDB supports range-based windows like RANGE BETWEEN INTERVAL '30 days' PRECEDING AND CURRENT ROW. This uses all observations within the last 30 calendar days rather than a fixed row count. This is useful when your data has irregular time spacing.
How do I compare multiple models using information criteria?
Fit each model and extract the AIC and BIC values from the result STRUCT. Lower AIC favors prediction accuracy while lower BIC favors simpler models. Use UNION ALL to combine results and ORDER BY m.bic to rank them. See the Model Comparison Table example above.