Anomaly Detection

Statistical and machine learning methods for outlier detection.

Quick Reference

Function	Description	SQL Signature
`anofox_tab_metric_zscore`	Z-score outlier	`(table, column, threshold) -> TABLE`
`anofox_tab_metric_iqr`	IQR outlier	`(table, column, multiplier) -> TABLE`
`anofox_tab_metric_isolation_forest`	Univariate ML	`(table, column, ...) -> TABLE`
`anofox_tab_metric_isolation_forest_multivariate`	Multivariate ML	`(table, columns, ...) -> TABLE`
`anofox_tab_metric_dbscan`	Density-based	`(table, column, eps, min_pts, mode) -> TABLE`
`anofox_tab_outlier_tree`	Explainable outliers	`(table, columns, mode) -> TABLE`

Statistical Methods (2)

anofox_tab_metric_zscore

Detect outliers using Z-score method.

Parameters

Parameter	Type	Required	Default	Description
`table_name`	VARCHAR	Yes	-	Table to analyze
`column_name`	VARCHAR	Yes	-	Column to analyze
`threshold`	DOUBLE	No	3.0	Z-score threshold

Method: z = (value - mean) / std_dev, anomaly = |z| > threshold

Thresholds:

Threshold	% in Distribution	Outliers
2.0	95%	5% of data
2.5	98.8%	1.2% of data
3.0	99.7%	0.3% of data

Example

SELECT * FROM anofox_tab_metric_zscore('transactions', 'amount', 3.0)
WHERE is_outlier = true;

anofox_tab_metric_iqr

Detect outliers using Interquartile Range.

Parameters

Parameter	Type	Required	Default	Description
`table_name`	VARCHAR	Yes	-	Table to analyze
`column_name`	VARCHAR	Yes	-	Column to analyze
`multiplier`	DOUBLE	No	1.5	IQR multiplier

Method:

IQR = Q3 - Q1
lower = Q1 - multiplier * IQR
upper = Q3 + multiplier * IQR
anomaly = value < lower OR value > upper

Example

SELECT * FROM anofox_tab_metric_iqr('transactions', 'amount', 1.5);

Pros over Z-score:

Robust to extreme outliers
Works with any distribution
Doesn't require normality assumption

Isolation Forest (Enhanced)

Industry-grade anomaly detection with advanced capabilities.

Core Capabilities

Categorical Support - Auto-detect VARCHAR columns with random subset splitting
Extended IF (ndim) - Hyperplane splits for diagonal/curved anomaly patterns
Density Scoring - Alternative metric based on points-to-volume ratio
Sample Weights - Weighted sampling for imbalanced datasets
SCiForest - Information-gain guided splitting

anofox_tab_metric_isolation_forest

Univariate Isolation Forest for single column anomaly detection.

Full Signature

anofox_tab_metric_isolation_forest(
    table_name VARCHAR,
    column_name VARCHAR,
    n_trees BIGINT,           -- 1-500, default 100
    sample_size BIGINT,       -- 1-10000, default 256
    contamination DOUBLE,     -- 0.0-0.5, default 0.1
    output_mode VARCHAR,      -- 'summary' or 'scores'
    ndim BIGINT,              -- 1-N, default 1 (Extended IF)
    coef_type VARCHAR,        -- 'uniform' or 'normal'
    scoring_metric VARCHAR,   -- 'depth', 'density', or 'adj_depth'
    weight_column VARCHAR,    -- Column for sample weights (NULL = uniform)
    ntry BIGINT,              -- 1-100, default 1 (SCiForest)
    prob_pick_avg_gain DOUBLE -- 0.0-1.0, default 0.0
) -> TABLE

Parameters

Parameter	Range	Default	Description
`n_trees`	1-500	100	Number of isolation trees
`sample_size`	1-10000	256	Subsample size per tree
`contamination`	0.0-0.5	0.1	Expected anomaly fraction
`output_mode`	-	'scores'	`'summary'` or `'scores'`
`ndim`	1-N	1	Hyperplane dimensions (Extended IF)
`coef_type`	-	'uniform'	`'uniform'` or `'normal'`
`scoring_metric`	-	'depth'	`'depth'`, `'density'`, `'adj_depth'`
`weight_column`	-	NULL	Sample weight column name
`ntry`	1-100	1	Split candidates (SCiForest)
`prob_pick_avg_gain`	0.0-1.0	0.0	Gain-based selection probability

Contamination guidance:

Value	Expectation	Use Case
0.01	1% anomalies	High quality data
0.05	5% anomalies	Normal data
0.10	10% anomalies	Messy data

Examples

-- Basic usage (backward compatible)
SELECT * FROM anofox_tab_metric_isolation_forest(
    'sales_data', 'amount', 100, 256, 0.1, 'scores'
) WHERE is_anomaly = true;

-- Extended IF with hyperplane splits
SELECT * FROM anofox_tab_metric_isolation_forest(
    'transactions', 'amount', 100, 256, 0.1, 'scores',
    3,          -- ndim: 3-dimensional hyperplanes
    'normal',   -- coef_type: normal distribution
    'depth'     -- scoring_metric
);

-- SCiForest with gain-based selection
SELECT * FROM anofox_tab_metric_isolation_forest(
    'events', 'value', 100, 256, 0.05, 'scores',
    1, 'uniform', 'depth',
    NULL,       -- weight_column
    10,         -- ntry: evaluate 10 candidates
    0.5         -- prob_pick_avg_gain: 50% gain-based
);

anofox_tab_metric_isolation_forest_multivariate

Multivariate Isolation Forest for multiple column anomaly detection.

Parameters

Same as univariate, but column_name becomes columns (comma-separated).

Example

-- Basic multivariate
SELECT * FROM anofox_tab_metric_isolation_forest_multivariate(
    'customer_events', 'purchase_amount, session_duration, page_views',
    100, 256, 0.1, 'scores'
) ORDER BY anomaly_score DESC LIMIT 10;

-- Extended IF with density scoring
SELECT * FROM anofox_tab_metric_isolation_forest_multivariate(
    'transactions', 'amount, quantity, duration',
    100, 256, 0.05, 'scores',
    3,          -- ndim
    'normal',   -- coef_type
    'density'   -- scoring_metric: density-based
);

DBSCAN Clustering

anofox_tab_metric_dbscan

Detect outliers using Density-Based Clustering.

Parameters

Parameter	Type	Required	Default	Description
`table_name`	VARCHAR	Yes	-	Table to analyze
`column_name`	VARCHAR	Yes	-	Column to analyze
`eps`	DOUBLE	No	0.5	Neighborhood radius
`min_pts`	BIGINT	No	5	Min points for dense region
`output_mode`	VARCHAR	No	'clusters'	`'summary'` or `'clusters'`

Point Classifications:

CORE - Dense region centers (low anomaly score)
BORDER - Cluster edges (moderate anomaly score)
NOISE - Isolated outliers (high anomaly score: 1.0)

Example

SELECT * FROM anofox_tab_metric_dbscan(
    'transactions', 'amount', 10.0, 5, 'clusters'
) WHERE point_type = 'NOISE';

anofox_tab_metric_dbscan_multivariate

Multivariate DBSCAN for multiple columns.

SELECT * FROM anofox_tab_metric_dbscan_multivariate(
    'customer_events', 'amount, frequency',
    0.5, 10, 'clusters'
) WHERE point_type = 'NOISE';

OutlierTree (Explainable Anomaly Detection)

Unlike Isolation Forest which provides anomaly scores, OutlierTree generates human-readable explanations for why specific values are outliers based on conditional distributions.

Key Features

Detects outliers in context (e.g., "salary is high for a Junior Developer")
Returns natural language explanations with statistical backing
Uses robust statistics (median + MAD) resistant to outliers
Supports both numeric and categorical columns

anofox_tab_outlier_tree

Signatures

-- Simple (3 params, uses defaults)
anofox_tab_outlier_tree(table_name, columns, output_mode) -> TABLE

-- Full (9 params)
anofox_tab_outlier_tree(
    table_name VARCHAR,
    columns VARCHAR,
    output_mode VARCHAR,
    max_depth INTEGER,
    max_perc_outliers DOUBLE,
    min_size_numeric INTEGER,
    min_size_categ INTEGER,
    z_norm DOUBLE,
    z_outlier DOUBLE
) -> TABLE

Parameters

Parameter	Type	Default	Description
`table_name`	VARCHAR	required	Source table name
`columns`	VARCHAR	required	Comma-separated column names
`output_mode`	VARCHAR	'summary'	`'summary'` or `'outliers'`
`max_depth`	INTEGER	4	Maximum tree depth
`max_perc_outliers`	DOUBLE	0.01	Max fraction of outliers per cluster
`min_size_numeric`	INTEGER	25	Min cluster size for numeric targets
`min_size_categ`	INTEGER	75	Min cluster size for categorical targets
`z_norm`	DOUBLE	2.67	Z-threshold for confidence intervals
`z_outlier`	DOUBLE	8.0	Z-threshold for outlier flagging

Output Modes

Summary Mode Returns:

Column	Type	Description
`status`	VARCHAR	`'pass'` or `'fail'`
`total_rows`	BIGINT	Rows analyzed
`outlier_count`	BIGINT	Outliers detected
`columns_analyzed`	INTEGER	Columns analyzed
`clusters_evaluated`	BIGINT	Clusters evaluated
`message`	VARCHAR	Summary message

Outliers Mode Returns:

Column	Type	Description
`row_id`	BIGINT	Row index (1-indexed)
`column_name`	VARCHAR	Column with outlier
`outlier_value`	VARCHAR	The anomalous value
`cluster_mean`	DOUBLE	Mean in cluster
`cluster_sd`	DOUBLE	SD in cluster
`z_score`	DOUBLE	Robust z-score
`lower_bound`	DOUBLE	Lower CI bound
`upper_bound`	DOUBLE	Upper CI bound
`conditions`	VARCHAR	JSON array of split conditions
`explanation`	VARCHAR	Human-readable explanation
`outlier_score`	DOUBLE	Rarity score

Examples

-- Summary mode: quick pass/fail check
SELECT * FROM anofox_tab_outlier_tree('employees', 'department,salary', 'summary');

-- Outliers mode: get detailed explanations
SELECT row_id, column_name, outlier_value, explanation
FROM anofox_tab_outlier_tree('employees', 'job_title,salary,years_exp', 'outliers');

-- Returns explanations like:
-- "Value 150000 for column 'salary' is unusually high (expected: 52333 +/- 7413)
--  when job_title = 'Junior Developer'"

-- With custom parameters for small datasets
SELECT * FROM anofox_tab_outlier_tree(
    'test_data', 'category,value', 'outliers',
    4,          -- max_depth
    0.5,        -- max_perc_outliers
    3,          -- min_size_numeric
    2,          -- min_size_categ
    2.67,       -- z_norm
    3.0         -- z_outlier
);

Method Comparison

Method	Type	Speed	Interpretation	Use Case
Z-Score	Statistical	Fast	Easy	Single column, normal dist
IQR	Statistical	Fast	Easy	Single column, robust
Isolation Forest	ML	Medium	Hard	Multiple columns, complex patterns
Extended IF	ML	Medium	Hard	Diagonal/curved anomalies
DBSCAN	ML	Slow	Medium	Density-based clusters
OutlierTree	ML	Medium	Easy	Explainable, contextual outliers

Decision Tree

How many features?
+-- ONE column
|   +-- Normal distribution? -> Z-Score
|   +-- Any distribution? -> IQR
|
+-- MULTIPLE columns
    +-- Need explanations? -> OutlierTree
    +-- Detect complex patterns? -> Extended IF (ndim > 1)
    +-- Find density clusters? -> DBSCAN
    +-- Maximize detection? -> Isolation Forest

Performance Characteristics

Function	Speed	Notes
Z-Score	O(n)	Single pass
IQR	O(n log n)	Sorting for quartiles
Isolation Forest	O(n log n)	Tree construction
Extended IF	O(n log n)	Additional hyperplane calc
DBSCAN	O(n^2)	Distance calculations
OutlierTree	O(n log n)	Tree building + scoring

Quick Reference​

Statistical Methods (2)​

anofox_tab_metric_zscore​

Parameters​

Example​

anofox_tab_metric_iqr​

Parameters​

Example​

Isolation Forest (Enhanced)​

Core Capabilities​

anofox_tab_metric_isolation_forest​

Full Signature​

Parameters​

Examples​

anofox_tab_metric_isolation_forest_multivariate​

Parameters​

Example​

DBSCAN Clustering​

anofox_tab_metric_dbscan​

Parameters​

Example​

anofox_tab_metric_dbscan_multivariate​

OutlierTree (Explainable Anomaly Detection)​

Key Features​

anofox_tab_outlier_tree​

Signatures​

Parameters​

Output Modes​

Examples​

Method Comparison​

Decision Tree​

Performance Characteristics​

Quick Reference

Statistical Methods (2)

anofox_tab_metric_zscore

Parameters

Example

anofox_tab_metric_iqr

Parameters

Example

Isolation Forest (Enhanced)

Core Capabilities

anofox_tab_metric_isolation_forest

Full Signature

Parameters

Examples

anofox_tab_metric_isolation_forest_multivariate

Parameters

Example

DBSCAN Clustering

anofox_tab_metric_dbscan

Parameters

Example

anofox_tab_metric_dbscan_multivariate

OutlierTree (Explainable Anomaly Detection)

Key Features

anofox_tab_outlier_tree

Signatures

Parameters

Output Modes

Examples

Method Comparison

Decision Tree

Performance Characteristics