Anomaly Detection
Statistical and machine learning methods for outlier detection.
Quick Reference
| Function | Description | SQL Signature |
|---|---|---|
anofox_tab_metric_zscore | Z-score outlier | (table, column, threshold) -> TABLE |
anofox_tab_metric_iqr | IQR outlier | (table, column, multiplier) -> TABLE |
anofox_tab_metric_isolation_forest | Univariate ML | (table, column, ...) -> TABLE |
anofox_tab_metric_isolation_forest_multivariate | Multivariate ML | (table, columns, ...) -> TABLE |
anofox_tab_metric_dbscan | Density-based | (table, column, eps, min_pts, mode) -> TABLE |
anofox_tab_outlier_tree | Explainable outliers | (table, columns, mode) -> TABLE |
Statistical Methods (2)
anofox_tab_metric_zscore
Detect outliers using Z-score method.
Parameters
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
table_name | VARCHAR | Yes | - | Table to analyze |
column_name | VARCHAR | Yes | - | Column to analyze |
threshold | DOUBLE | No | 3.0 | Z-score threshold |
Method: z = (value - mean) / std_dev, anomaly = |z| > threshold
Thresholds:
| Threshold | % in Distribution | Outliers |
|---|---|---|
| 2.0 | 95% | 5% of data |
| 2.5 | 98.8% | 1.2% of data |
| 3.0 | 99.7% | 0.3% of data |
Example
SELECT * FROM anofox_tab_metric_zscore('transactions', 'amount', 3.0)
WHERE is_outlier = true;
anofox_tab_metric_iqr
Detect outliers using Interquartile Range.
Parameters
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
table_name | VARCHAR | Yes | - | Table to analyze |
column_name | VARCHAR | Yes | - | Column to analyze |
multiplier | DOUBLE | No | 1.5 | IQR multiplier |
Method:
- IQR = Q3 - Q1
- lower = Q1 - multiplier * IQR
- upper = Q3 + multiplier * IQR
- anomaly = value < lower OR value > upper
Example
SELECT * FROM anofox_tab_metric_iqr('transactions', 'amount', 1.5);
Pros over Z-score:
- Robust to extreme outliers
- Works with any distribution
- Doesn't require normality assumption
Isolation Forest (Enhanced)
Industry-grade anomaly detection with advanced capabilities.
Core Capabilities
- Categorical Support - Auto-detect VARCHAR columns with random subset splitting
- Extended IF (ndim) - Hyperplane splits for diagonal/curved anomaly patterns
- Density Scoring - Alternative metric based on points-to-volume ratio
- Sample Weights - Weighted sampling for imbalanced datasets
- SCiForest - Information-gain guided splitting
anofox_tab_metric_isolation_forest
Univariate Isolation Forest for single column anomaly detection.
Full Signature
anofox_tab_metric_isolation_forest(
table_name VARCHAR,
column_name VARCHAR,
n_trees BIGINT, -- 1-500, default 100
sample_size BIGINT, -- 1-10000, default 256
contamination DOUBLE, -- 0.0-0.5, default 0.1
output_mode VARCHAR, -- 'summary' or 'scores'
ndim BIGINT, -- 1-N, default 1 (Extended IF)
coef_type VARCHAR, -- 'uniform' or 'normal'
scoring_metric VARCHAR, -- 'depth', 'density', or 'adj_depth'
weight_column VARCHAR, -- Column for sample weights (NULL = uniform)
ntry BIGINT, -- 1-100, default 1 (SCiForest)
prob_pick_avg_gain DOUBLE -- 0.0-1.0, default 0.0
) -> TABLE
Parameters
| Parameter | Range | Default | Description |
|---|---|---|---|
n_trees | 1-500 | 100 | Number of isolation trees |
sample_size | 1-10000 | 256 | Subsample size per tree |
contamination | 0.0-0.5 | 0.1 | Expected anomaly fraction |
output_mode | - | 'scores' | 'summary' or 'scores' |
ndim | 1-N | 1 | Hyperplane dimensions (Extended IF) |
coef_type | - | 'uniform' | 'uniform' or 'normal' |
scoring_metric | - | 'depth' | 'depth', 'density', 'adj_depth' |
weight_column | - | NULL | Sample weight column name |
ntry | 1-100 | 1 | Split candidates (SCiForest) |
prob_pick_avg_gain | 0.0-1.0 | 0.0 | Gain-based selection probability |
Contamination guidance:
| Value | Expectation | Use Case |
|---|---|---|
| 0.01 | 1% anomalies | High quality data |
| 0.05 | 5% anomalies | Normal data |
| 0.10 | 10% anomalies | Messy data |
Examples
-- Basic usage (backward compatible)
SELECT * FROM anofox_tab_metric_isolation_forest(
'sales_data', 'amount', 100, 256, 0.1, 'scores'
) WHERE is_anomaly = true;
-- Extended IF with hyperplane splits
SELECT * FROM anofox_tab_metric_isolation_forest(
'transactions', 'amount', 100, 256, 0.1, 'scores',
3, -- ndim: 3-dimensional hyperplanes
'normal', -- coef_type: normal distribution
'depth' -- scoring_metric
);
-- SCiForest with gain-based selection
SELECT * FROM anofox_tab_metric_isolation_forest(
'events', 'value', 100, 256, 0.05, 'scores',
1, 'uniform', 'depth',
NULL, -- weight_column
10, -- ntry: evaluate 10 candidates
0.5 -- prob_pick_avg_gain: 50% gain-based
);
anofox_tab_metric_isolation_forest_multivariate
Multivariate Isolation Forest for multiple column anomaly detection.
Parameters
Same as univariate, but column_name becomes columns (comma-separated).
Example
-- Basic multivariate
SELECT * FROM anofox_tab_metric_isolation_forest_multivariate(
'customer_events', 'purchase_amount, session_duration, page_views',
100, 256, 0.1, 'scores'
) ORDER BY anomaly_score DESC LIMIT 10;
-- Extended IF with density scoring
SELECT * FROM anofox_tab_metric_isolation_forest_multivariate(
'transactions', 'amount, quantity, duration',
100, 256, 0.05, 'scores',
3, -- ndim
'normal', -- coef_type
'density' -- scoring_metric: density-based
);
DBSCAN Clustering
anofox_tab_metric_dbscan
Detect outliers using Density-Based Clustering.
Parameters
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
table_name | VARCHAR | Yes | - | Table to analyze |
column_name | VARCHAR | Yes | - | Column to analyze |
eps | DOUBLE | No | 0.5 | Neighborhood radius |
min_pts | BIGINT | No | 5 | Min points for dense region |
output_mode | VARCHAR | No | 'clusters' | 'summary' or 'clusters' |
Point Classifications:
- CORE - Dense region centers (low anomaly score)
- BORDER - Cluster edges (moderate anomaly score)
- NOISE - Isolated outliers (high anomaly score: 1.0)
Example
SELECT * FROM anofox_tab_metric_dbscan(
'transactions', 'amount', 10.0, 5, 'clusters'
) WHERE point_type = 'NOISE';
anofox_tab_metric_dbscan_multivariate
Multivariate DBSCAN for multiple columns.
SELECT * FROM anofox_tab_metric_dbscan_multivariate(
'customer_events', 'amount, frequency',
0.5, 10, 'clusters'
) WHERE point_type = 'NOISE';
OutlierTree (Explainable Anomaly Detection)
Unlike Isolation Forest which provides anomaly scores, OutlierTree generates human-readable explanations for why specific values are outliers based on conditional distributions.
Key Features
- Detects outliers in context (e.g., "salary is high for a Junior Developer")
- Returns natural language explanations with statistical backing
- Uses robust statistics (median + MAD) resistant to outliers
- Supports both numeric and categorical columns
anofox_tab_outlier_tree
Signatures
-- Simple (3 params, uses defaults)
anofox_tab_outlier_tree(table_name, columns, output_mode) -> TABLE
-- Full (9 params)
anofox_tab_outlier_tree(
table_name VARCHAR,
columns VARCHAR,
output_mode VARCHAR,
max_depth INTEGER,
max_perc_outliers DOUBLE,
min_size_numeric INTEGER,
min_size_categ INTEGER,
z_norm DOUBLE,
z_outlier DOUBLE
) -> TABLE
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
table_name | VARCHAR | required | Source table name |
columns | VARCHAR | required | Comma-separated column names |
output_mode | VARCHAR | 'summary' | 'summary' or 'outliers' |
max_depth | INTEGER | 4 | Maximum tree depth |
max_perc_outliers | DOUBLE | 0.01 | Max fraction of outliers per cluster |
min_size_numeric | INTEGER | 25 | Min cluster size for numeric targets |
min_size_categ | INTEGER | 75 | Min cluster size for categorical targets |
z_norm | DOUBLE | 2.67 | Z-threshold for confidence intervals |
z_outlier | DOUBLE | 8.0 | Z-threshold for outlier flagging |
Output Modes
Summary Mode Returns:
| Column | Type | Description |
|---|---|---|
status | VARCHAR | 'pass' or 'fail' |
total_rows | BIGINT | Rows analyzed |
outlier_count | BIGINT | Outliers detected |
columns_analyzed | INTEGER | Columns analyzed |
clusters_evaluated | BIGINT | Clusters evaluated |
message | VARCHAR | Summary message |
Outliers Mode Returns:
| Column | Type | Description |
|---|---|---|
row_id | BIGINT | Row index (1-indexed) |
column_name | VARCHAR | Column with outlier |
outlier_value | VARCHAR | The anomalous value |
cluster_mean | DOUBLE | Mean in cluster |
cluster_sd | DOUBLE | SD in cluster |
z_score | DOUBLE | Robust z-score |
lower_bound | DOUBLE | Lower CI bound |
upper_bound | DOUBLE | Upper CI bound |
conditions | VARCHAR | JSON array of split conditions |
explanation | VARCHAR | Human-readable explanation |
outlier_score | DOUBLE | Rarity score |
Examples
-- Summary mode: quick pass/fail check
SELECT * FROM anofox_tab_outlier_tree('employees', 'department,salary', 'summary');
-- Outliers mode: get detailed explanations
SELECT row_id, column_name, outlier_value, explanation
FROM anofox_tab_outlier_tree('employees', 'job_title,salary,years_exp', 'outliers');
-- Returns explanations like:
-- "Value 150000 for column 'salary' is unusually high (expected: 52333 +/- 7413)
-- when job_title = 'Junior Developer'"
-- With custom parameters for small datasets
SELECT * FROM anofox_tab_outlier_tree(
'test_data', 'category,value', 'outliers',
4, -- max_depth
0.5, -- max_perc_outliers
3, -- min_size_numeric
2, -- min_size_categ
2.67, -- z_norm
3.0 -- z_outlier
);
Method Comparison
| Method | Type | Speed | Interpretation | Use Case |
|---|---|---|---|---|
| Z-Score | Statistical | Fast | Easy | Single column, normal dist |
| IQR | Statistical | Fast | Easy | Single column, robust |
| Isolation Forest | ML | Medium | Hard | Multiple columns, complex patterns |
| Extended IF | ML | Medium | Hard | Diagonal/curved anomalies |
| DBSCAN | ML | Slow | Medium | Density-based clusters |
| OutlierTree | ML | Medium | Easy | Explainable, contextual outliers |
Decision Tree
How many features?
+-- ONE column
| +-- Normal distribution? -> Z-Score
| +-- Any distribution? -> IQR
|
+-- MULTIPLE columns
+-- Need explanations? -> OutlierTree
+-- Detect complex patterns? -> Extended IF (ndim > 1)
+-- Find density clusters? -> DBSCAN
+-- Maximize detection? -> Isolation Forest
Performance Characteristics
| Function | Speed | Notes |
|---|---|---|
| Z-Score | O(n) | Single pass |
| IQR | O(n log n) | Sorting for quartiles |
| Isolation Forest | O(n log n) | Tree construction |
| Extended IF | O(n log n) | Additional hyperplane calc |
| DBSCAN | O(n^2) | Distance calculations |
| OutlierTree | O(n log n) | Tree building + scoring |