Skip to main content

Anomaly Detection

Statistical and machine learning methods for outlier detection.

Quick Reference

FunctionDescriptionSQL Signature
anofox_tab_metric_zscoreZ-score outlier(table, column, threshold) -> TABLE
anofox_tab_metric_iqrIQR outlier(table, column, multiplier) -> TABLE
anofox_tab_metric_isolation_forestUnivariate ML(table, column, ...) -> TABLE
anofox_tab_metric_isolation_forest_multivariateMultivariate ML(table, columns, ...) -> TABLE
anofox_tab_metric_dbscanDensity-based(table, column, eps, min_pts, mode) -> TABLE
anofox_tab_outlier_treeExplainable outliers(table, columns, mode) -> TABLE

Statistical Methods (2)

anofox_tab_metric_zscore

Detect outliers using Z-score method.

Parameters

ParameterTypeRequiredDefaultDescription
table_nameVARCHARYes-Table to analyze
column_nameVARCHARYes-Column to analyze
thresholdDOUBLENo3.0Z-score threshold

Method: z = (value - mean) / std_dev, anomaly = |z| > threshold

Thresholds:

Threshold% in DistributionOutliers
2.095%5% of data
2.598.8%1.2% of data
3.099.7%0.3% of data

Example

SELECT * FROM anofox_tab_metric_zscore('transactions', 'amount', 3.0)
WHERE is_outlier = true;

anofox_tab_metric_iqr

Detect outliers using Interquartile Range.

Parameters

ParameterTypeRequiredDefaultDescription
table_nameVARCHARYes-Table to analyze
column_nameVARCHARYes-Column to analyze
multiplierDOUBLENo1.5IQR multiplier

Method:

  • IQR = Q3 - Q1
  • lower = Q1 - multiplier * IQR
  • upper = Q3 + multiplier * IQR
  • anomaly = value < lower OR value > upper

Example

SELECT * FROM anofox_tab_metric_iqr('transactions', 'amount', 1.5);

Pros over Z-score:

  • Robust to extreme outliers
  • Works with any distribution
  • Doesn't require normality assumption

Isolation Forest (Enhanced)

Industry-grade anomaly detection with advanced capabilities.

Core Capabilities

  • Categorical Support - Auto-detect VARCHAR columns with random subset splitting
  • Extended IF (ndim) - Hyperplane splits for diagonal/curved anomaly patterns
  • Density Scoring - Alternative metric based on points-to-volume ratio
  • Sample Weights - Weighted sampling for imbalanced datasets
  • SCiForest - Information-gain guided splitting

anofox_tab_metric_isolation_forest

Univariate Isolation Forest for single column anomaly detection.

Full Signature

anofox_tab_metric_isolation_forest(
table_name VARCHAR,
column_name VARCHAR,
n_trees BIGINT, -- 1-500, default 100
sample_size BIGINT, -- 1-10000, default 256
contamination DOUBLE, -- 0.0-0.5, default 0.1
output_mode VARCHAR, -- 'summary' or 'scores'
ndim BIGINT, -- 1-N, default 1 (Extended IF)
coef_type VARCHAR, -- 'uniform' or 'normal'
scoring_metric VARCHAR, -- 'depth', 'density', or 'adj_depth'
weight_column VARCHAR, -- Column for sample weights (NULL = uniform)
ntry BIGINT, -- 1-100, default 1 (SCiForest)
prob_pick_avg_gain DOUBLE -- 0.0-1.0, default 0.0
) -> TABLE

Parameters

ParameterRangeDefaultDescription
n_trees1-500100Number of isolation trees
sample_size1-10000256Subsample size per tree
contamination0.0-0.50.1Expected anomaly fraction
output_mode-'scores''summary' or 'scores'
ndim1-N1Hyperplane dimensions (Extended IF)
coef_type-'uniform''uniform' or 'normal'
scoring_metric-'depth''depth', 'density', 'adj_depth'
weight_column-NULLSample weight column name
ntry1-1001Split candidates (SCiForest)
prob_pick_avg_gain0.0-1.00.0Gain-based selection probability

Contamination guidance:

ValueExpectationUse Case
0.011% anomaliesHigh quality data
0.055% anomaliesNormal data
0.1010% anomaliesMessy data

Examples

-- Basic usage (backward compatible)
SELECT * FROM anofox_tab_metric_isolation_forest(
'sales_data', 'amount', 100, 256, 0.1, 'scores'
) WHERE is_anomaly = true;

-- Extended IF with hyperplane splits
SELECT * FROM anofox_tab_metric_isolation_forest(
'transactions', 'amount', 100, 256, 0.1, 'scores',
3, -- ndim: 3-dimensional hyperplanes
'normal', -- coef_type: normal distribution
'depth' -- scoring_metric
);

-- SCiForest with gain-based selection
SELECT * FROM anofox_tab_metric_isolation_forest(
'events', 'value', 100, 256, 0.05, 'scores',
1, 'uniform', 'depth',
NULL, -- weight_column
10, -- ntry: evaluate 10 candidates
0.5 -- prob_pick_avg_gain: 50% gain-based
);

anofox_tab_metric_isolation_forest_multivariate

Multivariate Isolation Forest for multiple column anomaly detection.

Parameters

Same as univariate, but column_name becomes columns (comma-separated).

Example

-- Basic multivariate
SELECT * FROM anofox_tab_metric_isolation_forest_multivariate(
'customer_events', 'purchase_amount, session_duration, page_views',
100, 256, 0.1, 'scores'
) ORDER BY anomaly_score DESC LIMIT 10;

-- Extended IF with density scoring
SELECT * FROM anofox_tab_metric_isolation_forest_multivariate(
'transactions', 'amount, quantity, duration',
100, 256, 0.05, 'scores',
3, -- ndim
'normal', -- coef_type
'density' -- scoring_metric: density-based
);

DBSCAN Clustering

anofox_tab_metric_dbscan

Detect outliers using Density-Based Clustering.

Parameters

ParameterTypeRequiredDefaultDescription
table_nameVARCHARYes-Table to analyze
column_nameVARCHARYes-Column to analyze
epsDOUBLENo0.5Neighborhood radius
min_ptsBIGINTNo5Min points for dense region
output_modeVARCHARNo'clusters''summary' or 'clusters'

Point Classifications:

  • CORE - Dense region centers (low anomaly score)
  • BORDER - Cluster edges (moderate anomaly score)
  • NOISE - Isolated outliers (high anomaly score: 1.0)

Example

SELECT * FROM anofox_tab_metric_dbscan(
'transactions', 'amount', 10.0, 5, 'clusters'
) WHERE point_type = 'NOISE';

anofox_tab_metric_dbscan_multivariate

Multivariate DBSCAN for multiple columns.

SELECT * FROM anofox_tab_metric_dbscan_multivariate(
'customer_events', 'amount, frequency',
0.5, 10, 'clusters'
) WHERE point_type = 'NOISE';

OutlierTree (Explainable Anomaly Detection)

Unlike Isolation Forest which provides anomaly scores, OutlierTree generates human-readable explanations for why specific values are outliers based on conditional distributions.

Key Features

  • Detects outliers in context (e.g., "salary is high for a Junior Developer")
  • Returns natural language explanations with statistical backing
  • Uses robust statistics (median + MAD) resistant to outliers
  • Supports both numeric and categorical columns

anofox_tab_outlier_tree

Signatures

-- Simple (3 params, uses defaults)
anofox_tab_outlier_tree(table_name, columns, output_mode) -> TABLE

-- Full (9 params)
anofox_tab_outlier_tree(
table_name VARCHAR,
columns VARCHAR,
output_mode VARCHAR,
max_depth INTEGER,
max_perc_outliers DOUBLE,
min_size_numeric INTEGER,
min_size_categ INTEGER,
z_norm DOUBLE,
z_outlier DOUBLE
) -> TABLE

Parameters

ParameterTypeDefaultDescription
table_nameVARCHARrequiredSource table name
columnsVARCHARrequiredComma-separated column names
output_modeVARCHAR'summary''summary' or 'outliers'
max_depthINTEGER4Maximum tree depth
max_perc_outliersDOUBLE0.01Max fraction of outliers per cluster
min_size_numericINTEGER25Min cluster size for numeric targets
min_size_categINTEGER75Min cluster size for categorical targets
z_normDOUBLE2.67Z-threshold for confidence intervals
z_outlierDOUBLE8.0Z-threshold for outlier flagging

Output Modes

Summary Mode Returns:

ColumnTypeDescription
statusVARCHAR'pass' or 'fail'
total_rowsBIGINTRows analyzed
outlier_countBIGINTOutliers detected
columns_analyzedINTEGERColumns analyzed
clusters_evaluatedBIGINTClusters evaluated
messageVARCHARSummary message

Outliers Mode Returns:

ColumnTypeDescription
row_idBIGINTRow index (1-indexed)
column_nameVARCHARColumn with outlier
outlier_valueVARCHARThe anomalous value
cluster_meanDOUBLEMean in cluster
cluster_sdDOUBLESD in cluster
z_scoreDOUBLERobust z-score
lower_boundDOUBLELower CI bound
upper_boundDOUBLEUpper CI bound
conditionsVARCHARJSON array of split conditions
explanationVARCHARHuman-readable explanation
outlier_scoreDOUBLERarity score

Examples

-- Summary mode: quick pass/fail check
SELECT * FROM anofox_tab_outlier_tree('employees', 'department,salary', 'summary');

-- Outliers mode: get detailed explanations
SELECT row_id, column_name, outlier_value, explanation
FROM anofox_tab_outlier_tree('employees', 'job_title,salary,years_exp', 'outliers');

-- Returns explanations like:
-- "Value 150000 for column 'salary' is unusually high (expected: 52333 +/- 7413)
-- when job_title = 'Junior Developer'"

-- With custom parameters for small datasets
SELECT * FROM anofox_tab_outlier_tree(
'test_data', 'category,value', 'outliers',
4, -- max_depth
0.5, -- max_perc_outliers
3, -- min_size_numeric
2, -- min_size_categ
2.67, -- z_norm
3.0 -- z_outlier
);

Method Comparison

MethodTypeSpeedInterpretationUse Case
Z-ScoreStatisticalFastEasySingle column, normal dist
IQRStatisticalFastEasySingle column, robust
Isolation ForestMLMediumHardMultiple columns, complex patterns
Extended IFMLMediumHardDiagonal/curved anomalies
DBSCANMLSlowMediumDensity-based clusters
OutlierTreeMLMediumEasyExplainable, contextual outliers

Decision Tree

How many features?
+-- ONE column
| +-- Normal distribution? -> Z-Score
| +-- Any distribution? -> IQR
|
+-- MULTIPLE columns
+-- Need explanations? -> OutlierTree
+-- Detect complex patterns? -> Extended IF (ndim > 1)
+-- Find density clusters? -> DBSCAN
+-- Maximize detection? -> Isolation Forest

Performance Characteristics

FunctionSpeedNotes
Z-ScoreO(n)Single pass
IQRO(n log n)Sorting for quartiles
Isolation ForestO(n log n)Tree construction
Extended IFO(n log n)Additional hyperplane calc
DBSCANO(n^2)Distance calculations
OutlierTreeO(n log n)Tree building + scoring

🍪 Cookie Settings