Skip to main content

Anomaly Detection

Statistical and machine learning methods for outlier detection.


Overview

Anomaly detection identifies unusual values that deviate from the norm. Methods are categorized as:

  1. Statistical (univariate): Work on single columns
  2. Machine Learning (multivariate): Work across multiple columns

Statistical Methods (2 functions)

anofox_zscore_anomaly

Detect outliers using Z-score method.

Signature:

anofox_zscore_anomaly(value, threshold) OVER ()BOOLEAN

Parameters:

  • value: Column to analyze
  • threshold: Z-score threshold (typically 2.0, 2.5, or 3.0)

Returns: TRUE if outlier, FALSE otherwise

Speed: O(n) full scan + aggregate

Method:

z = (value - mean) / std_dev
anomaly = |z| > threshold

Example:

SELECT
transaction_id,
amount,
anofox_zscore_anomaly(amount, 3.0) OVER () as is_outlier
FROM transactions
WHERE anofox_zscore_anomaly(amount, 3.0) OVER () = TRUE
ORDER BY amount DESC;

Thresholds & Interpretation:

Threshold% in DistributionOutliersUse Case
2.095%5% of dataLoose detection
2.598.8%1.2% of dataModerate detection
3.099.7%0.3% of dataStrict detection

Pros:

  • Very fast (simple calculation)
  • Easy to interpret and explain
  • Works well with normal distributions

Cons:

  • Assumes normal distribution
  • Sensitive to extreme outliers (they skew the mean)
  • Can't handle multimodal data

anofox_iqr_anomaly

Detect outliers using Interquartile Range (IQR) method.

Signature:

anofox_iqr_anomaly(value, multiplier) OVER ()BOOLEAN

Parameters:

  • value: Column to analyze
  • multiplier: IQR multiplier (typically 1.5, 2.0, 3.0)

Returns: TRUE if outlier, FALSE otherwise

Speed: O(n log n) for sorting/percentiles

Method:

IQR = Q3 - Q1
lower = Q1 - multiplier * IQR
upper = Q3 + multiplier * IQR
anomaly = value < lower OR value > upper

Example:

SELECT
transaction_id,
amount,
anofox_iqr_anomaly(amount, 1.5) OVER () as is_outlier
FROM transactions
WHERE anofox_iqr_anomaly(amount, 1.5) OVER () = TRUE
ORDER BY amount DESC;

Multiplier Sensitivity:

MultiplierStrictnessTypical Use
1.0Very looseOnly extreme values
1.5StandardMost use cases (box plot default)
2.0StrictConservative filtering
3.0Very strictCritical systems only

Pros:

  • Robust to extreme outliers
  • Works with any distribution
  • Well-known method (box plots)
  • Doesn't require normality assumption

Cons:

  • Slower than Z-score (needs sorting)
  • May miss subtle patterns
  • Less sensitive to gradual drift

Machine Learning Methods (2 functions)

anofox_isolation_forest_anomaly

Detect outliers using Isolation Forest algorithm.

Signature:

anofox_isolation_forest_anomaly(features, contamination) OVER ()BOOLEAN

Parameters:

  • features: ARRAY of numeric columns to analyze (multivariate)
  • contamination: Expected proportion of anomalies (0.0-1.0)

Returns: TRUE if outlier, FALSE otherwise

Speed: O(n log n) per feature

Method:

  • Isolation Forest isolates anomalies by randomly selecting features and split values
  • Anomalies are easier to isolate, so they have shorter paths
  • Anomaly score based on path length

Example:

-- Detect unusual customers (combining multiple features)
SELECT
customer_id,
amount,
frequency,
days_since_purchase,
anofox_isolation_forest_anomaly(
[amount, frequency, days_since_purchase],
0.05 -- Expect 5% anomalies
) OVER () as is_anomaly
FROM customers
WHERE anofox_isolation_forest_anomaly(
[amount, frequency, days_since_purchase],
0.05
) OVER () = TRUE
ORDER BY amount DESC;

Contamination Guidance:

ContaminationExpectationUse Case
0.011% anomaliesVery high quality data
0.055% anomaliesNormal data
0.1010% anomaliesRough/messy data
0.2020% anomaliesVery loose detection

Pros:

  • Works with multiple dimensions naturally
  • No distribution assumptions
  • Fast O(n log n)
  • Handles complex patterns

Cons:

  • Less interpretable ("why" is harder)
  • Requires contamination rate tuning
  • Needs multiple features to be useful

anofox_dbscan_anomaly

Detect outliers using Density-Based Clustering.

Signature:

anofox_dbscan_anomaly(features, eps, min_pts) OVER ()BOOLEAN

Parameters:

  • features: ARRAY of numeric columns (multivariate)
  • eps: Distance threshold for neighbors (0.1-2.0 typical)
  • min_pts: Minimum points to form a cluster (5-20 typical)

Returns: TRUE if outlier (in sparse region), FALSE otherwise

Speed: O(n²) in worst case, O(n log n) with spatial index

Method:

  • Groups points by density
  • Points in sparse regions = anomalies
  • Points not in any cluster = noise/anomalies

Example:

-- Detect sparse clusters in behavior
SELECT
customer_id,
amount,
frequency,
anofox_dbscan_anomaly(
[amount, frequency],
eps := 0.5, -- Distance threshold
min_pts := 10 -- Min cluster members
) OVER () as is_anomaly
FROM customers
WHERE anofox_dbscan_anomaly(
[amount, frequency],
0.5, 10
) OVER () = TRUE;

Parameter Tuning:

  • eps (small → large): Stricter → looser anomaly detection

    • 0.1-0.3: Very strict (few clusters)
    • 0.5-1.0: Moderate (typical)
    • 1.5-2.0: Loose (large clusters)
  • min_pts (small → large): Looser → stricter definition of "cluster"

    • 3-5: Loose (small clusters ok)
    • 10-20: Moderate (typical)
    • 50+: Strict (must be large cluster)

Pros:

  • Finds density-based anomalies naturally
  • No contamination rate tuning needed
  • Works with irregular cluster shapes
  • Can find clusters within outliers

Cons:

  • Two parameters to tune (eps + min_pts)
  • Slower than Isolation Forest
  • Sensitive to parameter choices
  • Works better with 2-3 features

Comparison & Selection

MethodTypeSpeedInterpretationUse Case
Z-ScoreStatistical⚡⚡⚡EasySingle column, normal dist
IQRStatistical⚡⚡EasySingle column, robust
Isolation ForestML⚡⚡HardMultiple columns, patterns
DBSCANMLMediumDensity-based, clusters

Decision Tree

Which method should I use?

1. How many features?
├─ ONE column → Continue
│ ├─ Normal distribution? YES → Z-Score
│ └─ Any distribution? YES → IQR

└─ MULTIPLE columns → Continue
├─ Need interpretability? YES → DBSCAN (with tuning)
└─ Maximize detection? YES → Isolation Forest

Practical Examples

Example 1: Sales Anomalies (Single Column)

-- Detect unusual sales amounts
SELECT
date,
region,
sales_amount,
anofox_zscore_anomaly(sales_amount, 3.0) OVER (PARTITION BY region) as is_anomaly_zscore,
anofox_iqr_anomaly(sales_amount, 1.5) OVER (PARTITION BY region) as is_anomaly_iqr
FROM daily_sales
WHERE anofox_zscore_anomaly(sales_amount, 3.0) OVER (PARTITION BY region) = TRUE
ORDER BY sales_amount DESC;

Example 2: Customer Behavior (Multiple Columns)

-- Detect unusual customer behavior
SELECT
customer_id,
purchase_amount,
purchase_frequency,
days_since_last_purchase,
anofox_isolation_forest_anomaly(
[purchase_amount, purchase_frequency, days_since_last_purchase],
0.05
) OVER () as is_anomaly
FROM customer_summary
WHERE anofox_isolation_forest_anomaly(
[purchase_amount, purchase_frequency, days_since_last_purchase],
0.05
) OVER () = TRUE
ORDER BY purchase_amount DESC;

Example 3: Multi-Stage Detection

-- Combine methods for robust detection
CREATE TABLE anomalies_detected AS
SELECT
transaction_id,
amount,
CASE
WHEN anofox_zscore_anomaly(amount, 3.0) OVER () THEN 'z_score'
WHEN anofox_iqr_anomaly(amount, 1.5) OVER () THEN 'iqr'
ELSE NULL
END as anomaly_method_univariate,
CASE
WHEN anofox_isolation_forest_anomaly([amount], 0.05) OVER () THEN 'forest'
ELSE NULL
END as anomaly_method_ml
FROM transactions;

-- Flag anything detected by any method
SELECT * FROM anomalies_detected
WHERE anomaly_method_univariate IS NOT NULL
OR anomaly_method_ml IS NOT NULL;

Window Function Usage

All anomaly detection functions work with OVER clauses:

-- Global detection (all data)
anofox_zscore_anomaly(amount, 3.0) OVER ()

-- Partition by group
anofox_zscore_anomaly(amount, 3.0) OVER (PARTITION BY region)

-- Partition with ordering (time-series)
anofox_zscore_anomaly(amount, 3.0) OVER (
PARTITION BY customer_id
ORDER BY date
ROWS BETWEEN 30 PRECEDING AND CURRENT ROW
)

Best Practices

✅ Start simple (Z-Score or IQR)
✅ Visualize results before using for action
✅ Validate thresholds on test data
✅ Combine methods for robustness
✅ Tune parameters on historical anomalies
✅ Monitor anomaly rates over time


Next Steps

🍪 Cookie Settings