Anomaly Detection Methods

Find unusual values in your data—statistical outliers and distributional anomalies.

What is an Anomaly?

An anomaly is a data point that significantly deviates from the normal distribution or pattern.

Examples:

Sales amount $5,000 when typical is $100 = anomalous
Customer age 150 when valid range is 18-120 = anomalous
10 orders in 1 second when typical is 0.1 per second = anomalous

Two Detection Approaches

1. Statistical Methods (Univariate)

Detect: Single-variable outliers based on distribution.

Cost: O(n) time, very fast
Assumption: Data is roughly normally distributed
Use when: Looking for obvious outliers in single numeric columns

Z-Score Method

Outlier if |z| > threshold (typically 3.0 = 99.7% of data)

z = (x - mean) / std_dev

Example:

-- Detect sales amounts &gt; 3 standard deviations from mean
SELECT
    transaction_id,
    amount,
    anofox_zscore_anomaly(amount, 3.0) OVER () as is_anomaly
FROM transactions;

Interpretation:

z = 0: Value equals mean
z = ±1: Within 1 std_dev (68% of data)
z = ±2: Within 2 std_dev (95% of data)
z = ±3: Within 3 std_dev (99.7% of data)
z > 3: Extreme outlier (0.3% of data)

Pros:

Fast, simple, interpretable
Works well for normally distributed data
Easy to explain to stakeholders

Cons:

Assumes normal distribution (breaks with skewed data)
Doesn't account for clusters or multimodal distributions
Sensitive to extreme outliers that skew the mean

IQR (Interquartile Range) Method

Outlier if x < Q1 - 1.5*IQR or x > Q3 + 1.5*IQR

IQR = Q3 - Q1  (middle 50% of data)

Example:

-- Detect amounts outside typical range
SELECT
    transaction_id,
    amount,
    anofox_iqr_anomaly(amount, 1.5) OVER () as is_anomaly
FROM transactions;

Interpretation:

Flags values in the bottom 25% or top 25% beyond the whiskers
Robust to extreme outliers (doesn't use them in calculation)

Pros:

Robust to extreme values
Works with skewed data
Easy to explain (whiskers on box plots)

Cons:

Less sensitive to gradual drift
Requires sorted data (slower on large datasets)

2. Machine Learning Methods (Multivariate)

Detect: Anomalies in multiple dimensions simultaneously.

Cost: O(n log n) time (slower)
Assumption: None (data-driven)
Use when: Looking for patterns across multiple columns

Isolation Forest

Isolates anomalies by randomly selecting features and split values. Anomalies are easier to isolate.

Example:

-- Detect unusual customers (combining multiple features)
SELECT
    customer_id,
    amount,
    frequency,
    days_since_purchase,
    anofox_isolation_forest_anomaly(
        [amount, frequency, days_since_purchase],
        0.1  -- Contamination rate: expect 10% anomalies
    ) OVER () as is_anomaly
FROM customers;

Parameters:

features (array): Columns to analyze
contamination (0.0-1.0): Expected proportion of anomalies (0.05 = 5%)

Pros:

Detects complex patterns
Handles multiple dimensions naturally
Doesn't assume distribution
Fast O(n log n)

Cons:

Less interpretable ("why" is harder to explain)
Requires tuning contamination rate
Needs multiple features to be useful

DBSCAN (Density-Based Clustering)

Groups points by density. Points in low-density regions = anomalies.

Example:

-- Detect sparse clusters in customer behavior
SELECT
    customer_id,
    amount,
    frequency,
    anofox_dbscan_anomaly(
        [amount, frequency],
        eps := 0.5,        -- Distance threshold
        min_pts := 10      -- Min points in neighborhood
    ) OVER () as is_anomaly
FROM customers;

Parameters:

eps (distance): How far to search for neighbors
min_pts: How many neighbors make a valid cluster

Pros:

Finds density-based anomalies (naturally sparse points)
Doesn't require setting contamination rate
Works with irregular cluster shapes

Cons:

Requires parameter tuning (eps, min_pts)
Slower for high-dimensional data
Less predictable than Isolation Forest

Comparison Matrix

Method	Speed	Multivariate	Interpretable	Use Case
Z-Score	⚡⚡⚡	No	✅✅✅	Single column, normal distribution
IQR	⚡⚡⚡	No	✅✅	Single column, robust, visual
Isolation Forest	⚡⚡	✅	⚠️	Multiple columns, complex patterns
DBSCAN	⚡	✅	⚠️	Density-based, irregular clusters

Decision Tree

Start: Need to detect anomalies
    ↓
Single column or multiple columns?
    ├─ Single → Continue
    │           ↓
    │       Is data normally distributed?
    │           ├─ YES → Use Z-Score
    │           └─ NO  → Use IQR
    │
    └─ Multiple → Continue
                ↓
            Do you need interpretability?
                ├─ YES → Use Isolation Forest
                └─ NO  → Use DBSCAN (fine-tune eps/min_pts)

Practical Example: Multi-Stage Detection

-- Stage 1: Quick syntactic check
CREATE TABLE stage1_syntax AS
SELECT
    transaction_id,
    amount,
    customer_id
FROM transactions
WHERE amount &gt; 0 AND amount &lt; 1000000;

-- Stage 2: Statistical outliers
CREATE TABLE stage2_statistical AS
SELECT
    *,
    anofox_zscore_anomaly(amount, 3.0) OVER () as z_anomaly,
    anofox_iqr_anomaly(amount, 1.5) OVER () as iqr_anomaly
FROM stage1_syntax;

-- Stage 3: Multivariate patterns
CREATE TABLE stage3_multivariate AS
SELECT
    *,
    anofox_isolation_forest_anomaly(
        [amount, customer_id],
        0.05
    ) OVER () as forest_anomaly
FROM stage2_statistical;

-- Final: Flag anything flagged by any method
SELECT
    transaction_id,
    amount,
    CASE
        WHEN z_anomaly OR iqr_anomaly OR forest_anomaly THEN 'ANOMALY'
        ELSE 'NORMAL'
    END as status
FROM stage3_multivariate
ORDER BY status DESC;

Parameter Guidelines

Contamination Rate:

Start with 0.05 (5% anomalies)
Adjust based on domain knowledge
0.01 for strict quality (1% anomalies)
0.10 for loose detection (10% anomalies)

DBSCAN eps:

Start with 0.5 for normalized data
Try: 0.3, 0.5, 1.0, 2.0 and observe results
Smaller eps = stricter grouping

Z-Score threshold:

2.0: 95% confidence (catches ~5% of data)
3.0: 99.7% confidence (catches ~0.3% of data)
4.0: 99.99% confidence (catches ~0.01% of data)

Next Steps

Data Profiling — Understand your data's normal patterns
Basic Workflow — End-to-end validation pipeline
Anomaly Detection Reference — Function API details

What is an Anomaly?​

Two Detection Approaches​

1. Statistical Methods (Univariate)​

Z-Score Method​

IQR (Interquartile Range) Method​

2. Machine Learning Methods (Multivariate)​

Isolation Forest​

DBSCAN (Density-Based Clustering)​

Comparison Matrix​

Decision Tree​

Practical Example: Multi-Stage Detection​

Parameter Guidelines​

Next Steps​

What is an Anomaly?

Two Detection Approaches

1. Statistical Methods (Univariate)

Z-Score Method

IQR (Interquartile Range) Method

2. Machine Learning Methods (Multivariate)

Isolation Forest

DBSCAN (Density-Based Clustering)

Comparison Matrix

Decision Tree

Practical Example: Multi-Stage Detection

Parameter Guidelines

Next Steps