Skip to main content

Anomaly Detection Methods

Find unusual values in your data—statistical outliers and distributional anomalies.


What is an Anomaly?

An anomaly is a data point that significantly deviates from the normal distribution or pattern.

Examples:

  • Sales amount $5,000 when typical is $100 = anomalous
  • Customer age 150 when valid range is 18-120 = anomalous
  • 10 orders in 1 second when typical is 0.1 per second = anomalous

Two Detection Approaches


1. Statistical Methods (Univariate)

Detect: Single-variable outliers based on distribution.

Cost: O(n) time, very fast
Assumption: Data is roughly normally distributed
Use when: Looking for obvious outliers in single numeric columns

Z-Score Method

Outlier if |z| > threshold (typically 3.0 = 99.7% of data)

z = (x - mean) / std_dev

Example:

-- Detect sales amounts > 3 standard deviations from mean
SELECT
transaction_id,
amount,
anofox_zscore_anomaly(amount, 3.0) OVER () as is_anomaly
FROM transactions;

Interpretation:

  • z = 0: Value equals mean
  • z = ±1: Within 1 std_dev (68% of data)
  • z = ±2: Within 2 std_dev (95% of data)
  • z = ±3: Within 3 std_dev (99.7% of data)
  • z > 3: Extreme outlier (0.3% of data)

Pros:

  • Fast, simple, interpretable
  • Works well for normally distributed data
  • Easy to explain to stakeholders

Cons:

  • Assumes normal distribution (breaks with skewed data)
  • Doesn't account for clusters or multimodal distributions
  • Sensitive to extreme outliers that skew the mean

IQR (Interquartile Range) Method

Outlier if x < Q1 - 1.5*IQR or x > Q3 + 1.5*IQR

IQR = Q3 - Q1  (middle 50% of data)

Example:

-- Detect amounts outside typical range
SELECT
transaction_id,
amount,
anofox_iqr_anomaly(amount, 1.5) OVER () as is_anomaly
FROM transactions;

Interpretation:

  • Flags values in the bottom 25% or top 25% beyond the whiskers
  • Robust to extreme outliers (doesn't use them in calculation)

Pros:

  • Robust to extreme values
  • Works with skewed data
  • Easy to explain (whiskers on box plots)

Cons:

  • Less sensitive to gradual drift
  • Requires sorted data (slower on large datasets)

2. Machine Learning Methods (Multivariate)

Detect: Anomalies in multiple dimensions simultaneously.

Cost: O(n log n) time (slower)
Assumption: None (data-driven)
Use when: Looking for patterns across multiple columns

Isolation Forest

Isolates anomalies by randomly selecting features and split values. Anomalies are easier to isolate.

Example:

-- Detect unusual customers (combining multiple features)
SELECT
customer_id,
amount,
frequency,
days_since_purchase,
anofox_isolation_forest_anomaly(
[amount, frequency, days_since_purchase],
0.1 -- Contamination rate: expect 10% anomalies
) OVER () as is_anomaly
FROM customers;

Parameters:

  • features (array): Columns to analyze
  • contamination (0.0-1.0): Expected proportion of anomalies (0.05 = 5%)

Pros:

  • Detects complex patterns
  • Handles multiple dimensions naturally
  • Doesn't assume distribution
  • Fast O(n log n)

Cons:

  • Less interpretable ("why" is harder to explain)
  • Requires tuning contamination rate
  • Needs multiple features to be useful

DBSCAN (Density-Based Clustering)

Groups points by density. Points in low-density regions = anomalies.

Example:

-- Detect sparse clusters in customer behavior
SELECT
customer_id,
amount,
frequency,
anofox_dbscan_anomaly(
[amount, frequency],
eps := 0.5, -- Distance threshold
min_pts := 10 -- Min points in neighborhood
) OVER () as is_anomaly
FROM customers;

Parameters:

  • eps (distance): How far to search for neighbors
  • min_pts: How many neighbors make a valid cluster

Pros:

  • Finds density-based anomalies (naturally sparse points)
  • Doesn't require setting contamination rate
  • Works with irregular cluster shapes

Cons:

  • Requires parameter tuning (eps, min_pts)
  • Slower for high-dimensional data
  • Less predictable than Isolation Forest

Comparison Matrix

MethodSpeedMultivariateInterpretableUse Case
Z-Score⚡⚡⚡No✅✅✅Single column, normal distribution
IQR⚡⚡⚡No✅✅Single column, robust, visual
Isolation Forest⚡⚡⚠️Multiple columns, complex patterns
DBSCAN⚠️Density-based, irregular clusters

Decision Tree

Start: Need to detect anomalies

Single column or multiple columns?
├─ Single → Continue
│ ↓
│ Is data normally distributed?
│ ├─ YES → Use Z-Score
│ └─ NO → Use IQR

└─ Multiple → Continue

Do you need interpretability?
├─ YES → Use Isolation Forest
└─ NO → Use DBSCAN (fine-tune eps/min_pts)

Practical Example: Multi-Stage Detection

-- Stage 1: Quick syntactic check
CREATE TABLE stage1_syntax AS
SELECT
transaction_id,
amount,
customer_id
FROM transactions
WHERE amount &gt; 0 AND amount &lt; 1000000;

-- Stage 2: Statistical outliers
CREATE TABLE stage2_statistical AS
SELECT
*,
anofox_zscore_anomaly(amount, 3.0) OVER () as z_anomaly,
anofox_iqr_anomaly(amount, 1.5) OVER () as iqr_anomaly
FROM stage1_syntax;

-- Stage 3: Multivariate patterns
CREATE TABLE stage3_multivariate AS
SELECT
*,
anofox_isolation_forest_anomaly(
[amount, customer_id],
0.05
) OVER () as forest_anomaly
FROM stage2_statistical;

-- Final: Flag anything flagged by any method
SELECT
transaction_id,
amount,
CASE
WHEN z_anomaly OR iqr_anomaly OR forest_anomaly THEN 'ANOMALY'
ELSE 'NORMAL'
END as status
FROM stage3_multivariate
ORDER BY status DESC;

Parameter Guidelines

Contamination Rate:

  • Start with 0.05 (5% anomalies)
  • Adjust based on domain knowledge
  • 0.01 for strict quality (1% anomalies)
  • 0.10 for loose detection (10% anomalies)

DBSCAN eps:

  • Start with 0.5 for normalized data
  • Try: 0.3, 0.5, 1.0, 2.0 and observe results
  • Smaller eps = stricter grouping

Z-Score threshold:

  • 2.0: 95% confidence (catches ~5% of data)
  • 3.0: 99.7% confidence (catches ~0.3% of data)
  • 4.0: 99.99% confidence (catches ~0.01% of data)

Next Steps

🍪 Cookie Settings