Anomaly Detection Methods
Find unusual values in your data—statistical outliers and distributional anomalies.
What is an Anomaly?
An anomaly is a data point that significantly deviates from the normal distribution or pattern.
Examples:
- Sales amount $5,000 when typical is $100 = anomalous
- Customer age 150 when valid range is 18-120 = anomalous
- 10 orders in 1 second when typical is 0.1 per second = anomalous
Two Detection Approaches
1. Statistical Methods (Univariate)
Detect: Single-variable outliers based on distribution.
Cost: O(n) time, very fast
Assumption: Data is roughly normally distributed
Use when: Looking for obvious outliers in single numeric columns
Z-Score Method
Outlier if |z| > threshold (typically 3.0 = 99.7% of data)
z = (x - mean) / std_dev
Example:
-- Detect sales amounts > 3 standard deviations from mean
SELECT
transaction_id,
amount,
anofox_zscore_anomaly(amount, 3.0) OVER () as is_anomaly
FROM transactions;
Interpretation:
- z = 0: Value equals mean
- z = ±1: Within 1 std_dev (68% of data)
- z = ±2: Within 2 std_dev (95% of data)
- z = ±3: Within 3 std_dev (99.7% of data)
- z > 3: Extreme outlier (0.3% of data)
Pros:
- Fast, simple, interpretable
- Works well for normally distributed data
- Easy to explain to stakeholders
Cons:
- Assumes normal distribution (breaks with skewed data)
- Doesn't account for clusters or multimodal distributions
- Sensitive to extreme outliers that skew the mean
IQR (Interquartile Range) Method
Outlier if x < Q1 - 1.5*IQR or x > Q3 + 1.5*IQR
IQR = Q3 - Q1 (middle 50% of data)
Example:
-- Detect amounts outside typical range
SELECT
transaction_id,
amount,
anofox_iqr_anomaly(amount, 1.5) OVER () as is_anomaly
FROM transactions;
Interpretation:
- Flags values in the bottom 25% or top 25% beyond the whiskers
- Robust to extreme outliers (doesn't use them in calculation)
Pros:
- Robust to extreme values
- Works with skewed data
- Easy to explain (whiskers on box plots)
Cons:
- Less sensitive to gradual drift
- Requires sorted data (slower on large datasets)
2. Machine Learning Methods (Multivariate)
Detect: Anomalies in multiple dimensions simultaneously.
Cost: O(n log n) time (slower)
Assumption: None (data-driven)
Use when: Looking for patterns across multiple columns
Isolation Forest
Isolates anomalies by randomly selecting features and split values. Anomalies are easier to isolate.
Example:
-- Detect unusual customers (combining multiple features)
SELECT
customer_id,
amount,
frequency,
days_since_purchase,
anofox_isolation_forest_anomaly(
[amount, frequency, days_since_purchase],
0.1 -- Contamination rate: expect 10% anomalies
) OVER () as is_anomaly
FROM customers;
Parameters:
features(array): Columns to analyzecontamination(0.0-1.0): Expected proportion of anomalies (0.05 = 5%)
Pros:
- Detects complex patterns
- Handles multiple dimensions naturally
- Doesn't assume distribution
- Fast O(n log n)
Cons:
- Less interpretable ("why" is harder to explain)
- Requires tuning contamination rate
- Needs multiple features to be useful
DBSCAN (Density-Based Clustering)
Groups points by density. Points in low-density regions = anomalies.
Example:
-- Detect sparse clusters in customer behavior
SELECT
customer_id,
amount,
frequency,
anofox_dbscan_anomaly(
[amount, frequency],
eps := 0.5, -- Distance threshold
min_pts := 10 -- Min points in neighborhood
) OVER () as is_anomaly
FROM customers;
Parameters:
eps(distance): How far to search for neighborsmin_pts: How many neighbors make a valid cluster
Pros:
- Finds density-based anomalies (naturally sparse points)
- Doesn't require setting contamination rate
- Works with irregular cluster shapes
Cons:
- Requires parameter tuning (eps, min_pts)
- Slower for high-dimensional data
- Less predictable than Isolation Forest
Comparison Matrix
| Method | Speed | Multivariate | Interpretable | Use Case |
|---|---|---|---|---|
| Z-Score | ⚡⚡⚡ | No | ✅✅✅ | Single column, normal distribution |
| IQR | ⚡⚡⚡ | No | ✅✅ | Single column, robust, visual |
| Isolation Forest | ⚡⚡ | ✅ | ⚠️ | Multiple columns, complex patterns |
| DBSCAN | ⚡ | ✅ | ⚠️ | Density-based, irregular clusters |
Decision Tree
Start: Need to detect anomalies
↓
Single column or multiple columns?
├─ Single → Continue
│ ↓
│ Is data normally distributed?
│ ├─ YES → Use Z-Score
│ └─ NO → Use IQR
│
└─ Multiple → Continue
↓
Do you need interpretability?
├─ YES → Use Isolation Forest
└─ NO → Use DBSCAN (fine-tune eps/min_pts)
Practical Example: Multi-Stage Detection
-- Stage 1: Quick syntactic check
CREATE TABLE stage1_syntax AS
SELECT
transaction_id,
amount,
customer_id
FROM transactions
WHERE amount > 0 AND amount < 1000000;
-- Stage 2: Statistical outliers
CREATE TABLE stage2_statistical AS
SELECT
*,
anofox_zscore_anomaly(amount, 3.0) OVER () as z_anomaly,
anofox_iqr_anomaly(amount, 1.5) OVER () as iqr_anomaly
FROM stage1_syntax;
-- Stage 3: Multivariate patterns
CREATE TABLE stage3_multivariate AS
SELECT
*,
anofox_isolation_forest_anomaly(
[amount, customer_id],
0.05
) OVER () as forest_anomaly
FROM stage2_statistical;
-- Final: Flag anything flagged by any method
SELECT
transaction_id,
amount,
CASE
WHEN z_anomaly OR iqr_anomaly OR forest_anomaly THEN 'ANOMALY'
ELSE 'NORMAL'
END as status
FROM stage3_multivariate
ORDER BY status DESC;
Parameter Guidelines
Contamination Rate:
- Start with 0.05 (5% anomalies)
- Adjust based on domain knowledge
- 0.01 for strict quality (1% anomalies)
- 0.10 for loose detection (10% anomalies)
DBSCAN eps:
- Start with 0.5 for normalized data
- Try: 0.3, 0.5, 1.0, 2.0 and observe results
- Smaller eps = stricter grouping
Z-Score threshold:
- 2.0: 95% confidence (catches ~5% of data)
- 3.0: 99.7% confidence (catches ~0.3% of data)
- 4.0: 99.99% confidence (catches ~0.01% of data)
Next Steps
- Data Profiling — Understand your data's normal patterns
- Basic Workflow — End-to-end validation pipeline
- Anomaly Detection Reference — Function API details