Anomaly Detection
Statistical and machine learning methods for outlier detection.
Overview
Anomaly detection identifies unusual values that deviate from the norm. Methods are categorized as:
- Statistical (univariate): Work on single columns
- Machine Learning (multivariate): Work across multiple columns
Statistical Methods (2 functions)
anofox_zscore_anomaly
Detect outliers using Z-score method.
Signature:
anofox_zscore_anomaly(value, threshold) OVER () → BOOLEAN
Parameters:
value: Column to analyzethreshold: Z-score threshold (typically 2.0, 2.5, or 3.0)
Returns: TRUE if outlier, FALSE otherwise
Speed: O(n) full scan + aggregate
Method:
z = (value - mean) / std_dev
anomaly = |z| > threshold
Example:
SELECT
transaction_id,
amount,
anofox_zscore_anomaly(amount, 3.0) OVER () as is_outlier
FROM transactions
WHERE anofox_zscore_anomaly(amount, 3.0) OVER () = TRUE
ORDER BY amount DESC;
Thresholds & Interpretation:
| Threshold | % in Distribution | Outliers | Use Case |
|---|---|---|---|
| 2.0 | 95% | 5% of data | Loose detection |
| 2.5 | 98.8% | 1.2% of data | Moderate detection |
| 3.0 | 99.7% | 0.3% of data | Strict detection |
Pros:
- Very fast (simple calculation)
- Easy to interpret and explain
- Works well with normal distributions
Cons:
- Assumes normal distribution
- Sensitive to extreme outliers (they skew the mean)
- Can't handle multimodal data
anofox_iqr_anomaly
Detect outliers using Interquartile Range (IQR) method.
Signature:
anofox_iqr_anomaly(value, multiplier) OVER () → BOOLEAN
Parameters:
value: Column to analyzemultiplier: IQR multiplier (typically 1.5, 2.0, 3.0)
Returns: TRUE if outlier, FALSE otherwise
Speed: O(n log n) for sorting/percentiles
Method:
IQR = Q3 - Q1
lower = Q1 - multiplier * IQR
upper = Q3 + multiplier * IQR
anomaly = value < lower OR value > upper
Example:
SELECT
transaction_id,
amount,
anofox_iqr_anomaly(amount, 1.5) OVER () as is_outlier
FROM transactions
WHERE anofox_iqr_anomaly(amount, 1.5) OVER () = TRUE
ORDER BY amount DESC;
Multiplier Sensitivity:
| Multiplier | Strictness | Typical Use |
|---|---|---|
| 1.0 | Very loose | Only extreme values |
| 1.5 | Standard | Most use cases (box plot default) |
| 2.0 | Strict | Conservative filtering |
| 3.0 | Very strict | Critical systems only |
Pros:
- Robust to extreme outliers
- Works with any distribution
- Well-known method (box plots)
- Doesn't require normality assumption
Cons:
- Slower than Z-score (needs sorting)
- May miss subtle patterns
- Less sensitive to gradual drift
Machine Learning Methods (2 functions)
anofox_isolation_forest_anomaly
Detect outliers using Isolation Forest algorithm.
Signature:
anofox_isolation_forest_anomaly(features, contamination) OVER () → BOOLEAN
Parameters:
features: ARRAY of numeric columns to analyze (multivariate)contamination: Expected proportion of anomalies (0.0-1.0)
Returns: TRUE if outlier, FALSE otherwise
Speed: O(n log n) per feature
Method:
- Isolation Forest isolates anomalies by randomly selecting features and split values
- Anomalies are easier to isolate, so they have shorter paths
- Anomaly score based on path length
Example:
-- Detect unusual customers (combining multiple features)
SELECT
customer_id,
amount,
frequency,
days_since_purchase,
anofox_isolation_forest_anomaly(
[amount, frequency, days_since_purchase],
0.05 -- Expect 5% anomalies
) OVER () as is_anomaly
FROM customers
WHERE anofox_isolation_forest_anomaly(
[amount, frequency, days_since_purchase],
0.05
) OVER () = TRUE
ORDER BY amount DESC;
Contamination Guidance:
| Contamination | Expectation | Use Case |
|---|---|---|
| 0.01 | 1% anomalies | Very high quality data |
| 0.05 | 5% anomalies | Normal data |
| 0.10 | 10% anomalies | Rough/messy data |
| 0.20 | 20% anomalies | Very loose detection |
Pros:
- Works with multiple dimensions naturally
- No distribution assumptions
- Fast O(n log n)
- Handles complex patterns
Cons:
- Less interpretable ("why" is harder)
- Requires contamination rate tuning
- Needs multiple features to be useful
anofox_dbscan_anomaly
Detect outliers using Density-Based Clustering.
Signature:
anofox_dbscan_anomaly(features, eps, min_pts) OVER () → BOOLEAN
Parameters:
features: ARRAY of numeric columns (multivariate)eps: Distance threshold for neighbors (0.1-2.0 typical)min_pts: Minimum points to form a cluster (5-20 typical)
Returns: TRUE if outlier (in sparse region), FALSE otherwise
Speed: O(n²) in worst case, O(n log n) with spatial index
Method:
- Groups points by density
- Points in sparse regions = anomalies
- Points not in any cluster = noise/anomalies
Example:
-- Detect sparse clusters in behavior
SELECT
customer_id,
amount,
frequency,
anofox_dbscan_anomaly(
[amount, frequency],
eps := 0.5, -- Distance threshold
min_pts := 10 -- Min cluster members
) OVER () as is_anomaly
FROM customers
WHERE anofox_dbscan_anomaly(
[amount, frequency],
0.5, 10
) OVER () = TRUE;
Parameter Tuning:
-
eps (small → large): Stricter → looser anomaly detection
- 0.1-0.3: Very strict (few clusters)
- 0.5-1.0: Moderate (typical)
- 1.5-2.0: Loose (large clusters)
-
min_pts (small → large): Looser → stricter definition of "cluster"
- 3-5: Loose (small clusters ok)
- 10-20: Moderate (typical)
- 50+: Strict (must be large cluster)
Pros:
- Finds density-based anomalies naturally
- No contamination rate tuning needed
- Works with irregular cluster shapes
- Can find clusters within outliers
Cons:
- Two parameters to tune (eps + min_pts)
- Slower than Isolation Forest
- Sensitive to parameter choices
- Works better with 2-3 features
Comparison & Selection
| Method | Type | Speed | Interpretation | Use Case |
|---|---|---|---|---|
| Z-Score | Statistical | ⚡⚡⚡ | Easy | Single column, normal dist |
| IQR | Statistical | ⚡⚡ | Easy | Single column, robust |
| Isolation Forest | ML | ⚡⚡ | Hard | Multiple columns, patterns |
| DBSCAN | ML | ⚡ | Medium | Density-based, clusters |
Decision Tree
Which method should I use?
1. How many features?
├─ ONE column → Continue
│ ├─ Normal distribution? YES → Z-Score
│ └─ Any distribution? YES → IQR
│
└─ MULTIPLE columns → Continue
├─ Need interpretability? YES → DBSCAN (with tuning)
└─ Maximize detection? YES → Isolation Forest
Practical Examples
Example 1: Sales Anomalies (Single Column)
-- Detect unusual sales amounts
SELECT
date,
region,
sales_amount,
anofox_zscore_anomaly(sales_amount, 3.0) OVER (PARTITION BY region) as is_anomaly_zscore,
anofox_iqr_anomaly(sales_amount, 1.5) OVER (PARTITION BY region) as is_anomaly_iqr
FROM daily_sales
WHERE anofox_zscore_anomaly(sales_amount, 3.0) OVER (PARTITION BY region) = TRUE
ORDER BY sales_amount DESC;
Example 2: Customer Behavior (Multiple Columns)
-- Detect unusual customer behavior
SELECT
customer_id,
purchase_amount,
purchase_frequency,
days_since_last_purchase,
anofox_isolation_forest_anomaly(
[purchase_amount, purchase_frequency, days_since_last_purchase],
0.05
) OVER () as is_anomaly
FROM customer_summary
WHERE anofox_isolation_forest_anomaly(
[purchase_amount, purchase_frequency, days_since_last_purchase],
0.05
) OVER () = TRUE
ORDER BY purchase_amount DESC;
Example 3: Multi-Stage Detection
-- Combine methods for robust detection
CREATE TABLE anomalies_detected AS
SELECT
transaction_id,
amount,
CASE
WHEN anofox_zscore_anomaly(amount, 3.0) OVER () THEN 'z_score'
WHEN anofox_iqr_anomaly(amount, 1.5) OVER () THEN 'iqr'
ELSE NULL
END as anomaly_method_univariate,
CASE
WHEN anofox_isolation_forest_anomaly([amount], 0.05) OVER () THEN 'forest'
ELSE NULL
END as anomaly_method_ml
FROM transactions;
-- Flag anything detected by any method
SELECT * FROM anomalies_detected
WHERE anomaly_method_univariate IS NOT NULL
OR anomaly_method_ml IS NOT NULL;
Window Function Usage
All anomaly detection functions work with OVER clauses:
-- Global detection (all data)
anofox_zscore_anomaly(amount, 3.0) OVER ()
-- Partition by group
anofox_zscore_anomaly(amount, 3.0) OVER (PARTITION BY region)
-- Partition with ordering (time-series)
anofox_zscore_anomaly(amount, 3.0) OVER (
PARTITION BY customer_id
ORDER BY date
ROWS BETWEEN 30 PRECEDING AND CURRENT ROW
)
Best Practices
✅ Start simple (Z-Score or IQR)
✅ Visualize results before using for action
✅ Validate thresholds on test data
✅ Combine methods for robustness
✅ Tune parameters on historical anomalies
✅ Monitor anomaly rates over time
Next Steps
- Anomaly Detection Methods — Deep dive into methodology
- Data Profiling — Establish baselines
- Basic Workflow — Implement in production