Anomaly Detection

Statistical and machine learning methods for outlier detection.

Overview

Anomaly detection identifies unusual values that deviate from the norm. Methods are categorized as:

Statistical (univariate): Work on single columns
Machine Learning (multivariate): Work across multiple columns

Statistical Methods (2 functions)

anofox_zscore_anomaly

Detect outliers using Z-score method.

Signature:

anofox_zscore_anomaly(value, threshold) OVER () → BOOLEAN

Parameters:

value: Column to analyze
threshold: Z-score threshold (typically 2.0, 2.5, or 3.0)

Returns: TRUE if outlier, FALSE otherwise

Speed: O(n) full scan + aggregate

Method:

z = (value - mean) / std_dev
anomaly = |z| > threshold

Example:

SELECT
    transaction_id,
    amount,
    anofox_zscore_anomaly(amount, 3.0) OVER () as is_outlier
FROM transactions
WHERE anofox_zscore_anomaly(amount, 3.0) OVER () = TRUE
ORDER BY amount DESC;

Thresholds & Interpretation:

Threshold	% in Distribution	Outliers	Use Case
2.0	95%	5% of data	Loose detection
2.5	98.8%	1.2% of data	Moderate detection
3.0	99.7%	0.3% of data	Strict detection

Pros:

Very fast (simple calculation)
Easy to interpret and explain
Works well with normal distributions

Cons:

Assumes normal distribution
Sensitive to extreme outliers (they skew the mean)
Can't handle multimodal data

anofox_iqr_anomaly

Detect outliers using Interquartile Range (IQR) method.

Signature:

anofox_iqr_anomaly(value, multiplier) OVER () → BOOLEAN

Parameters:

value: Column to analyze
multiplier: IQR multiplier (typically 1.5, 2.0, 3.0)

Returns: TRUE if outlier, FALSE otherwise

Speed: O(n log n) for sorting/percentiles

Method:

IQR = Q3 - Q1
lower = Q1 - multiplier * IQR
upper = Q3 + multiplier * IQR
anomaly = value < lower OR value > upper

Example:

SELECT
    transaction_id,
    amount,
    anofox_iqr_anomaly(amount, 1.5) OVER () as is_outlier
FROM transactions
WHERE anofox_iqr_anomaly(amount, 1.5) OVER () = TRUE
ORDER BY amount DESC;

Multiplier Sensitivity:

Multiplier	Strictness	Typical Use
1.0	Very loose	Only extreme values
1.5	Standard	Most use cases (box plot default)
2.0	Strict	Conservative filtering
3.0	Very strict	Critical systems only

Pros:

Robust to extreme outliers
Works with any distribution
Well-known method (box plots)
Doesn't require normality assumption

Cons:

Slower than Z-score (needs sorting)
May miss subtle patterns
Less sensitive to gradual drift

Machine Learning Methods (2 functions)

anofox_isolation_forest_anomaly

Detect outliers using Isolation Forest algorithm.

Signature:

anofox_isolation_forest_anomaly(features, contamination) OVER () → BOOLEAN

Parameters:

features: ARRAY of numeric columns to analyze (multivariate)
contamination: Expected proportion of anomalies (0.0-1.0)

Returns: TRUE if outlier, FALSE otherwise

Speed: O(n log n) per feature

Method:

Isolation Forest isolates anomalies by randomly selecting features and split values
Anomalies are easier to isolate, so they have shorter paths
Anomaly score based on path length

Example:

-- Detect unusual customers (combining multiple features)
SELECT
    customer_id,
    amount,
    frequency,
    days_since_purchase,
    anofox_isolation_forest_anomaly(
        [amount, frequency, days_since_purchase],
        0.05  -- Expect 5% anomalies
    ) OVER () as is_anomaly
FROM customers
WHERE anofox_isolation_forest_anomaly(
    [amount, frequency, days_since_purchase],
    0.05
) OVER () = TRUE
ORDER BY amount DESC;

Contamination Guidance:

Contamination	Expectation	Use Case
0.01	1% anomalies	Very high quality data
0.05	5% anomalies	Normal data
0.10	10% anomalies	Rough/messy data
0.20	20% anomalies	Very loose detection

Pros:

Works with multiple dimensions naturally
No distribution assumptions
Fast O(n log n)
Handles complex patterns

Cons:

Less interpretable ("why" is harder)
Requires contamination rate tuning
Needs multiple features to be useful

anofox_dbscan_anomaly

Detect outliers using Density-Based Clustering.

Signature:

anofox_dbscan_anomaly(features, eps, min_pts) OVER () → BOOLEAN

Parameters:

features: ARRAY of numeric columns (multivariate)
eps: Distance threshold for neighbors (0.1-2.0 typical)
min_pts: Minimum points to form a cluster (5-20 typical)

Returns: TRUE if outlier (in sparse region), FALSE otherwise

Speed: O(n²) in worst case, O(n log n) with spatial index

Method:

Groups points by density
Points in sparse regions = anomalies
Points not in any cluster = noise/anomalies

Example:

-- Detect sparse clusters in behavior
SELECT
    customer_id,
    amount,
    frequency,
    anofox_dbscan_anomaly(
        [amount, frequency],
        eps := 0.5,        -- Distance threshold
        min_pts := 10      -- Min cluster members
    ) OVER () as is_anomaly
FROM customers
WHERE anofox_dbscan_anomaly(
    [amount, frequency],
    0.5, 10
) OVER () = TRUE;

Parameter Tuning:

eps (small → large): Stricter → looser anomaly detection
- 0.1-0.3: Very strict (few clusters)
- 0.5-1.0: Moderate (typical)
- 1.5-2.0: Loose (large clusters)
min_pts (small → large): Looser → stricter definition of "cluster"
- 3-5: Loose (small clusters ok)
- 10-20: Moderate (typical)
- 50+: Strict (must be large cluster)

Pros:

Finds density-based anomalies naturally
No contamination rate tuning needed
Works with irregular cluster shapes
Can find clusters within outliers

Cons:

Two parameters to tune (eps + min_pts)
Slower than Isolation Forest
Sensitive to parameter choices
Works better with 2-3 features

Comparison & Selection

Method	Type	Speed	Interpretation	Use Case
Z-Score	Statistical	⚡⚡⚡	Easy	Single column, normal dist
IQR	Statistical	⚡⚡	Easy	Single column, robust
Isolation Forest	ML	⚡⚡	Hard	Multiple columns, patterns
DBSCAN	ML	⚡	Medium	Density-based, clusters

Decision Tree

Which method should I use?

1. How many features?
   ├─ ONE column → Continue
   │              ├─ Normal distribution? YES → Z-Score
   │              └─ Any distribution? YES → IQR
   │
   └─ MULTIPLE columns → Continue
                      ├─ Need interpretability? YES → DBSCAN (with tuning)
                      └─ Maximize detection? YES → Isolation Forest

Practical Examples

Example 1: Sales Anomalies (Single Column)

-- Detect unusual sales amounts
SELECT
    date,
    region,
    sales_amount,
    anofox_zscore_anomaly(sales_amount, 3.0) OVER (PARTITION BY region) as is_anomaly_zscore,
    anofox_iqr_anomaly(sales_amount, 1.5) OVER (PARTITION BY region) as is_anomaly_iqr
FROM daily_sales
WHERE anofox_zscore_anomaly(sales_amount, 3.0) OVER (PARTITION BY region) = TRUE
ORDER BY sales_amount DESC;

Example 2: Customer Behavior (Multiple Columns)

-- Detect unusual customer behavior
SELECT
    customer_id,
    purchase_amount,
    purchase_frequency,
    days_since_last_purchase,
    anofox_isolation_forest_anomaly(
        [purchase_amount, purchase_frequency, days_since_last_purchase],
        0.05
    ) OVER () as is_anomaly
FROM customer_summary
WHERE anofox_isolation_forest_anomaly(
    [purchase_amount, purchase_frequency, days_since_last_purchase],
    0.05
) OVER () = TRUE
ORDER BY purchase_amount DESC;

Example 3: Multi-Stage Detection

-- Combine methods for robust detection
CREATE TABLE anomalies_detected AS
SELECT
    transaction_id,
    amount,
    CASE
        WHEN anofox_zscore_anomaly(amount, 3.0) OVER () THEN 'z_score'
        WHEN anofox_iqr_anomaly(amount, 1.5) OVER () THEN 'iqr'
        ELSE NULL
    END as anomaly_method_univariate,
    CASE
        WHEN anofox_isolation_forest_anomaly([amount], 0.05) OVER () THEN 'forest'
        ELSE NULL
    END as anomaly_method_ml
FROM transactions;

-- Flag anything detected by any method
SELECT * FROM anomalies_detected
WHERE anomaly_method_univariate IS NOT NULL
   OR anomaly_method_ml IS NOT NULL;

Window Function Usage

All anomaly detection functions work with OVER clauses:

-- Global detection (all data)
anofox_zscore_anomaly(amount, 3.0) OVER ()

-- Partition by group
anofox_zscore_anomaly(amount, 3.0) OVER (PARTITION BY region)

-- Partition with ordering (time-series)
anofox_zscore_anomaly(amount, 3.0) OVER (
    PARTITION BY customer_id
    ORDER BY date
    ROWS BETWEEN 30 PRECEDING AND CURRENT ROW
)

Best Practices

✅ Start simple (Z-Score or IQR)
✅ Visualize results before using for action
✅ Validate thresholds on test data
✅ Combine methods for robustness
✅ Tune parameters on historical anomalies
✅ Monitor anomaly rates over time

Next Steps

Anomaly Detection Methods — Deep dive into methodology
Data Profiling — Establish baselines
Basic Workflow — Implement in production

Overview​

Statistical Methods (2 functions)​

anofox_zscore_anomaly​

anofox_iqr_anomaly​

Machine Learning Methods (2 functions)​

anofox_isolation_forest_anomaly​

anofox_dbscan_anomaly​

Comparison & Selection​

Decision Tree​

Practical Examples​

Example 1: Sales Anomalies (Single Column)​

Example 2: Customer Behavior (Multiple Columns)​

Example 3: Multi-Stage Detection​

Window Function Usage​

Best Practices​

Next Steps​

Overview

Statistical Methods (2 functions)

anofox_zscore_anomaly

anofox_iqr_anomaly

Machine Learning Methods (2 functions)

anofox_isolation_forest_anomaly

anofox_dbscan_anomaly

Comparison & Selection

Decision Tree

Practical Examples

Example 1: Sales Anomalies (Single Column)

Example 2: Customer Behavior (Multiple Columns)

Example 3: Multi-Stage Detection

Window Function Usage

Best Practices

Next Steps