ml-anomaly-detection

Classical MLscikit-learnrigorous codebase

Description

Unsupervised Anomaly Detection Algorithm Design

Research Question

Design a novel unsupervised anomaly detection algorithm for tabular data that generalizes across datasets with varying dimensionality, sample sizes, and anomaly ratios.

Background

Unsupervised anomaly detection identifies rare, unusual patterns in data without labeled examples. Classic methods include Isolation Forest (tree-based isolation), Local Outlier Factor (density-based), and One-Class SVM (boundary-based). Recent advances include ECOD (empirical cumulative distribution tails, TKDE 2022), COPOD (copula-based tail probabilities, ICDM 2020), and Deep Isolation Forest (representation-enhanced isolation, TKDE 2023). Despite progress, no single method dominates across all dataset characteristics, leaving room for novel algorithmic designs that combine strengths of multiple paradigms.

Task

Implement a custom unsupervised anomaly detection algorithm in the CustomAnomalyDetector class in custom_anomaly.py. Your algorithm should detect anomalies without using any labels during training.

Interface

class CustomAnomalyDetector:
    def __init__(self):
        # Initialize hyperparameters and internal state

    def fit(self, X):
        # Train on unlabeled data X: numpy array (n_samples, n_features)
        # Data is already standardized (zero mean, unit variance)
        return self

    def decision_function(self, X):
        # Return anomaly scores: numpy array (n_samples,)
        # Higher scores = more anomalous
        return scores

Available Libraries

numpy, scipy (linear algebra, statistics, spatial, optimization)
scikit-learn (PCA, KDE, NearestNeighbors, GaussianMixture, etc.)
pyod (IForest, LOF, OCSVM, ECOD, COPOD, KNN, HBOS, PCA, LODA, SUOD, etc.)

Evaluation

Evaluated on 4 tabular anomaly detection benchmarks from ADBench/ODDS:

Cardio: 1,831 samples, 21 features, ~9.6% anomalies (cardiotocography)
Thyroid: 3,772 samples, 6 features, ~2.5% anomalies (thyroid disease)
Satellite: 6,435 samples, 36 features, ~31.6% anomalies (Landsat satellite)
Shuttle: 49,097 samples, 9 features, ~7.2% anomalies (NASA shuttle)

Metrics (higher is better): AUROC (area under ROC curve) and F1 score at the optimal contamination threshold. Evaluated via a 60/40 stratified train/test split, following the standard ADBench/ECOD paper protocol.

Code

custom_anomaly.py

EditableRead-only

1"""Unsupervised Anomaly Detection Benchmark for MLS-Bench.
2
3FIXED: Data loading, evaluation pipeline, metrics computation.
4EDITABLE: CustomAnomalyDetector class — the agent's anomaly detection algorithm.
5
6Usage:
7    ENV=cardio SEED=42 OUTPUT_DIR=./output python custom_anomaly.py
8"""
9
10import os
11import sys
12import json
13import time
14import warnings
15from pathlib import Path

Results

Show per-seed results

Model	Type	auroc cardio ↑	f1 cardio ↑	auroc thyroid ↑	f1 thyroid ↑	auroc satellite ↑	f1 satellite ↑	auroc shuttle ↑	f1 shuttle ↑
copod	baseline	0.921	0.532	0.939	0.180	0.634	0.481	0.995	0.950
ecod	baseline	0.907	0.467	0.978	0.532	0.566	0.437	0.992	0.853
isolation_forest	baseline	0.946	0.586	0.981	0.550	0.707	0.586	0.997	0.963
lof	baseline	0.547	0.168	0.706	0.086	0.550	0.381	0.531	0.132
ocsvm	baseline	0.884	0.400	0.939	0.315	0.547	0.440	0.880	0.503
deepseek-reasoner	vanilla	-	-	-	-	-	-	-	-
google/gemini-3.1-pro-preview	vanilla	0.500	0.174	0.500	0.048	0.500	0.481	0.500	0.133
openai/gpt-5.4	vanilla	-	-	-	-	-	-	-	-
qwen/qwen3.6-plus	vanilla	-	-	-	-	-	-	-	-
deepseek-reasoner	agent	-	-	-	-	-	-	-	-
google/gemini-3.1-pro-preview	agent	0.895	0.429	0.942	0.324	0.621	0.437	0.968	0.707
openai/gpt-5.4	agent	-	-	-	-	-	-	-	-
qwen/qwen3.6-plus	agent	-	-	-	-	-	-	-	-

Agent Conversations

deepseek-reasoner

20 steps

google/gemini-3.1-pro-preview