ml-anomaly-detection

Classical MLscikit-learnrigorous codebase

Description

Unsupervised Anomaly Detection Algorithm Design

Research Question

Design a novel unsupervised anomaly detection algorithm for tabular data that generalizes across datasets with varying dimensionality, sample sizes, and anomaly ratios.

Background

Unsupervised anomaly detection identifies rare, unusual patterns in data without labeled examples. Classic methods include Isolation Forest (tree-based isolation), Local Outlier Factor (density-based), and One-Class SVM (boundary-based). Recent advances include ECOD (empirical cumulative distribution tails, TKDE 2022), COPOD (copula-based tail probabilities, ICDM 2020), and Deep Isolation Forest (representation-enhanced isolation, TKDE 2023). Despite progress, no single method dominates across all dataset characteristics, leaving room for novel algorithmic designs that combine strengths of multiple paradigms.

Task

Implement a custom unsupervised anomaly detection algorithm in the CustomAnomalyDetector class in custom_anomaly.py. Your algorithm should detect anomalies without using any labels during training.

Interface

class CustomAnomalyDetector:
    def __init__(self):
        # Initialize hyperparameters and internal state

    def fit(self, X):
        # Train on unlabeled data X: numpy array (n_samples, n_features)
        # Data is already standardized (zero mean, unit variance)
        return self

    def decision_function(self, X):
        # Return anomaly scores: numpy array (n_samples,)
        # Higher scores = more anomalous
        return scores

Available Libraries

  • numpy, scipy (linear algebra, statistics, spatial, optimization)
  • scikit-learn (PCA, KDE, NearestNeighbors, GaussianMixture, etc.)
  • pyod (IForest, LOF, OCSVM, ECOD, COPOD, KNN, HBOS, PCA, LODA, SUOD, etc.)

Evaluation

Evaluated on 4 tabular anomaly detection benchmarks from ADBench/ODDS:

  • Cardio: 1,831 samples, 21 features, ~9.6% anomalies (cardiotocography)
  • Thyroid: 3,772 samples, 6 features, ~2.5% anomalies (thyroid disease)
  • Satellite: 6,435 samples, 36 features, ~31.6% anomalies (Landsat satellite)
  • Shuttle: 49,097 samples, 9 features, ~7.2% anomalies (NASA shuttle)

Metrics (higher is better): AUROC (area under ROC curve) and F1 score at the optimal contamination threshold. Evaluated via a 60/40 stratified train/test split, following the standard ADBench/ECOD paper protocol.

Code

custom_anomaly.py
EditableRead-only
1"""Unsupervised Anomaly Detection Benchmark for MLS-Bench.
2
3FIXED: Data loading, evaluation pipeline, metrics computation.
4EDITABLE: CustomAnomalyDetector class — the agent's anomaly detection algorithm.
5
6Usage:
7 ENV=cardio SEED=42 OUTPUT_DIR=./output python custom_anomaly.py
8"""
9
10import os
11import sys
12import json
13import time
14import warnings
15from pathlib import Path

Results

ModelTypeauroc cardio f1 cardio auroc thyroid f1 thyroid auroc satellite f1 satellite auroc shuttle f1 shuttle
copodbaseline0.9210.5320.9390.1800.6340.4810.9950.950
ecodbaseline0.9070.4670.9780.5320.5660.4370.9920.853
isolation_forestbaseline0.9460.5860.9810.5500.7070.5860.9970.963
lofbaseline0.5470.1680.7060.0860.5500.3810.5310.132
ocsvmbaseline0.8840.4000.9390.3150.5470.4400.8800.503
deepseek-reasonervanilla--------
google/gemini-3.1-pro-previewvanilla0.5000.1740.5000.0480.5000.4810.5000.133
openai/gpt-5.4vanilla--------
qwen/qwen3.6-plusvanilla--------
deepseek-reasoneragent--------
google/gemini-3.1-pro-previewagent0.8950.4290.9420.3240.6210.4370.9680.707
openai/gpt-5.4agent--------
qwen/qwen3.6-plusagent--------

Agent Conversations