Agent Conversation: openai/gpt-5.4
Task: ml-anomaly-detection
Unsupervised Anomaly Detection Algorithm Design
Research Question
Design a novel unsupervised anomaly detection algorithm for tabular data that generalizes across datasets with varying dimensionality, sample sizes, and anomaly ratios.
Background
Unsupervised anomaly detection identifies rare, unusual patterns in data without labeled examples. Classic methods include Isolation Forest (tree-based isolation), Local Outlier Factor (density-based), and One-Class SVM (boundary-based). Recent advances include ECOD (empirical cumulative distribution tails, TKDE 2022), COPOD (copula-based tail probabilities, ICDM 2020), and Deep Isolation Forest (representation-enhanced isolation, TKDE 2023). Despite progress, no single method dominates across all dataset characteristics, leaving room for novel algorithmic designs that combine strengths of multiple paradigms.
Task
Implement a custom unsupervised anomaly detection algorithm in the CustomAnomalyDetector class in custom_anomaly.py. Your algorithm should detect anomalies without using any labels during training.
Interface
class CustomAnomalyDetector:
def __init__(self):
# Initialize hyperparameters and internal state
def fit(self, X):
# Train on unlabeled data X: numpy array (n_samples, n_features)
# Data is already standardized (zero mean, unit variance)
return self
def decision_function(self, X):
# Return anomaly scores: numpy array (n_samples,)
# Higher scores = more anomalous
return scores
Available Libraries
numpy,scipy(linear algebra, statistics, spatial, optimization)scikit-learn(PCA, KDE, NearestNeighbors, GaussianMixture, etc.)pyod(IForest, LOF, OCSVM, ECOD, COPOD, KNN, HBOS, PCA, LODA, SUOD, etc.)
Evaluation
Evaluated on 4 tabular anomaly detection benchmarks from ADBench/ODDS:
- Cardio: 1,831 samples, 21 features, ~9.6% anomalies (cardiotocography)
- Thyroid: 3,772 samples, 6 features, ~2.5% anomalies (thyroid disease)
- Satellite: 6,435 samples, 36 features, ~31.6% anomalies (Landsat satellite)
- Shuttle: 49,097 samples, 9 features, ~7.2% anomalies (NASA shuttle)
Metrics (higher is better): AUROC (area under ROC curve) and F1 score at the optimal contamination threshold. Evaluated via a 60/40 stratified train/test split, following the standard ADBench/ECOD paper protocol.
scikit-learn/custom_anomaly.py [EDITABLE — lines 160–212 only]
1: """Unsupervised Anomaly Detection Benchmark for MLS-Bench.
2:
3: FIXED: Data loading, evaluation pipeline, metrics computation.
4: EDITABLE: CustomAnomalyDetector class — the agent's anomaly detection algorithm.
5:
6: Usage:
7: ENV=cardio SEED=42 OUTPUT_DIR=./output python custom_anomaly.py
8: """
9:
10: import os
11: import sys
12: import json
13: import time
14: import warnings
15: fromTool Input
1{
2 "op": "replace",
3 "filename": "scikit-learn/custom_anomaly.py",
4 "content": "class CustomAnomalyDetector:\n \"\"\"Self-purified copula ensemble for unsupervised anomaly detection.\n\n The method first maps each feature through an inductive empirical CDF,\n producing copula-normalized coordinates. It then iteratively builds a\n central \"core\" set using a preliminary tail score and fits three\n complementary detectors on that purified core: local sparsity (kNN), global\n dependence deviation (Mahalanobis in copula space), and isolation depth\n (Isolation Forest). Final scores are a reliability-weighted fusion of the\n component scores.\n \"\"\"\n\n def __init__(self):\n self.eps_ = 1e-4\n\n def _ecdf(self, X):\n X = np.asarray(X, dtype=np.float64)\n U = np.empty_like(X)\n for j in range(X.shape[1]):\n U[:, j] = np.searchsorted(self.sorted_[:, j], X[:, j], side=\"right\") / (self.n_train_ + 1.0)\n return np.clip(U, self.eps_, 1.0 - self.eps_)\n\n def _parts(self, X):\n from scipy.stats import norm\n\n U = self._ecdf(X)\n T = -np.log(np.minimum(U, 1.0 - U))\n Z = norm.ppf(U)\n knn = self.nn_.kneighbors(Z, n_neighbors=self.q_, return_distance=True)[0][:, -1]\n dz = Z - self.mu_\n maha = np.einsum(\"ij,jk,ik->i\", dz, self.prec_, dz)\n iso = self.iforest_.decision_function(Z)\n tail = 0.6 * T.mean(axis=1) + 0.4 * T.max(axis=1)\n return np.c_[tail, knn, maha, iso], Z\n\n def _combine(self, parts):\n z = np.clip((parts - self.loc_) / (1.4826 * self.scale_), -5.0, 20.0)\n wz = z * self.weight_\n return 0.65 * wz.mean(axis=1) + 0.35 * wz.max(axis=1)\n\n def fit(self, X):\n from scipy.stats import norm\n from sklearn.neighbors import NearestNeighbors\n from pyod.models.iforest import IForest\n\n X = np.asarray(X, dtype=np.float64)\n self.n_train_ = X.shape[0]\n self.sorted_ = np.sort(X, axis=0)\n\n U = self._ecdf(X)\n T = -np.log(np.minimum(U, 1.0 - U))\n Z = norm.ppf(U)\n score = 0.6 * T.mean(axis=1) + 0.4 * T.max(axis=1) + 0.1 * (Z * Z).mean(axis=1)\n self.core_q_ = float(np.clip(85.0 - 5.0 * np.log1p(X.shape[1]), 60.0, 78.0))\n\n for _ in range(2):\n core = Z[score <= np.percentile(score, self.core_q_)]\n self.q_ = min(core.shape[0], int(np.clip(3.0 * np.log1p(core.shape[0]), 10.0, 30.0)) + 1)\n self.nn_ = NearestNeighbors(n_neighbors=self.q_, algorithm=\"auto\").fit(core)\n self.mu_ = core.mean(axis=0)\n cov = np.cov(core, rowvar=False) + 1e-3 * np.eye(core.shape[1])\n self.prec_ = np.linalg.pinv(cov)\n self.iforest_ = IForest(n_estimators=200, max_samples=\"auto\", random_state=42)\n self.iforest_.fit(core)\n parts, _ = self._parts(X)\n self.loc_ = np.median(parts, axis=0)\n self.scale_ = np.median(np.abs(parts - self.loc_), axis=0) + 1e-9\n z_parts = (parts - self.loc_) / (1.4826 * self.scale_)\n self.weight_ = np.maximum(np.percentile(z_parts, 90, axis=0) - np.median(z_parts, axis=0), 0.25)\n score = self._combine(parts)\n return self\n\n def decision_function(self, X):\n parts, _ = self._parts(np.asarray(X, dtype=np.float64))\n return self._combine(parts)\n",
5 "after_line": 0,
6 "start_line": 160,
7 "end_line": 212
8}OK: Replaced lines 160..212 with 75 line(s) in scikit-learn/custom_anomaly.py. Editable range: 160–234.
[Current file: scikit-learn/custom_anomaly.py | editable: 160–234 | total: 278 lines]
160 class CustomAnomalyDetector:
161 """Self-purified copula ensemble for unsupervised anomaly detection.
162
... (69 more lines) ...
232 def decision_function(self, X):
233 parts, _ = self._parts(np.asarray(X, dtype=np.float64))
234 return self._combine(parts)[Test #1] (2 tests remaining; call submit(n=N) to choose which test result to submit as final)
### cardio (scripts/cardio.sh)
[COMMAND FAILED exit=1]
Dataset: cardio, Seed: 42
Traceback (most recent call last):
File "/srv/home/bohanlyu/MLS-Bench/vendor/workspace/ml-anomaly-detection/openai_gpt-5.4_20260414_110948/scikit-learn/custom_anomaly.py", line 246, in <module>
X, y = load_dataset(DATASET_NAME)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/srv/home/bohanlyu/MLS-Bench/vendor/workspace/ml-anomaly-detection/openai_gpt-5.4_20260414_110948/scikit-learn/custom_anomaly.py", line 60, in load_dataset
data = np.load(filepath, allow_pickle=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/srv/home/bohanlyu/miniconda3/envs/mlsbench-scikit-learn/lib/python3.11/site-packages/numpy/lib/_npyio_impl.py", line 454, in load
fid = stack.enter_context(open(os.fspath(file), "rb"))
^^^^^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No suc
Tool Input
1{
2 "op": "replace",
3 "filename": "scikit-learn/custom_anomaly.py",
4 "content": "class CustomAnomalyDetector:\n \"\"\"Self-purified dual-space copula ensemble.\n\n Features are converted to inductive empirical-CDF coordinates, yielding a\n copula-normalized view that is robust to marginal skew/heavy tails. The\n detector iteratively extracts a central core, then scores points using five\n complementary views: marginal tail extremeness, local sparsity, copula-space\n Mahalanobis deviation, copula-space isolation, and raw-space isolation. A\n sample-adaptive fusion shifts weight from local density to tail/global views\n when a point is already extreme in the marginal tails.\n \"\"\"\n\n def __init__(self):\n self.eps_ = 1e-4\n\n def _ecdf(self, X):\n X = np.asarray(X, dtype=np.float64)\n U = np.empty_like(X)\n for j in range(X.shape[1]):\n U[:, j] = np.searchsorted(self.sorted_[:, j], X[:, j], side=\"right\") / (self.n_train_ + 1.0)\n return np.clip(U, self.eps_, 1.0 - self.eps_)\n\n def _parts(self, X):\n from scipy.stats import norm\n\n X = np.asarray(X, dtype=np.float64)\n U = self._ecdf(X)\n T = -np.log(np.minimum(U, 1.0 - U))\n Z = norm.ppf(U)\n knn = self.nn_.kneighbors(Z, n_neighbors=self.q_, return_distance=True)[0][:, -1]\n dz = Z - self.mu_\n maha = np.einsum(\"ij,jk,ik->i\", dz, self.prec_, dz)\n tail = 0.6 * T.mean(axis=1) + 0.4 * T.max(axis=1)\n iso_z = self.iforest_z_.decision_function(Z)\n iso_x = self.iforest_x_.decision_function(X)\n return np.c_[tail, knn, maha, iso_z, iso_x], Z\n\n def _combine(self, parts):\n z = np.clip((parts - self.loc_) / (1.4826 * self.scale_), -5.0, 20.0)\n wz = z * self.weight_\n g = 1.0 / (1.0 + np.exp(-z[:, 0]))\n iso = np.maximum(wz[:, 3], wz[:, 4])\n interior = 0.55 * wz[:, 1] + 0.25 * wz[:, 2] + 0.20 * iso\n extreme = 0.45 * wz[:, 0] + 0.20 * wz[:, 2] + 0.35 * iso\n return (1.0 - g) * interior + g * extreme + 0.15 * wz.max(axis=1)\n\n def fit(self, X):\n from scipy.stats import norm\n from sklearn.neighbors import NearestNeighbors\n from pyod.models.iforest import IForest\n\n X = np.asarray(X, dtype=np.float64)\n self.n_train_ = X.shape[0]\n self.sorted_ = np.sort(X, axis=0)\n\n U = self._ecdf(X)\n T = -np.log(np.minimum(U, 1.0 - U))\n Z = norm.ppf(U)\n score = 0.6 * T.mean(axis=1) + 0.4 * T.max(axis=1) + 0.1 * (Z * Z).mean(axis=1)\n self.core_q_ = float(np.clip(85.0 - 5.0 * np.log1p(X.shape[1]), 60.0, 78.0))\n\n for _ in range(2):\n mask = score <= np.percentile(score, self.core_q_)\n core_z, core_x = Z[mask], X[mask]\n self.q_ = min(core_z.shape[0], int(np.clip(3.0 * np.log1p(core_z.shape[0]), 10.0, 30.0)) + 1)\n self.nn_ = NearestNeighbors(n_neighbors=self.q_, algorithm=\"auto\").fit(core_z)\n self.mu_ = core_z.mean(axis=0)\n cov = np.cov(core_z, rowvar=False) + 1e-3 * np.eye(core_z.shape[1])\n self.prec_ = np.linalg.pinv(cov)\n self.iforest_z_ = IForest(n_estimators=200, max_samples=\"auto\", random_state=42)\n self.iforest_x_ = IForest(n_estimators=200, max_samples=\"auto\", random_state=43)\n self.iforest_z_.fit(core_z)\n self.iforest_x_.fit(core_x)\n parts, _ = self._parts(X)\n self.loc_ = np.median(parts, axis=0)\n self.scale_ = np.median(np.abs(parts - self.loc_), axis=0) + 1e-9\n z_parts = (parts - self.loc_) / (1.4826 * self.scale_)\n self.weight_ = np.maximum(np.percentile(z_parts, 90, axis=0) - np.median(z_parts, axis=0), 0.25)\n score = self._combine(parts)\n return self\n\n def decision_function(self, X):\n parts, _ = self._parts(np.asarray(X, dtype=np.float64))\n return self._combine(parts)\n",
5 "after_line": 0,
6 "start_line": 160,
7 "end_line": 234
8}OK: Replaced lines 160..234 with 84 line(s) in scikit-learn/custom_anomaly.py. Editable range: 160–243.
[Current file: scikit-learn/custom_anomaly.py | editable: 160–243 | total: 287 lines]
160 class CustomAnomalyDetector:
161 """Self-purified dual-space copula ensemble.
162
... (78 more lines) ...
241 def decision_function(self, X):
242 parts, _ = self._parts(np.asarray(X, dtype=np.float64))
243 return self._combine(parts)[Test #2] (1 test remaining; call submit(n=N) to choose which test result to submit as final)
### cardio (scripts/cardio.sh)
[COMMAND FAILED exit=1]
Dataset: cardio, Seed: 42
Traceback (most recent call last):
File "/srv/home/bohanlyu/MLS-Bench/vendor/workspace/ml-anomaly-detection/openai_gpt-5.4_20260414_110948/scikit-learn/custom_anomaly.py", line 255, in <module>
X, y = load_dataset(DATASET_NAME)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/srv/home/bohanlyu/MLS-Bench/vendor/workspace/ml-anomaly-detection/openai_gpt-5.4_20260414_110948/scikit-learn/custom_anomaly.py", line 60, in load_dataset
data = np.load(filepath, allow_pickle=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/srv/home/bohanlyu/miniconda3/envs/mlsbench-scikit-learn/lib/python3.11/site-packages/numpy/lib/_npyio_impl.py", line 454, in load
fid = stack.enter_context(open(os.fspath(file), "rb"))
^^^^^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such
Tool Input
1{
2 "n": 2
3}[submit] Submitting result from test #2 as final.
### cardio (scripts/cardio.sh)
[COMMAND FAILED exit=1]
Dataset: cardio, Seed: 42
Traceback (most recent call last):
File "/srv/home/bohanlyu/MLS-Bench/vendor/workspace/ml-anomaly-detection/openai_gpt-5.4_20260414_110948/scikit-learn/custom_anomaly.py", line 255, in <module>
X, y = load_dataset(DATASET_NAME)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/srv/home/bohanlyu/MLS-Bench/vendor/workspace/ml-anomaly-detection/openai_gpt-5.4_20260414_110948/scikit-learn/custom_anomaly.py", line 60, in load_dataset
data = np.load(filepath, allow_pickle=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/srv/home/bohanlyu/miniconda3/envs/mlsbench-scikit-learn/lib/python3.11/site-packages/numpy/lib/_npyio_impl.py", line 454, in load
fid = stack.enter_context(open(os.fspath(file), "rb"))
^^^^^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: '/data/adbench/6_cardio.
Tool Input
1{
2 "op": "replace",
3 "filename": "scikit-learn/custom_anomaly.py",
4 "content": "class CustomAnomalyDetector:\n \"\"\"Self-purified adaptive copula graph ensemble.\n\n The detector maps features to inductive ECDF coordinates, Gaussianizes them,\n and iteratively fits detectors on a purified inlier core. It then fuses\n complementary signals: concentrated marginal tails, local graph sparsity,\n global dependence deviation, and isolation in both raw and copula spaces.\n A concentration gate emphasizes tail scores for sparse-feature anomalies and\n graph/global scores for diffuse anomalies.\n \"\"\"\n\n def __init__(self):\n self.eps_ = 1e-4\n\n def _ecdf(self, X):\n X = np.asarray(X, dtype=np.float64)\n U = np.empty_like(X)\n for j in range(X.shape[1]):\n U[:, j] = np.searchsorted(self.sorted_[:, j], X[:, j], side=\"right\") / (self.n_train_ + 1.0)\n return np.clip(U, self.eps_, 1.0 - self.eps_)\n\n def _parts(self, X):\n from scipy.stats import norm\n\n X = np.asarray(X, dtype=np.float64)\n U = self._ecdf(X)\n A = -np.log(np.minimum(U, 1.0 - U))\n Z = norm.ppf(U)\n r = min(max(2, X.shape[1] // 6 + 1), X.shape[1])\n tail_sparse = np.partition(A, X.shape[1] - r, axis=1)[:, -r:].mean(axis=1)\n tail_dense = A.mean(axis=1)\n tail = 0.55 * tail_sparse + 0.45 * tail_dense\n conc = tail_sparse / (tail_dense + self.eps_)\n d_knn = self.nn_.kneighbors(Z, n_neighbors=self.q_, return_distance=True)[0][:, -1]\n local = d_knn / self.core_rad_\n dz = Z - self.mu_\n maha = np.einsum(\"ij,jk,ik->i\", dz, self.prec_, dz)\n iso = np.maximum(self.iforest_z_.decision_function(Z), self.iforest_x_.decision_function(X))\n return np.c_[tail, local, maha, iso], conc\n\n def _combine(self, parts, conc):\n z = np.clip((parts - self.loc_) / (1.4826 * self.scale_), -5.0, 20.0)\n wz = z * self.weight_\n g = 1.0 / (1.0 + np.exp(-2.0 * (conc - self.conc0_)))\n diffuse = 0.50 * wz[:, 1] + 0.30 * wz[:, 2] + 0.20 * wz[:, 3]\n sparse = 0.55 * wz[:, 0] + 0.20 * wz[:, 2] + 0.25 * wz[:, 3]\n return (1.0 - g) * diffuse + g * sparse + 0.12 * wz.max(axis=1)\n\n def fit(self, X):\n from scipy.stats import norm\n from sklearn.neighbors import NearestNeighbors\n from pyod.models.iforest import IForest\n\n X = np.asarray(X, dtype=np.float64)\n self.n_train_ = X.shape[0]\n self.sorted_ = np.sort(X, axis=0)\n U = self._ecdf(X)\n A = -np.log(np.minimum(U, 1.0 - U))\n Z = norm.ppf(U)\n seed = 0.6 * A.mean(axis=1) + 0.4 * A.max(axis=1) + 0.05 * (Z * Z).mean(axis=1)\n core_q = float(np.clip(83.0 - 5.0 * np.log1p(X.shape[1]), 58.0, 76.0))\n\n for _ in range(2):\n mask = seed <= np.percentile(seed, core_q)\n core_x, core_z = X[mask], Z[mask]\n self.q_ = min(core_z.shape[0], int(np.clip(3.0 * np.log1p(core_z.shape[0]), 12.0, 36.0)) + 1)\n self.nn_ = NearestNeighbors(n_neighbors=self.q_, algorithm=\"auto\").fit(core_z)\n self.core_rad_ = np.median(self.nn_.kneighbors(core_z, n_neighbors=self.q_, return_distance=True)[0][:, -1]) + 1e-9\n self.mu_ = core_z.mean(axis=0)\n cov = np.cov(core_z, rowvar=False) + 1e-3 * np.eye(core_z.shape[1])\n self.prec_ = np.linalg.pinv(cov)\n self.iforest_z_ = IForest(n_estimators=160, max_samples=\"auto\", random_state=42)\n self.iforest_x_ = IForest(n_estimators=160, max_samples=\"auto\", random_state=43)\n self.iforest_z_.fit(core_z)\n self.iforest_x_.fit(core_x)\n parts, conc = self._parts(X)\n self.loc_ = np.median(parts, axis=0)\n self.scale_ = np.median(np.abs(parts - self.loc_), axis=0) + 1e-9\n z_parts = (parts - self.loc_) / (1.4826 * self.scale_)\n self.weight_ = np.maximum(np.percentile(z_parts, 90, axis=0) - np.median(z_parts, axis=0), 0.25)\n self.conc0_ = float(np.median(conc))\n seed = self._combine(parts, conc)\n return self\n\n def decision_function(self, X):\n parts, conc = self._parts(np.asarray(X, dtype=np.float64))\n return self._combine(parts, conc)\n",
5 "after_line": 0,
6 "start_line": 160,
7 "end_line": 243
8}OK: Replaced lines 160..243 with 87 line(s) in scikit-learn/custom_anomaly.py. Editable range: 160–246.
[Current file: scikit-learn/custom_anomaly.py | editable: 160–246 | total: 290 lines]
160 class CustomAnomalyDetector:
161 """Self-purified adaptive copula graph ensemble.
162
... (81 more lines) ...
244 def decision_function(self, X):
245 parts, conc = self._parts(np.asarray(X, dtype=np.float64))
246 return self._combine(parts, conc)[Test #3] (0 tests remaining; call submit(n=N) to choose which test result to submit as final)
[NOTE] This was your last test. You MUST now call submit(n=X) to choose which test result to submit as your final answer.
### cardio (scripts/cardio.sh)
[COMMAND FAILED exit=1]
Dataset: cardio, Seed: 42
Traceback (most recent call last):
File "/srv/home/bohanlyu/MLS-Bench/vendor/workspace/ml-anomaly-detection/openai_gpt-5.4_20260414_110948/scikit-learn/custom_anomaly.py", line 258, in <module>
X, y = load_dataset(DATASET_NAME)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/srv/home/bohanlyu/MLS-Bench/vendor/workspace/ml-anomaly-detection/openai_gpt-5.4_20260414_110948/scikit-learn/custom_anomaly.py", line 60, in load_dataset
data = np.load(filepath, allow_pickle=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/srv/home/bohanlyu/miniconda3/envs/mlsbench-scikit-learn/lib/python3.11/site-packages/numpy/lib/_npyio_impl.py", line 454, in load
fid = stack.enter_context
Tool Input
1{
2 "n": 3
3}[submit] Submitting result from test #3 as final.
### cardio (scripts/cardio.sh)
[COMMAND FAILED exit=1]
Dataset: cardio, Seed: 42
Traceback (most recent call last):
File "/srv/home/bohanlyu/MLS-Bench/vendor/workspace/ml-anomaly-detection/openai_gpt-5.4_20260414_110948/scikit-learn/custom_anomaly.py", line 258, in <module>
X, y = load_dataset(DATASET_NAME)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/srv/home/bohanlyu/MLS-Bench/vendor/workspace/ml-anomaly-detection/openai_gpt-5.4_20260414_110948/scikit-learn/custom_anomaly.py", line 60, in load_dataset
data = np.load(filepath, allow_pickle=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/srv/home/bohanlyu/miniconda3/envs/mlsbench-scikit-learn/lib/python3.11/site-packages/numpy/lib/_npyio_impl.py", line 454, in load
fid = stack.enter_context(open(os.fspath(file), "rb"))
^^^^^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: '/data/adbench/6_cardio.
Tool Input
1{
2 "op": "replace",
3 "filename": "scikit-learn/custom_anomaly.py",
4 "content": "class CustomAnomalyDetector:\n \"\"\"Self-purified adaptive copula graph ensemble.\n\n The detector maps features to inductive ECDF coordinates, Gaussianizes them,\n and iteratively fits detectors on a purified inlier core. It fuses five\n complementary views: concentrated marginal tails, local graph sparsity,\n global dependence deviation, copula subspace reconstruction residual, and\n isolation in raw/copula spaces. A concentration gate emphasizes tail scores\n for sparse-feature anomalies and graph/global/subspace scores for diffuse\n anomalies.\n \"\"\"\n\n def __init__(self):\n self.eps_ = 1e-4\n\n def _ecdf(self, X):\n X = np.asarray(X, dtype=np.float64)\n U = np.empty_like(X)\n for j in range(X.shape[1]):\n U[:, j] = np.searchsorted(self.sorted_[:, j], X[:, j], side=\"right\") / (self.n_train_ + 1.0)\n return np.clip(U, self.eps_, 1.0 - self.eps_)\n\n def _parts(self, X):\n from scipy.stats import norm\n\n X = np.asarray(X, dtype=np.float64)\n U = self._ecdf(X)\n A = -np.log(np.minimum(U, 1.0 - U))\n Z = norm.ppf(U)\n r = min(max(2, X.shape[1] // 6 + 1), X.shape[1])\n tail_sparse = np.partition(A, X.shape[1] - r, axis=1)[:, -r:].mean(axis=1)\n tail_dense = A.mean(axis=1)\n tail = 0.55 * tail_sparse + 0.45 * tail_dense\n conc = tail_sparse / (tail_dense + self.eps_)\n d_knn = self.nn_.kneighbors(Z, n_neighbors=self.q_, return_distance=True)[0][:, -1]\n local = d_knn / self.core_rad_\n dz = Z - self.mu_\n maha = np.einsum(\"ij,jk,ik->i\", dz, self.prec_, dz)\n recon = self.pca_.inverse_transform(self.pca_.transform(Z))\n resid = np.mean((Z - recon) ** 2, axis=1)\n iso = np.maximum(self.iforest_z_.decision_function(Z), self.iforest_x_.decision_function(X))\n return np.c_[tail, local, maha, resid, iso], conc\n\n def _combine(self, parts, conc):\n z = np.clip((parts - self.loc_) / (1.4826 * self.scale_), -5.0, 20.0)\n wz = z * self.weight_\n g = 1.0 / (1.0 + np.exp(-2.0 * (conc - self.conc0_)))\n diffuse = 0.38 * wz[:, 1] + 0.24 * wz[:, 2] + 0.18 * wz[:, 3] + 0.20 * wz[:, 4]\n sparse = 0.52 * wz[:, 0] + 0.16 * wz[:, 2] + 0.10 * wz[:, 3] + 0.22 * wz[:, 4]\n return (1.0 - g) * diffuse + g * sparse + 0.10 * wz.max(axis=1)\n\n def fit(self, X):\n from scipy.stats import norm\n from sklearn.decomposition import PCA\n from sklearn.neighbors import NearestNeighbors\n from pyod.models.iforest import IForest\n\n X = np.asarray(X, dtype=np.float64)\n self.n_train_ = X.shape[0]\n self.sorted_ = np.sort(X, axis=0)\n U = self._ecdf(X)\n A = -np.log(np.minimum(U, 1.0 - U))\n Z = norm.ppf(U)\n seed = 0.6 * A.mean(axis=1) + 0.4 * A.max(axis=1) + 0.05 * (Z * Z).mean(axis=1)\n core_q = float(np.clip(83.0 - 5.0 * np.log1p(X.shape[1]), 58.0, 76.0))\n\n for _ in range(2):\n mask = seed <= np.percentile(seed, core_q)\n core_x, core_z = X[mask], Z[mask]\n self.q_ = min(core_z.shape[0], int(np.clip(3.0 * np.log1p(core_z.shape[0]), 12.0, 36.0)) + 1)\n self.nn_ = NearestNeighbors(n_neighbors=self.q_, algorithm=\"auto\").fit(core_z)\n self.core_rad_ = np.median(self.nn_.kneighbors(core_z, n_neighbors=self.q_, return_distance=True)[0][:, -1]) + 1e-9\n self.mu_ = core_z.mean(axis=0)\n cov = np.cov(core_z, rowvar=False) + 1e-3 * np.eye(core_z.shape[1])\n self.prec_ = np.linalg.pinv(cov)\n nc = max(1, min(core_z.shape[1] - 1, int(np.ceil(0.6 * core_z.shape[1]))))\n self.pca_ = PCA(n_components=nc, svd_solver=\"full\").fit(core_z)\n self.iforest_z_ = IForest(n_estimators=160, max_samples=\"auto\", random_state=42)\n self.iforest_x_ = IForest(n_estimators=160, max_samples=\"auto\", random_state=43)\n self.iforest_z_.fit(core_z)\n self.iforest_x_.fit(core_x)\n parts, conc = self._parts(X)\n self.loc_ = np.median(parts, axis=0)\n self.scale_ = np.median(np.abs(parts - self.loc_), axis=0) + 1e-9\n z_parts = (parts - self.loc_) / (1.4826 * self.scale_)\n self.weight_ = np.maximum(np.percentile(z_parts, 90, axis=0) - np.median(z_parts, axis=0), 0.25)\n self.conc0_ = float(np.median(conc))\n seed = self._combine(parts, conc)\n return self\n\n def decision_function(self, X):\n parts, conc = self._parts(np.asarray(X, dtype=np.float64))\n return self._combine(parts, conc)\n",
5 "after_line": 0,
6 "start_line": 160,
7 "end_line": 246
8}OK: Replaced lines 160..246 with 93 line(s) in scikit-learn/custom_anomaly.py. Editable range: 160–252.
[Current file: scikit-learn/custom_anomaly.py | editable: 160–252 | total: 296 lines]
160 class CustomAnomalyDetector:
161 """Self-purified adaptive copula graph ensemble.
162
... (87 more lines) ...
250 def decision_function(self, X):
251 parts, conc = self._parts(np.asarray(X, dtype=np.float64))
252 return self._combine(parts, conc)