ml-clustering-algorithm

Classical MLscikit-learnrigorous codebase

Description

Clustering Algorithm Design

Research Question

Design a novel clustering algorithm or distance metric that improves cluster quality across diverse dataset geometries — including convex blobs, non-convex shapes (moons), varied-density clusters, and real-world high-dimensional data (handwritten digits).

Background

Clustering is a fundamental unsupervised learning problem. Classic methods like K-Means assume convex, isotropic clusters; DBSCAN handles arbitrary shapes but requires careful tuning of the eps parameter. Modern advances include HDBSCAN (hierarchical density estimation, parameter-free cluster count), Spectral Clustering (graph Laplacian for non-convex clusters), and Density Peak Clustering (DPC, which identifies centers via local density and inter-peak distance). No single method dominates across all dataset structures, making this an open research question.

Task

Modify the CustomClustering class in scikit-learn/custom_clustering.py (lines 36--120) to implement a novel clustering algorithm. You may also modify the custom_distance function if your approach uses a custom distance metric.

Your algorithm must:

  • Accept n_clusters (int or None) and random_state parameters
  • Implement fit(X) that sets self.labels_ and returns self
  • Implement predict(X) that returns integer cluster labels
  • Handle datasets with different structures (convex, non-convex, varied density, high-dimensional)

Interface

class CustomClustering(BaseEstimator, ClusterMixin):
    def __init__(self, n_clusters=None, random_state=42): ...
    def fit(self, X):        # X: (n_samples, n_features) -> self
    def predict(self, X):    # X: (n_samples, n_features) -> labels (n_samples,)

Available imports (already in the FIXED section): numpy, sklearn.base.BaseEstimator, sklearn.base.ClusterMixin, sklearn.preprocessing.StandardScaler, sklearn.metrics.*. You may import any module from scikit-learn, numpy, or scipy.

Evaluation

  • Datasets: blobs (5 Gaussian clusters), moons (2 half-circles), varied_density (3 clusters with different densities), digits (sklearn Digits, 10 classes, 64 features)
  • Metrics: ARI (Adjusted Rand Index, higher is better), NMI (Normalized Mutual Information, higher is better), Silhouette Score (higher is better)
  • Success = consistently improving over baselines across all four datasets

Code

custom_clustering.py
EditableRead-only
1"""Custom clustering algorithm benchmark.
2
3This script evaluates a clustering algorithm across multiple dataset types.
4The agent should modify the EDITABLE section to implement a novel clustering
5algorithm or distance metric that achieves high cluster quality.
6
7Datasets (selected by $ENV):
8 - blobs: Isotropic Gaussian blobs (varying cluster sizes)
9 - moons: Two interleaving half-circles + noise
10 - varied_density: Clusters with different densities and sizes
11 - digits: Real-world: sklearn Digits (8x8 images of handwritten digits)
12
13Metrics: ARI (Adjusted Rand Index), NMI (Normalized Mutual Information),
14 Silhouette Score
15"""

Results

ModelTypeari blobs nmi blobs silhouette blobs ari moons nmi moons silhouette moons ari digits nmi digits silhouette digits
dbscanbaseline0.6920.8090.6250.9780.9530.2730.0010.018-1.000
hdbscanbaseline0.7670.8580.6510.9990.9960.3720.2290.5760.020
kmeansbaseline0.8530.8740.5850.4810.3830.4940.5340.6710.139
anthropic/claude-opus-4.6vanilla0.9360.9370.6651.0001.0000.3850.6430.7590.136
deepseek-reasonervanilla0.7700.8270.4040.0190.0150.3040.5560.7580.105
google/gemini-3.1-pro-previewvanilla0.9420.9410.6641.0001.0000.3850.4040.6040.106
openai/gpt-5.4vanilla0.9390.9390.6641.0001.0000.3850.6420.7620.131
qwen/qwen3.6-plusvanilla0.9390.9410.6661.0001.0000.3850.6580.7730.137
anthropic/claude-opus-4.6agent0.9390.9390.6651.0001.0000.3850.6640.7790.137
deepseek-reasoneragent0.7430.8060.4040.0020.0020.1820.6490.7470.085
google/gemini-3.1-pro-previewagent0.9410.9420.6641.0001.0000.3850.6660.7860.135
openai/gpt-5.4agent0.9390.9390.6641.0001.0000.3850.6420.7620.131
qwen/qwen3.6-plusagent0.9390.9410.6661.0001.0000.3850.6580.7730.137

Agent Conversations