Agent Conversation: google/gemini-3.1-pro-preview
Task: ml-clustering-algorithm
Clustering Algorithm Design
Research Question
Design a novel clustering algorithm or distance metric that improves cluster quality across diverse dataset geometries — including convex blobs, non-convex shapes (moons), varied-density clusters, and real-world high-dimensional data (handwritten digits).
Background
Clustering is a fundamental unsupervised learning problem. Classic methods like K-Means assume convex, isotropic clusters; DBSCAN handles arbitrary shapes but requires careful tuning of the eps parameter. Modern advances include HDBSCAN (hierarchical density estimation, parameter-free cluster count), Spectral Clustering (graph Laplacian for non-convex clusters), and Density Peak Clustering (DPC, which identifies centers via local density and inter-peak distance). No single method dominates across all dataset structures, making this an open research question.
Task
Modify the CustomClustering class in scikit-learn/custom_clustering.py (lines 36--120) to implement a novel clustering algorithm. You may also modify the custom_distance function if your approach uses a custom distance metric.
Your algorithm must:
- Accept
n_clusters(int or None) andrandom_stateparameters - Implement
fit(X)that setsself.labels_and returnsself - Implement
predict(X)that returns integer cluster labels - Handle datasets with different structures (convex, non-convex, varied density, high-dimensional)
Interface
class CustomClustering(BaseEstimator, ClusterMixin):
def __init__(self, n_clusters=None, random_state=42): ...
def fit(self, X): # X: (n_samples, n_features) -> self
def predict(self, X): # X: (n_samples, n_features) -> labels (n_samples,)
Available imports (already in the FIXED section): numpy, sklearn.base.BaseEstimator, sklearn.base.ClusterMixin, sklearn.preprocessing.StandardScaler, sklearn.metrics.*. You may import any module from scikit-learn, numpy, or scipy.
Evaluation
- Datasets: blobs (5 Gaussian clusters), moons (2 half-circles), varied_density (3 clusters with different densities), digits (sklearn Digits, 10 classes, 64 features)
- Metrics: ARI (Adjusted Rand Index, higher is better), NMI (Normalized Mutual Information, higher is better), Silhouette Score (higher is better)
- Success = consistently improving over baselines across all four datasets
scikit-learn/custom_clustering.py [EDITABLE — lines 36–109 only]
1: """Custom clustering algorithm benchmark.
2:
3: This script evaluates a clustering algorithm across multiple dataset types.
4: The agent should modify the EDITABLE section to implement a novel clustering
5: algorithm or distance metric that achieves high cluster quality.
6:
7: Datasets (selected by $ENV):
8: - blobs: Isotropic Gaussian blobs (varying cluster sizes)
9: - moons: Two interleaving half-circles + noiseTool Input
1{
2 "op": "replace",
3 "filename": "scikit-learn/custom_clustering.py",
4 "start_line": 41,
5 "content": "class CustomClustering(BaseEstimator, ClusterMixin):\n \"\"\"Density-Adaptive Spectral Clustering.\"\"\"\n\n def __init__(self, n_clusters=None, random_state=42, n_neighbors=7):\n self.n_clusters = n_clusters\n self.random_state = random_state\n self.n_neighbors = n_neighbors\n self.labels_ = None\n\n def fit(self, X):\n from sklearn.cluster import SpectralClustering\n from sklearn.metrics import pairwise_distances\n import numpy as np\n \n k = self.n_clusters if self.n_clusters is not None else 8\n n_samples = X.shape[0]\n \n D = pairwise_distances(X, metric='euclidean')\n \n k_nn = min(self.n_neighbors, n_samples - 1)\n D_sorted = np.partition(D, k_nn, axis=1)\n sigma = D_sorted[:, k_nn]\n sigma = np.maximum(sigma, 1e-8)\n \n sigma_ij = np.outer(sigma, sigma)\n W = np.exp(- (D ** 2) / sigma_ij)\n \n W = (W + W.T) / 2.0\n np.fill_diagonal(W, 0.0)\n \n try:\n sc = SpectralClustering(\n n_clusters=k,\n affinity='precomputed',\n random_state=self.random_state,\n assign_labels='cluster_qr'\n )\n self.labels_ = sc.fit_predict(W)\n except Exception:\n sc = SpectralClustering(\n n_clusters=k,\n affinity='precomputed',\n random_state=self.random_state,\n assign_labels='kmeans'\n )\n self.labels_ = sc.fit_predict(W)\n \n return self\n\n def predict(self, X):\n self.fit(X)\n return self.labels_\n\ndef custom_distance(x, y):\n return np.sqrt(np.sum((x - y) ** 2))",
6 "end_line": 107
7}OK: Replaced lines 41..107 with 55 line(s) in scikit-learn/custom_clustering.py. Editable range: 36–97.
[Current file: scikit-learn/custom_clustering.py | editable: 36–97 | total: 198 lines]
36 # ================================================================
37 # EDITABLE -- agent modifies this section (lines 36 to 109)
38 # ================================================================
... (56 more lines) ...
95 return np.sqrt(np.sum((x - y) ** 2))
96
97 # ================================================================[Test #1] (2 tests remaining; call submit(n=N) to choose which test result to submit as final)
### blobs (scripts/blobs.sh)
Training progress (blobs):
TRAIN_METRICS stage=fitting
TRAIN_METRICS stage=done
TRAIN_METRICS ari=0.941787
TRAIN_METRICS nmi=0.941162
TRAIN_METRICS silhouette=0.664045
ari_blobs: 0.941787
nmi_blobs: 0.941162
silhouette_blobs: 0.664045
### moons (scripts/moons.sh)
Training progress (moons):
TRAIN_METRICS stage=fitting
TRAIN_METRICS stage=done
TRAIN_METRICS ari=1.000000
TRAIN_METRICS nmi=1.000000
TRAIN_METRICS silhouette=0.385298
ari_moons: 1.000000
nmi_moons: 1.000000
silhouette_moons: 0.385298
[Leaderboard] Results saved: {'ari_blobs': 0.941787, 'nmi_blobs': 0.941162, 'silhouette_blobs': 0.664045, 'elapsed_blobs': 4, 'ari_moons': 1.0, 'nmi_moons': 1.0, 'silhouette_moons': 0.385298, 'elapsed_moons': 4, 'ari_digits': 0.404036, 'nmi_digits': 0.603544, 'silhouette_digits': 0.106327, 'elapsed_digits': 4}Tool Input
1{
2 "start_line": 41,
3 "content": "class CustomClustering(BaseEstimator, ClusterMixin):\n \"\"\"Manifold-Constrained Density-Adaptive Spectral Clustering.\n \"\"\"\n\n def __init__(self, n_clusters=None, random_state=42, n_neighbors=15):\n self.n_clusters = n_clusters\n self.random_state = random_state\n self.n_neighbors = n_neighbors\n self.labels_ = None\n\n def fit(self, X):\n from sklearn.cluster import SpectralClustering\n from sklearn.metrics import pairwise_distances\n from sklearn.neighbors import NearestNeighbors\n import numpy as np\n \n k = self.n_clusters if self.n_clusters is not None else 8\n n_samples = X.shape[0]\n \n # 1. Compute pairwise distances\n D = pairwise_distances(X, metric='euclidean')\n \n # 2. Local scaling: distance to the k-th nearest neighbor\n k_nn = min(self.n_neighbors, n_samples - 1)\n D_sorted = np.partition(D, k_nn, axis=1)\n sigma = D_sorted[:, k_nn]\n sigma = np.maximum(sigma, 1e-8)\n \n # 3. Locally scaled affinity matrix (dense)\n sigma_ij = np.outer(sigma, sigma)\n W_dense = np.exp(- (D ** 2) / (sigma_ij))\n \n # 4. Manifold constraint: restrict affinity to the K-NN graph\n # This prevents \"short circuits\" across manifolds in high-dimensional space\n nn = NearestNeighbors(n_neighbors=k_nn)\n nn.fit(X)\n A = nn.kneighbors_graph(X, mode='connectivity').toarray()\n \n # Symmetrize KNN graph (union of neighborhoods)\n A_sym = np.maximum(A, A.T)\n \n # 5. Final Sparse-like Affinity Matrix\n W = W_dense * A_sym\n \n # Symmetrize and remove self-loops\n W = (W + W.T) / 2.0\n np.fill_diagonal(W, 0.0)\n \n # 6. Spectral Clustering\n try:\n sc = SpectralClustering(\n n_clusters=k,\n affinity='precomputed',\n random_state=self.random_state,\n assign_labels='cluster_qr'\n )\n self.labels_ = sc.fit_predict(W)\n except Exception:\n sc = SpectralClustering(\n n_clusters=k,\n affinity='precomputed',\n random_state=self.random_state,\n assign_labels='kmeans'\n )\n self.labels_ = sc.fit_predict(W)\n \n return self\n\n def predict(self, X):\n self.fit(X)\n return self.labels_\n\ndef custom_distance(x, y):\n return np.sqrt(np.sum((x - y) ** 2))",
4 "filename": "scikit-learn/custom_clustering.py",
5 "op": "replace",
6 "end_line": 93
7}OK: Replaced lines 41..93 with 74 line(s) in scikit-learn/custom_clustering.py. Editable range: 36–118.
[Current file: scikit-learn/custom_clustering.py | editable: 36–118 | total: 219 lines]
36 # ================================================================
37 # EDITABLE -- agent modifies this section (lines 36 to 109)
38 # ================================================================
... (77 more lines) ...
116 return np.sqrt(np.sum((x - y) ** 2))
117
118 # ================================================================[Test #2] (1 test remaining; call submit(n=N) to choose which test result to submit as final)
### blobs (scripts/blobs.sh)
Training progress (blobs):
TRAIN_METRICS stage=fitting
TRAIN_METRICS stage=done
TRAIN_METRICS ari=0.940533
TRAIN_METRICS nmi=0.942082
TRAIN_METRICS silhouette=0.664413
ari_blobs: 0.940533
nmi_blobs: 0.942082
silhouette_blobs: 0.664413
### moons (scripts/moons.sh)
Training progress (moons):
TRAIN_METRICS stage=fitting
TRAIN_METRICS stage=done
TRAIN_METRICS ari=1.000000
TRAIN_METRICS nmi=1.000000
TRAIN_METRICS silhouette=0.385298
ari_moons: 1.000000
nmi_moons: 1.000000
silhouette_moons: 0.385298
[Leaderboard] Results saved: {'ari_blobs': 0.940533, 'nmi_blobs': 0.942082, 'silhouette_blobs': 0.664413, 'elapsed_blobs': 3, 'ari_moons': 1.0, 'nmi_moons': 1.0, 'silhouette_moons': 0.385298, 'elapsed_moons': 3, 'ari_digits': 0.665959, 'nmi_digits': 0.786012, 'silhouette_digits': 0.134566, 'elapsed_digits': 4}Tool Input
1{
2 "filename": "scikit-learn/custom_clustering.py",
3 "start_line": 41,
4 "end_line": 115,
5 "op": "replace",
6 "content": "class CustomClustering(BaseEstimator, ClusterMixin):\n \"\"\"Shared Nearest Neighbors (SNN) Spectral Clustering.\"\"\"\n\n def __init__(self, n_clusters=None, random_state=42, n_neighbors=25):\n self.n_clusters = n_clusters\n self.random_state = random_state\n self.n_neighbors = n_neighbors\n self.labels_ = None\n\n def fit(self, X):\n from sklearn.cluster import SpectralClustering\n from sklearn.neighbors import NearestNeighbors\n import numpy as np\n \n k = self.n_clusters if self.n_clusters is not None else 8\n n_samples = X.shape[0]\n \n k_nn = min(self.n_neighbors, n_samples - 1)\n \n # 1. Compute unweighted K-NN graph\n nn = NearestNeighbors(n_neighbors=k_nn, metric='euclidean')\n nn.fit(X)\n A_sparse = nn.kneighbors_graph(X, mode='connectivity')\n \n # 2. Compute the Shared Nearest Neighbors (SNN) Affinity Matrix\n W = A_sparse.dot(A_sparse.T).toarray() / k_nn\n \n # 3. Symmetrize and remove self-loops \n W = (W + W.T) / 2.0\n np.fill_diagonal(W, 0.0)\n \n try:\n sc = SpectralClustering(\n n_clusters=k,\n affinity='precomputed',\n random_state=self.random_state,\n assign_labels='cluster_qr'\n )\n self.labels_ = sc.fit_predict(W)\n except Exception:\n sc = SpectralClustering(\n n_clusters=k,\n affinity='precomputed',\n random_state=self.random_state,\n assign_labels='kmeans'\n )\n self.labels_ = sc.fit_predict(W)\n \n return self\n\n def predict(self, X):\n self.fit(X)\n return self.labels_\n\ndef custom_distance(x, y):\n return np.sqrt(np.sum((x - y) ** 2))"
7}OK: Replaced lines 41..115 with 56 line(s) in scikit-learn/custom_clustering.py. Editable range: 36–99.
[Current file: scikit-learn/custom_clustering.py | editable: 36–99 | total: 200 lines]
36 # ================================================================
37 # EDITABLE -- agent modifies this section (lines 36 to 109)
38 # ================================================================
... (58 more lines) ...
97 return np.sqrt(np.sum((x - y) ** 2))
98
99 # ================================================================[Test #3] (0 tests remaining; call submit(n=N) to choose which test result to submit as final) [NOTE] This was your last test. You MUST now call submit(n=X) to choose which test result to submit as your final answer. ### blobs (scripts/blobs.sh) Training progress (blobs): TRAIN_METRICS stage=fitting TRAIN_METRICS stage=done TRAIN_METRICS ari=0.938877 TRAIN_METRICS nmi=0.939905 TRAIN_METRICS silhouette=0.665560 ari_blobs: 0.938877 nmi_blobs: 0.939905 silhouette_blobs: 0.665560 ### moons (scripts/moons.sh) Training progress (moons): TRAIN_METRICS stage=fitting TRAIN_METRICS stage=done TRAIN_METRICS ari=1.000000 TRAIN_METRICS nmi=1.000000 TRAIN_METRICS silhouette=0.385298 ari_moons: 1.000000 nmi_moons: 1.000000 silhouette_moons: 0.385298 ### digits (scripts/digits.sh) Training progress (digits): TRAIN_METRICS stage=fitting TRAIN_METRICS stage=done TRAIN_METRICS ari=0.657823 TRAIN_METRICS nmi=0.770178 TRAIN_METRICS silhouette=0.141167 ari_digits: 0.657823 nmi_digits: 0.770178 silhouett
Tool Input
1{
2 "start_line": 41,
3 "filename": "scikit-learn/custom_clustering.py",
4 "end_line": 96,
5 "op": "replace",
6 "content": "class CustomClustering(BaseEstimator, ClusterMixin):\n \"\"\"Manifold-Constrained Density-Adaptive Spectral Clustering with optimal params.\n \"\"\"\n\n def __init__(self, n_clusters=None, random_state=42, n_neighbors_density=7, n_neighbors_manifold=15):\n self.n_clusters = n_clusters\n self.random_state = random_state\n self.n_neighbors_density = n_neighbors_density\n self.n_neighbors_manifold = n_neighbors_manifold\n self.labels_ = None\n\n def fit(self, X):\n from sklearn.cluster import SpectralClustering\n from sklearn.metrics import pairwise_distances\n from sklearn.neighbors import NearestNeighbors\n import numpy as np\n \n k = self.n_clusters if self.n_clusters is not None else 8\n n_samples = X.shape[0]\n \n # 1. Compute pairwise distances\n D = pairwise_distances(X, metric='euclidean')\n \n # 2. Local scaling: distance to the k-th nearest neighbor (density)\n k_density = min(self.n_neighbors_density, n_samples - 1)\n D_sorted = np.partition(D, k_density, axis=1)\n sigma = D_sorted[:, k_density]\n sigma = np.maximum(sigma, 1e-8)\n \n # 3. Locally scaled affinity matrix (dense)\n # Using self-tuning spectral clustering form: W_ij = exp( - d^2 / (sigma_i * sigma_j) )\n sigma_ij = np.outer(sigma, sigma)\n W_dense = np.exp(- (D ** 2) / (sigma_ij))\n \n # 4. Manifold constraint: restrict affinity to the K-NN graph\n # This prevents \"short circuits\" across manifolds in high-dimensional space\n k_manifold = min(self.n_neighbors_manifold, n_samples - 1)\n nn = NearestNeighbors(n_neighbors=k_manifold, metric='euclidean')\n nn.fit(X)\n A = nn.kneighbors_graph(X, mode='connectivity').toarray()\n \n # Symmetrize KNN graph (mutual + standard union works well, we just use union)\n A_sym = np.maximum(A, A.T)\n \n # 5. Final Sparse-like Affinity Matrix\n W = W_dense * A_sym\n \n # Symmetrize and remove self-loops just in case\n W = (W + W.T) / 2.0\n np.fill_diagonal(W, 0.0)\n \n # 6. Spectral Clustering\n try:\n sc = SpectralClustering(\n n_clusters=k,\n affinity='precomputed',\n random_state=self.random_state,\n assign_labels='cluster_qr'\n )\n self.labels_ = sc.fit_predict(W)\n except Exception:\n sc = SpectralClustering(\n n_clusters=k,\n affinity='precomputed',\n random_state=self.random_state,\n assign_labels='kmeans'\n )\n self.labels_ = sc.fit_predict(W)\n \n return self\n\n def predict(self, X):\n self.fit(X)\n return self.labels_\n\ndef custom_distance(x, y):\n return np.sqrt(np.sum((x - y) ** 2))"
7}OK: Replaced lines 41..96 with 77 line(s) in scikit-learn/custom_clustering.py. Editable range: 36–120.
[Current file: scikit-learn/custom_clustering.py | editable: 36–120 | total: 221 lines]
36 # ================================================================
37 # EDITABLE -- agent modifies this section (lines 36 to 109)
38 # ================================================================
... (79 more lines) ...
118 return np.sqrt(np.sum((x - y) ** 2))
119
120 # ================================================================ERROR: Test budget exhausted (3/3). You MUST call submit(n=N) to choose which test result to submit as final.
Tool Input
1{
2 "n": 1
3}Undo complete: Restored: scikit-learn/custom_clustering.py