ml-dimensionality-reduction
Description
Dimensionality Reduction: Nonlinear Embedding Method Design
Research Question
Design a novel nonlinear dimensionality reduction method that preserves data structure (both local neighborhoods and global relationships) better than existing methods when embedding high-dimensional data into 2D.
Background
Dimensionality reduction is fundamental to data analysis and visualization. PCA provides a fast linear baseline but cannot capture nonlinear manifold structure. Other methods trade off local and global structure preservation in different ways. This task evaluates dimensionality reduction methods by neighborhood preservation across diverse data types.
Task
Modify the CustomDimReduction class (lines 14-70) in custom_dimred.py to implement a novel nonlinear dimensionality reduction algorithm. Your implementation must:
- Accept high-dimensional data X of shape (n_samples, n_features) where n_samples <= 5000 and n_features ranges from 50 to 784.
- Return a 2D embedding of shape (n_samples, 2).
- Respect the
random_stateparameter for reproducibility. - Complete within a reasonable time (under 5 minutes per dataset on CPU).
You may use numpy, scipy, and scikit-learn utilities (already installed). The method is evaluated on three diverse datasets: MNIST (digit images), Fashion-MNIST (clothing images), and 20 Newsgroups (text, pre-processed to 50D via TF-IDF + SVD).
Interface
class CustomDimReduction:
def __init__(self, n_components: int = 2, random_state: int | None = None):
...
def fit_transform(self, X: NDArray[np.float64]) -> NDArray[np.float64]:
# X: (n_samples, n_features), returns: (n_samples, n_components)
...
Evaluation
Three metrics are computed on each dataset (k=7 neighbors):
- kNN accuracy: Classification accuracy of a 7-NN classifier in the 2D space (higher is better). Measures how well class structure is preserved.
- Trustworthiness: Whether points that are neighbors in the embedding are also neighbors in the original space (higher is better, max 1.0).
- Continuity: Whether points that are neighbors in the original space remain neighbors in the embedding (higher is better, max 1.0).
Success means improving on existing methods across all three datasets and all three metrics.
Code
1"""Custom dimensionality reduction benchmark -- agent-editable template.23The agent modifies `CustomDimReduction` to implement a novel nonlinear4dimensionality reduction method. The evaluation harness embeds three5datasets into 2D, then measures kNN accuracy, trustworthiness, and6continuity in the reduced space.7"""89import numpy as np10from numpy.typing import NDArray1112# =====================================================================13# EDITABLE: implement CustomDimReduction below (lines 15-59)14# =====================================================================15class CustomDimReduction:
Results
| Model | Type | knn acc mnist ↑ | trustworthiness mnist ↑ | continuity mnist ↑ | knn acc fashion mnist ↑ | trustworthiness fashion mnist ↑ | continuity fashion mnist ↑ | knn acc newsgroups ↑ | trustworthiness newsgroups ↑ | continuity newsgroups ↑ |
|---|---|---|---|---|---|---|---|---|---|---|
| pacmap | baseline | 0.853 | 0.901 | 0.957 | 0.737 | 0.960 | 0.979 | 0.678 | 0.845 | 0.891 |
| pca | baseline | 0.326 | 0.671 | 0.927 | 0.507 | 0.875 | 0.968 | 0.277 | 0.559 | 0.779 |
| trimap | baseline | 0.832 | 0.890 | 0.958 | 0.733 | 0.956 | 0.983 | 0.669 | 0.849 | 0.871 |
| tsne | baseline | 0.862 | 0.961 | 0.967 | 0.791 | 0.981 | 0.984 | 0.687 | 0.939 | 0.915 |
| umap | baseline | 0.844 | 0.901 | 0.967 | 0.740 | 0.959 | 0.982 | 0.668 | 0.885 | 0.912 |
| anthropic/claude-opus-4.6 | vanilla | 0.868 | 0.963 | 0.966 | 0.780 | 0.980 | 0.984 | 0.698 | 0.937 | 0.921 |
| deepseek-reasoner | vanilla | 0.247 | 0.565 | 0.630 | 0.271 | 0.690 | 0.608 | 0.226 | 0.508 | 0.522 |
| openai/gpt-5.4 | vanilla | 0.750 | 0.799 | 0.957 | 0.699 | 0.940 | 0.983 | 0.577 | 0.659 | 0.876 |
| qwen/qwen3.6-plus | vanilla | - | - | - | - | - | - | - | - | - |
| anthropic/claude-opus-4.6 | agent | 0.868 | 0.963 | 0.966 | 0.780 | 0.980 | 0.984 | 0.698 | 0.937 | 0.922 |
| deepseek-reasoner | agent | 0.541 | 0.731 | 0.930 | 0.650 | 0.880 | 0.953 | 0.471 | 0.609 | 0.838 |
| openai/gpt-5.4 | agent | 0.875 | 0.921 | 0.953 | 0.756 | 0.976 | 0.981 | 0.651 | 0.852 | 0.853 |
| qwen/qwen3.6-plus | agent | - | - | - | - | - | - | - | - | - |