ml-dimensionality-reduction

Classical MLscikit-learnrigorous codebase

Description

Dimensionality Reduction: Nonlinear Embedding Method Design

Research Question

Design a novel nonlinear dimensionality reduction method that preserves data structure (both local neighborhoods and global relationships) better than existing methods when embedding high-dimensional data into 2D.

Background

Dimensionality reduction is fundamental to data analysis and visualization. PCA provides a fast linear baseline but cannot capture nonlinear manifold structure. Other methods trade off local and global structure preservation in different ways. This task evaluates dimensionality reduction methods by neighborhood preservation across diverse data types.

Task

Modify the CustomDimReduction class (lines 14-70) in custom_dimred.py to implement a novel nonlinear dimensionality reduction algorithm. Your implementation must:

  1. Accept high-dimensional data X of shape (n_samples, n_features) where n_samples <= 5000 and n_features ranges from 50 to 784.
  2. Return a 2D embedding of shape (n_samples, 2).
  3. Respect the random_state parameter for reproducibility.
  4. Complete within a reasonable time (under 5 minutes per dataset on CPU).

You may use numpy, scipy, and scikit-learn utilities (already installed). The method is evaluated on three diverse datasets: MNIST (digit images), Fashion-MNIST (clothing images), and 20 Newsgroups (text, pre-processed to 50D via TF-IDF + SVD).

Interface

class CustomDimReduction:
    def __init__(self, n_components: int = 2, random_state: int | None = None):
        ...
    def fit_transform(self, X: NDArray[np.float64]) -> NDArray[np.float64]:
        # X: (n_samples, n_features), returns: (n_samples, n_components)
        ...

Evaluation

Three metrics are computed on each dataset (k=7 neighbors):

  • kNN accuracy: Classification accuracy of a 7-NN classifier in the 2D space (higher is better). Measures how well class structure is preserved.
  • Trustworthiness: Whether points that are neighbors in the embedding are also neighbors in the original space (higher is better, max 1.0).
  • Continuity: Whether points that are neighbors in the original space remain neighbors in the embedding (higher is better, max 1.0).

Success means improving on existing methods across all three datasets and all three metrics.

Code

custom_dimred.py
EditableRead-only
1"""Custom dimensionality reduction benchmark -- agent-editable template.
2
3The agent modifies `CustomDimReduction` to implement a novel nonlinear
4dimensionality reduction method. The evaluation harness embeds three
5datasets into 2D, then measures kNN accuracy, trustworthiness, and
6continuity in the reduced space.
7"""
8
9import numpy as np
10from numpy.typing import NDArray
11
12# =====================================================================
13# EDITABLE: implement CustomDimReduction below (lines 15-59)
14# =====================================================================
15class CustomDimReduction:

Results

ModelTypeknn acc mnist trustworthiness mnist continuity mnist knn acc fashion mnist trustworthiness fashion mnist continuity fashion mnist knn acc newsgroups trustworthiness newsgroups continuity newsgroups
pacmapbaseline0.8530.9010.9570.7370.9600.9790.6780.8450.891
pcabaseline0.3260.6710.9270.5070.8750.9680.2770.5590.779
trimapbaseline0.8320.8900.9580.7330.9560.9830.6690.8490.871
tsnebaseline0.8620.9610.9670.7910.9810.9840.6870.9390.915
umapbaseline0.8440.9010.9670.7400.9590.9820.6680.8850.912
anthropic/claude-opus-4.6vanilla0.8680.9630.9660.7800.9800.9840.6980.9370.921
deepseek-reasonervanilla0.2470.5650.6300.2710.6900.6080.2260.5080.522
openai/gpt-5.4vanilla0.7500.7990.9570.6990.9400.9830.5770.6590.876
qwen/qwen3.6-plusvanilla---------
anthropic/claude-opus-4.6agent0.8680.9630.9660.7800.9800.9840.6980.9370.922
deepseek-reasoneragent0.5410.7310.9300.6500.8800.9530.4710.6090.838
openai/gpt-5.4agent0.8750.9210.9530.7560.9760.9810.6510.8520.853
qwen/qwen3.6-plusagent---------

Agent Conversations