ml-feature-selection
Description
Feature Selection Method Design
Research Question
Design a novel univariate feature scoring method that identifies the most informative features for classification, generalizing across diverse data modalities (text, vision, tabular).
Background
Feature selection is a fundamental preprocessing step in machine learning. By removing irrelevant or redundant features, it can improve model accuracy, reduce overfitting, and speed up training. Classical univariate methods score each feature independently based on its relationship with the target variable:
- Chi-squared test: Measures departure from independence between feature and target using contingency tables. Works best with non-negative, count-like features.
- ANOVA F-value (f_classif): Computes the ratio of between-class variance to within-class variance. Effective for normally-distributed features with different means per class.
- Mutual Information: Estimates the mutual information between each feature and the target via k-nearest neighbors. Captures non-linear dependencies but is computationally expensive.
Each method has strengths and weaknesses depending on the data distribution. The task is to design a scoring function that performs robustly across different data types and class structures.
Task
Implement the score_features(X, y) function in custom_featsel.py. Given a training feature matrix X and integer class labels y, return a 1-D numpy array of non-negative importance scores (one per feature). The top-k features (by score) will be selected and used to train a LogisticRegression classifier.
Interface
def score_features(X: np.ndarray, y: np.ndarray) -> np.ndarray:
"""
Args:
X: (n_samples, n_features) non-negative float array
y: (n_samples,) integer class labels
Returns:
scores: (n_features,) non-negative float array
"""
Available imports (already at top of file): numpy, scipy (via sklearn), sklearn.feature_selection (mutual_info_classif, chi2, f_classif), sklearn.preprocessing, sklearn.metrics.
Evaluation
Evaluated on three classification benchmarks spanning different data modalities:
- 20newsgroups: 10,000 TF-IDF text features, 20 classes, top-500 selected
- MNIST: 784 pixel intensity features, 10 digit classes, top-200 selected
- Madelon: 500 synthetic features (20 informative + 480 noisy), binary classification, top-20 selected
Metric: test classification accuracy using LogisticRegression on the selected features (higher is better).
Code
1# Custom feature selection method for MLS-Bench2#3# EDITABLE section: score_features() function.4# FIXED sections: everything else (data loading, classifier, evaluation).5import os6import warnings7import numpy as np8from pathlib import Path910from sklearn.datasets import fetch_20newsgroups, fetch_openml11from sklearn.feature_extraction.text import TfidfVectorizer12from sklearn.model_selection import StratifiedShuffleSplit13from sklearn.linear_model import LogisticRegression14from sklearn.preprocessing import StandardScaler, MinMaxScaler15from sklearn.metrics import accuracy_score
Results
| Model | Type | accuracy 20newsgroups ↑ | accuracy mnist ↑ | accuracy madelon ↑ |
|---|---|---|---|---|
| chi2 | baseline | 0.556 | 0.889 | 0.594 |
| f_classif | baseline | 0.547 | 0.898 | 0.588 |
| mutual_info | baseline | 0.468 | 0.896 | 0.612 |
| deepseek-reasoner | vanilla | 0.521 | 0.898 | 0.610 |
| google/gemini-3.1-pro-preview | vanilla | 0.528 | 0.892 | 0.604 |
| openai/gpt-5.4 | vanilla | 0.471 | 0.890 | 0.613 |
| qwen/qwen3.6-plus | vanilla | 0.537 | 0.892 | 0.594 |
| deepseek-reasoner | agent | 0.543 | 0.892 | 0.637 |
| google/gemini-3.1-pro-preview | agent | 0.557 | 0.893 | 0.613 |
| openai/gpt-5.4 | agent | 0.553 | 0.893 | 0.623 |
| qwen/qwen3.6-plus | agent | 0.554 | 0.893 | 0.595 |