ai4bio-protein-function-prediction

AI for BiologyDeepProteinrigorous codebase

Description

Task: Protein Sequence Encoder Design for Function Prediction

Research Question

Design a novel protein sequence encoder architecture that learns effective representations for predicting protein function from amino acid sequences. The goal is a general-purpose encoder that transfers well across diverse protein property prediction tasks.

Background

Protein function prediction from sequence is a fundamental problem in computational biology. Given a protein's amino acid sequence (up to ~1000 residues), the model must learn structural and functional patterns to predict properties like enzymatic activity, fluorescence, and solubility.

Current approaches fall into three categories:

Sequential models: CNN and RNN-based encoders that treat the sequence as a 1D signal (e.g., convolutional filters over one-hot amino acid encodings).
Attention models: Transformer-based encoders that capture long-range residue interactions via self-attention over learned token embeddings.
Graph models: GNN-based encoders that construct residue contact graphs and perform message passing (e.g., GCN on k-nearest neighbor sequence graphs).

Key challenges include:

Long sequences: Proteins can have hundreds to thousands of residues; efficient encoding of long-range dependencies is critical.
Sparse vocabulary: Only 20 standard amino acids, but local context and global structure both matter.
Cross-task generalization: The same encoder must work for regression (continuous property values) and classification tasks.

What to Implement

Implement the ProteinEncoder class in custom_protein.py (lines 107-177). You must implement:

__init__(self, vocab_size, onehot_dim, max_seq_len): Initialize your encoder architecture.
forward(self, batch: ProteinBatch) -> Tensor [B, output_dim]: Encode a batch of protein sequences into fixed-size representations.

Your encoder must set self.output_dim (an integer) so the downstream prediction head knows the representation size.

Input Format (ProteinBatch)

class ProteinBatch:
    token_ids: Tensor     # [B, max_len] integer amino acid indices (0=pad, 1-20=amino acids)
    onehot: Tensor        # [B, max_len, 20] one-hot encoded sequences
    mask: Tensor          # [B, max_len] attention mask (1=real residue, 0=padding)
    seq_lengths: Tensor   # [B] actual sequence lengths
    targets: Tensor       # [B] regression/classification targets

You may use any combination of these input formats. The token_ids are suitable for embedding layers, onehot for convolutional or linear layers, and mask for attention-based models.

Amino Acid Vocabulary

20 standard amino acids (ACDEFGHIKLMNPQRSTVWY), index 0 is padding. Maximum sequence length is 1000 (truncated/padded).

Evaluation

The encoder is evaluated on 3 protein property prediction benchmarks from the Therapeutics Data Commons (TDC):

Regression (metric: Spearman correlation, higher is better):

Beta-lactamase: Enzyme activity prediction (~4,158 proteins, TAPE benchmark)
Fluorescence: GFP fluorescence intensity (~21,446 proteins, TAPE benchmark)

Regression (metric: Spearman correlation, higher is better):

Solubility: Protein solubility prediction (~40,174 proteins, TDC benchmark)

All benchmarks use pre-defined train/valid/test splits. The primary metric is Spearman correlation.

Editable Region

Lines 107-177 of custom_protein.py are editable (between EDITABLE SECTION START and EDITABLE SECTION END markers). You may define helper classes, layers, or functions within this region. The region must contain a ProteinEncoder class with the specified interface.

Code

custom_protein.py

EditableRead-only

1"""
2Protein Function Prediction — Self-contained template.
3Predicts protein properties from amino acid sequences:
4  - Beta-lactamase activity (regression, Spearman correlation)
5  - Fluorescence intensity (regression, Spearman correlation)
6  - Solubility (regression, Spearman correlation)
7
8Structure:
9  Lines 1-106:   FIXED — Imports, constants, amino acid encoding, data loading
10  Lines 107-220: EDITABLE — ProteinEncoder class (starter: simple MLP)
11  Lines 221+:    FIXED — Classifier head, training loop, evaluation
12"""
13import os
14import sys
15import math

Results

Model	Type	spearman Beta ↑	pearson Beta ↑	mse Beta ↓	spearman Fluorescence ↑	pearson Fluorescence ↑	mse Fluorescence ↓	spearman Solubility ↑	pearson Solubility ↑	mse Solubility ↓
cnn	baseline	0.689	0.741	0.047	0.683	0.948	0.099	0.460	0.492	0.194
dgl_gcn	baseline	0.290	0.376	0.104	0.671	0.874	0.232	0.560	0.558	0.175
dgl_gcn	baseline	-	-	-	-	-	-	-	-	-
dgl_gcn	baseline	-	-	-	-	-	-	-	-	-
dgl_gcn	baseline	-	-	-	-	-	-	-	-	-
transformer	baseline	0.206	0.094	0.104	0.554	0.619	1.088	0.528	0.527	0.184
transformer	baseline	0.115	0.115	0.104	0.467	0.485	1.662	0.537	0.540	0.181
transformer	baseline	0.263	0.256	0.103	0.556	0.617	1.079	0.537	0.538	0.188
transformer	baseline	0.168	0.092	0.103	0.538	0.593	1.120	0.530	0.530	0.185
transformer	baseline	0.235	0.269	0.104	0.641	0.801	0.507	0.553	0.555	0.176