ai4bio-protein-function-prediction
Description
Task: Protein Sequence Encoder Design for Function Prediction
Research Question
Design a novel protein sequence encoder architecture that learns effective representations for predicting protein function from amino acid sequences. The goal is a general-purpose encoder that transfers well across diverse protein property prediction tasks.
Background
Protein function prediction from sequence is a fundamental problem in computational biology. Given a protein's amino acid sequence (up to ~1000 residues), the model must learn structural and functional patterns to predict properties like enzymatic activity, fluorescence, and solubility.
Current approaches fall into three categories:
- Sequential models: CNN and RNN-based encoders that treat the sequence as a 1D signal (e.g., convolutional filters over one-hot amino acid encodings).
- Attention models: Transformer-based encoders that capture long-range residue interactions via self-attention over learned token embeddings.
- Graph models: GNN-based encoders that construct residue contact graphs and perform message passing (e.g., GCN on k-nearest neighbor sequence graphs).
Key challenges include:
- Long sequences: Proteins can have hundreds to thousands of residues; efficient encoding of long-range dependencies is critical.
- Sparse vocabulary: Only 20 standard amino acids, but local context and global structure both matter.
- Cross-task generalization: The same encoder must work for regression (continuous property values) and classification tasks.
What to Implement
Implement the ProteinEncoder class in custom_protein.py (lines 107-177). You must implement:
__init__(self, vocab_size, onehot_dim, max_seq_len): Initialize your encoder architecture.forward(self, batch: ProteinBatch) -> Tensor [B, output_dim]: Encode a batch of protein sequences into fixed-size representations.
Your encoder must set self.output_dim (an integer) so the downstream prediction head knows the representation size.
Input Format (ProteinBatch)
class ProteinBatch:
token_ids: Tensor # [B, max_len] integer amino acid indices (0=pad, 1-20=amino acids)
onehot: Tensor # [B, max_len, 20] one-hot encoded sequences
mask: Tensor # [B, max_len] attention mask (1=real residue, 0=padding)
seq_lengths: Tensor # [B] actual sequence lengths
targets: Tensor # [B] regression/classification targets
You may use any combination of these input formats. The token_ids are suitable for embedding layers, onehot for convolutional or linear layers, and mask for attention-based models.
Amino Acid Vocabulary
20 standard amino acids (ACDEFGHIKLMNPQRSTVWY), index 0 is padding. Maximum sequence length is 1000 (truncated/padded).
Evaluation
The encoder is evaluated on 3 protein property prediction benchmarks from the Therapeutics Data Commons (TDC):
Regression (metric: Spearman correlation, higher is better):
- Beta-lactamase: Enzyme activity prediction (~4,158 proteins, TAPE benchmark)
- Fluorescence: GFP fluorescence intensity (~21,446 proteins, TAPE benchmark)
Regression (metric: Spearman correlation, higher is better):
- Solubility: Protein solubility prediction (~40,174 proteins, TDC benchmark)
All benchmarks use pre-defined train/valid/test splits. The primary metric is Spearman correlation.
Editable Region
Lines 107-177 of custom_protein.py are editable (between EDITABLE SECTION START and EDITABLE SECTION END markers). You may define helper classes, layers, or functions within this region. The region must contain a ProteinEncoder class with the specified interface.
Code
1"""2Protein Function Prediction — Self-contained template.3Predicts protein properties from amino acid sequences:4- Beta-lactamase activity (regression, Spearman correlation)5- Fluorescence intensity (regression, Spearman correlation)6- Solubility (regression, Spearman correlation)78Structure:9Lines 1-106: FIXED — Imports, constants, amino acid encoding, data loading10Lines 107-220: EDITABLE — ProteinEncoder class (starter: simple MLP)11Lines 221+: FIXED — Classifier head, training loop, evaluation12"""13import os14import sys15import math
Results
| Model | Type | spearman Beta ↑ | pearson Beta ↑ | mse Beta ↓ | spearman Fluorescence ↑ | pearson Fluorescence ↑ | mse Fluorescence ↓ | spearman Solubility ↑ | pearson Solubility ↑ | mse Solubility ↓ |
|---|---|---|---|---|---|---|---|---|---|---|
| cnn | baseline | 0.689 | 0.741 | 0.047 | 0.683 | 0.948 | 0.099 | 0.460 | 0.492 | 0.194 |
| dgl_gcn | baseline | 0.290 | 0.376 | 0.104 | 0.671 | 0.874 | 0.232 | 0.560 | 0.558 | 0.175 |
| dgl_gcn | baseline | - | - | - | - | - | - | - | - | - |
| dgl_gcn | baseline | - | - | - | - | - | - | - | - | - |
| dgl_gcn | baseline | - | - | - | - | - | - | - | - | - |
| transformer | baseline | 0.206 | 0.094 | 0.104 | 0.554 | 0.619 | 1.088 | 0.528 | 0.527 | 0.184 |
| transformer | baseline | 0.115 | 0.115 | 0.104 | 0.467 | 0.485 | 1.662 | 0.537 | 0.540 | 0.181 |
| transformer | baseline | 0.263 | 0.256 | 0.103 | 0.556 | 0.617 | 1.079 | 0.537 | 0.538 | 0.188 |
| transformer | baseline | 0.168 | 0.092 | 0.103 | 0.538 | 0.593 | 1.120 | 0.530 | 0.530 | 0.185 |
| transformer | baseline | 0.235 | 0.269 | 0.104 | 0.641 | 0.801 | 0.507 | 0.553 | 0.555 | 0.176 |