ai4bio-protein-structure-repr
Description
Task: Protein Structure Representation Learning
Research Question
Design a novel geometric GNN encoder for learning protein structure representations from 3D alpha-carbon coordinates. The encoder must capture both local geometric patterns (bond angles, dihedral angles) and global structural motifs to produce informative per-residue and per-protein embeddings.
Background
Protein function is determined by 3D structure. Geometric GNNs that operate on protein structure graphs (nodes = residues at alpha-carbon positions, edges = spatial/sequential neighbors) have emerged as powerful tools for learning protein representations. Key challenges include:
- Geometric awareness: The encoder should leverage 3D spatial information (distances, angles, orientations) beyond simple adjacency.
- Equivariance/invariance: Representations should be invariant to rigid body transformations (rotations, translations) of the protein.
- Multi-scale structure: Proteins exhibit hierarchical structure (secondary structure elements, domains, global fold) that the encoder should capture.
Existing approaches include:
- SchNet: Uses continuous-filter convolutions with Gaussian radial basis function distance expansion. Invariant by design.
- EGNN: E(n) equivariant message passing that jointly updates node features and coordinates.
- GearNet: Geometry-Aware Relational Graph Neural Network with multiple edge types (sequential, spatial, k-nearest) and relational convolutions.
What to Implement
Implement the ProteinEncoder class and any helper modules in custom_protein_encoder.py. You must implement:
__init__(self, ...): Set up the encoder architecture. The input node features have dimensionSCALAR_NODE_DIM=28(20-dim amino acid one-hot + 2-dim positional encoding + 6-dim pseudo-dihedral features).forward(self, pos, node_feat, batch) -> (node_emb, graph_emb): Encode the protein graph.pos: (N, 3) alpha-carbon coordinatesnode_feat: (N, 28) scalar node features (computed by the fixedcompute_node_featuresfunction)batch: (N,) batch assignment indices- Returns:
node_emb(N, out_dim) per-node embeddings,graph_emb(B, out_dim) per-graph embeddings
Evaluation
The encoder is evaluated on three protein function/structure prediction benchmarks:
EC Number Prediction (384-class, multiclass)
- Predicts enzyme commission number from protein structure
- Metric: accuracy (top-1)
GO Biological Process (1943-class, multilabel)
- Predicts Gene Ontology biological process annotations
- Metric: f1_max (maximum F1 across thresholds)
Fold Classification (1195-class, multiclass)
- Predicts protein fold from the SCOPe/CATH hierarchy
- Metric: accuracy (top-1)
Higher is better for all metrics.
Editable Region
Lines 107-234 of custom_protein_encoder.py (the section between EDITABLE SECTION START and EDITABLE SECTION END markers). You may define any helper classes, layers, or functions within this region. The region must contain a ProteinEncoder class with the interface described above.
Code
1"""2Protein Structure Representation Learning — Self-contained template.3Trains a geometric GNN encoder for protein structure and evaluates on4downstream classification tasks (EC number, GO-BP, Fold classification).56Structure:7Lines 1-124: FIXED — Imports, constants, data loading utilities8Lines 125-252: EDITABLE — ProteinEncoder class + helper modules9Lines 253+: FIXED — Dataset, decoder head, training loop, evaluation10"""11import os12import sys13import math14import json15import argparse
Results
| Model | Type | accuracy EC ↑ | test loss EC ↓ | f1 max GO-BP ↑ | test loss GO-BP ↓ | accuracy Fold ↑ | test loss Fold ↓ |
|---|---|---|---|---|---|---|---|
| egnn | baseline | 0.705 | 1.950 | 0.238 | 0.176 | 0.305 | 5.141 |
| gearnet | baseline | 0.770 | 1.740 | 0.281 | 0.158 | 0.338 | 6.273 |
| schnet | baseline | 0.584 | 2.654 | 0.250 | 0.165 | 0.167 | 5.423 |