ai4bio-protein-structure-repr

AI for BiologyProteinWorkshoprigorous codebase

Description

Task: Protein Structure Representation Learning

Research Question

Design a novel geometric GNN encoder for learning protein structure representations from 3D alpha-carbon coordinates. The encoder must capture both local geometric patterns (bond angles, dihedral angles) and global structural motifs to produce informative per-residue and per-protein embeddings.

Background

Protein function is determined by 3D structure. Geometric GNNs that operate on protein structure graphs (nodes = residues at alpha-carbon positions, edges = spatial/sequential neighbors) have emerged as powerful tools for learning protein representations. Key challenges include:

Geometric awareness: The encoder should leverage 3D spatial information (distances, angles, orientations) beyond simple adjacency.
Equivariance/invariance: Representations should be invariant to rigid body transformations (rotations, translations) of the protein.
Multi-scale structure: Proteins exhibit hierarchical structure (secondary structure elements, domains, global fold) that the encoder should capture.

Existing approaches include:

SchNet: Uses continuous-filter convolutions with Gaussian radial basis function distance expansion. Invariant by design.
EGNN: E(n) equivariant message passing that jointly updates node features and coordinates.
GearNet: Geometry-Aware Relational Graph Neural Network with multiple edge types (sequential, spatial, k-nearest) and relational convolutions.

What to Implement

Implement the ProteinEncoder class and any helper modules in custom_protein_encoder.py. You must implement:

__init__(self, ...): Set up the encoder architecture. The input node features have dimension SCALAR_NODE_DIM=28 (20-dim amino acid one-hot + 2-dim positional encoding + 6-dim pseudo-dihedral features).
forward(self, pos, node_feat, batch) -> (node_emb, graph_emb): Encode the protein graph.
- pos: (N, 3) alpha-carbon coordinates
- node_feat: (N, 28) scalar node features (computed by the fixed compute_node_features function)
- batch: (N,) batch assignment indices
- Returns: node_emb (N, out_dim) per-node embeddings, graph_emb (B, out_dim) per-graph embeddings

Evaluation

The encoder is evaluated on three protein function/structure prediction benchmarks:

EC Number Prediction (384-class, multiclass)

Predicts enzyme commission number from protein structure
Metric: accuracy (top-1)

GO Biological Process (1943-class, multilabel)

Predicts Gene Ontology biological process annotations
Metric: f1_max (maximum F1 across thresholds)

Fold Classification (1195-class, multiclass)

Predicts protein fold from the SCOPe/CATH hierarchy
Metric: accuracy (top-1)

Higher is better for all metrics.

Editable Region

Lines 107-234 of custom_protein_encoder.py (the section between EDITABLE SECTION START and EDITABLE SECTION END markers). You may define any helper classes, layers, or functions within this region. The region must contain a ProteinEncoder class with the interface described above.

Code

custom_protein_encoder.py

EditableRead-only

1"""
2Protein Structure Representation Learning — Self-contained template.
3Trains a geometric GNN encoder for protein structure and evaluates on
4downstream classification tasks (EC number, GO-BP, Fold classification).
5
6Structure:
7  Lines 1-124:    FIXED — Imports, constants, data loading utilities
8  Lines 125-252:  EDITABLE — ProteinEncoder class + helper modules
9  Lines 253+:     FIXED — Dataset, decoder head, training loop, evaluation
10"""
11import os
12import sys
13import math
14import json
15import argparse

Results

Model	Type	accuracy EC ↑	test loss EC ↓	f1 max GO-BP ↑	test loss GO-BP ↓	accuracy Fold ↑	test loss Fold ↓
egnn	baseline	0.705	1.950	0.238	0.176	0.305	5.141
gearnet	baseline	0.770	1.740	0.281	0.158	0.338	6.273
schnet	baseline	0.584	2.654	0.250	0.165	0.167	5.423