Backbone-to-Sequence Inverse Folding

Studies how geometric structure encoding and sequence decoding recover amino-acid sequences from protein backbones.

AI for ScienceProteinInvBench
ai4bio-protein-inverse-folding

Description

Task: Protein Inverse Folding — Structure Encoder Design

Research Question

Design a novel GNN-based structure encoder for protein inverse folding: given backbone atom coordinates (N, CA, C, O), predict the amino acid sequence that would fold into that structure.

Background

Protein inverse folding (also called computational protein design or fixed-backbone design) is a central problem in structural biology. Given a protein backbone structure, the goal is to predict the amino acid sequence most likely to fold into that structure. This is the inverse of the protein folding problem (predicting structure from sequence).

The key challenge is encoding the 3D protein backbone graph into rich per-residue embeddings that capture local geometry, long-range interactions, and structural motifs. Existing approaches differ primarily in how they encode the protein structure:

  • GVP (Geometric Vector Perceptron; Jing et al., "Learning from Protein Structure with Geometric Vector Perceptrons", ICLR 2021; arXiv:2009.01411). SE(3)-equivariant message passing with both scalar and vector node/edge features. Code: https://github.com/drorlab/gvp.
  • ProteinMPNN (Dauparas et al., "Robust deep learning–based protein sequence design using ProteinMPNN", Science 2022, 378(6615):49–56; bioRxiv 2022.06.03.494563). Message-passing encoder with edge updates, followed by an autoregressive decoder with masking. Code: https://github.com/dauparas/ProteinMPNN.
  • PiFold (Gao et al., "PiFold: Toward Effective and Efficient Protein Inverse Folding", ICLR 2023; arXiv:2209.12643). PiGNN encoder with learnable virtual atoms, multi-scale distance features, and dihedral features, plus a non-autoregressive one-shot decoder. Code: https://github.com/A4Bio/PiFold.

The structure encoder is the critical component: all methods share the same input format (backbone coordinates) and output format (amino acid log-probabilities), but differ in how they transform structure into sequence-informative representations.

What to Implement

Modify the editable section of custom_invfold.py. You must implement:

  1. StructureEncoder: A GNN module that takes backbone coordinates X (B, L, 4, 3) and mask (B, L), and produces per-residue embeddings h_V (B, L, hidden_dim).
  2. InverseFoldingModel: Wraps the encoder with a decoder head that outputs amino acid log-probabilities (B, L, 20).

Interface

class StructureEncoder(nn.Module):
    def __init__(self, hidden_dim=128, ...):
        ...
    def forward(self, X, mask):
        """
        X: (B, L, 4, 3) backbone coordinates [N, CA, C, O]
        mask: (B, L) binary mask (1 for valid residues, 0 for padding)
        Returns: h_V (B, L, hidden_dim) per-residue embeddings
        """
        ...

class InverseFoldingModel(nn.Module):
    def __init__(self, hidden_dim=128, ...):
        ...
    def forward(self, X, mask):
        """
        Returns: log_probs (B, L, 20) amino acid log-probabilities
        """
        ...

Helper functions available in the FIXED section above the editable region:

  • _rbf(D, ...): Radial basis function encoding of distances.
  • _dihedrals(X): Backbone dihedral angles (phi, psi, omega) as sin/cos features.
  • _orientations(X): Local coordinate frame (forward + binormal vectors).
  • knn_graph(X_ca, mask, k): Build k-nearest neighbor graph from CA coordinates.

Fixed Pipeline

Datasets, train/validation/test splits, the training loop, padding/masking, optimizer schedule, loss (per-residue cross-entropy), and evaluation harness are all supplied by the scaffold and not part of the contribution.

Evaluation

The model is evaluated on three benchmarks:

  • CATH 4.2: Standard protein design benchmark (single-chain, ~18k train / 608 test).
  • CATH 4.3: Updated CATH with more diverse structures (~21k train / 1120 test).
  • TS50: 50 de novo designed proteins for out-of-distribution generalization (trained on CATH 4.2).

Primary metric: Recovery (fraction of correctly predicted amino acids, higher is better). Secondary metric: Perplexity (exponential of per-residue cross-entropy loss, lower is better).

Code

custom_invfold.py
EditableRead-only
1"""
2Protein Inverse Folding — Self-contained template.
3Given backbone structure (N, CA, C, O coordinates), predict amino acid sequence.
4
5Structure:
6 Lines 1-75: FIXED — Imports, constants, data loading, featurization
7 Lines 76-230: EDITABLE — StructureEncoder + decoder (starter: simple MPNN)
8 Lines 231+: FIXED — Training loop, evaluation, metrics
9"""
10import os
11import sys
12import json
13import math
14import time
15import argparse

Method Summary

Auto-summarized from each method's code by an LLM reviewer — not the model's original output. Browse via the picker below; the Code section is independent.
Baselines
Agents
Claude Opus 4.6·Hybridlow

ProteinMPNN + RLFE + GeoGate

Augments ProteinMPNN edges with relative local-frame rotations and gates messages by frame-conditioned sigmoids.

Rijrel=FiFjR3×3,gij=σ ⁣(MLP(vec(Rijrel))),mij=gijMPNN(hi,hj,eij)R^{\text{rel}}_{ij} = F_i^{\top} F_j \in \mathbb{R}^{3\times3}, \quad g_{ij} = \sigma\!\big(\mathrm{MLP}(\mathrm{vec}(R^{\text{rel}}_{ij}))\big), \quad m_{ij} = g_{ij} \odot \mathrm{MPNN}(h_i, h_j, e_{ij})
backbone X25 atom-pair RBFlocal frame F_iF_i^T F_j (9-d)edge embedgeo gate sigmoid×MPNN msg x gatenode update
Δ vs. baselineCode as displayed has a SyntaxError (duplicated `def compute_relative_frame_features` and `class ProteinFeaturesEnhanced` headers and a missing `__init__` line) — the final agent file did not execute and the leaderboard `is_final=true` row is empty. Intended design: extends the ProteinMPNN baseline (25 all-atom RBFs + positional edges) with a per-edge 9-d relative local-frame rotation $F_i^\top F_j$, which is concatenated into edge features and also fed through a sigmoid MLP that element-wise gates each MPNN message.
hidden_dim=128num_layers=3kneighborsk_neighbors=30num_rbf=16dropout=0.1rel_frame_dim=9Recovers ProteinMPNN baseline when the geo-gate saturates to 1

Results