ai4bio-antibody-cdr-design

AI for Biologychimera-benchrigorous codebase

Description

Task: Antibody CDR Sequence-Structure Co-Design

Research Question

Design a novel generative architecture for antibody complementarity-determining region (CDR) loop sequence-structure co-design. The model must jointly generate CDR amino acid sequences and 3D CA backbone coordinates conditioned on the antigen-antibody framework context.

Background

Antibodies are Y-shaped immune proteins that bind antigens through six hypervariable loops called CDRs (H1-H3, L1-L3). Designing CDR loops that bind specific epitopes is central to therapeutic antibody engineering. This requires jointly optimizing:

  • Sequence: amino acid identity at each CDR position (determines binding specificity)
  • Structure: 3D backbone conformation (determines shape complementarity)
  • Epitope conditioning: the generated CDR must form contacts with the target epitope

State-of-the-art approaches use diffusion models (DiffAb), equivariant GNNs (dyMEAN, MEAN), flow matching, and iterative refinement to generate CDR loops. The task evaluates generalization across three biologically motivated data splits.

What to Implement

Modify the CustomCDRModel class in custom_cdr.py (lines 178-301). You must implement:

  1. __init__(self, ...): Define your model architecture.
  2. forward(self, batch): Training forward pass. Receives a list of sample dicts, returns a dict of named losses (e.g., {'seq': tensor, 'coord': tensor}).
  3. sample(self, batch): Inference. Receives a list of sample dicts, returns a list of prediction dicts with keys: complex_id, cdr_type, pred_sequence, true_sequence, pred_coords, true_coords, ppl.

You may define helper classes/modules within the editable region. The starter code is a simple MLP that runs but performs poorly.

Data Format

Each sample is a dict with:

  • heavy_seq / light_seq: amino acid sequences (str)
  • heavy_coords / light_coords: CA coordinates [L, 3] (Tensor)
  • ag_coords: antigen CA coordinates [L_ag, 3] (Tensor)
  • ag_surface: antigen surface chemical features [128, 6] (Tensor)
  • cdr_info: dict mapping CDR label (e.g., "H3") to {'indices': ndarray, 'seq': str, 'coords': ndarray [N,3], 'chain': str}

Evaluation

The model is tested on 3 CHIMERA-Bench splits (epitope_group, antigen_fold, temporal). For each, all 6 CDR types (H1-H3, L1-L3) are generated and evaluated using 12 CHIMERA metrics:

  • Sequence: AAR (amino acid recovery), CAAR (contact AAR), PPL
  • Structure: RMSD (Kabsch-aligned CA), TM-score
  • Binding: Fnat, iRMSD, DockQ
  • Epitope: Epitope F1
  • Designability: n_liabilities
  • Composites: CHIMERA-S (structural), CHIMERA-B (binding)

Higher AAR, TM-score, Fnat, DockQ, EpitopeF1, CHIMERA-S, CHIMERA-B is better. Lower RMSD is better.

Editable Region

Lines 178-301 of custom_cdr.py.

Code

custom_cdr.py
EditableRead-only
1"""
2Antibody CDR Design — Custom generative model for CDR loop sequence-structure co-design.
3
4This template provides a unified training and evaluation pipeline for antibody
5complementarity-determining region (CDR) design using the CHIMERA-Bench framework.
6
7Structure:
8 Lines 1-221: FIXED — Imports, data loading, utilities, constants
9 Lines 222-514: EDITABLE — CustomCDRModel class (starter: simple EGNN denoiser)
10 Lines 515-end: FIXED — Training loop, evaluation, CLI entry point
11
12Interface:
13 forward(batch) -> loss_dict: Dict[str, Tensor]
14 Training forward pass. Returns dict of named losses to be summed.
15 sample(batch) -> Dict[str, Tensor]

Additional context files (read-only):

  • chimera-bench/evaluation/metrics.py
  • chimera-bench/baselines/chimera_utils.py
  • chimera-bench/baselines/shared_config.yaml

Results

ModelTypeaar epitope group rmsd epitope group tm score epitope group aar antigen fold rmsd antigen fold tm score antigen fold aar temporal rmsd temporal tm score temporal
diffabbaseline0.5762.4600.3470.5692.4620.3520.5652.5830.343
dymeanbaseline0.5562.4940.3430.5612.5280.3470.5582.6360.339
meanbaseline0.5710.8770.4490.5670.8880.4340.5721.0450.405