ai4bio-antibody-cdr-design

AI for Biologychimera-benchrigorous codebase

Description

Task: Antibody CDR Sequence-Structure Co-Design

Research Question

Design a novel generative architecture for antibody complementarity-determining region (CDR) loop sequence-structure co-design. The model must jointly generate CDR amino acid sequences and 3D CA backbone coordinates conditioned on the antigen-antibody framework context.

Background

Antibodies are Y-shaped immune proteins that bind antigens through six hypervariable loops called CDRs (H1-H3, L1-L3). Designing CDR loops that bind specific epitopes is central to therapeutic antibody engineering. This requires jointly optimizing:

Sequence: amino acid identity at each CDR position (determines binding specificity)
Structure: 3D backbone conformation (determines shape complementarity)
Epitope conditioning: the generated CDR must form contacts with the target epitope

State-of-the-art approaches use diffusion models (DiffAb), equivariant GNNs (dyMEAN, MEAN), flow matching, and iterative refinement to generate CDR loops. The task evaluates generalization across three biologically motivated data splits.

What to Implement

Modify the CustomCDRModel class in custom_cdr.py (lines 178-301). You must implement:

__init__(self, ...): Define your model architecture.
forward(self, batch): Training forward pass. Receives a list of sample dicts, returns a dict of named losses (e.g., {'seq': tensor, 'coord': tensor}).
sample(self, batch): Inference. Receives a list of sample dicts, returns a list of prediction dicts with keys: complex_id, cdr_type, pred_sequence, true_sequence, pred_coords, true_coords, ppl.

You may define helper classes/modules within the editable region. The starter code is a simple MLP that runs but performs poorly.

Data Format

Each sample is a dict with:

heavy_seq / light_seq: amino acid sequences (str)
heavy_coords / light_coords: CA coordinates [L, 3] (Tensor)
ag_coords: antigen CA coordinates [L_ag, 3] (Tensor)
ag_surface: antigen surface chemical features [128, 6] (Tensor)
cdr_info: dict mapping CDR label (e.g., "H3") to {'indices': ndarray, 'seq': str, 'coords': ndarray [N,3], 'chain': str}

Evaluation

The model is tested on 3 CHIMERA-Bench splits (epitope_group, antigen_fold, temporal). For each, all 6 CDR types (H1-H3, L1-L3) are generated and evaluated using 12 CHIMERA metrics:

Sequence: AAR (amino acid recovery), CAAR (contact AAR), PPL
Structure: RMSD (Kabsch-aligned CA), TM-score
Binding: Fnat, iRMSD, DockQ
Epitope: Epitope F1
Designability: n_liabilities
Composites: CHIMERA-S (structural), CHIMERA-B (binding)

Higher AAR, TM-score, Fnat, DockQ, EpitopeF1, CHIMERA-S, CHIMERA-B is better. Lower RMSD is better.

Editable Region

Lines 178-301 of custom_cdr.py.

Code

custom_cdr.py

EditableRead-only

1"""
2Antibody CDR Design — Custom generative model for CDR loop sequence-structure co-design.
3
4This template provides a unified training and evaluation pipeline for antibody
5complementarity-determining region (CDR) design using the CHIMERA-Bench framework.
6
7Structure:
8  Lines 1-221:     FIXED — Imports, data loading, utilities, constants
9  Lines 222-514:   EDITABLE — CustomCDRModel class (starter: simple EGNN denoiser)
10  Lines 515-end:   FIXED — Training loop, evaluation, CLI entry point
11
12Interface:
13  forward(batch) -> loss_dict: Dict[str, Tensor]
14    Training forward pass. Returns dict of named losses to be summed.
15  sample(batch) -> Dict[str, Tensor]

Additional context files (read-only):

chimera-bench/evaluation/metrics.py
chimera-bench/baselines/chimera_utils.py
chimera-bench/baselines/shared_config.yaml

Results

Model	Type	aar epitope group ↑	rmsd epitope group ↓	tm score epitope group ↑	aar antigen fold ↑	rmsd antigen fold ↓	tm score antigen fold ↑	aar temporal ↑	rmsd temporal ↓	tm score temporal ↑
diffab	baseline	0.576	2.460	0.347	0.569	2.462	0.352	0.565	2.583	0.343
dymean	baseline	0.556	2.494	0.343	0.561	2.528	0.347	0.558	2.636	0.339
mean	baseline	0.571	0.877	0.449	0.567	0.888	0.434	0.572	1.045	0.405