ai4sci-mol-property-prediction
Description
Task: Molecular Property Prediction
Research Question
Design a molecular representation model for predicting chemical properties (toxicity, blood-brain barrier penetration, enzyme inhibition, etc.) from molecular structure. The goal is to learn effective molecular representations that generalize across diverse property prediction tasks.
Background
Molecular property prediction is a core task in drug discovery and materials science. Given a molecule (as a SMILES string → molecular graph + optional 3D coordinates), the model must predict one or more chemical properties. Key challenges include:
- Molecular representation: How to encode atoms, bonds, and 3D geometry into informative features.
- Multi-task learning: Some datasets have multiple targets with missing labels (e.g., Tox21 has 12 assays).
- Scaffold generalization: The scaffold split ensures the model generalizes to structurally novel molecules.
Existing approaches include:
- D-MPNN: Directed message passing on bonds (not atoms) to avoid "message collision."
- AttentiveFP: Graph attention with GRU readout and learned molecular fingerprints.
- Uni-Mol: SE(3)-invariant Transformer with 3D distance attention bias, using pre-trained weights.
What to Implement
Implement the MoleculeModel class in custom_molprop.py. You must implement:
__init__(self, atom_dim, edge_dim, num_tasks, task_type): Set up your model architecture.forward(self, batch) -> Tensor: Return predictions of shape[B, num_tasks].
Batch Format (MolBatch)
@dataclass
class MolBatch:
# Sparse graph format (for GNN models)
x: Tensor # [total_atoms, atom_dim] node features
edge_index: Tensor # [2, total_edges] COO format
edge_attr: Tensor # [total_edges, edge_dim] bond features
batch_idx: Tensor # [total_atoms] graph assignment (0..B-1)
# Dense format (for Transformer models)
atom_features: Tensor # [B, max_atoms, atom_dim] zero-padded
positions: Tensor # [B, max_atoms, 3] 3D coordinates
dist_matrix: Tensor # [B, max_atoms, max_atoms] pairwise distances
mask: Tensor # [B, max_atoms] 1=real atom, 0=padding
# Uni-Mol specific (from LMDB pipeline)
atom_tokens: Tensor # [B, max_tokens] Uni-Mol vocabulary token ids (with [CLS]/[SEP])
edge_types: Tensor # [B, max_tokens, max_tokens] atom-pair type ids
# Targets (normalized for regression tasks)
targets: Tensor # [B, num_tasks]
target_mask: Tensor # [B, num_tasks] 1=valid label, 0=missing
Additional attributes on batch (set dynamically):
batch._unimol_dist: [B, max_tokens, max_tokens] distance matrix for Uni-Mol tokensbatch._unimol_token_mask: [B, max_tokens] 1=valid token, 0=padding
Atom Features (ATOM_DIM = 136)
One-hot encodings of: atomic_num (118), degree (6), formal_charge (5), num_Hs (5), hybridization (5), aromatic (1), in_ring (1).
Bond Features (EDGE_DIM = 9)
One-hot encodings of: bond_type (4), stereo (3), conjugated (1), in_ring (1).
Evaluation
The model is tested on 3 MoleculeNet classification benchmarks with scaffold split (metric: ROC-AUC, higher is better):
- BBBP: Blood-brain barrier penetration (2,039 molecules, 1 task)
- BACE: Beta-secretase 1 inhibition (1,513 molecules, 1 task)
- Tox21: Toxicity across 12 assays (7,831 molecules, 12 tasks, multi-task with missing labels)
Editable Region
Lines 115-207 of custom_molprop.py are editable (between EDITABLE SECTION START and EDITABLE SECTION END markers). You may define helper classes, layers, or functions within this region. The region must contain a MoleculeModel class with the specified interface.
Available Resources
- 3D conformers from LMDB (Uni-Mol pipeline: coordinates normalized, polar H removed)
- Uni-Mol vocabulary tokens and edge types available in batch
- Uni-Mol pre-trained weights at
/data/unimol_weights/mol_pre_all_h_220816.pt - TTA (test-time augmentation): predictions averaged over 11 conformers at val/test time
Code
1"""2Molecular Property Prediction — Self-contained template.3Predicts molecular properties (classification: ROC-AUC, regression: RMSE)4on MoleculeNet benchmarks (BBBP, BACE, Tox21, ESOL, FreeSolv, Lipophilicity).56Uses official Uni-Mol pre-split LMDB data with train/valid/test splits7and pre-computed multi-conformer 3D coordinates. Data pipeline mirrors8Uni-Mol: LMDB -> conformer sample/enumerate -> remove polar H -> normalize9coordinates -> Uni-Mol vocabulary tokenization -> distance matrix + edge types.1011Structure:12Lines 1-114: FIXED — Imports, constants, atom/bond featurization13Lines 115-207: EDITABLE — MoleculeModel class (starter: simple GIN)14Lines 208+: FIXED — Data loading, training loop, evaluation, TTA15"""
Results
| Model | Type | rocauc BBBP ↑ | rocauc BACE ↑ | rocauc Tox21 ↑ |
|---|---|---|---|---|
| dmpnn | baseline | 0.599 | 0.688 | 0.713 |
| gin | baseline | 0.510 | 0.726 | 0.747 |
| unimol | baseline | 0.713 | 0.847 | 0.761 |