LLM Offline RL: Preference Optimization for Language Models

Objective

Improve preference optimization training of a Llama-3-8B model by designing a custom preference loss function. The model is fine-tuned on UltraFeedback using LLaMA-Factory. Your code goes in the compute_preference_loss method in trainer.py and the pref_loss field in finetuning_args.py.

Background

DPO and its variants (Hinge, IPO, SimPO, ORPO) directly optimize a preference loss on chosen/rejected response pairs without a separate reward model. Each variant offers different trade-offs in terms of stability and separation quality.

Evaluation

AlpacaEval: instruction-following quality judged by DeepSeek. Metric: length_controlled_winrate.
MathRuler: mathematical reasoning on GSM8K. Metric: mathruler_accuracy.

llm-offline-rl

Description

LLM Offline RL: Preference Optimization for Language Models

Objective

Background

Evaluation

Code

Results