llm-offline-rl

Language ModelsLLaMA-FactoryMathRuleralpaca_eval

Description

LLM Offline RL: Preference Optimization for Language Models

Objective

Improve preference optimization training of a Llama-3-8B model by designing a custom preference loss function. The model is fine-tuned on UltraFeedback using LLaMA-Factory. Your code goes in the compute_preference_loss method in trainer.py and the pref_loss field in finetuning_args.py.

Background

DPO and its variants (Hinge, IPO, SimPO, ORPO) directly optimize a preference loss on chosen/rejected response pairs without a separate reward model. Each variant offers different trade-offs in terms of stability and separation quality.

Evaluation

  • AlpacaEval: instruction-following quality judged by DeepSeek. Metric: length_controlled_winrate.
  • MathRuler: mathematical reasoning on GSM8K. Metric: mathruler_accuracy.

Code

Results

No results available yet.