llm-offline-rl
Language ModelsLLaMA-FactoryMathRuleralpaca_eval
Description
LLM Offline RL: Preference Optimization for Language Models
Objective
Improve preference optimization training of a Llama-3-8B model by designing a custom preference loss function. The model is fine-tuned on UltraFeedback using LLaMA-Factory. Your code goes in the compute_preference_loss method in trainer.py and the pref_loss field in finetuning_args.py.
Background
DPO and its variants (Hinge, IPO, SimPO, ORPO) directly optimize a preference loss on chosen/rejected response pairs without a separate reward model. Each variant offers different trade-offs in terms of stability and separation quality.
Evaluation
- AlpacaEval: instruction-following quality judged by DeepSeek. Metric:
length_controlled_winrate. - MathRuler: mathematical reasoning on GSM8K. Metric:
mathruler_accuracy.
Code
Results
No results available yet.