Skip to content

Odds Ratio Preference Optimization (ORPO)

Gökdeniz Gülmez edited this page May 18, 2025 · 1 revision

ORPO Mode

Odds Ratio Preference Optimization (ORPO) is a reference-model-free alignment method that unifies supervised fine-tuning and preference learning in one step . It uses an odds-ratio loss to contrast the favored vs. disfavored outputs, without needing a separate reward or reference model. ORPO effectively adds a small penalty for disfavored outputs during fine-tuning, simplifying the pipeline.

  • Use case: Preference data with one chosen vs one rejected example per prompt, where you want to train both to follow instructions and align to preference simultaneously. ORPO lets you do instruction fine-tuning (SFT) and preference tuning at once.
  • How it works: ORPO applies an odds-ratio penalty: for each prompt, it encourages the model’s probability of the chosen response and discourages the rejected one. It’s “monolithic” because it merges SFT and preference loss, requiring no separate RL step.
  • Key CLI args: Use --train-mode orpo. You must set --beta (a temperature for the logistic component; default 0.1) which controls the strength of the preference penalty.

Example:

mlx_lm_lora.train \
  --model facebook/opt-125m \
  --train \
  --train-mode orpo \
  --data data/chat_preferences.jsonl \
  --beta 0.1

This runs ORPO on chat preference data. The dataset format is JSONL with "chosen" and "rejected" fields (like DPO). ORPO does not use a separate reward model, it directly contrasts the outputs.

Cited: Hong et al. (2024) introduce ORPO as a “monolithic” (reference-free) preference optimization. Their abstract notes that ORPO “eliminates the necessity for an additional preference alignment phase” and is effective across model sizes.

Clone this wiki locally