-
Notifications
You must be signed in to change notification settings - Fork 43
Contrastive Preference Optimization (CPO)
Contrastive Preference Optimization (CPO) extends DPO with a “contrastive” loss to push the model beyond just adequate answers. It was originally proposed for machine translation to avoid “close but not perfect” outputs, but can apply to any preference data. CPO effectively adds a secondary penalty so the model not only prefers the chosen output, but is specifically discouraged from mediocre translations.
- Use case: Preference data with graded or “near-miss” comparisons (e.g. human judgments where one answer is slightly better than another). It aims to push performance further than DPO by penalizing near-misses.
- How it works: CPO uses the same input format as DPO (chosen vs rejected). It modifies the loss such that, roughly speaking, if the rejected output is “adequate” it gets a larger penalty. The CPO loss is a generalization of DPO’s loss.
- Key CLI args: Use
--train-mode cpo. The same--beta,--dpo-cpo-loss-type, and--deltaflags apply as in DPO mode (since CPO is derived from DPO).
mlx_lm_lora.train \
--model mlx-community/Josiefied-Qwen3-8B-abliterated-v1-4bit \
--train \
--train-mode cpo \
--data mlx-community/Human-Like-DPO \
--beta 0.1This runs CPO on the translation preference data. Again, --beta and --dpo-cpo-loss-type shape the loss. (Internally CPO uses the DPO framework with a contrastive term).
Cited: Xu et al. (2024) describe CPO as training the model “to avoid generating adequate, but not perfect translations” and note that it is “derived from the DPO objective”.
⚙️ MLX-LM-LORA is proudly built on top of MLX-LM and optimized exclusively for Apple Silicon.
Trained with state-of-the-art fine-tuning algorithms like LoRA, DoRA, DPO, ORPO, GRPO, and CPO.
Made with ❤️ by Gökdeniz Gülmez · Powered by MLX