CapRL++: Unified Reinforcement Learning with Verifiable Rewards for Dense Image and Video Captioning
[Paper] [Project] [Model] [Dataset]
Authors: Penghui Yang*, Long Xing*, Xiaoyi Dong, Yuhang Zang, Yuhang Cao, Yibin Wang, Yujie Zhou, Jiazi Bu, Jianze Liang, Qidong Huang, Jiaqi Wang, Feng Wu, Dahua Lin.
CapRL++ is a unified reinforcement learning framework for dense image and video captioning with verifiable rewards. While the original CapRL release focuses on dense image captioning, CapRL++ keeps the same central idea: train a caption model with reward signals that measure whether the generated caption preserves enough visual information for downstream question answering.
This folder contains the training and evaluation code used for the CapRL++ extension.
CapRL++ adds three practical components on top of CapRL:
- unified image and video caption RL training based on the bundled verl framework;
- a remote reward service for QA-based or VLM-judge-based verifiable reward scoring;
- a video Prism evaluation pipeline that measures caption usefulness through downstream benchmark QA.
The training code is self-contained under train/. The evaluation code is
self-contained under eval/.
CapRL++/
├── train/
│ ├── scripts/
│ │ ├── train_caprl.sh
│ │ ├── start_reward_serve_rm.sh
│ │ ├── requirements.txt
│ │ └── README.md
│ └── verl/
│ └── recipe/video_captionrl/
└── eval/
├── scripts/
├── tools/
├── requirements.txt
└── README.md
The training implementation uses the bundled train/verl source tree. Start the
reward service first, then launch RL training.
cd CapRL++/train
conda create -n caprl python=3.10 -y
conda activate caprl
pip install -r scripts/requirements.txt
pip install -e ./verlStart the reward service:
REWARD_MODEL=/path/to/Qwen3-4B-Instruct \
CUDA_VISIBLE_DEVICES=0 \
REWARD_PORT=18889 \
bash scripts/start_reward_serve_rm.shLaunch training:
CAPTION_MODEL=/path/to/Qwen3-VL-4B-Instruct \
DATASET=/path/to/video_train.jsonl \
SAVE_DIR=/path/to/output/checkpoints \
REWARD_NODE_IP=127.0.0.1 \
REWARD_PORT=18889 \
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
bash scripts/train_caprl.shSee train/scripts/README.md for the full list of environment variables and
runtime options.
The evaluation code implements a two-stage Prism-style pipeline:
- generate captions for videos with a caption model;
- answer benchmark questions using only the generated captions;
- score the downstream answers against benchmark annotations.
cd CapRL++/eval
pip install -r requirements.txt
source examples/env.example
bash scripts/run_vllm_prism.shSupported benchmarks include Video-MME, MVBench, MMVU, MMBench-Video, TOMATO,
and TimeLens-Bench. See eval/README.md for benchmark-specific paths and
evaluation options.
CapRL++ follows the CapRL philosophy of optimizing caption models through question-answering feedback, but extends the workflow to video captioning and uses verl as the RL training backend. For image caption models, datasets, and the original CapRL training and evaluation pipeline, refer to the main repository README.
@article{yang2026caprlplusplus,
title={CapRL++: Unified Reinforcement Learning with Verifiable Rewards for Dense Image and Video Captioning},
author={Yang, Penghui and Xing, Long and Dong, Xiaoyi and Zang, Yuhang and Cao, Yuhang and Wang, Yibin and Zhou, Yujie and Bu, Jiazi and Liang, Jianze and Huang, Qidong and Wang, Jiaqi and Wu, Feng and Lin, Dahua},
journal={arXiv preprint arXiv:2606.09393},
year={2026}
}