MTA-Agent: An Open Recipe for Multimodal Deep Search Agents

Multi-modal Multi-hop Tool-Augmented Agent for Evidence-based QA Synthesis

📄 Paper • 🌐 Website • 💻 Code • 🤗 Dataset

Xiangyu Peng*, Can Qin*, An Yan†, Xinyi Yang†, Zeyuan Chen, Ran Xu, Chien-Sheng Wu
Salesforce AI Research

Abstract

Multimodal large language models (MLLMs) have demonstrated strong capabilities in visual understanding, yet they remain limited in complex, multi-step reasoning that requires deep searching and integrating visual evidence with external knowledge. In this work, we address this challenge by constructing high-quality, verified multi-hop vision-language training data for multimodal deep-search agents.

We propose MTA-Agent (Multi-hop Tool-Augmented Agent for Evidence-based QA Synthesis), which automatically selects tools and their parameters to retrieve and validate evidence from both visual and textual sources and generates structured multi-hop question-answer trajectories. Starting from diverse VQA seed datasets, our pipeline produces a large-scale training dataset, MTA-Vision-DeepSearch, containing 21K high-quality multi-hop examples.

A 32B open-source multimodal search agent achieves state-of-the-art performance, reaching an average of 54.63% across six challenging benchmarks, outperforming GPT-5 (51.86%), Gemini-2.5-Pro (50.98%), and Gemini-3-Pro (54.46%) under the same tool settings.

Main Results

Accuracy (%) on six deep-search benchmarks. MTA-DeepSearch-32B achieves state-of-the-art, outperforming GPT-5 and Gemini models under the same tool settings.

Model	MMSrch+	HR-MMSrch	BC-VL	MMSrch	FVQA	MTA-test	Avg
Agent Workflow (same tool setting)
GPT-5	31.61	52.13	51.63	77.65	72.28	25.84	51.86
Gemini-2.5-Pro	30.65	48.20	49.50	77.65	72.33	27.53	50.98
Gemini-3-Pro	33.51	53.20	51.78	82.94	76.67	28.65	54.46
Qwen3-VL-32B-Inst.	14.84	38.69	38.69	68.52	66.94	17.42	40.85
Our Models
MTA-DeepSearch-8B	26.77	47.54	44.36	79.41	73.06	20.79	48.66
MTA-DeepSearch-32B	31.93	53.95	53.77	82.35	76.00	29.78	54.63

Method Overview

Tool-Augmented QA Agent

A ReAct-style agent with 4 tools (web search, web reader, Google Lens, image search) iteratively gathers evidence for multi-hop QA generation.

Multi-Stage Verification

Generated QA pairs are verified for factual correctness, answer uniqueness, temporal stability, and entity dependency through independent checks.

RL Training with DAPO

Models are fine-tuned using DAPO with cached tool interactions, enabling efficient training without real-time API calls.

Key Findings

Deeper Search Behavior: Training increases average search depth from 2.27 to 4.28 steps, enabling more systematic and persistent multi-step retrieval strategies.
Structured Tool Usage: After training, Web Search usage rises to 99% and Reverse Image Search to 79%, forming a consistent two-stage retrieval pipeline.
Cost-Efficient Replay Training: Cached interaction replay enables effective RL training without real-time tool calls, significantly reducing training cost.

Dataset

The MTA-Vision-DeepSearch training dataset is publicly available on HuggingFace:

📦 Salesforce/MTA-Vision-DeepSearch

It contains 5 subsets (~3.5GB total):

Subset	Description
`fvqa_qw3vl`	FVQA-based multi-hop QA
`infoseek_qw3vl`	InfoSeek-based multi-hop QA
`infovqa_qw3vl`	InfoVQA-based multi-hop QA
`okvqa_qw3vl`	OK-VQA-based multi-hop QA
`livevqa_news_qw3vl`	LiveVQA News-based multi-hop QA

Download

from huggingface_hub import snapshot_download

snapshot_download(
    repo_id="Salesforce/MTA-Vision-DeepSearch",
    repo_type="dataset",
    local_dir="data/mm_deepsearch",
)

Reproduce from Scratch

To regenerate the dataset from original VQA sources:

python examples/data_preprocess/mm_search/infovqa_qw3vl.py
python examples/data_preprocess/mm_search/infoseek_qw3vl.py
python examples/data_preprocess/mm_search/fvqa_qw3vl.py
python examples/data_preprocess/mm_search/okvqa_qw3vl.py
python examples/data_preprocess/mm_search/livevqa_news_qw3vl.py

Setup

Prerequisites

The following versions are required:

vllm: 0.11.0
verl: 0.7.0.dev0
torch: 2.8.0
transformers: 4.57.3
qwen-omni-utils: 0.0.8
qwen-vl-utils: 0.0.14
ray: 2.52.1
openai: 2.9.0
flash-attn: 2.8.3

Conda Installation (Recommended)

git submodule update --init --recursive
conda create --name verl-tool-env python=3.10
conda activate verl-tool-env
pip install -e verl
pip install -e ".[vllm,acecoder,torl,search_tool]"
pip install "flash-attn==2.8.3" --no-build-isolation

Alternative: UV Installation

git submodule update --init --recursive
uv sync
source .venv/bin/activate
uv pip install -e verl
uv pip install -e ".[vllm,acecoder,torl,search_tool]"
uv pip install "flash-attn==2.8.3" --no-build-isolation

Set API Keys

Create a .env file in the repo root (never commit this file):

TAVILY_API_KEY=your-tavily-key          # Web search
SERPAPI_API_KEY=your-serpapi-key        # Image search
SERPER_API_KEY=your-serper-key          # Google search
X_API_KEY=your-openai-gateway-key       # GPT-4 summarization (or use OPENAI_API_KEY)

Training

Setup WandB

export WANDB_API_KEY=<YOUR_WANDB_API_KEY>
export WANDB_ENTITY=<YOUR_WANDB_ENTITY>

Launch Training

8B Model (Qwen3-VL-8B, DAPO — recommended):

conda activate verl-tool-env
bash examples/train/mm_deep_research/train/train_multimodal_8b_qwen3vl_dapo.sh

Tool Server

Training automatically launches a multimodal tool server:

python -m verl_tool.servers.serve \
    --tool_type "web_text_to_text_search,web_text_to_img_search,web_url_reader,web_image_to_text,ocr_tool,ipython_code,bash_terminal" \
    --workers_per_tool 2 \
    --use_ray True

Troubleshooting

OOM: Reduce gpu_memory_utilization or enable do_offload=True
Tool server issues: Check API keys and network connectivity
Training instability: Reduce lr to 5e-7, increase lr_warmup_steps
Debug mode: export VERL_DEBUG=1 NCCL_DEBUG=INFO VLLM_USE_V1=1

Features

🔧 Complete decoupling of actor rollout and environment interaction — Uses verl as a submodule. All tool calling is integrated via a unified API; add new tools by simply adding a Python file.
🌍 Tool-as-environment paradigm — Each tool interaction can modify the environment state. Environment states are stored and reloaded for each trajectory.
⚡ Native RL framework for tool-calling agents — Natively supports multi-turn interactive loops between agents and tool environments.
🖼️ Multimodal support — Native support for multimodal agent loops with image understanding, image search, and multimodal reasoning.
📊 User-friendly evaluation suite — Launch trained model with OpenAI API alongside the tool server; send questions and get final outputs with all interactions handled internally.

📚 Documentation

Citation

@article{peng2026mtaagent,
  title     = {MTA-Agent: An Open Recipe for Multimodal Deep Search Agents},
  author    = {Peng, Xiangyu and Qin, Can and Yan, An and Yang, Xinyi and Chen, Zeyuan and Xu, Ran and Wu, Chien-Sheng},
  journal   = {arXiv preprint arXiv:2604.06376},
  year      = {2026}
}

Contact

Xiangyu Peng — xiangyupeng1994@gmail.com

This release should not be used to compete with OpenAI.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
assets		assets
benchmarks		benchmarks
eval_service		eval_service
examples		examples
patches		patches
scripts		scripts
temp_images		temp_images
verl		verl
verl_tool		verl_tool
.gitignore		.gitignore
.gitmodules		.gitmodules
.python-version		.python-version
CODEOWNERS		CODEOWNERS
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
LICENSE.txt		LICENSE.txt
README.md		README.md
SECURITY.md		SECURITY.md
bnb_chart.png		bnb_chart.png
deploy_servers.sh		deploy_servers.sh
extract_image_search.py		extract_image_search.py
extract_image_search_inference.py		extract_image_search_inference.py
extract_ocr.py		extract_ocr.py
extract_ocr_inference.py		extract_ocr_inference.py
extract_reverse_image_search.py		extract_reverse_image_search.py
extract_reverse_image_search_inference.py		extract_reverse_image_search_inference.py
extract_web_read.py		extract_web_read.py
extract_web_read_inference.py		extract_web_read_inference.py
extract_web_search.py		extract_web_search.py
extract_web_search_inference.py		extract_web_search_inference.py
main.py		main.py
process_rollout.py		process_rollout.py
pyproject.toml		pyproject.toml
remove_error.py		remove_error.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MTA-Agent: An Open Recipe for Multimodal Deep Search Agents

Multi-modal Multi-hop Tool-Augmented Agent for Evidence-based QA Synthesis

Abstract

Main Results

Method Overview

Tool-Augmented QA Agent

Multi-Stage Verification

RL Training with DAPO

Key Findings

Dataset

Download

Reproduce from Scratch

Setup

Prerequisites

Conda Installation (Recommended)

Alternative: UV Installation

Set API Keys

Training

Setup WandB

Launch Training

Tool Server

Troubleshooting

Features

📚 Documentation

Citation

Contact

About

Licenses found

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MTA-Agent: An Open Recipe for Multimodal Deep Search Agents

Multi-modal Multi-hop Tool-Augmented Agent for Evidence-based QA Synthesis

Abstract

Main Results

Method Overview

Tool-Augmented QA Agent

Multi-Stage Verification

RL Training with DAPO

Key Findings

Dataset

Download

Reproduce from Scratch

Setup

Prerequisites

Conda Installation (Recommended)

Alternative: UV Installation

Set API Keys

Training

Setup WandB

Launch Training

Tool Server

Troubleshooting

Features

📚 Documentation

Citation

Contact

About

Resources

License

Licenses found

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages