Skip to content

[Feature] Add Rebellions (RBLN) NPU evaluation backend#1566

Open
rebel-hwkim wants to merge 5 commits into
open-compass:mainfrom
rebellions-sw:rbln-backend
Open

[Feature] Add Rebellions (RBLN) NPU evaluation backend#1566
rebel-hwkim wants to merge 5 commits into
open-compass:mainfrom
rebellions-sw:rbln-backend

Conversation

@rebel-hwkim

@rebel-hwkim rebel-hwkim commented Jun 2, 2026

Copy link
Copy Markdown

Summary

This PR adds an opt-in evaluation backend for Rebellions NPU hardware via optimum-rbln. It runs the standard VLMEvalKit benchmark suite against a model compiled for RBLN NPUs, in-process, with the same datasets / prompts / scorers as the CUDA path.

The default device is unchanged (nvidia); RBLN is reached only with --device rbln, so existing GPU users are unaffected. There is no static model registry — any HF model id or compiled directory is auto-dispatched to
the right wrapper, so the backend works for arbitrary image-text-to-text models without per-model config entries.

What this adds

  • vlmeval/vlm/rbln/ — 11 model-family wrappers on a shared base (RBLNVLMBase / RBLNChatVLMBase):
    Qwen2-VL/Qwen2.5-VL, Qwen3-VL, Cosmos-Reason1, LLaVA-1.5, LLaVA-Next, Idefics3, Gemma3, Pixtral, PaliGemma, PaliGemma2, BLIP-2. Each wrapper reuses the same prompt construction as its upstream CUDA counterpart (shared *PromptMixin where one exists, otherwise a verbatim port) so RBLN and GPU feed the model byte-identical prompts.
  • run.py — three additive CLI flags:
    • --device {nvidia,rbln} (default nvidia): opt-in backend selector.
    • --limit N: truncate each dataset to the first N rows (smoke testing).
    • --rbln-kwargs '<json>': pass compile/runtime rbln_config overrides.
  • Auto-dispatch (vlmeval/vlm/rbln/auto.py): with --device rbln, the wrapper class is selected from the model's config.json architectures, and per-family compile defaults (visual.max_seq_lens, tensor_parallel_size, …, mirroring rbln-model-zoo's compile.py) are seeded automatically.
  • requirements/rbln.txt, docs/en/Quickstart.md — optional deps + usage.

vlmeval/config.py is essentially untouched (whitespace-only); RBLN classes are
re-exported from vlmeval/vlm/__init__.py.

Design notes

  • Zero impact on the default path. Every optimum-rbln import is lazy (inside wrapper methods), and the auto-dispatch loop is gated behind args.device == 'rbln'. Importing vlmeval or running --device nvidia never
    imports the RBLN runtime — asserted by a test that checks optimum.rbln stays out of sys.modules.
  • Compile-or-load. The wrapper loads a cached *.rbln artifact if present, else compiles and caches it under ./<basename(model_path)>/ (rbln-model-zoo convention). Compile defaults come from the auto-dispatch table, so a first compile works without any manual rbln_config.
  • vllm-rbln serving:vllm-rbln exposes the same OpenAI-compatible server as vllm serve, so a served model is evaluated with VLMEvalKit's existing --base-url.

How to use

# install (RBLN + PyTorch CPU indexes, not PyPI)
pip install \
  --extra-index-url https://pypi.rbln.ai/simple \
  --extra-index-url https://download.pytorch.org/whl/cpu \
  -r requirements/rbln.txt

# in-process eval on NPU — pass an HF id or a compiled directory
python run.py --device rbln --model Qwen/Qwen2.5-VL-7B-Instruct --data MMBench_DEV_EN
python run.py --device rbln --model ./Qwen2.5-VL-7B-Instruct       --data ChartQA_TEST

Testing & verification

Offline unit suite (tests/rbln/, no NPU / no API key — CPU-only, CI-able): 32 test functions across 6 files (≈47 parametrized cases) — prompt parity per family, auto-dispatch & ordering, rbln_config merge + cached-load guard, lazy-import invariant, mock-model harness, and a --mode eval rule-based scorer lane. All green under pytest tests/rbln/; clean under the repo's pre-commit (flake8 / isort / yapf).

Per-family NPU inference smoke (scripts/rbln_smoke.sh): all 11 families compiled and run on real RBLN hardware (tp=1/4/8), each producing non-empty predictions.

Judge-free score parity (tests/rbln/E1_RESULTS.md): Qwen2.5-VL-7B on ChartQA_TEST (2500, relaxed_accuracy, no LLM judge):

Overall test_human test_augmented infer_fail
ATOM-Max FP16 (RBLN NPU) 86.56 78.32 94.80 0% (0/2500)
Reference (Qwen2.5-VL report / A100 FP16) 87.3–87.84 80.72 94.96
Δ −0.7 to −1.3 %p −2.4 −0.16

Δ is within a ±3–5 %p NPU/GPU tolerance band, so RBLN reproduces the GPU/official ChartQA score.

Verification gaps (honest scope)

  • Scored accuracy parity is verified for one model × one judge-free benchmark (above). Judge-based benchmarks (MMVet/MMMU/MMBench/…) and scored parity for the other 10 families are not yet measured — those need an LLM judge (OPENAI_API_KEY / --judge-base-url). Per-family inference (not scored) is verified for all 11 families.
  • No CI workflow is added for the RBLN tests; the offline suite is CPU-runnable and can be wired into CI on request (NPU smoke needs hardware).

Notes

  • Branch is based on the latest main; 5 logical commits, diff is RBLN-only (new vlmeval/vlm/rbln/, tests/rbln/, scripts/rbln_smoke.sh, requirements/rbln.txt, plus additive edits to run.py, vlmeval/vlm/__init__.py, docs/en/Quickstart.md; config.py whitespace-only).
  • Happy to split further (e.g. the generic --device/--limit CLI first, then the backend) if preferred.

rebel-hwkim and others added 5 commits June 2, 2026 04:42
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@rebel-hwkim rebel-hwkim changed the title Rbln backend [Feature] Add Rebellions (RBLN) NPU evaluation backend Jun 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant