Skip to content

PlanteAmigor/intel-gpu-stability-guide

Repository files navigation

Intel GPU AI Training & Inference Issue Logs

Ubuntu PyTorch OpenVINO

This directory documents issues encountered during AI training and inference on a specific Intel platform (Arrow Lake + Arc iGPU + NPU), along with mitigation strategies explored.

⚠️ Disclaimer: All observations here are based on specific hardware/software/model combinations and may not apply to other environments. See detailed disclaimers in each document.

🌐 Language: English · 中文


Hardware Environment

Component Model
CPU Intel Core Ultra 9 285H (Arrow Lake, 6P+8E, 14 cores)
iGPU Intel Arc Graphics (8 Xe-core, 128 GB shared memory)
NPU Intel AI Boost (/dev/dri/renderD128)
RAM 128 GB DDR5 (shared with iGPU)

Software Environment

Software Version
OS Ubuntu 26.04 LTS (Linux Kernel 7.0)
PyTorch 2.12.0+xpu
OpenVINO 2026.1
Python 3.14
oneAPI 2026.0 (IntelLLVM, MKL, DNNL, TBB)
GPU driver libze-intel-gpu1 26.14.37833.4

Documents

Use case: OpenVINO + Qwen3-series model inference on Intel Arc integrated GPU.

Key findings (on this platform):

  • Sustained GPU utilization >90% → Kernel Panic / segfault
  • NaN output → precursor to driver crash (not a quantization precision issue)
  • Mitigation: small batch + frequent cooldown intervals + INT8 quantization
File Description
intel_gpu_aitrain001.md Full analysis (bilingual)

Use case: PyTorch XPU backend Transformer training on Arc iGPU.

Key findings (on this platform):

  • nn.TransformerEncoderLayer / F.scaled_dot_product_attention backward pass may crash
  • Error types: RuntimeError (negative dimension / integer overflow) or IndexError (index out of range)
  • Even tiny 22M-parameter models (hidden=512, heads=8, batch=4) may crash
  • AMP BF16 may escalate to full system freeze (driver crash)
  • Attention-free architectures (Gated CNN, Conv2d) run stably
File Description
xpu_backward_issue.md Full analysis (bilingual)

Summary

Scenario GPU Framework Status Recommended Alternative
Transformer training iGPU PyTorch XPU ❌ Backward crash Gated CNN / CPU training
CNN training iGPU PyTorch XPU ✅ Stable
LLM inference (Qwen3) iGPU OpenVINO ⚠️ Needs cooling INT8 + small batch + cooldown
LLM inference NPU OpenVINO ✅ Tested Requires static shapes
LLM inference CPU OpenVINO ✅ Stable Slowest performance

External Links


These documents record observations from a specific test environment for reference only. Different hardware, driver, or framework versions may yield different results.

About

How to keep Intel Arc GPUs stable under sustained AI workloads (e.g., Qwen3 models) - with practical protection strategies and data.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors