Intel GPU AI Training & Inference Issue Logs

This directory documents issues encountered during AI training and inference on a specific Intel platform (Arrow Lake + Arc iGPU + NPU), along with mitigation strategies explored.

⚠️ Disclaimer: All observations here are based on specific hardware/software/model combinations and may not apply to other environments. See detailed disclaimers in each document.

🌐 Language: English · 中文

Hardware Environment

Component	Model
CPU	Intel Core Ultra 9 285H (Arrow Lake, 6P+8E, 14 cores)
iGPU	Intel Arc Graphics (8 Xe-core, 128 GB shared memory)
NPU	Intel AI Boost (`/dev/dri/renderD128`)
RAM	128 GB DDR5 (shared with iGPU)

Software Environment

Software	Version
OS	Ubuntu 26.04 LTS (Linux Kernel 7.0)
PyTorch	2.12.0+xpu
OpenVINO	2026.1
Python	3.14
oneAPI	2026.0 (IntelLLVM, MKL, DNNL, TBB)
GPU driver	`libze-intel-gpu1 26.14.37833.4`

Documents

1. iGPU Stability Guide

Use case: OpenVINO + Qwen3-series model inference on Intel Arc integrated GPU.

Key findings (on this platform):

Sustained GPU utilization >90% → Kernel Panic / segfault
NaN output → precursor to driver crash (not a quantization precision issue)
Mitigation: small batch + frequent cooldown intervals + INT8 quantization

File	Description
intel_gpu_aitrain001.md	Full analysis (bilingual)

2. iGPU Transformer Backward Crash Report

Use case: PyTorch XPU backend Transformer training on Arc iGPU.

Key findings (on this platform):

nn.TransformerEncoderLayer / F.scaled_dot_product_attention backward pass may crash
Error types: RuntimeError (negative dimension / integer overflow) or IndexError (index out of range)
Even tiny 22M-parameter models (hidden=512, heads=8, batch=4) may crash
AMP BF16 may escalate to full system freeze (driver crash)
Attention-free architectures (Gated CNN, Conv2d) run stably

File	Description
xpu_backward_issue.md	Full analysis (bilingual)

Summary

Scenario	GPU	Framework	Status	Recommended Alternative
Transformer training	iGPU	PyTorch XPU	❌ Backward crash	Gated CNN / CPU training
CNN training	iGPU	PyTorch XPU	✅ Stable	—
LLM inference (Qwen3)	iGPU	OpenVINO	⚠️ Needs cooling	INT8 + small batch + cooldown
LLM inference	NPU	OpenVINO	✅ Tested	Requires static shapes
LLM inference	CPU	OpenVINO	✅ Stable	Slowest performance

External Links

Intel GPU Stability Guide (GitHub) — External mirror of this repo
PyTorch XPU Documentation
OpenVINO Documentation

These documents record observations from a specific test environment for reference only. Different hardware, driver, or framework versions may yield different results.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
Intel-gpu-Ace -Step-Test-report.md		Intel-gpu-Ace -Step-Test-report.md
LICENSE		LICENSE
README.md		README.md
README.zh-CN.md		README.zh-CN.md
intel_gpu_aitrain001.md		intel_gpu_aitrain001.md
intel_xpu_with_cuda.md		intel_xpu_with_cuda.md
xpu_backward_issue.md		xpu_backward_issue.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Intel GPU AI Training & Inference Issue Logs

Hardware Environment

Software Environment

Documents

1. iGPU Stability Guide

2. iGPU Transformer Backward Crash Report

Summary

External Links

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Intel GPU AI Training & Inference Issue Logs

Hardware Environment

Software Environment

Documents

1. iGPU Stability Guide

2. iGPU Transformer Backward Crash Report

Summary

External Links

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages