- use
min(1GB, max_global_buffer)for memory allocation size (thanks @jerryrt) - now older GPUs with <1GB memory will work too - more reliable PCIe Gen estimate
- more robust Intel GPU core/CU detecton via
CL_DEVICE_IP_VERSION_INTEL - set
nvidia_compute_capabilityonly for Nvidia GPUs not Nvidia CPUs - fixed TFLOPs/s estimate for AMD CDNA3/4 GPUs
- fixed Device Name and CU reporting for AMD GPUs with rusticl
- disabled zero-copy on ARM iGPUs as
CL_MEM_USE_HOST_PTRis broken there - updated driver download links
- cosmetics
Example 🖖😏
|----------------.------------------------------------------------------------|
| Device ID | 0 |
| Device Name | NVIDIA B300 SXM6 AC |
| Device Vendor | NVIDIA Corporation |
| Device Driver | 580.126.09 (Linux) |
| OpenCL Version | OpenCL C 3.0 |
| Compute Units | 148 at 2032 MHz (18944 cores, 76.988 TFLOPs/s) |
| Memory, Cache | 274113 MB VRAM, 4736 KB global / 48 KB local |
| Buffer Limits | 68528 MB global, 64 KB constant |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled. |
| FP64 Compute (double, fma ) 1.184 TFLOPs/s (1/64) |
| FP32 Compute (float , fma ) 71.452 TFLOPs/s ( 1x ) |
| FP16 Compute (half2 , fma ) 75.201 TFLOPs/s ( 1x ) |
| INT64 Compute (long , a*b+c) 3.714 TIOPs/s (1/24) |
| INT32 Compute (int , a*b+c) 37.736 TIOPs/s (1/2 ) |
| INT16 Compute (short2, a*b+c) 34.592 TIOPs/s (1/2 ) |
| INT8 Compute (char4 , dp4a ) 118.743 TIOPs/s ( 2x ) |
| Memory Bandwidth ( coalesced read ) 6543.01 GB/s |
| Memory Bandwidth ( coalesced write) 6887.38 GB/s |
| Memory Bandwidth (misaligned read ) 2355.50 GB/s |
| Memory Bandwidth (misaligned write) 969.95 GB/s |
| PCIe Bandwidth (send ) 9.86 GB/s |
| PCIe Bandwidth ( receive ) 9.70 GB/s |
| PCIe Bandwidth ( bidirectional) (Gen3 x16) 8.93 GB/s |
|-----------------------------------------------------------------------------|