Skip to content

Releases: ProjectPhysX/OpenCL-Benchmark

OpenCL-Benchmark v2.0

19 Mar 17:42

Choose a tag to compare

  • use min(1GB, max_global_buffer) for memory allocation size (thanks @jerryrt) - now older GPUs with <1GB memory will work too
  • more reliable PCIe Gen estimate
  • more robust Intel GPU core/CU detecton via CL_DEVICE_IP_VERSION_INTEL
  • set nvidia_compute_capability only for Nvidia GPUs not Nvidia CPUs
  • fixed TFLOPs/s estimate for AMD CDNA3/4 GPUs
  • fixed Device Name and CU reporting for AMD GPUs with rusticl
  • disabled zero-copy on ARM iGPUs as CL_MEM_USE_HOST_PTR is broken there
  • updated driver download links
  • cosmetics

Example 🖖😏

|----------------.------------------------------------------------------------|
| Device ID      | 0                                                          |
| Device Name    | NVIDIA B300 SXM6 AC                                        |
| Device Vendor  | NVIDIA Corporation                                         |
| Device Driver  | 580.126.09 (Linux)                                         |
| OpenCL Version | OpenCL C 3.0                                               |
| Compute Units  | 148 at 2032 MHz (18944 cores, 76.988 TFLOPs/s)             |
| Memory, Cache  | 274113 MB VRAM, 4736 KB global / 48 KB local               |
| Buffer Limits  | 68528 MB global, 64 KB constant                            |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled.                                  |
| FP64   Compute   (double, fma  )                      1.184 TFLOPs/s (1/64) |
| FP32   Compute   (float , fma  )                     71.452 TFLOPs/s ( 1x ) |
| FP16   Compute   (half2 , fma  )                     75.201 TFLOPs/s ( 1x ) |
| INT64  Compute   (long  , a*b+c)                      3.714  TIOPs/s (1/24) |
| INT32  Compute   (int   , a*b+c)                     37.736  TIOPs/s (1/2 ) |
| INT16  Compute   (short2, a*b+c)                     34.592  TIOPs/s (1/2 ) |
| INT8   Compute   (char4 , dp4a )                    118.743  TIOPs/s ( 2x ) |
| Memory Bandwidth ( coalesced read      )                       6543.01 GB/s |
| Memory Bandwidth ( coalesced      write)                       6887.38 GB/s |
| Memory Bandwidth (misaligned read      )                       2355.50 GB/s |
| Memory Bandwidth (misaligned      write)                        969.95 GB/s |
| PCIe   Bandwidth (send                 )                          9.86 GB/s |
| PCIe   Bandwidth (   receive           )                          9.70 GB/s |
| PCIe   Bandwidth (        bidirectional)            (Gen3 x16)    8.93 GB/s |
|-----------------------------------------------------------------------------|

OpenCL-Benchmark v1.9

04 Sep 06:04

Choose a tag to compare

  • added hardware-supported dp4a on AMD RDNA3+ GPUs and ARM GPUs
  • disabled native dp4a in Intel CPU Runtime for OpenCL because it is slower than emulated dp4a
  • more robust dp4a detection
  • fixed dual CU and IPC reporting on AMD RDNA1-4 GPUs
  • fixed core count reporting for RDNA4 GPUs
  • fixed compiler warning with min_int
  • fixed bug in split_regex()
  • fixed missing <chrono> header on some compilers
  • updated driver download links

OpenCL-Benchmark v1.8

01 Mar 08:19

Choose a tag to compare

  • INT8 benchmark will now measure dp4a throughput on all supported AMD/Intel/Nvidia GPUs
  • fixed compiling on macOS with new OpenCL headers
  • updated OpenCL-Wrapper
 |----------------.------------------------------------------------------------|
 | Device ID      | 0                                                          |
 | Device Name    | NVIDIA H100 80GB HBM3                                      |
 | Device Vendor  | NVIDIA Corporation                                         |
 | Device Driver  | 565.57.01 (Linux)                                          |
 | OpenCL Version | OpenCL C 1.2                                               |
 | Compute Units  | 132 at 1980 MHz (16896 cores, 66.908 TFLOPs/s)             |
 | Memory, Cache  | 81105 MB VRAM, 4224 KB global / 48 KB local                |
 | Buffer Limits  | 20276 MB global, 64 KB constant                            |
 |----------------'------------------------------------------------------------|
 | Info: OpenCL C code successfully compiled.                                  |
 | FP64  compute                                        31.184 TFLOPs/s (1/2 ) |
 | FP32  compute                                        62.908 TFLOPs/s ( 1x ) |
 | FP16  compute                                       123.749 TFLOPs/s ( 2x ) |
 | INT64 compute                                         3.227  TIOPs/s (1/24) |
 | INT32 compute                                        32.946  TIOPs/s (1/2 ) |
 | INT16 compute                                        30.901  TIOPs/s (1/2 ) |
-| INT8  compute                                        30.582  TIOPs/s (1/2 ) |
+| INT8  compute                                       103.204  TIOPs/s ( 2x ) |
 | Memory Bandwidth ( coalesced read      )                       3025.53 GB/s |
 | Memory Bandwidth ( coalesced      write)                       3055.98 GB/s |
 | Memory Bandwidth (misaligned read      )                       2102.44 GB/s |
 | Memory Bandwidth (misaligned      write)                        314.25 GB/s |
 | PCIe   Bandwidth (send                 )                         10.53 GB/s |
 | PCIe   Bandwidth (   receive           )                         11.47 GB/s |
 | PCIe   Bandwidth (        bidirectional)            (Gen4 x16)   10.91 GB/s |
 |-----------------------------------------------------------------------------|
 |----------------.------------------------------------------------------------|
 | Device ID      | 0                                                          |
 | Device Name    | AMD Instinct MI300X                                        |
 | Device Vendor  | Advanced Micro Devices, Inc.                               |
 | Device Driver  | 3635.0 (HSA1.1,LC) (Linux)                                 |
 | OpenCL Version | OpenCL C 2.0                                               |
 | Compute Units  | 304 at 2100 MHz (19456 cores, 81.715 TFLOPs/s)             |
 | Memory, Cache  | 196592 MB VRAM, 32 KB global / 64 KB local                 |
 | Buffer Limits  | 196592 MB global, 201310208 KB constant                    |
 |----------------'------------------------------------------------------------|
 | Info: OpenCL C code successfully compiled.                                  |
 | FP64  compute                                        54.944 TFLOPs/s (2/3 ) |
 | FP32  compute                                       130.000 TFLOPs/s ( 2x ) |
 | FP16  compute                                       141.320 TFLOPs/s ( 2x ) |
 | INT64 compute                                         3.666  TIOPs/s (1/24) |
 | INT32 compute                                        47.736  TIOPs/s (2/3 ) |
 | INT16 compute                                        69.022  TIOPs/s ( 1x ) |
-| INT8  compute                                        43.582  TIOPs/s (1/2 ) |
+| INT8  compute                                       106.178  TIOPs/s ( 1x ) |
 | Memory Bandwidth ( coalesced read      )                       3756.64 GB/s |
 | Memory Bandwidth ( coalesced      write)                       4686.31 GB/s |
 | Memory Bandwidth (misaligned read      )                       3881.24 GB/s |
 | Memory Bandwidth (misaligned      write)                       2491.25 GB/s |
 | PCIe   Bandwidth (send                 )                         54.57 GB/s |
 | PCIe   Bandwidth (   receive           )                         55.79 GB/s |
 | PCIe   Bandwidth (        bidirectional)            (Gen4 x16)   55.21 GB/s |
 |-----------------------------------------------------------------------------|
 |----------------.------------------------------------------------------------|
 | Device ID      | 0                                                          |
 | Device Name    | Intel(R) Arc(TM) B580 Graphics                             |
 | Device Vendor  | Intel(R) Corporation                                       |
 | Device Driver  | 32.0.101.6559 (Windows)                                    |
 | OpenCL Version | OpenCL C 3.0                                               |
 | Compute Units  | 160 at 2850 MHz (2560 cores, 14.592 TFLOPs/s)              |
 | Memory, Cache  | 12187 MB VRAM, 18432 KB global / 128 KB local              |
 | Buffer Limits  | 11944 MB global, 12230900 KB constant                      |
 |----------------'------------------------------------------------------------|
 | Info: OpenCL C code successfully compiled.                                  |
 | FP64  compute                                         0.896 TFLOPs/s (1/16) |
 | FP32  compute                                        14.249 TFLOPs/s ( 1x ) |
 | FP16  compute                                        26.547 TFLOPs/s ( 2x ) |
 | INT64 compute                                         0.636  TIOPs/s (1/24) |
 | INT32 compute                                         4.556  TIOPs/s (1/3 ) |
 | INT16 compute                                        37.082  TIOPs/s ( 2x ) |
-| INT8  compute                                        24.424  TIOPs/s ( 2x ) |
+| INT8  compute                                        48.668  TIOPs/s ( 4x ) |
 | Memory Bandwidth ( coalesced read      )                        574.09 GB/s |
 | Memory Bandwidth ( coalesced      write)                        468.07 GB/s |
 | Memory Bandwidth (misaligned read      )                        796.23 GB/s |
 | Memory Bandwidth (misaligned      write)                        383.15 GB/s |
 | PCIe   Bandwidth (send                 )                          4.99 GB/s |
 | PCIe   Bandwidth (   receive           )                          4.87 GB/s |
 | PCIe   Bandwidth (        bidirectional)            (Gen3 x16)    5.11 GB/s |
 |-----------------------------------------------------------------------------|

OpenCL-Benchmark v1.7

18 Feb 06:27

Choose a tag to compare

  • faster enqueueReadBuffer() on modern CPUs with 64-Byte-aligned host_buffer
  • updated OpenCL headers
  • better OpenCL device specs detection using vendor ID and Nvidia compute capability
  • better VRAM capacity reporting correction for Intel dGPUs
  • fixed wrong device name reporting for AMD GPUs (unlike every sane GPU vendor they don't report device name as CL_DEVICE_NAME but need CL_DEVICE_BOARD_NAME_AMD extension instead)
  • fixed TFlops estimate for Intel Battlemage GPUs
 |----------------.------------------------------------------------------------|
 | Device ID      | 1                                                          |
-| Device Name    | gfx90a:sramecc+:xnack-                                     |
+| Device Name    | AMD Instinct MI210                                         |
 | Device Vendor  | Advanced Micro Devices, Inc.                               |
 | Device Driver  | 3625.0 (HSA1.1,LC)                                         |
 | OpenCL Version | OpenCL C 2.0                                               |
 | Compute Units  | 104 at 1700 MHz (6656 cores, 22.630 TFLOPs/s)              |
 | Memory, Cache  | 65520 MB, 16 KB global / 64 KB local                       |
 | Buffer Limits  | 65520 MB global, 67092480 KB constant                      |
 |----------------'------------------------------------------------------------|

OpenCL-Benchmark v1.6

16 Nov 11:28

Choose a tag to compare

  • automatically use zero-copy buffers on CPUs/iGPUs to reduce memory footprint
  • bandwidth kernels now write non-zero data, to avoid hardware optimizations for zero-initialized buffers

OpenCL-Benchmark v1.5

18 Aug 09:37

Choose a tag to compare

  • enabled benchmarking FP16 vector arithmetic on Nvidia Pascal and newer GPUs with Nvidia driver 520 or newer
  • removed wait() call at the end of the benchmark on Linux
 |----------------.------------------------------------------------------------|
 | Device ID      | 9                                                          |
 | Device Name    | NVIDIA GeForce RTX 2080 Ti                                 |
 | Device Vendor  | NVIDIA Corporation                                         |
 | Device Driver  | 525.89.02 (Linux)                                          |
 | OpenCL Version | OpenCL C 1.2                                               |
 | Compute Units  | 68 at 1545 MHz (4352 cores, 13.448 TFLOPs/s)               |
 | Memory, Cache  | 11011 MB, 2176 KB global / 48 KB local                     |
 | Buffer Limits  | 2752 MB global, 64 KB constant                             |
 |----------------'------------------------------------------------------------|
 | Info: OpenCL C code successfully compiled.                                  |
 | FP64  compute                                         0.517 TFLOPs/s (1/24) |
 | FP32  compute                                        16.597 TFLOPs/s ( 1x ) |
-| FP16  compute                                          not supported        |
+| FP16  compute                                        33.054 TFLOPs/s ( 2x ) |
 | INT64 compute                                         3.563  TIOPs/s (1/4 ) |
 | INT32 compute                                        16.385  TIOPs/s ( 1x ) |
 | INT16 compute                                        13.286  TIOPs/s ( 1x ) |
 | INT8  compute                                        10.502  TIOPs/s (2/3 ) |
 | Memory Bandwidth ( coalesced read      )                        532.76 GB/s |
 | Memory Bandwidth ( coalesced      write)                        548.88 GB/s |
 | Memory Bandwidth (misaligned read      )                        534.43 GB/s |
 | Memory Bandwidth (misaligned      write)                        157.78 GB/s |
 | PCIe   Bandwidth (send                 )                         12.86 GB/s |
 | PCIe   Bandwidth (   receive           )                         12.99 GB/s |
 | PCIe   Bandwidth (        bidirectional)            (Gen4 x16)    6.30 GB/s |
 |-----------------------------------------------------------------------------|

OpenCL-Benchmark v1.4

03 Aug 06:32

Choose a tag to compare

OpenCL-Benchmark v1.3

02 May 20:05

Choose a tag to compare

  • workaround for Nvidia driver bug: enqueueFillBuffer is broken for large buffers on Nvidia GPUs
  • fixed slow numeric drift issues
  • fixed terrible performance on ARM GPUs by macro-replacing fused-multiply-add (fma) with a*b+c
  • added automatic OS detection in make.sh

OpenCL-Benchmark v1.2

07 Dec 19:11

Choose a tag to compare

  • corrected TFlops/s estimate for Intel Data Center GPU Max series
  • made correction of wrong memory reporting on Intel Arc more robust
  • made CPU/GPU buffer initialization significantly faster with std::fill and enqueueFillBuffer
  • added operating system info to OpenCL device driver version printout
  • bug fix in print_message() function in utilities.hpp

OpenCL-Benchmark v1.1

30 Apr 22:04

Choose a tag to compare

  • fixed several issues with macOS