Releases · ProjectPhysX/OpenCL-Benchmark

19 Mar 17:42

ProjectPhysX

v2.0

4694a5b

OpenCL-Benchmark v2.0 Latest

Latest

use min(1GB, max_global_buffer) for memory allocation size (thanks @jerryrt) - now older GPUs with <1GB memory will work too
more reliable PCIe Gen estimate
more robust Intel GPU core/CU detecton via CL_DEVICE_IP_VERSION_INTEL
set nvidia_compute_capability only for Nvidia GPUs not Nvidia CPUs
fixed TFLOPs/s estimate for AMD CDNA3/4 GPUs
fixed Device Name and CU reporting for AMD GPUs with rusticl
disabled zero-copy on ARM iGPUs as CL_MEM_USE_HOST_PTR is broken there
updated driver download links
cosmetics

Example 🖖😏

|----------------.------------------------------------------------------------|
| Device ID      | 0                                                          |
| Device Name    | NVIDIA B300 SXM6 AC                                        |
| Device Vendor  | NVIDIA Corporation                                         |
| Device Driver  | 580.126.09 (Linux)                                         |
| OpenCL Version | OpenCL C 3.0                                               |
| Compute Units  | 148 at 2032 MHz (18944 cores, 76.988 TFLOPs/s)             |
| Memory, Cache  | 274113 MB VRAM, 4736 KB global / 48 KB local               |
| Buffer Limits  | 68528 MB global, 64 KB constant                            |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled.                                  |
| FP64   Compute   (double, fma  )                      1.184 TFLOPs/s (1/64) |
| FP32   Compute   (float , fma  )                     71.452 TFLOPs/s ( 1x ) |
| FP16   Compute   (half2 , fma  )                     75.201 TFLOPs/s ( 1x ) |
| INT64  Compute   (long  , a*b+c)                      3.714  TIOPs/s (1/24) |
| INT32  Compute   (int   , a*b+c)                     37.736  TIOPs/s (1/2 ) |
| INT16  Compute   (short2, a*b+c)                     34.592  TIOPs/s (1/2 ) |
| INT8   Compute   (char4 , dp4a )                    118.743  TIOPs/s ( 2x ) |
| Memory Bandwidth ( coalesced read      )                       6543.01 GB/s |
| Memory Bandwidth ( coalesced      write)                       6887.38 GB/s |
| Memory Bandwidth (misaligned read      )                       2355.50 GB/s |
| Memory Bandwidth (misaligned      write)                        969.95 GB/s |
| PCIe   Bandwidth (send                 )                          9.86 GB/s |
| PCIe   Bandwidth (   receive           )                          9.70 GB/s |
| PCIe   Bandwidth (        bidirectional)            (Gen3 x16)    8.93 GB/s |
|-----------------------------------------------------------------------------|

Contributors

jerryrt

Assets 4

04 Sep 06:04

ProjectPhysX

v1.9

3b66959

OpenCL-Benchmark v1.9

added hardware-supported dp4a on AMD RDNA3+ GPUs and ARM GPUs
disabled native dp4a in Intel CPU Runtime for OpenCL because it is slower than emulated dp4a
more robust dp4a detection
fixed dual CU and IPC reporting on AMD RDNA1-4 GPUs
fixed core count reporting for RDNA4 GPUs
fixed compiler warning with min_int
fixed bug in split_regex()
fixed missing <chrono> header on some compilers
updated driver download links

Assets 4

01 Mar 08:19

ProjectPhysX

v1.8

326d3a0

OpenCL-Benchmark v1.8

INT8 benchmark will now measure dp4a throughput on all supported AMD/Intel/Nvidia GPUs
fixed compiling on macOS with new OpenCL headers
updated OpenCL-Wrapper

 |----------------.------------------------------------------------------------|
 | Device ID      | 0                                                          |
 | Device Name    | NVIDIA H100 80GB HBM3                                      |
 | Device Vendor  | NVIDIA Corporation                                         |
 | Device Driver  | 565.57.01 (Linux)                                          |
 | OpenCL Version | OpenCL C 1.2                                               |
 | Compute Units  | 132 at 1980 MHz (16896 cores, 66.908 TFLOPs/s)             |
 | Memory, Cache  | 81105 MB VRAM, 4224 KB global / 48 KB local                |
 | Buffer Limits  | 20276 MB global, 64 KB constant                            |
 |----------------'------------------------------------------------------------|
 | Info: OpenCL C code successfully compiled.                                  |
 | FP64  compute                                        31.184 TFLOPs/s (1/2 ) |
 | FP32  compute                                        62.908 TFLOPs/s ( 1x ) |
 | FP16  compute                                       123.749 TFLOPs/s ( 2x ) |
 | INT64 compute                                         3.227  TIOPs/s (1/24) |
 | INT32 compute                                        32.946  TIOPs/s (1/2 ) |
 | INT16 compute                                        30.901  TIOPs/s (1/2 ) |
-| INT8  compute                                        30.582  TIOPs/s (1/2 ) |
+| INT8  compute                                       103.204  TIOPs/s ( 2x ) |
 | Memory Bandwidth ( coalesced read      )                       3025.53 GB/s |
 | Memory Bandwidth ( coalesced      write)                       3055.98 GB/s |
 | Memory Bandwidth (misaligned read      )                       2102.44 GB/s |
 | Memory Bandwidth (misaligned      write)                        314.25 GB/s |
 | PCIe   Bandwidth (send                 )                         10.53 GB/s |
 | PCIe   Bandwidth (   receive           )                         11.47 GB/s |
 | PCIe   Bandwidth (        bidirectional)            (Gen4 x16)   10.91 GB/s |
 |-----------------------------------------------------------------------------|

 |----------------.------------------------------------------------------------|
 | Device ID      | 0                                                          |
 | Device Name    | AMD Instinct MI300X                                        |
 | Device Vendor  | Advanced Micro Devices, Inc.                               |
 | Device Driver  | 3635.0 (HSA1.1,LC) (Linux)                                 |
 | OpenCL Version | OpenCL C 2.0                                               |
 | Compute Units  | 304 at 2100 MHz (19456 cores, 81.715 TFLOPs/s)             |
 | Memory, Cache  | 196592 MB VRAM, 32 KB global / 64 KB local                 |
 | Buffer Limits  | 196592 MB global, 201310208 KB constant                    |
 |----------------'------------------------------------------------------------|
 | Info: OpenCL C code successfully compiled.                                  |
 | FP64  compute                                        54.944 TFLOPs/s (2/3 ) |
 | FP32  compute                                       130.000 TFLOPs/s ( 2x ) |
 | FP16  compute                                       141.320 TFLOPs/s ( 2x ) |
 | INT64 compute                                         3.666  TIOPs/s (1/24) |
 | INT32 compute                                        47.736  TIOPs/s (2/3 ) |
 | INT16 compute                                        69.022  TIOPs/s ( 1x ) |
-| INT8  compute                                        43.582  TIOPs/s (1/2 ) |
+| INT8  compute                                       106.178  TIOPs/s ( 1x ) |
 | Memory Bandwidth ( coalesced read      )                       3756.64 GB/s |
 | Memory Bandwidth ( coalesced      write)                       4686.31 GB/s |
 | Memory Bandwidth (misaligned read      )                       3881.24 GB/s |
 | Memory Bandwidth (misaligned      write)                       2491.25 GB/s |
 | PCIe   Bandwidth (send                 )                         54.57 GB/s |
 | PCIe   Bandwidth (   receive           )                         55.79 GB/s |
 | PCIe   Bandwidth (        bidirectional)            (Gen4 x16)   55.21 GB/s |
 |-----------------------------------------------------------------------------|

 |----------------.------------------------------------------------------------|
 | Device ID      | 0                                                          |
 | Device Name    | Intel(R) Arc(TM) B580 Graphics                             |
 | Device Vendor  | Intel(R) Corporation                                       |
 | Device Driver  | 32.0.101.6559 (Windows)                                    |
 | OpenCL Version | OpenCL C 3.0                                               |
 | Compute Units  | 160 at 2850 MHz (2560 cores, 14.592 TFLOPs/s)              |
 | Memory, Cache  | 12187 MB VRAM, 18432 KB global / 128 KB local              |
 | Buffer Limits  | 11944 MB global, 12230900 KB constant                      |
 |----------------'------------------------------------------------------------|
 | Info: OpenCL C code successfully compiled.                                  |
 | FP64  compute                                         0.896 TFLOPs/s (1/16) |
 | FP32  compute                                        14.249 TFLOPs/s ( 1x ) |
 | FP16  compute                                        26.547 TFLOPs/s ( 2x ) |
 | INT64 compute                                         0.636  TIOPs/s (1/24) |
 | INT32 compute                                         4.556  TIOPs/s (1/3 ) |
 | INT16 compute                                        37.082  TIOPs/s ( 2x ) |
-| INT8  compute                                        24.424  TIOPs/s ( 2x ) |
+| INT8  compute                                        48.668  TIOPs/s ( 4x ) |
 | Memory Bandwidth ( coalesced read      )                        574.09 GB/s |
 | Memory Bandwidth ( coalesced      write)                        468.07 GB/s |
 | Memory Bandwidth (misaligned read      )                        796.23 GB/s |
 | Memory Bandwidth (misaligned      write)                        383.15 GB/s |
 | PCIe   Bandwidth (send                 )                          4.99 GB/s |
 | PCIe   Bandwidth (   receive           )                          4.87 GB/s |
 | PCIe   Bandwidth (        bidirectional)            (Gen3 x16)    5.11 GB/s |
 |-----------------------------------------------------------------------------|

Assets 4

18 Feb 06:27

ProjectPhysX

v1.7

c980082

OpenCL-Benchmark v1.7

faster enqueueReadBuffer() on modern CPUs with 64-Byte-aligned host_buffer
updated OpenCL headers
better OpenCL device specs detection using vendor ID and Nvidia compute capability
better VRAM capacity reporting correction for Intel dGPUs
fixed wrong device name reporting for AMD GPUs (unlike every sane GPU vendor they don't report device name as CL_DEVICE_NAME but need CL_DEVICE_BOARD_NAME_AMD extension instead)
fixed TFlops estimate for Intel Battlemage GPUs

 |----------------.------------------------------------------------------------|
 | Device ID      | 1                                                          |
-| Device Name    | gfx90a:sramecc+:xnack-                                     |
+| Device Name    | AMD Instinct MI210                                         |
 | Device Vendor  | Advanced Micro Devices, Inc.                               |
 | Device Driver  | 3625.0 (HSA1.1,LC)                                         |
 | OpenCL Version | OpenCL C 2.0                                               |
 | Compute Units  | 104 at 1700 MHz (6656 cores, 22.630 TFLOPs/s)              |
 | Memory, Cache  | 65520 MB, 16 KB global / 64 KB local                       |
 | Buffer Limits  | 65520 MB global, 67092480 KB constant                      |
 |----------------'------------------------------------------------------------|

Assets 4

16 Nov 11:28

ProjectPhysX

v1.6

1ece450

OpenCL-Benchmark v1.6

automatically use zero-copy buffers on CPUs/iGPUs to reduce memory footprint
bandwidth kernels now write non-zero data, to avoid hardware optimizations for zero-initialized buffers

Assets 4

18 Aug 09:37

ProjectPhysX

v1.5

7b264f9

OpenCL-Benchmark v1.5

enabled benchmarking FP16 vector arithmetic on Nvidia Pascal and newer GPUs with Nvidia driver 520 or newer
removed wait() call at the end of the benchmark on Linux

 |----------------.------------------------------------------------------------|
 | Device ID      | 9                                                          |
 | Device Name    | NVIDIA GeForce RTX 2080 Ti                                 |
 | Device Vendor  | NVIDIA Corporation                                         |
 | Device Driver  | 525.89.02 (Linux)                                          |
 | OpenCL Version | OpenCL C 1.2                                               |
 | Compute Units  | 68 at 1545 MHz (4352 cores, 13.448 TFLOPs/s)               |
 | Memory, Cache  | 11011 MB, 2176 KB global / 48 KB local                     |
 | Buffer Limits  | 2752 MB global, 64 KB constant                             |
 |----------------'------------------------------------------------------------|
 | Info: OpenCL C code successfully compiled.                                  |
 | FP64  compute                                         0.517 TFLOPs/s (1/24) |
 | FP32  compute                                        16.597 TFLOPs/s ( 1x ) |
-| FP16  compute                                          not supported        |
+| FP16  compute                                        33.054 TFLOPs/s ( 2x ) |
 | INT64 compute                                         3.563  TIOPs/s (1/4 ) |
 | INT32 compute                                        16.385  TIOPs/s ( 1x ) |
 | INT16 compute                                        13.286  TIOPs/s ( 1x ) |
 | INT8  compute                                        10.502  TIOPs/s (2/3 ) |
 | Memory Bandwidth ( coalesced read      )                        532.76 GB/s |
 | Memory Bandwidth ( coalesced      write)                        548.88 GB/s |
 | Memory Bandwidth (misaligned read      )                        534.43 GB/s |
 | Memory Bandwidth (misaligned      write)                        157.78 GB/s |
 | PCIe   Bandwidth (send                 )                         12.86 GB/s |
 | PCIe   Bandwidth (   receive           )                         12.99 GB/s |
 | PCIe   Bandwidth (        bidirectional)            (Gen4 x16)    6.30 GB/s |
 |-----------------------------------------------------------------------------|

Assets 4

03 Aug 06:32

ProjectPhysX

v1.4

c7e8987

OpenCL-Benchmark v1.4

updated OpenCL-Wrapper
GPU Driver and OpenCL Runtime installation instructions will be printed to console if no OpenCL devices are available

Assets 4

02 May 20:05

ProjectPhysX

v1.3

677d52f

OpenCL-Benchmark v1.3

workaround for Nvidia driver bug: enqueueFillBuffer is broken for large buffers on Nvidia GPUs
fixed slow numeric drift issues
fixed terrible performance on ARM GPUs by macro-replacing fused-multiply-add (fma) with a*b+c
added automatic OS detection in make.sh

Assets 4

07 Dec 19:11

ProjectPhysX

v1.2

0296687

OpenCL-Benchmark v1.2

corrected TFlops/s estimate for Intel Data Center GPU Max series
made correction of wrong memory reporting on Intel Arc more robust
made CPU/GPU buffer initialization significantly faster with std::fill and enqueueFillBuffer
added operating system info to OpenCL device driver version printout
bug fix in print_message() function in utilities.hpp

Assets 4

30 Apr 22:04

ProjectPhysX

v1.1

c5406ef

OpenCL-Benchmark v1.1

fixed several issues with macOS

Assets 4

Releases: ProjectPhysX/OpenCL-Benchmark

OpenCL-Benchmark v2.0

Contributors

Uh oh!

OpenCL-Benchmark v1.9

Uh oh!

OpenCL-Benchmark v1.8

Uh oh!

OpenCL-Benchmark v1.7

Uh oh!

OpenCL-Benchmark v1.6

Uh oh!

OpenCL-Benchmark v1.5

Uh oh!

OpenCL-Benchmark v1.4

Uh oh!

OpenCL-Benchmark v1.3

Uh oh!

OpenCL-Benchmark v1.2

Uh oh!

OpenCL-Benchmark v1.1

Uh oh!