Skip to content

Latest commit

 

History

History
47 lines (33 loc) · 2.61 KB

File metadata and controls

47 lines (33 loc) · 2.61 KB

<cuda/std/bit>

cuda::std::bit_cast

cuda::std::bit_cast extended the standard std::bit_cast to also recognize CUDA extended floating-point scalar and vector types as trivially copyable.

Limitations

  • The function can be used in constexpr contexts only when the source and destination types are trivially copyable.
  • The function cannot be used in constexpr contexts with MSVC <= 19.25 and GCC <= 10.

CUDA Performance Considerations

Given an unsigned integer with N bits and N <= 32, the <bit> functions translate into the following SASS instructions. For some functions, the results is decorated with a compile-time assumption to restrict its range and allowing further optimizations.

  • bit_width() translates into a single FLO SASS instruction. The result is assumed to be in the range [0, N].
  • bit_ceil() translates into ADD, FLO, SHL, IMINMAX SASS instructions. The result is assumed to be greater than or equal to the input.
  • bit_floor() translates into FLO, SHL SASS instructions. The result is assumed to be less than or equal to the input.
  • byteswap() translates into a single PRMT SASS instruction.
  • popcount() translates into a single POPC SASS instruction. The result is assumed to be in the range [0, N].
  • has_single_bit() translates into POPC + ISETP SASS instructions.
  • rotl()/rotr() translate into a single SHF (funned shift) SASS instruction.
  • countl_zero() translates into FLO, IMINMAX SASS instructions. The result is assumed to be in the range [0, N].
  • countl_one() translates into LOP3, FLO, IMINMAX SASS instructions. The result is assumed to be in the range [0, N].
  • countr_zero() translates into BREV, FLO, IMINMAX SASS instructions. The result is assumed to be in the range [0, N].
  • countr_one() translates into LOP3, BREV, FLO, IMINMAX SASS instructions. The result is assumed to be in the range [0, N].

Additional Notes

  • All functions are marked [[nodiscard]] and noexcept
  • All functions support 128-bit integer types
  • bit_ceil() checks for overflow in debug mode
  • rotl()/rotr() checks for invalid count value (INT_MIN) in debug mode

Note

When the input values are run-time values that the compiler can resolve at compile-time, e.g. an index of a loop with a fixed number of iterations, using the functions could not be optimal.

Note

GCC <= 8 uses a slow path with more instructions even in CUDA