Releases: NVIDIA/thrust
Thrust 1.9.7-1 (CUDA Toolkit 10.2 for Tegra)
Thrust 1.9.7-1 is a minor release accompanying the CUDA Toolkit 10.2 release for Tegra. It is nearly identical to 1.9.7.
Bug Fixes
- Remove support for GCC's broken nodiscard-like attribute.
Thrust 1.9.7 (CUDA Toolkit 10.2)
Thrust 1.9.7 is a minor release accompanying the CUDA Toolkit 10.2 release. Unfortunately, although the version and patch numbers are identical, one bug fix present in Thrust 1.9.7 (NVBug 2646034: Fix incorrect dependency handling for stream acquisition in thrust::future) was not included in the CUDA Toolkit 10.2 preview release for AArch64 SBSA. The tag cuda-10.2aarch64sbsa contains the exact version of Thrust present in the CUDA Toolkit 10.2 preview release for AArch64 SBSA.
Bug Fixes
- #967, NVBug 2448170: Fix the CUDA backend
thrust::for_eachso that it supports large input sizes with 64-bit indices. - NVBug 2646034: Fix incorrect dependency handling for stream acquisition in
thrust::future.- Not present in the CUDA Toolkit 10.2 preview release for AArch64 SBSA.
- #968, NVBug 2612102: Fix the
thrust::mr::polymorphic_adaptorto actually use its template parameter.
Thrust 1.9.6-1 (NVIDIA HPC SDK 20.3)
Thrust 1.9.6-1 is a variant of 1.9.6 accompanying the NVIDIA HPC SDK 20.3 release. It contains modifications necessary to serve as the implementation of NVC++'s GPU-accelerated C++17 Parallel Algorithms when using the CUDA Toolkit 10.1 Update 2 release.
Thrust 1.9.6 (CUDA Toolkit 10.1 Update 2)
Thrust 1.9.6 is a minor release accompanying the CUDA Toolkit 10.1 Update 2 release.
Bug Fixes
- NVBug 2509847: Inconsistent alignment of
thrust::complex - NVBug 2586774: Compilation failure with Clang + older libstdc++ that doesn't have
std::is_trivially_copyable - NVBug 200488234: CUDA header files contain unicode characters which leads compiling errors on Windows
- #949, #973, NVBug 2422333, NVBug 2522259, NVBug 2528822:
thrust::detail::aligned_reinterpret_castmust be annotated with__host__ __device__. - NVBug 2599629: Missing include in the OpenMP sort implementation
- NVBug 200513211: Truncation warning in test code under VC142
Thrust 1.9.5 (CUDA Toolkit 10.1 Update 1)
Thrust v1.9.5 is a minor bugfix release accompanying the CUDA 10.1 Update 1 CUDA Toolkit release.
Bug Fixes
- 2502854 Assignment of complex vector between host and device fails to compile in CUDA >=9.1 with GCC 6.
Thrust 1.9.4 (CUDA Toolkit 10.1)
Thrust 1.9.4 adds asynchronous interfaces for parallel algorithms, a new allocator system including caching allocators and unified memory support, as well as a variety of other enhancements, mostly related to C++11/C++14/C++17/C++20 support. The new asynchronous algorithms in the thrust::async namespace return thrust::event or thrust::future objects, which can be waited upon to synchronize with the completion of the parallel operation.
Breaking API Changes
Synchronous Thrust algorithms now block until all of their operations have completed. Use the new asynchronous Thrust algorithms for non-blocking behavior.
New Features
-
thrust::eventandthrust::future<T>, uniquely-owned asynchronous handles consisting of a state (ready or not ready), content (some value; forthrust::futureonly), and an optional set of objects that should be destroyed only when the future's value is ready and has been consumed.- The design is loosely based on C++11's
std::future. - They can be
.wait'd on, and the value of a future can be waited on and retrieved with.getor.extract. - Multiple
thrust::events andthrust::futures can be combined withthrust::when_all. thrust::futures can be converted tothrust::events.- Currently, these primitives are only implemented for the CUDA backend and are C++11 only.
- The design is loosely based on C++11's
-
New asynchronous algorithms that return
thrust::event/thrust::futures, implemented as C++20 range style customization points:thrust::async::reduce.thrust::async::reduce_into, which takes a target location to store the reduction result into.thrust::async::copy, including a two-policy overload that allows explicit cross system copies which execution policy properties can be attached to.thrust::async::transform.thrust::async::for_each.thrust::async::stable_sort.thrust::async::sort.- By default the asynchronous algorithms use the new caching allocators. Deallocation of temporary storage is deferred until the destruction of the returned
thrust::future. The content ofthrust::futures is stored in either device or universal memory and transferred to the host only upon request to prevent unnecessary data migration. - Asynchronous algorithms are currently only implemented for the CUDA system and are C++11 only.
-
exec.after(f, g, ...), a new execution policy method that takes a set ofthrust::event/thrust::futures and returns an execution policy that operations on that execution policy should depend upon. -
New logic and mindset for the type requirements for cross-system sequence copies (currently only used by
thrust::async::copy), based on:thrust::is_contiguous_iteratorandTHRUST_PROCLAIM_CONTIGUOUS_ITERATORfor detecting/indicating that an iterator points to contiguous storage.thrust::is_trivially_relocatableandTHRUST_PROCLAIM_TRIVIALLY_RELOCATABLEfor detecting/indicating that a type ismemcpyable (based on principles from https://wg21.link/P1144).- The new approach reduces buffering, increases performance, and increases correctness.
- The fast path is now enabled when copying fp16 and CUDA vector types with
thrust::async::copy.
-
All Thrust synchronous algorithms for the CUDA backend now actually synchronize. Previously, any algorithm that did not allocate temporary storage (counterexample:
thrust::sort) and did not have a computation-dependent result (counterexample:thrust::reduce) would actually be launched asynchronously. Additionally, synchronous algorithms that allocated temporary storage would become asynchronous if a custom allocator was supplied that did not synchronize on allocation/deallocation, unlikecudaMalloc/cudaFree. So, nowthrust::for_each,thrust::transform,thrust::sort, etc are truly synchronous. In some cases this may be a performance regression; if you need asynchrony, use the new asynchronous algorithms. -
Thrust's allocator framework has been rewritten. It now uses a memory resource system, similar to C++17's
std::pmrbut supporting static polymorphism. Memory resources are objects that allocate untyped storage and allocators are cheap handles to memory resources in this new model. The new facilities live in<thrust/mr/*>.thrust::mr::memory_resource<Pointer>, the memory resource base class, which takes a (possibly tagged) pointer tovoidtype as a parameter.thrust::mr::allocator<T, MemoryResource>, an allocator backed by a memory resource object.thrust::mr::polymorphic_adaptor_resource<Pointer>, a type-erased memory resource adaptor.thrust::mr::polymorphic_allocator<T>, a C++17-style polymorphic allocator backed by a type-erased memory resource object.- New tunable C++17-style caching memory resources,
thrust::mr::(disjoint_)?(un)?synchronized_pool_resource, designed to cache both small object allocations and large repetitive temporary allocations. The disjoint variants use separate storage for management of the pool, which is necessary if the memory being allocated cannot be accessed on the host (e.g. device memory). - System-specific allocators were rewritten to use the new memory resource framework.
- New
thrust::device_memory_resourcefor allocating device memory. - New
thrust::universal_memory_resourcefor allocating memory that can be accessed from both the host and device (e.g.cudaMallocManaged). - New
thrust::universal_host_pinned_memory_resourcefor allocating memory that can be accessed from the host and the device but always resides in host memory (e.g.cudaMallocHost). thrust::get_per_device_resourceandthrust::per_device_allocator, which lazily create and retrieve a per-device singleton memory resource.- Rebinding mechanisms (
rebind_traitsandrebind_alloc) forthrust::allocator_traits. thrust::device_make_unique, a factory function for creating astd::unique_ptrto a newly allocated object in device memory.<thrust/detail/memory_algorithms>, a C++11 implementation of the C++17 uninitialized memory algorithms.thrust::allocate_uniqueand friends, based on the proposed C++23std::allocate_unique(https://wg21.link/P0211).
-
New type traits and metaprogramming facilities. Type traits are slowly being migrated out of
thrust::detail::and<thrust/detail/*>; their new home will bethrust::and<thrust/type_traits/*>.thrust::is_execution_policy.thrust::is_operator_less_or_greater_function_object, which detectsthrust::less,thrust::greater,std::less, andstd::greater.thrust::is_operator_plus_function_object``, which detectsthrust::plusandstd::plus`.thrust::remove_cvref(_t)?, a C++11 implementation of C++20'sthrust::remove_cvref(_t)?.thrust::void_t, and various other new type traits.thrust::integer_sequenceand friends, a C++11 implementation of C++20'sstd::integer_sequencethrust::conjunction,thrust::disjunction, andthrust::disjunction, a C++11 implementation of C++17's logical metafunctions.- Some Thrust type traits (such as
thrust::is_constructible) have been redefined in terms of C++11's type traits when they are available.
-
<thrust/detail/tuple_algorithms.h>, newstd::tuplealgorithms:thrust::tuple_transform.thrust::tuple_for_each.thrust::tuple_subset.
-
Miscellaneous new
std::-like facilities:thrust::optional, a C++11 implementation of C++17'sstd::optional.thrust::addressof, an implementation of C++11'sstd::addressof.thrust::nextandthrust::prev, an implementation of C++11'sstd::nextandstd::prev.thrust::square, a<functional>style unary function object that multiplies its argument by itself.<thrust/limits.h>andthrust::numeric_limits, a customized version of<limits>andstd::numeric_limits.
-
<thrust/detail/preprocessor.h>, new general purpose preprocessor facilities:THRUST_PP_CAT[2-5], concatenates two to five tokens.THRUST_PP_EXPAND(_ARGS)?, performs double expansion.THRUST_PP_ARITYandTHRUST_PP_DISPATCH, tools for macro overloading.THRUST_PP_BOOL, boolean conversion.THRUST_PP_INCandTHRUST_PP_DEC, increment/decrement.THRUST_PP_HEAD, a variadic macro that expands to the first argument.THRUST_PP_TAIL, a variadic macro that expands to all its arguments after the first.THRUST_PP_IIF, bitwise conditional.THRUST_PP_COMMA_IF, andTHRUST_PP_HAS_COMMA, facilities for adding and detecting comma tokens.THRUST_PP_IS_VARIADIC_NULLARY, returns true if called with a nullary__VA_ARGS__.THRUST_CURRENT_FUNCTION, expands to the name of the current function.
-
New C++11 compatibility macros:
THRUST_NODISCARD, expands to[[nodiscard]]when available and the best equivalent otherwise.THRUST_CONSTEXPR, expands toconstexprwhen available and the best equivalent otherwise.THRUST_OVERRIDE, expands tooverridewhen available and the best equivalent otherwise.THRUST_DEFAULT, expands to= default;when available and the best equivalent otherwise.THRUST_NOEXCEPT, expands tonoexceptwhen available and the best equivalent otherwise.THRUST_FINAL, expands tofinalwhen available and the best equivalent otherwise.THRUST_INLINE_CONSTANT, expands toinline constexprwhen available and the best equivalent otherwise.
-
<thrust/detail/type_deduction.h>, new C++11-only type deduction helpers:THRUST_DECLTYPE_RETURNS*, expand to function definitions with suitable conditionalnoexceptqualifiers and trailing return types.THRUST_FWD(x), expands to::std::forward<decltype(x)>(x).THRUST_MVCAP, expands to a lambda move capture.THRUST_RETOF, expands to a decltype computing the return type of an invocable.
New ...
Thrust 1.9.3 (CUDA Toolkit 10.0)
Thrust 1.9.3 unifies and integrates CUDA Thrust and GitHub Thrust.
Bug Fixes
- #725, #850, #855, #859, #860: Unify the
thrust::iter_swapinterface and fixthrust::device_referenceswapping. - NVBug 2004663: Add a
datamethod tothrust::detail::temporary_arrayand refactor temporary memory allocation in the CUDA backend to be exception and leak safe. - #886, #894, #914: Various documentation typo fixes.
- #724: Provide
NVVMIR_LIBRARY_DIRenvironment variable to NVCC. - #878: Optimize
thrust::min/max_elementto only usethrust::detail::get_iterator_valuefor non-numeric types. - #899: Make
thrust::cuda::experimental::pinned_allocator's comparison operatorsconst. - NVBug 2092152: Remove all includes of
<cuda.h>. - #911: Fix default comparator element type for
thrust::merge_by_key.
Acknowledgments
- Thanks to Andrew Corrigan for contributing fixes for swapping interfaces.
- Thanks to Francisco Facioni for contributing optimizations for
thrust::min/max_element.
Thrust 1.9.2 (CUDA Toolkit 9.2)
Thrust 1.9.2 brings a variety of performance enhancements, bug fixes and test improvements. CUB 1.7.5 was integrated, enhancing the performance of thrust::sort on small data types and thrust::reduce. Changes were applied to complex to optimize memory access. Thrust now compiles with compiler warnings enabled and treated as errors. Additionally, the unit test suite and framework was enhanced to increase coverage.
Breaking Changes
- The
fallback_allocatorexample was removed, as it was buggy and difficult to support.
New Features
<thrust/detail/alignment.h>, utilities for memory alignment:thrust::aligned_reinterpret_cast.thrust::aligned_storage_size, which computes the amount of storage needed for an object of a particular size and alignment.thrust::alignment_of, a C++03 implementation of C++11'sstd::alignment_of.thrust::aligned_storage, a C++03 implementation of C++11'sstd::aligned_storage.thrust::max_align_t, a C++03 implementation of C++11'sstd::max_align_t.
Bug Fixes
- NVBug 200385527, NVBug 200385119, NVBug 200385113, NVBug 200349350, NVBug 2058778: Various compiler warning issues.
- NVBug 200355591:
thrust::reduceperformance issues. - NVBug 2053727: Fixed an ADL bug that caused user-supplied
allocateto be overlooked butdeallocateto be called with GCC <= 4.3. - NVBug 1777043: Fixed
thrust::complexto work withthrust::sequence.
Thrust 1.9.1-2 (CUDA Toolkit 9.1)
Thrust 1.9.1 integrates version 1.7.4 of CUB and introduces a new CUDA backend for thrust::reduce based on CUB.
Bug Fixes
- NVBug 1965743: Remove unnecessary static qualifiers.
- NVBug 1940974: Fix regression causing a compilation error when using
thrust::merge_by_keywiththrust::constant_iterators. - NVBug 1904217: Allow callables that take non-const refs to be used with
thrust::reduceandthrust::*_scan.
Thrust 1.9.0-5 (CUDA Toolkit 9.0)
Thrust 1.9.0 replaces the original CUDA backend (bulk) with a new one written using CUB, a high performance CUDA collectives library. This brings a substantial performance improvement to the CUDA backend across the board.
Breaking Changes
- Any code depending on CUDA backend implementation details will likely be broken.
New Features
- New CUDA backend based on CUB which delivers substantially higher performance.
thrust::transform_output_iterator, a fancy iterator that applies a function to the output before storing the result.
New Examples
transform_output_iteratordemonstrates use of the new fancy iteratorthrust::transform_output_iterator.
Other Enhancements
- When C++11 is enabled, functors do not have to inherit from
thrust::(unary|binary)_functionanymore to be used withthrust::transform_iterator. - Added C++11 only move constructors and move assignment operators for
thrust::detail::vector_base-based classes, e.g.thrust::host_vector,thrust::device_vector, and friends.
Bug Fixes
sin(thrust::complex<double>)no longer has precision loss to float.
Acknowledgments
- Thanks to Manuel Schiller for contributing a C++11 based enhancement regarding the deduction of functor return types, improving the performance of
thrust::uniqueand implementingthrust::transform_output_iterator. - Thanks to Thibault Notargiacomo for the implementation of move semantics for the
thrust::vector_base-based classes. - Thanks to Duane Merrill for developing CUB and helping to integrate it into Thrust's backend.