Releases: NVIDIA/thrust
Thrust 1.8.3 (CUDA Toolkit 8.0)
Thrust 1.8.3 is a small bug fix release.
New Examples
range_viewdemonstrates the use of a view (a non-owning wrapper for an iterator range with a container-like interface).
Bug Fixes
thrust::(min|max|minmax)_elementcan now accept raw device pointers when an explicit device execution policy is used.thrust::clearoperations on vector types no longer requires the element type to have a default constructor.
Thrust 1.8.2 (CUDA Toolkit 7.5)
Thrust 1.8.2 is a small bug fix release.
Bug Fixes
- Avoid warnings and errors concerning user functions called from
__host__ __device__functions - #632 CUDA
set_intersection_by_keyerror - #651
thrust::copybetween host & device is not interoperable withthrust::cuda::par.on(stream) - #664 CUDA
for_eachignores execution policy's stream
Known Issues
- #628 CUDA's
reduce_by_keyfails onsm_50devices
Thrust 1.8.1 (CUDA Toolkit 7.0)
Thrust 1.8.1 is a small bug fix release.
Bug Fixes
- CUDA
thrust::for_eachaccesses illegal memory locations when given a large range.
Thrust 1.8.0
Thrust 1.8.0 introduces support for algorithm invocation from CUDA __device__ code, support for CUDA streams, and algorithm performance improvements. Users may now invoke Thrust algorithms from CUDA __device__ code, providing a parallel algorithms library to CUDA programmers authoring custom kernels, as well as allowing Thrust programmers to nest their algorithm calls within functors. The thrust::seq execution policy allows users to require sequential algorithm execution in the calling thread and makes a sequential algorithms library available to individual CUDA threads. The .on(stream) syntax allows users to request a CUDA stream for kernels launched during algorithm execution. Finally, new CUDA algorithm implementations provide substantial performance improvements.
New Features
- Algorithms in CUDA
__device__code-
Thrust algorithms may now be invoked from CUDA
__device__and__host__ __device__functions.Algorithms invoked in this manner must be invoked with an execution policy as the first parameter:
__device__ int my_device_sort(int *data, size_t n) { thrust::sort(thrust::device, data, data + n); }The following execution policies are supported in CUDA
__device__code:thrust::seqthrust::cuda::parthrust::device, whenTHRUST_DEVICE_SYSTEM == THRUST_DEVICE_SYSTEM_CUDA
Parallel algorithm execution may not be accelerated unless CUDA Dynamic Parallelism is available.
-
- Execution Policies
-
CUDA Streams
Thethrust::cuda::par.on(stream)syntax allows users to request that CUDA__global__functions launched during algorithm execution should occur on a given stream:// execute for_each on stream s thrust::for_each(thrust::cuda::par.on(s), begin, end, my_functor);Algorithms executed with a CUDA stream in this manner may still synchronize with other streams when allocating temporary storage or returning results to the CPU.
-
thrust::seq
Thethrust::seqexecution policy allows users to require that an algorithm execute sequentially in the calling thread:// execute for_each sequentially in this thread thrust::for_each(thrust::seq, begin, end, my_functor);
-
- Other
- The new thrust::complex template provides complex number support.
New Examples
simple_cuda_streamsdemonstrates how to request a CUDA stream during algorithm execution.async_reducedemonstrates ways to achieve algorithm invocations which are asynchronous with the calling thread.
Other Enhancements
- CUDA
sortperformance for user-defined types is 300% faster on Tesla K20c for large problem sizes. - CUDA
mergeperformance is 200% faster on Tesla K20c for large problem sizes. - CUDA
sortperformance for primitive types is 50% faster on Tesla K20c for large problem sizes. - CUDA
reduce_by_keyperformance is 25% faster on Tesla K20c for large problem sizes. - CUDA
scanperformance is 15% faster on Tesla K20c for large problem sizes. fallback_allocatorexample is simpler.
Bug Fixes
- #364 iterators with unrelated system tags may be used with algorithms invoked with an execution policy
- #371 do not redefine
__CUDA_ARCH__ - #379 fix crash when dereferencing
transform_iteratoron the CPU - #391 avoid use of uppercase variable names
- #392 fix
thrust::copybetweencusp::complex&std::complex - #396 program compiled with gcc < 4.3 hangs during comparison sort
- #406
fallback_allocator.cuexample checks device for unified addressing support - #417 avoid using
std::less<T>in binary search algorithms - #418 avoid various warnings
- #443 including
version.hno longer configures default systems - #578 nvcc produces warnings when sequential algorithms are used with cpu systems
Known Issues
- When invoked with primitive data types,
thrust::sort,thrust::sort_by_key,thrust::stable_sort,thrust::stable_sort_by_keymay fail to link in some cases withnvcc -rdc=true. - The CUDA implementation of
thrust::reduce_by_keyincorrectly outputs the last element in a segment of equivalent keys instead of the first.
Acknowledgments
- Thanks to Sean Baxter for contributing faster CUDA
reduce,merge, andscanimplementations. - Thanks to Duane Merrill for contributing a faster CUDA radix sort implementation.
- Thanks to Filipe Maia for contributing the implementation of
thrust::complex.
Thrust 1.7.2 (CUDA Toolkit 6.5)
Thrust 1.7.2 is a minor bug fix release.
Bug Fixes
- Avoid use of
std::minin generic find implementation.
Thrust 1.7.1 (CUDA Toolkit 6.0)
Thrust 1.7.1 is a minor bug fix release.
Bug Fixes
- Eliminate identifiers in
set_operations.cuexample with leading underscore - Eliminate unused variable warning in CUDA reduce_by_key implemention
- Avoid deriving function objects from
std::unary_functionandstd::binary_function
Thrust 1.7.0 (CUDA Toolkit 5.5)
Thrust 1.7.0 introduces a new interface for controlling algorithm execution as well as several new algorithms and performance improvements. With this new interface, users may directly control how algorithms execute as well as details such as the allocation of temporary storage. Key/value versions of thrust::merge and the set operation algorithms have been added, as well stencil versions of partitioning algorithms. thrust::tabulate has been introduced to tabulate the values of functions taking integers. For 32b types, new CUDA merge and set operations provide 2-15x faster performance while a new CUDA comparison sort provides 1.3-4x faster performance. Finally, a new TBB reduce_by_key implementation provides 80% faster performance.
Breaking API Changes
- Dispatch
-
Custom user backend systems' tag types must now inherit from the corresponding system's
execution_policytemplate (e.g.thrust::cuda::execution_policy) instead of the tagstruct(e.g.thrust::cuda::tag). Otherwise, algorithm specializations will silently go unfound during dispatch.See examples/minimal_custom_backend.cu and examples/cuda/fallback_allocator.cu for usage examples.
-
thrust::advanceandthrust::distanceare no longer dispatched based on iterator system type and thus may no longer be customized.
-
- Iterators
iterator_facadeanditerator_adaptor'sPointertemplate parameters have been eliminated.iterator_adaptorhas been moved into thethrustnamespace (previouslythrust::experimental::iterator_adaptor).iterator_facadehas been moved into thethrustnamespace (previouslythrust::experimental::iterator_facade).iterator_core_accesshas been moved into thethrustnamespace (previouslythrust::experimental::iterator_core_access).
All iterators' nested pointertypedef(the type of the result ofoperator->) is nowvoidinstead of a pointer type to indicate that such expressions are currently impossible.
Floating pointcounting_iterators' nesteddifference_typetypedefis now a signed integral type instead of a floating point type.
- Other
normal_distributionhas been moved into thethrust::randomnamespace (previouslythrust::random::experimental::normal_distribution).- Placeholder expressions may no longer include the comma operator.
New Features
- Execution Policies
-
Users may directly control the dispatch of algorithm invocations with optional execution policy arguments.
For example, instead of wrapping raw pointers allocated by
cudaMallocwiththrust::device_ptr, thethrust::deviceexecution_policy may be passed as an argument to an algorithm invocation to enable CUDA execution.The following execution policies are supported in this version:
thrust::hostthrust::devicethrust::cpp::parthrust::cuda::parthrust::omp::parthrust::tbb::par
-
- Algorithms
freeget_temporary_buffermallocmerge_by_keypartitionwith stencilpartition_copywith stencilreturn_temporary_bufferset_difference_by_keyset_intersection_by_keyset_symmetric_difference_by_keyset_union_by_keystable_partitionwith stencilstable_partition_copywith stenciltabulate
New Examples
- uninitialized_vector demonstrates how to use a custom allocator to avoid the automatic initialization of elements in
thrust::device_vector.
Other Enhancements
- Authors of custom backend systems may manipulate arbitrary state during algorithm dispatch by incorporating it into their
execution_policyparameter. - Users may control the allocation of temporary storage during algorithm execution by passing standard allocators as parameters via execution policies such as
thrust::device. THRUST_DEVICE_SYSTEM_CPPhas been added as a compile-time target for the device backend.- CUDA
mergeperformance is 2-15x faster. - CUDA comparison sort performance is 1.3-4x faster.
- CUDA set operation performance is 1.5-15x faster.
- TBB
reduce_by_keyperformance is 80% faster. - Several algorithms have been parallelized with TBB.
- Support for user allocators in vectors has been improved.
- The sparse_vector example is now implemented with merge_by_key instead of sort_by_key.
- Warnings have been eliminated in various contexts.
- Warnings about
__host__or__device__-only functions called from__host__ __device__functions have been eliminated in various contexts. - Documentation about algorithm requirements have been improved.
- Simplified the minimal_custom_backend example.
- Simplified the cuda/custom_temporary_allocation example.
- Simplified the cuda/fallback_allocator example.
Bug Fixes
- #248 fix broken
counting_iterator<float>behavior with OpenMP - #231, #209 fix set operation failures with CUDA
- #187 fix incorrect occupancy calculation with CUDA
- #153 fix broken multigpu behavior with CUDA
- #142 eliminate warning produced by
thrust::random::taus88and MSVC 2010 - #208 correctly initialize elements in temporary storage when necessary
- #16 fix compilation error when sorting bool with CUDA
- #10 fix ambiguous overloads of
reinterpret_tag
Known Issues
g++versions 4.3 and lower may fail to dispatchthrust::get_temporary_buffercorrectly causing infinite recursion in examples such as cuda/custom_temporary_allocation.
Acknowledgments
- Thanks to Sean Baxter, Bryan Catanzaro, and Manjunath Kudlur for contributing a faster merge implementation for CUDA.
- Thanks to Sean Baxter for contributing a faster set operation implementation for CUDA.
- Thanks to Cliff Woolley for contributing a correct occupancy calculation algorithm.
Thrust 1.5.3 (CUDA Toolkit 5.0)
Thrust 1.5.3 is a minor bug fix release.
Bug Fixes
- Avoid warnings about potential race due to
__shared__non-POD variable
Thrust 1.6.0
Thrust 1.6.0 provides an interface for customization and extension and a new backend system based on the Threading Building Blocks library. With this new interface, programmers may customize the behavior of specific algorithms as well as control the allocation of temporary storage or invent entirely new backends. These enhancements also allow multiple different backend systems such as CUDA and OpenMP to coexist within a single program. Support for TBB allows Thrust programs to integrate more naturally into applications which may already employ the TBB task scheduler.
Breaking Changes
- The header <thrust/experimental/cuda/pinned_allocator.h> has been moved to <thrust/system/cuda/experimental/pinned_allocator.h>
- thrust::experimental::cuda::pinned_allocator has been moved to thrust::cuda::experimental::pinned_allocator
- The macro THRUST_DEVICE_BACKEND has been renamed THRUST_DEVICE_SYSTEM
- The macro THRUST_DEVICE_BACKEND_CUDA has been renamed THRUST_DEVICE_SYSTEM_CUDA
- The macro THRUST_DEVICE_BACKEND_OMP has been renamed THRUST_DEVICE_SYSTEM_OMP
- thrust::host_space_tag has been renamed thrust::host_system_tag
- thrust::device_space_tag has been renamed thrust::device_system_tag
- thrust::any_space_tag has been renamed thrust::any_system_tag
- thrust::iterator_space has been renamed thrust::iterator_system
New Features
- Backend Systems
- Threading Building Blocks (TBB) is now supported
- Algorithms
thrust::for_each_nthrust::raw_reference_cast
- Types
thrust::pointerthrust::reference
New Examples
cuda/custom_temporary_allocationcuda/fallback_allocatordevice_ptrexpandminimal_custom_backendraw_reference_castset_operations
Other Enhancements
- thrust::for_each now returns the end of the input range similar to most other algorithms
- thrust::pair and thrust::tuple have swap functionality
- All CUDA algorithms now support large data types
- Iterators may be dereferenced in user device or global functions
- The safe use of different backend systems is now possible within a single binary
Bug Fixes
- #469
min_elementandmax_elementalgorithms no longer require a const comparison operator
Known Issues
- NVCC may crash when parsing TBB headers on Windows.
Thrust 1.5.2 (CUDA Toolkit 4.2)
Thrust 1.5.2 is a minor bug fix release.
Bug Fixes
- Fixed warning about C-style initialization of structures