Releases: JuliaGPU/AMDGPU.jl
Releases · JuliaGPU/AMDGPU.jl
v0.4.9
AMDGPU v0.4.9
Closed issues:
- State of queues and streams (#337)
- rocBLAS: Remove old hand-wrapped code (#384)
- HSA memory fault upon switching from default device on multi-GPU node (#385)
- Test fail locally with
AssertionError: AMDGPU.Runtime.LOGGING_STATIC_ENABLED(#399)
Merged pull requests:
- Switch to task-focused synchronization model (#374) (@jpsamaroo)
- Use broadcast instead of copies to initialize mapreduce buffers. (#390) (@maleadt)
- tests: Skip logging tests if disabled (#391) (@jpsamaroo)
- Add blas wrappers for triangular matrix mul / div (#392) (@pxl-th)
- Simplify signal pooling (#393) (@pxl-th)
- Adapt to GPUCompiler 0.18 (#394) (@pxl-th)
- Reduce memory usage (#395) (@pxl-th)
- Add support for KernelAbstraction 0.9 (#398) (@vchuravy)
- Update to GPUCompiler 0.19 & LLVM 5 (#407) (@pxl-th)
- Fix compiler timespan logging (#408) (@pxl-th)
- rocBLAS: define highlevel dot, gemm, axpy functions for FP16 (#409) (@pxl-th)
- Add KernelAbstractions.jl unsafe_free! (#410) (@pxl-th)
v0.4.8
AMDGPU v0.4.8
Merged pull requests:
- ROCSignal: Pool signals in ctor (#369) (@jpsamaroo)
- Reduce allocations (#376) (@pxl-th)
- Report and exit on memory fault (#379) (@jpsamaroo)
- versioninfo: Indicate if using JLLs or System (#381) (@jpsamaroo)
- ROCSignal: Disable IPC by default (#383) (@jpsamaroo)
v0.4.7
v0.4.6
AMDGPU v0.4.6
Closed issues:
- Implement occupancy API (#271)
getinfoshould determine theRefoutput container automatically (#273)
Merged pull requests:
- Add timespan logging via TimespanLogging.jl (#263) (@jpsamaroo)
- Add occupancy API and groupsize tuning (#326) (@jpsamaroo)
- Reduce signal wait allocations (#361) (@jpsamaroo)
- Add more intrinsics, enable
always_inline(#362) (@jpsamaroo) - Simplify math intrinsics (#363) (@pxl-th)
- Implement unified getinfo interface (#364) (@jpsamaroo)
- Assorted fixes (#365) (@jpsamaroo)
- Add memory allocation limiters (#366) (@jpsamaroo)
- Specify return types for getinfo calls (#368) (@pxl-th)
v0.4.5
AMDGPU v0.4.5
Closed issues:
- Mem.alloc: Allow using hipMalloc to service allocations (#286)
- rocBLAS GEMM ignores
@view(#319) - sincospi intrinsic is broken (#334)
#jps/devsegfaults on MI250x (#340)- method ambiguity in
rand!(#343) - Add function or macro for AMDGPU.jl equivalent to CUDA.CuDynamicSharedArray (and CUDA.CuStaticSharedArray) (#347)
- Free
KernelStatein finalizer (#352) --check-bounds=nois broken on Julia 1.9.0-beta3 (#354)
Merged pull requests:
- Fix GEMM (regular & batched) and support batched GEMM for 3D array (#318) (@pxl-th)
- Add MIOpen (#320) (@pxl-th)
- Add support for 2D * 3D batched GEMM (#321) (@pxl-th)
- Support NNlib batched gemm format (#322) (@pxl-th)
- Add pointer() method for ROCArray and some library tests (#323) (@torrance)
- Fix double unsafe_free calls (#324) (@jpsamaroo)
- Mem: Allow using hipMalloc/hipFree for allocations (#325) (@jpsamaroo)
- Cast to Ptr before checking NULL pointer (#328) (@torrance)
- Resize! support (#333) (@matinraayai)
- Add sincos/sincospi/frexp/ldexp intrinsics (#336) (@jpsamaroo)
- Add local memory allocation helpers (#348) (@jpsamaroo)
- Add GPUCompiler 0.17 to compat (#349) (@jpsamaroo)
- Preserve
UInt32in indexing intrinsics (#351) (@pxl-th) - Fix
unsafe_free!not actually freeing (#353) (@jpsamaroo) - Don't sync on default HIP stream every time (#356) (@pxl-th)
- Make alignment generated (#358) (@pxl-th)
- tests: Properly unwrap Distributed exceptions (#359) (@jpsamaroo)
v0.4.4
AMDGPU v0.4.4
Closed issues:
- Repetetive
AMDGPU.onescalls crash runtime (#299) - Add AMDGPU.jl equivalent to CUDA.CuDynamicSharedArray (and CUDA.CuStaticSharedArray) (#304)
- Segfault with basic kernel from AMDGPU.jl doc on LUMI (#308)
- ROC kernel faulting upon having AMDGPU and CUDA loaded (#312)
AMDGPU.randfailing to create aROCArray(#315)
Merged pull requests:
- Remove waiter and error monitor threads (#306) (@pxl-th)
- Update bindeps search path (#307) (@luraess)
- Prioritise ENV var to use or not artifacts (#310) (@luraess)
- Add dynamic local memory support (#311) (@jpsamaroo)
- random: Load definitions without rocRAND (#316) (@jpsamaroo)
v0.4.3
AMDGPU v0.4.3
Closed issues:
- Queue selection test fail (#274)
Merged pull requests:
- Add device quirks from CUDA.jl, enhance at-rocprintf (#269) (@jpsamaroo)
- Use an optimized norm function for ROCBLASArray (#282) (@amontoison)
- Add rocBLAS_jll and rocSPARSE_jll deps (#284) (@jpsamaroo)
- active_kernels: Use WeakKeyDict (#285) (@jpsamaroo)
- CI: Add gfx90a to more jobs (#289) (@jpsamaroo)
- build: Remove build step, run at toplevel (#290) (@jpsamaroo)
- Mapreducedim support for AnyROCArray (#291) (@matinraayai)
- Parallelize tests (#293) (@jpsamaroo)
- Fix precompilation (#294) (@pxl-th)
- Do not rethrow EOF (#296) (@pxl-th)
- Use correct queue for kernels (#297) (@pxl-th)
- Implement kernel hashing system (#302) (@jpsamaroo)
v0.4.2
AMDGPU v0.4.2
Closed issues:
- build failure on Julia 1.8.1 (#278)
Merged pull requests:
- Mem: Retry failing allocations (#251) (@jpsamaroo)
- Add device-to-device unsafe_copy3d test (#260) (@luraess)
- Fix allocation retry mechanism, add slow allocation fallback (#262) (@jpsamaroo)
- Run wavefront tests with detected wavefrontsize (#264) (@torrance)
- During HostCall, ensure device has finished using buffers before freeing (#266) (@torrance)
- Expand fft tests (#267) (@torrance)
- Remove code that duplicates AbstractFFTs; add tests for casting (#268) (@torrance)
- Don't embed the method table in the AST (#276) (@jpsamaroo)
- deps: Don't access is_available unless using succeeds (#279) (@jpsamaroo)
- device: Add ROCDevice() ctor (#280) (@jpsamaroo)
v0.4.1
AMDGPU v0.4.1
Closed issues:
- Add option to disable automatic mark/wait of specific arrays (#126)
- Limit multi-dimensional groupsize properly (#150)
- Optimize kernarg allocations in kernel construction (#247)
- Add priority kwarg to ROCQueue ctor (#256)
Merged pull requests:
- Add BackToCPU struct to reduce 'view' allocations (#246) (@pxl-th)
- LB GPUCompiler to 0.16.2 (#248) (@jpsamaroo)
- Optimize kernel setup and launch (#249) (@jpsamaroo)
- launch: Fix groupsize dimension check (#250) (@jpsamaroo)
- device: Add device_id method (#253) (@jpsamaroo)
- Re-export indexing intrinsics (#254) (@jpsamaroo)
- CI: Switch GHA to 1.7 release (#257) (@jpsamaroo)
- queue: Allow setting priority from ctor (#258) (@jpsamaroo)
- math: Make signbit return Bool (#259) (@jpsamaroo)
v0.4.0
AMDGPU v0.4.0
Closed issues:
Merged pull requests:
- Remove launch export (#232) (@matinraayai)
- Remove indirection layer, use modules (#240) (@jpsamaroo)
- Update Setfield compat (#243) (@luraess)