Skip to content

Added planar types to speed up complex half precision GEMMs#1142

Open
cliffburdick wants to merge 8 commits intomainfrom
planar_tensor
Open

Added planar types to speed up complex half precision GEMMs#1142
cliffburdick wants to merge 8 commits intomainfrom
planar_tensor

Conversation

@cliffburdick
Copy link
Copy Markdown
Collaborator

No description provided.

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Mar 19, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 19, 2026

Greptile Summary

This PR introduces matxFp16ComplexPlanar and matxBf16ComplexPlanar marker types to allow pre-converted planar buffers to be passed directly into complex-half-precision cuBLASLt GEMMs, skipping the per-call interleaved\u2192planar conversion overhead. All three P0/P1 issues from prior review threads are resolved: the SetOp EPT regression is gated on planar output type, non-contiguous planar views are rejected at tensor-construction time via ValidatePlanarLayoutOnCreate_(), and c_adj is correctly reset for the planar-C path.

Confidence Score: 5/5

Safe to merge — all three P0/P1 concerns from prior rounds are resolved; remaining findings are P2 style nits.

The three blocking issues from previous review threads (SetOp EPT regression, TotalSize non-contiguous access, c_adj pointer mismatch) are fully addressed. The new planar GEMM logic is mathematically consistent with the pre-existing non-planar path, contiguity is enforced at construction time, and the cache key correctly differentiates planar vs. interleaved configurations. The only new findings are a dead ternary in the JIT string and an unused lambda return — both P2.

include/matx/operators/planar.h (dead ternary in JIT Size), include/matx/core/allocator.h (unused is_cuda_free return for HOST_MALLOC)

Important Files Changed

Filename Overview
include/matx/transforms/matmul/matmul_cuda.h Planar A/B/C fast-path skips per-call allocation and conversion. c_adj correctly reset to c.Data() for planar-C. ldc fixed to c.Size(RANK-1) for all complex-half paths. Cache key extended with a_planar/b_planar/c_planar booleans.
include/matx/core/tensor_impl.h Adds PlanarComplexProxy, LoadPlanarComplex/StorePlanarComplex, and planar-aware operator() overloads. Contiguity is validated at construction time via tensor.h. Return types in array-indexed operator() overloads are correctly relaxed to decltype(auto).
include/matx/operators/planar.h New ComplexPlanarOp operator with correct size doubling and scalar-only EPT. JIT Size() generates a dead ternary (both branches return the same value), but pre-computed out_dims_ values are correct so output is unaffected.
include/matx/operators/set.h EPT regression fixed: scalar EPT is now only forced when the output is a planar-complex type; non-planar SetOp retains normal vectorization negotiation.
include/matx/core/half_complex.h Adds matxFp16ComplexPlanar and matxBf16ComplexPlanar marker structs inheriting from interleaved counterparts with no extra data fields.
include/matx/core/allocator.h Guards CUDA-runtime free calls behind a cudaGetDevice() liveness check to avoid crashes during static teardown.
include/matx/operators/interleaved.h Adds InnerOp() accessor and two cancellation overloads: interleaved(planar(x)) and planar(interleaved(x)) both short-circuit to the inner operator.
include/matx/operators/base_operator.h Disables cudaMemcpyAsync fast-path when LHS and RHS value types differ, preventing raw-byte copies between planar and interleaved tensors.
test/00_transform/MatMul.cu Adds two typed test suites for planar GEMM: interleaved-reference comparison and raw-buffer validation.
test/00_operators/planar_test.cu Validates per-element real/imag reads from a cudaMemcpy-populated planar tensor against the raw host buffer for both Fp16 and Bf16.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["matmul called with complex-half types"] --> B{is_complex_half_v?}
    B -- No --> Z["Standard GEMM path"]
    B -- Yes --> C{a_is_planar?}
    C -- No --> D["Alloc a_hp, Convert A to planar"]
    C -- Yes --> E["Use A buffer directly"]
    D --> F{b_is_planar?}
    E --> F
    F -- No --> G["Alloc b_hp, Convert B to planar"]
    F -- Yes --> H["Use B buffer directly"]
    G --> I{c_is_planar?}
    H --> I
    I -- No --> J["Alloc c_hp, c_adj to c_hp"]
    I -- Yes --> K["c_adj.Reset to c.Data()"]
    J --> L["cuBLASLt GEMM"]
    K --> L
    L --> M{c_is_planar?}
    M -- No --> N["Convert C planar to interleaved"]
    M -- Yes --> O["Done"]
Loading

Reviews (7): Last reviewed commit: "Fixed issue with teardown where context ..." | Re-trigger Greptile

@cliffburdick
Copy link
Copy Markdown
Collaborator Author

/build

1 similar comment
@cliffburdick
Copy link
Copy Markdown
Collaborator Author

/build

@cliffburdick
Copy link
Copy Markdown
Collaborator Author

/build

@cliffburdick
Copy link
Copy Markdown
Collaborator Author

/build

@cliffburdick
Copy link
Copy Markdown
Collaborator Author

/build

1 similar comment
@cliffburdick
Copy link
Copy Markdown
Collaborator Author

/build

@cliffburdick
Copy link
Copy Markdown
Collaborator Author

/build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant