[Feat][Convolution] Support grouped conv2d and conv3d#1568
Conversation
There was a problem hiding this comment.
Code Review
This pull request adds native grouped Conv2d and Conv3d support for both bias and no-bias variants. It introduces GroupConv2dKernel and GroupConv3dKernel classes, implements their corresponding TileLang JIT kernels, and wires the forward operators to dispatch to these kernels when groups > 1. It also updates the expected weight shape validation, adds test coverage for grouped convolutions, and includes MobileNetV2 depthwise and ResNeXt grouped benchmark cases. I have no feedback to provide.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
63792e2 to
743b0c9
Compare
superAngGao
left a comment
There was a problem hiding this comment.
Reviewed the grouped Conv2d/Conv3d implementation. I did not find any blocking issues. The dispatch path, grouped channel indexing, weight/bias shape handling, and the added grouped/depthwise coverage all look consistent to me.\n\nNon-blocking note: grouped + dilation does not appear to have a dedicated grouped test case yet, but the dilation indexing is wired directly into the grouped kernels and the existing dilation formula is already covered on the non-grouped path, so I do not think this should block the PR.
lcy-seso
left a comment
There was a problem hiding this comment.
Request changes: the speedup numbers in the PR body need rework
The functional change (grouped conv2d/conv3d support) looks good, but the benchmark claims — particularly the 50.29x on conv3d and to a lesser extent the 4.88x on conv2d — are misleading as presented and should not ship as headline numbers in this form.
The 50x is a baseline artifact, not a throughput result
For the 3d-resnext-grouped-k3-fp16 case (N=1, C 64→128, D×H×W = 8×28×28, k=3³, groups=32), the total work is only:
- ~86.7 MFLOP (802,816 output elements × 54 MAC each × 2)
- ~2.4 MB of I/O (0.8 MB in + 1.6 MB out + 14 KB weights)
Against those figures, the reported 0.0148 ms means TileOps is running at ~5.9 TFLOP/s and ~162 GB/s — roughly 0.6% of H200 FP16 tensor-core peak and ~3% of HBM bandwidth. So the TileOps kernel itself is not doing anything fast in absolute terms; this is a tiny, launch-overhead-bound workload.
The entire 50x comes from the denominator: PyTorch's 0.7443 ms corresponds to ~0.12 TFLOP/s, which is the well-known cuDNN pathology for grouped 3D convolution (it tends to serialize into one small kernel per group, here 32 of them). So the comparison is really "one fused TileLang kernel vs. cuDNN's 32 serial micro-kernels," not a throughput comparison. The same table confirms this: when cuDNN picks a sane path (depthwise conv2d → 1.02x, grouped conv2d → 4.88x), the speedup collapses. The 50x is an outlier, not a representative result.
Measurement methodology is unverifiable and operating below the noise floor
0.0148 ms = 14.8 µs, which is in the same range as kernel-launch overhead. At this scale the result is dominated by how it's measured, and the timing harness is not in this PR. Before trusting these numbers, please confirm:
- Both TileOps and the torch baseline use the same harness with warmup +
torch.cuda.synchronize()(or CUDA events) + multiple iterations, reporting median/min — not wall-clock around a single call. - The torch side is warmed up so cuDNN algorithm selection / autotuning isn't counted in its latency (this alone can inflate the baseline several-fold).
- Memory format is stated for the torch baseline (channels_last vs. contiguous) so the comparison is apples-to-apples.
Requested changes
- Don't lead with 4.x / 50x. As-is these read as "TileOps is 50× faster," when the honest reading is "cuDNN's grouped conv3d path is degenerate on a batch-1 workload." Either drop them from the headline or annotate them explicitly as cuDNN-baseline artifacts.
- Report achieved efficiency (TFLOP/s and/or % of bandwidth) alongside or instead of the speedup ratio. That's the metric that actually reflects kernel quality and won't mislead.
- Add a realistic-size data point (e.g. batch ≥ 8, or a more compute-bound shape). I expect the 50x to fall off sharply — the current numbers are cherry-picked by workload size.
- Show the timing loop (or link the harness) so reviewers can confirm warmup/sync for both sides.
Net: the feature is fine, but the benchmark section as written sets a misleading bar. Speedup-vs-cuDNN ratios on launch-bound batch-1 grouped convs are not an acceptable performance claim on their own.
743b0c9 to
f56ca90
Compare
|
Updated the benchmark section to address the review:
|
zhen8838
left a comment
There was a problem hiding this comment.
Blocking on the manifest contract. This PR makes grouped Conv2d/Conv3d usable from the default op path, but tileops/manifest/convolution.yaml is not updated: Conv2dFwdOp, Conv2dBiasFwdOp, Conv3dFwdOp, and Conv3dBiasFwdOp still say status: spec-only # TODO: impl lacks groups, their groups params still say the kernel does not support groups, their kernel_map entries omit GroupConv2dKernel/GroupConv3dKernel, and their roofline formulas still use full C_in instead of per-group C_in_g. That leaves the manifest stats/source-of-truth wrong even though the implementation and tests now expose grouped support; validate-manifest was skipped because the manifest was untouched. Please update the convolution manifest in the same PR so implementation status, source map, TODOs, and roofline accounting match the new behavior.
Closes #1521
Summary
main.Test plan
python -m py_compile tileops/ops/convolution.py tileops/kernels/convolution.py tests/ops/test_convolution.py benchmarks/ops/bench_convolution.py tileops/kernels/__init__.pypython -m ruff check tileops/ops/convolution.py tileops/kernels/convolution.py tests/ops/test_convolution.py benchmarks/ops/bench_convolution.py tileops/kernels/__init__.pyTILELANG_CLEANUP_TEMP_FILES=1 python -m pytest tests/ops/test_convolution.py -m smoke -vvs(25 passed, 28 deselected)TILELANG_CLEANUP_TEMP_FILES=1 python -m pytest tests/ops/test_convolution.py -vvs(53 passed)python scripts/test_node_delta.py tests/ops/test_convolution.py(+6nodes)TILELANG_CLEANUP_TEMP_FILES=1 python -m pytest benchmarks/ops/bench_convolution.py -vvs(27 passed)CUDA_VISIBLE_DEVICES=1 TILELANG_CLEANUP_TEMP_FILES=1 python -m pytest 'benchmarks/ops/bench_convolution.py::test_conv2d_bench[mobilenetv2-depthwise-fp16]' 'benchmarks/ops/bench_convolution.py::test_conv2d_bench[resnext-grouped-3x3-fp16]' 'benchmarks/ops/bench_convolution.py::test_conv3d_bench[3d-resnext-grouped-k3-fp16]' 'benchmarks/ops/bench_convolution.py::test_conv3d_bench[3d-resnext-grouped-k3-b8-fp16]' -vvs(4 passed)Benchmark
Grouped benchmark subset run on GPU 1 with
CUDA_VISIBLE_DEVICES=1; pre-run clock check showed current SM clock at 1830 MHz on NVIDIA H200. Inputs use the benchmark default contiguous memory format: NCHW for Conv2d and NCDHW for Conv3d.Methodology: both TileOps and torch baseline run through
BenchmarkBase.profile()andbenchmarks/benchmark_base.py::bench_kernelundertorch.no_grad(). The harness uses 10 warmup iterations, 50 timed repeats per trial, 3 trials,torch.cuda.synchronize()after warmup, L2 cache flush, cloned tensor args when possible, CUPTI kernel timing throughtorch.profiler, and reports the median trial mean.Command:
Result:
4 passed in 138.39s.conv2d mobilenetv2-depthwise-fp16conv2d resnext-grouped-3x3-fp16conv3d 3d-resnext-grouped-k3-fp16conv3d 3d-resnext-grouped-k3-b8-fp16Notes: