[Feat][Convolution] Support grouped conv2d and conv3d by RMLYC · Pull Request #1568 · tile-ai/TileOPs

RMLYC · 2026-06-09T08:57:51Z

Summary

Add native grouped Conv2d/Conv3d TileLang kernels for bias and no-bias variants.
Wire Conv2d/Conv3d op dispatch to grouped kernels while preserving existing dilation support from main.
Add grouped correctness coverage and model-derived grouped convolution benchmark cases.

Test plan

python -m py_compile tileops/ops/convolution.py tileops/kernels/convolution.py tests/ops/test_convolution.py benchmarks/ops/bench_convolution.py tileops/kernels/__init__.py
python -m ruff check tileops/ops/convolution.py tileops/kernels/convolution.py tests/ops/test_convolution.py benchmarks/ops/bench_convolution.py tileops/kernels/__init__.py
TILELANG_CLEANUP_TEMP_FILES=1 python -m pytest tests/ops/test_convolution.py -m smoke -vvs (25 passed, 28 deselected)
TILELANG_CLEANUP_TEMP_FILES=1 python -m pytest tests/ops/test_convolution.py -vvs (53 passed)
python scripts/test_node_delta.py tests/ops/test_convolution.py (+6 nodes)
TILELANG_CLEANUP_TEMP_FILES=1 python -m pytest benchmarks/ops/bench_convolution.py -vvs (27 passed)
CUDA_VISIBLE_DEVICES=1 TILELANG_CLEANUP_TEMP_FILES=1 python -m pytest 'benchmarks/ops/bench_convolution.py::test_conv2d_bench[mobilenetv2-depthwise-fp16]' 'benchmarks/ops/bench_convolution.py::test_conv2d_bench[resnext-grouped-3x3-fp16]' 'benchmarks/ops/bench_convolution.py::test_conv3d_bench[3d-resnext-grouped-k3-fp16]' 'benchmarks/ops/bench_convolution.py::test_conv3d_bench[3d-resnext-grouped-k3-b8-fp16]' -vvs (4 passed)

Benchmark

Grouped benchmark subset run on GPU 1 with CUDA_VISIBLE_DEVICES=1; pre-run clock check showed current SM clock at 1830 MHz on NVIDIA H200. Inputs use the benchmark default contiguous memory format: NCHW for Conv2d and NCDHW for Conv3d.

Methodology: both TileOps and torch baseline run through BenchmarkBase.profile() and benchmarks/benchmark_base.py::bench_kernel under torch.no_grad(). The harness uses 10 warmup iterations, 50 timed repeats per trial, 3 trials, torch.cuda.synchronize() after warmup, L2 cache flush, cloned tensor args when possible, CUPTI kernel timing through torch.profiler, and reports the median trial mean.

Command:

CUDA_VISIBLE_DEVICES=1 TILELANG_CLEANUP_TEMP_FILES=1 python -m pytest \
  'benchmarks/ops/bench_convolution.py::test_conv2d_bench[mobilenetv2-depthwise-fp16]' \
  'benchmarks/ops/bench_convolution.py::test_conv2d_bench[resnext-grouped-3x3-fp16]' \
  'benchmarks/ops/bench_convolution.py::test_conv3d_bench[3d-resnext-grouped-k3-fp16]' \
  'benchmarks/ops/bench_convolution.py::test_conv3d_bench[3d-resnext-grouped-k3-b8-fp16]' \
  -vvs

Result: 4 passed in 138.39s.

Case	TileOps latency	TileOps TFLOP/s	TileOps bandwidth	Torch latency	Torch TFLOP/s	Torch bandwidth
`conv2d mobilenetv2-depthwise-fp16`	0.0064 ms	0.2812	0.0626 TB/s	0.0065 ms	0.2772	0.0617 TB/s
`conv2d resnext-grouped-3x3-fp16`	0.0041 ms	3.4882	0.1498 TB/s	0.0200 ms	0.7207	0.0309 TB/s
`conv3d 3d-resnext-grouped-k3-fp16`	0.0147 ms	5.8872	0.1645 TB/s	0.7442 ms	0.1165	0.0033 TB/s
`conv3d 3d-resnext-grouped-k3-b8-fp16`	0.1269 ms	5.4643	0.1519 TB/s	5.7221 ms	0.1212	0.0034 TB/s

Notes:

The batch-1 grouped Conv3d case is a small microbenchmark; its large latency gap primarily reflects a torch/cuDNN grouped-Conv3d baseline artifact and launch overhead, not high absolute H200 utilization.
The batch-8 grouped Conv3d case is included as a larger throughput point from the same 3D-ResNeXt/video-backbone pattern.

gemini-code-assist

Code Review

This pull request adds native grouped Conv2d and Conv3d support for both bias and no-bias variants. It introduces GroupConv2dKernel and GroupConv3dKernel classes, implements their corresponding TileLang JIT kernels, and wires the forward operators to dispatch to these kernels when groups > 1. It also updates the expected weight shape validation, adds test coverage for grouped convolutions, and includes MobileNetV2 depthwise and ResNeXt grouped benchmark cases. I have no feedback to provide.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

superAngGao

Reviewed the grouped Conv2d/Conv3d implementation. I did not find any blocking issues. The dispatch path, grouped channel indexing, weight/bias shape handling, and the added grouped/depthwise coverage all look consistent to me.\n\nNon-blocking note: grouped + dilation does not appear to have a dedicated grouped test case yet, but the dilation indexing is wired directly into the grouped kernels and the existing dilation formula is already covered on the non-grouped path, so I do not think this should block the PR.

lcy-seso

Request changes: the speedup numbers in the PR body need rework

The functional change (grouped conv2d/conv3d support) looks good, but the benchmark claims — particularly the 50.29x on conv3d and to a lesser extent the 4.88x on conv2d — are misleading as presented and should not ship as headline numbers in this form.

The 50x is a baseline artifact, not a throughput result

For the 3d-resnext-grouped-k3-fp16 case (N=1, C 64→128, D×H×W = 8×28×28, k=3³, groups=32), the total work is only:

~86.7 MFLOP (802,816 output elements × 54 MAC each × 2)
~2.4 MB of I/O (0.8 MB in + 1.6 MB out + 14 KB weights)

Against those figures, the reported 0.0148 ms means TileOps is running at ~5.9 TFLOP/s and ~162 GB/s — roughly 0.6% of H200 FP16 tensor-core peak and ~3% of HBM bandwidth. So the TileOps kernel itself is not doing anything fast in absolute terms; this is a tiny, launch-overhead-bound workload.

The entire 50x comes from the denominator: PyTorch's 0.7443 ms corresponds to ~0.12 TFLOP/s, which is the well-known cuDNN pathology for grouped 3D convolution (it tends to serialize into one small kernel per group, here 32 of them). So the comparison is really "one fused TileLang kernel vs. cuDNN's 32 serial micro-kernels," not a throughput comparison. The same table confirms this: when cuDNN picks a sane path (depthwise conv2d → 1.02x, grouped conv2d → 4.88x), the speedup collapses. The 50x is an outlier, not a representative result.

Measurement methodology is unverifiable and operating below the noise floor

0.0148 ms = 14.8 µs, which is in the same range as kernel-launch overhead. At this scale the result is dominated by how it's measured, and the timing harness is not in this PR. Before trusting these numbers, please confirm:

Both TileOps and the torch baseline use the same harness with warmup + torch.cuda.synchronize() (or CUDA events) + multiple iterations, reporting median/min — not wall-clock around a single call.
The torch side is warmed up so cuDNN algorithm selection / autotuning isn't counted in its latency (this alone can inflate the baseline several-fold).
Memory format is stated for the torch baseline (channels_last vs. contiguous) so the comparison is apples-to-apples.

Requested changes

Don't lead with 4.x / 50x. As-is these read as "TileOps is 50× faster," when the honest reading is "cuDNN's grouped conv3d path is degenerate on a batch-1 workload." Either drop them from the headline or annotate them explicitly as cuDNN-baseline artifacts.
Report achieved efficiency (TFLOP/s and/or % of bandwidth) alongside or instead of the speedup ratio. That's the metric that actually reflects kernel quality and won't mislead.
Add a realistic-size data point (e.g. batch ≥ 8, or a more compute-bound shape). I expect the 50x to fall off sharply — the current numbers are cherry-picked by workload size.
Show the timing loop (or link the harness) so reviewers can confirm warmup/sync for both sides.

Net: the feature is fine, but the benchmark section as written sets a misleading bar. Speedup-vs-cuDNN ratios on launch-bound batch-1 grouped convs are not an acceptable performance claim on their own.

RMLYC · 2026-06-12T04:04:14Z

Updated the benchmark section to address the review:

Removed the 4.88x / 50x framing as headline claims.
Added TFLOP/s and bandwidth columns alongside latency.
Documented the timing harness: warmup/repeats/trials, synchronization, L2 flush, CUPTI timing, and memory format.
Added a larger batch-8 grouped Conv3d benchmark case from the same 3D-ResNeXt/video-backbone pattern.
Annotated the batch-1 Conv3d ratio as a torch/cuDNN grouped-Conv3d baseline artifact rather than high absolute H200 utilization.

zhen8838

Blocking on the manifest contract. This PR makes grouped Conv2d/Conv3d usable from the default op path, but tileops/manifest/convolution.yaml is not updated: Conv2dFwdOp, Conv2dBiasFwdOp, Conv3dFwdOp, and Conv3dBiasFwdOp still say status: spec-only # TODO: impl lacks groups, their groups params still say the kernel does not support groups, their kernel_map entries omit GroupConv2dKernel/GroupConv3dKernel, and their roofline formulas still use full C_in instead of per-group C_in_g. That leaves the manifest stats/source-of-truth wrong even though the implementation and tests now expose grouped support; validate-manifest was skipped because the manifest was untouched. Please update the convolution manifest in the same PR so implementation status, source map, TODOs, and roofline accounting match the new behavior.

github-actions Bot added the feature New feature or new operator label Jun 9, 2026

RMLYC added the all-ai-powered Produced entirely by automated contributors label Jun 9, 2026

gemini-code-assist Bot reviewed Jun 9, 2026

View reviewed changes

RMLYC requested a review from superAngGao June 9, 2026 09:07

RMLYC marked this pull request as ready for review June 9, 2026 09:07

RMLYC force-pushed the issue-1521-grouped-convolution branch from 63792e2 to 743b0c9 Compare June 9, 2026 09:10

superAngGao previously approved these changes Jun 9, 2026

View reviewed changes

lcy-seso requested changes Jun 9, 2026

View reviewed changes

[Feature][Convolution] Support grouped convolution

f56ca90

RMLYC dismissed superAngGao’s stale review via f56ca90 June 12, 2026 04:03

RMLYC force-pushed the issue-1521-grouped-convolution branch from 743b0c9 to f56ca90 Compare June 12, 2026 04:03

RMLYC requested a review from lcy-seso June 12, 2026 05:43

zhen8838 requested changes Jun 15, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feat][Convolution] Support grouped conv2d and conv3d#1568

[Feat][Convolution] Support grouped conv2d and conv3d#1568
RMLYC wants to merge 1 commit into
tile-ai:mainfrom
RMLYC:issue-1521-grouped-convolution

RMLYC commented Jun 9, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

superAngGao left a comment

Uh oh!

lcy-seso left a comment

Uh oh!

RMLYC commented Jun 12, 2026

Uh oh!

zhen8838 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

RMLYC commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Benchmark

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

superAngGao left a comment

Choose a reason for hiding this comment

Uh oh!

lcy-seso left a comment

Choose a reason for hiding this comment

Request changes: the speedup numbers in the PR body need rework

The 50x is a baseline artifact, not a throughput result

Measurement methodology is unverifiable and operating below the noise floor

Requested changes

Uh oh!

RMLYC commented Jun 12, 2026

Uh oh!

zhen8838 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

RMLYC commented Jun 9, 2026 •

edited

Loading