Skip to content

[Feat][Convolution] Support grouped conv2d and conv3d#1568

Open
RMLYC wants to merge 1 commit into
tile-ai:mainfrom
RMLYC:issue-1521-grouped-convolution
Open

[Feat][Convolution] Support grouped conv2d and conv3d#1568
RMLYC wants to merge 1 commit into
tile-ai:mainfrom
RMLYC:issue-1521-grouped-convolution

Conversation

@RMLYC

@RMLYC RMLYC commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator

Closes #1521

Summary

  • Add native grouped Conv2d/Conv3d TileLang kernels for bias and no-bias variants.
  • Wire Conv2d/Conv3d op dispatch to grouped kernels while preserving existing dilation support from main.
  • Add grouped correctness coverage and model-derived grouped convolution benchmark cases.

Test plan

  • python -m py_compile tileops/ops/convolution.py tileops/kernels/convolution.py tests/ops/test_convolution.py benchmarks/ops/bench_convolution.py tileops/kernels/__init__.py
  • python -m ruff check tileops/ops/convolution.py tileops/kernels/convolution.py tests/ops/test_convolution.py benchmarks/ops/bench_convolution.py tileops/kernels/__init__.py
  • TILELANG_CLEANUP_TEMP_FILES=1 python -m pytest tests/ops/test_convolution.py -m smoke -vvs (25 passed, 28 deselected)
  • TILELANG_CLEANUP_TEMP_FILES=1 python -m pytest tests/ops/test_convolution.py -vvs (53 passed)
  • python scripts/test_node_delta.py tests/ops/test_convolution.py (+6 nodes)
  • TILELANG_CLEANUP_TEMP_FILES=1 python -m pytest benchmarks/ops/bench_convolution.py -vvs (27 passed)
  • CUDA_VISIBLE_DEVICES=1 TILELANG_CLEANUP_TEMP_FILES=1 python -m pytest 'benchmarks/ops/bench_convolution.py::test_conv2d_bench[mobilenetv2-depthwise-fp16]' 'benchmarks/ops/bench_convolution.py::test_conv2d_bench[resnext-grouped-3x3-fp16]' 'benchmarks/ops/bench_convolution.py::test_conv3d_bench[3d-resnext-grouped-k3-fp16]' 'benchmarks/ops/bench_convolution.py::test_conv3d_bench[3d-resnext-grouped-k3-b8-fp16]' -vvs (4 passed)

Benchmark

Grouped benchmark subset run on GPU 1 with CUDA_VISIBLE_DEVICES=1; pre-run clock check showed current SM clock at 1830 MHz on NVIDIA H200. Inputs use the benchmark default contiguous memory format: NCHW for Conv2d and NCDHW for Conv3d.

Methodology: both TileOps and torch baseline run through BenchmarkBase.profile() and benchmarks/benchmark_base.py::bench_kernel under torch.no_grad(). The harness uses 10 warmup iterations, 50 timed repeats per trial, 3 trials, torch.cuda.synchronize() after warmup, L2 cache flush, cloned tensor args when possible, CUPTI kernel timing through torch.profiler, and reports the median trial mean.

Command:

CUDA_VISIBLE_DEVICES=1 TILELANG_CLEANUP_TEMP_FILES=1 python -m pytest \
  'benchmarks/ops/bench_convolution.py::test_conv2d_bench[mobilenetv2-depthwise-fp16]' \
  'benchmarks/ops/bench_convolution.py::test_conv2d_bench[resnext-grouped-3x3-fp16]' \
  'benchmarks/ops/bench_convolution.py::test_conv3d_bench[3d-resnext-grouped-k3-fp16]' \
  'benchmarks/ops/bench_convolution.py::test_conv3d_bench[3d-resnext-grouped-k3-b8-fp16]' \
  -vvs

Result: 4 passed in 138.39s.

Case TileOps latency TileOps TFLOP/s TileOps bandwidth Torch latency Torch TFLOP/s Torch bandwidth
conv2d mobilenetv2-depthwise-fp16 0.0064 ms 0.2812 0.0626 TB/s 0.0065 ms 0.2772 0.0617 TB/s
conv2d resnext-grouped-3x3-fp16 0.0041 ms 3.4882 0.1498 TB/s 0.0200 ms 0.7207 0.0309 TB/s
conv3d 3d-resnext-grouped-k3-fp16 0.0147 ms 5.8872 0.1645 TB/s 0.7442 ms 0.1165 0.0033 TB/s
conv3d 3d-resnext-grouped-k3-b8-fp16 0.1269 ms 5.4643 0.1519 TB/s 5.7221 ms 0.1212 0.0034 TB/s

Notes:

  • The batch-1 grouped Conv3d case is a small microbenchmark; its large latency gap primarily reflects a torch/cuDNN grouped-Conv3d baseline artifact and launch overhead, not high absolute H200 utilization.
  • The batch-8 grouped Conv3d case is included as a larger throughput point from the same 3D-ResNeXt/video-backbone pattern.

@github-actions github-actions Bot added the feature New feature or new operator label Jun 9, 2026
@RMLYC RMLYC added the all-ai-powered Produced entirely by automated contributors label Jun 9, 2026

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds native grouped Conv2d and Conv3d support for both bias and no-bias variants. It introduces GroupConv2dKernel and GroupConv3dKernel classes, implements their corresponding TileLang JIT kernels, and wires the forward operators to dispatch to these kernels when groups > 1. It also updates the expected weight shape validation, adds test coverage for grouped convolutions, and includes MobileNetV2 depthwise and ResNeXt grouped benchmark cases. I have no feedback to provide.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

@RMLYC RMLYC requested a review from superAngGao June 9, 2026 09:07
@RMLYC RMLYC marked this pull request as ready for review June 9, 2026 09:07
@RMLYC RMLYC force-pushed the issue-1521-grouped-convolution branch from 63792e2 to 743b0c9 Compare June 9, 2026 09:10
superAngGao
superAngGao previously approved these changes Jun 9, 2026

@superAngGao superAngGao left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed the grouped Conv2d/Conv3d implementation. I did not find any blocking issues. The dispatch path, grouped channel indexing, weight/bias shape handling, and the added grouped/depthwise coverage all look consistent to me.\n\nNon-blocking note: grouped + dilation does not appear to have a dedicated grouped test case yet, but the dilation indexing is wired directly into the grouped kernels and the existing dilation formula is already covered on the non-grouped path, so I do not think this should block the PR.

@lcy-seso lcy-seso left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Request changes: the speedup numbers in the PR body need rework

The functional change (grouped conv2d/conv3d support) looks good, but the benchmark claims — particularly the 50.29x on conv3d and to a lesser extent the 4.88x on conv2d — are misleading as presented and should not ship as headline numbers in this form.

The 50x is a baseline artifact, not a throughput result

For the 3d-resnext-grouped-k3-fp16 case (N=1, C 64→128, D×H×W = 8×28×28, k=3³, groups=32), the total work is only:

  • ~86.7 MFLOP (802,816 output elements × 54 MAC each × 2)
  • ~2.4 MB of I/O (0.8 MB in + 1.6 MB out + 14 KB weights)

Against those figures, the reported 0.0148 ms means TileOps is running at ~5.9 TFLOP/s and ~162 GB/s — roughly 0.6% of H200 FP16 tensor-core peak and ~3% of HBM bandwidth. So the TileOps kernel itself is not doing anything fast in absolute terms; this is a tiny, launch-overhead-bound workload.

The entire 50x comes from the denominator: PyTorch's 0.7443 ms corresponds to ~0.12 TFLOP/s, which is the well-known cuDNN pathology for grouped 3D convolution (it tends to serialize into one small kernel per group, here 32 of them). So the comparison is really "one fused TileLang kernel vs. cuDNN's 32 serial micro-kernels," not a throughput comparison. The same table confirms this: when cuDNN picks a sane path (depthwise conv2d → 1.02x, grouped conv2d → 4.88x), the speedup collapses. The 50x is an outlier, not a representative result.

Measurement methodology is unverifiable and operating below the noise floor

0.0148 ms = 14.8 µs, which is in the same range as kernel-launch overhead. At this scale the result is dominated by how it's measured, and the timing harness is not in this PR. Before trusting these numbers, please confirm:

  • Both TileOps and the torch baseline use the same harness with warmup + torch.cuda.synchronize() (or CUDA events) + multiple iterations, reporting median/min — not wall-clock around a single call.
  • The torch side is warmed up so cuDNN algorithm selection / autotuning isn't counted in its latency (this alone can inflate the baseline several-fold).
  • Memory format is stated for the torch baseline (channels_last vs. contiguous) so the comparison is apples-to-apples.

Requested changes

  1. Don't lead with 4.x / 50x. As-is these read as "TileOps is 50× faster," when the honest reading is "cuDNN's grouped conv3d path is degenerate on a batch-1 workload." Either drop them from the headline or annotate them explicitly as cuDNN-baseline artifacts.
  2. Report achieved efficiency (TFLOP/s and/or % of bandwidth) alongside or instead of the speedup ratio. That's the metric that actually reflects kernel quality and won't mislead.
  3. Add a realistic-size data point (e.g. batch ≥ 8, or a more compute-bound shape). I expect the 50x to fall off sharply — the current numbers are cherry-picked by workload size.
  4. Show the timing loop (or link the harness) so reviewers can confirm warmup/sync for both sides.

Net: the feature is fine, but the benchmark section as written sets a misleading bar. Speedup-vs-cuDNN ratios on launch-bound batch-1 grouped convs are not an acceptable performance claim on their own.

@RMLYC RMLYC force-pushed the issue-1521-grouped-convolution branch from 743b0c9 to f56ca90 Compare June 12, 2026 04:03
@RMLYC

RMLYC commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator Author

Updated the benchmark section to address the review:

  • Removed the 4.88x / 50x framing as headline claims.
  • Added TFLOP/s and bandwidth columns alongside latency.
  • Documented the timing harness: warmup/repeats/trials, synchronization, L2 flush, CUPTI timing, and memory format.
  • Added a larger batch-8 grouped Conv3d benchmark case from the same 3D-ResNeXt/video-backbone pattern.
  • Annotated the batch-1 Conv3d ratio as a torch/cuDNN grouped-Conv3d baseline artifact rather than high absolute H200 utilization.

@RMLYC RMLYC requested a review from lcy-seso June 12, 2026 05:43

@zhen8838 zhen8838 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocking on the manifest contract. This PR makes grouped Conv2d/Conv3d usable from the default op path, but tileops/manifest/convolution.yaml is not updated: Conv2dFwdOp, Conv2dBiasFwdOp, Conv3dFwdOp, and Conv3dBiasFwdOp still say status: spec-only # TODO: impl lacks groups, their groups params still say the kernel does not support groups, their kernel_map entries omit GroupConv2dKernel/GroupConv3dKernel, and their roofline formulas still use full C_in instead of per-group C_in_g. That leaves the manifest stats/source-of-truth wrong even though the implementation and tests now expose grouped support; validate-manifest was skipped because the manifest was untouched. Please update the convolution manifest in the same PR so implementation status, source map, TODOs, and roofline accounting match the new behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

all-ai-powered Produced entirely by automated contributors feature New feature or new operator

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEAT][CONVOLUTION] support conv2d conv3d groups

4 participants