tensor-cores

CUDA matrix multiplication benchmarking on Jetson Orin Nano. Four implementations, three power modes, five matrix sizes. 99.5% mathematical validation. C++/CUDA and Python.

Updated Apr 2, 2026
Python

lavawolfiee / mini-flash-attention

Star

Minimal FlashAttention in CUDA C++/CuTe: readable WMMA/CuTe kernels, no NxN workspace, up to 4.5x faster than naive PyTorch

cuda attention cutlass cute gpu-kernels pytorch-extension tensor-cores llm flash-attention flashattention wmma

Updated Jun 2, 2026
Cuda

MerkyorLynn / lynn-engine

Star

Lynn 原生 LLM 推理引擎 · W4A8/NVFP4 量化 · 自写 CUDA/Triton kernel · MoE · 投机解码 | Lynn-native LLM inference engine for NVIDIA Blackwell

Updated Jun 4, 2026
Python

etasnadi / VulkanCooperativeMatrixAttention

Star

Vulkan & GLSL implementation of FlashAttention-2

vulkan glsl artificial-intelligence gpu-acceleration attention gpu-computing deel-learning tensor-cores large-language-models llm flash-attention flash-attention-2

Updated Jan 19, 2025
C++

llcuda / llcuda

Star

CUDA 12-first backend inference for Unsloth on Kaggle — Optimized for small GGUF models (1B-5B) on dual Tesla T4 GPUs (15GB each, SM 7.5)

python machine-learning ai deep-learning jupyter gpu cuda inference pytorch nvidia cuda-kernels google-colab tensor-cores tesla-t4 llm gguf unsloth flashattention

Updated Feb 1, 2026
Jupyter Notebook

LessUp / sgemm-optimization

Star

Bilingual CUDA SGEMM optimization tutorial and reference implementation, from naive kernels to Tensor Core WMMA | 双语 CUDA SGEMM 优化教程与参考实现，从朴素内核到 Tensor Core WMMA

tutorial cuda matrix-multiplication high-performance-computing cuda-kernels shared-memory gemm sgemm gpu-optimization bank-conflict tensor-cores wmma

Updated May 28, 2026
Cuda

LDRyan0 / Correlator-Bench

Star

A benchmarking framework for correlators of FX telescope arrays

cpp cuda radio-astronomy astronomy-instrumentation tensor-cores

Updated Oct 20, 2023
Cuda

WizardsForgeIo / sparsemma

Star

INT8 Sparse Tensor Core GEMM for PyTorch — built for Windows

windows gpu cuda inference pytorch nvidia sparse quantization gemm int8 ptx structured-sparsity tensor-cores vram-optimization

Updated Feb 16, 2026
Cuda

Umer-Farooq-CS / MNIST-Classification

Star

The MNIST classification problem is a fundamental machine learning task that involves recognizing handwritten digits (0- 9) from a dataset of 70,000 grayscale images (28x28 pixels each). It serves as a benchmark for evaluating machine learning models, particularly neural networks.

benchmarking deep-learning parallel-computing cuda mnist neural-networks high-performance-computing gpu-acceleration profiling shared-memory openacc performance-optimization c-cpp nsight tensor-cores cuda-streams pinned-memory

Updated Sep 12, 2025
Cuda

keneoneth / leet_gpu_solution

Star

High-performance CUDA kernels with step-by-step optimization, profiling, and analysis. A growing collection of GPU solutions demonstrating warp-level tuning, memory optimization, and Tensor Core acceleration.

gpu-acceleration cuda-programming tensor-cores leetgpu warp-reduction

Updated Nov 12, 2025
Cuda

NeuralAditya / Neural_Network_C

Star

Neural Network C is an advanced neural network implementation in pure C, optimized for high performance on CPUs and NVIDIA GPUs.

Updated Mar 29, 2025
C

athrva98 / FlashNystrom

Star

Tensor-core CUDA kernels for Nyström attention, linear-time forward and backward with exact autograd gradients. Faster than flash-attention at long sequence length.

cuda pytorch transformer attention attention-mechanism tensor-cores linear-attention nystromformer long-context flash-attention

Updated May 24, 2026
Cuda

tariqaf / RA-SpMM

Star

RA-SpMM: Regime-Aware Sparse Matrix Multiplication for GNN Workloads on GPUs. 8-rule router, 6 preprocessing-free kernels, 3.25x over cuSPARSE (FGCS 2026).

cuda benchmarks gpu-computing graph-neural-networks gnn spmm tensor-cores sparse-matrix-multiplication sparse-linear-algebra

Updated Apr 30, 2026
Python

ZrobMiloudaa / jetson-orin-matmul-analysis

Star

🔍 Analyze CUDA matrix multiplication performance and power consumption on NVIDIA Jetson Orin Nano across multiple implementations and settings.

machine-learning robotics cuda cublas matrix-multiplication high-performance-computing gpu-computing performance-optimization autonomous-systems edge-computing nvidia-jetson embeded-systems tensor-cores ml-deployment jetson-orin-nano gpu-benchmarking power-efficiency-benchmark cuda-optimization

Updated Jun 4, 2026
Python

Olajide-Badejo / CUDA-Matrix-Library

Star

CUDA matrix library for GEMM, GEMV, TRSM with naive, tiled, register-blocked, and tensor-core kernels. Includes FP16/BF16 mixed precision, sparse ops, cuSOLVER wrappers, and Python bindings.

cpp gpu cuda blas gemm mixed-precision tensor-cores

Updated Apr 15, 2026
C++

shengyuewangshuai-del / Gpu_Burn

Star

Multi-GPU CUDA stress test with Tensor Core power filler for board power testing

cuda tensor-cores gpu-burn gpu-stress-test power-testing

Updated May 25, 2026
C++

Improve this page

Add a description, image, and links to the tensor-cores topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the tensor-cores topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tensor-cores

Here are 32 public repositories matching this topic...

xlite-dev / ffpa-attn

xlite-dev / HGEMM

tgautam03 / tGeMM

fms-zth / BlackFlash

Cre4T3Tiv3 / jetson-orin-matmul-analysis

lavawolfiee / mini-flash-attention

MerkyorLynn / lynn-engine

etasnadi / VulkanCooperativeMatrixAttention

llcuda / llcuda

LessUp / sgemm-optimization

LDRyan0 / Correlator-Bench

WizardsForgeIo / sparsemma

Umer-Farooq-CS / MNIST-Classification

keneoneth / leet_gpu_solution

NeuralAditya / Neural_Network_C

athrva98 / FlashNystrom

tariqaf / RA-SpMM

ZrobMiloudaa / jetson-orin-matmul-analysis

Olajide-Badejo / CUDA-Matrix-Library

shengyuewangshuai-del / Gpu_Burn

Improve this page

Add this topic to your repo