The purpose of this guide is to help engineers learn GPU kernel programming and optimization, with a focus on high-performance AI systems. It covers the full journey from fundamentals to production deployment, balancing foundational concepts with cutting-edge techniques.
If you're interested in GPU performance engineering - we're hiring at Wafer.
Recommended reading order:
- Read "Tier 1" for all topics
- Read "Tier 2" for all topics
- Etc
- Fundamentals
- Matrix Multiplication
- Tensor Cores & Mixed Precision
- Attention & Memory-Bound Kernels
- Compiler & DSL Approaches
- Profiling & Optimization
- AMD & Alternative Hardware
- Production Inference Systems
- LLM-Generated Kernels
- Distributed & Multi-GPU
- The Big Picture
- Maintainer
- Programming Massively Parallel Processors (PMPP) - Hwu, Kirk, El Hajj. The canonical textbook, 4th edition covers Ampere/Hopper
- GPU Mode Lectures - Community-driven lecture series: profiling → kernels → CUTLASS → SASS. Active Discord (23k+ members): discord.gg/gpumode
- NVIDIA CUDA Programming Guide - Official documentation, essential reference for programming model
- NVIDIA Hopper Architecture In-Depth - TMA, Thread Block Clusters, Distributed Shared Memory, WGMMA
- Chips and Cheese: Blackwell - Microbenchmarking analysis of GB202, memory latency comparisons
- Dissecting the NVIDIA Hopper GPU Architecture - Academic microbenchmarking of H100
- Dissecting the NVIDIA Blackwell Architecture - Microbenchmarks covering tcgen05, TMEM, 2SM MMA
- PTX ISA Documentation - Official PTX instruction set reference
- Understanding PTX - Introduction to CUDA's virtual assembly language
- DocumentSASS - Unofficial SASS instruction documentation extracted from nvdisasm
- JEB SASS Disassembler - Reverse engineering GPU binaries (Volta → Blackwell)
- How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance - siboehm. The canonical starting tutorial. Covers tiling, shared memory, vectorized loads
- Inside NVIDIA GPUs: Anatomy of High-Performance Matmul Kernels - Aleksa Gordić. 47 figures. Covers PTX/SASS, wave quantization, ILP, roofline model, warp tiling
- Outperforming cuBLAS on H100: A Worklog - cudaforfun. Real optimization journey using WGMMA and TMA
- Fast CUDA GEMM with Tensor Cores - lezcano. Practical tensor core implementation
- Advanced Matrix Multiplication Optimization - salykova. Detailed optimization techniques following CUTLASS approach
- CUDA Matrix Multiplication Optimization - Lei Mao. Systematic optimization progression
- Optimizing SGEMV for cuBLAS-like Performance - Maharshi. Matrix-vector multiplication optimization worklog
- DeepGEMM - DeepSeek. Clean FP8 GEMM implementation for Hopper, ~300 lines
- New cuBLAS 12.0 Features - Hopper-specific optimizations and performance
- cuBLAS 12.9 Floating Point Emulation - FP32 emulation with BF16 tensor cores
- NVIDIA Tensor Core Evolution: Volta to Blackwell - SemiAnalysis. Comprehensive evolution: WMMA → MMA → WGMMA → tcgen05
- Deep Dive on Hopper TMA Unit for FP8 GEMMs - PyTorch. TMA programming model and FP8 integration
- CUTLASS Tutorial: Mastering TMA - Colfax Research. Tensor Memory Accelerator programming
- Introducing FP8 for Efficient AI Training - NVIDIA. E4M3 vs E5M2 formats, scaling strategies
- Introducing NVFP4 for Low-Precision Inference - NVIDIA. Blackwell FP4 with microscaling (MXFP4)
- NVIDIA Transformer Engine - Library for FP8/FP4 training and inference
- Per-Tensor and Per-Block Scaling for FP8 - NVIDIA. Scaling strategies for quantization
- Matrix Multiplication on Blackwell: Part 1 - Modular. tcgen05, TMEM, 2SM MMA programming
- Blackwell Pipelining with CuTeDSL - Simon Veitner. Blog post on advanced Blackwell kernel patterns
- FlashAttention: Fast and Memory-Efficient Attention - Dao et al. Original paper: IO-aware exact attention
- FlashAttention-2 - Dao. Better parallelization, work partitioning
- FlashAttention-3: Fast and Accurate Attention with Asynchrony - Dao et al. Hopper-specific: warp specialization, WGMMA pipelining
- A Case Study in CUDA Kernel Fusion: FlashAttention-2 on Hopper - Jay Shah et al. CUTLASS implementation details
- Efficient Memory Management for LLM Serving with PagedAttention - vLLM team. Virtual memory for KV cache
- FlashInfer - Kernel library for LLM serving (MLSys 2025 Best Paper). PagedAttention, FlashAttention-3, MLA support
- Accelerating Self-Attentions with FlashInfer - Architecture and design decisions
- Mastering LLM Techniques: Inference Optimization - NVIDIA. Comprehensive guide: GQA, MQA, KV cache compression
- GQA: Training Generalized Multi-Query Transformer Models - Google. Grouped Query Attention for memory efficiency
- Multi-Head Latent Attention (MLA) - DeepSeek. Low-rank KV compression, 8x cache reduction
- A Survey on LLM Acceleration based on KV Cache Management - Comprehensive taxonomy of KV cache techniques
- Introducing Triton - OpenAI. Original announcement and motivation
- Triton Language - Development repository
- Deep Dive into Triton Internals (Parts 1-3) - Kapil Sharma. Compiler pipeline: Python → MLIR → PTX → CUBIN
- GPU Mode: Triton Internals Talk - Kapil Sharma. Video + slides from the lecture
- Learn CUTLASS the Hard Way - Lei Mao. Naive GEMM → real CUTLASS progression
- CUTLASS Tutorial: GEMM Kernel Design with Pipelining - Colfax Research. Warp specialization, producer-consumer patterns
- NVIDIA CUTLASS - CUDA Templates for Linear Algebra Subroutines
- cuTile (CUDA Tile) - New tile-level programming model in CUDA 13.1
- TileLang - Composable tiled programming, 1075x speedup over PyTorch on H100
- ThunderKittens - Stanford Hazy Research. DSL for writing fast GPU kernels
- Apache TVM - End-to-end ML compiler with auto-tuning (Ansor)
- MLIR GPU Dialect - Compiler infrastructure for heterogeneous compute
- Mojo - MLIR-based language targeting GPU/CPU, SIMD-first design
- Nsight Compute Roofline Analysis - NVIDIA. Roofline modeling for bottleneck analysis
- CUDA Occupancy Calculator - NVIDIA.
cudaOccupancyMaxActiveBlocksPerMultiprocessorAPI - Hopper Tuning Guide - Official optimization guide for H100
- Memory Coalescing and Bank Conflicts - Shared memory optimization, padding tricks
- Understanding CUDA Occupancy - Thread block configuration
- The Roofline Model - NERSC. Arithmetic intensity, compute vs memory bound
- Understanding the Top-K CUDA Kernel with PTX - alpindale. 10x speedup over torch.topk, PTX-level optimization
- CUDA Graphs for Reduced Launch Overhead - NVIDIA. Batch kernel launches, 5x speedup for small kernels
- Kernel Batching with CUDA Graphs - Optimal batch sizes (50-100 nodes), 1.4x improvement
- Warp Specialization in PyTorch - Producer-consumer patterns, async execution
- Tawa: Automatic Warp Specialization - Matches FlashAttention-3 performance with less effort
- Developing Triton Kernels on AMD GPUs - AMD ROCm Blog. Triton for MI300X
- Triton Kernel Optimizations on AMD - AMD ROCm Blog. Performance tuning for CDNA
- HipKittens - ThunderKittens for AMD. Tile programming abstraction for MI300X
- Chips and Cheese: AMD CDNA 3 - MI300X architecture analysis, chiplet design
- Chips and Cheese: RDNA 4 - Dynamic register allocation, cache strategies
- AMD RDNA 3 Microbenchmarking - Chips and Cheese
- The Rise of Pallas: Custom TPU Kernels - Towards Data Science. JAX Pallas for TPU programming
- vLLM TPU: Unified JAX Backend - vLLM Blog. 20% throughput improvement via JAX primitives
- Building Production AI on Cloud TPUs with JAX - Google
- vLLM - PagedAttention, continuous batching, high throughput
- SGLang - RadixAttention, structured generation, prefix caching
- TensorRT-LLM - NVIDIA's optimized inference library
- Accelerating Transformers with cuDNN 9 - NVIDIA. Fused attention, Graph API
- Orca: Distributed Serving with Iteration-Level Scheduling - OSDI 2022. Original continuous batching paper, 36.9x throughput
- Continuous Batching from First Principles - Hugging Face. Clear explanation of dynamic batching
- Achieve 23x LLM Inference Throughput - Anyscale. vLLM performance analysis
- Medusa: Simple Framework for Accelerating LLM Generation - Multiple heads for parallel draft tokens
- EAGLE: Speculative Sampling with Draft Model - Autoregressive draft prediction
- Speculative Decoding Overview - vLLM Docs. Implementation in vLLM
- KernelBench: Can LLMs Write Efficient GPU Kernels? - Stanford. 250 PyTorch workloads, fast_p metric
- KernelLLM - Meta. 8B model trained on 25k PyTorch→Triton pairs, beats GPT-4o
- TritonBench - 184 real-world Triton operators from GitHub
- The AI CUDA Engineer - Sakana AI. Evolutionary optimization, 10-100x speedups (with caveats about benchmark gaming)
- AlphaEvolve - Google DeepMind. 32.5% FlashAttention speedup, 23% GEMM speedup
- Kevin: Multi-Turn RL for CUDA Kernels - First multi-turn RL model, 82% correctness (vs 56% base)
- CUDA-L1: Contrastive RL for CUDA Optimization - 3.12x average speedup on KernelBench
- EvoEngineer: Automated CUDA Kernel Evolution
- QiMeng-Kernel: Macro-Thinking Micro-Coding for GPU Kernels
- CUDA-LLM: LLMs Can Write Efficient CUDA Kernels
- GEAK: Triton Kernel AI Agent - AMD ROCm. 51% accuracy, 1.81x speedup on MI300X
- NVIDIA NCCL - Collective communication: all-reduce, all-gather, broadcast
- Fast Multi-GPU Collectives with NCCL - NVIDIA. Ring, tree algorithms, topology-aware optimization
- Demystifying NCCL - In-depth analysis of GPU communication protocols
- Collective Communication for 100k+ GPUs - Meta NCCLX. Scaling to massive clusters
- Megatron-LM: Training Multi-Billion Parameter Models - NVIDIA. Tensor parallelism, pipeline parallelism
- Megatron-LM - Up to 47% MFU on H100 clusters
- Large Scale Tensor Parallel Training - PyTorch Tutorial. Native TP support in PyTorch
- Horovod - Ring-allreduce distributed training, 90% scaling efficiency
- Kernel Fusion in CUDA - vrushankdes.ai. Vertical vs horizontal fusion, U-Net optimization
- Automatic Horizontal Fusion for GPU Kernels - CMU. 12-55% speedup via parallel kernel execution
- Michal Pitr - From Scratch - GPU programming, inference optimization
- cudaforfun Substack - cuBLAS-level kernel development
- Lei Mao's Log Book - CUTLASS, CUDA optimization deep dives
- Aleksa Gordić's Blog - Ex-DeepMind, GPU architecture and matmul
- GPU Mode Discord - 23k+ members, weekly lectures, kernel leaderboard
- GPU Mode Resource Stream - Curated CUDA/GPU learning materials
Contributions welcome! Please ensure resources meet our quality criteria:
- Primary sources (papers, official docs)
- Practitioner blogs with real implementation insights
- Active maintenance or timeless fundamentals
- No surface-level tutorials
- No AI-generated content without human verification
MIT
