Learning Guide: Performance Engineering for AI Infra

Purpose

The purpose of this guide is to help engineers learn GPU kernel programming and optimization, with a focus on high-performance AI systems. It covers the full journey from fundamentals to production deployment, balancing foundational concepts with cutting-edge techniques.

If you're interested in GPU performance engineering - we're hiring at Wafer.

How to read

Fundamentals

Introduction to GPU programming

Tier 1

Programming Massively Parallel Processors (PMPP) - Hwu, Kirk, El Hajj. The canonical textbook, 4th edition covers Ampere/Hopper
GPU Mode Lectures - Community-driven lecture series: profiling → kernels → CUTLASS → SASS. Active Discord (23k+ members): discord.gg/gpumode
NVIDIA CUDA Programming Guide - Official documentation, essential reference for programming model

Architecture deep dives

Tier 2

NVIDIA Hopper Architecture In-Depth - TMA, Thread Block Clusters, Distributed Shared Memory, WGMMA
Chips and Cheese: Blackwell - Microbenchmarking analysis of GB202, memory latency comparisons
Dissecting the NVIDIA Hopper GPU Architecture - Academic microbenchmarking of H100
Dissecting the NVIDIA Blackwell Architecture - Microbenchmarks covering tcgen05, TMEM, 2SM MMA

Low-level details

Tier 3

PTX ISA Documentation - Official PTX instruction set reference
Understanding PTX - Introduction to CUDA's virtual assembly language
DocumentSASS - Unofficial SASS instruction documentation extracted from nvdisasm
JEB SASS Disassembler - Reverse engineering GPU binaries (Volta → Blackwell)

Matrix Multiplication

Essential tutorials

Tier 1

How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance - siboehm. The canonical starting tutorial. Covers tiling, shared memory, vectorized loads
Inside NVIDIA GPUs: Anatomy of High-Performance Matmul Kernels - Aleksa Gordić. 47 figures. Covers PTX/SASS, wave quantization, ILP, roofline model, warp tiling
Outperforming cuBLAS on H100: A Worklog - cudaforfun. Real optimization journey using WGMMA and TMA
Fast CUDA GEMM with Tensor Cores - lezcano. Practical tensor core implementation

Advanced implementations

Tier 2

Advanced Matrix Multiplication Optimization - salykova. Detailed optimization techniques following CUTLASS approach
CUDA Matrix Multiplication Optimization - Lei Mao. Systematic optimization progression
Optimizing SGEMV for cuBLAS-like Performance - Maharshi. Matrix-vector multiplication optimization worklog
DeepGEMM - DeepSeek. Clean FP8 GEMM implementation for Hopper, ~300 lines

cuBLAS internals

Tier 3

New cuBLAS 12.0 Features - Hopper-specific optimizations and performance
cuBLAS 12.9 Floating Point Emulation - FP32 emulation with BF16 tensor cores

Tensor Cores & Mixed Precision

Tensor core fundamentals

Tier 1

NVIDIA Tensor Core Evolution: Volta to Blackwell - SemiAnalysis. Comprehensive evolution: WMMA → MMA → WGMMA → tcgen05
Deep Dive on Hopper TMA Unit for FP8 GEMMs - PyTorch. TMA programming model and FP8 integration
CUTLASS Tutorial: Mastering TMA - Colfax Research. Tensor Memory Accelerator programming

Precision formats

Tier 2

Introducing FP8 for Efficient AI Training - NVIDIA. E4M3 vs E5M2 formats, scaling strategies
Introducing NVFP4 for Low-Precision Inference - NVIDIA. Blackwell FP4 with microscaling (MXFP4)
NVIDIA Transformer Engine - Library for FP8/FP4 training and inference
Per-Tensor and Per-Block Scaling for FP8 - NVIDIA. Scaling strategies for quantization

Blackwell-specific

Tier 3

Matrix Multiplication on Blackwell: Part 1 - Modular. tcgen05, TMEM, 2SM MMA programming
Blackwell Pipelining with CuTeDSL - Simon Veitner. Blog post on advanced Blackwell kernel patterns

Attention & Memory-Bound Kernels

FlashAttention

Tier 1

FlashAttention: Fast and Memory-Efficient Attention - Dao et al. Original paper: IO-aware exact attention
FlashAttention-2 - Dao. Better parallelization, work partitioning
FlashAttention-3: Fast and Accurate Attention with Asynchrony - Dao et al. Hopper-specific: warp specialization, WGMMA pipelining
A Case Study in CUDA Kernel Fusion: FlashAttention-2 on Hopper - Jay Shah et al. CUTLASS implementation details

PagedAttention & serving

Tier 2

Efficient Memory Management for LLM Serving with PagedAttention - vLLM team. Virtual memory for KV cache
FlashInfer - Kernel library for LLM serving (MLSys 2025 Best Paper). PagedAttention, FlashAttention-3, MLA support
Accelerating Self-Attentions with FlashInfer - Architecture and design decisions

KV cache optimization

Tier 3

Mastering LLM Techniques: Inference Optimization - NVIDIA. Comprehensive guide: GQA, MQA, KV cache compression
GQA: Training Generalized Multi-Query Transformer Models - Google. Grouped Query Attention for memory efficiency
Multi-Head Latent Attention (MLA) - DeepSeek. Low-rank KV compression, 8x cache reduction
A Survey on LLM Acceleration based on KV Cache Management - Comprehensive taxonomy of KV cache techniques

Compiler & DSL Approaches

Triton

Tier 1

Introducing Triton - OpenAI. Original announcement and motivation
Triton Language - Development repository
Deep Dive into Triton Internals (Parts 1-3) - Kapil Sharma. Compiler pipeline: Python → MLIR → PTX → CUBIN
GPU Mode: Triton Internals Talk - Kapil Sharma. Video + slides from the lecture

CUTLASS & CuTe

Tier 2

Learn CUTLASS the Hard Way - Lei Mao. Naive GEMM → real CUTLASS progression
CUTLASS Tutorial: GEMM Kernel Design with Pipelining - Colfax Research. Warp specialization, producer-consumer patterns
NVIDIA CUTLASS - CUDA Templates for Linear Algebra Subroutines
cuTile (CUDA Tile) - New tile-level programming model in CUDA 13.1

Other DSLs

Tier 3

TileLang - Composable tiled programming, 1075x speedup over PyTorch on H100
ThunderKittens - Stanford Hazy Research. DSL for writing fast GPU kernels
Apache TVM - End-to-end ML compiler with auto-tuning (Ansor)
MLIR GPU Dialect - Compiler infrastructure for heterogeneous compute
Mojo - MLIR-based language targeting GPU/CPU, SIMD-first design

Profiling & Optimization

NVIDIA tools

Tier 1

Nsight Compute Roofline Analysis - NVIDIA. Roofline modeling for bottleneck analysis
CUDA Occupancy Calculator - NVIDIA. cudaOccupancyMaxActiveBlocksPerMultiprocessor API
Hopper Tuning Guide - Official optimization guide for H100

Optimization techniques

Tier 2

Memory Coalescing and Bank Conflicts - Shared memory optimization, padding tricks
Understanding CUDA Occupancy - Thread block configuration
The Roofline Model - NERSC. Arithmetic intensity, compute vs memory bound
Understanding the Top-K CUDA Kernel with PTX - alpindale. 10x speedup over torch.topk, PTX-level optimization

Advanced topics

Tier 3

CUDA Graphs for Reduced Launch Overhead - NVIDIA. Batch kernel launches, 5x speedup for small kernels
Kernel Batching with CUDA Graphs - Optimal batch sizes (50-100 nodes), 1.4x improvement
Warp Specialization in PyTorch - Producer-consumer patterns, async execution
Tawa: Automatic Warp Specialization - Matches FlashAttention-3 performance with less effort

AMD & Alternative Hardware

ROCm fundamentals

Tier 1

Developing Triton Kernels on AMD GPUs - AMD ROCm Blog. Triton for MI300X
Triton Kernel Optimizations on AMD - AMD ROCm Blog. Performance tuning for CDNA
HipKittens - ThunderKittens for AMD. Tile programming abstraction for MI300X

CDNA architecture

Tier 2

Chips and Cheese: AMD CDNA 3 - MI300X architecture analysis, chiplet design
Chips and Cheese: RDNA 4 - Dynamic register allocation, cache strategies
AMD RDNA 3 Microbenchmarking - Chips and Cheese

TPU & others

Tier 3

The Rise of Pallas: Custom TPU Kernels - Towards Data Science. JAX Pallas for TPU programming
vLLM TPU: Unified JAX Backend - vLLM Blog. 20% throughput improvement via JAX primitives
Building Production AI on Cloud TPUs with JAX - Google

Production Inference Systems

Core systems

Tier 1

vLLM - PagedAttention, continuous batching, high throughput
SGLang - RadixAttention, structured generation, prefix caching
TensorRT-LLM - NVIDIA's optimized inference library
Accelerating Transformers with cuDNN 9 - NVIDIA. Fused attention, Graph API

Continuous batching

Tier 2

Orca: Distributed Serving with Iteration-Level Scheduling - OSDI 2022. Original continuous batching paper, 36.9x throughput
Continuous Batching from First Principles - Hugging Face. Clear explanation of dynamic batching
Achieve 23x LLM Inference Throughput - Anyscale. vLLM performance analysis

Speculative decoding

Tier 3

Medusa: Simple Framework for Accelerating LLM Generation - Multiple heads for parallel draft tokens
EAGLE: Speculative Sampling with Draft Model - Autoregressive draft prediction
Speculative Decoding Overview - vLLM Docs. Implementation in vLLM

LLM-Generated Kernels

Benchmarks & models

Tier 1

KernelBench: Can LLMs Write Efficient GPU Kernels? - Stanford. 250 PyTorch workloads, fast_p metric
KernelLLM - Meta. 8B model trained on 25k PyTorch→Triton pairs, beats GPT-4o
TritonBench - 184 real-world Triton operators from GitHub

Agentic approaches

Tier 2

The AI CUDA Engineer - Sakana AI. Evolutionary optimization, 10-100x speedups (with caveats about benchmark gaming)
AlphaEvolve - Google DeepMind. 32.5% FlashAttention speedup, 23% GEMM speedup
Kevin: Multi-Turn RL for CUDA Kernels - First multi-turn RL model, 82% correctness (vs 56% base)
CUDA-L1: Contrastive RL for CUDA Optimization - 3.12x average speedup on KernelBench

Research papers

Tier 3

EvoEngineer: Automated CUDA Kernel Evolution
QiMeng-Kernel: Macro-Thinking Micro-Coding for GPU Kernels
CUDA-LLM: LLMs Can Write Efficient CUDA Kernels
GEAK: Triton Kernel AI Agent - AMD ROCm. 51% accuracy, 1.81x speedup on MI300X

Distributed & Multi-GPU

Communication primitives

Tier 1

NVIDIA NCCL - Collective communication: all-reduce, all-gather, broadcast
Fast Multi-GPU Collectives with NCCL - NVIDIA. Ring, tree algorithms, topology-aware optimization
Demystifying NCCL - In-depth analysis of GPU communication protocols
Collective Communication for 100k+ GPUs - Meta NCCLX. Scaling to massive clusters

Parallelism strategies

Tier 2

Megatron-LM: Training Multi-Billion Parameter Models - NVIDIA. Tensor parallelism, pipeline parallelism
Megatron-LM - Up to 47% MFU on H100 clusters
Large Scale Tensor Parallel Training - PyTorch Tutorial. Native TP support in PyTorch
Horovod - Ring-allreduce distributed training, 90% scaling efficiency

Kernel fusion

Tier 3

Kernel Fusion in CUDA - vrushankdes.ai. Vertical vs horizontal fusion, U-Net optimization
Automatic Horizontal Fusion for GPU Kernels - CMU. 12-55% speedup via parallel kernel execution

The Big Picture

Practitioner blogs

Michal Pitr - From Scratch - GPU programming, inference optimization
cudaforfun Substack - cuBLAS-level kernel development
Lei Mao's Log Book - CUTLASS, CUDA optimization deep dives
Aleksa Gordić's Blog - Ex-DeepMind, GPU architecture and matmul

Communities

GPU Mode Discord - 23k+ members, weekly lectures, kernel leaderboard
GPU Mode Resource Stream - Curated CUDA/GPU learning materials

Contributing

Contributions welcome! Please ensure resources meet our quality criteria:

Primary sources (papers, official docs)
Practitioner blogs with real implementation insights
Active maintenance or timeless fundamentals
No surface-level tutorials
No AI-generated content without human verification

License

MIT

Maintainer

emilio@wafer.ai

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
.github		.github
CONTRIBUTING.md		CONTRIBUTING.md
README.md		README.md
cover.avif		cover.avif

Folders and files

Latest commit

History

Repository files navigation

Learning Guide: Performance Engineering for AI Infra

Purpose

How to read

Table of contents

Fundamentals

Introduction to GPU programming

Tier 1

Architecture deep dives

Tier 2

Low-level details

Tier 3

Matrix Multiplication

Essential tutorials

Tier 1

Advanced implementations

Tier 2

cuBLAS internals

Tier 3

Tensor Cores & Mixed Precision

Tensor core fundamentals

Tier 1

Precision formats

Tier 2

Blackwell-specific

Tier 3

Attention & Memory-Bound Kernels

FlashAttention

Tier 1

PagedAttention & serving

Tier 2

KV cache optimization

Tier 3

Compiler & DSL Approaches

Triton

Tier 1

CUTLASS & CuTe

Tier 2

Other DSLs

Tier 3

Profiling & Optimization

NVIDIA tools

Tier 1

Optimization techniques

Tier 2

Advanced topics

Tier 3

AMD & Alternative Hardware

ROCm fundamentals

Tier 1

CDNA architecture

Tier 2

TPU & others

Tier 3

Production Inference Systems

Core systems

Tier 1

Continuous batching

Tier 2

Speculative decoding

Tier 3

LLM-Generated Kernels

Benchmarks & models

Tier 1

Agentic approaches

Tier 2

Research papers

Tier 3

Distributed & Multi-GPU

Communication primitives

Tier 1

Parallelism strategies

Tier 2

Kernel fusion

Tier 3

The Big Picture

Practitioner blogs

Packages