Skip to content

BharAI-Lab/gpu-perf-engineering-resources

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Performance Engineering for AI Infra

Learning Guide: Performance Engineering for AI Infra

Purpose

The purpose of this guide is to help engineers learn GPU kernel programming and optimization, with a focus on high-performance AI systems. It covers the full journey from fundamentals to production deployment, balancing foundational concepts with cutting-edge techniques.

If you're interested in GPU performance engineering - we're hiring at Wafer.

How to read

Recommended reading order:

  1. Read "Tier 1" for all topics
  2. Read "Tier 2" for all topics
  3. Etc

Table of contents

Fundamentals

Introduction to GPU programming

Tier 1

Architecture deep dives

Tier 2

Low-level details

Tier 3

Matrix Multiplication

Essential tutorials

Tier 1

Advanced implementations

Tier 2

cuBLAS internals

Tier 3

Tensor Cores & Mixed Precision

Tensor core fundamentals

Tier 1

Precision formats

Tier 2

Blackwell-specific

Tier 3

Attention & Memory-Bound Kernels

FlashAttention

Tier 1

PagedAttention & serving

Tier 2

KV cache optimization

Tier 3

Compiler & DSL Approaches

Triton

Tier 1

CUTLASS & CuTe

Tier 2

Other DSLs

Tier 3

  • TileLang - Composable tiled programming, 1075x speedup over PyTorch on H100
  • ThunderKittens - Stanford Hazy Research. DSL for writing fast GPU kernels
  • Apache TVM - End-to-end ML compiler with auto-tuning (Ansor)
  • MLIR GPU Dialect - Compiler infrastructure for heterogeneous compute
  • Mojo - MLIR-based language targeting GPU/CPU, SIMD-first design

Profiling & Optimization

NVIDIA tools

Tier 1

Optimization techniques

Tier 2

Advanced topics

Tier 3

AMD & Alternative Hardware

ROCm fundamentals

Tier 1

CDNA architecture

Tier 2

TPU & others

Tier 3

Production Inference Systems

Core systems

Tier 1

Continuous batching

Tier 2

Speculative decoding

Tier 3

LLM-Generated Kernels

Benchmarks & models

Tier 1

Agentic approaches

Tier 2

Research papers

Tier 3

Distributed & Multi-GPU

Communication primitives

Tier 1

Parallelism strategies

Tier 2

Kernel fusion

Tier 3

The Big Picture

Practitioner blogs

Communities

Contributing

Contributions welcome! Please ensure resources meet our quality criteria:

  • Primary sources (papers, official docs)
  • Practitioner blogs with real implementation insights
  • Active maintenance or timeless fundamentals
  • No surface-level tutorials
  • No AI-generated content without human verification

License

MIT

Maintainer

emilio@wafer.ai

About

A curriculum for learning about gpu performance engineering, from scratch to what the frontier AI labs do

Resources

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors