I'm an AI Kernel Engineer focused on bridging DSLs and hardware. I work on Triton, MLIR, LLVM, compiler IR transformations, GPU kernel optimization, and Agent-driven end-to-end workload acceleration.
I previously maintained the following open-source projects around OpenAI/Triton.
LLM Inference via Triton (Flexible & Modular): Focused on Kernel
Triton multi-level runner, include cubin, ptx, ttgir etc.
Triton for OpenCL backend, and use mlir-translate to get source OpenCL code
Getting Started with Triton: A Tutorial for Python Beginners
I also keep learning-oriented open-source notes and examples.
NVIDIA cuTile learning notes and examples
🔥 LeetGPU
Personal solutions to LeetGPU problems, primarily written in Triton, with selected CuTeDSL, CUDA, and Mojo implementations. The solutions are organized by problem, and my LeetGPU nickname is BobHuang.
Previously, I worked on:
- 🚀 Triton new NPU backend https://github.com/triton-lang/triton
- 🔥 Triton TLX-style new NPU backend https://github.com/facebookexperimental/triton
- 🧠 PyTorch new backend https://github.com/pytorch/pytorch
- 🖥️ MLIR https://github.com/llvm/llvm-project
- 🛠️ LLVM RISC-V backend https://github.com/llvm/llvm-project
- 📦 libclc(library of OpenCL) https://github.com/llvm/llvm-project
- ⚡ POCL(runtime of OpenCL) https://github.com/pocl/pocl
- 🧩 QEMU(emulator) https://github.com/qemu/qemu
- 🧑💻 MLSynthesis(FPGA HLS TOOL) https://github.com/pku-liang/hector
- 🧪 MLSynthesis Debuger(FPGA HLS TOOL) https://github.com/pku-liang/Hestia
- ⚙️ ONNX-MLIR (Lowering of ONNX Models in MLIR) https://github.com/onnx/onnx-mlir
- 🧰 Polygeist(C/C++ frontend for MLIR) https://github.com/llvm/Polygeist
I created and maintain the following organizations:




