Skip to content

LessUp/mini-inference-engine

Repository files navigation

Mini-Inference Engine Logo

Mini-Inference Engine

CUDA GEMM optimization tutorial and mini inference engine
From naive matrix multiplication to ~85% cuBLAS-class performance on the reference benchmark

English · 简体中文 · Online Docs · Quick Start

CI Docs License: MIT CUDA C++17


What this repository contains

Mini-Inference Engine is a compact CUDA/C++17 project for learning high-performance GEMM optimization in a realistic inference-engine setting. It keeps the scope intentionally small: matrix multiplication kernels, runtime utilities, benchmarks, tests, and bilingual documentation all live in one traceable codebase.

Core value:

Area What to inspect
GEMM kernels src/naive_matmul.cu through src/vectorized_gemm.cu show the optimization path.
Runtime components include/tensor.h, include/inference_engine.h, include/memory_pool.h, and include/stream_manager.h.
Benchmarks benchmarks/benchmark.cpp, benchmarks/detailed_benchmark.cu, and benchmarks/mnist_demo.cpp.
Specs openspec/specs/ defines requirements, architecture, API, data, and testing expectations.
Documentation docs/en/ and docs/zh/ provide the tutorial, architecture, API, and tuning guides.

The headline performance number is hardware-specific. The project uses a conservative reference claim: the best optimized kernel reaches about 85% of cuBLAS-class throughput on the documented RTX 3080 1024×1024 benchmark.


Quick start

Requirements: CUDA Toolkit 11.0+, CMake 3.18+, a C++17 compiler, and an NVIDIA GPU with compute capability 7.0+.

git clone https://github.com/LessUp/mini-inference-engine.git
cd mini-inference-engine

cmake --preset default
cmake --build --preset default
ctest --preset default --output-on-failure

cmake --preset release
cmake --build --preset release
./build-release/benchmark

GPU tests skip when no CUDA device is available, but building still requires a CUDA toolkit because the library is compiled as a CUDA project.


Documentation map

Topic English 中文
Quick Start docs/en/QUICK_START.md docs/zh/QUICK_START.md
Architecture docs/en/ARCHITECTURE.md docs/zh/ARCHITECTURE.md
GEMM Optimization docs/en/GEMM_OPTIMIZATION.md docs/zh/GEMM_OPTIMIZATION.md
Performance Tuning docs/en/PERFORMANCE_TUNING.md docs/zh/PERFORMANCE_TUNING.md
API Reference docs/en/API_REFERENCE.md docs/zh/API_REFERENCE.md
Development Guide docs/en/CONTRIBUTING.md docs/zh/CONTRIBUTING.md

Engineering workflow

  • Source of truth: openspec/specs/**.
  • Build system: explicit source lists in CMakeLists.txt; do not use recursive globbing for source files.
  • Formatting: .clang-format with Google-based 4-space style.
  • Tests: tests_host covers utilities that do not require a GPU device; tests_gpu covers CUDA runtime/kernel behavior. Configuring and compiling the project still requires a CUDA Toolkit.
  • Branching: keep master as the only long-lived branch; use short-lived branches/worktrees for changes and delete them after merge.

See AGENTS.md for the full project-specific AI and engineering workflow.