English | 简体中文
A high-performance DAG-based GPU image processing pipeline with multi-stream scheduling, pinned memory pool, and CUDA-accelerated operators. Designed for real-time video and batch image processing workflows.
- Highlights
- Quick Start
- Requirements
- Installation
- Build
- Usage
- Operators
- GPU Architecture Support
- Project Structure
- Architecture
- Documentation
- Engineering Quality
- License
- GPU Accelerated: Full CUDA implementation with async kernel execution
- DAG Scheduling: Directed acyclic graph-based task dependency management with automatic parallelization
- Multi-Stream Execution: Concurrent CUDA stream execution for independent tasks
- Memory Efficient: Pinned/Device memory pools with best-fit allocation strategy
- Separable Filtering: Gaussian blur optimized with separable horizontal + vertical passes
- Error Propagation: Task failures automatically propagate downstream along the DAG
# Clone the repository
git clone https://github.com/LessUp/mini-image-pipe.git
cd mini-image-pipe
# Build with CMake Presets (recommended)
cmake --preset release
cmake --build --preset release
# Run the demo
./build/demo_pipeline
# Run tests
./build/mini_image_pipe_tests- OS: Linux (Ubuntu 20.04+ recommended)
- CMake: >= 3.18
- CUDA Toolkit: >= 11.0 with
nvccin PATH - C++ Compiler: GCC 7+, Clang 7+, or MSVC 2019+
- GPU: NVIDIA GPU with Compute Capability >= 7.0 (Volta or newer)
- GTest: v1.14.0 (auto-fetched via FetchContent, no manual installation needed)
Ensure you have CUDA Toolkit installed:
# Verify CUDA installation
nvcc --version
# If not installed, download from:
# https://developer.nvidia.com/cuda-downloads# Add CUDA to PATH (if not already)
export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH# Debug build
cmake --preset default
cmake --build --preset default
# Release build (optimized for performance)
cmake --preset release
cmake --build --preset release
# Native GPU arch only (faster compile)
cmake --preset minimal
cmake --build --preset minimalmkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
cmake --build . -j$(nproc)./build/demo_pipeline# Using ctest
ctest --preset release
# Or run directly
./build/mini_image_pipe_tests#include "pipeline.h"
#include "operators/resize.h"
#include "operators/color_convert.h"
#include "operators/gaussian_blur.h"
#include "operators/sobel.h"
#include <cuda_runtime.h>
using namespace mini_image_pipe;
int main() {
// Configuration
PipelineConfig config;
config.numStreams = 4;
Pipeline pipeline(config);
// Add operators
auto resize = std::make_shared<ResizeOperator>(320, 240, InterpolationMode::BILINEAR);
auto gray = std::make_shared<ColorConvertOperator>(ColorConversionType::RGB_TO_GRAY);
auto blur = std::make_shared<GaussianBlurOperator>(GaussianKernelSize::KERNEL_5x5);
auto sobel = std::make_shared<SobelOperator>();
int n1 = pipeline.addOperator("Resize", resize);
int n2 = pipeline.addOperator("Gray", gray);
int n3 = pipeline.addOperator("Blur", blur);
int n4 = pipeline.addOperator("Sobel", sobel);
// Connect: Resize -> Gray -> Blur -> Sobel
pipeline.connect(n1, n2);
pipeline.connect(n2, n3);
pipeline.connect(n3, n4);
// Allocate GPU memory for input
int width = 640, height = 480, channels = 3;
size_t inputSize = width * height * channels * sizeof(uint8_t);
uint8_t* d_input;
cudaMalloc(&d_input, inputSize);
// (Load your image data to d_input here)
// Set input and execute
pipeline.setInput(n1, d_input, width, height, channels);
pipeline.execute();
// Get output
void* output = pipeline.getOutput(n4);
// Cleanup
cudaFree(d_input);
return 0;
}See examples/demo_pipeline.cpp for a complete working example.
| Operator | Function | Features |
|---|---|---|
| GaussianBlur | Gaussian blur | 3×3/5×5/7×7 separable filter, reflection boundary padding |
| Sobel | Edge detection | 3×3 Sobel kernels, gradient magnitude output |
| Resize | Image scaling | Bilinear / nearest-neighbor interpolation |
| ColorConvert | Color conversion | RGB↔Gray, BGR↔RGB, RGBA→RGB |
| Architecture | Compute Capability | Example GPUs |
|---|---|---|
| Volta | sm_70 | V100 |
| Turing | sm_75 | RTX 2080, T4 |
| Ampere | sm_80, sm_86 | A100, RTX 3090 |
| Ada Lovelace | sm_89 | RTX 4090, L40 |
| Hopper | sm_90 | H100 |
mini-image-pipe/
├── include/
│ ├── types.h # Data types, enums, KernelConfig
│ ├── operator.h # IOperator abstract base class
│ ├── memory_manager.h # Pinned/Device memory pool manager
│ ├── task_graph.h # DAG task graph (topological sort, cycle detection)
│ ├── scheduler.h # CUDA multi-stream DAG scheduler
│ ├── pipeline.h # Pipeline builder and execution entry
│ └── operators/
│ ├── color_convert.h # Color space conversion operator
│ ├── resize.h # Image resize operator
│ ├── sobel.h # Sobel edge detection operator
│ └── gaussian_blur.h # Gaussian blur operator (separable filter)
├── src/
│ ├── memory_manager.cu # Memory pool (best-fit strategy)
│ ├── task_graph.cpp # Kahn topological sort, DFS cycle detection
│ ├── scheduler.cu # Stream assignment, event sync, error propagation
│ ├── pipeline.cpp # Buffer allocation, dimension inference, batch processing
│ └── operators/
│ ├── color_convert.cu # RGB/BGR/RGBA/Gray conversion kernels
│ ├── resize.cu # Nearest-neighbor / bilinear interpolation kernels
│ ├── sobel.cu # 3×3 Sobel gradient kernel (__constant__ weights)
│ └── gaussian_blur.cu # Separable Gaussian kernel (horizontal + vertical pass)
├── tests/ # GTest property tests (100 random iterations per operator)
├── examples/
│ └── demo_pipeline.cpp # End-to-end pipeline demo
├── .clang-format # Code format rules
├── .editorconfig # Editor format rules
├── CMakeLists.txt # Build configuration
└── CMakePresets.json # CMake presets (default/release/minimal)
┌───────────────────────────────────────────────────────┐
│ Pipeline API │
├───────────────────────────────────────────────────────┤
│ TaskGraph │ DAGScheduler │ MemoryManager │
├───────────────────────────────────────────────────────┤
│ Operators: Gaussian │ Sobel │ Resize │ ColorConvert │
├───────────────────────────────────────────────────────┤
│ CUDA Streams │ CUDA Events │ Shared Memory │
└───────────────────────────────────────────────────────┘
- Getting Started Guide - Build and run your first project
- Usage Examples - Common usage patterns and best practices
- Architecture Overview - System design and component overview
- API Reference - Complete API documentation
- Contributing Guide - How to contribute to the project
- Modern CMake:
target_include_directories, generator expressions, FetchContent, MSVC compatibility - CI/CD: GitHub Actions (CUDA container build + clang-format check + ctest)
- Memory Safety: Pooled memory management, best-fit allocation, automatic reuse
- Error Handling: Full CUDA API error checking, DAG failure propagation
- Code Standards:
.clang-format(Google style, 4-space indent, 100 col) - Test Coverage: 100-iteration randomized property tests per operator/component
MIT License - see LICENSE for details.