You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
# Profile with Nsight Systems
nsys profile -o report.qdrep ./your_program
# View in Nsight Systems GUI
nsys-ui report.qdrep
Key Metrics
Metric
Target
Action if Poor
GPU Utilization
>80%
Increase batch size
Memory Bandwidth
>70%
Check data locality
Kernel Occupancy
>60%
Check block size
PCIe Bandwidth
<20% transfer
Batch transfers
Performance Comparison
Mini-OpenCV vs OpenCV CPU
Operation
OpenCV CPU
Mini-OpenCV GPU
Speedup
Gaussian Blur (5×5)
12 ms
0.3 ms
40x
Sobel Edge
8 ms
0.2 ms
40x
Histogram Equalize
15 ms
0.5 ms
30x
Resize (bilinear)
6 ms
0.15 ms
40x
RGB→Grayscale
2 ms
0.05 ms
40x
1920×1080 image, Intel i9-12900K vs RTX 4080
Use Case Recommendations
Use Case
Recommendation
Real-time video
Mini-OpenCV (GPU)
Batch processing
Mini-OpenCV (Pipeline)
Single image, latency critical
OpenCV (CPU)
Embedded systems
OpenCV (CPU)
Best Practices
1. Warmup
// First CUDA call includes JIT compilation overhead
GpuImage warmup = processor.gaussianBlur(dummy, 5, 1.5f);
cudaDeviceSynchronize();
// Subsequent calls are faster
2. Image Size Alignment
// Align to 32 for optimal memory accessint alignedWidth = ((width + 31) / 32) * 32;
int alignedHeight = ((height + 31) / 32) * 32;
3. Batch Processing
// Process multiple images when possible
PipelineProcessor pipeline(4);
auto results = pipeline.processBatchHost(images);
4. Resolution Cascade
// For preview quality, downsample first
GpuImage small = processor.resize(image, width/4, height/4);
GpuImage processed = processor.sobelEdgeDetection(small);
GpuImage fullSize = processor.resize(processed, width, height);
Troubleshooting Performance
Symptom
Likely Cause
Solution
Low GPU utilization
Small images or low batch
Increase batch size
PCIe bound
Too many transfers
Use PipelineProcessor
Kernel latency
Many small operations
Fuse kernels
Memory errors
Image too large
Process in tiles
Next Steps
[API Reference]({{ site.baseurl }}/api/) - Detailed function documentation
[Examples]({{ site.baseurl }}/tutorials/examples/) - Working code samples
Run benchmarks: cmake --build build --target gpu_image_benchmark
For performance questions, see [FAQ]({{ site.baseurl }}/tutorials/faq) or open a discussion