Overview
Create a dynamic loader for cuBLAS, similar to the existing cuBLASLt loader.
Current State
native/jit/
├── cublaslt_loader.cpp (1080 lines, cuBLASLt dynamic loading)
├── cublaslt_loader.hpp
├── nvrtc_loader.cpp
└── nvrtc_loader.hpp
cuBLASLt is loaded dynamically, but cuBLAS is not available.
Proposed Addition
native/jit/
├── cublas_loader.cpp (NEW - cuBLAS dynamic loading)
├── cublas_loader.hpp (NEW)
├── cublaslt_loader.cpp
├── cublaslt_loader.hpp
├── nvrtc_loader.cpp
└── nvrtc_loader.hpp
Required Functions
Core BLAS Operations
// GEMM
cublasSgemm() // FP32
cublasDgemm() // FP64
cublasHgemm() // FP16
cublasGemmEx() // Mixed precision
// GEMV
cublasSgemv()
cublasDgemv()
// Handle management
cublasCreate()
cublasDestroy()
cublasSetStream()
cublasSetMathMode()
Dynamic Loading Pattern
// Follow same pattern as cublaslt_loader
class CuBLASLoader {
public:
static CuBLASLoader& instance();
bool is_available() const;
// Function pointers
cublasStatus_t (*sgemm)(...);
cublasStatus_t (*dgemm)(...);
cublasStatus_t (*hgemm)(...);
cublasStatus_t (*gemmEx)(...);
// ...
private:
void* handle_ = nullptr;
bool load_library();
};
DLL Search Order (Windows)
cublas64_13.dll // CUDA 13.x
cublas64_12.dll // CUDA 12.x
cublas64_11.dll // CUDA 11.x
Use Cases
- Fallback path - When custom kernels fail, use cuBLAS
- Correctness reference - Compare custom kernel results vs cuBLAS
- Legacy dtype support - FP64 GEMM via cuBLAS
- Batched GEMM - cublasSgemmBatched for small batch sizes
Acceptance Criteria
Related
Overview
Create a dynamic loader for cuBLAS, similar to the existing cuBLASLt loader.
Current State
cuBLASLt is loaded dynamically, but cuBLAS is not available.
Proposed Addition
Required Functions
Core BLAS Operations
Dynamic Loading Pattern
DLL Search Order (Windows)
Use Cases
Acceptance Criteria
Related
native/jit/cublaslt_loader.cpp