This repository contains a curated collection of CUDA and GPU computing projects focused on parallel numerical computation, performance optimization, simulation, and multi-GPU programming. The projects demonstrate how computational workloads can be mapped to GPU architectures using CUDA kernels, thread/block organization, memory hierarchy awareness, reductions, atomic operations, stencil methods, Monte Carlo simulation, multi-GPU decomposition, and FFT-based numerical solvers.
CUDA parallel programming uses GPU hardware to accelerate data-parallel and numerically intensive workloads. Instead of executing work sequentially on the CPU, CUDA programs divide computation into many lightweight threads organized into grids and blocks.
The projects in this repository cover:
- CUDA kernel design for data-parallel computation
- grid, block, and thread mapping
- GPU timing with CUDA events
- block-size tuning and runtime comparison
- shared-memory reductions
- atomic operations for concurrent updates
- stencil-based iterative numerical solvers
- Jacobi iteration for Poisson and heat-diffusion problems
- Monte Carlo simulation and random sampling
- single-GPU and multi-GPU workload partitioning
- domain decomposition and boundary exchange
- cuFFT-based Fourier-space numerical solving
- result validation, runtime scaling, and experiment documentation
Core skills: CUDA, C++, GPU Computing, Parallel Programming, Thread/Block Mapping, CUDA Events, Shared Memory, Atomic Operations, Parallel Reduction, Stencil Computation, Jacobi Iteration, Poisson Solver, Heat Diffusion, Multi-GPU Programming, Domain Decomposition, Monte Carlo Simulation, cuFFT, Numerical Computing, Performance Benchmarking, Python, Bash, Technical Documentation
| No. | Project | Topic | Purpose | Main Concepts | Skills / Tags |
|---|---|---|---|---|---|
| 01 | 01-modified-matrix-addition |
Modified Matrix Addition | Implement a basic CUDA matrix operation and evaluate how block configuration affects runtime | 2D grid/block mapping, element-wise kernels, CUDA event timing, block-size tuning | CUDA, C++, 2D Kernel, GPU Timing, Performance Tuning |
| 02 | 02-parallel-reduction-trace |
Parallel Reduction Trace | Implement a CUDA reduction workflow for large-array summation or trace-style computation | shared-memory reduction, strided access, block-level partial sums, final accumulation | CUDA, Reduction, Shared Memory, Parallel Sum, Benchmarking |
| 03 | 03-poisson-3d-jacobi |
3D Poisson Jacobi Solver | Solve a 3D Poisson problem using iterative Jacobi updates on a structured grid | stencil computation, zero boundary condition, convergence checking, numerical validation | CUDA, Jacobi, Poisson Equation, Stencil, Numerical Solver |
| 04 | 04-multi-gpu-dot-product |
Multi-GPU Dot Product | Partition a vector dot product across multiple GPUs and combine partial results | multi-GPU partitioning, per-GPU reduction, host-side accumulation, CUDA timing | CUDA, Multi-GPU, Dot Product, Reduction, Performance Comparison |
| 05 | 05-multi-gpu-heat-diffusion |
Multi-GPU Heat Diffusion | Simulate 2D heat diffusion using Jacobi iteration with domain decomposition | halo exchange, boundary conditions, multi-GPU subdomains, iterative stencil updates | CUDA, Multi-GPU, Heat Diffusion, Domain Decomposition, Jacobi |
| 06 | 06-exponential-histogram |
Exponential Histogram | Compare CPU and GPU histogram implementations using atomic updates | exponential random samples, histogram bins, global atomics, shared-memory atomics | CUDA, Histogram, Atomic Operations, Shared Memory, Speedup |
| 07 | 07-monte-carlo-10d-integration |
Monte Carlo 10D Integration | Estimate a high-dimensional integral using CPU and CUDA Monte Carlo methods | random sampling, importance sampling, Metropolis sampling, convergence analysis | CUDA, Monte Carlo, High-Dimensional Integration, Sampling, Numerical Methods |
| 08 | 08-ising-model-monte-carlo |
2D Ising Model Monte Carlo | Simulate a 2D Ising model using Metropolis updates on GPU | checkerboard update, toroidal lattice, spin simulation, Monte Carlo production runs | CUDA, Monte Carlo, Ising Model, Metropolis, GPU Simulation |
| 09 | 09-poisson-3d-cufft |
3D Poisson Solver with cuFFT | Solve a 3D Poisson equation in Fourier space using cuFFT | FFT transform, Fourier-space Green's function, zero-mode handling, runtime scaling | CUDA, cuFFT, Poisson Solver, FFT, Numerical Computing |
| Skill / Concept | 01 Matrix | 02 Reduction | 03 Jacobi Poisson | 04 Multi-GPU Dot | 05 Heat Diffusion | 06 Histogram | 07 MC Integration | 08 Ising Model | 09 cuFFT Poisson |
|---|---|---|---|---|---|---|---|---|---|
| CUDA kernel programming | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| C++ systems programming | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| Grid/block/thread mapping | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| CUDA event timing | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| Block-size tuning | ✅ | ✅ | ✅ | ✅ | |||||
| Shared memory | ✅ | ✅ | ✅ | ||||||
| Atomic operations | ✅ | ||||||||
| Parallel reduction | ✅ | ✅ | |||||||
| Stencil computation | ✅ | ✅ | |||||||
| Jacobi iteration | ✅ | ✅ | |||||||
| Poisson equation solving | ✅ | ✅ | |||||||
| Heat diffusion simulation | ✅ | ||||||||
| Multi-GPU programming | ✅ | ✅ | ✅ | ||||||
| Domain decomposition | ✅ | ✅ | |||||||
| Monte Carlo simulation | ✅ | ✅ | |||||||
| Random sampling methods | ✅ | ✅ | ✅ | ||||||
| cuFFT / FFT-based solver | ✅ | ||||||||
| Numerical validation | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| Experiment automation | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
This project implements an element-wise CUDA matrix operation where each output value is computed independently. It demonstrates basic GPU kernel design and 2D indexing for matrix traversal.
Key demonstrated concepts:
- one GPU thread per matrix element
- 2D grid and block configuration
- CUDA event-based timing
- block-size performance comparison
This project implements a CUDA reduction workflow for large-array summation or trace-style computation. It demonstrates how partial sums can be computed inside blocks and then combined to obtain a final result.
Key demonstrated concepts:
- shared-memory reduction
- strided global-memory access
- block-level partial sums
- reduction performance tuning
This project solves a 3D Poisson equation using Jacobi iteration. The implementation represents a structured 3D grid and repeatedly applies stencil updates until the configured convergence or iteration condition is reached.
Key demonstrated concepts:
- 3D grid representation
- stencil-based update
- Jacobi iteration
- zero Dirichlet boundary condition
- radial-average validation against expected potential behavior
This project computes a vector dot product using multiple GPUs. The input vector is partitioned across devices, each GPU computes a partial dot product, and the host combines the partial results.
Key demonstrated concepts:
- multi-GPU workload partitioning
- per-device memory allocation
- partial reduction
- host-side final accumulation
- single-GPU and multi-GPU timing comparison
This project simulates 2D heat diffusion using Jacobi iteration and multi-GPU domain decomposition. Each GPU updates a subdomain, and neighboring boundary data is exchanged to maintain consistency.
Key demonstrated concepts:
- 2D stencil computation
- fixed boundary conditions
- domain decomposition
- halo/boundary exchange
- single-GPU and two-GPU comparison
This project builds histograms from samples generated from an exponential distribution. It compares CPU and GPU versions, including GPU implementations that use global-memory and shared-memory atomic operations.
Key demonstrated concepts:
- histogram binning
- exponential random distribution
- global-memory atomics
- shared-memory atomics
- CPU vs GPU speedup comparison
This project estimates a 10-dimensional integral using Monte Carlo techniques. Multiple sampling strategies are compared to evaluate convergence behavior and GPU acceleration.
Key demonstrated concepts:
- high-dimensional numerical integration
- simple Monte Carlo sampling
- direct-inversion importance sampling
- Metropolis sampling
- CPU vs CUDA comparison
This project simulates the 2D Ising model on a toroidal lattice using the Metropolis algorithm. The checkerboard update pattern helps avoid conflicting updates between neighboring spins.
Key demonstrated concepts:
- Ising spin simulation
- toroidal boundary condition
- Metropolis Monte Carlo update
- checkerboard update scheme
- GPU-based production runs
This project solves a 3D Poisson problem using an FFT-based method with cuFFT. The charge distribution is transformed into Fourier space, the potential is computed using a Green's-function-style formulation, and the result is transformed back to real space.
Key demonstrated concepts:
- 3D FFT workflow
- cuFFT integration
- Fourier-space Poisson solver
- zero-mode handling
- runtime scaling and grid-size exploration
cuda-parallel-programming/
├── README.md
├── .gitignore
├── 01-modified-matrix-addition/
├── 02-parallel-reduction-trace/
├── 03-poisson-3d-jacobi/
├── 04-multi-gpu-dot-product/
├── 05-multi-gpu-heat-diffusion/
├── 06-exponential-histogram/
├── 07-monte-carlo-10d-integration/
├── 08-ising-model-monte-carlo/
└── 09-poisson-3d-cufft/
Typical project folders may use the following structure:
project-folder/
├── README.md
├── Makefile
├── src/
├── scripts/
└── results/
Some folders may omit directories that are not required for that specific project.
Each project contains its own README with project-specific build and run instructions. A typical CUDA/C++ project can be built with:
make clean
makeA typical executable run may look like:
./build/mainFor projects with experiment scripts:
bash scripts/run_experiments.shFor Python-based plotting or result summarization:
python3 scripts/plot_results.pyTo reproduce results, use a CUDA-capable NVIDIA GPU with a compatible CUDA Toolkit version installed. Exact runtime results may vary depending on GPU model, driver version, CUDA version, CPU, memory bandwidth, and system load.
Recommended environment:
- NVIDIA GPU with CUDA support
- CUDA Toolkit
- C++ compiler compatible with
nvcc - GNU Make
- Python 3 for optional scripts and plots
- Python packages such as
numpy,pandas, andmatplotlibwhen plotting scripts are used