Summary
I've implemented a GPU-accelerated SFAP (Single Fiber Action Potential) computation backend using CuPy that targets the simulate_muaps() bottleneck in IntramuscularEMG. This is not the same as the existing cupy support for simulate_intramuscular_emg() (spike-train convolution) — it accelerates the biophysical SFAP calculation itself, which dominates total simulation time for large motor units.
Before investing time in adapting the code to MyoGen's coding standards and opening a PR, I'd like to check whether this is something the maintainers would be interested in integrating.
Motivation
When simulating large motor units (>2000 fibers), simulate_muaps() becomes the dominant bottleneck. On a single RTX 3090:
MU with 289 fibers: CPU 141s → GPU 4.1s (34x speedup)
MU with 2036 fibers: CPU 18.6 min → GPU 30.6s (36.6x speedup)
For batch workflows (e.g., generating MUAP template libraries across multiple electrode positions), this reduces hours of compute to minutes.
Approach: Hybrid CPU-GPU Pipeline
The NMJ simulation (sim_nmj_branches_two_layers) stays on CPU to preserve RNG fidelity via get_random_generator(), while the SFAP computation (volume conductor model + current density) is offloaded to GPU:
CPU: NMJ init (preserves RNG state)
→ fiber positions, NMJ delays, conduction velocities
GPU: calc_sfaps batch (CuPy — 95% of compute time)
→ per-fiber SFAPs
CPU: shift_sfaps + sum → MUAP template
Numerical parity: Cross-correlation with CPU reference ≥ 0.9999994. The residual PtP difference (~0.016%) comes from IEEE 754 non-associativity of floating-point summation.
Current Implementation
Zero MyoGen files modified — implemented as a non-invasive wrapper:
GpuSfapCalculator — CuPy SFAP kernels (drop-in replacement for the NumPy path in bioelectric.py)
HybridPipeline — orchestrator with NMJ caching and explicit seed management
Automatic fallback to NumPy when CuPy is not available
Possible integration approaches:
Option A — Backend submodule (myogen/backends/) with SfapEngine Protocol and a new backend= parameter on simulate_muaps()
Option B — GPU subpackage under myogen/simulator/core/emg/intramuscular/gpu/
In either case, CuPy would be an optional [gpu] dependency.
Questions
Is GPU acceleration for simulate_muaps() on the roadmap? I noticed commented-out neuron-gpu-nightly in pyproject.toml — are you planning a different GPU approach?
Preferred integration pattern? Option A vs B, or something else?
Testing requirements? I validate with cross-correlation ≥ 0.999 against CPU reference. Should I integrate with test_determinism.py?
NumPy version? I'm on NumPy 2.2 + CuPy 14. Given your numpy < 2.0 constraint for elephant — should I ensure backwards compat?
Happy to adapt to your coding standards (beartype, __mm unit suffixes, etc.) once we agree on the approach. Thanks for building such a well-structured toolkit!
Summary
I've implemented a GPU-accelerated SFAP (Single Fiber Action Potential) computation backend using CuPy that targets the simulate_muaps() bottleneck in IntramuscularEMG. This is not the same as the existing cupy support for simulate_intramuscular_emg() (spike-train convolution) — it accelerates the biophysical SFAP calculation itself, which dominates total simulation time for large motor units.
Before investing time in adapting the code to MyoGen's coding standards and opening a PR, I'd like to check whether this is something the maintainers would be interested in integrating.
Motivation
When simulating large motor units (>2000 fibers), simulate_muaps() becomes the dominant bottleneck. On a single RTX 3090:
MU with 289 fibers: CPU 141s → GPU 4.1s (34x speedup)
MU with 2036 fibers: CPU 18.6 min → GPU 30.6s (36.6x speedup)
For batch workflows (e.g., generating MUAP template libraries across multiple electrode positions), this reduces hours of compute to minutes.
Approach: Hybrid CPU-GPU Pipeline
The NMJ simulation (sim_nmj_branches_two_layers) stays on CPU to preserve RNG fidelity via get_random_generator(), while the SFAP computation (volume conductor model + current density) is offloaded to GPU:
CPU: NMJ init (preserves RNG state)
→ fiber positions, NMJ delays, conduction velocities
GPU: calc_sfaps batch (CuPy — 95% of compute time)
→ per-fiber SFAPs
CPU: shift_sfaps + sum → MUAP template
Numerical parity: Cross-correlation with CPU reference ≥ 0.9999994. The residual PtP difference (~0.016%) comes from IEEE 754 non-associativity of floating-point summation.
Current Implementation
Zero MyoGen files modified — implemented as a non-invasive wrapper:
GpuSfapCalculator — CuPy SFAP kernels (drop-in replacement for the NumPy path in bioelectric.py)
HybridPipeline — orchestrator with NMJ caching and explicit seed management
Automatic fallback to NumPy when CuPy is not available
Possible integration approaches:
Option A — Backend submodule (myogen/backends/) with SfapEngine Protocol and a new backend= parameter on simulate_muaps()
Option B — GPU subpackage under myogen/simulator/core/emg/intramuscular/gpu/
In either case, CuPy would be an optional [gpu] dependency.
Questions
Is GPU acceleration for simulate_muaps() on the roadmap? I noticed commented-out neuron-gpu-nightly in pyproject.toml — are you planning a different GPU approach?
Preferred integration pattern? Option A vs B, or something else?
Testing requirements? I validate with cross-correlation ≥ 0.999 against CPU reference. Should I integrate with test_determinism.py?
NumPy version? I'm on NumPy 2.2 + CuPy 14. Given your numpy < 2.0 constraint for elephant — should I ensure backwards compat?
Happy to adapt to your coding standards (beartype, __mm unit suffixes, etc.) once we agree on the approach. Thanks for building such a well-structured toolkit!