Skip to content

fix: run block simulation on dedicated tokio runtime#233

Closed
rswanson wants to merge 1 commit intomainfrom
fix/dedicated-sim-runtime
Closed

fix: run block simulation on dedicated tokio runtime#233
rswanson wants to merge 1 commit intomainfrom
fix/dedicated-sim-runtime

Conversation

@rswanson
Copy link
Member

@rswanson rswanson commented Mar 2, 2026

Summary

  • The simulator was running CPU-heavy EVM simulation (BlockBuild::build()) on the main tokio runtime, which also serves the /healthcheck endpoint. With CONCURRENCY_LIMIT=10 and a 2-CPU pod, all tokio worker threads were saturated with EVM work, preventing the healthcheck from responding within the 1s liveness probe timeout — causing repeated SIGKILL (exit code 137) and a crash loop.
  • This creates a dedicated tokio::runtime::Runtime for simulation, sized to CONCURRENCY_LIMIT worker threads. The main runtime stays free for healthchecks, cache I/O, and channel operations regardless of simulation load.
  • No throughput loss — simulation still runs with full concurrency on its own thread pool.

Test plan

  • Verify make clippy and make test pass (confirmed locally)
  • Deploy to parmigiana and confirm the pod no longer crash-loops
  • Verify liveness/readiness probes pass consistently during simulation
  • Confirm block building completes within the 12s slot window

🤖 Generated with Claude Code

The simulator was running CPU-heavy EVM work on the main tokio runtime,
starving the healthcheck handler and causing liveness probe kills (exit
code 137) in Kubernetes. This creates a dedicated multi-thread runtime
sized to CONCURRENCY_LIMIT for simulation, keeping the main runtime free
for healthchecks, cache I/O, and channel operations.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Member

@Evalir Evalir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, this technically would fix the main issue: The main tokio thread's getting blocked by the simulation work, which is blocking and CPU heavy. The problem is that we're spawning an entirely new tokio runtime for this which increases resource usage just for isolating the sim task. On top of this, assigning CONCURRENCY_LIMIT worker threads is a tad wasteful since the sim runtime only runs one task at a time, which is the simulation work.

What we actually want is to isolate the single task, CPU heavy work from blocking the main tokio thread so that it can schedule handling healthchecks and other tasks while also continuing the sim work. Tokio can do this with spawn_blocking, which shifts the spawned task into its blocking task thread pool, avoiding the contention issue. I've implemented the changes on a PR on top of this: see #234

Copy link
Member

Evalir commented Mar 2, 2026

@dylanlott
Copy link
Contributor

This should be closed now, yeah?

@Evalir
Copy link
Member

Evalir commented Mar 4, 2026

yep

@Evalir Evalir closed this Mar 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants