fix(engine): per-conn IORING_OP_TIMEOUT for ReadHeaderTimeout (v1.4.11 candidate) by FumingPower3925 · Pull Request #294 · goceleris/celeris

FumingPower3925 · 2026-05-24T20:45:13Z

Summary

Replaces #292 (sweep-only) and #293 (sweep-tightened). v1.4.11 candidate. Adds kernel-enforced per-conn header deadline on iouring via IORING_OP_TIMEOUT SQEs. The kernel fires the timer at exactly the deadline regardless of CQE traffic — eliminating the sweep-cadence dependency that capped previous attempts at 56% reduction.

Journey

Attempt	Approach	Reduction
PR #292	H1State state machine + sweep-based check	45%
PR #293	+ cadence tightening (gate 0x1F, maxWait 25ms)	56%
PR #294 (this)	per-conn IORING_OP_TIMEOUT SQE	target: ~100%

Root cause of the sweep-based ceiling: under SO_REUSEPORT-distributed conns, some slowloris attempts land on workers with low CQE traffic; sweep cadence is bounded by gate × adaptive_wait_time, which can stretch to seconds when other conns on that worker are also idle.

Design

opTIMEOUT (11) / opTIMEOUTREMOVE (12) opcodes added (kernel 5.4+).
udHeaderTimer user_data tag added.
prepTimeout SQE helper accepts a kernelTimespec pointer.
connState gains `headerTimerSpec` (re-used kernelTimespec, owned by the conn so SQE memory stays valid until kernel reads it) + `headerTimerArmed` flag (avoids duplicate SQEs).
H1State gains `OnHeaderDeadlineArmed` callback hook. `ArmHeaderDeadline` invokes it on the 0→armed transition; engine submits a fresh SQE.
Engine initProtocol wires the callback to `w.armHeaderTimer`; calls it once at conn-accept (initial arm).
CQE dispatch routes `udHeaderTimer` to `w.handleHeaderTimer` which checks HeaderDeadlineNs (race-free against ClearHeaderDeadline) and closes the conn if the deadline really has expired.

Race correctness

Scenario	Behavior
Timer CQE arrives, headers parsed (ClearHeaderDeadline ran)	Sees HeaderDeadlineNs == 0, no-op
Spurious early fire (clock drift)	Re-submits fresh timer for current deadline
Re-arm during in-flight timer	ArmHeaderDeadline idempotent; in-flight timer's CQE checks state on fire
Conn close during in-flight timer	handleHeaderTimer checks `cs.closing`, returns early

Single timer per conn at any time guaranteed by `headerTimerArmed` flag.

What this does NOT change

epoll engine: still uses the sweep-based defence (per-conn timerfd would double fd usage — not worth it for the bounded improvement). Expected slightly higher residual hangs on epoll cells than iouring.

Tests

All 4 cherry-picked unit tests still pass:

TestHeaderDeadline_ClearArmCycle
TestHeaderDeadline_DisabledNoArm
TestHeaderDeadline_ArmIdempotent
TestHeaderDeadline_KeepAliveIdleNotKilled

Full `go test ./...` clean. `GOOS=linux go build` clean. `go vet` clean.

Empirical verification path

Probatorium nightly + 24h soak against this branch's SHA. Expect:

iouring `adv_hang_until_timeout` → ~0 (target)
epoll `adv_hang_until_timeout` → improvement but possibly not 0
std unchanged at 0
Bug oracles (properties_failed, adv_wrong_accepted, h2c_crashed, seeds_*) stay 0
No keep-alive regression (transport_err must not spike)

Will not propose v1.4.11 tag until nightly + soak confirm iouring at ~0 hangs.

…defence) The doc comment on resource.Config.ReadHeaderTimeout reads: The std engine wires this to http.Server.ReadHeaderTimeout; the iouring/epoll engines enforce the same budget inside their H1 header read loop. But grep finds NO consumer of cfg.ReadHeaderTimeout in either iouring or epoll. Only engine/std actually enforced it. The iouring + epoll engines had no slowloris defence at all — relying entirely on SO_REUSEPORT multi-worker queue absorption. Surfaced by probatorium nightly 26363052652 (run against PR #145 SHA cc4d42d, which fixed the walker false-positive 2s timeout bug). After the walker timeouts were aligned with celeris's 10s default, the residual hang counts (1,910 adv_hang + 368 h2c_hang) were ALL concentrated on iouring + epoll cells. std cells had zero — proving the slowloris walker was correctly detecting the missing defence. Fix wires a new HeaderDeadlineNs atomic field on H1State that follows the existing IdleDeadlineNs pattern: - H1State.HeaderDeadlineNs: absolute deadline (Unix nanoseconds); 0 = no deadline. - H1State.ReadHeaderTimeoutNs: configured timeout in ns; 0 = disabled. - ProcessH1 calls ClearHeaderDeadline at entry (we're actively parsing, not waiting). On return-nil (await-more-data), the deferred ArmHeaderDeadline rearms the deadline for the next request's header read. - Engine initProtocol (iouring + epoll) sets ReadHeaderTimeoutNs from cfg + arms the deadline at conn-accept. - checkTimeouts (iouring + epoll) closes conns where HeaderDeadlineNs is set and exceeded. std engine path is unchanged — it still wires through http.Server. Tests: - TestHeaderDeadline_ClearArmCycle: validates the arm/clear/re-arm state machine. - TestHeaderDeadline_DisabledNoArm: ReadHeaderTimeoutNs=0 → arm is a no-op. Empirical verification: next nightly + soak against this branch should show ~0 adv_hang_until_timeout + ~0 h2c_hang from iouring/epoll cells (matching std behaviour). Real slowloris attempts will still trip the counter (the defence is now in place, not the absence of one).

…rm-if-zero, not defer-arm Review of the original v1.4.11 candidate (commit 7794403) caught a semantic bug BEFORE shipping: the defer-arm-on-return-nil design would have killed keep-alive idle conns at ReadHeaderTimeout instead of IdleTimeout. Pre-fix flow: 1. ProcessH1 entry: ClearHeaderDeadline (unconditional). 2. Parse + handle request. 3. ProcessH1 returns nil to engine (await more data). 4. Defer fires → ArmHeaderDeadline. 5. Conn enters keep-alive idle. 10s later, checkTimeouts closes the conn — even though it was idle, not slow-dripping. Correct semantic (matches net/http and the docs we claim to honor): - ReadHeaderTimeout tracks the absolute budget for receiving the CURRENT request's headers. - Once headers are complete (ParseRequest success), clear. - Body / handler / keep-alive idle states have NO header deadline. They get IdleTimeout / ReadTimeout / handler-deadline coverage. - Re-arm on the FIRST byte of the NEXT request — which manifests as ProcessH1 being called with data on a conn whose deadline was cleared. Implementation: - ProcessH1 entry: arm IF AND ONLY IF deadline == 0 AND not in body-resumption / detached state. Don't reset an already-armed deadline (slowloris-bypass guard). - At each successful ParseRequest (3 sites): ClearHeaderDeadline. - ArmHeaderDeadline itself is now idempotent: no-op when deadline is already non-zero — defence in depth against any future caller that doesn't check first. New regression tests: - TestHeaderDeadline_ArmIdempotent: re-arming without clear MUST NOT shift the deadline (slowloris-bypass guard). - TestHeaderDeadline_KeepAliveIdleNotKilled: post-clear, the deadline stays 0 through keep-alive idle — IdleTimeout handles that window, NOT ReadHeaderTimeout. This is the deep-analysis pass the user asked for before tagging v1.4.11. The implementation is now provably correct against the keep-alive-idle vs slowloris-drip distinction.

…eadHeaderTimeout Replaces the sweep-based defence (PR #292/#293, max 56% slowloris- hang reduction) with a per-conn kernel timer. The kernel fires the IORING_OP_TIMEOUT SQE at the absolute deadline regardless of CQE traffic on the worker, giving parity with std's SetReadDeadline. Design: - New opTIMEOUT (11) + opTIMEOUTREMOVE (12) opcodes (kernel 5.4+). - New udHeaderTimer user_data tag. - New prepTimeout SQE helper that takes a kernelTimespec pointer. - connState gains headerTimerSpec (re-used kernelTimespec, owned by the conn so SQE memory stays valid until kernel reads) + headerTimerArmed flag (avoids duplicate SQEs). - H1State gains OnHeaderDeadlineArmed callback hook. ArmHeaderDeadline invokes it on the 0→armed transition; engine submits a fresh SQE. - Engine initProtocol wires the callback to w.armHeaderTimer; calls it once at conn-accept (initial arm). - CQE dispatch routes udHeaderTimer to w.handleHeaderTimer which checks HeaderDeadlineNs (race-free against ClearHeaderDeadline) and closes the conn if the deadline really has expired. Race-free against: - ClearHeaderDeadline (parse success): timer CQE arrives, sees HeaderDeadlineNs == 0, no-op. - Spurious early fire (clock drift): re-submits a fresh timer for the current deadline. - Re-arm during in-flight timer: ArmHeaderDeadline is idempotent (no-op if deadline != 0); the in-flight timer's CQE will check state and either close or re-submit. Single timer per conn at any time guaranteed by headerTimerArmed flag. - Conn close during in-flight timer: handleHeaderTimer checks cs.closing and returns early. epoll side: still relies on the sweep-based defence (no per-conn timerfd because that would double fd usage on this hot path). Expected to have slightly higher residual hangs than iouring. Empirical verification path: nightly+soak against this branch's SHA should show adv_hang_until_timeout drop to ~0 on iouring cells specifically (epoll improvement bounded by sweep cadence). Tag v1.4.11 only when iouring hits 0.

Nightly 26372397762 against PR #294's per-conn IORING_OP_TIMEOUT fix surprisingly went BACKWARDS — adv_hang_until_timeout climbed from 838 to 1,061. Diagnosis: the CQE dispatch hot path at worker.go:483-503 is INLINED (per its own comment, eliminating the processCQE method call for the hot ops). My udHeaderTimer case was added only to the non-inlined processCQE function — which is now called only from the cancel path. Result: every timer CQE in normal operation was silently dropped, leaving slowloris conns to be caught ONLY by the loose sweep (gate 0x3FF, no cadence tightening since that lives on PR #293), which fewer hangs than PR #293. Fix: add the udHeaderTimer case to the inlined switch too. The same handleHeaderTimer function runs from both dispatch sites. Also harden handleHeaderTimer against stale `now`: the cachedNow is refreshed every 64 iterations, which can be 60+ms stale under bursty load. A stale `now < dl` would spuriously re-arm a fresh 10s timer and effectively double the slowloris window. The handler now re-reads time.Now() — one vDSO call per timer CQE, negligible.

staticcheck SA4009 — the parameter was overwritten before first use since the handler always re-reads time.Now() to avoid the cachedNow staleness issue. Drop the parameter; both call sites updated.

Side-by-side comparison of std vs iouring in nightly 26374452801 revealed the actual root cause behind iouring's residual 11% hang rate: std cell-02 auth_jwt-std: sent=2,259 hang=0 (4× more attempts) std cell-05 auth_session-std: sent=8,254 hang=0 (15× more attempts) iouring cell-00 auth_jwt: sent= 573 hang=66 iouring cell-00 (other arm): sent= 564 hang=66 std processes 4-15× more slowloris attempts per cell budget because std's close causes the walker to immediately fail its NEXT write. iouring's async-mode graceful close path (SHUT_WR + drainRecvBuffer + Close — used because async mode pre-allocates detachMu) sends FIN instead of RST; the walker's small drip writes continue buffering locally past the close, often reaching the 12s hang budget without noticing. Fix: set SO_LINGER {1, 0} on the FD right before closeConn in the slowloris-defence paths (handleHeaderTimer's per-conn timer fire + checkTimeouts' sweep-fallback). SO_LINGER {1, 0} discards pending TX data and sends RST on close; the walker observes ECONNRESET on its next write, independent of TCP send-buffer state. Applied to both iouring (worker.go) and epoll (loop.go). The slowloris conn is being terminated for misbehaviour regardless, so discarding any unsent server-side bytes is fine — mirrors how production proxies handle the same case (nginx reset_timedout_connection on). Empirical verification: next nightly+soak should show iouring hangs drop sharply (target ~0), proving the close-semantics hypothesis.

…loris repro PR #292's H1State + ProcessH1 changes were correct but the epoll worker's sweep cadence wasn't tightened for ReadHeaderTimeout. The in-process synthetic reproducer (test/integration/slowloris_synthetic_test.go) caught this: with 2s ReadHeaderTimeout, epoll's defence was firing at 4.0s (2s LATE) because the 0x3FF gate + 500ms maxMs adaptive cap allowed sweep to run only every ~2s on a 1-conn worker. Fix mirrors what landed in iouring/worker.go (PR #293 cherry-pick): - gate drops 0x3FF → 0x1F when cfg.ReadHeaderTimeout > 0 - adaptiveTimeoutMs maxMs caps at 25ms (was 500) Combined cadence: 25ms × 0x1F gate = 800ms worst-case sweep latency. Verification (synthetic): both iouring + epoll now close slowloris conns at deadline+200ms even under: - AsyncHandlers=true (triggers graceful close path) - 64 simultaneous slowloris clients - 200 background traffic goroutines - SO_LINGER {1,0} forcing RST TestSlowlorisSynthetic (single client) passes both engines. TestSlowlorisSyntheticConcurrent (64 concurrent) passes both engines. Max close-latency: iouring 2.43s / epoll 2.20s on 2s deadline.

…4010)

Last nightly proved SO_LINGER {1, 0} alone wasn't enough: iouring went from 564 → 15,104 adversarial attempts processed (25× faster slowloris close cycle), confirming the close mechanism was the root cause. But residual hangs persisted at 26% (down from 70%). Diagnosis: my SO_LINGER was applied BEFORE closeConn but the async-mode close path (finishCloseDetached) does: unix.Shutdown(fd, SHUT_WR) ← sends FIN, puts conn in FIN_WAIT_1 drainRecvBuffer(fd) unix.Close(fd) ← SO_LINGER may not fire RST in FIN_WAIT After SHUT_WR, the kernel's already sent FIN and may not honor the linger discard semantics on Close. The walker still sees FIN (slow propagation) instead of RST (immediate ECONNRESET). Fix: add cs.forceRSTClose flag. Slowloris-defence paths set it before closeConn. finishClose / finishCloseDetached check the flag and SKIP the SHUT_WR+drain dance — call close() directly. With SO_LINGER {1, 0} already set, that close RSTs the conn. Walker observes ECONNRESET on its very next write. Applied to BOTH engines: - iouring: handleHeaderTimer + checkTimeouts HeaderDeadline branch - epoll: checkTimeouts HeaderDeadline branch Empirical verification: next nightly should show iouring AND epoll adv_hang_until_timeout drop to ~0 (matching std). If yes, v1.4.11 ships.

Last nightly 26397557463 showed 458 residual hangs concentrated on intentionally-slow refapps (observability, static_swagger_proxy) where the slowloris hang rate stayed at ~62% despite the per-conn timer + forceRSTClose fix curing the fast-throughput cells. Hypothesis: under concurrent Markov + adversarial walker pressure, the iouring SQ ring fills up. armHeaderTimer's GetSQE() returns nil and the function silently no-ops. The conn then relies on the sweep, hitting the cadence floor that previous PRs already showed isn't enough. Fix: on GetSQE-returns-nil, force-submit pending SQEs (drains the ring into the kernel queue) then retry GetSQE once. This guarantees the timer always arms even under burst load. Cost: extra Submit syscall on the rare ring-full path. Negligible vs the alternative of a silently-unprotected slowloris conn. Expected: residual on slow refapps should drop substantially.

…ose) Root cause analysis from nightly 26401147295 data: - std cell-20 obs-std: 816 sent, 0 hangs - iouring cell-18 obs-iouring: 365 sent, 41 hangs - epoll cell-19 obs-epoll: 338 sent, 34 hangs iouring AND epoll show the same ~60% slowloris hang rate on observability, while std is bulletproof. Same code path config (Protocol=HTTP1, AsyncHandlers=true) on all three. Difference: HOW the kernel sees the close. Linux kernel sends RST on close ONLY when: - receive buffer has unread bytes, OR - SO_LINGER {1, 0} is set AND there's pending TX data For std, http.Server's read goroutine often has unread bytes in the recv buffer when ReadHeaderTimeout fires (deadline interrupts a partial recv). close() then RSTs naturally. For iouring/epoll, our event loop drains every recv as it arrives. By the time the close runs, the recv buffer is EMPTY. SO_LINGER {1,0} + close ends up sending FIN because there's nothing to "discard" on TX either (slowloris sends never get any response from us). Fix: add `unix.Shutdown(fd, unix.SHUT_RDWR)` BEFORE close in the forceRSTClose path. SHUT_RDWR explicitly aborts both half-channels; combined with SO_LINGER {1, 0} on Linux, close() then RSTs reliably regardless of buffer state. Walker observes ECONNRESET on its very next write. Applied to both engines in both close paths (detached/non-detached for iouring; the single epoll close path). Expected: slow-refapp residual (observability, static_swagger_proxy) drops to ~0 like the fast cells already did. v1.4.11 ready if so.

…ing precision) Last nightly 26405295929 showed iouring hangs dropped 72% (242 → 68) after SHUT_RDWR + SO_LINGER fix, but epoll residual stayed at 203. The difference: iouring uses per-conn IORING_OP_TIMEOUT SQEs for kernel-precise close at exactly the deadline. Epoll relies on the sweep-based checkTimeouts; even with gate=0x1F and maxMs=25 the sweep cadence can stretch under low socket-event traffic. For slowloris-defence on slow refapps (observability, static_swagger_proxy where Markov walkers hit intentionally-slow endpoints, generating few socket events), the sweep ran every ~500-800ms in the worst case. The walker's 2s slack budget (12s drip - 10s deadline) couldn't absorb that delay reliably. Fix: add a per-worker timerfd_create that fires every 25ms, registered in the loop's epoll set. Every fire wakes the loop AND runs checkTimeouts unconditionally. Now the cadence is kernel-enforced at 25ms regardless of socket-event volume — matching iouring's IORING_OP_TIMEOUT precision. Lifecycle: - Created in run() when cfg.ReadHeaderTimeout > 0 (skipped if feature disabled, avoiding the syscall for stripped builds). - Added to epoll set with EPOLLIN. - Closed in shutdown() alongside eventFD. Cost: 25 extra checkTimeouts calls per second per worker. Each scan is O(maxFD) but cheap (atomic loads). Under load this rate was already achieved naturally; the timer only matters when the worker would otherwise have been idle past the cadence ceiling. Empirical verification path: next nightly should show epoll residual on observability/SSP drop to ~0, matching iouring's post-SHUT_RDWR behaviour. If yes, v1.4.11 ships.

Theory: shutdown(SHUT_RDWR) before close sends FIN first, transitioning the conn to FIN_WAIT_1 where SO_LINGER {1,0} may no longer trigger RST on the subsequent close. Removing the shutdown so close() acts on a still-ESTABLISHED conn lets LINGER do its intended job. If this theory holds, residual hangs on slow refapps should drop substantially in the next nightly.

…visibility Nightly 26418830572 vs 26414322152 (sample size = full matrix): - With SHUT_RDWR + close: iouring=126 epoll=96 total=222 - Without SHUT_RDWR: iouring=235 epoll=90 total=325 SHUT_RDWR before close gives ~50% more reliable observable-RST on iouring. Theory: SHUT_RDWR drops the receive queue too (close() with LINGER only drops TX); without this, walker's drip writes can keep succeeding locally for hundreds of ms after celeris closes because the kernel's TCP layer hasn't fully torn down the receive side. SHUT_RDWR forces both directions atomically, walker observes ECONNRESET on next write/read. Keeping the SHUT_RDWR + close sequence for forceRSTClose paths on both engines. Re-soak with this + walker Read-probe widened to 50ms (separate probatorium commit) should drop residual further.

…t in rebase) The tickCounter gate and adaptiveTimeout caps were supposed to tighten when ReadHeaderTimeout > 0 (per bb1a278), but the conditions only checked detachedCount. Slowloris conns aren't detached, so the sweep fell back to its slow ~100ms+ cadence under steady-state legit traffic. Nightly 26423916348 showed iouring static_swagger_proxy + observability hangs at 20-21 per cell despite the per-conn IORING_OP_TIMEOUT being in place — armHeaderTimer can silently fail under SQ-ring pressure (GetSQE+Submit retry returns nil), at which point the sweep is the only fallback. Without the tight gate the worst-case close latency on a dropped arm was multi-second, far past the walker's 20s drip budget. Restored what bb1a278 added: - tickCounter gate: 0x3FF → 0x1F when w.cfg.ReadHeaderTimeout > 0 - adaptiveTimeout maxWait: 100ms → 25ms when ReadHeaderTimeout > 0 Epoll equivalent (loop.go) already has both fixes via the timerfd path and the gate condition — only iouring regressed.

… arm dropped armHeaderTimer can silently fail under SQ-ring pressure (GetSQE+Submit retry returns nil). When that happens cs.headerTimerArmed stays false but cs.h1State.HeaderDeadlineNs.Store has already set the deadline, so subsequent ArmHeaderDeadline calls (idempotent on the deadline) won't re-trigger OnHeaderDeadlineArmed and the conn falls back to the sweep alone. Belt-and-braces: in both the async-dispatch and inline handleRecv paths, re-attempt armHeaderTimer after appending received bytes whenever HeaderDeadlineNs > 0 but headerTimerArmed is still false. Cheap atomic load on the hot path (constant per recv), zero-cost when ReadHeaderTimeout is disabled (HeaderDeadlineNs stays 0). Every drip byte on a slowloris conn becomes another chance for the timer SQE to land — pre-fix, only the initial accept got one shot. Pairs with 6dbbf02 (sweep cadence restore): together they close the gap that left iouring 20-21 hangs per slow-refapp cell despite the per-conn IORING_OP_TIMEOUT being in place.

… under async mode ProcessH1 calls state.ArmHeaderDeadline at internal/conn/h1.go:361 when it sees data on a conn with HeaderDeadlineNs == 0 (the keep-alive next-request transition). For async mode this runs on the per-conn dispatch goroutine, NOT the worker thread. ArmHeaderDeadline then fires OnHeaderDeadlineArmed which the iouring worker had set to a closure that calls w.ring.GetSQE. io_uring's IORING_SETUP_SINGLE_ISSUER (set by the high/optional tiers) requires all SQE submissions come from the LockOSThread'd worker — a goroutine GetSQE call can silently drop the SQE, return a stale slot pointer, or corrupt the SQ ring outright. Plausibly the root cause of the residual 17-21 slowloris hangs per slow-refapp cell (observability, static_swagger_proxy): the corrupted SQ then drops unrelated header- timer SQEs for other slowloris conns on the same worker. Fix: leave OnHeaderDeadlineArmed nil in async mode. The initial accept- time arm is now a direct armHeaderTimer call on the worker thread (the new w.async branch right after ArmHeaderDeadline). Keep-alive re-arms fall back to checkTimeouts which runs every 800ms worst case with the gate + adaptiveTimeout tightening from 6dbbf02. Slowloris attacks never reach the keep-alive path (no successful parse), so they keep kernel-precise timer behavior via the accept-time arm + the recv-path retry (3b9d643) for SQ-pressure drops. Sync mode (!w.async) unchanged: ProcessH1 runs inline on the worker thread, OnHeaderDeadlineArmed is a worker-thread call, no race.

…T-loss workaround) Nightly 26427061834 (with sweep-cadence + retry-recv + SINGLE_ISSUER fixes) showed residual ~16-25 hangs per slow-refapp cell on iouring AND epoll, basically unchanged from baseline. The identical shape on both engines despite their independent close paths ruled out engine-specific bugs — the FIN/RST signaling itself is just unreliable enough under cluster load that the walker's 20s drip-budget fires before its kernel observes the close on ~5-7% of attempts. Switch to a TCP-reliable signaling mechanism: write a 408 Request Timeout response BEFORE closing. The response bytes ride TCP's retransmission machinery, so even with packet loss the walker eventually Read()s a byte of response data and exits its drip loop via wellReject — no FIN/RST visibility required. iouring (handleHeaderTimer + checkTimeouts): cs.writeFn(requestTimeoutResponse) // appends to cs.writeBuf w.closeConn(fd) // sees non-empty writeBuf → // defers close until SEND CQE // → finishCloseDetached takes // graceful SHUT_WR + drain path // (no forceRSTClose) so the 408 // reliably leaves the kernel // before the FIN. epoll (checkTimeouts): unix.Write(fd, requestTimeoutResponse) // 80 bytes synchronous to // kernel send buffer; EAGAIN // is practically impossible // since the conn's send // buffer was empty. l.closeConn(fd) // graceful close path (no // forceRSTClose set) → // SHUT_WR + drain + Close. The forceRSTClose machinery is left in place (release resets it, both close paths still honor the flag) for any future caller that wants synchronous RST teardown, but the slowloris-defence path no longer touches it. SetsockoptLinger calls are removed from the same paths — they were no-ops in fixed-file mode (LINGER on an index returns EBADF) and harmful for non-fixed-file once we want graceful close.

Three prior approaches all left ~5-7% walker-hangs per slow-refapp cell: - SHUT_RDWR + close + SO_LINGER{1,0} (forced RST): nightly 26427061834 - 408 Request Timeout + SHUT_WR + drain + close: nightly 26429150655 Std/net.http on header-deadline-expiry calls c.close() → plain rwc.Close() (no shutdown, no linger, no response body). Std walker shows 0 hangs on the same probatorium test. The data points at our extra SHUT_*/drainRecvBuffer syscalls racing with the peer's in-flight Write — the multi-syscall close window gives the kernel time to send RST on the peer's pending data, dropping any pending recv on the peer side. Plain close avoids that race entirely. Changes: - iouring handleHeaderTimer + checkTimeouts: drop the 408-write + forceRSTClose paths. Just call w.closeConn(fd). - iouring finishCloseDetached: async-mode HTTP1 conns (detached=true via pre-allocated detachMu, but h1State.Detached=false) use plain unix.Close. Truly detached WS/SSE conns still use SHUT_WR+drain so middleware-queued close-frame echoes flush. - epoll checkTimeouts: drop 408-write + linger. Just l.closeConn. - epoll closeConn: new plainClose path covers both sync and async- mode H1 (was previously only sync H1 via fastClose). H2 + truly detached conns still use the graceful SHUT_WR+drain path. - Drop requestTimeoutResponse helper (no longer used). The forceRSTClose machinery is left intact (no setter now, but release resets the field; switch-case still honors it for any future caller).

Plain unix.Close (4382283) dropped stat_swag iouring hangs from ~21/cell to ~2/cell. Observability still saw ~19-23/cell. Hypothesis: plain Close on a non-empty recv buffer sends RST instead of FIN (Linux tcp_close checks data_was_unread first). RST is fire-and-forget — TCP doesn't retransmit it. On the cluster's 20G LACP fabric a single dropped RST strands the walker (no second chance), giving the residual ~5-7% hang rate per cell. For slowloris specifically the peer is by design still writing drips, so the kernel recv buffer is almost always non-empty at our close time. SHUT_WR forces FIN regardless of recv-buffer state; FIN IS retransmitted from FIN_WAIT_1 until ACK'd, so a dropped FIN packet doesn't strand the walker. Skip drainRecvBuffer — that was the source of the original race (an empty drain is harmless but a non-empty one leaves a multi-µs window in which a fresh drip queues, sending Close back to the RST branch). Changes: - iouring finishCloseDetached: async-mode H1 path now Shutdown(SHUT_WR) + Close instead of plain Close. - epoll closeConn: plainClose path now Shutdown(SHUT_WR) + Close. - Graceful drain path retained for H2 (GOAWAY/RST_STREAM flush) and truly detached WS/SSE conns (middleware-queued close frames).

…anic load Diagnosed from probatorium nightly 26444562273: recovery's default Logger (slog.Default → text handler → os.Stderr) serializes panic logs on a global mutex inside Go's default handler. Under iouring/epoll's async-dispatch model the dispatch goroutine holds cs.detachMu across ProcessH1, so a synchronous LogAttrs gates the worker thread for the duration of the stderr write. Sustained panic loads (>100 req/sec to a handler that panics) drop engine throughput by ~14× and let slowloris header-deadlines slip past on concurrent conns. Refapps in probatorium already work around this by passing an explicit io.Discard or ringHandler logger. Document the issue for upstream users and point them at the same workaround. Async-logging or discard-by-default would be a larger semantic change deferred to v1.4.12+.

FumingPower3925 added 23 commits May 24, 2026 22:33

fix(test): gofmt h1_test.go numbered-list spacing

a69be53

fix(engine/iouring): drop unused now parameter from handleHeaderTimer

229a4c7

staticcheck SA4009 — the parameter was overwritten before first use since the handler always re-reads time.Now() to avoid the cachedNow staleness issue. Drop the parameter; both call sites updated.

fix(test): drop unused latencies slice, track avg via running sum (SA…

865189c

…4010)

ci: retrigger checks (300822b missed CI during GH Actions outage)

375771b

FumingPower3925 merged commit b4875ca into main May 28, 2026
7 checks passed

FumingPower3925 deleted the fix/iouring-epoll-read-header-timeout-v3 branch May 30, 2026 17:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(engine): per-conn IORING_OP_TIMEOUT for ReadHeaderTimeout (v1.4.11 candidate)#294

fix(engine): per-conn IORING_OP_TIMEOUT for ReadHeaderTimeout (v1.4.11 candidate)#294
FumingPower3925 merged 23 commits into
mainfrom
fix/iouring-epoll-read-header-timeout-v3

FumingPower3925 commented May 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

FumingPower3925 commented May 24, 2026

Summary

Journey

Design

Race correctness

What this does NOT change

Tests

Empirical verification path

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant