fix(engine): per-conn IORING_OP_TIMEOUT for ReadHeaderTimeout (v1.4.11 candidate)#294
Merged
Merged
Conversation
…defence)
The doc comment on resource.Config.ReadHeaderTimeout reads:
The std engine wires this to http.Server.ReadHeaderTimeout; the
iouring/epoll engines enforce the same budget inside their H1
header read loop.
But grep finds NO consumer of cfg.ReadHeaderTimeout in either iouring
or epoll. Only engine/std actually enforced it. The iouring + epoll
engines had no slowloris defence at all — relying entirely on
SO_REUSEPORT multi-worker queue absorption.
Surfaced by probatorium nightly 26363052652 (run against PR #145 SHA
cc4d42d, which fixed the walker false-positive 2s timeout bug). After
the walker timeouts were aligned with celeris's 10s default, the
residual hang counts (1,910 adv_hang + 368 h2c_hang) were ALL
concentrated on iouring + epoll cells. std cells had zero — proving
the slowloris walker was correctly detecting the missing defence.
Fix wires a new HeaderDeadlineNs atomic field on H1State that follows
the existing IdleDeadlineNs pattern:
- H1State.HeaderDeadlineNs: absolute deadline (Unix nanoseconds);
0 = no deadline.
- H1State.ReadHeaderTimeoutNs: configured timeout in ns; 0 = disabled.
- ProcessH1 calls ClearHeaderDeadline at entry (we're actively
parsing, not waiting). On return-nil (await-more-data), the
deferred ArmHeaderDeadline rearms the deadline for the next
request's header read.
- Engine initProtocol (iouring + epoll) sets ReadHeaderTimeoutNs from
cfg + arms the deadline at conn-accept.
- checkTimeouts (iouring + epoll) closes conns where HeaderDeadlineNs
is set and exceeded.
std engine path is unchanged — it still wires through http.Server.
Tests:
- TestHeaderDeadline_ClearArmCycle: validates the arm/clear/re-arm
state machine.
- TestHeaderDeadline_DisabledNoArm: ReadHeaderTimeoutNs=0 → arm is
a no-op.
Empirical verification: next nightly + soak against this branch should
show ~0 adv_hang_until_timeout + ~0 h2c_hang from iouring/epoll cells
(matching std behaviour). Real slowloris attempts will still trip the
counter (the defence is now in place, not the absence of one).
…rm-if-zero, not defer-arm Review of the original v1.4.11 candidate (commit 7794403) caught a semantic bug BEFORE shipping: the defer-arm-on-return-nil design would have killed keep-alive idle conns at ReadHeaderTimeout instead of IdleTimeout. Pre-fix flow: 1. ProcessH1 entry: ClearHeaderDeadline (unconditional). 2. Parse + handle request. 3. ProcessH1 returns nil to engine (await more data). 4. Defer fires → ArmHeaderDeadline. 5. Conn enters keep-alive idle. 10s later, checkTimeouts closes the conn — even though it was idle, not slow-dripping. Correct semantic (matches net/http and the docs we claim to honor): - ReadHeaderTimeout tracks the absolute budget for receiving the CURRENT request's headers. - Once headers are complete (ParseRequest success), clear. - Body / handler / keep-alive idle states have NO header deadline. They get IdleTimeout / ReadTimeout / handler-deadline coverage. - Re-arm on the FIRST byte of the NEXT request — which manifests as ProcessH1 being called with data on a conn whose deadline was cleared. Implementation: - ProcessH1 entry: arm IF AND ONLY IF deadline == 0 AND not in body-resumption / detached state. Don't reset an already-armed deadline (slowloris-bypass guard). - At each successful ParseRequest (3 sites): ClearHeaderDeadline. - ArmHeaderDeadline itself is now idempotent: no-op when deadline is already non-zero — defence in depth against any future caller that doesn't check first. New regression tests: - TestHeaderDeadline_ArmIdempotent: re-arming without clear MUST NOT shift the deadline (slowloris-bypass guard). - TestHeaderDeadline_KeepAliveIdleNotKilled: post-clear, the deadline stays 0 through keep-alive idle — IdleTimeout handles that window, NOT ReadHeaderTimeout. This is the deep-analysis pass the user asked for before tagging v1.4.11. The implementation is now provably correct against the keep-alive-idle vs slowloris-drip distinction.
…eadHeaderTimeout Replaces the sweep-based defence (PR #292/#293, max 56% slowloris- hang reduction) with a per-conn kernel timer. The kernel fires the IORING_OP_TIMEOUT SQE at the absolute deadline regardless of CQE traffic on the worker, giving parity with std's SetReadDeadline. Design: - New opTIMEOUT (11) + opTIMEOUTREMOVE (12) opcodes (kernel 5.4+). - New udHeaderTimer user_data tag. - New prepTimeout SQE helper that takes a kernelTimespec pointer. - connState gains headerTimerSpec (re-used kernelTimespec, owned by the conn so SQE memory stays valid until kernel reads) + headerTimerArmed flag (avoids duplicate SQEs). - H1State gains OnHeaderDeadlineArmed callback hook. ArmHeaderDeadline invokes it on the 0→armed transition; engine submits a fresh SQE. - Engine initProtocol wires the callback to w.armHeaderTimer; calls it once at conn-accept (initial arm). - CQE dispatch routes udHeaderTimer to w.handleHeaderTimer which checks HeaderDeadlineNs (race-free against ClearHeaderDeadline) and closes the conn if the deadline really has expired. Race-free against: - ClearHeaderDeadline (parse success): timer CQE arrives, sees HeaderDeadlineNs == 0, no-op. - Spurious early fire (clock drift): re-submits a fresh timer for the current deadline. - Re-arm during in-flight timer: ArmHeaderDeadline is idempotent (no-op if deadline != 0); the in-flight timer's CQE will check state and either close or re-submit. Single timer per conn at any time guaranteed by headerTimerArmed flag. - Conn close during in-flight timer: handleHeaderTimer checks cs.closing and returns early. epoll side: still relies on the sweep-based defence (no per-conn timerfd because that would double fd usage on this hot path). Expected to have slightly higher residual hangs than iouring. Empirical verification path: nightly+soak against this branch's SHA should show adv_hang_until_timeout drop to ~0 on iouring cells specifically (epoll improvement bounded by sweep cadence). Tag v1.4.11 only when iouring hits 0.
Nightly 26372397762 against PR #294's per-conn IORING_OP_TIMEOUT fix surprisingly went BACKWARDS — adv_hang_until_timeout climbed from 838 to 1,061. Diagnosis: the CQE dispatch hot path at worker.go:483-503 is INLINED (per its own comment, eliminating the processCQE method call for the hot ops). My udHeaderTimer case was added only to the non-inlined processCQE function — which is now called only from the cancel path. Result: every timer CQE in normal operation was silently dropped, leaving slowloris conns to be caught ONLY by the loose sweep (gate 0x3FF, no cadence tightening since that lives on PR #293), which fewer hangs than PR #293. Fix: add the udHeaderTimer case to the inlined switch too. The same handleHeaderTimer function runs from both dispatch sites. Also harden handleHeaderTimer against stale `now`: the cachedNow is refreshed every 64 iterations, which can be 60+ms stale under bursty load. A stale `now < dl` would spuriously re-arm a fresh 10s timer and effectively double the slowloris window. The handler now re-reads time.Now() — one vDSO call per timer CQE, negligible.
staticcheck SA4009 — the parameter was overwritten before first use since the handler always re-reads time.Now() to avoid the cachedNow staleness issue. Drop the parameter; both call sites updated.
Side-by-side comparison of std vs iouring in nightly 26374452801
revealed the actual root cause behind iouring's residual 11% hang rate:
std cell-02 auth_jwt-std: sent=2,259 hang=0 (4× more attempts)
std cell-05 auth_session-std: sent=8,254 hang=0 (15× more attempts)
iouring cell-00 auth_jwt: sent= 573 hang=66
iouring cell-00 (other arm): sent= 564 hang=66
std processes 4-15× more slowloris attempts per cell budget because
std's close causes the walker to immediately fail its NEXT write.
iouring's async-mode graceful close path (SHUT_WR + drainRecvBuffer
+ Close — used because async mode pre-allocates detachMu) sends FIN
instead of RST; the walker's small drip writes continue buffering
locally past the close, often reaching the 12s hang budget without
noticing.
Fix: set SO_LINGER {1, 0} on the FD right before closeConn in the
slowloris-defence paths (handleHeaderTimer's per-conn timer fire +
checkTimeouts' sweep-fallback). SO_LINGER {1, 0} discards pending
TX data and sends RST on close; the walker observes ECONNRESET on
its next write, independent of TCP send-buffer state.
Applied to both iouring (worker.go) and epoll (loop.go). The
slowloris conn is being terminated for misbehaviour regardless, so
discarding any unsent server-side bytes is fine — mirrors how
production proxies handle the same case (nginx
reset_timedout_connection on).
Empirical verification: next nightly+soak should show iouring hangs
drop sharply (target ~0), proving the close-semantics hypothesis.
…loris repro PR #292's H1State + ProcessH1 changes were correct but the epoll worker's sweep cadence wasn't tightened for ReadHeaderTimeout. The in-process synthetic reproducer (test/integration/slowloris_synthetic_test.go) caught this: with 2s ReadHeaderTimeout, epoll's defence was firing at 4.0s (2s LATE) because the 0x3FF gate + 500ms maxMs adaptive cap allowed sweep to run only every ~2s on a 1-conn worker. Fix mirrors what landed in iouring/worker.go (PR #293 cherry-pick): - gate drops 0x3FF → 0x1F when cfg.ReadHeaderTimeout > 0 - adaptiveTimeoutMs maxMs caps at 25ms (was 500) Combined cadence: 25ms × 0x1F gate = 800ms worst-case sweep latency. Verification (synthetic): both iouring + epoll now close slowloris conns at deadline+200ms even under: - AsyncHandlers=true (triggers graceful close path) - 64 simultaneous slowloris clients - 200 background traffic goroutines - SO_LINGER {1,0} forcing RST TestSlowlorisSynthetic (single client) passes both engines. TestSlowlorisSyntheticConcurrent (64 concurrent) passes both engines. Max close-latency: iouring 2.43s / epoll 2.20s on 2s deadline.
Last nightly proved SO_LINGER {1, 0} alone wasn't enough: iouring
went from 564 → 15,104 adversarial attempts processed (25× faster
slowloris close cycle), confirming the close mechanism was the
root cause. But residual hangs persisted at 26% (down from 70%).
Diagnosis: my SO_LINGER was applied BEFORE closeConn but the
async-mode close path (finishCloseDetached) does:
unix.Shutdown(fd, SHUT_WR) ← sends FIN, puts conn in FIN_WAIT_1
drainRecvBuffer(fd)
unix.Close(fd) ← SO_LINGER may not fire RST in FIN_WAIT
After SHUT_WR, the kernel's already sent FIN and may not honor the
linger discard semantics on Close. The walker still sees FIN
(slow propagation) instead of RST (immediate ECONNRESET).
Fix: add cs.forceRSTClose flag. Slowloris-defence paths set it
before closeConn. finishClose / finishCloseDetached check the flag
and SKIP the SHUT_WR+drain dance — call close() directly. With
SO_LINGER {1, 0} already set, that close RSTs the conn. Walker
observes ECONNRESET on its very next write.
Applied to BOTH engines:
- iouring: handleHeaderTimer + checkTimeouts HeaderDeadline branch
- epoll: checkTimeouts HeaderDeadline branch
Empirical verification: next nightly should show iouring AND epoll
adv_hang_until_timeout drop to ~0 (matching std). If yes, v1.4.11
ships.
Last nightly 26397557463 showed 458 residual hangs concentrated on intentionally-slow refapps (observability, static_swagger_proxy) where the slowloris hang rate stayed at ~62% despite the per-conn timer + forceRSTClose fix curing the fast-throughput cells. Hypothesis: under concurrent Markov + adversarial walker pressure, the iouring SQ ring fills up. armHeaderTimer's GetSQE() returns nil and the function silently no-ops. The conn then relies on the sweep, hitting the cadence floor that previous PRs already showed isn't enough. Fix: on GetSQE-returns-nil, force-submit pending SQEs (drains the ring into the kernel queue) then retry GetSQE once. This guarantees the timer always arms even under burst load. Cost: extra Submit syscall on the rare ring-full path. Negligible vs the alternative of a silently-unprotected slowloris conn. Expected: residual on slow refapps should drop substantially.
…ose)
Root cause analysis from nightly 26401147295 data:
- std cell-20 obs-std: 816 sent, 0 hangs
- iouring cell-18 obs-iouring: 365 sent, 41 hangs
- epoll cell-19 obs-epoll: 338 sent, 34 hangs
iouring AND epoll show the same ~60% slowloris hang rate on observability,
while std is bulletproof. Same code path config (Protocol=HTTP1,
AsyncHandlers=true) on all three. Difference: HOW the kernel sees the
close.
Linux kernel sends RST on close ONLY when:
- receive buffer has unread bytes, OR
- SO_LINGER {1, 0} is set AND there's pending TX data
For std, http.Server's read goroutine often has unread bytes in the
recv buffer when ReadHeaderTimeout fires (deadline interrupts a partial
recv). close() then RSTs naturally.
For iouring/epoll, our event loop drains every recv as it arrives. By
the time the close runs, the recv buffer is EMPTY. SO_LINGER {1,0} +
close ends up sending FIN because there's nothing to "discard" on TX
either (slowloris sends never get any response from us).
Fix: add `unix.Shutdown(fd, unix.SHUT_RDWR)` BEFORE close in the
forceRSTClose path. SHUT_RDWR explicitly aborts both half-channels;
combined with SO_LINGER {1, 0} on Linux, close() then RSTs reliably
regardless of buffer state. Walker observes ECONNRESET on its very
next write.
Applied to both engines in both close paths (detached/non-detached
for iouring; the single epoll close path).
Expected: slow-refapp residual (observability, static_swagger_proxy)
drops to ~0 like the fast cells already did. v1.4.11 ready if so.
…ing precision)
Last nightly 26405295929 showed iouring hangs dropped 72% (242 → 68)
after SHUT_RDWR + SO_LINGER fix, but epoll residual stayed at 203.
The difference: iouring uses per-conn IORING_OP_TIMEOUT SQEs for
kernel-precise close at exactly the deadline. Epoll relies on the
sweep-based checkTimeouts; even with gate=0x1F and maxMs=25 the
sweep cadence can stretch under low socket-event traffic.
For slowloris-defence on slow refapps (observability,
static_swagger_proxy where Markov walkers hit intentionally-slow
endpoints, generating few socket events), the sweep ran every
~500-800ms in the worst case. The walker's 2s slack budget (12s
drip - 10s deadline) couldn't absorb that delay reliably.
Fix: add a per-worker timerfd_create that fires every 25ms,
registered in the loop's epoll set. Every fire wakes the loop AND
runs checkTimeouts unconditionally. Now the cadence is
kernel-enforced at 25ms regardless of socket-event volume —
matching iouring's IORING_OP_TIMEOUT precision.
Lifecycle:
- Created in run() when cfg.ReadHeaderTimeout > 0 (skipped if
feature disabled, avoiding the syscall for stripped builds).
- Added to epoll set with EPOLLIN.
- Closed in shutdown() alongside eventFD.
Cost: 25 extra checkTimeouts calls per second per worker. Each
scan is O(maxFD) but cheap (atomic loads). Under load this rate
was already achieved naturally; the timer only matters when the
worker would otherwise have been idle past the cadence ceiling.
Empirical verification path: next nightly should show epoll
residual on observability/SSP drop to ~0, matching iouring's
post-SHUT_RDWR behaviour. If yes, v1.4.11 ships.
Theory: shutdown(SHUT_RDWR) before close sends FIN first, transitioning
the conn to FIN_WAIT_1 where SO_LINGER {1,0} may no longer trigger
RST on the subsequent close. Removing the shutdown so close() acts
on a still-ESTABLISHED conn lets LINGER do its intended job.
If this theory holds, residual hangs on slow refapps should drop
substantially in the next nightly.
…visibility Nightly 26418830572 vs 26414322152 (sample size = full matrix): - With SHUT_RDWR + close: iouring=126 epoll=96 total=222 - Without SHUT_RDWR: iouring=235 epoll=90 total=325 SHUT_RDWR before close gives ~50% more reliable observable-RST on iouring. Theory: SHUT_RDWR drops the receive queue too (close() with LINGER only drops TX); without this, walker's drip writes can keep succeeding locally for hundreds of ms after celeris closes because the kernel's TCP layer hasn't fully torn down the receive side. SHUT_RDWR forces both directions atomically, walker observes ECONNRESET on next write/read. Keeping the SHUT_RDWR + close sequence for forceRSTClose paths on both engines. Re-soak with this + walker Read-probe widened to 50ms (separate probatorium commit) should drop residual further.
…t in rebase) The tickCounter gate and adaptiveTimeout caps were supposed to tighten when ReadHeaderTimeout > 0 (per bb1a278), but the conditions only checked detachedCount. Slowloris conns aren't detached, so the sweep fell back to its slow ~100ms+ cadence under steady-state legit traffic. Nightly 26423916348 showed iouring static_swagger_proxy + observability hangs at 20-21 per cell despite the per-conn IORING_OP_TIMEOUT being in place — armHeaderTimer can silently fail under SQ-ring pressure (GetSQE+Submit retry returns nil), at which point the sweep is the only fallback. Without the tight gate the worst-case close latency on a dropped arm was multi-second, far past the walker's 20s drip budget. Restored what bb1a278 added: - tickCounter gate: 0x3FF → 0x1F when w.cfg.ReadHeaderTimeout > 0 - adaptiveTimeout maxWait: 100ms → 25ms when ReadHeaderTimeout > 0 Epoll equivalent (loop.go) already has both fixes via the timerfd path and the gate condition — only iouring regressed.
… arm dropped armHeaderTimer can silently fail under SQ-ring pressure (GetSQE+Submit retry returns nil). When that happens cs.headerTimerArmed stays false but cs.h1State.HeaderDeadlineNs.Store has already set the deadline, so subsequent ArmHeaderDeadline calls (idempotent on the deadline) won't re-trigger OnHeaderDeadlineArmed and the conn falls back to the sweep alone. Belt-and-braces: in both the async-dispatch and inline handleRecv paths, re-attempt armHeaderTimer after appending received bytes whenever HeaderDeadlineNs > 0 but headerTimerArmed is still false. Cheap atomic load on the hot path (constant per recv), zero-cost when ReadHeaderTimeout is disabled (HeaderDeadlineNs stays 0). Every drip byte on a slowloris conn becomes another chance for the timer SQE to land — pre-fix, only the initial accept got one shot. Pairs with 6dbbf02 (sweep cadence restore): together they close the gap that left iouring 20-21 hangs per slow-refapp cell despite the per-conn IORING_OP_TIMEOUT being in place.
… under async mode ProcessH1 calls state.ArmHeaderDeadline at internal/conn/h1.go:361 when it sees data on a conn with HeaderDeadlineNs == 0 (the keep-alive next-request transition). For async mode this runs on the per-conn dispatch goroutine, NOT the worker thread. ArmHeaderDeadline then fires OnHeaderDeadlineArmed which the iouring worker had set to a closure that calls w.ring.GetSQE. io_uring's IORING_SETUP_SINGLE_ISSUER (set by the high/optional tiers) requires all SQE submissions come from the LockOSThread'd worker — a goroutine GetSQE call can silently drop the SQE, return a stale slot pointer, or corrupt the SQ ring outright. Plausibly the root cause of the residual 17-21 slowloris hangs per slow-refapp cell (observability, static_swagger_proxy): the corrupted SQ then drops unrelated header- timer SQEs for other slowloris conns on the same worker. Fix: leave OnHeaderDeadlineArmed nil in async mode. The initial accept- time arm is now a direct armHeaderTimer call on the worker thread (the new w.async branch right after ArmHeaderDeadline). Keep-alive re-arms fall back to checkTimeouts which runs every 800ms worst case with the gate + adaptiveTimeout tightening from 6dbbf02. Slowloris attacks never reach the keep-alive path (no successful parse), so they keep kernel-precise timer behavior via the accept-time arm + the recv-path retry (3b9d643) for SQ-pressure drops. Sync mode (!w.async) unchanged: ProcessH1 runs inline on the worker thread, OnHeaderDeadlineArmed is a worker-thread call, no race.
…T-loss workaround)
Nightly 26427061834 (with sweep-cadence + retry-recv + SINGLE_ISSUER
fixes) showed residual ~16-25 hangs per slow-refapp cell on iouring AND
epoll, basically unchanged from baseline. The identical shape on both
engines despite their independent close paths ruled out engine-specific
bugs — the FIN/RST signaling itself is just unreliable enough under
cluster load that the walker's 20s drip-budget fires before its kernel
observes the close on ~5-7% of attempts.
Switch to a TCP-reliable signaling mechanism: write a 408 Request
Timeout response BEFORE closing. The response bytes ride TCP's
retransmission machinery, so even with packet loss the walker
eventually Read()s a byte of response data and exits its drip loop via
wellReject — no FIN/RST visibility required.
iouring (handleHeaderTimer + checkTimeouts):
cs.writeFn(requestTimeoutResponse) // appends to cs.writeBuf
w.closeConn(fd) // sees non-empty writeBuf →
// defers close until SEND CQE
// → finishCloseDetached takes
// graceful SHUT_WR + drain path
// (no forceRSTClose) so the 408
// reliably leaves the kernel
// before the FIN.
epoll (checkTimeouts):
unix.Write(fd, requestTimeoutResponse) // 80 bytes synchronous to
// kernel send buffer; EAGAIN
// is practically impossible
// since the conn's send
// buffer was empty.
l.closeConn(fd) // graceful close path (no
// forceRSTClose set) →
// SHUT_WR + drain + Close.
The forceRSTClose machinery is left in place (release resets it, both
close paths still honor the flag) for any future caller that wants
synchronous RST teardown, but the slowloris-defence path no longer
touches it. SetsockoptLinger calls are removed from the same paths —
they were no-ops in fixed-file mode (LINGER on an index returns EBADF)
and harmful for non-fixed-file once we want graceful close.
Three prior approaches all left ~5-7% walker-hangs per slow-refapp cell:
- SHUT_RDWR + close + SO_LINGER{1,0} (forced RST): nightly 26427061834
- 408 Request Timeout + SHUT_WR + drain + close: nightly 26429150655
Std/net.http on header-deadline-expiry calls c.close() → plain
rwc.Close() (no shutdown, no linger, no response body). Std walker
shows 0 hangs on the same probatorium test. The data points at our
extra SHUT_*/drainRecvBuffer syscalls racing with the peer's in-flight
Write — the multi-syscall close window gives the kernel time to send
RST on the peer's pending data, dropping any pending recv on the peer
side. Plain close avoids that race entirely.
Changes:
- iouring handleHeaderTimer + checkTimeouts: drop the 408-write +
forceRSTClose paths. Just call w.closeConn(fd).
- iouring finishCloseDetached: async-mode HTTP1 conns (detached=true
via pre-allocated detachMu, but h1State.Detached=false) use plain
unix.Close. Truly detached WS/SSE conns still use SHUT_WR+drain
so middleware-queued close-frame echoes flush.
- epoll checkTimeouts: drop 408-write + linger. Just l.closeConn.
- epoll closeConn: new plainClose path covers both sync and async-
mode H1 (was previously only sync H1 via fastClose). H2 + truly
detached conns still use the graceful SHUT_WR+drain path.
- Drop requestTimeoutResponse helper (no longer used).
The forceRSTClose machinery is left intact (no setter now, but release
resets the field; switch-case still honors it for any future caller).
Plain unix.Close (4382283) dropped stat_swag iouring hangs from ~21/cell to ~2/cell. Observability still saw ~19-23/cell. Hypothesis: plain Close on a non-empty recv buffer sends RST instead of FIN (Linux tcp_close checks data_was_unread first). RST is fire-and-forget — TCP doesn't retransmit it. On the cluster's 20G LACP fabric a single dropped RST strands the walker (no second chance), giving the residual ~5-7% hang rate per cell. For slowloris specifically the peer is by design still writing drips, so the kernel recv buffer is almost always non-empty at our close time. SHUT_WR forces FIN regardless of recv-buffer state; FIN IS retransmitted from FIN_WAIT_1 until ACK'd, so a dropped FIN packet doesn't strand the walker. Skip drainRecvBuffer — that was the source of the original race (an empty drain is harmless but a non-empty one leaves a multi-µs window in which a fresh drip queues, sending Close back to the RST branch). Changes: - iouring finishCloseDetached: async-mode H1 path now Shutdown(SHUT_WR) + Close instead of plain Close. - epoll closeConn: plainClose path now Shutdown(SHUT_WR) + Close. - Graceful drain path retained for H2 (GOAWAY/RST_STREAM flush) and truly detached WS/SSE conns (middleware-queued close frames).
…anic load Diagnosed from probatorium nightly 26444562273: recovery's default Logger (slog.Default → text handler → os.Stderr) serializes panic logs on a global mutex inside Go's default handler. Under iouring/epoll's async-dispatch model the dispatch goroutine holds cs.detachMu across ProcessH1, so a synchronous LogAttrs gates the worker thread for the duration of the stderr write. Sustained panic loads (>100 req/sec to a handler that panics) drop engine throughput by ~14× and let slowloris header-deadlines slip past on concurrent conns. Refapps in probatorium already work around this by passing an explicit io.Discard or ringHandler logger. Document the issue for upstream users and point them at the same workaround. Async-logging or discard-by-default would be a larger semantic change deferred to v1.4.12+.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Replaces #292 (sweep-only) and #293 (sweep-tightened). v1.4.11 candidate. Adds kernel-enforced per-conn header deadline on iouring via IORING_OP_TIMEOUT SQEs. The kernel fires the timer at exactly the deadline regardless of CQE traffic — eliminating the sweep-cadence dependency that capped previous attempts at 56% reduction.
Journey
Root cause of the sweep-based ceiling: under SO_REUSEPORT-distributed conns, some slowloris attempts land on workers with low CQE traffic; sweep cadence is bounded by
gate × adaptive_wait_time, which can stretch to seconds when other conns on that worker are also idle.Design
Race correctness
Single timer per conn at any time guaranteed by `headerTimerArmed` flag.
What this does NOT change
Tests
All 4 cherry-picked unit tests still pass:
Full `go test ./...` clean. `GOOS=linux go build` clean. `go vet` clean.
Empirical verification path
Probatorium nightly + 24h soak against this branch's SHA. Expect:
Will not propose v1.4.11 tag until nightly + soak confirm iouring at ~0 hangs.