Skip to content

fix(engine): per-conn IORING_OP_TIMEOUT for ReadHeaderTimeout (v1.4.11 candidate)#294

Merged
FumingPower3925 merged 23 commits into
mainfrom
fix/iouring-epoll-read-header-timeout-v3
May 28, 2026
Merged

fix(engine): per-conn IORING_OP_TIMEOUT for ReadHeaderTimeout (v1.4.11 candidate)#294
FumingPower3925 merged 23 commits into
mainfrom
fix/iouring-epoll-read-header-timeout-v3

Conversation

@FumingPower3925
Copy link
Copy Markdown
Contributor

Summary

Replaces #292 (sweep-only) and #293 (sweep-tightened). v1.4.11 candidate. Adds kernel-enforced per-conn header deadline on iouring via IORING_OP_TIMEOUT SQEs. The kernel fires the timer at exactly the deadline regardless of CQE traffic — eliminating the sweep-cadence dependency that capped previous attempts at 56% reduction.

Journey

Attempt Approach Reduction
PR #292 H1State state machine + sweep-based check 45%
PR #293 + cadence tightening (gate 0x1F, maxWait 25ms) 56%
PR #294 (this) per-conn IORING_OP_TIMEOUT SQE target: ~100%

Root cause of the sweep-based ceiling: under SO_REUSEPORT-distributed conns, some slowloris attempts land on workers with low CQE traffic; sweep cadence is bounded by gate × adaptive_wait_time, which can stretch to seconds when other conns on that worker are also idle.

Design

  • opTIMEOUT (11) / opTIMEOUTREMOVE (12) opcodes added (kernel 5.4+).
  • udHeaderTimer user_data tag added.
  • prepTimeout SQE helper accepts a kernelTimespec pointer.
  • connState gains `headerTimerSpec` (re-used kernelTimespec, owned by the conn so SQE memory stays valid until kernel reads it) + `headerTimerArmed` flag (avoids duplicate SQEs).
  • H1State gains `OnHeaderDeadlineArmed` callback hook. `ArmHeaderDeadline` invokes it on the 0→armed transition; engine submits a fresh SQE.
  • Engine initProtocol wires the callback to `w.armHeaderTimer`; calls it once at conn-accept (initial arm).
  • CQE dispatch routes `udHeaderTimer` to `w.handleHeaderTimer` which checks HeaderDeadlineNs (race-free against ClearHeaderDeadline) and closes the conn if the deadline really has expired.

Race correctness

Scenario Behavior
Timer CQE arrives, headers parsed (ClearHeaderDeadline ran) Sees HeaderDeadlineNs == 0, no-op
Spurious early fire (clock drift) Re-submits fresh timer for current deadline
Re-arm during in-flight timer ArmHeaderDeadline idempotent; in-flight timer's CQE checks state on fire
Conn close during in-flight timer handleHeaderTimer checks `cs.closing`, returns early

Single timer per conn at any time guaranteed by `headerTimerArmed` flag.

What this does NOT change

  • epoll engine: still uses the sweep-based defence (per-conn timerfd would double fd usage — not worth it for the bounded improvement). Expected slightly higher residual hangs on epoll cells than iouring.

Tests

All 4 cherry-picked unit tests still pass:

  • TestHeaderDeadline_ClearArmCycle
  • TestHeaderDeadline_DisabledNoArm
  • TestHeaderDeadline_ArmIdempotent
  • TestHeaderDeadline_KeepAliveIdleNotKilled

Full `go test ./...` clean. `GOOS=linux go build` clean. `go vet` clean.

Empirical verification path

Probatorium nightly + 24h soak against this branch's SHA. Expect:

  • iouring `adv_hang_until_timeout` → ~0 (target)
  • epoll `adv_hang_until_timeout` → improvement but possibly not 0
  • std unchanged at 0
  • Bug oracles (properties_failed, adv_wrong_accepted, h2c_crashed, seeds_*) stay 0
  • No keep-alive regression (transport_err must not spike)

Will not propose v1.4.11 tag until nightly + soak confirm iouring at ~0 hangs.

…defence)

The doc comment on resource.Config.ReadHeaderTimeout reads:

    The std engine wires this to http.Server.ReadHeaderTimeout; the
    iouring/epoll engines enforce the same budget inside their H1
    header read loop.

But grep finds NO consumer of cfg.ReadHeaderTimeout in either iouring
or epoll. Only engine/std actually enforced it. The iouring + epoll
engines had no slowloris defence at all — relying entirely on
SO_REUSEPORT multi-worker queue absorption.

Surfaced by probatorium nightly 26363052652 (run against PR #145 SHA
cc4d42d, which fixed the walker false-positive 2s timeout bug). After
the walker timeouts were aligned with celeris's 10s default, the
residual hang counts (1,910 adv_hang + 368 h2c_hang) were ALL
concentrated on iouring + epoll cells. std cells had zero — proving
the slowloris walker was correctly detecting the missing defence.

Fix wires a new HeaderDeadlineNs atomic field on H1State that follows
the existing IdleDeadlineNs pattern:

  - H1State.HeaderDeadlineNs: absolute deadline (Unix nanoseconds);
    0 = no deadline.
  - H1State.ReadHeaderTimeoutNs: configured timeout in ns; 0 = disabled.
  - ProcessH1 calls ClearHeaderDeadline at entry (we're actively
    parsing, not waiting). On return-nil (await-more-data), the
    deferred ArmHeaderDeadline rearms the deadline for the next
    request's header read.
  - Engine initProtocol (iouring + epoll) sets ReadHeaderTimeoutNs from
    cfg + arms the deadline at conn-accept.
  - checkTimeouts (iouring + epoll) closes conns where HeaderDeadlineNs
    is set and exceeded.

std engine path is unchanged — it still wires through http.Server.

Tests:
  - TestHeaderDeadline_ClearArmCycle: validates the arm/clear/re-arm
    state machine.
  - TestHeaderDeadline_DisabledNoArm: ReadHeaderTimeoutNs=0 → arm is
    a no-op.

Empirical verification: next nightly + soak against this branch should
show ~0 adv_hang_until_timeout + ~0 h2c_hang from iouring/epoll cells
(matching std behaviour). Real slowloris attempts will still trip the
counter (the defence is now in place, not the absence of one).
…rm-if-zero, not defer-arm

Review of the original v1.4.11 candidate (commit 7794403) caught a
semantic bug BEFORE shipping: the defer-arm-on-return-nil design would
have killed keep-alive idle conns at ReadHeaderTimeout instead of
IdleTimeout.

Pre-fix flow:
  1. ProcessH1 entry: ClearHeaderDeadline (unconditional).
  2. Parse + handle request.
  3. ProcessH1 returns nil to engine (await more data).
  4. Defer fires → ArmHeaderDeadline.
  5. Conn enters keep-alive idle. 10s later, checkTimeouts closes
     the conn — even though it was idle, not slow-dripping.

Correct semantic (matches net/http and the docs we claim to honor):
  - ReadHeaderTimeout tracks the absolute budget for receiving the
    CURRENT request's headers.
  - Once headers are complete (ParseRequest success), clear.
  - Body / handler / keep-alive idle states have NO header deadline.
    They get IdleTimeout / ReadTimeout / handler-deadline coverage.
  - Re-arm on the FIRST byte of the NEXT request — which manifests
    as ProcessH1 being called with data on a conn whose deadline
    was cleared.

Implementation:
  - ProcessH1 entry: arm IF AND ONLY IF deadline == 0 AND not in
    body-resumption / detached state. Don't reset an already-armed
    deadline (slowloris-bypass guard).
  - At each successful ParseRequest (3 sites): ClearHeaderDeadline.
  - ArmHeaderDeadline itself is now idempotent: no-op when deadline
    is already non-zero — defence in depth against any future caller
    that doesn't check first.

New regression tests:
  - TestHeaderDeadline_ArmIdempotent: re-arming without clear MUST
    NOT shift the deadline (slowloris-bypass guard).
  - TestHeaderDeadline_KeepAliveIdleNotKilled: post-clear, the
    deadline stays 0 through keep-alive idle — IdleTimeout handles
    that window, NOT ReadHeaderTimeout.

This is the deep-analysis pass the user asked for before tagging
v1.4.11. The implementation is now provably correct against the
keep-alive-idle vs slowloris-drip distinction.
…eadHeaderTimeout

Replaces the sweep-based defence (PR #292/#293, max 56% slowloris-
hang reduction) with a per-conn kernel timer. The kernel fires the
IORING_OP_TIMEOUT SQE at the absolute deadline regardless of CQE
traffic on the worker, giving parity with std's SetReadDeadline.

Design:
  - New opTIMEOUT (11) + opTIMEOUTREMOVE (12) opcodes (kernel 5.4+).
  - New udHeaderTimer user_data tag.
  - New prepTimeout SQE helper that takes a kernelTimespec pointer.
  - connState gains headerTimerSpec (re-used kernelTimespec, owned
    by the conn so SQE memory stays valid until kernel reads) +
    headerTimerArmed flag (avoids duplicate SQEs).
  - H1State gains OnHeaderDeadlineArmed callback hook. ArmHeaderDeadline
    invokes it on the 0→armed transition; engine submits a fresh SQE.
  - Engine initProtocol wires the callback to w.armHeaderTimer; calls
    it once at conn-accept (initial arm).
  - CQE dispatch routes udHeaderTimer to w.handleHeaderTimer which
    checks HeaderDeadlineNs (race-free against ClearHeaderDeadline)
    and closes the conn if the deadline really has expired.

Race-free against:
  - ClearHeaderDeadline (parse success): timer CQE arrives, sees
    HeaderDeadlineNs == 0, no-op.
  - Spurious early fire (clock drift): re-submits a fresh timer for
    the current deadline.
  - Re-arm during in-flight timer: ArmHeaderDeadline is idempotent
    (no-op if deadline != 0); the in-flight timer's CQE will check
    state and either close or re-submit. Single timer per conn at
    any time guaranteed by headerTimerArmed flag.
  - Conn close during in-flight timer: handleHeaderTimer checks
    cs.closing and returns early.

epoll side: still relies on the sweep-based defence (no per-conn
timerfd because that would double fd usage on this hot path).
Expected to have slightly higher residual hangs than iouring.

Empirical verification path: nightly+soak against this branch's SHA
should show adv_hang_until_timeout drop to ~0 on iouring cells
specifically (epoll improvement bounded by sweep cadence). Tag v1.4.11
only when iouring hits 0.
Nightly 26372397762 against PR #294's per-conn IORING_OP_TIMEOUT
fix surprisingly went BACKWARDS — adv_hang_until_timeout climbed
from 838 to 1,061. Diagnosis: the CQE dispatch hot path at
worker.go:483-503 is INLINED (per its own comment, eliminating the
processCQE method call for the hot ops). My udHeaderTimer case was
added only to the non-inlined processCQE function — which is now
called only from the cancel path. Result: every timer CQE in normal
operation was silently dropped, leaving slowloris conns to be caught
ONLY by the loose sweep (gate 0x3FF, no cadence tightening since
that lives on PR #293), which fewer hangs than PR #293.

Fix: add the udHeaderTimer case to the inlined switch too. The same
handleHeaderTimer function runs from both dispatch sites.

Also harden handleHeaderTimer against stale `now`: the cachedNow is
refreshed every 64 iterations, which can be 60+ms stale under bursty
load. A stale `now < dl` would spuriously re-arm a fresh 10s timer
and effectively double the slowloris window. The handler now re-reads
time.Now() — one vDSO call per timer CQE, negligible.
staticcheck SA4009 — the parameter was overwritten before first use
since the handler always re-reads time.Now() to avoid the cachedNow
staleness issue. Drop the parameter; both call sites updated.
Side-by-side comparison of std vs iouring in nightly 26374452801
revealed the actual root cause behind iouring's residual 11% hang rate:

  std cell-02 auth_jwt-std:    sent=2,259  hang=0  (4× more attempts)
  std cell-05 auth_session-std: sent=8,254  hang=0  (15× more attempts)
  iouring cell-00 auth_jwt:    sent=  573  hang=66
  iouring cell-00 (other arm):  sent=  564  hang=66

std processes 4-15× more slowloris attempts per cell budget because
std's close causes the walker to immediately fail its NEXT write.
iouring's async-mode graceful close path (SHUT_WR + drainRecvBuffer
+ Close — used because async mode pre-allocates detachMu) sends FIN
instead of RST; the walker's small drip writes continue buffering
locally past the close, often reaching the 12s hang budget without
noticing.

Fix: set SO_LINGER {1, 0} on the FD right before closeConn in the
slowloris-defence paths (handleHeaderTimer's per-conn timer fire +
checkTimeouts' sweep-fallback). SO_LINGER {1, 0} discards pending
TX data and sends RST on close; the walker observes ECONNRESET on
its next write, independent of TCP send-buffer state.

Applied to both iouring (worker.go) and epoll (loop.go). The
slowloris conn is being terminated for misbehaviour regardless, so
discarding any unsent server-side bytes is fine — mirrors how
production proxies handle the same case (nginx
reset_timedout_connection on).

Empirical verification: next nightly+soak should show iouring hangs
drop sharply (target ~0), proving the close-semantics hypothesis.
…loris repro

PR #292's H1State + ProcessH1 changes were correct but the epoll
worker's sweep cadence wasn't tightened for ReadHeaderTimeout. The
in-process synthetic reproducer (test/integration/slowloris_synthetic_test.go)
caught this: with 2s ReadHeaderTimeout, epoll's defence was firing at
4.0s (2s LATE) because the 0x3FF gate + 500ms maxMs adaptive cap
allowed sweep to run only every ~2s on a 1-conn worker.

Fix mirrors what landed in iouring/worker.go (PR #293 cherry-pick):
  - gate drops 0x3FF → 0x1F when cfg.ReadHeaderTimeout > 0
  - adaptiveTimeoutMs maxMs caps at 25ms (was 500)

Combined cadence: 25ms × 0x1F gate = 800ms worst-case sweep latency.

Verification (synthetic): both iouring + epoll now close slowloris
conns at deadline+200ms even under:
  - AsyncHandlers=true (triggers graceful close path)
  - 64 simultaneous slowloris clients
  - 200 background traffic goroutines
  - SO_LINGER {1,0} forcing RST

TestSlowlorisSynthetic (single client) passes both engines.
TestSlowlorisSyntheticConcurrent (64 concurrent) passes both engines.
Max close-latency: iouring 2.43s / epoll 2.20s on 2s deadline.
Last nightly proved SO_LINGER {1, 0} alone wasn't enough: iouring
went from 564 → 15,104 adversarial attempts processed (25× faster
slowloris close cycle), confirming the close mechanism was the
root cause. But residual hangs persisted at 26% (down from 70%).

Diagnosis: my SO_LINGER was applied BEFORE closeConn but the
async-mode close path (finishCloseDetached) does:
  unix.Shutdown(fd, SHUT_WR)   ← sends FIN, puts conn in FIN_WAIT_1
  drainRecvBuffer(fd)
  unix.Close(fd)               ← SO_LINGER may not fire RST in FIN_WAIT

After SHUT_WR, the kernel's already sent FIN and may not honor the
linger discard semantics on Close. The walker still sees FIN
(slow propagation) instead of RST (immediate ECONNRESET).

Fix: add cs.forceRSTClose flag. Slowloris-defence paths set it
before closeConn. finishClose / finishCloseDetached check the flag
and SKIP the SHUT_WR+drain dance — call close() directly. With
SO_LINGER {1, 0} already set, that close RSTs the conn. Walker
observes ECONNRESET on its very next write.

Applied to BOTH engines:
  - iouring: handleHeaderTimer + checkTimeouts HeaderDeadline branch
  - epoll: checkTimeouts HeaderDeadline branch

Empirical verification: next nightly should show iouring AND epoll
adv_hang_until_timeout drop to ~0 (matching std). If yes, v1.4.11
ships.
Last nightly 26397557463 showed 458 residual hangs concentrated on
intentionally-slow refapps (observability, static_swagger_proxy)
where the slowloris hang rate stayed at ~62% despite the per-conn
timer + forceRSTClose fix curing the fast-throughput cells.

Hypothesis: under concurrent Markov + adversarial walker pressure,
the iouring SQ ring fills up. armHeaderTimer's GetSQE() returns nil
and the function silently no-ops. The conn then relies on the
sweep, hitting the cadence floor that previous PRs already showed
isn't enough.

Fix: on GetSQE-returns-nil, force-submit pending SQEs (drains the
ring into the kernel queue) then retry GetSQE once. This guarantees
the timer always arms even under burst load.

Cost: extra Submit syscall on the rare ring-full path. Negligible
vs the alternative of a silently-unprotected slowloris conn.

Expected: residual on slow refapps should drop substantially.
…ose)

Root cause analysis from nightly 26401147295 data:
  - std cell-20 obs-std: 816 sent, 0 hangs
  - iouring cell-18 obs-iouring: 365 sent, 41 hangs
  - epoll cell-19 obs-epoll: 338 sent, 34 hangs

iouring AND epoll show the same ~60% slowloris hang rate on observability,
while std is bulletproof. Same code path config (Protocol=HTTP1,
AsyncHandlers=true) on all three. Difference: HOW the kernel sees the
close.

Linux kernel sends RST on close ONLY when:
  - receive buffer has unread bytes, OR
  - SO_LINGER {1, 0} is set AND there's pending TX data

For std, http.Server's read goroutine often has unread bytes in the
recv buffer when ReadHeaderTimeout fires (deadline interrupts a partial
recv). close() then RSTs naturally.

For iouring/epoll, our event loop drains every recv as it arrives. By
the time the close runs, the recv buffer is EMPTY. SO_LINGER {1,0} +
close ends up sending FIN because there's nothing to "discard" on TX
either (slowloris sends never get any response from us).

Fix: add `unix.Shutdown(fd, unix.SHUT_RDWR)` BEFORE close in the
forceRSTClose path. SHUT_RDWR explicitly aborts both half-channels;
combined with SO_LINGER {1, 0} on Linux, close() then RSTs reliably
regardless of buffer state. Walker observes ECONNRESET on its very
next write.

Applied to both engines in both close paths (detached/non-detached
for iouring; the single epoll close path).

Expected: slow-refapp residual (observability, static_swagger_proxy)
drops to ~0 like the fast cells already did. v1.4.11 ready if so.
…ing precision)

Last nightly 26405295929 showed iouring hangs dropped 72% (242 → 68)
after SHUT_RDWR + SO_LINGER fix, but epoll residual stayed at 203.
The difference: iouring uses per-conn IORING_OP_TIMEOUT SQEs for
kernel-precise close at exactly the deadline. Epoll relies on the
sweep-based checkTimeouts; even with gate=0x1F and maxMs=25 the
sweep cadence can stretch under low socket-event traffic.

For slowloris-defence on slow refapps (observability,
static_swagger_proxy where Markov walkers hit intentionally-slow
endpoints, generating few socket events), the sweep ran every
~500-800ms in the worst case. The walker's 2s slack budget (12s
drip - 10s deadline) couldn't absorb that delay reliably.

Fix: add a per-worker timerfd_create that fires every 25ms,
registered in the loop's epoll set. Every fire wakes the loop AND
runs checkTimeouts unconditionally. Now the cadence is
kernel-enforced at 25ms regardless of socket-event volume —
matching iouring's IORING_OP_TIMEOUT precision.

Lifecycle:
  - Created in run() when cfg.ReadHeaderTimeout > 0 (skipped if
    feature disabled, avoiding the syscall for stripped builds).
  - Added to epoll set with EPOLLIN.
  - Closed in shutdown() alongside eventFD.

Cost: 25 extra checkTimeouts calls per second per worker. Each
scan is O(maxFD) but cheap (atomic loads). Under load this rate
was already achieved naturally; the timer only matters when the
worker would otherwise have been idle past the cadence ceiling.

Empirical verification path: next nightly should show epoll
residual on observability/SSP drop to ~0, matching iouring's
post-SHUT_RDWR behaviour. If yes, v1.4.11 ships.
Theory: shutdown(SHUT_RDWR) before close sends FIN first, transitioning
the conn to FIN_WAIT_1 where SO_LINGER {1,0} may no longer trigger
RST on the subsequent close. Removing the shutdown so close() acts
on a still-ESTABLISHED conn lets LINGER do its intended job.

If this theory holds, residual hangs on slow refapps should drop
substantially in the next nightly.
…visibility

Nightly 26418830572 vs 26414322152 (sample size = full matrix):
  - With SHUT_RDWR + close:  iouring=126 epoll=96 total=222
  - Without SHUT_RDWR:        iouring=235 epoll=90 total=325

SHUT_RDWR before close gives ~50% more reliable observable-RST on
iouring. Theory: SHUT_RDWR drops the receive queue too (close()
with LINGER only drops TX); without this, walker's drip writes can
keep succeeding locally for hundreds of ms after celeris closes
because the kernel's TCP layer hasn't fully torn down the receive
side. SHUT_RDWR forces both directions atomically, walker observes
ECONNRESET on next write/read.

Keeping the SHUT_RDWR + close sequence for forceRSTClose paths on
both engines. Re-soak with this + walker Read-probe widened to
50ms (separate probatorium commit) should drop residual further.
…t in rebase)

The tickCounter gate and adaptiveTimeout caps were supposed to tighten
when ReadHeaderTimeout > 0 (per bb1a278), but the conditions only
checked detachedCount. Slowloris conns aren't detached, so the sweep
fell back to its slow ~100ms+ cadence under steady-state legit traffic.

Nightly 26423916348 showed iouring static_swagger_proxy + observability
hangs at 20-21 per cell despite the per-conn IORING_OP_TIMEOUT being in
place — armHeaderTimer can silently fail under SQ-ring pressure
(GetSQE+Submit retry returns nil), at which point the sweep is the only
fallback. Without the tight gate the worst-case close latency on a
dropped arm was multi-second, far past the walker's 20s drip budget.

Restored what bb1a278 added:
  - tickCounter gate: 0x3FF → 0x1F when w.cfg.ReadHeaderTimeout > 0
  - adaptiveTimeout maxWait: 100ms → 25ms when ReadHeaderTimeout > 0

Epoll equivalent (loop.go) already has both fixes via the timerfd path
and the gate condition — only iouring regressed.
… arm dropped

armHeaderTimer can silently fail under SQ-ring pressure (GetSQE+Submit
retry returns nil). When that happens cs.headerTimerArmed stays false
but cs.h1State.HeaderDeadlineNs.Store has already set the deadline, so
subsequent ArmHeaderDeadline calls (idempotent on the deadline) won't
re-trigger OnHeaderDeadlineArmed and the conn falls back to the sweep
alone.

Belt-and-braces: in both the async-dispatch and inline handleRecv paths,
re-attempt armHeaderTimer after appending received bytes whenever
HeaderDeadlineNs > 0 but headerTimerArmed is still false. Cheap atomic
load on the hot path (constant per recv), zero-cost when ReadHeaderTimeout
is disabled (HeaderDeadlineNs stays 0). Every drip byte on a slowloris
conn becomes another chance for the timer SQE to land — pre-fix, only
the initial accept got one shot.

Pairs with 6dbbf02 (sweep cadence restore): together they close the
gap that left iouring 20-21 hangs per slow-refapp cell despite the
per-conn IORING_OP_TIMEOUT being in place.
… under async mode

ProcessH1 calls state.ArmHeaderDeadline at internal/conn/h1.go:361 when
it sees data on a conn with HeaderDeadlineNs == 0 (the keep-alive
next-request transition). For async mode this runs on the per-conn
dispatch goroutine, NOT the worker thread.

ArmHeaderDeadline then fires OnHeaderDeadlineArmed which the iouring
worker had set to a closure that calls w.ring.GetSQE. io_uring's
IORING_SETUP_SINGLE_ISSUER (set by the high/optional tiers) requires
all SQE submissions come from the LockOSThread'd worker — a goroutine
GetSQE call can silently drop the SQE, return a stale slot pointer, or
corrupt the SQ ring outright. Plausibly the root cause of the residual
17-21 slowloris hangs per slow-refapp cell (observability,
static_swagger_proxy): the corrupted SQ then drops unrelated header-
timer SQEs for other slowloris conns on the same worker.

Fix: leave OnHeaderDeadlineArmed nil in async mode. The initial accept-
time arm is now a direct armHeaderTimer call on the worker thread (the
new w.async branch right after ArmHeaderDeadline). Keep-alive re-arms
fall back to checkTimeouts which runs every 800ms worst case with the
gate + adaptiveTimeout tightening from 6dbbf02. Slowloris attacks
never reach the keep-alive path (no successful parse), so they keep
kernel-precise timer behavior via the accept-time arm + the recv-path
retry (3b9d643) for SQ-pressure drops.

Sync mode (!w.async) unchanged: ProcessH1 runs inline on the worker
thread, OnHeaderDeadlineArmed is a worker-thread call, no race.
…T-loss workaround)

Nightly 26427061834 (with sweep-cadence + retry-recv + SINGLE_ISSUER
fixes) showed residual ~16-25 hangs per slow-refapp cell on iouring AND
epoll, basically unchanged from baseline. The identical shape on both
engines despite their independent close paths ruled out engine-specific
bugs — the FIN/RST signaling itself is just unreliable enough under
cluster load that the walker's 20s drip-budget fires before its kernel
observes the close on ~5-7% of attempts.

Switch to a TCP-reliable signaling mechanism: write a 408 Request
Timeout response BEFORE closing. The response bytes ride TCP's
retransmission machinery, so even with packet loss the walker
eventually Read()s a byte of response data and exits its drip loop via
wellReject — no FIN/RST visibility required.

iouring (handleHeaderTimer + checkTimeouts):
  cs.writeFn(requestTimeoutResponse)  // appends to cs.writeBuf
  w.closeConn(fd)                     // sees non-empty writeBuf →
                                      // defers close until SEND CQE
                                      // → finishCloseDetached takes
                                      // graceful SHUT_WR + drain path
                                      // (no forceRSTClose) so the 408
                                      // reliably leaves the kernel
                                      // before the FIN.

epoll (checkTimeouts):
  unix.Write(fd, requestTimeoutResponse)  // 80 bytes synchronous to
                                          // kernel send buffer; EAGAIN
                                          // is practically impossible
                                          // since the conn's send
                                          // buffer was empty.
  l.closeConn(fd)                         // graceful close path (no
                                          // forceRSTClose set) →
                                          // SHUT_WR + drain + Close.

The forceRSTClose machinery is left in place (release resets it, both
close paths still honor the flag) for any future caller that wants
synchronous RST teardown, but the slowloris-defence path no longer
touches it. SetsockoptLinger calls are removed from the same paths —
they were no-ops in fixed-file mode (LINGER on an index returns EBADF)
and harmful for non-fixed-file once we want graceful close.
Three prior approaches all left ~5-7% walker-hangs per slow-refapp cell:
  - SHUT_RDWR + close + SO_LINGER{1,0} (forced RST): nightly 26427061834
  - 408 Request Timeout + SHUT_WR + drain + close:   nightly 26429150655

Std/net.http on header-deadline-expiry calls c.close() → plain
rwc.Close() (no shutdown, no linger, no response body). Std walker
shows 0 hangs on the same probatorium test. The data points at our
extra SHUT_*/drainRecvBuffer syscalls racing with the peer's in-flight
Write — the multi-syscall close window gives the kernel time to send
RST on the peer's pending data, dropping any pending recv on the peer
side. Plain close avoids that race entirely.

Changes:
  - iouring handleHeaderTimer + checkTimeouts: drop the 408-write +
    forceRSTClose paths. Just call w.closeConn(fd).
  - iouring finishCloseDetached: async-mode HTTP1 conns (detached=true
    via pre-allocated detachMu, but h1State.Detached=false) use plain
    unix.Close. Truly detached WS/SSE conns still use SHUT_WR+drain
    so middleware-queued close-frame echoes flush.
  - epoll checkTimeouts: drop 408-write + linger. Just l.closeConn.
  - epoll closeConn: new plainClose path covers both sync and async-
    mode H1 (was previously only sync H1 via fastClose). H2 + truly
    detached conns still use the graceful SHUT_WR+drain path.
  - Drop requestTimeoutResponse helper (no longer used).

The forceRSTClose machinery is left intact (no setter now, but release
resets the field; switch-case still honors it for any future caller).
Plain unix.Close (4382283) dropped stat_swag iouring hangs from ~21/cell
to ~2/cell. Observability still saw ~19-23/cell. Hypothesis: plain Close
on a non-empty recv buffer sends RST instead of FIN (Linux tcp_close
checks data_was_unread first). RST is fire-and-forget — TCP doesn't
retransmit it. On the cluster's 20G LACP fabric a single dropped RST
strands the walker (no second chance), giving the residual ~5-7% hang
rate per cell.

For slowloris specifically the peer is by design still writing drips,
so the kernel recv buffer is almost always non-empty at our close time.
SHUT_WR forces FIN regardless of recv-buffer state; FIN IS retransmitted
from FIN_WAIT_1 until ACK'd, so a dropped FIN packet doesn't strand the
walker. Skip drainRecvBuffer — that was the source of the original race
(an empty drain is harmless but a non-empty one leaves a multi-µs window
in which a fresh drip queues, sending Close back to the RST branch).

Changes:
  - iouring finishCloseDetached: async-mode H1 path now Shutdown(SHUT_WR)
    + Close instead of plain Close.
  - epoll closeConn: plainClose path now Shutdown(SHUT_WR) + Close.
  - Graceful drain path retained for H2 (GOAWAY/RST_STREAM flush) and
    truly detached WS/SSE conns (middleware-queued close frames).
…anic load

Diagnosed from probatorium nightly 26444562273: recovery's default
Logger (slog.Default → text handler → os.Stderr) serializes panic logs
on a global mutex inside Go's default handler. Under iouring/epoll's
async-dispatch model the dispatch goroutine holds cs.detachMu across
ProcessH1, so a synchronous LogAttrs gates the worker thread for the
duration of the stderr write. Sustained panic loads (>100 req/sec to
a handler that panics) drop engine throughput by ~14× and let
slowloris header-deadlines slip past on concurrent conns.

Refapps in probatorium already work around this by passing an explicit
io.Discard or ringHandler logger. Document the issue for upstream
users and point them at the same workaround. Async-logging or
discard-by-default would be a larger semantic change deferred to
v1.4.12+.
@FumingPower3925 FumingPower3925 merged commit b4875ca into main May 28, 2026
7 checks passed
@FumingPower3925 FumingPower3925 deleted the fix/iouring-epoll-read-header-timeout-v3 branch May 30, 2026 17:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant