Skip to content

feat(session): add LLM stream stall detector in processor.ts #399

@randomm

Description

@randomm

Motivation

When an LLM provider stalls mid-stream (no tokens for minutes), the subagent appears "running" but is actually frozen. The for-await loop in processor.ts hangs indefinitely waiting for the next SSE chunk. The abort signal is only checked after a token arrives — if no token arrives, the check never runs. The only safety net is a 30-minute timeout, meaning zombie agents waste up to 30 minutes before the system notices.

Proposed Solution

Add a lastTokenTime timestamp that updates on every text-delta, reasoning-delta, and tool-call event in the processor stream loop. At the start of each iteration (after the existing input.abort.throwIfAborted() check at line 56), check Date.now() - lastTokenTime against a configurable stall timeout (default 3 minutes). If exceeded, throw an error: "LLM stream stalled: no tokens received for 3 minutes".

Key hook locations in processor.ts:

  • Line 56: stall check goes after input.abort.throwIfAborted()
  • Line 81: case "reasoning-delta" — update lastTokenTime = Date.now()
  • Line 134-264: tool-call handlers — update lastTokenTime (tool calls indicate LLM is active)
  • Line 338: case "text-delta" — update lastTokenTime = Date.now()

Important: This will NOT gracefully abort the hung HTTP connection (SSE reads cannot be interrupted mid-byte). It will however surface the problem — the session fails visibly instead of hanging indefinitely, and Pulse/check_task can take corrective action.

Quality Gates (Non-Negotiable)

  • TDD: Write tests before implementation
  • Coverage: 80%+ test coverage for new code
  • Linting: All code passes project linting rules
  • Local Verification: All tests pass locally before completion

Acceptance Criteria

  • lastTokenTime tracked per session in the processor stream loop
  • Stall timeout configurable (default 3 minutes, overridable via env var OPENCODE_STALL_TIMEOUT_MS)
  • On stall detection: descriptive error thrown, session status set to "failed"
  • Existing tests still pass
  • New test for stall detection (mock stream that delivers tokens then goes silent)
  • Log instrumentation: warn-level log on stall detection with session ID and elapsed time

Definition of Done

  • Tests written and passing
  • Typecheck passes
  • Linting passes

Fork Manifest Requirement

This issue modifies the subagent monitoring system introduced by the async-tasks fork feature. Upon completion, update .fork-features/manifest.json entry async-tasks:

  • modifiedFiles: Add packages/opencode/src/session/processor.ts
  • criticalCode: Add lastTokenTime, OPENCODE_STALL_TIMEOUT_MS, LLM stream stalled
  • absorptionSignals: Add stall.*detector, stream.*stall, lastTokenTime

This ensures sync-time agents understand the stall detection logic and can verify it survives upstream merges.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions