Skip to content

fix: Verify durable cached agent steps match the request before replay#68372

Merged
kaxil merged 1 commit into
apache:mainfrom
astronomer:durable-replay-verification
Jun 18, 2026
Merged

fix: Verify durable cached agent steps match the request before replay#68372
kaxil merged 1 commit into
apache:mainfrom
astronomer:durable-replay-verification

Conversation

@kaxil

@kaxil kaxil commented Jun 11, 2026

Copy link
Copy Markdown
Member

durable=True caches each model response and tool result under positional keys (model_step_{N}, tool_step_{N}), and on a hit it returned the cached entry without ever looking at the current request. So a retry after the operator tweaked the system prompt (or upgraded the model, or a deploy changed the toolset) replayed responses recorded for the old conversation against the new agent. No error, nothing above DEBUG in the logs, just a wrong answer that looks fine. Changing something before retrying is the normal human workflow, which is what makes this the common path rather than an edge case.

The fix stores a fingerprint with each cache entry and only replays when it matches the current request:

  • model steps hash the model identity, the message history (minus the timestamp/run_id/conversation_id fields pydantic-ai regenerates on every attempt), the settings, and the whole ModelRequestParameters, so tool definitions and output mode are covered too
  • tool steps hash the tool name, the args, and the model-issued tool_call_id

On mismatch the step logs a warning and runs live. The tool_call_id part does more work than it looks: ids round-trip through the cache unchanged, but a live model call mints new ones, so once a model step diverges the downstream tool entries stop matching as well. I kept the positional keys from #64199 instead of switching to content-addressed ones; verify-on-hit gets the same invalidation chain without changing the storage layout.

A few decisions worth flagging for review:

  • if a request can't be serialized to JSON, the fingerprint is None and that step replays unverified, i.e. the old behavior. I specifically avoided default=str in the digest: hashing <object at 0x...> reprs would never match across processes, which quietly turns replay off forever while the warning blames the user for changing the agent.
  • entries written by older provider versions carry no fingerprint and are treated as a miss. A provider upgrade is itself a deploy landing between attempts, so re-running once is the right call even though it costs tokens.
  • verification compares requests, not code. Fixing a tool's implementation between attempts won't invalidate a cached result for an identical call, and neither will repointing llm_conn_id at a different endpoint serving the same model name. Both documented in the operator guide along with the delete-the-cache-file escape hatch.

Verified end to end in breeze with a real DAG: two AgentOperator tasks on pydantic-ai's built-in TestModel, each with a tool that fails on attempt 1 only. The unchanged task logged Durable: replayed 2 cached steps (1 model, 1 tool), executed 2 new steps (1 model, 1 tool) on attempt 2, so only the failed step re-ran. The second task templates its prompt on try_number so the request changes per attempt; its retry fired the new warning and replayed 0 model steps. The cache file was gone after the run succeeded.

@kaxil kaxil force-pushed the durable-replay-verification branch 2 times, most recently from d4c16ba to d0036eb Compare June 11, 2026 18:56
@kaxil kaxil marked this pull request as ready for review June 12, 2026 23:19
@kaxil kaxil requested a review from gopidesupavan as a code owner June 12, 2026 23:19
Comment thread providers/common/ai/src/airflow/providers/common/ai/durable/storage.py Outdated
@kaxil kaxil force-pushed the durable-replay-verification branch from d0036eb to 020da0f Compare June 16, 2026 23:55
durable=True cached model responses and tool results under purely
positional keys, so a retry replayed cached steps even when the agent
changed between attempts (prompt tweak, model upgrade, toolset change,
or a deploy landing between retries). The retry silently continued a
different conversation with no warning above DEBUG.

Each cache entry now stores a fingerprint of the request that produced
it (model identity, message history minus per-attempt fields, settings,
and the full ModelRequestParameters; tool name, args, and tool_call_id
for tool steps). On a hit the fingerprint is compared first: a mismatch
logs a warning and re-runs the step live. A divergence invalidates
downstream tool steps too, because a fresh model response mints new
tool_call_ids. Entries written by older provider versions have no
fingerprint and re-run instead of replaying.
@kaxil kaxil force-pushed the durable-replay-verification branch from 020da0f to 3a30867 Compare June 17, 2026 00:34
@kaxil kaxil merged commit db26df7 into apache:main Jun 18, 2026
81 checks passed
@kaxil kaxil deleted the durable-replay-verification branch June 18, 2026 00:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants