Python: Fix function_approval_response extraction in AG-UI workflow path#4550
Python: Fix function_approval_response extraction in AG-UI workflow path#4550moonbox3 wants to merge 6 commits intomicrosoft:mainfrom
Conversation
…4546) _extract_responses_from_messages now handles function_approval_response content in addition to function_result content. Previously, approval responses sent via the messages field were silently dropped because the function only checked for content.type == "function_result". The approval response is keyed by content.id and includes the approved status, id, and serialized function_call — consistent with how _coerce_content identifies approval response payloads. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
moonbox3
left a comment
There was a problem hiding this comment.
Automated Code Review
Reviewers: 4 | Confidence: 97% | Result: All clear
Reviewed: Correctness, Security Reliability, Test Coverage, Design Approach
Automated review by moonbox3's agents
Python Test Coverage Report •
Python Unit Test Overview
|
||||||||||||||||||||||||||||||
There was a problem hiding this comment.
Pull request overview
Fixes AG-UI workflow runs hanging in human-in-the-loop scenarios by ensuring user-sent tool approval decisions are extracted from incoming messages and forwarded into the workflow responses map.
Changes:
- Extend
_extract_responses_from_messagesto also extractfunction_approval_responsecontent (in addition tofunction_result). - Add unit tests covering approval extraction (approved/denied/mixed/ignored content).
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| python/packages/ag-ui/agent_framework_ag_ui/_workflow_run.py | Extracts function_approval_response payloads from incoming messages so workflow runs can resume from message-based approvals. |
| python/packages/ag-ui/tests/ag_ui/test_workflow_run.py | Adds unit tests for the updated extraction helper, covering approval response handling. |
You can also share your feedback on Copilot code review. Take the survey.
…ssage-based approvals - Update _extract_responses_from_messages docstring to reflect that it now handles function_approval_response content in addition to function_result content. - Add integration tests for run_workflow_stream across two turns with approval responses provided via messages (function_approvals) rather than resume.interrupts, covering both approved and denied scenarios. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
moonbox3
left a comment
There was a problem hiding this comment.
Automated Code Review
Reviewers: 4 | Confidence: 82%
✓ Correctness
This diff adds a docstring clarification and two new integration tests for approval-via-messages functionality. The tests are well-structured and symmetric (approved vs denied). No blocking correctness issues found. The tests correctly match request IDs, use the right async patterns (no @pytest.mark.asyncio needed per project convention), and verify both the happy path and denial path. One minor concern: the
"interrupt" not in resumed_finishedassertion may be fragile depending on how the Pydantic model serializes optional fields.
✓ Security Reliability
This diff contains only a docstring clarification in production code and two new integration tests. There are no production logic changes, no new trust boundaries, no deserialization of untrusted input, no secrets, and no resource-leak risks. The tests are well-structured and exercise both the approved and denied approval paths. No security or reliability issues found.
✓ Test Coverage
Two solid integration tests are added covering both approved and denied approval-via-messages paths. The assertions are meaningful: they verify event types, text output content, and absence of leftover interrupts. However, the docstring update to
_extract_responses_from_messagesdescribes newfunction_approval_responsehandling, yet no unit test is added to the existingTestExtractResponsesFromMessagesclass to exercise that extraction logic directly. The integration tests cover the end-to-end flow but a targeted unit test would catch regressions in the extraction function more precisely.
✓ Design Approach
The diff adds a docstring clarification and two new integration tests covering the approval-via-messages path (approved and denied). The production code change is documentation-only, and the tests exercise a real workflow end-to-end. There are no fundamental design approach concerns: the tests verify a legitimate second message-delivery channel for approval responses and cover both positive and negative outcomes. The minor duplication between the two test functions is acceptable given they assert different behavioral branches.
Suggestions
- The assertion
assert "interrupt" not in resumed_finished(lines 1328 and 1399) could be fragile: if the Pydantic model always includesinterruptinmodel_dump()(e.g., asNoneor[]), the key would still be present and the assertion would fail. Considerassert not resumed_finished.get("interrupt")which handles both absent keys and falsy values (None, []). - Add a unit test to
TestExtractResponsesFromMessagesthat constructs aMessagecontainingfunction_approval_responsecontent and asserts_extract_responses_from_messagesreturns the expected{id: content}mapping. This directly validates the extraction logic described in the updated docstring, independent of the full workflow machinery. - Consider adding an edge-case integration test (or unit test) where both a
function_resultand afunction_approval_responseare present in the same message, verifying both are extracted into the responses dict without interference.
Automated review by moonbox3's agents
- Use safer 'not .get("interrupt")' assertion instead of 'not in'
to handle Pydantic v2 model_dump() including keys with None values
- Add unit test for mixed function_result and function_approval_response
in the same message to TestExtractResponsesFromMessages
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
moonbox3
left a comment
There was a problem hiding this comment.
Automated Code Review
Reviewers: 4 | Confidence: 86%
✓ Correctness
The diff adds a straightforward new test for mixed result/approval extraction and weakens two assertions from checking key absence (
"interrupt" not in dict) to checking falsiness (not dict.get("interrupt")). The new test looks correct. The assertion changes are likely motivated by Pydantic'smodel_dump()including"interrupt": Nonefor optional fields, but usingnotis overly permissive—it would also pass for unexpected falsy values like{},[], orFalse, potentially masking real bugs.
✓ Security Reliability
This diff modifies only test code: it adds a new test for mixed result/approval extraction and changes two assertions from
assert "interrupt" not in dicttoassert not dict.get("interrupt"). No production code is touched. The assertion change subtly weakens the check—dict.get("interrupt")returnsNoneif the key is absent but would also pass if the key is present with a falsy value (e.g.,{},0,False,None,""), whereas the original"interrupt" not in dictstrictly checks key absence. This is unlikely to mask a real bug in practice but is worth noting as a minor reliability concern in tests. No security issues found.
✓ Test Coverage
The diff adds a useful test for mixed result and approval content in the same message, and tweaks two assertions from
assert "interrupt" not in resumed_finishedtoassert not resumed_finished.get("interrupt"). The new test is well-structured with meaningful assertions. The assertion change subtly weakens the check: the original verified the key was absent, while.get()would also pass if the key is present but holds a falsy value (e.g.,None,False,{},[],0). If the intent is truly to verify the field is absent or null, consider an explicit check. No missing test scenarios were identified.
✓ Design Approach
This diff adds a well-structured test for mixed result/approval extraction and relaxes two assertions from key-absence checks to falsy-value checks. The changes are minor and appropriate. The assertion change from
"interrupt" not in dicttonot dict.get("interrupt")makes tests less brittle against Pydantic serialization behavior (whether optional None fields are included or excluded in model_dump()), which is a reasonable improvement rather than a design concern.
Suggestions
- Consider using
assert resumed_finished.get("interrupt") is Noneinstead ofassert not resumed_finished.get("interrupt"). Theis Nonecheck precisely allows both key-absent and key-is-None cases while still catching unexpected falsy values like{}orFalse. - The assertion change from
assert "interrupt" not in resumed_finishedtoassert not resumed_finished.get("interrupt")is slightly weaker: it now also passes wheninterruptis present but falsy (e.g.,{},None,0). If the intent is to verify the key is truly absent, the original form is more precise. If the intent is to accept both absence and falsy values, the new form is fine—just be aware of the semantic difference. - The assertion change from
assert "interrupt" not in resumed_finishedtoassert not resumed_finished.get("interrupt")is semantically weaker — it now passes wheninterruptis present but falsy (e.g.,None,{},0). If the intent is to accept both absent andNone, this is fine; if the field should truly be absent, the original assertion was stricter. Consider documenting the intent or usingassert resumed_finished.get("interrupt") is Nonefor clarity.
Automated review by moonbox3's agents
Motivation and Context
The
_extract_responses_from_messageshelper only handledfunction_resultcontent, silently droppingfunction_approval_responseentries. This meant approval/denial decisions from the user were never forwarded to the agent runtime, causing human-in-the-loop tool-approval workflows to hang indefinitely.Fixes #4546
Description
The root cause was a guard clause in
_extract_responses_from_messagesthat skipped any content whose type was notfunction_result. The fix restructures the conditional to also handlefunction_approval_responsecontent, extracting itsapprovedflag,id, and associatedfunction_callinto the responses dict keyed by the approval id. Tests covering approved, denied, mixed, and edge-case scenarios are added to prevent regression.Contribution Checklist