-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Description
🔴 Required Information
Describe the Bug:
When using LiteLlm in streaming mode (StreamingMode.SSE), if the model's response is truncated due to reaching max_output_tokens (or the model's natural max length), and the model was attempting to produce a tool call, the entire response is silently dropped — no LlmResponse is yielded, no error is raised, and the ADK event stream simply ends with no output.
This happens because the streaming aggregation logic in LiteLlm.generate_content_async() (lite_llm.py lines 1955-2005) only handles finish_reason == "tool_calls" or finish_reason == "stop" when deciding whether to yield the aggregated response. When finish_reason == "length" (which LiteLLM returns when MAX_TOKENS is hit), the accumulated function_calls dict is never yielded and is silently discarded.
Relevant code (v1.24.1, lite_llm.py ~line 1955):
if (
finish_reason == "tool_calls" or finish_reason == "stop"
) and function_calls:
# ... builds and yields aggregated_llm_response_with_tool_call
elif finish_reason == "stop" and (text or reasoning_parts):
# ... builds and yields aggregated_llm_responseFor pure text responses this is less critical because text chunks are already yielded incrementally via _message_to_generate_content_response(..., is_partial=True). But for tool calls, chunks are only accumulated into the function_calls dict during streaming and are yielded as a single aggregated response at the end — so when finish_reason == "length", the aggregated tool call response is never emitted.
Steps to Reproduce:
- Install
google-adklatest - Create an agent with a tool and set a very low
max_output_tokens(e.g.,10) to force truncation during tool call generation - Run a query that triggers a tool call in streaming mode (
StreamingMode.SSE) - Observe that the agent produces no output — the event stream ends silently
Expected Behavior:
When the model's output is truncated due to max tokens, ADK should either:
- Raise an error or yield an
LlmResponsewithfinish_reason=MAX_TOKENSso the caller knows the response was truncated, OR - Yield the partial tool call data accumulated so far with the appropriate
finish_reason, allowing the framework/caller to handle the truncation gracefully (e.g., retry with higher limits)
At a minimum, a non-silent failure is expected — either an exception or a response event indicating truncation occurred.
Observed Behavior:
- The
function_callsdict accumulates partial tool call data during streaming - When
finish_reason == "length"arrives, no branch in theif/elifhandles it aggregated_llm_response_with_tool_callis never assigned- The method exits the
async forloop and falls through to the yield guards, which findNonevalues - Result: Zero events yielded for the complete response. The ADK runner produces no output.
Environment Details:
- ADK Library Version:
1.24.1 - Desktop OS: Linux
- Python Version:
3.12.11
Model Information:
- Are you using LiteLLM: Yes
- LiteLLM Version:
1.79.3 - Which model is being used: Azure OpenAI (e.g.,
azure/<deployment-id>, gpt-4o / gpt-4.1 / o4-mini class models)
🟡 Optional Information
Minimal Reproduction Code:
import asyncio
import os
from dotenv import load_dotenv
from google.adk.agents import Agent
from google.adk.agents.run_config import RunConfig, StreamingMode
from google.adk.models.lite_llm import LiteLlm
from google.adk.runners import Runner
from google.adk.sessions import InMemorySessionService
from google.genai import types
from google.genai.types import GenerateContentConfig
load_dotenv()
def get_litellm_model() -> LiteLlm:
deployment_id = os.getenv("AZURE_OPENAI_DEPLOYMENT_ID")
return LiteLlm(
model=f"azure/{deployment_id}",
stream=True,
)
async def add(a: float, b: float) -> float:
return a + b
def create_simple_agent() -> Agent:
model = get_litellm_model()
instructions = """
You are a helpful mathematical assistant with access to a calculator tool.
"""
agent = Agent(
name="Claudia",
model=model,
generate_content_config=GenerateContentConfig(
temperature=0.0, max_output_tokens=10
),
instruction=instructions,
description="A simple agent that can perform basic mathematical calculations",
tools=[add],
)
return agent
async def run_agent_single_query(query: str):
agent = create_simple_agent()
session_service = InMemorySessionService()
runner = Runner(
agent=agent,
app_name="SimpleADKApp",
session_service=session_service,
auto_create_session=True,
)
user_id = "user_123"
session_id = "session_002"
content = types.Content(role="user", parts=[types.Part(text=query)])
response_text = ""
async for event in runner.run_async(
user_id=user_id,
session_id=session_id,
new_message=content,
run_config=(
RunConfig(
streaming_mode=StreamingMode.SSE,
response_modalities=["TEXT"],
)
),
):
if event.content and event.content.parts:
response_text = event.content.parts[0].text or ""
print(f"Agent Response:\n{response_text}\n")
return response_text
async def main():
await run_agent_single_query("Ciao! 3 + 9")
if __name__ == "__main__":
asyncio.run(main())How often has this issue occurred?:
- Always (100%) - when
max_output_tokensis set low enough to truncate during tool call generation.
Proposed Fix
In generate_content_async() (streaming branch), add an explicit check for finish_reason == "length" when tool calls have been partially accumulated. Since truncated tool calls contain incomplete JSON arguments that cannot be reliably executed, the appropriate behavior is to raise an error rather than silently dropping the response:
elif finish_reason == "stop" and (text or reasoning_parts):
message_content = text if text else None
aggregated_llm_response = _message_to_generate_content_response(
# ... existing code ...
)
aggregated_llm_response.finish_reason = _map_finish_reason(
finish_reason
)
text = ""
reasoning_parts = []
elif finish_reason == "length" and function_calls:
raise ValueError(
"LLM response was truncated due to max output token limit "
"while generating a tool call. The partial tool call data "
"cannot be executed. Consider increasing `max_output_tokens` "
"in your GenerateContentConfig, or reducing the number/complexity "
"of available tools to allow the model to complete its response."
)Thanks @notTyche for spotting this one