Skip to content

LiteLLM streaming silently drops tool call responses when finish_reason is "length" (max output tokens reached) #4482

@GitMarco27

Description

@GitMarco27

🔴 Required Information

Describe the Bug:

When using LiteLlm in streaming mode (StreamingMode.SSE), if the model's response is truncated due to reaching max_output_tokens (or the model's natural max length), and the model was attempting to produce a tool call, the entire response is silently dropped — no LlmResponse is yielded, no error is raised, and the ADK event stream simply ends with no output.

This happens because the streaming aggregation logic in LiteLlm.generate_content_async() (lite_llm.py lines 1955-2005) only handles finish_reason == "tool_calls" or finish_reason == "stop" when deciding whether to yield the aggregated response. When finish_reason == "length" (which LiteLLM returns when MAX_TOKENS is hit), the accumulated function_calls dict is never yielded and is silently discarded.

Relevant code (v1.24.1, lite_llm.py ~line 1955):

if (
    finish_reason == "tool_calls" or finish_reason == "stop"
) and function_calls:
    # ... builds and yields aggregated_llm_response_with_tool_call
elif finish_reason == "stop" and (text or reasoning_parts):
    # ... builds and yields aggregated_llm_response

For pure text responses this is less critical because text chunks are already yielded incrementally via _message_to_generate_content_response(..., is_partial=True). But for tool calls, chunks are only accumulated into the function_calls dict during streaming and are yielded as a single aggregated response at the end — so when finish_reason == "length", the aggregated tool call response is never emitted.

Steps to Reproduce:

  1. Install google-adk latest
  2. Create an agent with a tool and set a very low max_output_tokens (e.g., 10) to force truncation during tool call generation
  3. Run a query that triggers a tool call in streaming mode (StreamingMode.SSE)
  4. Observe that the agent produces no output — the event stream ends silently

Expected Behavior:

When the model's output is truncated due to max tokens, ADK should either:

  1. Raise an error or yield an LlmResponse with finish_reason=MAX_TOKENS so the caller knows the response was truncated, OR
  2. Yield the partial tool call data accumulated so far with the appropriate finish_reason, allowing the framework/caller to handle the truncation gracefully (e.g., retry with higher limits)

At a minimum, a non-silent failure is expected — either an exception or a response event indicating truncation occurred.

Observed Behavior:

  • The function_calls dict accumulates partial tool call data during streaming
  • When finish_reason == "length" arrives, no branch in the if/elif handles it
  • aggregated_llm_response_with_tool_call is never assigned
  • The method exits the async for loop and falls through to the yield guards, which find None values
  • Result: Zero events yielded for the complete response. The ADK runner produces no output.

Environment Details:

  • ADK Library Version: 1.24.1
  • Desktop OS: Linux
  • Python Version: 3.12.11

Model Information:

  • Are you using LiteLLM: Yes
  • LiteLLM Version: 1.79.3
  • Which model is being used: Azure OpenAI (e.g., azure/<deployment-id>, gpt-4o / gpt-4.1 / o4-mini class models)

🟡 Optional Information

Minimal Reproduction Code:

import asyncio
import os

from dotenv import load_dotenv
from google.adk.agents import Agent
from google.adk.agents.run_config import RunConfig, StreamingMode
from google.adk.models.lite_llm import LiteLlm
from google.adk.runners import Runner
from google.adk.sessions import InMemorySessionService
from google.genai import types
from google.genai.types import GenerateContentConfig

load_dotenv()


def get_litellm_model() -> LiteLlm:
    deployment_id = os.getenv("AZURE_OPENAI_DEPLOYMENT_ID")
    return LiteLlm(
        model=f"azure/{deployment_id}",
        stream=True,
    )


async def add(a: float, b: float) -> float:
    return a + b


def create_simple_agent() -> Agent:
    model = get_litellm_model()

    instructions = """
    You are a helpful mathematical assistant with access to a calculator tool.
    """

    agent = Agent(
        name="Claudia",
        model=model,
        generate_content_config=GenerateContentConfig(
            temperature=0.0, max_output_tokens=10
        ),
        instruction=instructions,
        description="A simple agent that can perform basic mathematical calculations",
        tools=[add],
    )

    return agent


async def run_agent_single_query(query: str):
    agent = create_simple_agent()
    session_service = InMemorySessionService()

    runner = Runner(
        agent=agent,
        app_name="SimpleADKApp",
        session_service=session_service,
        auto_create_session=True,
    )

    user_id = "user_123"
    session_id = "session_002"

    content = types.Content(role="user", parts=[types.Part(text=query)])

    response_text = ""
    async for event in runner.run_async(
        user_id=user_id,
        session_id=session_id,
        new_message=content,
        run_config=(
            RunConfig(
                streaming_mode=StreamingMode.SSE,
                response_modalities=["TEXT"],
            )
        ),
    ):
        if event.content and event.content.parts:
            response_text = event.content.parts[0].text or ""

    print(f"Agent Response:\n{response_text}\n")
    return response_text


async def main():
    await run_agent_single_query("Ciao! 3 + 9")


if __name__ == "__main__":
    asyncio.run(main())

How often has this issue occurred?:

  • Always (100%) - when max_output_tokens is set low enough to truncate during tool call generation.

Proposed Fix

In generate_content_async() (streaming branch), add an explicit check for finish_reason == "length" when tool calls have been partially accumulated. Since truncated tool calls contain incomplete JSON arguments that cannot be reliably executed, the appropriate behavior is to raise an error rather than silently dropping the response:

          elif finish_reason == "stop" and (text or reasoning_parts):
            message_content = text if text else None
            aggregated_llm_response = _message_to_generate_content_response(
                # ... existing code ...
            )
            aggregated_llm_response.finish_reason = _map_finish_reason(
                finish_reason
            )
            text = ""
            reasoning_parts = []

          elif finish_reason == "length" and function_calls:
            raise ValueError(
                "LLM response was truncated due to max output token limit "
                "while generating a tool call. The partial tool call data "
                "cannot be executed. Consider increasing `max_output_tokens` "
                "in your GenerateContentConfig, or reducing the number/complexity "
                "of available tools to allow the model to complete its response."
            )

Thanks @notTyche for spotting this one

Metadata

Metadata

Assignees

Labels

models[Component] Issues related to model support

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions