LiteLLM streaming silently drops tool call responses when `finish_reason` is `"length"` (max output tokens reached)

## 🔴 Required Information

**Describe the Bug:**

When using `LiteLlm` in streaming mode (`StreamingMode.SSE`), if the model's response is truncated due to reaching `max_output_tokens` (or the model's natural max length), and the model was attempting to produce a **tool call**, the entire response is silently dropped — no `LlmResponse` is yielded, no error is raised, and the ADK event stream simply ends with no output.

This happens because the streaming aggregation logic in `LiteLlm.generate_content_async()` ([`lite_llm.py` lines 1955-2005](https://github.com/google/adk-python/blob/main/src/google/adk/models/lite_llm.py)) only handles `finish_reason == "tool_calls"` or `finish_reason == "stop"` when deciding whether to yield the aggregated response. When `finish_reason == "length"` (which LiteLLM returns when `MAX_TOKENS` is hit), the accumulated `function_calls` dict is never yielded and is silently discarded.

**Relevant code (v1.24.1, `lite_llm.py` ~line 1955):**

```python
if (
    finish_reason == "tool_calls" or finish_reason == "stop"
) and function_calls:
    # ... builds and yields aggregated_llm_response_with_tool_call
elif finish_reason == "stop" and (text or reasoning_parts):
    # ... builds and yields aggregated_llm_response
```

For **pure text** responses this is less critical because text chunks are already yielded incrementally via `_message_to_generate_content_response(..., is_partial=True)`. But for **tool calls**, chunks are only accumulated into the `function_calls` dict during streaming and are yielded as a single aggregated response at the end — so when `finish_reason == "length"`, the aggregated tool call response is never emitted.

**Steps to Reproduce:**

1. Install `google-adk` latest
2. Create an agent with a tool and set a very low `max_output_tokens` (e.g., `10`) to force truncation during tool call generation
3. Run a query that triggers a tool call in streaming mode (`StreamingMode.SSE`)
4. Observe that the agent produces no output — the event stream ends silently

**Expected Behavior:**

When the model's output is truncated due to max tokens, ADK should either:

1. **Raise an error** or yield an `LlmResponse` with `finish_reason=MAX_TOKENS` so the caller knows the response was truncated, OR
2. **Yield the partial tool call data** accumulated so far with the appropriate `finish_reason`, allowing the framework/caller to handle the truncation gracefully (e.g., retry with higher limits)

At a minimum, a non-silent failure is expected — either an exception or a response event indicating truncation occurred.

**Observed Behavior:**

- The `function_calls` dict accumulates partial tool call data during streaming
- When `finish_reason == "length"` arrives, no branch in the `if/elif` handles it
- `aggregated_llm_response_with_tool_call` is never assigned
- The method exits the `async for` loop and falls through to the yield guards, which find `None` values
- **Result**: Zero events yielded for the complete response. The ADK runner produces no output. 

**Environment Details:**

- ADK Library Version: `1.24.1`
- Desktop OS: Linux
- Python Version: `3.12.11`

**Model Information:**

- Are you using LiteLLM: Yes
- LiteLLM Version: `1.79.3`
- Which model is being used: Azure OpenAI (e.g., `azure/<deployment-id>`, gpt-4o / gpt-4.1 / o4-mini class models)

---

## 🟡 Optional Information

**Minimal Reproduction Code:**

```python
import asyncio
import os

from dotenv import load_dotenv
from google.adk.agents import Agent
from google.adk.agents.run_config import RunConfig, StreamingMode
from google.adk.models.lite_llm import LiteLlm
from google.adk.runners import Runner
from google.adk.sessions import InMemorySessionService
from google.genai import types
from google.genai.types import GenerateContentConfig

load_dotenv()


def get_litellm_model() -> LiteLlm:
    deployment_id = os.getenv("AZURE_OPENAI_DEPLOYMENT_ID")
    return LiteLlm(
        model=f"azure/{deployment_id}",
        stream=True,
    )


async def add(a: float, b: float) -> float:
    return a + b


def create_simple_agent() -> Agent:
    model = get_litellm_model()

    instructions = """
    You are a helpful mathematical assistant with access to a calculator tool.
    """

    agent = Agent(
        name="Claudia",
        model=model,
        generate_content_config=GenerateContentConfig(
            temperature=0.0, max_output_tokens=10
        ),
        instruction=instructions,
        description="A simple agent that can perform basic mathematical calculations",
        tools=[add],
    )

    return agent


async def run_agent_single_query(query: str):
    agent = create_simple_agent()
    session_service = InMemorySessionService()

    runner = Runner(
        agent=agent,
        app_name="SimpleADKApp",
        session_service=session_service,
        auto_create_session=True,
    )

    user_id = "user_123"
    session_id = "session_002"

    content = types.Content(role="user", parts=[types.Part(text=query)])

    response_text = ""
    async for event in runner.run_async(
        user_id=user_id,
        session_id=session_id,
        new_message=content,
        run_config=(
            RunConfig(
                streaming_mode=StreamingMode.SSE,
                response_modalities=["TEXT"],
            )
        ),
    ):
        if event.content and event.content.parts:
            response_text = event.content.parts[0].text or ""

    print(f"Agent Response:\n{response_text}\n")
    return response_text


async def main():
    await run_agent_single_query("Ciao! 3 + 9")


if __name__ == "__main__":
    asyncio.run(main())
```

**How often has this issue occurred?:**

- Always (100%) - when `max_output_tokens` is set low enough to truncate during tool call generation.

---

## Proposed Fix

In `generate_content_async()` (streaming branch), add an explicit check for `finish_reason == "length"` when tool calls have been partially accumulated. Since truncated tool calls contain incomplete JSON arguments that cannot be reliably executed, the appropriate behavior is to raise an error rather than silently dropping the response:

```python
          elif finish_reason == "stop" and (text or reasoning_parts):
            message_content = text if text else None
            aggregated_llm_response = _message_to_generate_content_response(
                # ... existing code ...
            )
            aggregated_llm_response.finish_reason = _map_finish_reason(
                finish_reason
            )
            text = ""
            reasoning_parts = []

          elif finish_reason == "length" and function_calls:
            raise ValueError(
                "LLM response was truncated due to max output token limit "
                "while generating a tool call. The partial tool call data "
                "cannot be executed. Consider increasing `max_output_tokens` "
                "in your GenerateContentConfig, or reducing the number/complexity "
                "of available tools to allow the model to complete its response."
            )
```

Thanks @notTyche for spotting this one

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LiteLLM streaming silently drops tool call responses when `finish_reason` is `"length"` (max output tokens reached) #4482

🔴 Required Information

🟡 Optional Information

Proposed Fix

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

LiteLLM streaming silently drops tool call responses when finish_reason is "length" (max output tokens reached) #4482

Description

🔴 Required Information

🟡 Optional Information

Proposed Fix

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

LiteLLM streaming silently drops tool call responses when `finish_reason` is `"length"` (max output tokens reached) #4482