Skip to content

Enhancement: Dataset signed URLs and improved validation errors#652

Merged
vprashrex merged 10 commits intomainfrom
feature/validation-errors-dataset-signed-url
Mar 11, 2026
Merged

Enhancement: Dataset signed URLs and improved validation errors#652
vprashrex merged 10 commits intomainfrom
feature/validation-errors-dataset-signed-url

Conversation

@vprashrex
Copy link
Copy Markdown
Collaborator

@vprashrex vprashrex commented Mar 9, 2026

Summary

Target issue is #651
Explain the motivation for making this change. What existing problem does the pull request solve?

  1. Clients had no way to download dataset files directly from the API since the endpoint only returned the raw S3 path without a time-limited presigned URL.

  2. Validation errors returned a flat string, making it hard for clients to parse and display them per field.

Checklist

Before submitting a pull request, please ensure that you mark these task.

  • Ran fastapi run --reload app/main.py or docker compose up in the repository root and test.
  • If you've fixed a bug or added code that is tested and has test cases.

Notes

Previously validation error was like this:

{
    "success": false,
    "data": null,
    "error": "('body', 'file'): Field required\n('body', 'dataset_name'): Field required",
    "metadata": null
}

Now validation error would be like this

{
    "success": false,
    "data": null,
    "error": "Validation failed",
    "errors": [
        {
            "field": "file",
            "message": "Field required"
        },
        {
            "field": "dataset_name",
            "message": "Field required"
        }
    ],
    "metadata": null
}

Summary by CodeRabbit

  • New Features

    • Optionally request time-limited signed URLs for dataset downloads (1‑hour expiry).
  • Improvements

    • Validation errors now return structured per-field error objects (field + message).
    • Validation error responses are filtered to remove union/branch noise for clearer messages.
    • API error formatting improved for consistent failure responses.
  • Documentation

    • API docs updated to describe signed URL usage and expiry.
  • Tests

    • Added/updated tests for structured error responses and signed URL behavior.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 9, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds optional signed URL support for dataset downloads via an include_signed_url query flag and presigned URLs from cloud storage; restructures validation error handling to return filtered, structured error objects (errors array) instead of a single error string.

Changes

Cohort / File(s) Summary
Dataset signed URL
backend/app/api/routes/evaluations/dataset.py, backend/app/models/evaluation.py, backend/app/api/docs/evaluation/get_dataset.md
Adds include_signed_url query flag; uses get_cloud_storage to generate a presigned URL when object_store_url exists; DatasetUploadResponse adds optional signed_url; docs note 1-hour expiry.
Validation error filtering
backend/app/core/exception_handlers.py
Adds _BRANCH_PATTERN, _is_branch_identifier, _filter_union_branch_errors; applies branch-aware filtering to RequestValidationError handling and updates handler signatures to return JSONResponse.
Structured API responses
backend/app/utils.py
Introduces ValidationErrorDetail model; APIResponse gains errors: Optional[list[ValidationErrorDetail]]; failure_response builds structured errors from Pydantic error dicts.
Tests updated / new tests
backend/app/tests/.../test_*, backend/app/tests/core/test_exception_handlers.py
Many tests updated to assert errors array with field/message; new unit/integration tests for _filter_union_branch_errors and validation response shape; dataset GET tests validating signed_url behavior.

Sequence Diagram(s)

sequenceDiagram
    participant Client as Client
    participant Endpoint as "Dataset Endpoint"
    participant Storage as "CloudStorage"
    participant Response as "APIResponse"

    Client->>Endpoint: GET /datasets/{id}?include_signed_url=true
    Endpoint->>Endpoint: load dataset record
    alt dataset.object_store_url exists
        Endpoint->>Storage: request presigned URL for object_store_url
        Storage-->>Endpoint: return presigned URL (1h expiry)
        Endpoint->>Response: build DatasetUploadResponse { signed_url }
    else no object_store_url
        Endpoint->>Response: build DatasetUploadResponse { signed_url: null }
    end
    Response-->>Client: JSON payload (success, data, errors?)
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Poem

🐇 I hopped through code to fetch a tiny key,

a signed link blooming, safe for you to see.
I brushed the noisy branches till errors were clear,
each field with a message now hopping near.
CSV in hand, I danced — download is near.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 60.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the two main objectives: adding signed URLs for dataset downloads and restructuring validation error responses into a more detailed format.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feature/validation-errors-dataset-signed-url

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
backend/app/utils.py (1)

68-93: Consider a more specific type hint for the error parameter.

The current type hint str | list is imprecise. Since the list variant expects dictionaries with specific keys (loc, msg), a more explicit type would improve code clarity and type checking.

♻️ Suggested type improvement
     `@classmethod`
     def failure_response(
         cls,
-        error: str | list,
+        error: str | list[dict[str, Any]],
         data: Optional[T] = None,
         metadata: Optional[Dict[str, Any]] = None,
     ) -> "APIResponse[None]":
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/utils.py` around lines 68 - 93, The parameter type for error in
the APIResponse classmethod is too broad (str | list); change it to a more
specific annotation—either Union[str, List[Dict[str, Any]]] or, better, define a
TypedDict (e.g., ValidationErrorDict = TypedDict("ValidationErrorDict", {"loc":
Tuple[Any, ...], "msg": str}, total=False)) and use Union[str,
List[ValidationErrorDict]] for the error parameter in the method signature so
static checkers know the list contains dicts with "loc" and "msg"; also add the
required imports from typing (Union, List, Dict, TypedDict, Tuple) and keep the
existing isinstance(error, list) handling but treat elements as
ValidationErrorDict for clarity.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/app/api/routes/evaluations/dataset.py`:
- Around line 154-159: The current signed URL retrieval (guarded by
include_signed_url and dataset.object_store_url) calls get_cloud_storage(...)
and storage.get_signed_url(...) but can raise ValueError or other exceptions;
update the block to catch ValueError and a broad Exception around
get_cloud_storage and storage.get_signed_url (referencing get_cloud_storage,
storage.get_signed_url, include_signed_url, dataset.object_store_url,
auth_context.project_.id, session=_session), log a warning or debug message with
the error, and simply leave signed_url as None so the endpoint returns the
dataset without failing when cloud storage is not configured or initialization
fails.

In `@backend/app/core/exception_handlers.py`:
- Around line 14-18: The current _BRANCH_PATTERN only matches uppercase starters
and brackets, so _is_branch_identifier fails for lowercase discriminators like
"text" or "openai-native"; update the regex used by _BRANCH_PATTERN to include
lowercase letters (for example change r"^[A-Z]|[\[\]()]" to
r"^[A-Za-z]|[\[\]()]") so _is_branch_identifier(part: str) correctly recognizes
lowercase discriminators and hyphenated tags used in discriminated unions (leave
_is_branch_identifier implementation as-is and add/verify tests for tags such as
"text", "audio", "openai", "openai-native").

---

Nitpick comments:
In `@backend/app/utils.py`:
- Around line 68-93: The parameter type for error in the APIResponse classmethod
is too broad (str | list); change it to a more specific annotation—either
Union[str, List[Dict[str, Any]]] or, better, define a TypedDict (e.g.,
ValidationErrorDict = TypedDict("ValidationErrorDict", {"loc": Tuple[Any, ...],
"msg": str}, total=False)) and use Union[str, List[ValidationErrorDict]] for the
error parameter in the method signature so static checkers know the list
contains dicts with "loc" and "msg"; also add the required imports from typing
(Union, List, Dict, TypedDict, Tuple) and keep the existing isinstance(error,
list) handling but treat elements as ValidationErrorDict for clarity.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 40e87da3-ff24-4603-ad0c-3d2dcf7d506a

📥 Commits

Reviewing files that changed from the base of the PR and between 6ead760 and a424d9a.

📒 Files selected for processing (5)
  • backend/app/api/docs/evaluation/get_dataset.md
  • backend/app/api/routes/evaluations/dataset.py
  • backend/app/core/exception_handlers.py
  • backend/app/models/evaluation.py
  • backend/app/utils.py

Comment on lines +154 to +159
signed_url = None
if include_signed_url and dataset.object_store_url:
storage = get_cloud_storage(
session=_session, project_id=auth_context.project_.id
)
signed_url = storage.get_signed_url(dataset.object_store_url)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Missing error handling for cloud storage failures.

According to backend/app/core/cloud/storage.py:267-284, get_cloud_storage() can raise ValueError for invalid projects or propagate exceptions when AWS initialization fails. If cloud storage is not configured (e.g., in development environments without AWS credentials), this will cause a 500 error even when the dataset exists.

Consider gracefully handling these failures by returning the dataset without the signed URL instead of failing the entire request.

🛡️ Proposed fix with error handling
     signed_url = None
     if include_signed_url and dataset.object_store_url:
-        storage = get_cloud_storage(
-            session=_session, project_id=auth_context.project_.id
-        )
-        signed_url = storage.get_signed_url(dataset.object_store_url)
+        try:
+            storage = get_cloud_storage(
+                session=_session, project_id=auth_context.project_.id
+            )
+            signed_url = storage.get_signed_url(dataset.object_store_url)
+        except Exception as e:
+            logger.warning(
+                f"[get_dataset] Failed to generate signed URL | dataset_id={dataset_id} | error={str(e)}"
+            )
+            # Continue without signed_url rather than failing the request
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
signed_url = None
if include_signed_url and dataset.object_store_url:
storage = get_cloud_storage(
session=_session, project_id=auth_context.project_.id
)
signed_url = storage.get_signed_url(dataset.object_store_url)
signed_url = None
if include_signed_url and dataset.object_store_url:
try:
storage = get_cloud_storage(
session=_session, project_id=auth_context.project_.id
)
signed_url = storage.get_signed_url(dataset.object_store_url)
except Exception as e:
logger.warning(
f"[get_dataset] Failed to generate signed URL | dataset_id={dataset_id} | error={str(e)}"
)
# Continue without signed_url rather than failing the request
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/api/routes/evaluations/dataset.py` around lines 154 - 159, The
current signed URL retrieval (guarded by include_signed_url and
dataset.object_store_url) calls get_cloud_storage(...) and
storage.get_signed_url(...) but can raise ValueError or other exceptions; update
the block to catch ValueError and a broad Exception around get_cloud_storage and
storage.get_signed_url (referencing get_cloud_storage, storage.get_signed_url,
include_signed_url, dataset.object_store_url, auth_context.project_.id,
session=_session), log a warning or debug message with the error, and simply
leave signed_url as None so the endpoint returns the dataset without failing
when cloud storage is not configured or initialization fails.

Comment on lines +14 to +18
_BRANCH_PATTERN = re.compile(r"^[A-Z]|[\[\]()]")


def _is_branch_identifier(part: str) -> bool:
return bool(part and isinstance(part, str) and _BRANCH_PATTERN.search(part))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🌐 Web query:

Pydantic v2 discriminated union validation error loc format

💡 Result:

In Pydantic v2, ValidationError.errors() returns each error with:

  • loc: a tuple path of str | int segments (field names and container indexes). (docs.pydantic.dev)
  • When the error happens inside a discriminated (tagged) union, Pydantic inserts the selected union tag (i.e., the discriminator result) into the location path, between the union field and the nested field(s).

What the loc looks like for discriminated unions

If your model field is pet and the discriminator selects tag "dog", then a missing field error inside that variant will have a location conceptually like:

For nested discriminated unions, tags can stack:

  • ('pet', 'cat', 'black', 'black_name') → displayed as pet.cat.black.black_name (docs.pydantic.dev)

This is the same behavior people notice in practice, e.g. ("child", "a", "field_a") where "a" is the discriminator/tag. (stackoverflow.com)

When the discriminator itself is wrong/missing

For discriminator failures, Pydantic uses error types like:

(Those errors are raised at the union field; the printed error path you see includes the tag when it’s known, e.g. pet.cat in the nested example.) (docs.pydantic.dev)

Citations:


🏁 Script executed:

# First, check the full context of the exception handler file
cat -n backend/app/core/exception_handlers.py | head -50

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 2025


🏁 Script executed:

# Check the request.py file to verify the discriminator values mentioned
cat -n backend/app/models/llm/request.py | grep -A 5 -B 5 "discriminator\|text\|audio\|openai"

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 9327


The union branch identifier pattern is insufficient for lowercase discriminators used in the codebase.

The regex ^[A-Z]|[\[\]()] only matches strings starting with uppercase letters or containing brackets. However, according to Pydantic v2 documentation, validation error loc tuples include discriminator tag values—and your codebase uses lowercase literal discriminators like "text", "audio", "openai", "openai-native" (from backend/app/models/llm/request.py lines 109, 114, 119, 125 and 197, 217).

Since the pattern won't match these lowercase discriminators when they appear in error locations, _is_branch_identifier() will fail to identify them as branch identifiers. This causes errors from discriminated unions with lowercase discriminators to be misclassified as non-union errors instead of being properly grouped and filtered by union branch (see lines 38–47).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/core/exception_handlers.py` around lines 14 - 18, The current
_BRANCH_PATTERN only matches uppercase starters and brackets, so
_is_branch_identifier fails for lowercase discriminators like "text" or
"openai-native"; update the regex used by _BRANCH_PATTERN to include lowercase
letters (for example change r"^[A-Z]|[\[\]()]" to r"^[A-Za-z]|[\[\]()]") so
_is_branch_identifier(part: str) correctly recognizes lowercase discriminators
and hyphenated tags used in discriminated unions (leave _is_branch_identifier
implementation as-is and add/verify tests for tags such as "text", "audio",
"openai", "openai-native").

@vprashrex vprashrex self-assigned this Mar 9, 2026
@vprashrex vprashrex added the enhancement New feature or request label Mar 9, 2026
@vprashrex vprashrex moved this to In Review in Kaapi-dev Mar 9, 2026
@vprashrex vprashrex linked an issue Mar 9, 2026 that may be closed by this pull request
@codecov
Copy link
Copy Markdown

codecov bot commented Mar 9, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (5)
backend/app/tests/api/routes/test_threads.py (1)

604-605: Assert the structured message too.

This only proves that some error mentions question. It would still pass if the API stopped returning the per-field message that this PR introduces. Please assert the question entry also includes the expected "message": "Field required".

Suggested test tightening
     assert error_response["success"] is False
     assert error_response["errors"]
-    assert any("question" in e["field"] for e in error_response["errors"])
+    assert any(
+        e["field"] == "question" and e["message"] == "Field required"
+        for e in error_response["errors"]
+    )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/tests/api/routes/test_threads.py` around lines 604 - 605, The
test currently only checks that some error mentions "question"; tighten it to
assert the structured per-field message by checking the error_response["errors"]
list contains an entry where e["field"] == "question" and e["message"] == "Field
required" (e.g. replace or augment the any("question" in e["field"] ...)
assertion with any(e.get("field") == "question" and e.get("message") == "Field
required" for e in error_response["errors"])); reference the existing
error_response variable and its "errors" list when updating the assertion.
backend/app/tests/core/test_exception_handlers.py (2)

205-230: Make this nested-path assertion exact.

This can pass with one dotted field and a separate "params" field. The point of the test is the full path contract, so it should assert the concrete dotted field value.

💡 Suggested assertion
         body = response.json()
         fields = [e["field"] for e in body["errors"]]
-        assert any("." in f for f in fields)
-        assert any("params" in f for f in fields)
+        assert "config_blob.completion.params" in fields
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/tests/core/test_exception_handlers.py` around lines 205 - 230,
The test_nested_field_path_in_error test currently only checks for any dotted
field and any occurrence of "params" separately, which can pass incorrectly;
update the assertions to require the exact dotted field path returned in the
errors (e.g., "config_blob.completion.params") by computing fields = [e["field"]
for e in body["errors"]] and asserting that the exact string
"config_blob.completion.params" is in fields (and remove the loose any("." in f)
/ any("params" in f) checks) so the test enforces the full dotted path contract.

287-318: Assert the presigner is called from the dataset storage path.

Right now this only proves that a signed_url value is returned. A route that hardcodes or stubs the response without using dataset.object_store_url would still pass.

💡 Suggested assertion
         with patch(
             "app.api.routes.evaluations.dataset.get_cloud_storage"
         ) as mock_get_storage:
             mock_storage = mock_get_storage.return_value
             mock_storage.get_signed_url.return_value = mock_signed_url
@@
         assert response.status_code == 200
+        mock_storage.get_signed_url.assert_called_once()
+        signed_url_call = mock_storage.get_signed_url.call_args
+        assert dataset.object_store_url in signed_url_call.args or (
+            dataset.object_store_url in signed_url_call.kwargs.values()
+        )
         body = response.json()
         assert body["success"] is True
         assert body["data"]["signed_url"] == mock_signed_url
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/tests/core/test_exception_handlers.py` around lines 287 - 318,
The test test_get_dataset_with_signed_url currently only asserts the returned
signed_url value; update it to also verify the presigner is invoked with the
dataset's storage path by asserting mock_storage.get_signed_url was called with
dataset.object_store_url (or the expected object key derived from
dataset.object_store_url) after the client.get call; locate the mocked
get_cloud_storage (patch of
app.api.routes.evaluations.dataset.get_cloud_storage) and add an assertion that
mock_storage.get_signed_url was called once with the dataset.object_store_url to
ensure the route uses the dataset's object store path.
backend/app/tests/api/routes/test_onboarding.py (1)

200-203: Verify these errors stay attached to credentials.

The onboarding validator raises on the credentials field, but these tests only check message text. They won't catch a regression where the API returns the right message with field="unknown".

💡 Suggested assertion
     assert error_response["errors"]
+    assert any(e["field"] == "credentials" for e in error_response["errors"])
     assert any(
         "credential validation failed" in e["message"]
         for e in error_response["errors"]
     )

Apply the same assertion pattern to the other credential-validation cases in this file.

Also applies to: 232-238, 267-271, 302-306, 338-343

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/tests/api/routes/test_onboarding.py` around lines 200 - 203, The
tests currently only assert the error message text in error_response["errors"]
but must also assert that each credential-validation error is attached to the
credentials field; update the assertions that check for "credential validation
failed" (the error_response variable in test_onboarding.py) to also assert that
at least one error has e["field"] == "credentials" (or that the matching error
includes field "credentials"), and apply the same pattern to the other
credential-validation cases in this file (the assertion blocks around the other
error_response checks referenced in the comment).
backend/app/tests/api/routes/test_evaluation.py (1)

350-354: Assert the duplication_factor field too.

These checks only validate the message text. If the handler regresses and emits the error under the wrong field, these tests still pass even though the new contract is per-field validation errors.

💡 Suggested assertion
         assert response_data["errors"]
+        assert any(
+            "duplication_factor" in e["field"] for e in response_data["errors"]
+        )
         assert any(
             "greater than or equal to 1" in e["message"]
             for e in response_data["errors"]
         )

Apply the same pattern to the upper-bound test.

Also applies to: 377-380

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/tests/api/routes/test_evaluation.py` around lines 350 - 354, The
test currently only checks error messages for the duplication_factor validation;
update the assertions in backend/app/tests/api/routes/test_evaluation.py to also
assert that the error object(s) include the correct field name
"duplication_factor" (e.g., check e["field"] == "duplication_factor") alongside
the existing message check so the failure is tied to the right input field;
apply the same change to the corresponding upper-bound test block that checks
the duplication_factor upper-limit error so both tests validate per-field error
reporting.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@backend/app/tests/api/routes/test_evaluation.py`:
- Around line 350-354: The test currently only checks error messages for the
duplication_factor validation; update the assertions in
backend/app/tests/api/routes/test_evaluation.py to also assert that the error
object(s) include the correct field name "duplication_factor" (e.g., check
e["field"] == "duplication_factor") alongside the existing message check so the
failure is tied to the right input field; apply the same change to the
corresponding upper-bound test block that checks the duplication_factor
upper-limit error so both tests validate per-field error reporting.

In `@backend/app/tests/api/routes/test_onboarding.py`:
- Around line 200-203: The tests currently only assert the error message text in
error_response["errors"] but must also assert that each credential-validation
error is attached to the credentials field; update the assertions that check for
"credential validation failed" (the error_response variable in
test_onboarding.py) to also assert that at least one error has e["field"] ==
"credentials" (or that the matching error includes field "credentials"), and
apply the same pattern to the other credential-validation cases in this file
(the assertion blocks around the other error_response checks referenced in the
comment).

In `@backend/app/tests/api/routes/test_threads.py`:
- Around line 604-605: The test currently only checks that some error mentions
"question"; tighten it to assert the structured per-field message by checking
the error_response["errors"] list contains an entry where e["field"] ==
"question" and e["message"] == "Field required" (e.g. replace or augment the
any("question" in e["field"] ...) assertion with any(e.get("field") ==
"question" and e.get("message") == "Field required" for e in
error_response["errors"])); reference the existing error_response variable and
its "errors" list when updating the assertion.

In `@backend/app/tests/core/test_exception_handlers.py`:
- Around line 205-230: The test_nested_field_path_in_error test currently only
checks for any dotted field and any occurrence of "params" separately, which can
pass incorrectly; update the assertions to require the exact dotted field path
returned in the errors (e.g., "config_blob.completion.params") by computing
fields = [e["field"] for e in body["errors"]] and asserting that the exact
string "config_blob.completion.params" is in fields (and remove the loose
any("." in f) / any("params" in f) checks) so the test enforces the full dotted
path contract.
- Around line 287-318: The test test_get_dataset_with_signed_url currently only
asserts the returned signed_url value; update it to also verify the presigner is
invoked with the dataset's storage path by asserting mock_storage.get_signed_url
was called with dataset.object_store_url (or the expected object key derived
from dataset.object_store_url) after the client.get call; locate the mocked
get_cloud_storage (patch of
app.api.routes.evaluations.dataset.get_cloud_storage) and add an assertion that
mock_storage.get_signed_url was called once with the dataset.object_store_url to
ensure the route uses the dataset's object store path.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 21c9dce8-8e03-4286-a6b2-2cdd38219bba

📥 Commits

Reviewing files that changed from the base of the PR and between a424d9a and 7547864.

📒 Files selected for processing (6)
  • backend/app/tests/api/routes/collections/test_create_collections.py
  • backend/app/tests/api/routes/test_doc_transformation_job.py
  • backend/app/tests/api/routes/test_evaluation.py
  • backend/app/tests/api/routes/test_onboarding.py
  • backend/app/tests/api/routes/test_threads.py
  • backend/app/tests/core/test_exception_handlers.py

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
backend/app/core/exception_handlers.py (1)

24-24: ⚠️ Potential issue | 🟡 Minor

Typo: "happend" → "happened".

📝 Proposed fix
-    This picks the branch where the validation error happend.
+    This picks the branch where the validation error happened.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/core/exception_handlers.py` at line 24, Fix the typo in the
inline comment string "This picks the branch where the validation error
happend." by changing "happend" to "happened" so the comment reads "This picks
the branch where the validation error happened."; locate and update this comment
in backend/app/core/exception_handlers.py (the exact comment text is the unique
identifier).
🧹 Nitpick comments (2)
backend/app/core/exception_handlers.py (2)

65-67: In-place mutation of error dicts may cause unintended side effects.

Modifying err["loc"] directly mutates the dictionaries returned by exc.errors(). If another part of the code holds a reference to these dicts, their loc values will be unexpectedly changed. Consider creating shallow copies before modification.

♻️ Proposed fix
+        filtered = [dict(err) for err in filtered]  # shallow copy to avoid mutating originals
         for err in filtered:
             loc = err.get("loc", ())
             err["loc"] = tuple(p for p in loc if not _is_branch_identifier(p))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/core/exception_handlers.py` around lines 65 - 67, The loop
currently mutates the dicts returned by exc.errors() by assigning to err["loc"];
instead, create a shallow copy of each error dict (e.g., copy = err.copy())
before modifying the "loc" field so you don't change the original objects;
compute the filtered tuple using _is_branch_identifier and assign it to the
copy's "loc", then use/collect the copy instead of mutating err (apply this
change where the for err in filtered: loop handles items from exc.errors()).

78-79: Broad exception handler silently swallows errors.

Catching all exceptions and returning the original errors list hides bugs in the filtering logic. Consider logging the exception for debugging purposes.

🔍 Proposed fix with logging
+import logging
+
+logger = logging.getLogger(__name__)
+
 # ... in _filter_union_branch_errors:
         return unique_errors or errors
     except Exception:
+        logger.exception("[_filter_union_branch_errors] Error filtering validation errors")
         return errors
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/core/exception_handlers.py` around lines 78 - 79, The broad
except block currently swallows errors (the "except Exception: return errors"
branch); change it to capture the exception (e.g., "except Exception as e:"),
log the full exception (use the module logger or logging.exception) with
contextual information about the filtering operation and the current "errors"
value, then return errors; ensure the logger is imported/available in
core.exception_handlers.py (or create one) so the exception stacktrace is
recorded for debugging.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@backend/app/core/exception_handlers.py`:
- Line 24: Fix the typo in the inline comment string "This picks the branch
where the validation error happend." by changing "happend" to "happened" so the
comment reads "This picks the branch where the validation error happened.";
locate and update this comment in backend/app/core/exception_handlers.py (the
exact comment text is the unique identifier).

---

Nitpick comments:
In `@backend/app/core/exception_handlers.py`:
- Around line 65-67: The loop currently mutates the dicts returned by
exc.errors() by assigning to err["loc"]; instead, create a shallow copy of each
error dict (e.g., copy = err.copy()) before modifying the "loc" field so you
don't change the original objects; compute the filtered tuple using
_is_branch_identifier and assign it to the copy's "loc", then use/collect the
copy instead of mutating err (apply this change where the for err in filtered:
loop handles items from exc.errors()).
- Around line 78-79: The broad except block currently swallows errors (the
"except Exception: return errors" branch); change it to capture the exception
(e.g., "except Exception as e:"), log the full exception (use the module logger
or logging.exception) with contextual information about the filtering operation
and the current "errors" value, then return errors; ensure the logger is
imported/available in core.exception_handlers.py (or create one) so the
exception stacktrace is recorded for debugging.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f20618a2-8922-4634-8d7a-73c689bb2f9e

📥 Commits

Reviewing files that changed from the base of the PR and between 7547864 and a912a3e.

📒 Files selected for processing (1)
  • backend/app/core/exception_handlers.py

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (3)
backend/app/tests/core/test_exception_handlers.py (1)

126-140: Consider asserting errors list is non-empty for robustness.

The for loop at line 138 would pass vacuously if errors is empty. While the 422 status code and the incomplete request payload make this unlikely, adding an explicit assertion would make the test intent clearer and guard against future regressions in error generation.

🔧 Suggested improvement
         assert response.status_code == 422
+        errors = response.json()["errors"]
+        assert errors, "Expected validation errors but got none"
-        for error in response.json()["errors"]:
+        for error in errors:
             assert "openai-native" not in error["message"]
             assert "NativeCompletionConfig" not in error["field"]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/tests/core/test_exception_handlers.py` around lines 126 - 140,
The test test_union_noise_filtered should explicitly assert that the returned
errors list is not empty before iterating; update the assertion logic in
test_union_noise_filtered to retrieve errors = response.json().get("errors", [])
and assert errors (or assert len(errors) > 0) so the subsequent for-loop does
not pass vacuously, then continue asserting each error does not contain
"openai-native" in message or "NativeCompletionConfig" in field.
backend/app/tests/api/routes/test_evaluation.py (2)

1401-1420: Consider verifying mock was called with the correct URL.

The test correctly verifies the signed URL is returned, but doesn't confirm get_signed_url was called with the expected object_store_url. This would make the test more robust against regressions where the wrong URL might be passed.

🔧 Suggested improvement
         with patch("app.api.routes.evaluations.dataset.get_cloud_storage") as mock:
             mock.return_value.get_signed_url.return_value = "https://signed.url"
             response = client.get(
                 f"/api/v1/evaluations/datasets/{dataset.id}",
                 headers={"X-API-KEY": user_api_key.key},
                 params={"include_signed_url": True},
             )
+            mock.return_value.get_signed_url.assert_called_once_with(
+                dataset.object_store_url
+            )
         assert response.json()["data"]["signed_url"] == "https://signed.url"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/tests/api/routes/test_evaluation.py` around lines 1401 - 1420,
The test test_include_signed_url doesn't assert that get_signed_url was called
with the expected object URL; after calling the API in the test, add an
assertion that the mocked get_cloud_storage().get_signed_url was invoked with
the dataset's object_store_url (e.g.,
mock.return_value.get_signed_url.assert_called_once_with(dataset.object_store_url))
so the test verifies the correct URL is passed through; locate this in the
test_include_signed_url function and add the assertion immediately after the
client.get and before the final response assertion.

1398-1399: Consider adding status code assertion for consistency.

The other signed URL tests don't explicitly assert the status code before checking the response body. While the assertion at line 1398 implicitly verifies a successful response (since it accesses response.json()["data"]), adding an explicit assert response.status_code == 200 would improve readability and make failures easier to diagnose.

🔧 Suggested improvement
         response = client.get(
             f"/api/v1/evaluations/datasets/{dataset.id}",
             headers={"X-API-KEY": user_api_key.key},
         )
         assert response.status_code == 200
-        assert response.json()["data"].get("signed_url") is None
+        assert response.json()["data"].get("signed_url") is None

Note: Same suggestion applies to test_include_signed_url (line 1420) and test_no_object_store_url_skips_signing (line 1445).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/tests/api/routes/test_evaluation.py` around lines 1398 - 1399,
Add an explicit HTTP status assertion before inspecting the response body in the
tests that check signed URL behavior: insert `assert response.status_code ==
200` at the start of the assertions in the test function containing the shown
snippet (the test that asserts signed_url is None) and do the same in
`test_include_signed_url` and `test_no_object_store_url_skips_signing` so each
test first verifies the 200 response code before accessing
`response.json()["data"]`.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@backend/app/tests/api/routes/test_evaluation.py`:
- Around line 1401-1420: The test test_include_signed_url doesn't assert that
get_signed_url was called with the expected object URL; after calling the API in
the test, add an assertion that the mocked get_cloud_storage().get_signed_url
was invoked with the dataset's object_store_url (e.g.,
mock.return_value.get_signed_url.assert_called_once_with(dataset.object_store_url))
so the test verifies the correct URL is passed through; locate this in the
test_include_signed_url function and add the assertion immediately after the
client.get and before the final response assertion.
- Around line 1398-1399: Add an explicit HTTP status assertion before inspecting
the response body in the tests that check signed URL behavior: insert `assert
response.status_code == 200` at the start of the assertions in the test function
containing the shown snippet (the test that asserts signed_url is None) and do
the same in `test_include_signed_url` and
`test_no_object_store_url_skips_signing` so each test first verifies the 200
response code before accessing `response.json()["data"]`.

In `@backend/app/tests/core/test_exception_handlers.py`:
- Around line 126-140: The test test_union_noise_filtered should explicitly
assert that the returned errors list is not empty before iterating; update the
assertion logic in test_union_noise_filtered to retrieve errors =
response.json().get("errors", []) and assert errors (or assert len(errors) > 0)
so the subsequent for-loop does not pass vacuously, then continue asserting each
error does not contain "openai-native" in message or "NativeCompletionConfig" in
field.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 8872bc60-4f23-4073-aab2-8ec37c5c8dd4

📥 Commits

Reviewing files that changed from the base of the PR and between a912a3e and eea7a48.

📒 Files selected for processing (2)
  • backend/app/tests/api/routes/test_evaluation.py
  • backend/app/tests/core/test_exception_handlers.py

@vprashrex vprashrex linked an issue Mar 9, 2026 that may be closed by this pull request
session: SessionDep,
auth_context: AuthContextDep,
include_signed_url: bool = Query(
False, description="Include signed URLs for dataset"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it will be only "URL" and not "URLs"

return bool(part and isinstance(part, str) and _BRANCH_PATTERN.search(part))


def _filter_union_branch_errors(errors: list[dict]) -> list[dict]:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what exactly do you mean by "branch" here

Copy link
Copy Markdown
Collaborator Author

@vprashrex vprashrex Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, “branch” refers to each possible model inside a Union / discriminated union.
For example, CompletionConfig can be either NativeCompletionConfig or KaapiCompletionConfig.

CompletionConfig = Annotated[
    Union[NativeCompletionConfig, KaapiCompletionConfig],
    Field(discriminator="provider"),
]

When a validation error occurs, Pydantic often returns errors from all union branches, even though the issue actually belongs to only one branch. This can result in confusing error messages.

The _filter_union_branch_errors function filters these results and keeps only the errors from the branch where the validation actually failed.

Note: If errors genuinely occur in multiple branches, errors from multiple branches will still be returned.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added the docstring as well for that function filter union branch

@vprashrex vprashrex requested a review from nishika26 March 11, 2026 06:14
def _filter_union_branch_errors(errors: list[dict]) -> list[dict]:
"""When a field is a Union type, pydantic returns errors for every possible branch.

This picks the branch where the validation error happend.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

write - "this function picks" , so its super clear that you are talking about the function here

@vprashrex vprashrex merged commit 48b0a6b into main Mar 11, 2026
3 checks passed
@vprashrex vprashrex deleted the feature/validation-errors-dataset-signed-url branch March 11, 2026 09:40
@github-project-automation github-project-automation bot moved this from In Review to Closed in Kaapi-dev Mar 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

Status: Closed

Development

Successfully merging this pull request may close these issues.

Evaluation: Add signed URL support Validation: Structured error responses

3 participants