Enhancement: Dataset signed URLs and improved validation errors#652
Enhancement: Dataset signed URLs and improved validation errors#652
Conversation
…on error handling
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds optional signed URL support for dataset downloads via an Changes
Sequence Diagram(s)sequenceDiagram
participant Client as Client
participant Endpoint as "Dataset Endpoint"
participant Storage as "CloudStorage"
participant Response as "APIResponse"
Client->>Endpoint: GET /datasets/{id}?include_signed_url=true
Endpoint->>Endpoint: load dataset record
alt dataset.object_store_url exists
Endpoint->>Storage: request presigned URL for object_store_url
Storage-->>Endpoint: return presigned URL (1h expiry)
Endpoint->>Response: build DatasetUploadResponse { signed_url }
else no object_store_url
Endpoint->>Response: build DatasetUploadResponse { signed_url: null }
end
Response-->>Client: JSON payload (success, data, errors?)
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
backend/app/utils.py (1)
68-93: Consider a more specific type hint for theerrorparameter.The current type hint
str | listis imprecise. Since the list variant expects dictionaries with specific keys (loc,msg), a more explicit type would improve code clarity and type checking.♻️ Suggested type improvement
`@classmethod` def failure_response( cls, - error: str | list, + error: str | list[dict[str, Any]], data: Optional[T] = None, metadata: Optional[Dict[str, Any]] = None, ) -> "APIResponse[None]":🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/utils.py` around lines 68 - 93, The parameter type for error in the APIResponse classmethod is too broad (str | list); change it to a more specific annotation—either Union[str, List[Dict[str, Any]]] or, better, define a TypedDict (e.g., ValidationErrorDict = TypedDict("ValidationErrorDict", {"loc": Tuple[Any, ...], "msg": str}, total=False)) and use Union[str, List[ValidationErrorDict]] for the error parameter in the method signature so static checkers know the list contains dicts with "loc" and "msg"; also add the required imports from typing (Union, List, Dict, TypedDict, Tuple) and keep the existing isinstance(error, list) handling but treat elements as ValidationErrorDict for clarity.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@backend/app/api/routes/evaluations/dataset.py`:
- Around line 154-159: The current signed URL retrieval (guarded by
include_signed_url and dataset.object_store_url) calls get_cloud_storage(...)
and storage.get_signed_url(...) but can raise ValueError or other exceptions;
update the block to catch ValueError and a broad Exception around
get_cloud_storage and storage.get_signed_url (referencing get_cloud_storage,
storage.get_signed_url, include_signed_url, dataset.object_store_url,
auth_context.project_.id, session=_session), log a warning or debug message with
the error, and simply leave signed_url as None so the endpoint returns the
dataset without failing when cloud storage is not configured or initialization
fails.
In `@backend/app/core/exception_handlers.py`:
- Around line 14-18: The current _BRANCH_PATTERN only matches uppercase starters
and brackets, so _is_branch_identifier fails for lowercase discriminators like
"text" or "openai-native"; update the regex used by _BRANCH_PATTERN to include
lowercase letters (for example change r"^[A-Z]|[\[\]()]" to
r"^[A-Za-z]|[\[\]()]") so _is_branch_identifier(part: str) correctly recognizes
lowercase discriminators and hyphenated tags used in discriminated unions (leave
_is_branch_identifier implementation as-is and add/verify tests for tags such as
"text", "audio", "openai", "openai-native").
---
Nitpick comments:
In `@backend/app/utils.py`:
- Around line 68-93: The parameter type for error in the APIResponse classmethod
is too broad (str | list); change it to a more specific annotation—either
Union[str, List[Dict[str, Any]]] or, better, define a TypedDict (e.g.,
ValidationErrorDict = TypedDict("ValidationErrorDict", {"loc": Tuple[Any, ...],
"msg": str}, total=False)) and use Union[str, List[ValidationErrorDict]] for the
error parameter in the method signature so static checkers know the list
contains dicts with "loc" and "msg"; also add the required imports from typing
(Union, List, Dict, TypedDict, Tuple) and keep the existing isinstance(error,
list) handling but treat elements as ValidationErrorDict for clarity.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 40e87da3-ff24-4603-ad0c-3d2dcf7d506a
📒 Files selected for processing (5)
backend/app/api/docs/evaluation/get_dataset.mdbackend/app/api/routes/evaluations/dataset.pybackend/app/core/exception_handlers.pybackend/app/models/evaluation.pybackend/app/utils.py
| signed_url = None | ||
| if include_signed_url and dataset.object_store_url: | ||
| storage = get_cloud_storage( | ||
| session=_session, project_id=auth_context.project_.id | ||
| ) | ||
| signed_url = storage.get_signed_url(dataset.object_store_url) |
There was a problem hiding this comment.
Missing error handling for cloud storage failures.
According to backend/app/core/cloud/storage.py:267-284, get_cloud_storage() can raise ValueError for invalid projects or propagate exceptions when AWS initialization fails. If cloud storage is not configured (e.g., in development environments without AWS credentials), this will cause a 500 error even when the dataset exists.
Consider gracefully handling these failures by returning the dataset without the signed URL instead of failing the entire request.
🛡️ Proposed fix with error handling
signed_url = None
if include_signed_url and dataset.object_store_url:
- storage = get_cloud_storage(
- session=_session, project_id=auth_context.project_.id
- )
- signed_url = storage.get_signed_url(dataset.object_store_url)
+ try:
+ storage = get_cloud_storage(
+ session=_session, project_id=auth_context.project_.id
+ )
+ signed_url = storage.get_signed_url(dataset.object_store_url)
+ except Exception as e:
+ logger.warning(
+ f"[get_dataset] Failed to generate signed URL | dataset_id={dataset_id} | error={str(e)}"
+ )
+ # Continue without signed_url rather than failing the request📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| signed_url = None | |
| if include_signed_url and dataset.object_store_url: | |
| storage = get_cloud_storage( | |
| session=_session, project_id=auth_context.project_.id | |
| ) | |
| signed_url = storage.get_signed_url(dataset.object_store_url) | |
| signed_url = None | |
| if include_signed_url and dataset.object_store_url: | |
| try: | |
| storage = get_cloud_storage( | |
| session=_session, project_id=auth_context.project_.id | |
| ) | |
| signed_url = storage.get_signed_url(dataset.object_store_url) | |
| except Exception as e: | |
| logger.warning( | |
| f"[get_dataset] Failed to generate signed URL | dataset_id={dataset_id} | error={str(e)}" | |
| ) | |
| # Continue without signed_url rather than failing the request |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@backend/app/api/routes/evaluations/dataset.py` around lines 154 - 159, The
current signed URL retrieval (guarded by include_signed_url and
dataset.object_store_url) calls get_cloud_storage(...) and
storage.get_signed_url(...) but can raise ValueError or other exceptions; update
the block to catch ValueError and a broad Exception around get_cloud_storage and
storage.get_signed_url (referencing get_cloud_storage, storage.get_signed_url,
include_signed_url, dataset.object_store_url, auth_context.project_.id,
session=_session), log a warning or debug message with the error, and simply
leave signed_url as None so the endpoint returns the dataset without failing
when cloud storage is not configured or initialization fails.
| _BRANCH_PATTERN = re.compile(r"^[A-Z]|[\[\]()]") | ||
|
|
||
|
|
||
| def _is_branch_identifier(part: str) -> bool: | ||
| return bool(part and isinstance(part, str) and _BRANCH_PATTERN.search(part)) |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
Pydantic v2 discriminated union validation error loc format
💡 Result:
In Pydantic v2, ValidationError.errors() returns each error with:
loc: a tuple path ofstr | intsegments (field names and container indexes). (docs.pydantic.dev)- When the error happens inside a discriminated (tagged) union, Pydantic inserts the selected union tag (i.e., the discriminator result) into the location path, between the union field and the nested field(s).
What the loc looks like for discriminated unions
If your model field is pet and the discriminator selects tag "dog", then a missing field error inside that variant will have a location conceptually like:
('pet', 'dog', 'barks')→ displayed aspet.dog.barks(docs.pydantic.dev)
For nested discriminated unions, tags can stack:
('pet', 'cat', 'black', 'black_name')→ displayed aspet.cat.black.black_name(docs.pydantic.dev)
This is the same behavior people notice in practice, e.g. ("child", "a", "field_a") where "a" is the discriminator/tag. (stackoverflow.com)
When the discriminator itself is wrong/missing
For discriminator failures, Pydantic uses error types like:
union_tag_invalid(tag value not in expected set) (docs.pydantic.dev)union_tag_not_found(cannot extract a discriminator value) (docs.pydantic.dev)
(Those errors are raised at the union field; the printed error path you see includes the tag when it’s known, e.g. pet.cat in the nested example.) (docs.pydantic.dev)
Citations:
- 1: https://docs.pydantic.dev/latest/errors/errors/
- 2: https://docs.pydantic.dev/latest/concepts/unions/
- 3: https://docs.pydantic.dev/latest/concepts/unions/
- 4: https://stackoverflow.com/questions/79129176/pydantic-remove-discriminated-union-discriminator-value-from-validationerror-lo
- 5: https://docs.pydantic.dev/latest/errors/validation_errors/
- 6: https://docs.pydantic.dev/latest/errors/validation_errors/
- 7: https://docs.pydantic.dev/latest/concepts/unions/
🏁 Script executed:
# First, check the full context of the exception handler file
cat -n backend/app/core/exception_handlers.py | head -50Repository: ProjectTech4DevAI/kaapi-backend
Length of output: 2025
🏁 Script executed:
# Check the request.py file to verify the discriminator values mentioned
cat -n backend/app/models/llm/request.py | grep -A 5 -B 5 "discriminator\|text\|audio\|openai"Repository: ProjectTech4DevAI/kaapi-backend
Length of output: 9327
The union branch identifier pattern is insufficient for lowercase discriminators used in the codebase.
The regex ^[A-Z]|[\[\]()] only matches strings starting with uppercase letters or containing brackets. However, according to Pydantic v2 documentation, validation error loc tuples include discriminator tag values—and your codebase uses lowercase literal discriminators like "text", "audio", "openai", "openai-native" (from backend/app/models/llm/request.py lines 109, 114, 119, 125 and 197, 217).
Since the pattern won't match these lowercase discriminators when they appear in error locations, _is_branch_identifier() will fail to identify them as branch identifiers. This causes errors from discriminated unions with lowercase discriminators to be misclassified as non-union errors instead of being properly grouped and filtered by union branch (see lines 38–47).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@backend/app/core/exception_handlers.py` around lines 14 - 18, The current
_BRANCH_PATTERN only matches uppercase starters and brackets, so
_is_branch_identifier fails for lowercase discriminators like "text" or
"openai-native"; update the regex used by _BRANCH_PATTERN to include lowercase
letters (for example change r"^[A-Z]|[\[\]()]" to r"^[A-Za-z]|[\[\]()]") so
_is_branch_identifier(part: str) correctly recognizes lowercase discriminators
and hyphenated tags used in discriminated unions (leave _is_branch_identifier
implementation as-is and add/verify tests for tags such as "text", "audio",
"openai", "openai-native").
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
🧹 Nitpick comments (5)
backend/app/tests/api/routes/test_threads.py (1)
604-605: Assert the structured message too.This only proves that some error mentions
question. It would still pass if the API stopped returning the per-fieldmessagethat this PR introduces. Please assert thequestionentry also includes the expected"message": "Field required".Suggested test tightening
assert error_response["success"] is False assert error_response["errors"] - assert any("question" in e["field"] for e in error_response["errors"]) + assert any( + e["field"] == "question" and e["message"] == "Field required" + for e in error_response["errors"] + )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/tests/api/routes/test_threads.py` around lines 604 - 605, The test currently only checks that some error mentions "question"; tighten it to assert the structured per-field message by checking the error_response["errors"] list contains an entry where e["field"] == "question" and e["message"] == "Field required" (e.g. replace or augment the any("question" in e["field"] ...) assertion with any(e.get("field") == "question" and e.get("message") == "Field required" for e in error_response["errors"])); reference the existing error_response variable and its "errors" list when updating the assertion.backend/app/tests/core/test_exception_handlers.py (2)
205-230: Make this nested-path assertion exact.This can pass with one dotted field and a separate
"params"field. The point of the test is the full path contract, so it should assert the concrete dotted field value.💡 Suggested assertion
body = response.json() fields = [e["field"] for e in body["errors"]] - assert any("." in f for f in fields) - assert any("params" in f for f in fields) + assert "config_blob.completion.params" in fields🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/tests/core/test_exception_handlers.py` around lines 205 - 230, The test_nested_field_path_in_error test currently only checks for any dotted field and any occurrence of "params" separately, which can pass incorrectly; update the assertions to require the exact dotted field path returned in the errors (e.g., "config_blob.completion.params") by computing fields = [e["field"] for e in body["errors"]] and asserting that the exact string "config_blob.completion.params" is in fields (and remove the loose any("." in f) / any("params" in f) checks) so the test enforces the full dotted path contract.
287-318: Assert the presigner is called from the dataset storage path.Right now this only proves that a
signed_urlvalue is returned. A route that hardcodes or stubs the response without usingdataset.object_store_urlwould still pass.💡 Suggested assertion
with patch( "app.api.routes.evaluations.dataset.get_cloud_storage" ) as mock_get_storage: mock_storage = mock_get_storage.return_value mock_storage.get_signed_url.return_value = mock_signed_url @@ assert response.status_code == 200 + mock_storage.get_signed_url.assert_called_once() + signed_url_call = mock_storage.get_signed_url.call_args + assert dataset.object_store_url in signed_url_call.args or ( + dataset.object_store_url in signed_url_call.kwargs.values() + ) body = response.json() assert body["success"] is True assert body["data"]["signed_url"] == mock_signed_url🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/tests/core/test_exception_handlers.py` around lines 287 - 318, The test test_get_dataset_with_signed_url currently only asserts the returned signed_url value; update it to also verify the presigner is invoked with the dataset's storage path by asserting mock_storage.get_signed_url was called with dataset.object_store_url (or the expected object key derived from dataset.object_store_url) after the client.get call; locate the mocked get_cloud_storage (patch of app.api.routes.evaluations.dataset.get_cloud_storage) and add an assertion that mock_storage.get_signed_url was called once with the dataset.object_store_url to ensure the route uses the dataset's object store path.backend/app/tests/api/routes/test_onboarding.py (1)
200-203: Verify these errors stay attached tocredentials.The onboarding validator raises on the
credentialsfield, but these tests only check message text. They won't catch a regression where the API returns the right message withfield="unknown".💡 Suggested assertion
assert error_response["errors"] + assert any(e["field"] == "credentials" for e in error_response["errors"]) assert any( "credential validation failed" in e["message"] for e in error_response["errors"] )Apply the same assertion pattern to the other credential-validation cases in this file.
Also applies to: 232-238, 267-271, 302-306, 338-343
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/tests/api/routes/test_onboarding.py` around lines 200 - 203, The tests currently only assert the error message text in error_response["errors"] but must also assert that each credential-validation error is attached to the credentials field; update the assertions that check for "credential validation failed" (the error_response variable in test_onboarding.py) to also assert that at least one error has e["field"] == "credentials" (or that the matching error includes field "credentials"), and apply the same pattern to the other credential-validation cases in this file (the assertion blocks around the other error_response checks referenced in the comment).backend/app/tests/api/routes/test_evaluation.py (1)
350-354: Assert theduplication_factorfield too.These checks only validate the message text. If the handler regresses and emits the error under the wrong
field, these tests still pass even though the new contract is per-field validation errors.💡 Suggested assertion
assert response_data["errors"] + assert any( + "duplication_factor" in e["field"] for e in response_data["errors"] + ) assert any( "greater than or equal to 1" in e["message"] for e in response_data["errors"] )Apply the same pattern to the upper-bound test.
Also applies to: 377-380
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/tests/api/routes/test_evaluation.py` around lines 350 - 354, The test currently only checks error messages for the duplication_factor validation; update the assertions in backend/app/tests/api/routes/test_evaluation.py to also assert that the error object(s) include the correct field name "duplication_factor" (e.g., check e["field"] == "duplication_factor") alongside the existing message check so the failure is tied to the right input field; apply the same change to the corresponding upper-bound test block that checks the duplication_factor upper-limit error so both tests validate per-field error reporting.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@backend/app/tests/api/routes/test_evaluation.py`:
- Around line 350-354: The test currently only checks error messages for the
duplication_factor validation; update the assertions in
backend/app/tests/api/routes/test_evaluation.py to also assert that the error
object(s) include the correct field name "duplication_factor" (e.g., check
e["field"] == "duplication_factor") alongside the existing message check so the
failure is tied to the right input field; apply the same change to the
corresponding upper-bound test block that checks the duplication_factor
upper-limit error so both tests validate per-field error reporting.
In `@backend/app/tests/api/routes/test_onboarding.py`:
- Around line 200-203: The tests currently only assert the error message text in
error_response["errors"] but must also assert that each credential-validation
error is attached to the credentials field; update the assertions that check for
"credential validation failed" (the error_response variable in
test_onboarding.py) to also assert that at least one error has e["field"] ==
"credentials" (or that the matching error includes field "credentials"), and
apply the same pattern to the other credential-validation cases in this file
(the assertion blocks around the other error_response checks referenced in the
comment).
In `@backend/app/tests/api/routes/test_threads.py`:
- Around line 604-605: The test currently only checks that some error mentions
"question"; tighten it to assert the structured per-field message by checking
the error_response["errors"] list contains an entry where e["field"] ==
"question" and e["message"] == "Field required" (e.g. replace or augment the
any("question" in e["field"] ...) assertion with any(e.get("field") ==
"question" and e.get("message") == "Field required" for e in
error_response["errors"])); reference the existing error_response variable and
its "errors" list when updating the assertion.
In `@backend/app/tests/core/test_exception_handlers.py`:
- Around line 205-230: The test_nested_field_path_in_error test currently only
checks for any dotted field and any occurrence of "params" separately, which can
pass incorrectly; update the assertions to require the exact dotted field path
returned in the errors (e.g., "config_blob.completion.params") by computing
fields = [e["field"] for e in body["errors"]] and asserting that the exact
string "config_blob.completion.params" is in fields (and remove the loose
any("." in f) / any("params" in f) checks) so the test enforces the full dotted
path contract.
- Around line 287-318: The test test_get_dataset_with_signed_url currently only
asserts the returned signed_url value; update it to also verify the presigner is
invoked with the dataset's storage path by asserting mock_storage.get_signed_url
was called with dataset.object_store_url (or the expected object key derived
from dataset.object_store_url) after the client.get call; locate the mocked
get_cloud_storage (patch of
app.api.routes.evaluations.dataset.get_cloud_storage) and add an assertion that
mock_storage.get_signed_url was called once with the dataset.object_store_url to
ensure the route uses the dataset's object store path.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 21c9dce8-8e03-4286-a6b2-2cdd38219bba
📒 Files selected for processing (6)
backend/app/tests/api/routes/collections/test_create_collections.pybackend/app/tests/api/routes/test_doc_transformation_job.pybackend/app/tests/api/routes/test_evaluation.pybackend/app/tests/api/routes/test_onboarding.pybackend/app/tests/api/routes/test_threads.pybackend/app/tests/core/test_exception_handlers.py
…iqueness and handling of literal errors
There was a problem hiding this comment.
♻️ Duplicate comments (1)
backend/app/core/exception_handlers.py (1)
24-24:⚠️ Potential issue | 🟡 MinorTypo: "happend" → "happened".
📝 Proposed fix
- This picks the branch where the validation error happend. + This picks the branch where the validation error happened.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/core/exception_handlers.py` at line 24, Fix the typo in the inline comment string "This picks the branch where the validation error happend." by changing "happend" to "happened" so the comment reads "This picks the branch where the validation error happened."; locate and update this comment in backend/app/core/exception_handlers.py (the exact comment text is the unique identifier).
🧹 Nitpick comments (2)
backend/app/core/exception_handlers.py (2)
65-67: In-place mutation of error dicts may cause unintended side effects.Modifying
err["loc"]directly mutates the dictionaries returned byexc.errors(). If another part of the code holds a reference to these dicts, theirlocvalues will be unexpectedly changed. Consider creating shallow copies before modification.♻️ Proposed fix
+ filtered = [dict(err) for err in filtered] # shallow copy to avoid mutating originals for err in filtered: loc = err.get("loc", ()) err["loc"] = tuple(p for p in loc if not _is_branch_identifier(p))🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/core/exception_handlers.py` around lines 65 - 67, The loop currently mutates the dicts returned by exc.errors() by assigning to err["loc"]; instead, create a shallow copy of each error dict (e.g., copy = err.copy()) before modifying the "loc" field so you don't change the original objects; compute the filtered tuple using _is_branch_identifier and assign it to the copy's "loc", then use/collect the copy instead of mutating err (apply this change where the for err in filtered: loop handles items from exc.errors()).
78-79: Broad exception handler silently swallows errors.Catching all exceptions and returning the original errors list hides bugs in the filtering logic. Consider logging the exception for debugging purposes.
🔍 Proposed fix with logging
+import logging + +logger = logging.getLogger(__name__) + # ... in _filter_union_branch_errors: return unique_errors or errors except Exception: + logger.exception("[_filter_union_branch_errors] Error filtering validation errors") return errors🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/core/exception_handlers.py` around lines 78 - 79, The broad except block currently swallows errors (the "except Exception: return errors" branch); change it to capture the exception (e.g., "except Exception as e:"), log the full exception (use the module logger or logging.exception) with contextual information about the filtering operation and the current "errors" value, then return errors; ensure the logger is imported/available in core.exception_handlers.py (or create one) so the exception stacktrace is recorded for debugging.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@backend/app/core/exception_handlers.py`:
- Line 24: Fix the typo in the inline comment string "This picks the branch
where the validation error happend." by changing "happend" to "happened" so the
comment reads "This picks the branch where the validation error happened.";
locate and update this comment in backend/app/core/exception_handlers.py (the
exact comment text is the unique identifier).
---
Nitpick comments:
In `@backend/app/core/exception_handlers.py`:
- Around line 65-67: The loop currently mutates the dicts returned by
exc.errors() by assigning to err["loc"]; instead, create a shallow copy of each
error dict (e.g., copy = err.copy()) before modifying the "loc" field so you
don't change the original objects; compute the filtered tuple using
_is_branch_identifier and assign it to the copy's "loc", then use/collect the
copy instead of mutating err (apply this change where the for err in filtered:
loop handles items from exc.errors()).
- Around line 78-79: The broad except block currently swallows errors (the
"except Exception: return errors" branch); change it to capture the exception
(e.g., "except Exception as e:"), log the full exception (use the module logger
or logging.exception) with contextual information about the filtering operation
and the current "errors" value, then return errors; ensure the logger is
imported/available in core.exception_handlers.py (or create one) so the
exception stacktrace is recorded for debugging.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: f20618a2-8922-4634-8d7a-73c689bb2f9e
📒 Files selected for processing (1)
backend/app/core/exception_handlers.py
There was a problem hiding this comment.
🧹 Nitpick comments (3)
backend/app/tests/core/test_exception_handlers.py (1)
126-140: Consider asserting errors list is non-empty for robustness.The
forloop at line 138 would pass vacuously iferrorsis empty. While the 422 status code and the incomplete request payload make this unlikely, adding an explicit assertion would make the test intent clearer and guard against future regressions in error generation.🔧 Suggested improvement
assert response.status_code == 422 + errors = response.json()["errors"] + assert errors, "Expected validation errors but got none" - for error in response.json()["errors"]: + for error in errors: assert "openai-native" not in error["message"] assert "NativeCompletionConfig" not in error["field"]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/tests/core/test_exception_handlers.py` around lines 126 - 140, The test test_union_noise_filtered should explicitly assert that the returned errors list is not empty before iterating; update the assertion logic in test_union_noise_filtered to retrieve errors = response.json().get("errors", []) and assert errors (or assert len(errors) > 0) so the subsequent for-loop does not pass vacuously, then continue asserting each error does not contain "openai-native" in message or "NativeCompletionConfig" in field.backend/app/tests/api/routes/test_evaluation.py (2)
1401-1420: Consider verifying mock was called with the correct URL.The test correctly verifies the signed URL is returned, but doesn't confirm
get_signed_urlwas called with the expectedobject_store_url. This would make the test more robust against regressions where the wrong URL might be passed.🔧 Suggested improvement
with patch("app.api.routes.evaluations.dataset.get_cloud_storage") as mock: mock.return_value.get_signed_url.return_value = "https://signed.url" response = client.get( f"/api/v1/evaluations/datasets/{dataset.id}", headers={"X-API-KEY": user_api_key.key}, params={"include_signed_url": True}, ) + mock.return_value.get_signed_url.assert_called_once_with( + dataset.object_store_url + ) assert response.json()["data"]["signed_url"] == "https://signed.url"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/tests/api/routes/test_evaluation.py` around lines 1401 - 1420, The test test_include_signed_url doesn't assert that get_signed_url was called with the expected object URL; after calling the API in the test, add an assertion that the mocked get_cloud_storage().get_signed_url was invoked with the dataset's object_store_url (e.g., mock.return_value.get_signed_url.assert_called_once_with(dataset.object_store_url)) so the test verifies the correct URL is passed through; locate this in the test_include_signed_url function and add the assertion immediately after the client.get and before the final response assertion.
1398-1399: Consider adding status code assertion for consistency.The other signed URL tests don't explicitly assert the status code before checking the response body. While the assertion at line 1398 implicitly verifies a successful response (since it accesses
response.json()["data"]), adding an explicitassert response.status_code == 200would improve readability and make failures easier to diagnose.🔧 Suggested improvement
response = client.get( f"/api/v1/evaluations/datasets/{dataset.id}", headers={"X-API-KEY": user_api_key.key}, ) assert response.status_code == 200 - assert response.json()["data"].get("signed_url") is None + assert response.json()["data"].get("signed_url") is NoneNote: Same suggestion applies to
test_include_signed_url(line 1420) andtest_no_object_store_url_skips_signing(line 1445).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/tests/api/routes/test_evaluation.py` around lines 1398 - 1399, Add an explicit HTTP status assertion before inspecting the response body in the tests that check signed URL behavior: insert `assert response.status_code == 200` at the start of the assertions in the test function containing the shown snippet (the test that asserts signed_url is None) and do the same in `test_include_signed_url` and `test_no_object_store_url_skips_signing` so each test first verifies the 200 response code before accessing `response.json()["data"]`.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@backend/app/tests/api/routes/test_evaluation.py`:
- Around line 1401-1420: The test test_include_signed_url doesn't assert that
get_signed_url was called with the expected object URL; after calling the API in
the test, add an assertion that the mocked get_cloud_storage().get_signed_url
was invoked with the dataset's object_store_url (e.g.,
mock.return_value.get_signed_url.assert_called_once_with(dataset.object_store_url))
so the test verifies the correct URL is passed through; locate this in the
test_include_signed_url function and add the assertion immediately after the
client.get and before the final response assertion.
- Around line 1398-1399: Add an explicit HTTP status assertion before inspecting
the response body in the tests that check signed URL behavior: insert `assert
response.status_code == 200` at the start of the assertions in the test function
containing the shown snippet (the test that asserts signed_url is None) and do
the same in `test_include_signed_url` and
`test_no_object_store_url_skips_signing` so each test first verifies the 200
response code before accessing `response.json()["data"]`.
In `@backend/app/tests/core/test_exception_handlers.py`:
- Around line 126-140: The test test_union_noise_filtered should explicitly
assert that the returned errors list is not empty before iterating; update the
assertion logic in test_union_noise_filtered to retrieve errors =
response.json().get("errors", []) and assert errors (or assert len(errors) > 0)
so the subsequent for-loop does not pass vacuously, then continue asserting each
error does not contain "openai-native" in message or "NativeCompletionConfig" in
field.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 8872bc60-4f23-4073-aab2-8ec37c5c8dd4
📒 Files selected for processing (2)
backend/app/tests/api/routes/test_evaluation.pybackend/app/tests/core/test_exception_handlers.py
| session: SessionDep, | ||
| auth_context: AuthContextDep, | ||
| include_signed_url: bool = Query( | ||
| False, description="Include signed URLs for dataset" |
There was a problem hiding this comment.
it will be only "URL" and not "URLs"
| return bool(part and isinstance(part, str) and _BRANCH_PATTERN.search(part)) | ||
|
|
||
|
|
||
| def _filter_union_branch_errors(errors: list[dict]) -> list[dict]: |
There was a problem hiding this comment.
what exactly do you mean by "branch" here
There was a problem hiding this comment.
Here, “branch” refers to each possible model inside a Union / discriminated union.
For example, CompletionConfig can be either NativeCompletionConfig or KaapiCompletionConfig.
CompletionConfig = Annotated[
Union[NativeCompletionConfig, KaapiCompletionConfig],
Field(discriminator="provider"),
]When a validation error occurs, Pydantic often returns errors from all union branches, even though the issue actually belongs to only one branch. This can result in confusing error messages.
The _filter_union_branch_errors function filters these results and keeps only the errors from the branch where the validation actually failed.
Note: If errors genuinely occur in multiple branches, errors from multiple branches will still be returned.
There was a problem hiding this comment.
added the docstring as well for that function filter union branch
| def _filter_union_branch_errors(errors: list[dict]) -> list[dict]: | ||
| """When a field is a Union type, pydantic returns errors for every possible branch. | ||
|
|
||
| This picks the branch where the validation error happend. |
There was a problem hiding this comment.
write - "this function picks" , so its super clear that you are talking about the function here
Summary
Target issue is #651
Explain the motivation for making this change. What existing problem does the pull request solve?
Clients had no way to download dataset files directly from the API since the endpoint only returned the raw S3 path without a time-limited presigned URL.
Validation errors returned a flat string, making it hard for clients to parse and display them per field.
Checklist
Before submitting a pull request, please ensure that you mark these task.
fastapi run --reload app/main.pyordocker compose upin the repository root and test.Notes
Previously validation error was like this:
Now validation error would be like this
Summary by CodeRabbit
New Features
Improvements
Documentation
Tests