Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThe changes refactor audio file handling by relocating the AudioUploadResponse model from stt_evaluation to file module, consolidating related imports, and adding optional signed URL generation for sample files in dataset retrieval through a new include_url parameter. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant APIRoute as get_dataset Route
participant CloudStorage as Cloud Storage Service
participant Database as Database
Client->>APIRoute: GET /datasets/{id}?include_url=true
APIRoute->>Database: Query dataset and samples
Database-->>APIRoute: Returns samples
alt include_url is true
APIRoute->>CloudStorage: Initialize storage
CloudStorage-->>APIRoute: Ready
loop For each sample
APIRoute->>CloudStorage: get_signed_url(object_store_url)
CloudStorage-->>APIRoute: Returns signed_url
APIRoute->>APIRoute: Populate signed_url in sample
end
else include_url is false
APIRoute->>APIRoute: signed_url remains None
end
APIRoute-->>Client: Dataset with samples (signed_url populated or null)
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
backend/app/services/stt_evaluations/helpers.py (1)
6-11: Improve type hint forstorageparameter.The
storage: object | Nonetype is too loose and will cause type checkers to flagstorage.get_signed_url()as invalid sinceobjectdoesn't have that method. Consider using aProtocolor importing the actualCloudStoragetype.♻️ Suggested improvement using Protocol
"""Helper functions for building audio file response schemas with signed URLs.""" +from typing import Protocol + from app.models.file import File, FilePublic +class StorageProtocol(Protocol): + """Protocol for cloud storage with signed URL support.""" + def get_signed_url(self, object_url: str) -> str: ... + + def build_file_schema( *, file: File, include_url: bool, - storage: object | None, + storage: StorageProtocol | None, ) -> FilePublic:Alternatively, import and use
CloudStoragedirectly if circular imports aren't a concern.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/services/stt_evaluations/helpers.py` around lines 6 - 11, The storage parameter in build_file_schema is typed too loosely as object | None which prevents type checkers from recognizing methods like get_signed_url; update the signature of build_file_schema to use a more specific type (either import the concrete CloudStorage type or declare a minimal Protocol with get_signed_url and any other required methods) and replace storage: object | None with storage: CloudStorage | None or storage: YourStorageProtocol | None, then adjust any imports to avoid circular references (or place the Protocol in this module) so calls to storage.get_signed_url() type-check correctly.backend/app/api/routes/stt_evaluations/files.py (1)
81-91: Consider handling storage initialization failures gracefully.
get_cloud_storagecan raise exceptions if storage initialization fails (seebackend/app/core/cloud/storage.pylines 278-283). Currently this would bubble up as a 500 error. Consider whether to catch and return a more specific error or log additional context.💡 Optional: Add explicit error handling
storage = None if include_url: + try: storage = get_cloud_storage( session=_session, project_id=auth_context.project_.id ) + except Exception as e: + logger.warning( + f"[get_audio] Failed to initialize storage for signed URL | " + f"project_id: {auth_context.project_.id}, error: {e}" + ) + # Proceed without signed URL rather than failing the entire request result = build_file_schema(This is optional—letting it fail with a 500 may be the desired behavior if signed URLs are critical.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/api/routes/stt_evaluations/files.py` around lines 81 - 91, When include_url is true, get_cloud_storage(session=_session, project_id=auth_context.project_.id) can raise during initialization; wrap that call in a try/except around the storage assignment (the storage variable created before calling get_cloud_storage) and handle failures explicitly: log contextual details (file id, project id, include_url) and either raise a more specific HTTP error (e.g., HTTPException 502/503 with a useful message) or return a controlled response indicating signed URL generation failed before calling build_file_schema(file=..., include_url=include_url, storage=storage); ensure exceptions from get_cloud_storage are not allowed to bubble as raw 500s.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@backend/app/models/file.py`:
- Around line 1-4: Remove the duplicate module docstring in
backend/app/models/file.py so there is only a single triple-quoted module
docstring at the very top of the file; delete the extra repeated string (the
second occurrence) and ensure all imports remain directly below the single
docstring to resolve the E402 import order errors.
---
Nitpick comments:
In `@backend/app/api/routes/stt_evaluations/files.py`:
- Around line 81-91: When include_url is true,
get_cloud_storage(session=_session, project_id=auth_context.project_.id) can
raise during initialization; wrap that call in a try/except around the storage
assignment (the storage variable created before calling get_cloud_storage) and
handle failures explicitly: log contextual details (file id, project id,
include_url) and either raise a more specific HTTP error (e.g., HTTPException
502/503 with a useful message) or return a controlled response indicating signed
URL generation failed before calling build_file_schema(file=...,
include_url=include_url, storage=storage); ensure exceptions from
get_cloud_storage are not allowed to bubble as raw 500s.
In `@backend/app/services/stt_evaluations/helpers.py`:
- Around line 6-11: The storage parameter in build_file_schema is typed too
loosely as object | None which prevents type checkers from recognizing methods
like get_signed_url; update the signature of build_file_schema to use a more
specific type (either import the concrete CloudStorage type or declare a minimal
Protocol with get_signed_url and any other required methods) and replace
storage: object | None with storage: CloudStorage | None or storage:
YourStorageProtocol | None, then adjust any imports to avoid circular references
(or place the Protocol in this module) so calls to storage.get_signed_url()
type-check correctly.
ℹ️ Review info
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
backend/app/api/docs/stt_evaluation/get_audio.mdbackend/app/api/routes/stt_evaluations/files.pybackend/app/models/__init__.pybackend/app/models/file.pybackend/app/models/stt_evaluation.pybackend/app/services/stt_evaluations/audio.pybackend/app/services/stt_evaluations/helpers.py
💤 Files with no reviewable changes (1)
- backend/app/models/stt_evaluation.py
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
backend/app/tests/api/routes/test_stt_evaluation.py (1)
375-375: Remove the debug print from the test.This introduces noisy CI output and is a leftover debug artifact.
Proposed cleanup
- print("response=", response)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/tests/api/routes/test_stt_evaluation.py` at line 375, Remove the leftover debug print by deleting the print("response=", response) statement in the failing test (the debug uses the local variable response); simply remove that line so the test does not produce noisy CI output (or replace it with an assertion or proper logger.debug if you need visibility).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@backend/app/api/routes/stt_evaluations/files.py`:
- Around line 82-85: The call to get_cloud_storage(...) inside the include_url
branch can raise and currently propagates; update the include_url block (where
get_cloud_storage is called with session=_session and
project_id=auth_context.project_.id) to catch exceptions, log a prefixed error
(e.g., "Storage init failed: ...") including the exception details, and raise a
controlled HTTP error (HTTPException / appropriate framework error) with a 500
status and a clear message instead of letting the exception bubble up.
- Around line 71-79: The get_audio route currently uses get_file_by_id which can
return non-audio files; add a file-type guard after fetching the file (e.g., in
the same block where get_file_by_id is called inside get_audio) that checks the
file's type (e.g., file.file_type or file.type) against the audio enum/constant
(e.g., FileType.AUDIO) and raise an HTTPException (status 415 or 400) with a
clear message like "Requested file is not an audio file" if it doesn't match;
ensure you import/ reference the correct enum/class (FileType) used elsewhere in
the codebase and update any tests or callers accordingly.
---
Nitpick comments:
In `@backend/app/tests/api/routes/test_stt_evaluation.py`:
- Line 375: Remove the leftover debug print by deleting the print("response=",
response) statement in the failing test (the debug uses the local variable
response); simply remove that line so the test does not produce noisy CI output
(or replace it with an assertion or proper logger.debug if you need visibility).
ℹ️ Review info
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
backend/app/api/routes/stt_evaluations/files.pybackend/app/models/file.pybackend/app/models/stt_evaluation.pybackend/app/tests/api/routes/test_stt_evaluation.py
🚧 Files skipped from review as they are similar to previous changes (1)
- backend/app/models/stt_evaluation.py
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@backend/app/tests/api/routes/test_stt_evaluation.py`:
- Around line 110-117: The test functions (e.g.,
test_get_audio_file_with_signed_url) currently accept an untyped mock parameter
named mock_storage; update the function signatures to add precise type hints
(for example unittest.mock.Mock or MagicMock or the concrete storage
interface/fixture type used in tests) and add return type annotations (-> None)
where missing to satisfy the project typing rule; apply the same change to the
other test functions referenced (the ones around lines 159-166 and 254-261) so
each patched/mock parameter is annotated (e.g., mock_storage: Mock or MagicMock)
and all test functions have explicit parameter and return type hints.
- Line 558: Remove the stray debug print call print("response=", response) from
the test in test_stt_evaluation.py; locate the print in the test function that
handles the HTTP/response object (search for the variable name response or the
exact print string) and delete that line so CI output is not polluted—if the
intent was to assert something about response, replace the print with an
appropriate assertion on response.status_code or response.json() inside the same
test function.
There was a problem hiding this comment.
🧹 Nitpick comments (1)
backend/app/tests/api/routes/test_stt_evaluation.py (1)
209-231: Test logic is correct, but consider clarifying the naming.The test creates a file under
user_api_key(Dalgo org) and verifies thatsuperuser_api_key_header(Glific org) returns 404. This correctly validates cross-organization isolation. The method name and docstring mention "cross_project_access" while the actual test validates cross-organization access—consider renaming totest_get_audio_file_cross_org_accessfor clarity.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/tests/api/routes/test_stt_evaluation.py` around lines 209 - 231, The test function name and docstring are misleading: test_get_audio_file_cross_project_access and its docstring mention "cross_project" but the test actually asserts cross-organization isolation using user_api_key (Dalgo org) and superuser_api_key_header (Glific org); rename the function to test_get_audio_file_cross_org_access and update the docstring to mention "cross-organization" (or "cross_org") to reflect the intent, keeping the test body (file creation with user_api_key and request with superuser_api_key_header, asserting 404) unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@backend/app/tests/api/routes/test_stt_evaluation.py`:
- Around line 209-231: The test function name and docstring are misleading:
test_get_audio_file_cross_project_access and its docstring mention
"cross_project" but the test actually asserts cross-organization isolation using
user_api_key (Dalgo org) and superuser_api_key_header (Glific org); rename the
function to test_get_audio_file_cross_org_access and update the docstring to
mention "cross-organization" (or "cross_org") to reflect the intent, keeping the
test body (file creation with user_api_key and request with
superuser_api_key_header, asserting 404) unchanged.
There was a problem hiding this comment.
♻️ Duplicate comments (1)
backend/app/api/routes/stt_evaluations/dataset.py (1)
162-179:⚠️ Potential issue | 🟠 MajorGuard signed URL generation errors so one storage failure doesn’t 500 the whole response.
get_cloud_storage(...)andstorage.get_signed_url(...)can raise; currently those exceptions bubble and fail the entire dataset fetch.Proposed hardening
storage = None if include_url: - storage = get_cloud_storage( - session=_session, project_id=auth_context.project_.id - ) + try: + storage = get_cloud_storage( + session=_session, project_id=auth_context.project_.id + ) + except Exception: + logger.exception( + f"[get_dataset] Failed to initialize cloud storage | " + f"project_id: {auth_context.project_.id}" + ) + raise HTTPException( + status_code=500, detail="Unable to generate signed URLs" + ) from None @@ - signed_url=storage.get_signed_url( - file_map.get(s.file_id).object_store_url - ) - if include_url and storage and s.file_id in file_map - else None, + signed_url=( + storage.get_signed_url(file_map[s.file_id].object_store_url) + if ( + include_url + and storage + and s.file_id in file_map + and file_map[s.file_id].object_store_url + ) + else None + ),🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/api/routes/stt_evaluations/dataset.py` around lines 162 - 179, The code currently lets exceptions from get_cloud_storage(...) and storage.get_signed_url(...) bubble up while building STTSamplePublic, causing a full 500; wrap the storage initialization and per-sample signed URL generation in try/except blocks: catch errors from get_cloud_storage(...) and set storage=None (and log), and for each sample when computing signed_url call storage.get_signed_url(...) inside a try/except that returns None on failure (and logs the exception) so building the samples list never raises; refer to the storage variable, get_cloud_storage, storage.get_signed_url, and the STTSamplePublic construction to locate the changes.
🧹 Nitpick comments (1)
backend/app/api/routes/stt_evaluations/dataset.py (1)
122-124: Consider makinginclude_urldefault toFalsefor true opt-in behavior.Right now signed URLs are effectively on by default, which increases response cost and data exposure unless callers explicitly disable it.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/api/routes/stt_evaluations/dataset.py` around lines 122 - 124, Change the include_url query default from True to False in the Query declaration (the include_url: bool = Query(...) line) so signed URLs are opt-in rather than on by default; update the parameter description if needed to indicate opt-in behavior and search for any callers or tests that assume the old default (e.g., route handler functions referencing include_url in stt_evaluations dataset endpoints) to update them to explicitly pass include_url=True where required.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@backend/app/api/routes/stt_evaluations/dataset.py`:
- Around line 162-179: The code currently lets exceptions from
get_cloud_storage(...) and storage.get_signed_url(...) bubble up while building
STTSamplePublic, causing a full 500; wrap the storage initialization and
per-sample signed URL generation in try/except blocks: catch errors from
get_cloud_storage(...) and set storage=None (and log), and for each sample when
computing signed_url call storage.get_signed_url(...) inside a try/except that
returns None on failure (and logs the exception) so building the samples list
never raises; refer to the storage variable, get_cloud_storage,
storage.get_signed_url, and the STTSamplePublic construction to locate the
changes.
---
Nitpick comments:
In `@backend/app/api/routes/stt_evaluations/dataset.py`:
- Around line 122-124: Change the include_url query default from True to False
in the Query declaration (the include_url: bool = Query(...) line) so signed
URLs are opt-in rather than on by default; update the parameter description if
needed to indicate opt-in behavior and search for any callers or tests that
assume the old default (e.g., route handler functions referencing include_url in
stt_evaluations dataset endpoints) to update them to explicitly pass
include_url=True where required.
ℹ️ Review info
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
backend/app/api/routes/stt_evaluations/dataset.pybackend/app/api/routes/stt_evaluations/files.pybackend/app/models/file.pybackend/app/models/stt_evaluation.pybackend/app/tests/api/routes/test_stt_evaluation.py
🚧 Files skipped from review as they are similar to previous changes (1)
- backend/app/models/file.py
| @@ -1 +1 @@ | |||
| Get an STT dataset with its samples. | |||
| Get an STT dataset with its samples. You get a pre-signed S3 URL for direct sample file access as well if you have set ``include_url=True``. | |||
There was a problem hiding this comment.
update it to include_signed_url as URL is already there and might create confusion
Summary
Target issue is #629
Checklist
Before submitting a pull request, please ensure that you mark these task.
fastapi run --reload app/main.pyordocker compose upin the repository root and test.Notes
New Features
include_urlparameter to dataset retrieval endpoint to optionally include pre-signed URLs for sample audio files from get dataset and get run endpoint.Moved "AudioFileUploadResponse" model to "models/files.py" instead of letting it stay in "models/stt_evaluation.py"
Updated api documentation to state what will happen when they put include url and include samples value as "true"