Skip to content

Files: audio sample's signed url#632

Merged
nishika26 merged 9 commits intomainfrom
enhancement/get_signed_url
Mar 5, 2026
Merged

Files: audio sample's signed url#632
nishika26 merged 9 commits intomainfrom
enhancement/get_signed_url

Conversation

@nishika26
Copy link
Copy Markdown
Collaborator

@nishika26 nishika26 commented Feb 28, 2026

Summary

Target issue is #629

Checklist

Before submitting a pull request, please ensure that you mark these task.

  • Ran fastapi run --reload app/main.py or docker compose up in the repository root and test.

Notes

  • New Features

    • Added optional include_url parameter to dataset retrieval endpoint to optionally include pre-signed URLs for sample audio files from get dataset and get run endpoint.
  • Moved "AudioFileUploadResponse" model to "models/files.py" instead of letting it stay in "models/stt_evaluation.py"

  • Updated api documentation to state what will happen when they put include url and include samples value as "true"

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Feb 28, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

The changes refactor audio file handling by relocating the AudioUploadResponse model from stt_evaluation to file module, consolidating related imports, and adding optional signed URL generation for sample files in dataset retrieval through a new include_url parameter.

Changes

Cohort / File(s) Summary
Model Reorganization
backend/app/models/file.py, backend/app/models/stt_evaluation.py, backend/app/models/__init__.py
Moved AudioUploadResponse from stt_evaluation module to file module. File model refactored to use Field instead of SQLField. FilePublic now inherits from SQLModel. Added signed_url field to STTSamplePublic.
Import Consolidation
backend/app/api/routes/stt_evaluations/files.py, backend/app/services/stt_evaluations/audio.py
Updated import statements to reference AudioUploadResponse from new file module location and consolidated cloud storage imports.
Signed URL Feature
backend/app/api/routes/stt_evaluations/dataset.py
Added include_url parameter to get_dataset endpoint. Conditionally initializes cloud storage and generates signed URLs for audio samples when requested.
Test Updates
backend/app/tests/api/routes/test_stt_evaluation.py
Added MagicMock type annotation to mock_start_job parameter in test_start_stt_evaluation_success.
Documentation
backend/app/api/docs/stt_evaluation/get_audio.md
New endpoint documentation describing get_audio response metadata and S3 URL availability.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant APIRoute as get_dataset Route
    participant CloudStorage as Cloud Storage Service
    participant Database as Database

    Client->>APIRoute: GET /datasets/{id}?include_url=true
    APIRoute->>Database: Query dataset and samples
    Database-->>APIRoute: Returns samples
    
    alt include_url is true
        APIRoute->>CloudStorage: Initialize storage
        CloudStorage-->>APIRoute: Ready
        
        loop For each sample
            APIRoute->>CloudStorage: get_signed_url(object_store_url)
            CloudStorage-->>APIRoute: Returns signed_url
            APIRoute->>APIRoute: Populate signed_url in sample
        end
    else include_url is false
        APIRoute->>APIRoute: signed_url remains None
    end
    
    APIRoute-->>Client: Dataset with samples (signed_url populated or null)
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 Models shuffled, reorganized with care,
AudioUploadResponse finds its rightful lair,
Signed URLs bloom when include_url's set true,
Cloud storage springs up, then performs its due,
Fields refactored smooth—a storage tale anew!

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Files: audio sample's signed url' is directly related to the main change of adding signed URLs to audio samples, but is somewhat vague and uses abbreviated phrasing that could be clearer.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch enhancement/get_signed_url

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@nishika26 nishika26 self-assigned this Feb 28, 2026
@nishika26 nishika26 added the enhancement New feature or request label Feb 28, 2026
@nishika26 nishika26 linked an issue Feb 28, 2026 that may be closed by this pull request
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
backend/app/services/stt_evaluations/helpers.py (1)

6-11: Improve type hint for storage parameter.

The storage: object | None type is too loose and will cause type checkers to flag storage.get_signed_url() as invalid since object doesn't have that method. Consider using a Protocol or importing the actual CloudStorage type.

♻️ Suggested improvement using Protocol
 """Helper functions for building audio file response schemas with signed URLs."""
 
+from typing import Protocol
+
 from app.models.file import File, FilePublic
 
 
+class StorageProtocol(Protocol):
+    """Protocol for cloud storage with signed URL support."""
+    def get_signed_url(self, object_url: str) -> str: ...
+
+
 def build_file_schema(
     *,
     file: File,
     include_url: bool,
-    storage: object | None,
+    storage: StorageProtocol | None,
 ) -> FilePublic:

Alternatively, import and use CloudStorage directly if circular imports aren't a concern.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/services/stt_evaluations/helpers.py` around lines 6 - 11, The
storage parameter in build_file_schema is typed too loosely as object | None
which prevents type checkers from recognizing methods like get_signed_url;
update the signature of build_file_schema to use a more specific type (either
import the concrete CloudStorage type or declare a minimal Protocol with
get_signed_url and any other required methods) and replace storage: object |
None with storage: CloudStorage | None or storage: YourStorageProtocol | None,
then adjust any imports to avoid circular references (or place the Protocol in
this module) so calls to storage.get_signed_url() type-check correctly.
backend/app/api/routes/stt_evaluations/files.py (1)

81-91: Consider handling storage initialization failures gracefully.

get_cloud_storage can raise exceptions if storage initialization fails (see backend/app/core/cloud/storage.py lines 278-283). Currently this would bubble up as a 500 error. Consider whether to catch and return a more specific error or log additional context.

💡 Optional: Add explicit error handling
     storage = None
     if include_url:
+        try:
             storage = get_cloud_storage(
                 session=_session, project_id=auth_context.project_.id
             )
+        except Exception as e:
+            logger.warning(
+                f"[get_audio] Failed to initialize storage for signed URL | "
+                f"project_id: {auth_context.project_.id}, error: {e}"
+            )
+            # Proceed without signed URL rather than failing the entire request
 
     result = build_file_schema(

This is optional—letting it fail with a 500 may be the desired behavior if signed URLs are critical.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/api/routes/stt_evaluations/files.py` around lines 81 - 91, When
include_url is true, get_cloud_storage(session=_session,
project_id=auth_context.project_.id) can raise during initialization; wrap that
call in a try/except around the storage assignment (the storage variable created
before calling get_cloud_storage) and handle failures explicitly: log contextual
details (file id, project id, include_url) and either raise a more specific HTTP
error (e.g., HTTPException 502/503 with a useful message) or return a controlled
response indicating signed URL generation failed before calling
build_file_schema(file=..., include_url=include_url, storage=storage); ensure
exceptions from get_cloud_storage are not allowed to bubble as raw 500s.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/app/models/file.py`:
- Around line 1-4: Remove the duplicate module docstring in
backend/app/models/file.py so there is only a single triple-quoted module
docstring at the very top of the file; delete the extra repeated string (the
second occurrence) and ensure all imports remain directly below the single
docstring to resolve the E402 import order errors.

---

Nitpick comments:
In `@backend/app/api/routes/stt_evaluations/files.py`:
- Around line 81-91: When include_url is true,
get_cloud_storage(session=_session, project_id=auth_context.project_.id) can
raise during initialization; wrap that call in a try/except around the storage
assignment (the storage variable created before calling get_cloud_storage) and
handle failures explicitly: log contextual details (file id, project id,
include_url) and either raise a more specific HTTP error (e.g., HTTPException
502/503 with a useful message) or return a controlled response indicating signed
URL generation failed before calling build_file_schema(file=...,
include_url=include_url, storage=storage); ensure exceptions from
get_cloud_storage are not allowed to bubble as raw 500s.

In `@backend/app/services/stt_evaluations/helpers.py`:
- Around line 6-11: The storage parameter in build_file_schema is typed too
loosely as object | None which prevents type checkers from recognizing methods
like get_signed_url; update the signature of build_file_schema to use a more
specific type (either import the concrete CloudStorage type or declare a minimal
Protocol with get_signed_url and any other required methods) and replace
storage: object | None with storage: CloudStorage | None or storage:
YourStorageProtocol | None, then adjust any imports to avoid circular references
(or place the Protocol in this module) so calls to storage.get_signed_url()
type-check correctly.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ecbae15 and 685d914.

📒 Files selected for processing (7)
  • backend/app/api/docs/stt_evaluation/get_audio.md
  • backend/app/api/routes/stt_evaluations/files.py
  • backend/app/models/__init__.py
  • backend/app/models/file.py
  • backend/app/models/stt_evaluation.py
  • backend/app/services/stt_evaluations/audio.py
  • backend/app/services/stt_evaluations/helpers.py
💤 Files with no reviewable changes (1)
  • backend/app/models/stt_evaluation.py

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
backend/app/tests/api/routes/test_stt_evaluation.py (1)

375-375: Remove the debug print from the test.

This introduces noisy CI output and is a leftover debug artifact.

Proposed cleanup
-        print("response=", response)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/tests/api/routes/test_stt_evaluation.py` at line 375, Remove the
leftover debug print by deleting the print("response=", response) statement in
the failing test (the debug uses the local variable response); simply remove
that line so the test does not produce noisy CI output (or replace it with an
assertion or proper logger.debug if you need visibility).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/app/api/routes/stt_evaluations/files.py`:
- Around line 82-85: The call to get_cloud_storage(...) inside the include_url
branch can raise and currently propagates; update the include_url block (where
get_cloud_storage is called with session=_session and
project_id=auth_context.project_.id) to catch exceptions, log a prefixed error
(e.g., "Storage init failed: ...") including the exception details, and raise a
controlled HTTP error (HTTPException / appropriate framework error) with a 500
status and a clear message instead of letting the exception bubble up.
- Around line 71-79: The get_audio route currently uses get_file_by_id which can
return non-audio files; add a file-type guard after fetching the file (e.g., in
the same block where get_file_by_id is called inside get_audio) that checks the
file's type (e.g., file.file_type or file.type) against the audio enum/constant
(e.g., FileType.AUDIO) and raise an HTTPException (status 415 or 400) with a
clear message like "Requested file is not an audio file" if it doesn't match;
ensure you import/ reference the correct enum/class (FileType) used elsewhere in
the codebase and update any tests or callers accordingly.

---

Nitpick comments:
In `@backend/app/tests/api/routes/test_stt_evaluation.py`:
- Line 375: Remove the leftover debug print by deleting the print("response=",
response) statement in the failing test (the debug uses the local variable
response); simply remove that line so the test does not produce noisy CI output
(or replace it with an assertion or proper logger.debug if you need visibility).

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 685d914 and 4d1e8e8.

📒 Files selected for processing (4)
  • backend/app/api/routes/stt_evaluations/files.py
  • backend/app/models/file.py
  • backend/app/models/stt_evaluation.py
  • backend/app/tests/api/routes/test_stt_evaluation.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • backend/app/models/stt_evaluation.py

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 2, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/app/tests/api/routes/test_stt_evaluation.py`:
- Around line 110-117: The test functions (e.g.,
test_get_audio_file_with_signed_url) currently accept an untyped mock parameter
named mock_storage; update the function signatures to add precise type hints
(for example unittest.mock.Mock or MagicMock or the concrete storage
interface/fixture type used in tests) and add return type annotations (-> None)
where missing to satisfy the project typing rule; apply the same change to the
other test functions referenced (the ones around lines 159-166 and 254-261) so
each patched/mock parameter is annotated (e.g., mock_storage: Mock or MagicMock)
and all test functions have explicit parameter and return type hints.
- Line 558: Remove the stray debug print call print("response=", response) from
the test in test_stt_evaluation.py; locate the print in the test function that
handles the HTTP/response object (search for the variable name response or the
exact print string) and delete that line so CI output is not polluted—if the
intent was to assert something about response, replace the print with an
appropriate assertion on response.status_code or response.json() inside the same
test function.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4d1e8e8 and 2c6c8e2.

📒 Files selected for processing (1)
  • backend/app/tests/api/routes/test_stt_evaluation.py

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
backend/app/tests/api/routes/test_stt_evaluation.py (1)

209-231: Test logic is correct, but consider clarifying the naming.

The test creates a file under user_api_key (Dalgo org) and verifies that superuser_api_key_header (Glific org) returns 404. This correctly validates cross-organization isolation. The method name and docstring mention "cross_project_access" while the actual test validates cross-organization access—consider renaming to test_get_audio_file_cross_org_access for clarity.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/tests/api/routes/test_stt_evaluation.py` around lines 209 - 231,
The test function name and docstring are misleading:
test_get_audio_file_cross_project_access and its docstring mention
"cross_project" but the test actually asserts cross-organization isolation using
user_api_key (Dalgo org) and superuser_api_key_header (Glific org); rename the
function to test_get_audio_file_cross_org_access and update the docstring to
mention "cross-organization" (or "cross_org") to reflect the intent, keeping the
test body (file creation with user_api_key and request with
superuser_api_key_header, asserting 404) unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@backend/app/tests/api/routes/test_stt_evaluation.py`:
- Around line 209-231: The test function name and docstring are misleading:
test_get_audio_file_cross_project_access and its docstring mention
"cross_project" but the test actually asserts cross-organization isolation using
user_api_key (Dalgo org) and superuser_api_key_header (Glific org); rename the
function to test_get_audio_file_cross_org_access and update the docstring to
mention "cross-organization" (or "cross_org") to reflect the intent, keeping the
test body (file creation with user_api_key and request with
superuser_api_key_header, asserting 404) unchanged.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2c6c8e2 and 8793b1f.

📒 Files selected for processing (1)
  • backend/app/tests/api/routes/test_stt_evaluation.py

@AkhileshNegi AkhileshNegi changed the title audio_files: get endpoint and signed url Files: get endpoint and signed url Mar 2, 2026
@nishika26 nishika26 changed the title Files: get endpoint and signed url Files: audio sample's signed url Mar 2, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
backend/app/api/routes/stt_evaluations/dataset.py (1)

162-179: ⚠️ Potential issue | 🟠 Major

Guard signed URL generation errors so one storage failure doesn’t 500 the whole response.

get_cloud_storage(...) and storage.get_signed_url(...) can raise; currently those exceptions bubble and fail the entire dataset fetch.

Proposed hardening
         storage = None
         if include_url:
-            storage = get_cloud_storage(
-                session=_session, project_id=auth_context.project_.id
-            )
+            try:
+                storage = get_cloud_storage(
+                    session=_session, project_id=auth_context.project_.id
+                )
+            except Exception:
+                logger.exception(
+                    f"[get_dataset] Failed to initialize cloud storage | "
+                    f"project_id: {auth_context.project_.id}"
+                )
+                raise HTTPException(
+                    status_code=500, detail="Unable to generate signed URLs"
+                ) from None
@@
-                signed_url=storage.get_signed_url(
-                    file_map.get(s.file_id).object_store_url
-                )
-                if include_url and storage and s.file_id in file_map
-                else None,
+                signed_url=(
+                    storage.get_signed_url(file_map[s.file_id].object_store_url)
+                    if (
+                        include_url
+                        and storage
+                        and s.file_id in file_map
+                        and file_map[s.file_id].object_store_url
+                    )
+                    else None
+                ),
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/api/routes/stt_evaluations/dataset.py` around lines 162 - 179,
The code currently lets exceptions from get_cloud_storage(...) and
storage.get_signed_url(...) bubble up while building STTSamplePublic, causing a
full 500; wrap the storage initialization and per-sample signed URL generation
in try/except blocks: catch errors from get_cloud_storage(...) and set
storage=None (and log), and for each sample when computing signed_url call
storage.get_signed_url(...) inside a try/except that returns None on failure
(and logs the exception) so building the samples list never raises; refer to the
storage variable, get_cloud_storage, storage.get_signed_url, and the
STTSamplePublic construction to locate the changes.
🧹 Nitpick comments (1)
backend/app/api/routes/stt_evaluations/dataset.py (1)

122-124: Consider making include_url default to False for true opt-in behavior.

Right now signed URLs are effectively on by default, which increases response cost and data exposure unless callers explicitly disable it.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/api/routes/stt_evaluations/dataset.py` around lines 122 - 124,
Change the include_url query default from True to False in the Query declaration
(the include_url: bool = Query(...) line) so signed URLs are opt-in rather than
on by default; update the parameter description if needed to indicate opt-in
behavior and search for any callers or tests that assume the old default (e.g.,
route handler functions referencing include_url in stt_evaluations dataset
endpoints) to update them to explicitly pass include_url=True where required.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@backend/app/api/routes/stt_evaluations/dataset.py`:
- Around line 162-179: The code currently lets exceptions from
get_cloud_storage(...) and storage.get_signed_url(...) bubble up while building
STTSamplePublic, causing a full 500; wrap the storage initialization and
per-sample signed URL generation in try/except blocks: catch errors from
get_cloud_storage(...) and set storage=None (and log), and for each sample when
computing signed_url call storage.get_signed_url(...) inside a try/except that
returns None on failure (and logs the exception) so building the samples list
never raises; refer to the storage variable, get_cloud_storage,
storage.get_signed_url, and the STTSamplePublic construction to locate the
changes.

---

Nitpick comments:
In `@backend/app/api/routes/stt_evaluations/dataset.py`:
- Around line 122-124: Change the include_url query default from True to False
in the Query declaration (the include_url: bool = Query(...) line) so signed
URLs are opt-in rather than on by default; update the parameter description if
needed to indicate opt-in behavior and search for any callers or tests that
assume the old default (e.g., route handler functions referencing include_url in
stt_evaluations dataset endpoints) to update them to explicitly pass
include_url=True where required.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8793b1f and 06c0b7a.

📒 Files selected for processing (5)
  • backend/app/api/routes/stt_evaluations/dataset.py
  • backend/app/api/routes/stt_evaluations/files.py
  • backend/app/models/file.py
  • backend/app/models/stt_evaluation.py
  • backend/app/tests/api/routes/test_stt_evaluation.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • backend/app/models/file.py

@@ -1 +1 @@
Get an STT dataset with its samples.
Get an STT dataset with its samples. You get a pre-signed S3 URL for direct sample file access as well if you have set ``include_url=True``.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update it to include_signed_url as URL is already there and might create confusion

@nishika26 nishika26 merged commit a0c1f24 into main Mar 5, 2026
3 checks passed
@nishika26 nishika26 deleted the enhancement/get_signed_url branch March 5, 2026 06:54
nishika26 added a commit that referenced this pull request Mar 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

STT Eval: Code inconsistency as compared to rest of code STT EVAL: adding get endpoints and audio files signed url

3 participants