Evaluation: STT by AkhileshNegi · Pull Request #571 · ProjectTech4DevAI/kaapi-backend

AkhileshNegi · 2026-02-02T05:52:22Z

Summary

Target issue is #533

Checklist

Before submitting a pull request, please ensure that you mark these task.

Ran fastapi run --reload app/main.py or docker compose up in the repository root and test.
If you've fixed a bug or added code that is tested and has test cases.

Notes

New Features
- End-to-end Speech-to-Text evaluation: upload audio, create datasets/samples, start runs, view transcriptions, and record human feedback.
- Provider batch transcription integration (Gemini) with batch submission, polling, and result processing.
- Signed URL expiry cap and MIME detection for uploads; supported audio formats and size limits.
- Database migration adding STT sample/result tables and extending dataset/run metadata.
Functional Requirements Testing

Summary by CodeRabbit

New Features
- Speech-to-Text (STT) evaluation: create/list datasets, upload audio, start/list/get runs, view per-run results, and submit human feedback
- Audio upload support (mp3, wav, flac, m4a, ogg, webm; max 200 MB)
- Gemini-backed batch transcription for scalable STT runs
Documentation
- User docs added for dataset creation, audio upload, runs, results, and feedback
Tests
- Extensive tests covering STT APIs, services, storage utilities, and Gemini integration

…o feature/stt-evaluation

coderabbitai · 2026-02-02T05:52:29Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds end-to-end Speech-to-Text (STT) evaluation: API endpoints, models, CRUD, services, Gemini batch provider, batch polling/processing, storage/file handling, DB migration, and comprehensive tests and docs.

Changes

Cohort / File(s)	Summary
API docs & router wiring `backend/app/api/docs/stt_evaluation/*`, `backend/app/api/main.py`, `backend/app/api/routes/stt_evaluations/router.py`	Adds STT endpoint documentation and mounts new "/evaluations/stt" APIRouter.
API routes `backend/app/api/routes/stt_evaluations/files.py`, `.../dataset.py`, `.../evaluation.py`, `.../result.py`, `.../__init__.py`	New FastAPI routers for audio upload, dataset CRUD, starting/listing/getting runs, and result feedback with permission checks and markdown descriptions. Review error handling and dependency usage.
Models, DB migration & enums `backend/app/models/stt_evaluation.py`, `backend/app/models/evaluation.py`, `backend/app/alembic/versions/044_add_stt_evaluation_tables.py`	Adds STT ORM + Pydantic models, Evaluation type/language/providers fields, and migration creating `file`, `stt_sample`, `stt_result` tables and new columns. Schema and migrations need careful DB review.
CRUD & orchestration `backend/app/crud/stt_evaluations/*`, `backend/app/crud/file.py`, `backend/app/crud/__init__.py`	Implements dataset/sample/run/result/file CRUD, batch submission orchestration, polling/processing cron; re-exports file helpers. Pay attention to transaction boundaries and ownership checks.
Batch provider & core `backend/app/core/batch/gemini.py`, `backend/app/core/batch/__init__.py`, `backend/app/core/providers.py`	Adds GeminiBatchProvider, BatchJobState enum, create_stt_batch_requests helper and exposes exports; registers GOOGLE provider config. Inspect external client interactions and error paths.
Storage utils & cloud `backend/app/core/storage_utils.py`, `backend/app/core/cloud/storage.py`	Adds get_mime_from_url, generic upload_to_object_store, JSON/JSONL options, and clamps signed URL expiry to 24h. Verify MIME inference and upload content-type handling.
Services (audio, dataset, Gemini client) `backend/app/services/stt_evaluations/*`, `.../gemini/client.py`	Audio validation/upload, CSV dataset upload orchestration, constants, and GeminiClient wrapper for credentials/connection. Check size/format validation and credential error semantics.
Cron / processing integration `backend/app/crud/evaluations/cron.py`, `backend/app/crud/evaluations/processing.py`	Integrates STT polling into evaluation cron, separates text polling, and merges STT/text summaries. Review interaction with existing polling and run-state transitions.
Tests & test infra `backend/app/tests/...` (api/routes/test_stt_evaluation.py, core/batch/test_gemini.py, core/test_storage_utils.py, services/stt_evaluations/*)	Extensive unit/integration tests for routes, Gemini provider, storage utils, services, and client. Ensure mocks align with production client expectations.
Project deps & minor model docs `backend/pyproject.toml`, `backend/app/models/credentials.py`, `backend/app/models/collection.py`	Adds `google-genai` and `requests` deps; updates provider examples/documentation strings. Verify dependency versions and CI.
Alembic migration `backend/app/alembic/versions/044_add_stt_evaluation_tables.py`	Large migration adding new tables/indices and columns—requires DB review and testing in migration pipeline.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant API as STT API
    participant Service
    participant Storage
    participant DB as Database

    Client->>API: POST /files/audio (multipart)
    API->>Service: validate_audio_file(file)
    Service->>Service: check extension & size
    Service->>Storage: upload file to object store
    Storage-->>Service: object_store_url
    Service->>DB: create_file record
    DB-->>API: file metadata
    API-->>Client: AudioUploadResponse (s3_url, file_id)

sequenceDiagram
    participant Client
    participant API as STT API
    participant CRUD
    participant BatchSvc as Batch Service
    participant Storage
    participant Gemini

    Client->>API: POST /runs (start evaluation)
    API->>CRUD: validate dataset & list samples
    CRUD-->>API: samples
    API->>CRUD: create run & create result records
    API->>BatchSvc: start_stt_evaluation_batch(run, samples)
    BatchSvc->>Storage: generate signed URLs for sample files
    BatchSvc->>BatchSvc: build JSONL requests
    BatchSvc->>Gemini: submit batch job(s)
    Gemini-->>BatchSvc: provider batch id/status
    BatchSvc->>CRUD: update_stt_run(status=processing, batch ids)
    API-->>Client: STTEvaluationRunPublic (processing)

sequenceDiagram
    participant Cron
    participant CRUD as Run CRUD
    participant Gemini
    participant Results as Result CRUD
    participant DB as Database

    Cron->>CRUD: get_pending_stt_runs()
    CRUD-->>Cron: pending runs with batch_job_id
    loop each run
        Cron->>Gemini: get_batch_status(batch_id)
        Gemini-->>Cron: state
        alt terminal
            Cron->>Gemini: download_batch_results(batch_output_id)
            Gemini-->>Cron: results JSONL
            Cron->>Results: update_stt_result entries
            Results->>DB: persist updates
            Cron->>CRUD: update_stt_run(status=completed/failed)
        else
            Cron-->>Cron: continue polling
        end
    end

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

Add score trace URL to evaluation run #558 — touches core/storage_utils.py (format-aware upload_jsonl changes); strongly related to storage helpers here.
Evaluation: Use Config Management #477 — modifies EvaluationRun-related fields; related to added run fields (type/language/providers).
Kaapi v1.0: Database Comments #476 — also changes evaluation models; overlaps with evaluation model and migration edits.

Suggested labels

ready-for-review

Suggested reviewers

kartpop
Prajna1999
vprashrex

Poem

🐇
I hopped with a mic and a curious grin,
Uploaded the sounds and let batches begin.
Gemini hummed, transcripts danced into view,
Datasets, runs, and feedback — a rallying chew.
Nibbles of review make the results true. 🎧✨

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title "Evaluation: STT" is concise and directly describes the main feature addition—Speech-to-Text evaluation support—which is the primary focus of the changeset.
Docstring Coverage	✅ Passed	Docstring coverage is 91.04% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feature/stt-evaluation

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Fix all issues with AI agents

In `@backend/app/api/routes/stt_evaluations/evaluation.py`:
- Around line 116-127: The except block in start_stt_evaluation currently
returns str(e) to the client, leaking internal error details; instead keep
logging the full error and saving the detailed message via
update_stt_run(session=_session, run_id=run.id, status="failed",
error_message=str(e)) but change the HTTPException detail to a generic message
(optionally include a non-sensitive identifier like run.id or an internal error
code) so the client receives no internal stack/secret data; locate the except
block that references logger.error, update_stt_run, and raise HTTPException and
replace the HTTPException detail payload accordingly.
- Around line 62-80: The run's total_items is being computed from
dataset.dataset_metadata.sample_count which can be stale; instead fetch the
actual samples first (the variable named samples used later) and compute
total_items = len(samples) * len(run_create.providers) before calling
create_stt_run (update the create_stt_run call to pass that computed value).
Ensure you still fall back to dataset.dataset_metadata.get("sample_count", 0)
only if samples is empty or the samples fetch fails, and keep using language_id
= dataset.language_id and other create_stt_run parameters unchanged.

🧹 Nitpick comments (7)

backend/app/models/stt_evaluation.py (5)
74-82: default_factory=dict produces {} but column is nullable=True — pick one semantic.

The Python default is {} (via default_factory=dict), so sample_metadata will never be None when created through the ORM without an explicit None assignment, yet the DB column allows NULL. This may cause confusion for downstream consumers checking is None vs == {}. Consider aligning: either use default=None (matching nullable=True) or set nullable=False, server_default=text("'{}'").

136-147: provider and status on STTResult are bare str — consider enum validation or constraints.

status has an STTResultStatus enum defined but the column is typed as str with no DB-level check constraint. Similarly, provider could drift from SUPPORTED_STT_PROVIDERS. At minimum, adding a CheckConstraint on status ensures DB-level integrity.

273-277: STTFeedbackUpdate allows an empty payload (both fields None) — this is a no-op.

If both is_correct and comment are None, the update request does nothing meaningful. Consider a model-level validator to require at least one field.
Example validator
 class STTFeedbackUpdate(BaseModel):
     """Request model for updating human feedback on a result."""
 
     is_correct: bool | None = Field(None, description="Is the transcription correct?")
     comment: str | None = Field(None, description="Feedback comment")
+
+    `@model_validator`(mode="after")
+    def check_at_least_one_field(self) -> "STTFeedbackUpdate":
+        if self.is_correct is None and self.comment is None:
+            raise ValueError("At least one of 'is_correct' or 'comment' must be provided")
+        return self
108-112: updated_at uses default_factory=now which only fires on INSERT—but CRUD updates explicitly set it, making this a best-practice refactoring suggestion.

Both STTSample.updated_at and STTResult.updated_at use default_factory=now, which only executes on INSERT. However, inspection of the CRUD layer shows that all update operations explicitly set updated_at = now() (e.g., backend/app/crud/stt_evaluations/run.py:222, backend/app/crud/stt_evaluations/result.py:238). While this explicit management works, consider adding SQLAlchemy's onupdate to the column definition for additional safety against future updates that might miss manual assignment.
Example using sa_column with onupdate
-    updated_at: datetime = SQLField(
-        default_factory=now,
-        nullable=False,
-        sa_column_kwargs={"comment": "Timestamp when the sample was last updated"},
-    )
+    updated_at: datetime = SQLField(
+        default_factory=now,
+        nullable=False,
+        sa_column_kwargs={
+            "comment": "Timestamp when the sample was last updated",
+            "onupdate": now,
+        },
+    )
Also applies to: 216-219

232-265: Consider adding model_config = ConfigDict(from_attributes=True) to STTSamplePublic and STTResultPublic for idiomatic Pydantic usage.

These models could benefit from Pydantic v2's from_attributes=True configuration. While the current explicit keyword argument construction (e.g., STTSamplePublic(id=sample.id, file_id=sample.file_id, ...)) works correctly, adopting from_attributes=True would enable the idiomatic pattern model_validate(orm_instance) and reduce duplication across construction sites.
backend/app/api/routes/stt_evaluations/evaluation.py (2)
137-155: Verbose manual field-by-field model construction — use model_validate or from_orm.

Both STTEvaluationRunPublic and STTEvaluationRunWithResults are constructed by manually mapping every field from the ORM object. This is error-prone (easy to miss a field when the model evolves) and verbose. If you add model_config = ConfigDict(from_attributes=True) to the Pydantic models, you can replace this with:
STTEvaluationRunPublic.model_validate(run)
This would also simplify STTEvaluationRunWithResults construction.

Also applies to: 233-253

190-206: Response model mismatch when include_results=False.

The endpoint declares response_model=APIResponse[STTEvaluationRunWithResults], but when include_results=False, results will be an empty list and results_total will be 0. While this technically validates, it's semantically misleading — the response schema always advertises results. Consider using a union type or separate endpoint, or at minimum document this behavior clearly.

coderabbitai · 2026-02-06T12:07:15Z

backend/app/api/routes/stt_evaluations/evaluation.py

+    sample_count = (dataset.dataset_metadata or {}).get("sample_count", 0)
+
+    if sample_count == 0:
+        raise HTTPException(status_code=400, detail="Dataset has no samples")
+
+    # Use language_id from the dataset
+    language_id = dataset.language_id
+
+    # Create run record
+    run = create_stt_run(
+        session=_session,
+        run_name=run_create.run_name,
+        dataset_id=run_create.dataset_id,
+        dataset_name=dataset.name,
+        org_id=auth_context.organization_.id,
+        project_id=auth_context.project_.id,
+        providers=run_create.providers,
+        language_id=language_id,
+        total_items=sample_count * len(run_create.providers),


⚠️ Potential issue | 🟠 Major

total_items derived from metadata may diverge from actual sample count.

sample_count is read from dataset.dataset_metadata (line 62), but the actual samples are fetched separately on line 84. If metadata becomes stale (e.g., samples added/removed without metadata update), total_items stored on the run will be incorrect. Consider computing total_items from len(samples) after fetching them.

Proposed fix

+ # Get samples for the dataset + samples = get_samples_by_dataset_id( + session=_session, + dataset_id=run_create.dataset_id, + org_id=auth_context.organization_.id, + project_id=auth_context.project_.id, + ) + + if not samples: + raise HTTPException(status_code=400, detail="Dataset has no samples") + # Create run record run = create_stt_run( session=_session, run_name=run_create.run_name, dataset_id=run_create.dataset_id, dataset_name=dataset.name, org_id=auth_context.organization_.id, project_id=auth_context.project_.id, providers=run_create.providers, language_id=language_id, - total_items=sample_count * len(run_create.providers), + total_items=len(samples) * len(run_create.providers), ) - - # Get samples for the dataset - samples = get_samples_by_dataset_id( - session=_session, - dataset_id=run_create.dataset_id, - org_id=auth_context.organization_.id, - project_id=auth_context.project_.id, - )

🤖 Prompt for AI Agents

In `@backend/app/api/routes/stt_evaluations/evaluation.py` around lines 62 - 80, The run's total_items is being computed from dataset.dataset_metadata.sample_count which can be stale; instead fetch the actual samples first (the variable named samples used later) and compute total_items = len(samples) * len(run_create.providers) before calling create_stt_run (update the create_stt_run call to pass that computed value). Ensure you still fall back to dataset.dataset_metadata.get("sample_count", 0) only if samples is empty or the samples fetch fails, and keep using language_id = dataset.language_id and other create_stt_run parameters unchanged.

coderabbitai · 2026-02-06T12:07:15Z

backend/app/api/routes/stt_evaluations/evaluation.py

+    except Exception as e:
+        logger.error(
+            f"[start_stt_evaluation] Batch submission failed | "
+            f"run_id: {run.id}, error: {str(e)}"
+        )
+        update_stt_run(
+            session=_session,
+            run_id=run.id,
+            status="failed",
+            error_message=str(e),
+        )
+        raise HTTPException(status_code=500, detail=f"Batch submission failed: {e}")


⚠️ Potential issue | 🟡 Minor

Internal error details leaked to the client.

str(e) is included in the HTTP 500 response detail, which may expose internal implementation details (stack traces, service URLs, credentials in connection strings, etc.) to API consumers.

Proposed fix

- raise HTTPException(status_code=500, detail=f"Batch submission failed: {e}") + raise HTTPException(status_code=500, detail="Batch submission failed")

🤖 Prompt for AI Agents

In `@backend/app/api/routes/stt_evaluations/evaluation.py` around lines 116 - 127, The except block in start_stt_evaluation currently returns str(e) to the client, leaking internal error details; instead keep logging the full error and saving the detailed message via update_stt_run(session=_session, run_id=run.id, status="failed", error_message=str(e)) but change the HTTPException detail to a generic message (optionally include a non-sensitive identifier like run.id or an internal error code) so the client receives no internal stack/secret data; locate the except block that references logger.error, update_stt_run, and raise HTTPException and replace the HTTPException detail payload accordingly.

backend/app/services/stt_evaluations/constants.py

Prajna1999 · 2026-02-09T05:17:11Z

backend/app/services/stt_evaluations/gemini/client.py

+        )
+        return cls(api_key=api_key)
+
+    def validate_connection(self) -> bool:


I think once the client instantiation is successful this check is rendered redundant unless there is a wifi outage.

coderabbitai

Actionable comments posted: 6

🤖 Fix all issues with AI agents

In `@backend/app/api/docs/stt_evaluation/start_evaluation.md`:
- Around line 1-8: The docs refer to "providers" but the API expects "models";
update the wording in this doc so it matches the STTEvaluationRunCreate contract
by replacing mentions of "providers" with "models" and noting the field name
`models: list[str]` where examples or supported options are listed (e.g., change
"**Supported providers:** gemini-2.5-pro" to "**Supported models:**
gemini-2.5-pro") and ensure any descriptive text referencing "providers" (lines
describing processing through providers) instead mentions processing through the
specified `models`.

In `@backend/app/crud/stt_evaluations/batch.py`:
- Around line 133-198: The code only saves first_batch_job_id
(first_batch_job_id) to the EvaluationRun so only the first model's batch job is
discoverable by poll_all_pending_stt_evaluations (which uses
EvaluationRun.batch_job_id); fix by persisting all created batch job IDs instead
of only the first: collect batch_job.id values from start_batch_job into
batch_jobs (or a list batch_job_ids) and call update_stt_run with that
collection (e.g., update_stt_run(session, run_id=run.id, status="processing",
batch_job_id=None, metadata={"batch_job_ids": batch_job_ids}) or add a new
parameter/field to store batch_job_ids); alternatively, if multi-model runs are
not supported, add an explicit assertion or comment near the loop and avoid
creating multiple jobs — reference functions/vars: start_batch_job, batch_jobs,
first_batch_job_id, update_stt_run, and poll_all_pending_stt_evaluations when
making the change.

In `@backend/app/crud/stt_evaluations/cron.py`:
- Around line 282-284: Remove the unused boolean any_succeeded (set at
any_succeeded = False and later modified) or else use it in the final-status
calculation; specifically, edit the block that defines/updates all_terminal,
any_failed, and any_succeeded (search for any_succeeded and any_failed in
cron.py) and either delete all references and assignments to any_succeeded, or
integrate it into the terminal-state logic (e.g., use any_succeeded to
distinguish "all failed" vs "partially succeeded" when computing the overall
status) and remove any linter warnings. Ensure no other code relies on
any_succeeded before removing.
- Around line 435-452: The loop currently skips when result_record is None which
hides failures; update the branch handling to log a warning (including
identifiers like the batch_result payload or an id/metadata if available) and
increment failed_count so missing DB rows are visible; specifically, in the
block where you check result_record (the variable result_record inside the loop
that later calls update_stt_result), add a processLogger.warning (or existing
logger) call describing the missing result_record and the batch_result context,
and increment failed_count (not processed_count) so metrics reflect the skipped
result instead of silently continuing.
- Around line 346-363: The run finalization currently leaves runs stuck in
"processing" when all batches are terminal but there remain pending results;
modify the finalization logic in the block using count_results_by_status,
STTResultStatus, any_failed, errors and update_stt_run so that if all_terminal
is True and pending > 0 you treat the run as terminal (set status to "completed"
or "failed") and populate error_message accordingly (e.g., include failed_count
and a message about unresolved pending results or list errors); ensure
update_stt_run is called with the terminal status and a clear error_message so
subsequent cron cycles won't reprocess the same run.

In `@backend/app/models/collection.py`:
- Around line 42-48: The provider Field's description mentions 'google' but the
ProviderType enum (ProviderType) does not include a GOOGLE/GEMINI member; fix by
either adding a GOOGLE (or GEMINI) enum value to ProviderType (ensure downstream
serialization/name matches existing enum style) or update the provider Field
description in collection.py to only list supported providers (e.g., 'openai')
so OpenAPI/docs match the actual ProviderType; modify the ProviderType
definition or the description string in the provider Field accordingly and run
tests/type checks to confirm no regressions.

🧹 Nitpick comments (15)

backend/app/core/providers.py (1)

1-4: Consider modernizing type hints to Python 3.11+ builtins.

Pre-existing code uses Dict, List, Optional from typing (lines 2, 22, 26, 62, 86). Since the coding guidelines require Python 3.11+, these can be replaced with the built-in dict, list, and X | None syntax. Not blocking, but worth a follow-up cleanup. As per coding guidelines: "Use Python 3.11+ with type hints throughout the codebase."
backend/app/models/stt_evaluation.py (4)
15-17: Remove commented-out code with duplicate entries.

The commented-out list on line 16 contains "gemini-2.5-pro" twice (likely intended to be "gemini-2.5-flash"). Remove the dead comment to avoid confusion.
Proposed fix
-# Supported STT models for evaluation
-# SUPPORTED_STT_MODELS = ["gemini-2.5-pro", "gemini-2.5-pro", "gemini-2.0-flash"]
-SUPPORTED_STT_MODELS = ["gemini-2.5-pro"]
+# Supported STT models for evaluation
+SUPPORTED_STT_MODELS = ["gemini-2.5-pro"]
327-341: Confusing variable names in validate_models.

The parameter valid_model reads as a singular validated model, but it's actually the full list being validated. The loop variable models shadows the field name conceptually and reads as plural when it represents a single model string.
Proposed fix
     `@field_validator`("models")
     `@classmethod`
-    def validate_models(cls, valid_model: list[str]) -> list[str]:
+    def validate_models(cls, models: list[str]) -> list[str]:
         """Validate that all models are supported."""
-        if not valid_model:
+        if not models:
             raise ValueError("At least one model must be specified")
         unsupported = [
-            models for models in valid_model if models not in SUPPORTED_STT_MODELS
+            m for m in models if m not in SUPPORTED_STT_MODELS
         ]
         if unsupported:
             raise ValueError(
                 f"Unsupported model(s): {', '.join(unsupported)}. "
                 f"Supported models are: {', '.join(SUPPORTED_STT_MODELS)}"
             )
-        return valid_model
+        return models
36-113: Consider adding database indexes on frequently queried foreign keys.

STTSample will be queried by dataset_id (listing samples for a dataset) and STTResult by evaluation_run_id and stt_sample_id (listing results for a run, looking up results for a sample). Without indexes on these columns, those queries will degrade as data grows.
Example: adding table-level indexes
from sqlalchemy import Index

class STTSample(SQLModel, table=True):
    __tablename__ = "stt_sample"
    __table_args__ = (
        Index("idx_stt_sample_dataset", "dataset_id"),
    )
    # ... fields ...

class STTResult(SQLModel, table=True):
    __tablename__ = "stt_result"
    __table_args__ = (
        Index("idx_stt_result_run", "evaluation_run_id"),
        Index("idx_stt_result_sample", "stt_sample_id"),
    )
    # ... fields ...
Also applies to: 116-221

274-278: No-op feedback update is silently accepted.

Both is_correct and comment default to None. A request body {} passes validation but performs a no-op update. Consider adding a model-level validator to require at least one field.
Proposed fix
 class STTFeedbackUpdate(BaseModel):
     """Request model for updating human feedback on a result."""

     is_correct: bool | None = Field(None, description="Is the transcription correct?")
     comment: str | None = Field(None, description="Feedback comment")
+
+    `@model_validator`(mode="after")
+    def check_at_least_one_field(self) -> "STTFeedbackUpdate":
+        if self.is_correct is None and self.comment is None:
+            raise ValueError("At least one of 'is_correct' or 'comment' must be provided")
+        return self
backend/app/crud/evaluations/processing.py (1)
669-674: Use the EvaluationType enum for consistency.

The STT polling in cron.py uses EvaluationType.STT.value, but here the type filter uses a raw string "text". Using the enum constant is safer against typos and keeps the codebase consistent.
Proposed fix
+from app.models.stt_evaluation import EvaluationType
+
     # Single query to fetch all processing text evaluation runs
     # STT/TTS evaluations have their own polling
     statement = select(EvaluationRun).where(
         EvaluationRun.status == "processing",
-        EvaluationRun.type == "text",
+        EvaluationRun.type == EvaluationType.TEXT.value,
     )
backend/app/crud/stt_evaluations/batch.py (1)
123-124: Use RuntimeError (or a domain exception) instead of bare Exception.

Raising bare Exception makes it harder for callers to catch specific errors. This also applies to line 190.
Proposed fix
     if not signed_urls:
-        raise Exception("Failed to generate signed URLs for any audio files")
+        raise RuntimeError("Failed to generate signed URLs for any audio files")
     ...
     if not batch_jobs:
-        raise Exception("Batch submission failed for all models")
+        raise RuntimeError("Batch submission failed for all models")
Also applies to: 189-190
backend/app/tests/api/routes/test_stt_evaluation.py (1)

14-99: Consider moving factory helpers to a shared conftest or utility module.

The create_test_file, create_test_stt_dataset, and create_test_stt_sample helpers are well-structured factory functions. Per the project's testing guidelines, these could be extracted to a shared test utilities module (e.g., backend/app/tests/utils/stt_factories.py) so they can be reused by other test files that may need STT test data. As per coding guidelines, "Use factory pattern for test fixtures in backend/app/tests/".
backend/app/core/storage_utils.py (1)
139-206: upload_jsonl_to_object_store duplicates the upload logic instead of delegating to upload_to_object_store.

The new upload_to_object_store was introduced as the generic upload function, and upload_csv_to_object_store correctly delegates to it. However, upload_jsonl_to_object_store still manually creates UploadFile, calls storage.put, and duplicates the same error handling. This defeats the purpose of the generic function.
Proposed fix
 def upload_jsonl_to_object_store(
     storage: CloudStorage,
     results: list[dict],
     filename: str,
     subdirectory: str,
     format: Literal["json", "jsonl"] = "jsonl",
 ) -> str | None:
     ...
     try:
-        file_path = Path(subdirectory) / filename
-
         if format == "jsonl":
             jsonl_content = (
                 "\n".join(json.dumps(result, ensure_ascii=False) for result in results)
                 + "\n"
             )
-            content_type = {"content-type": "application/jsonl"}
+            ct = "application/jsonl"
         else:
             jsonl_content = json.dumps(results, ensure_ascii=False)
-            content_type = {"content-type": "application/json"}
+            ct = "application/json"
 
         content_bytes = jsonl_content.encode("utf-8")
 
-        headers = Headers(content_type)
-        upload_file = UploadFile(
-            filename=filename,
-            file=BytesIO(content_bytes),
-            headers=headers,
+        return upload_to_object_store(
+            storage=storage,
+            content=content_bytes,
+            filename=filename,
+            subdirectory=subdirectory,
+            content_type=ct,
         )
-
-        destination = storage.put(source=upload_file, file_path=file_path)
-        object_store_url = str(destination)
-        ...
     except CloudStorageError as e:
         ...
backend/app/tests/core/test_storage_utils.py (2)
359-369: Uniqueness test doesn't assert uniqueness.

The test is named test_unique_filenames but only asserts prefix format — it never compares filename1 != filename2. Combined with the comment acknowledging same-second collisions, this test adds no value beyond test_timestamp_format.

Either assert inequality or remove the test to avoid misleading coverage.
Proposed fix
     def test_unique_filenames(self):
         """Test that consecutive calls produce different filenames."""
         import time

         filename1 = generate_timestamped_filename("test")
-        time.sleep(0.01)  # Small delay to ensure different timestamp
+        time.sleep(1.1)  # Ensure different second-level timestamp
         filename2 = generate_timestamped_filename("test")
-        # They may be the same if called in the same second
-        # but the format should be correct
-        assert filename1.startswith("test_")
-        assert filename2.startswith("test_")
+        assert filename1 != filename2, "Consecutive filenames should differ"
24-27: Add return type hints to test methods for consistency.

Several test methods lack -> None return type annotations (e.g., test_mp3_url, test_wav_url, test_successful_upload, etc.), while others like test_upload_json_file_success have them. As per coding guidelines, all functions should have type hints on return values.

Also applies to: 29-32, 97-109, 166-176, 197-219, 327-331, 333-337, 353-357
backend/app/crud/stt_evaluations/run.py (1)
218-220: Cannot explicitly clear a field to None (e.g., error_message).

The if value is not None guard means callers can never reset a field — for instance, clearing error_message on a retry. If intentional, document it; otherwise, consider a sentinel or explicit include-list.
Example using a sentinel
+_UNSET = object()
+
 def update_stt_run(
     *,
     session: Session,
     run_id: int,
-    status: str | None = None,
-    score: dict[str, Any] | None = None,
-    error_message: str | None = None,
-    object_store_url: str | None = None,
-    batch_job_id: int | None = None,
+    status: str | None = _UNSET,
+    score: dict[str, Any] | None = _UNSET,
+    error_message: str | None = _UNSET,
+    object_store_url: str | None = _UNSET,
+    batch_job_id: int | None = _UNSET,
 ) -> EvaluationRun | None:
     ...
     for field, value in updates.items():
-        if value is not None:
+        if value is not _UNSET:
             setattr(run, field, value)
backend/app/crud/stt_evaluations/result.py (3)
22-77: session.flush() before session.commit() is redundant.

commit() already calls flush() internally. The extra flush() on line 69 adds no value.
Proposed fix
     session.add_all(results)
-    session.flush()
     session.commit()
247-299: CRUD layer raises HTTPException — mixing concerns.

update_human_feedback raises HTTPException (line 280), coupling the data-access layer to HTTP semantics. If this function is called from a non-HTTP context (e.g., a script, background job), the HTTP exception is inappropriate.

Return None and let the API route handler raise the 404.
Proposed fix
-from app.core.exception_handlers import HTTPException
 ...
 def update_human_feedback(
     ...
 ) -> STTResult | None:
     ...
     result = get_stt_result_by_id(...)

     if not result:
-        raise HTTPException(status_code=404, detail="Result not found")
+        return None

     ...
Then in the API route, check for None and raise HTTPException(status_code=404, ...) there.
350-358: Simplify dict comprehension per Ruff C416.
Proposed fix
-    return {status: count for status, count in rows}
+    return dict(rows)

coderabbitai · 2026-02-09T05:52:17Z

backend/app/api/docs/stt_evaluation/start_evaluation.md

+Start an STT evaluation run on a dataset.
+
+The evaluation will:
+1. Process each audio sample through the specified providers
+2. Generate transcriptions using Gemini Batch API
+3. Store results for human review
+
+**Supported providers:** gemini-2.5-pro


⚠️ Potential issue | 🟡 Minor

Documentation uses "providers" but the API model uses "models".

STTEvaluationRunCreate defines the field as models: list[str], but this doc references "providers" (lines 4 and 8). Update the terminology to match the API contract to avoid confusing consumers.

Proposed fix

Start an STT evaluation run on a dataset. The evaluation will: -1. Process each audio sample through the specified providers +1. Process each audio sample through the specified models 2. Generate transcriptions using Gemini Batch API 3. Store results for human review -**Supported providers:** gemini-2.5-pro +**Supported models:** gemini-2.5-pro

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

Start an STT evaluation run on a dataset.

The evaluation will:

1. Process each audio sample through the specified providers

2. Generate transcriptions using Gemini Batch API

3. Store results for human review

**Supported providers:** gemini-2.5-pro

Start an STT evaluation run on a dataset.

The evaluation will:

1. Process each audio sample through the specified models

2. Generate transcriptions using Gemini Batch API

3. Store results for human review

**Supported models:** gemini-2.5-pro

🤖 Prompt for AI Agents

In `@backend/app/api/docs/stt_evaluation/start_evaluation.md` around lines 1 - 8, The docs refer to "providers" but the API expects "models"; update the wording in this doc so it matches the STTEvaluationRunCreate contract by replacing mentions of "providers" with "models" and noting the field name `models: list[str]` where examples or supported options are listed (e.g., change "**Supported providers:** gemini-2.5-pro" to "**Supported models:** gemini-2.5-pro") and ensure any descriptive text referencing "providers" (lines describing processing through providers) instead mentions processing through the specified `models`.

coderabbitai · 2026-02-09T05:52:18Z

backend/app/crud/stt_evaluations/batch.py

+    # Submit one batch job per model
+    batch_jobs: dict[str, Any] = {}
+    first_batch_job_id: int | None = None
+
+    for model in models:
+        model_path = f"models/{model}"
+        batch_provider = GeminiBatchProvider(
+            client=gemini_client.client, model=model_path
+        )
+
+        try:
+            batch_job = start_batch_job(
+                session=session,
+                provider=batch_provider,
+                provider_name="google",
+                job_type="stt_evaluation",
+                organization_id=org_id,
+                project_id=project_id,
+                jsonl_data=jsonl_data,
+                config={
+                    "model": model,
+                    "stt_provider": model,
+                    "evaluation_run_id": run.id,
+                },
+            )
+
+            batch_jobs[model] = {
+                "batch_job_id": batch_job.id,
+                "provider_batch_id": batch_job.provider_batch_id,
+            }
+
+            if first_batch_job_id is None:
+                first_batch_job_id = batch_job.id
+
+            logger.info(
+                f"[start_stt_evaluation_batch] Batch job created | "
+                f"run_id: {run.id}, model: {model}, "
+                f"batch_job_id: {batch_job.id}"
+            )
+
+        except Exception as e:
+            logger.error(
+                f"[start_stt_evaluation_batch] Failed to submit batch | "
+                f"model: {model}, error: {str(e)}"
+            )
+            pending = get_pending_results_for_run(
+                session=session, run_id=run.id, provider=model
+            )
+            for result in pending:
+                update_stt_result(
+                    session=session,
+                    result_id=result.id,
+                    status=STTResultStatus.FAILED.value,
+                    error_message=f"Batch submission failed for {model}: {str(e)}",
+                )
+
+    if not batch_jobs:
+        raise Exception("Batch submission failed for all models")
+
+    # Link first batch job to the evaluation run (for pending run detection)
+    update_stt_run(
+        session=session,
+        run_id=run.id,
+        status="processing",
+        batch_job_id=first_batch_job_id,
+    )


⚠️ Potential issue | 🟠 Major

Only the first batch job is linked to the evaluation run — multi-model runs will lose track of subsequent jobs.

first_batch_job_id (line 164-165) records only the first model's batch job. The cron poller in poll_all_pending_stt_evaluations discovers runs via EvaluationRun.batch_job_id, so batch jobs for any additional models won't be polled or processed. While currently only one model is supported, this code path explicitly loops over multiple models and would silently drop results for all but the first.

Consider either: (a) creating one EvaluationRun per model, (b) storing all batch job IDs (e.g., in the run's metadata/config), or (c) adding a comment/assertion that multi-model is not yet fully supported.

🤖 Prompt for AI Agents

In `@backend/app/crud/stt_evaluations/batch.py` around lines 133 - 198, The code only saves first_batch_job_id (first_batch_job_id) to the EvaluationRun so only the first model's batch job is discoverable by poll_all_pending_stt_evaluations (which uses EvaluationRun.batch_job_id); fix by persisting all created batch job IDs instead of only the first: collect batch_job.id values from start_batch_job into batch_jobs (or a list batch_job_ids) and call update_stt_run with that collection (e.g., update_stt_run(session, run_id=run.id, status="processing", batch_job_id=None, metadata={"batch_job_ids": batch_job_ids}) or add a new parameter/field to store batch_job_ids); alternatively, if multi-model runs are not supported, add an explicit assertion or comment near the loop and avoid creating multiple jobs — reference functions/vars: start_batch_job, batch_jobs, first_batch_job_id, update_stt_run, and poll_all_pending_stt_evaluations when making the change.

coderabbitai · 2026-02-09T05:52:18Z

backend/app/crud/stt_evaluations/cron.py

+    all_terminal = True
+    any_succeeded = False
+    any_failed = False


⚠️ Potential issue | 🟡 Minor

any_succeeded is assigned but never used.

The variable is set on lines 283 and 329 but never read. Static analysis (Ruff F841) confirms this. Either remove it or incorporate it into the final status logic if the intent was to distinguish "all failed" from "partially succeeded."

Proposed fix — remove unused variable

all_terminal = True - any_succeeded = False any_failed = False errors: list[str] = [] for batch_job in batch_jobs: provider_name = batch_job.config.get("stt_provider", "unknown") # Skip batch jobs already in terminal state that have been processed if batch_job.provider_status in TERMINAL_STATES: if batch_job.provider_status == BatchJobState.SUCCEEDED.value: - any_succeeded = True + pass else: any_failed = True ... if provider_status == BatchJobState.SUCCEEDED.value: await process_completed_stt_batch( session=session, run=run, batch_job=batch_job, batch_provider=batch_provider, ) - any_succeeded = True else:

Also applies to: 329-329

🤖 Prompt for AI Agents

In `@backend/app/crud/stt_evaluations/cron.py` around lines 282 - 284, Remove the unused boolean any_succeeded (set at any_succeeded = False and later modified) or else use it in the final-status calculation; specifically, edit the block that defines/updates all_terminal, any_failed, and any_succeeded (search for any_succeeded and any_failed in cron.py) and either delete all references and assignments to any_succeeded, or integrate it into the terminal-state logic (e.g., use any_succeeded to distinguish "all failed" vs "partially succeeded" when computing the overall status) and remove any linter warnings. Ensure no other code relies on any_succeeded before removing.

coderabbitai · 2026-02-09T05:52:18Z

backend/app/crud/stt_evaluations/cron.py

+    # All batch jobs are done - finalize the run
+    status_counts = count_results_by_status(session=session, run_id=run.id)
+    pending = status_counts.get(STTResultStatus.PENDING.value, 0)
+    failed_count = status_counts.get(STTResultStatus.FAILED.value, 0)
+
+    final_status = "completed" if pending == 0 else "processing"
+    error_message = None
+    if any_failed:
+        error_message = "; ".join(errors)
+    elif failed_count > 0:
+        error_message = f"{failed_count} transcription(s) failed"
+
+    update_stt_run(
+        session=session,
+        run_id=run.id,
+        status=final_status,
+        error_message=error_message,
+    )


⚠️ Potential issue | 🟠 Major

Run stuck in "processing" forever when pending results remain after all batches are terminal.

When all_terminal is True but pending > 0 (e.g., unmatched custom_ids or missing result records), final_status is set to "processing" (line 351). On the next cron cycle, all batch jobs are already terminal and skipped (line 291-299), so the same finalization logic re-runs and sets "processing" again — creating an infinite loop.

If all batches have completed but results are still pending, those results will never be resolved. Mark the run as "completed" (or "failed") with an appropriate error message instead.

Proposed fix

- final_status = "completed" if pending == 0 else "processing" + if pending > 0: + # All batches are terminal but some results weren't matched/processed + final_status = "completed" + if error_message: + error_message += f"; {pending} result(s) still unresolved" + else: + error_message = f"{pending} result(s) still unresolved after all batches completed" + else: + final_status = "completed"

🤖 Prompt for AI Agents

In `@backend/app/crud/stt_evaluations/cron.py` around lines 346 - 363, The run finalization currently leaves runs stuck in "processing" when all batches are terminal but there remain pending results; modify the finalization logic in the block using count_results_by_status, STTResultStatus, any_failed, errors and update_stt_run so that if all_terminal is True and pending > 0 you treat the run as terminal (set status to "completed" or "failed") and populate error_message accordingly (e.g., include failed_count and a message about unresolved pending results or list errors); ensure update_stt_run is called with the terminal status and a clear error_message so subsequent cron cycles won't reprocess the same run.

coderabbitai · 2026-02-09T05:52:18Z

backend/app/crud/stt_evaluations/cron.py

+            if result_record:
+                if batch_result.get("response"):
+                    text = batch_result["response"].get("text", "")
+                    update_stt_result(
+                        session=session,
+                        result_id=result_record.id,
+                        transcription=text,
+                        status=STTResultStatus.COMPLETED.value,
+                    )
+                    processed_count += 1
+                else:
+                    update_stt_result(
+                        session=session,
+                        result_id=result_record.id,
+                        status=STTResultStatus.FAILED.value,
+                        error_message=batch_result.get("error", "Unknown error"),
+                    )
+                    failed_count += 1


⚠️ Potential issue | 🟡 Minor

Silently skips results when result_record is not found — no logging or error counting.

If result_record is None (line 433), the loop continues implicitly without any warning. This makes it impossible to diagnose why results are stuck in PENDING after batch completion. At minimum, log a warning and increment failed_count.

Proposed fix

result_record = session.exec(stmt).one_or_none() if result_record: if batch_result.get("response"): ... else: ... + else: + logger.warning( + f"[process_completed_stt_batch] No result record found | " + f"run_id={run.id}, sample_id={sample_id}, provider={stt_provider}" + ) + failed_count += 1

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if result_record:

if batch_result.get("response"):

text = batch_result["response"].get("text", "")

update_stt_result(

session=session,

result_id=result_record.id,

transcription=text,

status=STTResultStatus.COMPLETED.value,

)

processed_count += 1

else:

update_stt_result(

session=session,

result_id=result_record.id,

status=STTResultStatus.FAILED.value,

error_message=batch_result.get("error", "Unknown error"),

)

failed_count += 1

if result_record:

if batch_result.get("response"):

text = batch_result["response"].get("text", "")

update_stt_result(

session=session,

result_id=result_record.id,

transcription=text,

status=STTResultStatus.COMPLETED.value,

)

processed_count += 1

else:

update_stt_result(

session=session,

result_id=result_record.id,

status=STTResultStatus.FAILED.value,

error_message=batch_result.get("error", "Unknown error"),

)

failed_count += 1

else:

logger.warning(

f"[process_completed_stt_batch] No result record found | "

f"run_id={run.id}, sample_id={sample_id}, provider={stt_provider}"

)

failed_count += 1

🤖 Prompt for AI Agents

In `@backend/app/crud/stt_evaluations/cron.py` around lines 435 - 452, The loop currently skips when result_record is None which hides failures; update the branch handling to log a warning (including identifiers like the batch_result payload or an id/metadata if available) and increment failed_count so missing DB rows are visible; specifically, in the block where you check result_record (the variable result_record inside the loop that later calls update_stt_result), add a processLogger.warning (or existing logger) call describing the missing result_record and the batch_result context, and increment failed_count (not processed_count) so metrics reflect the skipped result instead of silently continuing.

coderabbitai · 2026-02-09T05:52:18Z

backend/app/models/collection.py

    provider: ProviderType = (
        Field(
            nullable=False,
-            description="LLM provider used for this collection (e.g., 'openai', 'bedrock', 'gemini', etc)",
+            description="LLM provider used for this collection (e.g., 'openai', 'bedrock', 'google', etc)",
            sa_column_kwargs={"comment": "LLM provider used for this collection"},
        ),
    )


⚠️ Potential issue | 🟡 Minor

Description references 'google' but ProviderType enum doesn't include it.

The description on line 45 lists 'google' as an example provider, but the ProviderType enum (lines 15–20) only defines openai—GEMINI is commented out and there's no GOOGLE member. This is misleading for API consumers who see the field description (e.g., in OpenAPI docs). Either add GOOGLE to ProviderType or keep the description aligned with what's actually accepted.

🤖 Prompt for AI Agents

In `@backend/app/models/collection.py` around lines 42 - 48, The provider Field's description mentions 'google' but the ProviderType enum (ProviderType) does not include a GOOGLE/GEMINI member; fix by either adding a GOOGLE (or GEMINI) enum value to ProviderType (ensure downstream serialization/name matches existing enum style) or update the provider Field description in collection.py to only list supported providers (e.g., 'openai') so OpenAPI/docs match the actual ProviderType; modify the ProviderType definition or the description string in the provider Field accordingly and run tests/type checks to confirm no regressions.

codecov · 2026-02-09T07:35:05Z

Codecov Report

❌ Patch coverage is 85.74277% with 286 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
backend/app/crud/stt_evaluations/cron.py	13.01%	127 Missing ⚠️
backend/app/crud/stt_evaluations/result.py	34.21%	50 Missing ⚠️
backend/app/crud/stt_evaluations/batch.py	25.00%	48 Missing ⚠️
backend/app/crud/stt_evaluations/run.py	62.50%	18 Missing ⚠️
backend/app/crud/file.py	50.00%	11 Missing ⚠️
...ckend/app/api/routes/stt_evaluations/evaluation.py	77.77%	10 Missing ⚠️
backend/app/api/routes/stt_evaluations/result.py	73.91%	6 Missing ⚠️
backend/app/crud/evaluations/cron.py	14.28%	6 Missing ⚠️
backend/app/api/routes/stt_evaluations/files.py	78.57%	3 Missing ⚠️
backend/app/crud/stt_evaluations/dataset.py	96.55%	2 Missing ⚠️
... and 4 more

📢 Thoughts on this report? Let us know!

AkhileshNegi added 12 commits January 30, 2026 14:15

first stab at STT evals

011d489

Merge branch 'main' of github.com:ProjectTech4DevAI/kaapi-backend int…

7777290

…o feature/stt-evaluation

Merge branch 'main' of github.com:ProjectTech4DevAI/kaapi-backend int…

d8df80c

…o feature/stt-evaluation

fix migration naming

f1df7f9

fixing endpoints

cda0611

update dataset endpoint

ad5779f

update types

01e2beb

updated dataset with URL

1637007

added few more testcases

36af7e9

added storage to core for easy reuse

78fd206

cleanup for audio duration

4ac2ca6

first stab at fixing celery task to cron

d8b531c

AkhileshNegi self-assigned this Feb 2, 2026

AkhileshNegi added the enhancement New feature or request label Feb 2, 2026

AkhileshNegi added 15 commits February 2, 2026 20:54

added gemini as provider

2295da5

moving to batch job in gemini

25e6002

code refactoring, using batch requests and files similar to OpenAI

db2512e

few cleanups

ff29ddd

updated migration

cd979fd

cleanup config for batch

b6c633a

moved documentation to separate folder

b6e6649

updated score format in stt result

719584d

cleaner dataset sample count

bf0b4c2

got rid of redundant sample count

68e6821

removed deadcode

2247faa

removing more redundant code

056612c

clean few more cruds

13bb9cc

more free from dead code

7bbf811

cleanup batch request code

04e419c

ProjectTech4DevAI deleted a comment from codecov bot Feb 6, 2026

removed language_id from evaluation run

7f8cfaa

coderabbitai bot reviewed Feb 6, 2026

View reviewed changes

AkhileshNegi added 3 commits February 6, 2026 23:09

updated provider as gemini

e357949

added support for multiple provider

0dabf82

updated doc

3cabb49

Prajna1999 approved these changes Feb 9, 2026

View reviewed changes

AkhileshNegi added 2 commits February 9, 2026 11:16

merging with master

5455139

updating migration

9587d44

coderabbitai bot reviewed Feb 9, 2026

View reviewed changes

updated testcase

ee9fbc8

AkhileshNegi added 6 commits February 9, 2026 13:14

updated testcase

b9a92d0

cleanup few things

d02b702

cleaned up unnecessary wrapper

94c4000

reusing same status for stt results

ececf9a

updated routes

2588bdf

update routs

e3f4fec

kartpop approved these changes Feb 9, 2026

View reviewed changes

coderabbit cleanups

f11bfea

AkhileshNegi merged commit 4704afc into main Feb 9, 2026
2 of 3 checks passed

coderabbitai bot mentioned this pull request Feb 12, 2026

STT Evaluation: Refactor #601

Merged

2 tasks

This was referenced Feb 24, 2026

Evaluation: TTS #619

Merged

audio files: list endpoint #630

Closed

This was referenced Mar 6, 2026

STT Evaluation: Automated metric #639

Merged

TTS Evaluation: TTS Results signed URLs #666

Merged

This was referenced Mar 16, 2026

TTS/STT: Integrate ElevenLabs as a Provider for STT and TTS #650

Merged

TTS Evaluation: Metric #681

Merged

STT Evaluation: Gemini Files API instead of signed URLs #685

Merged

AkhileshNegi deleted the feature/stt-evaluation branch April 2, 2026 04:01

Conversation

AkhileshNegi commented Feb 2, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Checklist

Notes

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Feb 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Prajna1999 Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

AkhileshNegi commented Feb 2, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 2, 2026 •

edited

Loading

codecov bot commented Feb 9, 2026 •

edited

Loading