Evaluation: Improve error handling for batch job by AkhileshNegi · Pull Request #724 · ProjectTech4DevAI/kaapi-backend

AkhileshNegi · 2026-03-28T06:09:58Z

Summary

Target issue is #723

Checklist

Before submitting a pull request, please ensure that you mark these task.

Ran fastapi run --reload app/main.py or docker compose up in the repository root and test.
If you've fixed a bug or added code that is tested and has test cases.

Bug Fixes

Improved error handling for batch evaluations that complete with all requests failed. The system now extracts and reports aggregated error messages with request counts for better visibility into failure causes.

Tests

Added comprehensive test coverage for batch error extraction and handling of all-failed batch completion scenarios.

Screenshot

Summary by CodeRabbit

Bug Fixes
- Improved batch evaluation failure handling: when a provider run completes with no output but an error file, the system now aggregates the most frequent error from the batch, marks the evaluation failed, and records a concise "top error (count/total requests)" message.
Tests
- Added tests covering aggregated error extraction and the all-requests-failed completion path to ensure correct failure reporting and persisted error messages.

coderabbitai · 2026-03-28T06:10:11Z

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: efd739df-9c84-4b5e-bea2-ba4839191762

📥 Commits

Reviewing files that changed from the base of the PR and between 589efc2 and d009d19.

📒 Files selected for processing (1)

backend/app/tests/crud/evaluations/test_processing.py

📝 Walkthrough

Walkthrough

Detects OpenAI batch jobs that finish with only an error file, downloads and aggregates JSONL error messages, persists an aggregated error to the batch job and eval run, and returns an explicit failed result when all requests failed.

Changes

Cohort / File(s)	Summary
Batch Error Extraction & Flow `backend/app/crud/evaluations/processing.py`	Added `_extract_batch_error_message()` to download/parse JSONL error files, choose the most frequent `response.body.error.message`, append `(N/N requests)`, call `update_batch_job` to persist `batch_job.error_message`, and extended `check_and_process_evaluation()` to handle `provider_status == "completed"` when `provider_output_file_id` is missing but `error_file_id` exists by marking the eval run failed and returning `{ action: "failed", error: ... }`. Also updated imports for `update_batch_job`, `BatchJob`, and `BatchJobUpdate`.
Tests — error aggregation & flows `backend/app/tests/crud/evaluations/test_processing.py`	Added `TestExtractBatchErrorMessage` unit tests for single/multiple error aggregation. Added `test_check_and_process_evaluation_completed_all_requests_failed` to mock `download_file` returning JSONL errors and assert returned failure and persisted `eval_run.error_message`/`batch_job.error_message`. Adjusted existing completion/failure tests to reflect `provider_output_file_id` variations and updated imports.

Sequence Diagram(s)

sequenceDiagram
  participant Evaluator as EvaluatorService
  participant Poller as poll_batch_status
  participant Storage as BatchProviderStorage
  participant DB as Database(update_batch_job / eval_run)
  participant Logger as Logger

  rect rgba(100,150,250,0.5)
  Evaluator->>Poller: poll_batch_status(batch_job_id)
  Poller-->>Evaluator: provider_status="completed", provider_output_file_id=null, error_file_id
  end

  rect rgba(150,200,100,0.5)
  Evaluator->>Storage: download_file(error_file_id)
  Storage-->>Evaluator: JSONL error records
  Evaluator->>Evaluator: parse & aggregate most frequent error (msg, count, total)
  Evaluator->>DB: update_batch_job(batch_job_id, error_message)
  Evaluator->>DB: mark eval_run as failed with error_message
  Evaluator->>Logger: log failure with aggregated message
  Evaluator-->>Client: return { action: "failed", error: aggregated_message }
  end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Evaluation: Improve error handling for batch job #724 — Same code-level changes adding _extract_batch_error_message, handling completed-with-error-file cases, and corresponding tests.

Suggested reviewers

vprashrex
Ayush8923
Prajna1999

Poem

🐰 I dug through lines of JSONL bright,

counted errors in the night,
the top mistake I gently send,
stamped with counts before the end—
a hop, a note, logs tucked tight 🥕

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly describes the main change: improving error handling for batch jobs. It directly aligns with the PR's core objective of extracting and reporting aggregated error messages when batch evaluations fail.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch enhancement/evaluation-errors

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codecov · 2026-03-28T06:16:36Z

Codecov Report

❌ Patch coverage is 94.11765% with 6 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
backend/app/crud/evaluations/processing.py	81.81%	6 Missing ⚠️

📢 Thoughts on this report? Let us know!

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (1)

backend/app/tests/crud/evaluations/test_processing.py (1)
803-839: Please move this scenario behind a factory-backed fixture.

This test duplicates BatchJob and EvaluationRun construction that's already repeated elsewhere in the module. A factory/fixture would keep the test focused on the failure behavior and make future model changes cheaper to absorb.

As per coding guidelines, Use factory pattern for test fixtures in backend/app/tests/.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/tests/crud/evaluations/test_processing.py` around lines 803 -
839, Extract the inline BatchJob and EvaluationRun setup into a reusable
factory-backed fixture and replace the duplicated construction in this test with
that fixture; specifically, create a fixture that constructs a BatchJob (using
the same fields: provider, provider_batch_id "batch_all_fail", provider_status
"completed", job_type BatchJobType.EVALUATION, total_items, status,
organization_id, project_id, timestamps) and an EvaluationRun created via
create_evaluation_run (then set eval_run.batch_job_id and
eval_run.status="processing"), ensure provider_output_file_id remains unset,
register the fixture in the test module and update this test to accept the
fixture instead of performing inline DB.add/commit/refresh calls so other tests
can reuse BatchJob and EvaluationRun setups.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/app/crud/evaluations/processing.py`:
- Around line 105-112: The except block in _extract_batch_error_message is
currently catching all exceptions (including DB write errors from
update_batch_job) and returning a generic message, which can hide persistence
failures; change the logic so that only errors related to extracting/formatting
the error message are swallowed here, and any exception raised by
update_batch_job (or any persistence call that sets batch_job.error_message) is
not suppressed—either move the call to update_batch_job out of this broad
try/except, or catch Exception e, detect if it originates from update_batch_job
(or re-raise after logging), ensuring update_batch_job failures are logged and
propagated (do not return the fallback string when update_batch_job fails) so
the caller can observe persistence errors and batch_job.error_message isn't left
unpersisted.
- Around line 631-645: In the provider_status == "completed" branch, the code
incorrectly checks batch_job.provider_output_file_id (which may be stale) to
decide if all requests failed; instead use the freshly polled value from
poll_batch_status (status_result["provider_output_file_id"]) when deciding the
"all requests failed" path—i.e., replace or supplement checks that reference
batch_job.provider_output_file_id with the status_result value (also consider
status_result.get("error_file_id") as already used) so a completed row with a
newly returned provider_output_file_id is not misclassified; see
poll_batch_status, status_result, and batch_job.provider_output_file_id for the
exact symbols to update.

In `@backend/app/tests/crud/evaluations/test_processing.py`:
- Around line 794-801: The test coroutine
test_check_and_process_evaluation_completed_all_requests_failed is missing type
annotations for its mock parameters and return type; update its signature to add
explicit types (e.g. mock_provider_cls: Mock, mock_poll: Mock, mock_get_batch:
Mock, db: Session, test_dataset: Any) and add the return annotation -> None, and
ensure you import typing.Any and unittest.mock.Mock (or the preferred mock type
used in the repo) at the top of the test file so the annotations resolve.

---

Nitpick comments:
In `@backend/app/tests/crud/evaluations/test_processing.py`:
- Around line 803-839: Extract the inline BatchJob and EvaluationRun setup into
a reusable factory-backed fixture and replace the duplicated construction in
this test with that fixture; specifically, create a fixture that constructs a
BatchJob (using the same fields: provider, provider_batch_id "batch_all_fail",
provider_status "completed", job_type BatchJobType.EVALUATION, total_items,
status, organization_id, project_id, timestamps) and an EvaluationRun created
via create_evaluation_run (then set eval_run.batch_job_id and
eval_run.status="processing"), ensure provider_output_file_id remains unset,
register the fixture in the test module and update this test to accept the
fixture instead of performing inline DB.add/commit/refresh calls so other tests
can reuse BatchJob and EvaluationRun setups.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 0c1e6a24-d519-49de-95dc-90fa9c6f75f4

📥 Commits

Reviewing files that changed from the base of the PR and between e577369 and 3f5c0b1.

📒 Files selected for processing (2)

backend/app/crud/evaluations/processing.py
backend/app/tests/crud/evaluations/test_processing.py

backend/app/crud/evaluations/processing.py

coderabbitai · 2026-03-28T06:16:54Z

backend/app/tests/crud/evaluations/test_processing.py

+    async def test_check_and_process_evaluation_completed_all_requests_failed(
+        self,
+        mock_provider_cls,
+        mock_poll,
+        mock_get_batch,
+        db: Session,
+        test_dataset,
+    ):


⚠️ Potential issue | 🟡 Minor

Add the missing type annotations on the new test coroutine.

The injected mocks and test_dataset fixture are untyped here, and the function is also missing -> None. That breaks the repo-wide Python typing rule.

Suggested fix

async def test_check_and_process_evaluation_completed_all_requests_failed( self, - mock_provider_cls, - mock_poll, - mock_get_batch, + mock_provider_cls: MagicMock, + mock_poll: MagicMock, + mock_get_batch: MagicMock, db: Session, - test_dataset, - ): + test_dataset: EvaluationDataset, + ) -> None:

As per coding guidelines, Always add type hints to all function parameters and return values in Python code.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@backend/app/tests/crud/evaluations/test_processing.py` around lines 794 - 801, The test coroutine test_check_and_process_evaluation_completed_all_requests_failed is missing type annotations for its mock parameters and return type; update its signature to add explicit types (e.g. mock_provider_cls: Mock, mock_poll: Mock, mock_get_batch: Mock, db: Session, test_dataset: Any) and add the return annotation -> None, and ensure you import typing.Any and unittest.mock.Mock (or the preferred mock type used in the repo) at the top of the test file so the annotations resolve.

AkhileshNegi added 2 commits March 28, 2026 10:45

first stab at cleaning error message

f738f3a

added testcases

3f5c0b1

AkhileshNegi self-assigned this Mar 28, 2026

AkhileshNegi marked this pull request as ready for review March 28, 2026 06:10

AkhileshNegi linked an issue Mar 28, 2026 that may be closed by this pull request

Batch Job: Improve error handling #723

Closed

coderabbitai bot reviewed Mar 28, 2026

View reviewed changes

AkhileshNegi requested a review from Ayush8923 April 1, 2026 04:39

Merge branch 'main' into enhancement/evaluation-errors

6e47cb2

Ayush8923 approved these changes Apr 1, 2026

View reviewed changes

AkhileshNegi added 3 commits April 1, 2026 10:14

Merge branch 'main' into enhancement/evaluation-errors

dba07bc

coderabbit suggestion

589efc2

coderabbit suggestion

d009d19

AkhileshNegi merged commit ed9d789 into main Apr 1, 2026
1 check passed

AkhileshNegi deleted the enhancement/evaluation-errors branch April 1, 2026 04:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluation: Improve error handling for batch job#724

Evaluation: Improve error handling for batch job#724
AkhileshNegi merged 6 commits intomainfrom
enhancement/evaluation-errors

AkhileshNegi commented Mar 28, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Mar 28, 2026 •

edited

Loading

Review failed

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

codecov bot commented Mar 28, 2026 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot Mar 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

AkhileshNegi commented Mar 28, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Checklist

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

codecov bot commented Mar 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot Mar 28, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

AkhileshNegi commented Mar 28, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 28, 2026 •

edited

Loading

codecov bot commented Mar 28, 2026 •

edited

Loading