Skip to content

Collections: batch size default to 10#689

Merged
AkhileshNegi merged 2 commits intomainfrom
enhancement/batch_size_10
Mar 18, 2026
Merged

Collections: batch size default to 10#689
AkhileshNegi merged 2 commits intomainfrom
enhancement/batch_size_10

Conversation

@nishika26
Copy link
Copy Markdown
Collaborator

@nishika26 nishika26 commented Mar 18, 2026

Summary

Target issue is #690

Notes

  • Documentation

    • Added documentation explaining the batch_size parameter that controls document batching during collection creation, helping optimize performance for large uploads.
  • Enhancements

    • Updated the default batch size from 1 to 10, improving efficiency when processing large document uploads.

Summary by CodeRabbit

Release Notes

  • Documentation

    • Added comprehensive documentation for the batch_size parameter, describing how it controls the number of documents sent per transaction during vector store creation and highlighting optimization benefits for large uploads.
  • Improvements

    • Optimized default batch size for document processing from 1 to 10 documents per transaction to improve upload performance for large datasets.

@nishika26 nishika26 requested review from AkhileshNegi and Prajna1999 and removed request for AkhileshNegi March 18, 2026 12:56
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 18, 2026

📝 Walkthrough

Walkthrough

This PR changes the default collection batch_size from 1 to 10, updates docs to document the parameter, applies a service-layer fallback to 10 when unspecified, and updates tests to use the new default.

Changes

Cohort / File(s) Summary
Documentation & Model
backend/app/api/docs/collections/create.md, backend/app/models/collection.py
Added doc text describing batch_size; changed CollectionOptions default batch_size from 1 to 10.
Service Provider
backend/app/services/collections/providers/openai.py
Use a fallback/default batch_size of 10 when request value is falsy, and pass that into batching logic.
Tests
backend/app/tests/api/routes/collections/test_create_collections.py, backend/app/tests/services/collections/providers/test_openai_provider.py, backend/app/tests/services/collections/test_create_collection.py
Updated test payloads to expect/use batch_size = 10 (replacing prior 1 values) to align with new default and provider behavior.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related issues

Poem

🐰 Ten little hops, ten docs at play,
I batch them neatly, off they sway.
Soft vectors hum, the pipeline's bright,
A rabbit's nibble, data takes flight! 🥕✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 46.15% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically summarizes the main change: increasing the default batch size for collections from 1 to 10, which is the primary alteration across the codebase.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch enhancement/batch_size_10
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@nishika26 nishika26 requested a review from vprashrex March 18, 2026 12:56
@nishika26 nishika26 self-assigned this Mar 18, 2026
@nishika26 nishika26 added documentation Improvements or additions to documentation ready-for-review labels Mar 18, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
backend/app/models/collection.py (1)

105-112: ⚠️ Potential issue | 🟡 Minor

Add validation constraint to prevent invalid batch_size values.

The batch_size field accepts any integer including 0 or negative values, but the provider layer in openai.py uses batch_size or 10, silently coercing falsy values to 10. This creates an inconsistency where the API accepts batch_size=0 without feedback but internally uses 10.

Add a ge=1 constraint to reject invalid values at the model validation layer:

🛡️ Proposed fix
     batch_size: int = Field(
         default=10,
+        ge=1,
         description=(
             "Number of documents to send to OpenAI in a single "
             "transaction. See the `file_ids` parameter in the "
             "vector store [create batch](https://platform.openai.com/docs/api-reference/vector-stores-file-batches/createBatch)."
         ),
     )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/models/collection.py` around lines 105 - 112, The collection
model's batch_size Field currently allows zero or negative ints causing silent
coercion in the provider (openai.py) — update the Field definition for
batch_size in collection.py to add a validation constraint ge=1 so values must
be >=1 (i.e., change the Field(...) call for batch_size to include ge=1) so
API-level validation rejects invalid inputs before reaching provider logic; keep
the existing description unchanged.
🧹 Nitpick comments (1)
backend/app/services/collections/providers/openai.py (1)

33-38: Redundant fallback creates silent coercion for batch_size=0.

The or 10 fallback is problematic: if a user explicitly passes batch_size=0, it silently becomes 10 with no validation error or warning. Since CollectionOptions already defaults to 10, this fallback only matters for falsy values (0, None).

If validation is added to the model (recommended), this fallback becomes unnecessary. Otherwise, consider explicit validation here:

♻️ Option 1: Remove fallback if model validates ge=1
-            # Use user-provided batch_size, default to 10 if not set
-            batch_size = collection_request.batch_size or 10
+            batch_size = collection_request.batch_size
             docs_batches = batch_documents(
                 document_crud,
                 collection_request.documents,
                 batch_size,
             )
♻️ Option 2: Keep fallback but add explicit check for invalid values
-            # Use user-provided batch_size, default to 10 if not set
-            batch_size = collection_request.batch_size or 10
+            # Use user-provided batch_size, default to 10 if not set or invalid
+            if collection_request.batch_size is None or collection_request.batch_size < 1:
+                batch_size = 10
+            else:
+                batch_size = collection_request.batch_size
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/services/collections/providers/openai.py` around lines 33 - 38,
The current fallback "batch_size = collection_request.batch_size or 10" silently
converts invalid falsy values (e.g., 0) to 10; update the logic in the block
that sets batch_size (referencing collection_request.batch_size and
batch_documents) to handle validation explicitly: if
collection_request.batch_size is None keep the default of 10 (or use
CollectionOptions default), but if it's provided validate that it's an integer
>= 1 and raise a clear error (or return a validation response) for invalid
values (like 0 or negatives) before calling batch_documents; alternatively,
remove the fallback entirely if CollectionOptions already enforces ge=1.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/app/tests/services/collections/test_create_collection.py`:
- Around line 351-356: The CreationRequest instantiation in sample_request is
invalid because a positional argument (0) appears after a keyword argument; fix
by either moving the positional 0 before any keyword arguments or, preferably,
convert it to its explicit keyword name (e.g., priority=0 or the correct
parameter name for that positional arg) inside the CreationRequest call so the
call becomes syntactically valid.

---

Outside diff comments:
In `@backend/app/models/collection.py`:
- Around line 105-112: The collection model's batch_size Field currently allows
zero or negative ints causing silent coercion in the provider (openai.py) —
update the Field definition for batch_size in collection.py to add a validation
constraint ge=1 so values must be >=1 (i.e., change the Field(...) call for
batch_size to include ge=1) so API-level validation rejects invalid inputs
before reaching provider logic; keep the existing description unchanged.

---

Nitpick comments:
In `@backend/app/services/collections/providers/openai.py`:
- Around line 33-38: The current fallback "batch_size =
collection_request.batch_size or 10" silently converts invalid falsy values
(e.g., 0) to 10; update the logic in the block that sets batch_size (referencing
collection_request.batch_size and batch_documents) to handle validation
explicitly: if collection_request.batch_size is None keep the default of 10 (or
use CollectionOptions default), but if it's provided validate that it's an
integer >= 1 and raise a clear error (or return a validation response) for
invalid values (like 0 or negatives) before calling batch_documents;
alternatively, remove the fallback entirely if CollectionOptions already
enforces ge=1.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: a4b676f7-183e-45a8-b409-b8c881d53983

📥 Commits

Reviewing files that changed from the base of the PR and between 824cd26 and d014db2.

📒 Files selected for processing (6)
  • backend/app/api/docs/collections/create.md
  • backend/app/models/collection.py
  • backend/app/services/collections/providers/openai.py
  • backend/app/tests/api/routes/collections/test_create_collections.py
  • backend/app/tests/services/collections/providers/test_openai_provider.py
  • backend/app/tests/services/collections/test_create_collection.py

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 18, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!


sample_request = CreationRequest(
documents=[document.id], batch_size=1, callback_url=None, provider="openai"
documents=[document.id], batch_size=10, callback_url=None, provider="openai"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure how this works, but increasing size and still passing one document id does not make sense. it should be list of documents somewhere but not sure about rest of the code

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, can write a test where num docs >10, and the test can then check whether batching happens as expected

@AkhileshNegi AkhileshNegi changed the title collections: batch size default to 10 Collections: batch size default to 10 Mar 18, 2026
@AkhileshNegi AkhileshNegi merged commit a6a7e86 into main Mar 18, 2026
3 checks passed
@AkhileshNegi AkhileshNegi deleted the enhancement/batch_size_10 branch March 18, 2026 13:46
@AkhileshNegi AkhileshNegi linked an issue Mar 18, 2026 that may be closed by this pull request
@coderabbitai coderabbitai bot mentioned this pull request Mar 25, 2026
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation ready-for-review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Collections: set batch size default to 10

3 participants