Collection: making the module provider agnostic by nishika26 · Pull Request #508 · ProjectTech4DevAI/kaapi-backend

nishika26 · 2025-12-24T11:16:37Z

Summary

Target issue is #489

Checklist

Before submitting a pull request, please ensure that you mark these task.

Ran fastapi run --reload app/main.py or docker compose up in the repository root and test.
If you've fixed a bug or added code that is tested and has test cases.

Notes

Please add here if any other information is required for the reviewer.

Summary by CodeRabbit

New Features
- Provider-based collections: create/delete collections via pluggable LLM providers.
- Collections can have a name and description; names are now validated for uniqueness.
- New Evaluations APIs: upload/list/get/delete datasets and start/list/get evaluation runs.
Refactor
- Collection workflows reworked to use provider abstractions (more extensible).
- Evaluation service and dataset flows introduced to centralize evaluation functionality.
Tests
- Updated test utilities and fixtures to exercise provider-based and evaluation flows.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2025-12-24T11:16:43Z

📝 Walkthrough

Walkthrough

This PR introduces a provider-based collection provisioning system, expands collection schema (provider/name/description) with a DB migration, implements a provider registry and OpenAI provider, adds a full Evaluations feature (services, validators, routes, tests), refactors batch/job imports and polling, removes per-key API key helpers, and updates many tests and fixtures.

Changes

Cohort / File(s)	Summary
Models & Migration `backend/app/models/collection.py`, `backend/app/models/__init__.py`, `backend/app/models/organization.py`, `backend/app/alembic/versions/041_extend_collection_table_for_provider_.py`	Add `ProviderType` and `provider`, `name`, `description` to Collection; remove Organization.collections and organization_id from Collection; update request/option models; add DB migration to add provider/name/description and drop organization_id.
Provider System `backend/app/services/collections/providers/...` `backend/app/services/collections/providers/__init__.py`, `.../base.py`, `.../openai.py`, `.../registry.py`	New `BaseProvider` abstract class, `OpenAIProvider` implementation, and provider registry with `get_llm_provider()` factory and exports.
Collection Services & API `backend/app/services/collections/create_collection.py`, `.../delete_collection.py`, `.../helpers.py`, `backend/app/api/routes/collections.py`, `backend/app/crud/collection/collection.py`	Replace legacy OpenAI-specific paths with provider-based create/delete/cleanup flows; add `get_service_name()` and `ensure_unique_name()`; call `ensure_unique_name()` in route; add `CollectionCrud.exists_by_name()`.
Batch / Job Refactor `backend/app/core/batch/__init__.py`, `.../operations.py`, `.../polling.py`, `backend/app/crud/job/*`, `backend/app/crud/job/batch.py`	Re-export batch functions from new core locations, remove inline poll in operations, add `poll_batch_status` module, add `crud.job` re-export layer and job/batch re-exports.
Evaluations Feature `backend/app/api/routes/evaluations/`, `backend/app/services/evaluations/`, `backend/app/services/evaluations/validators.py`, `backend/app/crud/evaluations/*`	Remove old evaluation route file; add new evaluations package (routes: dataset, evaluation), services (dataset/evaluation), validators, CRUD adjustments, and associated orchestration (Langfuse/object store integration).
Security `backend/app/core/security.py`, `backend/app/tests/core/test_security.py`	Remove per-key `encrypt_api_key` / `decrypt_api_key` helpers and corresponding tests.
Tests & Utilities many files under `backend/app/tests/...`, `backend/app/tests/utils/llm_provider.py`, `backend/app/tests/utils/collection.py`, `backend/app/tests/utils/document.py`	Widespread test updates to use provider mocks (`get_mock_provider`), rename helpers (`get_collection` → `get_assistant_collection`, add `get_vector_store_collection`), move/remove fixtures (crawler), update import paths, and add many new evaluation tests.
Misc & CI `backend/app/api/main.py`, `.github/workflows/*`, `backend/app/core/batch/operations.py`	Switch main routing import to new evaluations module; change GitHub action upload-artifact version; add `if: false` to staging CD job.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant CreateService as create_collection.py
    participant Registry as get_llm_provider
    participant Provider as OpenAIProvider
    participant DocumentCrud
    participant Storage
    participant Database

    Client->>CreateService: execute_job(request, with_assistant, ...)
    CreateService->>Registry: get_llm_provider(provider)
    Registry-->>CreateService: provider instance
    CreateService->>Provider: create(collection_request, Storage, DocumentCrud)
    Provider->>DocumentCrud: batch_documents / read each
    Provider->>Storage: upload files (if used)
    Provider->>Provider: create vector store / assistant
    Provider-->>CreateService: Collection result (llm_service_id, llm_service_name)
    CreateService->>Database: persist Collection with provider metadata
    Database-->>CreateService: saved
    CreateService-->>Client: callback success/failure

sequenceDiagram
    participant Client
    participant DeleteService as delete_collection.py
    participant Database
    participant Registry as get_llm_provider
    participant Provider as OpenAIProvider

    Client->>DeleteService: start_job(collection_id,...)
    DeleteService->>Database: fetch Collection
    Database-->>DeleteService: Collection(provider, llm_service_id, llm_service_name)
    DeleteService->>Registry: get_llm_provider(collection.provider)
    Registry-->>DeleteService: provider instance
    DeleteService->>Provider: delete(collection)
    Provider-->>DeleteService: provider deletion result
    DeleteService->>Database: delete local Collection record
    DeleteService-->>Client: callback success/failure

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

Collection: Adding input parameter "provider" #502: Adds provider field / ProviderOptions to collection creation requests (overlaps collection model and request changes).
Evaluation: Refactor #503: Implements similar evaluation refactor and new evaluations package (overlaps evaluation routes/services/tests).
Collection: Fixing crawler fixture #523: Moves/removes crawler fixture into route conftest (overlaps test fixture rearrangement).

Suggested reviewers

avirajsingh7
Prajna1999

Poem

🐰 I hopped through models, tests, and code,
A provider path where old flows strode,
OpenAI tucked inside a class so neat,
Jobs create and delete with tidy feet,
Cheers — a rabbit’s dance for CI green light! 🥕✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 76.40% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Collection: making the module provider agnostic' is clear and directly summarizes the main objective of the changeset.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

backend/app/services/collections/create_collection.py (1)
269-270: Potential NameError if CreationRequest parsing fails.

If CreationRequest(**request) on line 156 raises an exception, creation_request is never assigned. The check on line 269 will then raise a NameError.
Proposed fix: initialize creation_request before try block or guard the check
+    creation_request = None
+
     try:
         creation_request = CreationRequest(**request)
         # ...

     except Exception as err:
         # ...

-        if creation_request and creation_request.callback_url and collection_job:
+        if creation_request is not None and creation_request.callback_url and collection_job:
             failure_payload = build_failure_payload(collection_job, str(err))
             send_callback(creation_request.callback_url, failure_payload)

🧹 Nitpick comments (11)

backend/app/services/collections/helpers.py (1)
17-25: Consider raising an error or logging for unknown providers.

Returning an empty string for unknown providers could lead to silent failures downstream. Consider logging a warning or raising a ValueError for unsupported providers to make debugging easier.
🔎 Suggested improvement
 def get_service_name(provider: str) -> str:
     """Get the collection service name for a provider."""
     names = {
         "openai": "openai vector store",
         #   "bedrock": "bedrock knowledge base",
         #  "gemini": "gemini file search store",
     }
-    return names.get(provider.lower(), "")
+    service_name = names.get(provider.lower())
+    if service_name is None:
+        logger.warning(f"[get_service_name] Unknown provider: {provider}")
+        return ""
+    return service_name
backend/app/services/collections/providers/base.py (3)
30-53: Docstring parameters don't match the method signature.

The docstring mentions batch_size, with_assistant, and assistant_options parameters that don't exist in the actual method signature. Also:

Line 48: "CreateCollectionresult" → "CreateCollectionResult" (typo)

Line 51: "kb_blob" → "collection_blob" (field name mismatch)

Line 53: error message says "execute method" but method is named "create"
Proposed fix
     @abstractmethod
     def create(
         self,
         collection_request: CreationRequest,
         storage: CloudStorage,
         document_crud: DocumentCrud,
     ) -> CreateCollectionResult:
         """Create collection with documents and optionally an assistant.

         Args:
-            collection_params: Collection parameters (name, description, chunking_params, etc.)
+            collection_request: Creation request containing collection params and options
             storage: Cloud storage instance for file access
             document_crud: DocumentCrud instance for fetching documents
-            batch_size: Number of documents to process per batch
-            with_assistant: Whether to create an assistant/agent
-            assistant_options: Options for assistant creation (provider-specific)

         Returns:
-            CreateCollectionresult containing:
+            CreateCollectionResult containing:
             - llm_service_id: ID of the created resource (vector store or assistant)
             - llm_service_name: Name of the service
-            - kb_blob: All collection params except documents
+            - collection_blob: All collection params except documents
         """
-        raise NotImplementedError("Providers must implement execute method")
+        raise NotImplementedError("Providers must implement create method")
55-65: Docstring Args don't match the method signature.

The docstring mentions llm_service_id and llm_service_name as parameters, but the actual signature only accepts collection: Collection.
Proposed fix
     @abstractmethod
     def delete(self, collection: Collection) -> None:
         """Delete remote resources associated with a collection.

         Called when a collection is being deleted and remote resources need to be cleaned up.

         Args:
-            llm_service_id: ID of the resource to delete
-            llm_service_name: Name of the service (determines resource type)
+            collection: The collection whose remote resources should be deleted
         """
         raise NotImplementedError("Providers must implement delete method")
67-76: Typo in docstring.

Line 74: "CreateCollectionresult" should be "CreateCollectionResult".
Proposed fix
-            collection_result: The CreateCollectionresult returned from execute, containing resource IDs
+            collection_result: The CreateCollectionResult returned from create, containing resource IDs
backend/app/services/collections/create_collection.py (1)
35-42: Unused with_assistant parameter.

The with_assistant parameter is accepted but never used in start_job. The assistant creation logic is now determined by checking model and instructions in the provider. Consider removing this parameter if it's no longer needed.
Proposed fix
 def start_job(
     db: Session,
     request: CreationRequest,
     project_id: int,
     collection_job_id: UUID,
-    with_assistant: bool,
     organization_id: int,
 ) -> str:
backend/app/services/collections/providers/openai.py (4)
2-2: Unused import: Any.

The Any type is imported but not used in this file.
Proposed fix
 import logging
-from typing import Any
 
 from openai import OpenAI
24-26: Redundant self.client assignment.

super().__init__(client) already assigns self.client = client in BaseProvider.__init__. The second assignment on line 26 is redundant.
Proposed fix
     def __init__(self, client: OpenAI):
         super().__init__(client)
-        self.client = client
62-65: Log messages reference wrong method name.

The log prefix says [OpenAIProvider.execute] but the method is named create. Per coding guidelines, log messages should be prefixed with the function name.
Proposed fix for all occurrences in create method
             logger.info(
-                "[OpenAIProvider.execute] Vector store created | "
+                "[OpenAIProvider.create] Vector store created | "
                 f"vector_store_id={vector_store.id}, batches={len(docs_batches)}"
             )
Apply similar changes to lines 93-95, 104-105, and 114-118.
60-60: Consider explicit loop for generator consumption.

Using list() to consume a generator whose result is discarded can be unclear. A for loop or collections.deque(maxlen=0) pattern would make intent clearer.
Proposed alternative
-            list(vector_store_crud.update(vector_store.id, storage, docs_batches))
+            for _ in vector_store_crud.update(vector_store.id, storage, docs_batches):
+                pass
backend/app/services/collections/providers/registry.py (1)
61-69: Unreachable else branch and logging format.

The else branch (lines 65-69) is unreachable because LLMProvider.get(provider) on line 47 already raises ValueError for unsupported providers. Also, the log message on line 67 should use square brackets per coding guidelines: [get_llm_provider].
Proposed fix: remove unreachable code or convert to assertion
     if provider == LLMProvider.OPENAI:
         if "api_key" not in credentials:
             raise ValueError("OpenAI credentials not configured for this project.")
         client = OpenAI(api_key=credentials["api_key"])
-    else:
-        logger.error(
-            f"[get_llm_provider] Unsupported provider type requested: {provider}"
-        )
-        raise ValueError(f"Provider '{provider}' is not supported.")
+    else:
+        # This branch is unreachable as LLMProvider.get validates the provider,
+        # but kept as defensive programming for future provider additions.
+        raise AssertionError(f"Unhandled provider: {provider}")

     return provider_class(client=client)
backend/app/models/collection/response.py (1)

20-29: Add provider field to CollectionPublic.

The Collection database model includes a provider field (ProviderType enum) that represents the LLM provider (e.g., "openai"). This field is missing from CollectionPublic and should be exposed in the response schema. Per learnings, provider and llm_service_name serve different purposes—provider indicates the LLM provider name while llm_service_name specifies the particular service from that provider. Exposing both fields provides complete information to API consumers about the collection's LLM configuration.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 91941f9 and 946e7c7.

📒 Files selected for processing (15)

backend/app/alembic/versions/041_adding_blob_column_in_collection_table.py
backend/app/models/__init__.py
backend/app/models/collection/__init__.py
backend/app/models/collection/request.py
backend/app/models/collection/response.py
backend/app/services/collections/create_collection.py
backend/app/services/collections/delete_collection.py
backend/app/services/collections/helpers.py
backend/app/services/collections/providers/__init__.py
backend/app/services/collections/providers/base.py
backend/app/services/collections/providers/openai.py
backend/app/services/collections/providers/registry.py
backend/app/tests/api/routes/collections/test_collection_info.py
backend/app/tests/api/routes/collections/test_collection_list.py
backend/app/tests/utils/collection.py

🧰 Additional context used

📓 Path-based instructions (6)

backend/app/services/**/*.py