feat: extract and surface document file dates across platform#862
Conversation
There was a problem hiding this comment.
Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.
…ates Add end-to-end support for preserving original file dates (created_at, modified_at) from uploaded documents. Metadata is extracted in the crawler service during file parsing, stored in the Convex document schema and PostgreSQL knowledge database, and made available through the RAG search pipeline for date-aware document retrieval.
Replace ad-hoc ensure_* functions with a proper versioned migration system using dbmate. Migrations run automatically on DB container startup after init scripts, with a healthcheck sentinel to block services until complete. - Install dbmate binary in DB Docker image - Convert ensure_error_column, ensure_content_hash_index, ensure_file_id_column, and ensure_document_date_columns to versioned SQL migration files - Add baseline migration to mark init-script state as version zero - Update entrypoint to run migrations after init scripts with /tmp/.db_ready sentinel - Rename ensure_embedding_dimensions to pin_embedding_dimensions (runtime config) - Remove migrated ensure_* functions from RAG service database.py
…id columns Add source_created_at/source_modified_at fields to document content, status, and search result responses. Include modified date annotation in formatted search results for agent context. Remove unused user_id column from documents and chunks tables with a migration.
…uploadedAt column Add a denormalized `indexed` boolean to the documents schema for fast agent-scoped lookups, replacing runtime ragInfo.status filtering. The RAG search tool gains a `list_indexed` operation so agents can discover which files are available before searching. The documents table now shows separate "Modified" (source file date) and "Uploaded at" columns.
- Fix silent data loss: use getString + ISO parsing instead of getNumber for RAG status dates, and write dates before marking status completed - Fix pagination cursor bug with composite cursor encoding skip counts to prevent document loss across paginated calls - Fix O(N^2) backfill by switching to native .paginate() - Add missing lastModified in org-wide upload path - Remove silent exception swallowing in crawler date extraction - Unify OOXML metadata extraction (dates + title/author) to avoid double-parsing in vision paths and router endpoints - Add source_created_at to SearchResult for API symmetry - Export and reuse getDocumentEffectiveDate to eliminate DRY violation - Replace as type casts with type guards and shared interfaces - Fix false-positive recency boost tests (single-result normalization) - Add pagination continuation tests, parse_search_results tests, and fix create_document_from_upload test mocks
The polling loop stored indexedFileId in ragInfo, but updating ragInfo status to 'running' replaced the entire object and silently dropped indexedFileId. Subsequent polls fell back to the Convex document ID, which RAG doesn't recognise, causing documents to stay stuck in 'running' until the 50-attempt timeout. Remove the redundant indexedFileId field from ragInfo writes and read document.fileId directly in all RAG interactions. Also fix _mark_completed in the RAG service to restore chunks_count from actual chunk rows instead of leaving it at 0 after a re-upload skip.
2b5517f to
e2282ae
Compare
📝 WalkthroughWalkthroughThis PR adds file metadata extraction capabilities to the crawler service with new Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Warning Review ran into problems🔥 ProblemsGit: Failed to clone repository. Please run the Comment |
Use datetime.UTC alias instead of datetime.timezone.utc (ruff UP017) and remove dbmate scripts from db package.json (knip unlisted binary).
The checked-in api.d.ts was generated against an older @convex-dev/workflow version missing list, listByName, restart and updated cleanup signatures.
There was a problem hiding this comment.
Actionable comments posted: 22
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
services/rag/app/routers/search.py (1)
35-43:⚠️ Potential issue | 🟠 Major
source_created_atis still dropped from both response paths.These mappings now forward
source_modified_at, but they still omitsource_created_atwhen constructingSearchResult. Clients will never see the original creation date even if the search service returns it.Suggested fix
search_results = [ SearchResult( content=r.get("content", ""), score=r.get("score", 0.0), file_id=r.get("file_id"), filename=r.get("filename"), + source_created_at=r.get("source_created_at"), source_modified_at=r.get("source_modified_at"), metadata=r.get("metadata") if request.include_metadata else None, ) for r in results ] @@ sources = [ SearchResult( content=s.get("content", ""), score=s.get("score", 0.0), file_id=s.get("file_id"), + source_created_at=s.get("source_created_at"), source_modified_at=s.get("source_modified_at"), metadata=s.get("metadata"), ) for s in result.get("sources", []) ]Also applies to: 76-83
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@services/rag/app/routers/search.py` around lines 35 - 43, The SearchResult objects are never populated with source_created_at, so add source_created_at=r.get("source_created_at") to the SearchResult constructor when building search_results (the list comprehension/block that constructs SearchResult) and make the same change in the second construction block around the other occurrence (the alternate SearchResult creation at lines noted in the review). Ensure both places include source_created_at from the result dict so clients receive the original creation timestamp.services/rag/tests/test_document_content.py (1)
52-59: 🧹 Nitpick | 🔵 TrivialAdd one positive file-date round-trip case.
This widens the fixture shape, but with both new fields always
Nonethe suite still won't catch a regression whereget_document_content()forgets to return them. Add a case with concrete timestamps and assert they appear in the response.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@services/rag/tests/test_document_content.py` around lines 52 - 59, The test fixture DOC_ROW currently sets source_created_at and source_modified_at to None, so add a positive round-trip case by assigning concrete ISO8601 timestamps (e.g., "2023-01-01T00:00:00Z") to DOC_ROW["source_created_at"] and DOC_ROW["source_modified_at"] in the test setup and update the test for get_document_content() to assert those exact timestamp strings are present in the returned document payload; ensure you only add one case (leave the existing None case intact) and reference DOC_ROW and get_document_content() when making the assertions.services/rag/app/services/database.py (1)
95-101:⚠️ Potential issue | 🟠 MajorFix dimension mismatch handling to prevent startup failure on populated embedding columns.
The
ALTER TABLE ... TYPE vector({dimensions})operation at lines 95-101 will fail if thechunkstable already contains vectors from a previous model with a different dimension. pgvector cannot directly cast between incompatible fixed dimensions on populated columns. Add a check to detect and reject mismatched dimensions when embeddings already exist, forcing explicit remediation instead of silent startup failure.Suggested guard
elif col_type != expected_type: + has_embeddings = await conn.fetchval( + f"SELECT EXISTS (SELECT 1 FROM {SCHEMA}.chunks WHERE embedding IS NOT NULL)" + ) + if has_embeddings: + raise RuntimeError( + f"{SCHEMA}.chunks.embedding is {col_type}; clear or reindex existing chunks before pinning to {expected_type}." + ) logger.warning( "Embedding column is {}, expected {}. Dimension mismatch — existing embeddings may need re-generation.", col_type, expected_type, ) await conn.execute(f"ALTER TABLE {SCHEMA}.chunks ALTER COLUMN embedding TYPE vector({dimensions})")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@services/rag/app/services/database.py` around lines 95 - 101, The current logic attempts to ALTER COLUMN embedding TYPE vector({dimensions}) whenever col_type != expected_type, which will fail if the chunks table already contains embeddings of a different dimension; change this to first check whether any rows have a non-null embedding (e.g., using conn to count non-null embedding entries on the chunks table) and only perform the ALTER if the column is empty, otherwise abort startup by raising/logging a clear error that reports col_type, expected_type and the existing row count and instructs explicit remediation (re-generate embeddings or drop/replace the column) instead of attempting the incompatible ALTER; update the branch that references col_type, expected_type, SCHEMA, chunks, embedding, dimensions and conn.execute to implement this guard and error path.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@services/crawler/app/routers/docx.py`:
- Around line 498-501: The code is directly indexing the metadata dict
(meta["..."]) which raises KeyError for missing properties; update all direct
accesses to use meta.get("key") (or meta.get("key", default) when a default is
needed) in the docx router where metadata is extracted (look for the variable
meta and functions like extract_metadata/process_docx or the block that builds
the metadata object for docx documents) so missing keys return None or a safe
default instead of raising; ensure any downstream code that assumed a value
handles None/empty-string appropriately.
- Around line 505-511: The route handler in services/crawler/app/routers/docx.py
currently catches generic exceptions and returns a JSON payload with "success":
false while still sending an HTTP 200; update the exception path in the handler
to raise an HTTP error (e.g., raise fastapi.HTTPException(status_code=500,
detail="Extraction failed") or return a Response/JSONResponse with
status_code=500) instead of returning a 2xx body, and ensure the handler (and/or
surrounding except block) logs the exception details (exception instance and
traceback) before returning/raising so callers get a non-2xx status on
unexpected extraction failures.
In `@services/crawler/app/routers/pdf.py`:
- Line 243: The endpoint currently imports the private helper _parse_pdf_date
from another module — change the helper to a public API (rename/export it as
parse_pdf_date) in its defining module and update this module to import
parse_pdf_date instead of _parse_pdf_date; ensure the original function name is
updated consistently (rename the function or add a public wrapper named
parse_pdf_date) and update all call sites in routers/pdf.py that reference
_parse_pdf_date to use parse_pdf_date.
- Around line 251-254: The PDF document opened into variable doc must be closed
on all paths; wrap the code that opens and processes the PDF (the fitz.open(...)
call that assigns doc and subsequent operations that currently call doc.close())
in a try/finally (or use a context manager if supported) so doc.close() is
invoked in the finally block; update the routine in routers/pdf.py that creates
doc to ensure doc.close() runs even on exceptions to avoid leaking the fitz
document handle.
In `@services/crawler/app/routers/pptx.py`:
- Around line 224-227: The error response in the PPTX metadata handler returns
filename=file.filename or "unknown" while the success path uses "unknown.pptx";
update the failure return (the FileMetadataResponse constructed in the
exception/failure branch) to use the same fallback as the success path
(filename=file.filename or "unknown.pptx") so clients receive a consistent
filename format.
- Around line 197-209: The route currently reads the entire upload with
file.read() and then calls _extract_ooxml_metadata and Presentation using the
same bytes, which allows oversized/zip-bomb uploads to exhaust memory/CPU;
before fully loading, enforce a hard upload-size guard (e.g., check
request.content_length or read at most MAX_UPLOAD_BYTES+1 from file and if
exceeded raise HTTPException 413), avoid parsing twice by reusing the validated
bytes (pass the same BytesIO(file_bytes) to both _extract_ooxml_metadata and
Presentation or adapt _extract_ooxml_metadata to accept a stream) and ensure
MAX_UPLOAD_BYTES is a defined constant used for the guard.
In `@services/crawler/app/services/file_parser_service.py`:
- Around line 292-297: Duplicate inline metadata dicts used for DOCX and PPTX
should be replaced by a shared helper to ensure consistent title/author/date
normalization; add a new function core_properties_to_metadata(core_props) (or
similar) in file_parser_service.py that implements the same normalization logic
currently in _extract_ooxml_metadata, then call core_properties_to_metadata(...)
from both the DOCX and PPTX code paths (replace the inline dicts where metadata
is constructed) so both branches reuse the single normalizer instead of
re-parsing or duplicating logic.
- Around line 33-35: The code is clamping parsed datetimes to 1970 which
destroys valid pre-1970 source dates; instead, stop forcing a 1970 floor in the
date handling code (look for functions/methods like parse_date, _serialize_date
or normalize_timestamp referenced around the flagged lines) and preserve the
original datetime value. Remove the hard floor-to-1970 logic and, if downstream
consumers cannot accept negative epoch seconds, serialize pre-1970 values as
ISO-8601 strings or attach an explicit metadata flag (e.g., original_datetime or
is_pre_1970) rather than altering the timestamp.
- Around line 228-248: The vision-path PDF metadata currently returns only a
partial dict or {} on failure, breaking callers that expect the stable metadata
shape returned by parse_pdf(); update the vision extraction code so it builds
and returns the same metadata object shape as parse_pdf() (e.g., keys like
title, author, created, modified — using None or empty strings as sensible
defaults) instead of returning {} on fitz failure, or re-use/normalize via
parse_pdf()’s metadata normalization function so callers don't need
special-casing.
- Around line 20-31: The PDF date parser is incorrectly accepting malformed
timezone-only digits because it uses re.match (prefix only) and a regex that
allows timezone digits without a sign; update the parser to use re.fullmatch
instead of re.match and refactor the PDF date regex so the entire timezone
portion is an optional grouped unit that requires a sign if present (i.e., make
the sign mandatory whenever any timezone digits appear). Locate the PDF date
regex variable and the call that currently uses re.match in the PDF date parsing
function (the code paths that set source_created_at/source_modified_at) and
apply these two changes together to fully reject malformed strings.
In `@services/db/docker-entrypoint-wrapper.sh`:
- Around line 106-116: The background subshell running run_init_scripts and
run_migrations can fail without stopping the main PostgreSQL process (the
subshell exit won't propagate), so either run those steps synchronously before
starting/post-exec of docker-entrypoint.sh, or make failures fatal by
propagating the error to the main process: start Postgres in the wrapper,
capture its PID, run run_init_scripts/run_migrations in the foreground (or in
background but check their exit status), and on any non-zero exit call kill
-TERM "$PG_PID" (or exit with that code) instead of just exiting the subshell;
use the existing pg_isready, run_init_scripts, run_migrations, touch
/tmp/.db_ready and trap logic to implement the chosen approach so migration/init
failures reliably stop the container startup.
- Around line 92-104: The run_migrations function currently runs dbmate but
doesn't stop on failure; update run_migrations to check dbmate's exit status (or
run with set -e) and on non-zero exit print an error including dbmate's stderr
and exit the script (exit 1), and ensure the code that touches /tmp/.db_ready
only runs after a successful migrate; reference the run_migrations function,
MIGRATIONS_DIR, dbmate migrate invocation, and the /tmp/.db_ready readiness
touch to locate where to add the failure check and early exit.
In `@services/db/init-scripts/03-create-knowledge-database.sql`:
- Around line 174-175: The migration is not convergence-safe: CREATE TABLE IF
NOT EXISTS won’t add new columns to an existing private_knowledge.documents
table and CREATE UNIQUE INDEX IF NOT EXISTS may leave an outdated
idx_pk_docs_unique_scope; replace the CREATE variants with idempotent
ALTER/conditional logic — add ALTER TABLE private_knowledge.documents ADD COLUMN
IF NOT EXISTS for source_created_at and source_modified_at (so pre-existing
installs get new columns), and ensure the unique index idx_pk_docs_unique_scope
is recreated safely by checking/dropping the old index if present and then
creating the desired unique index (or using a conditional DO block that creates
the index only when missing) so long-lived deployments converge to the new
schema.
In
`@services/platform/convex/documents/__tests__/create_document_from_upload.test.ts`:
- Around line 29-37: Add a test that asserts the scheduler is invoked for
document date extraction: in the create_document_from_upload test suite,
mockGetAuthUser to return AUTH_USER, create a mock ctx via createMockCtx, call
the handler returned by getHandler with baseArgs and a fileName ending in
.pdf/.docx/.pptx, and expect ctx.scheduler.runAfter toHaveBeenCalledWith(0,
'extractDocumentDates', expect.objectContaining({ documentId: 'doc_created',
fileId: 'storage_1' })); target the handler returned by getHandler and the
scheduler method ctx.scheduler.runAfter and the internal action name
'extractDocumentDates' so the scheduling logic is covered for PDF/DOCX/PPTX
uploads.
In
`@services/platform/convex/documents/__tests__/transform_to_document_item.test.ts`:
- Around line 5-12: The mock factory createMockDocument currently uses an unsafe
cast ("as never") which disables type checking; change it to return a properly
typed partial/fixture—e.g., declare the function to return Partial<DocumentItem>
or use the TypeScript "satisfies" operator on the object literal to satisfy the
expected DocumentItem type—so TypeScript validates fields while still allowing
overrides; update the createMockDocument signature and its return value
accordingly (refer to createMockDocument and any test usages that consume its
return).
In `@services/platform/convex/documents/internal_actions.ts`:
- Around line 457-462: The current deleteDocumentFromRag flow bails out when
document.fileId is missing and treats HTTP 404 from the RAG delete as a hard
failure, leaving Convex records orphaned; change deleteDocumentFromRag so it
does NOT return early when document?.fileId is falsy but instead logs the
missing fileId and continues to call deleteDocumentById(documentId), and modify
the RAG delete logic (the HTTP delete/response handling where it currently
throws on 404) to catch 404 responses, log that the remote resource was already
gone, and treat them as success so deleteDocumentById always runs; keep throwing
or returning errors only for non-404/critical failures.
In `@services/platform/convex/documents/internal_mutations.ts`:
- Around line 64-89: The mutation updateDocumentDates currently only patches
sourceCreatedAt/sourceModifiedAt when a number is present, preventing explicit
clears; change the arg types to accept null (e.g., sourceCreatedAt:
v.union([v.number(), v.null()]) and same for sourceModifiedAt or use a
nullable/union variant) and update the handler so it checks for undefined (not
null) — e.g., if (args.sourceCreatedAt !== undefined) patch.sourceCreatedAt =
args.sourceCreatedAt — and do the same for sourceModifiedAt, then call
ctx.db.patch with that patch so null values will explicitly clear the fields via
ctx.db.patch.
In `@services/platform/convex/documents/list_indexed_documents_for_agent.ts`:
- Around line 140-143: The mapping that builds AgentIndexedDocumentItem
currently uses doc.sourceModifiedAt directly and returns null for legacy docs;
update the mapper in list_indexed_documents_for_agent (the code constructing
documents: AgentIndexedDocumentItem[]) to call the existing
getDocumentEffectiveDate(doc) fallback chain instead of using
doc.sourceModifiedAt, so name, fileId remain the same but sourceModifiedAt is
set to the value returned by getDocumentEffectiveDate for each doc to preserve
legacy timestamps.
In `@services/rag/app/config.py`:
- Around line 48-51: Add Pydantic validation to ensure recency_decay_base is
strictly between 0 and 1 and recency_max_age_days is a positive integer:
implement validators on the model containing
recency_boost_enabled/recency_decay_base/recency_max_age_days (e.g., add
`@validator` methods for "recency_decay_base" and "recency_max_age_days") that
raise ValueError with clear messages when recency_decay_base <= 0 or >= 1 and
when recency_max_age_days <= 0; keep the existing default values and ensure
validation runs on model initialization.
In `@services/rag/app/services/indexing_service.py`:
- Around line 117-127: The PDF handling opens a PyMuPDF document with
fitz.open(stream=content_bytes, filetype="pdf") but calls doc.close() manually,
risking resource leaks if an exception occurs; change to a context manager (with
fitz.open(stream=content_bytes, filetype="pdf") as doc:) inside the branch where
ext == "pdf", read metadata from doc (doc.metadata or {}), then return the two
_parse_pdf_date(...) values and remove the explicit doc.close() call so the file
is always closed safely.
In `@services/rag/app/services/search_service.py`:
- Around line 194-222: The local import of datetime/timezone inside
_apply_recency_boost should be moved to the module level to improve readability
and avoid repeated imports; add "from datetime import datetime, timezone" at the
top of the file and remove the in-function import statement in
_apply_recency_boost so the function uses the module-level symbols.
In `@services/rag/tests/test_indexing_service.py`:
- Around line 727-731: The test's mock for mock_conn.fetchrow currently returns
{"chunks_count": 5, "source_created_at": None, "source_modified_at": None} which
only verifies tolerance for missing dates; update the fixture used by the clone
test to return realistic timestamps instead of None so clone_from_existing() is
exercised copying actual values — replace the None entries in the second
fetchrow response with concrete datetime values (strings or timezone-aware
datetimes matching the code's expected format) so the test asserts the new row
preserves source_created_at and source_modified_at.
---
Outside diff comments:
In `@services/rag/app/routers/search.py`:
- Around line 35-43: The SearchResult objects are never populated with
source_created_at, so add source_created_at=r.get("source_created_at") to the
SearchResult constructor when building search_results (the list
comprehension/block that constructs SearchResult) and make the same change in
the second construction block around the other occurrence (the alternate
SearchResult creation at lines noted in the review). Ensure both places include
source_created_at from the result dict so clients receive the original creation
timestamp.
In `@services/rag/app/services/database.py`:
- Around line 95-101: The current logic attempts to ALTER COLUMN embedding TYPE
vector({dimensions}) whenever col_type != expected_type, which will fail if the
chunks table already contains embeddings of a different dimension; change this
to first check whether any rows have a non-null embedding (e.g., using conn to
count non-null embedding entries on the chunks table) and only perform the ALTER
if the column is empty, otherwise abort startup by raising/logging a clear error
that reports col_type, expected_type and the existing row count and instructs
explicit remediation (re-generate embeddings or drop/replace the column) instead
of attempting the incompatible ALTER; update the branch that references
col_type, expected_type, SCHEMA, chunks, embedding, dimensions and conn.execute
to implement this guard and error path.
In `@services/rag/tests/test_document_content.py`:
- Around line 52-59: The test fixture DOC_ROW currently sets source_created_at
and source_modified_at to None, so add a positive round-trip case by assigning
concrete ISO8601 timestamps (e.g., "2023-01-01T00:00:00Z") to
DOC_ROW["source_created_at"] and DOC_ROW["source_modified_at"] in the test setup
and update the test for get_document_content() to assert those exact timestamp
strings are present in the returned document payload; ensure you only add one
case (leave the existing None case intact) and reference DOC_ROW and
get_document_content() when making the assertions.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 507824ab-7bc0-4830-ba4e-caa9e76f78e3
⛔ Files ignored due to path filters (7)
services/db/migrations/db/migrations/00000000000000_baseline.sqlis excluded by!**/migrations/**services/db/migrations/db/migrations/20260325000001_add_documents_error_column.sqlis excluded by!**/migrations/**services/db/migrations/db/migrations/20260325000002_add_documents_content_hash_index.sqlis excluded by!**/migrations/**services/db/migrations/db/migrations/20260325000003_rename_document_id_to_file_id.sqlis excluded by!**/migrations/**services/db/migrations/db/migrations/20260325000004_add_document_date_columns.sqlis excluded by!**/migrations/**services/db/migrations/db/migrations/20260325000005_drop_user_id_columns.sqlis excluded by!**/migrations/**services/platform/convex/_generated/api.d.tsis excluded by!**/_generated/**
📒 Files selected for processing (51)
services/crawler/app/models.pyservices/crawler/app/routers/docx.pyservices/crawler/app/routers/pdf.pyservices/crawler/app/routers/pptx.pyservices/crawler/app/services/file_parser_service.pyservices/db/Dockerfileservices/db/docker-entrypoint-wrapper.shservices/db/init-scripts/03-create-knowledge-database.sqlservices/db/package.jsonservices/platform/app/features/documents/hooks/mutations.tsservices/platform/app/features/documents/hooks/use-documents-table-config.tsxservices/platform/convex/agent_tools/rag/format_search_results.test.tsservices/platform/convex/agent_tools/rag/format_search_results.tsservices/platform/convex/agent_tools/rag/helpers/list_indexed_documents.tsservices/platform/convex/agent_tools/rag/parse_search_results.test.tsservices/platform/convex/agent_tools/rag/parse_search_results.tsservices/platform/convex/agent_tools/rag/rag_search_tool.tsservices/platform/convex/documents/__tests__/create_document_from_upload.test.tsservices/platform/convex/documents/__tests__/get_agent_scoped_file_ids.test.tsservices/platform/convex/documents/__tests__/list_indexed_documents_for_agent.test.tsservices/platform/convex/documents/__tests__/transform_to_document_item.test.tsservices/platform/convex/documents/actions.tsservices/platform/convex/documents/create_document.tsservices/platform/convex/documents/get_agent_scoped_file_ids.tsservices/platform/convex/documents/internal_actions.tsservices/platform/convex/documents/internal_mutations.tsservices/platform/convex/documents/internal_queries.tsservices/platform/convex/documents/list_indexed_documents_for_agent.tsservices/platform/convex/documents/mutations.tsservices/platform/convex/documents/schema.tsservices/platform/convex/documents/transform_to_document_item.tsservices/platform/convex/documents/types.tsservices/platform/convex/documents/update_document_rag_info.tsservices/platform/convex/documents/validators.tsservices/platform/convex/workflow_engine/action_defs/document/document_action.tsservices/platform/messages/en.jsonservices/platform/types/documents.tsservices/rag/app/config.pyservices/rag/app/models.pyservices/rag/app/routers/documents.pyservices/rag/app/routers/search.pyservices/rag/app/services/database.pyservices/rag/app/services/indexing_service.pyservices/rag/app/services/rag_service.pyservices/rag/app/services/search_service.pyservices/rag/tests/test_background_ingest.pyservices/rag/tests/test_document_content.pyservices/rag/tests/test_file_dates.pyservices/rag/tests/test_indexing_service.pyservices/rag/tests/test_rag_service.pyservices/rag/tests/test_search_service.py
💤 Files with no reviewable changes (1)
- services/platform/convex/documents/actions.ts
| title=meta["title"] or None, | ||
| author=meta["author"] or None, | ||
| created_at=meta["created_at"], | ||
| modified_at=meta["modified_at"], |
There was a problem hiding this comment.
Use safe metadata access to avoid avoidable KeyError failures.
Direct indexing (meta["..."]) can fail when a property is absent. Use .get() for resilient partial metadata extraction.
Proposed fix
- title=meta["title"] or None,
- author=meta["author"] or None,
- created_at=meta["created_at"],
- modified_at=meta["modified_at"],
+ title=meta.get("title") or None,
+ author=meta.get("author") or None,
+ created_at=meta.get("created_at"),
+ modified_at=meta.get("modified_at"),🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@services/crawler/app/routers/docx.py` around lines 498 - 501, The code is
directly indexing the metadata dict (meta["..."]) which raises KeyError for
missing properties; update all direct accesses to use meta.get("key") (or
meta.get("key", default) when a default is needed) in the docx router where
metadata is extracted (look for the variable meta and functions like
extract_metadata/process_docx or the block that builds the metadata object for
docx documents) so missing keys return None or a safe default instead of
raising; ensure any downstream code that assumed a value handles
None/empty-string appropriately.
| except Exception: | ||
| logger.exception("Error extracting DOCX metadata") | ||
| return FileMetadataResponse( | ||
| success=False, | ||
| filename=file.filename or "unknown", | ||
| error="Failed to extract DOCX metadata", | ||
| ) |
There was a problem hiding this comment.
Unexpected extraction failures should not return HTTP 200.
The generic exception path currently returns success=False as a normal response, which results in a 2xx status and can hide operational failures from upstream callers.
Proposed fix
except HTTPException:
raise
except Exception:
logger.exception("Error extracting DOCX metadata")
- return FileMetadataResponse(
- success=False,
- filename=file.filename or "unknown",
- error="Failed to extract DOCX metadata",
- )
+ raise HTTPException(
+ status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
+ detail="Failed to extract DOCX metadata",
+ )📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| except Exception: | |
| logger.exception("Error extracting DOCX metadata") | |
| return FileMetadataResponse( | |
| success=False, | |
| filename=file.filename or "unknown", | |
| error="Failed to extract DOCX metadata", | |
| ) | |
| except HTTPException: | |
| raise | |
| except Exception: | |
| logger.exception("Error extracting DOCX metadata") | |
| raise HTTPException( | |
| status_code=status.HTTP_500_INTERNAL_SERVER_ERROR, | |
| detail="Failed to extract DOCX metadata", | |
| ) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@services/crawler/app/routers/docx.py` around lines 505 - 511, The route
handler in services/crawler/app/routers/docx.py currently catches generic
exceptions and returns a JSON payload with "success": false while still sending
an HTTP 200; update the exception path in the handler to raise an HTTP error
(e.g., raise fastapi.HTTPException(status_code=500, detail="Extraction failed")
or return a Response/JSONResponse with status_code=500) instead of returning a
2xx body, and ensure the handler (and/or surrounding except block) logs the
exception details (exception instance and traceback) before returning/raising so
callers get a non-2xx status on unexpected extraction failures.
| """Extract metadata from a PDF file without full text extraction.""" | ||
| import fitz | ||
|
|
||
| from app.services.file_parser_service import _parse_pdf_date |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
Avoid importing a private helper across module boundaries.
Importing _parse_pdf_date (underscore-prefixed) from another module couples this endpoint to a private API. Promote it to a public utility (e.g., parse_pdf_date) and import that instead.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@services/crawler/app/routers/pdf.py` at line 243, The endpoint currently
imports the private helper _parse_pdf_date from another module — change the
helper to a public API (rename/export it as parse_pdf_date) in its defining
module and update this module to import parse_pdf_date instead of
_parse_pdf_date; ensure the original function name is updated consistently
(rename the function or add a public wrapper named parse_pdf_date) and update
all call sites in routers/pdf.py that reference _parse_pdf_date to use
parse_pdf_date.
| doc = fitz.open(stream=file_bytes, filetype="pdf") | ||
| raw = doc.metadata or {} | ||
| page_count = len(doc) | ||
| doc.close() |
There was a problem hiding this comment.
Ensure fitz document is always closed (exception-safe cleanup).
If an exception is raised after opening the PDF (Line 251) but before doc.close() (Line 254), the document handle is leaked. Wrap open/read operations in try/finally so close is guaranteed.
Proposed fix
- doc = fitz.open(stream=file_bytes, filetype="pdf")
- raw = doc.metadata or {}
- page_count = len(doc)
- doc.close()
+ doc = None
+ try:
+ doc = fitz.open(stream=file_bytes, filetype="pdf")
+ raw = doc.metadata or {}
+ page_count = len(doc)
+ finally:
+ if doc is not None:
+ doc.close()📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| doc = fitz.open(stream=file_bytes, filetype="pdf") | |
| raw = doc.metadata or {} | |
| page_count = len(doc) | |
| doc.close() | |
| doc = None | |
| try: | |
| doc = fitz.open(stream=file_bytes, filetype="pdf") | |
| raw = doc.metadata or {} | |
| page_count = len(doc) | |
| finally: | |
| if doc is not None: | |
| doc.close() |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@services/crawler/app/routers/pdf.py` around lines 251 - 254, The PDF document
opened into variable doc must be closed on all paths; wrap the code that opens
and processes the PDF (the fitz.open(...) call that assigns doc and subsequent
operations that currently call doc.close()) in a try/finally (or use a context
manager if supported) so doc.close() is invoked in the finally block; update the
routine in routers/pdf.py that creates doc to ensure doc.close() runs even on
exceptions to avoid leaking the fitz document handle.
| file_bytes = await file.read() | ||
| if not file_bytes: | ||
| raise HTTPException(status_code=status.HTTP_400_BAD_REQUEST, detail="Empty file uploaded") | ||
|
|
||
| filename = file.filename or "unknown.pptx" | ||
| meta = _extract_ooxml_metadata(file_bytes, "pptx") | ||
|
|
||
| from io import BytesIO | ||
|
|
||
| from pptx import Presentation | ||
|
|
||
| prs = Presentation(BytesIO(file_bytes)) | ||
|
|
There was a problem hiding this comment.
Add an explicit upload-size guard before parsing PPTX bytes.
Line 197 reads the full file into memory and Lines 202/208 parse it twice. Without a hard cap, oversized/zip-bomb uploads can cause memory/CPU exhaustion on this route.
Suggested fix
try:
file_bytes = await file.read()
if not file_bytes:
raise HTTPException(status_code=status.HTTP_400_BAD_REQUEST, detail="Empty file uploaded")
+ if len(file_bytes) > 20 * 1024 * 1024:
+ raise HTTPException(status_code=413, detail="File too large for metadata extraction")
filename = file.filename or "unknown.pptx"
meta = _extract_ooxml_metadata(file_bytes, "pptx")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@services/crawler/app/routers/pptx.py` around lines 197 - 209, The route
currently reads the entire upload with file.read() and then calls
_extract_ooxml_metadata and Presentation using the same bytes, which allows
oversized/zip-bomb uploads to exhaust memory/CPU; before fully loading, enforce
a hard upload-size guard (e.g., check request.content_length or read at most
MAX_UPLOAD_BYTES+1 from file and if exceeded raise HTTPException 413), avoid
parsing twice by reusing the validated bytes (pass the same BytesIO(file_bytes)
to both _extract_ooxml_metadata and Presentation or adapt
_extract_ooxml_metadata to accept a stream) and ensure MAX_UPLOAD_BYTES is a
defined constant used for the guard.
| const documents: AgentIndexedDocumentItem[] = page.map((doc) => ({ | ||
| fileId: String(doc.fileId), | ||
| name: doc.title ?? 'Untitled', | ||
| sourceModifiedAt: doc.sourceModifiedAt ?? null, |
There was a problem hiding this comment.
Preserve legacy modified dates in list_indexed.
Line 143 only emits doc.sourceModifiedAt, but this PR already added getDocumentEffectiveDate() because older documents can still store their source timestamp in metadata.sourceModifiedAt / metadata.lastModified. As written, list_indexed will return null for those documents and drop the temporal context agents are supposed to get.
💡 Reuse the existing fallback chain
-import type { Doc, Id } from '../_generated/dataModel';
+import type { Doc, Id } from '../_generated/dataModel';
+import type { DocumentMetadata } from './types';
import type { QueryCtx } from '../_generated/server';
+
+import { getDocumentEffectiveDate } from './transform_to_document_item';
@@
const documents: AgentIndexedDocumentItem[] = page.map((doc) => ({
fileId: String(doc.fileId),
name: doc.title ?? 'Untitled',
- sourceModifiedAt: doc.sourceModifiedAt ?? null,
+ // oxlint-disable-next-line typescript/no-unsafe-type-assertion -- same Convex metadata shape used in transformToDocumentItem
+ sourceModifiedAt:
+ getDocumentEffectiveDate(
+ doc,
+ doc.metadata as DocumentMetadata | undefined,
+ ) ?? null,
}));📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| const documents: AgentIndexedDocumentItem[] = page.map((doc) => ({ | |
| fileId: String(doc.fileId), | |
| name: doc.title ?? 'Untitled', | |
| sourceModifiedAt: doc.sourceModifiedAt ?? null, | |
| import type { Doc, Id } from '../_generated/dataModel'; | |
| import type { DocumentMetadata } from './types'; | |
| import type { QueryCtx } from '../_generated/server'; | |
| import { getDocumentEffectiveDate } from './transform_to_document_item'; | |
| const documents: AgentIndexedDocumentItem[] = page.map((doc) => ({ | |
| fileId: String(doc.fileId), | |
| name: doc.title ?? 'Untitled', | |
| // oxlint-disable-next-line typescript/no-unsafe-type-assertion -- same Convex metadata shape used in transformToDocumentItem | |
| sourceModifiedAt: | |
| getDocumentEffectiveDate( | |
| doc, | |
| doc.metadata as DocumentMetadata | undefined, | |
| ) ?? null, |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@services/platform/convex/documents/list_indexed_documents_for_agent.ts`
around lines 140 - 143, The mapping that builds AgentIndexedDocumentItem
currently uses doc.sourceModifiedAt directly and returns null for legacy docs;
update the mapper in list_indexed_documents_for_agent (the code constructing
documents: AgentIndexedDocumentItem[]) to call the existing
getDocumentEffectiveDate(doc) fallback chain instead of using
doc.sourceModifiedAt, so name, fileId remain the same but sourceModifiedAt is
set to the value returned by getDocumentEffectiveDate for each doc to preserve
legacy timestamps.
| # Recency boost | ||
| recency_boost_enabled: bool = False | ||
| recency_decay_base: float = 0.85 | ||
| recency_max_age_days: int = 730 |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Consider adding validation for recency boost parameters.
The recency_decay_base should be constrained to (0, 1) for meaningful exponential decay, and recency_max_age_days should be positive. Invalid values could produce unexpected search scoring behavior.
🛡️ Suggested validation using Pydantic validators
+from pydantic import field_validator
+
class Settings(BaseServiceSettings):
# ... existing fields ...
# Recency boost
recency_boost_enabled: bool = False
recency_decay_base: float = 0.85
recency_max_age_days: int = 730
+
+ `@field_validator`('recency_decay_base')
+ `@classmethod`
+ def validate_decay_base(cls, v: float) -> float:
+ if not 0 < v < 1:
+ raise ValueError('recency_decay_base must be between 0 and 1 (exclusive)')
+ return v
+
+ `@field_validator`('recency_max_age_days')
+ `@classmethod`
+ def validate_max_age_days(cls, v: int) -> int:
+ if v <= 0:
+ raise ValueError('recency_max_age_days must be positive')
+ return v📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Recency boost | |
| recency_boost_enabled: bool = False | |
| recency_decay_base: float = 0.85 | |
| recency_max_age_days: int = 730 | |
| # Recency boost | |
| recency_boost_enabled: bool = False | |
| recency_decay_base: float = 0.85 | |
| recency_max_age_days: int = 730 | |
| `@field_validator`('recency_decay_base') | |
| `@classmethod` | |
| def validate_decay_base(cls, v: float) -> float: | |
| if not 0 < v < 1: | |
| raise ValueError('recency_decay_base must be between 0 and 1 (exclusive)') | |
| return v | |
| `@field_validator`('recency_max_age_days') | |
| `@classmethod` | |
| def validate_max_age_days(cls, v: int) -> int: | |
| if v <= 0: | |
| raise ValueError('recency_max_age_days must be positive') | |
| return v |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@services/rag/app/config.py` around lines 48 - 51, Add Pydantic validation to
ensure recency_decay_base is strictly between 0 and 1 and recency_max_age_days
is a positive integer: implement validators on the model containing
recency_boost_enabled/recency_decay_base/recency_max_age_days (e.g., add
`@validator` methods for "recency_decay_base" and "recency_max_age_days") that
raise ValueError with clear messages when recency_decay_base <= 0 or >= 1 and
when recency_max_age_days <= 0; keep the existing default values and ensure
validation runs on model initialization.
| try: | ||
| if ext == "pdf": | ||
| import fitz | ||
|
|
||
| doc = fitz.open(stream=content_bytes, filetype="pdf") | ||
| metadata = doc.metadata or {} | ||
| doc.close() | ||
| return ( | ||
| _parse_pdf_date(metadata.get("creationDate")), | ||
| _parse_pdf_date(metadata.get("modDate")), | ||
| ) |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Consider using context manager for PDF document.
If an exception occurs between fitz.open() and doc.close(), the document handle may not be properly closed. While Python's garbage collector will eventually clean it up, explicit resource management is cleaner.
♻️ Suggested refactor using context manager
if ext == "pdf":
import fitz
- doc = fitz.open(stream=content_bytes, filetype="pdf")
- metadata = doc.metadata or {}
- doc.close()
- return (
- _parse_pdf_date(metadata.get("creationDate")),
- _parse_pdf_date(metadata.get("modDate")),
- )
+ with fitz.open(stream=content_bytes, filetype="pdf") as doc:
+ metadata = doc.metadata or {}
+ return (
+ _parse_pdf_date(metadata.get("creationDate")),
+ _parse_pdf_date(metadata.get("modDate")),
+ )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@services/rag/app/services/indexing_service.py` around lines 117 - 127, The
PDF handling opens a PyMuPDF document with fitz.open(stream=content_bytes,
filetype="pdf") but calls doc.close() manually, risking resource leaks if an
exception occurs; change to a context manager (with
fitz.open(stream=content_bytes, filetype="pdf") as doc:) inside the branch where
ext == "pdf", read metadata from doc (doc.metadata or {}), then return the two
_parse_pdf_date(...) values and remove the explicit doc.close() call so the file
is always closed safely.
| def _apply_recency_boost( | ||
| results: list[dict[str, Any]], | ||
| decay_base: float, | ||
| max_age_days: int, | ||
| ) -> None: | ||
| """Scale RRF scores by document age so newer documents rank higher. | ||
|
|
||
| Modifies *results* in place: adjusts ``rrf_score``, re-normalises so the | ||
| top result equals 1.0, and re-sorts descending. | ||
| """ | ||
| from datetime import datetime, timezone | ||
|
|
||
| now = datetime.now(timezone.utc) | ||
| for item in results: | ||
| doc_ts = item.get("source_modified_at") or item.get("created_at") | ||
| if doc_ts is None: | ||
| item["rrf_score"] *= decay_base | ||
| continue | ||
| age_days = (now - doc_ts).total_seconds() / 86400 | ||
| recency_factor = max(0.0, 1.0 - age_days / max_age_days) | ||
| boost = decay_base + (1.0 - decay_base) * recency_factor | ||
| item["rrf_score"] *= boost | ||
|
|
||
| max_score = max((r["rrf_score"] for r in results), default=1.0) | ||
| if max_score > 0: | ||
| for r in results: | ||
| r["rrf_score"] /= max_score | ||
|
|
||
| results.sort(key=lambda x: x.get("rrf_score", 0), reverse=True) |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Consider moving the import to module level.
The datetime import at line 204 inside the function works but is unconventional. Moving it to the module level improves readability and avoids repeated import overhead.
🔧 Move import to module level
At the top of the file (around line 7):
from datetime import datetime, timezoneThen remove line 204.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@services/rag/app/services/search_service.py` around lines 194 - 222, The
local import of datetime/timezone inside _apply_recency_boost should be moved to
the module level to improve readability and avoid repeated imports; add "from
datetime import datetime, timezone" at the top of the file and remove the
in-function import statement in _apply_recency_boost so the function uses the
module-level symbols.
| mock_conn.fetchrow = AsyncMock( | ||
| side_effect=[ | ||
| None, | ||
| {"chunks_count": 5}, | ||
| {"chunks_count": 5, "source_created_at": None, "source_modified_at": None}, | ||
| {"id": "new-uuid"}, |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Use real timestamps in this clone fixture.
None/None only proves the clone path tolerates missing dates. It won't catch a regression where clone_from_existing() stops copying actual source_created_at / source_modified_at values into the new row.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@services/rag/tests/test_indexing_service.py` around lines 727 - 731, The
test's mock for mock_conn.fetchrow currently returns {"chunks_count": 5,
"source_created_at": None, "source_modified_at": None} which only verifies
tolerance for missing dates; update the fixture used by the clone test to return
realistic timestamps instead of None so clone_from_existing() is exercised
copying actual values — replace the None entries in the second fetchrow response
with concrete datetime values (strings or timezone-aware datetimes matching the
code's expected format) so the test asserts the new row preserves
source_created_at and source_modified_at.
Summary
ensure_*migrations in the RAG service with a proper dbmate versioned migration system for thetale_knowledgedatabaselist_indexedagent operation for discovering indexed documentsindexedboolean field anduploadedAtcolumn to the documents table for fast lookups and improved UIuser_idcolumns from knowledge DB tablesTest plan
format_search_results,parse_search_results,transform_to_document_item,list_indexed_documents_for_agentSummary by CodeRabbit
Release Notes
New Features
Improvements