Skip to content

feat: extract and surface document file dates across platform#862

Merged
larryro merged 9 commits into
mainfrom
feat/document-file-dates
Mar 25, 2026
Merged

feat: extract and surface document file dates across platform#862
larryro merged 9 commits into
mainfrom
feat/document-file-dates

Conversation

@larryro

@larryro larryro commented Mar 25, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Extract original file creation/modification dates during document parsing (crawler service) and store them in both Convex and PostgreSQL
  • Replace ad-hoc ensure_* migrations in the RAG service with a proper dbmate versioned migration system for the tale_knowledge database
  • Surface file dates in RAG search results so agents get temporal context, and add a list_indexed agent operation for discovering indexed documents
  • Add indexed boolean field and uploadedAt column to the documents table for fast lookups and improved UI
  • Drop unused user_id columns from knowledge DB tables

Test plan

  • Unit tests for format_search_results, parse_search_results, transform_to_document_item, list_indexed_documents_for_agent
  • Python tests for file date extraction, search service, background ingest, and indexing service
  • Existing tests updated and passing for affected modules

Summary by CodeRabbit

Release Notes

  • New Features

    • File metadata extraction for PDF, DOCX, and PPTX documents (title, author, creation/modification dates, page/slide counts)
    • Document date tracking: capture file modification, creation, and upload timestamps
    • Recency-based search result boosting to prioritize recently modified documents
    • "Uploaded at" column in documents table for upload timestamp visibility
    • Cursor-based pagination for indexed document listing
  • Improvements

    • Enhanced search results with date annotations
    • Optimized document indexing with improved filtering
    • Database migration management tools

@greptile-apps greptile-apps Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

larryro added 6 commits March 25, 2026 23:04
…ates

Add end-to-end support for preserving original file dates (created_at,
modified_at) from uploaded documents. Metadata is extracted in the crawler
service during file parsing, stored in the Convex document schema and
PostgreSQL knowledge database, and made available through the RAG search
pipeline for date-aware document retrieval.
Replace ad-hoc ensure_* functions with a proper versioned migration system
using dbmate. Migrations run automatically on DB container startup after
init scripts, with a healthcheck sentinel to block services until complete.

- Install dbmate binary in DB Docker image
- Convert ensure_error_column, ensure_content_hash_index, ensure_file_id_column,
  and ensure_document_date_columns to versioned SQL migration files
- Add baseline migration to mark init-script state as version zero
- Update entrypoint to run migrations after init scripts with /tmp/.db_ready sentinel
- Rename ensure_embedding_dimensions to pin_embedding_dimensions (runtime config)
- Remove migrated ensure_* functions from RAG service database.py
…id columns

Add source_created_at/source_modified_at fields to document content,
status, and search result responses. Include modified date annotation
in formatted search results for agent context. Remove unused user_id
column from documents and chunks tables with a migration.
…uploadedAt column

Add a denormalized `indexed` boolean to the documents schema for fast
agent-scoped lookups, replacing runtime ragInfo.status filtering. The RAG
search tool gains a `list_indexed` operation so agents can discover which
files are available before searching. The documents table now shows
separate "Modified" (source file date) and "Uploaded at" columns.
- Fix silent data loss: use getString + ISO parsing instead of getNumber
  for RAG status dates, and write dates before marking status completed
- Fix pagination cursor bug with composite cursor encoding skip counts
  to prevent document loss across paginated calls
- Fix O(N^2) backfill by switching to native .paginate()
- Add missing lastModified in org-wide upload path
- Remove silent exception swallowing in crawler date extraction
- Unify OOXML metadata extraction (dates + title/author) to avoid
  double-parsing in vision paths and router endpoints
- Add source_created_at to SearchResult for API symmetry
- Export and reuse getDocumentEffectiveDate to eliminate DRY violation
- Replace as type casts with type guards and shared interfaces
- Fix false-positive recency boost tests (single-result normalization)
- Add pagination continuation tests, parse_search_results tests,
  and fix create_document_from_upload test mocks
The polling loop stored indexedFileId in ragInfo, but updating ragInfo
status to 'running' replaced the entire object and silently dropped
indexedFileId. Subsequent polls fell back to the Convex document ID,
which RAG doesn't recognise, causing documents to stay stuck in
'running' until the 50-attempt timeout.

Remove the redundant indexedFileId field from ragInfo writes and read
document.fileId directly in all RAG interactions. Also fix
_mark_completed in the RAG service to restore chunks_count from actual
chunk rows instead of leaving it at 0 after a re-upload skip.
@larryro larryro force-pushed the feat/document-file-dates branch from 2b5517f to e2282ae Compare March 25, 2026 15:04
@coderabbitai

coderabbitai Bot commented Mar 25, 2026

Copy link
Copy Markdown
Contributor
📝 Walkthrough

Walkthrough

This PR adds file metadata extraction capabilities to the crawler service with new /extract-metadata endpoints for PDF, DOCX, and PPTX files. It introduces automatic source timestamp (creation/modification date) extraction from documents during ingestion, persists these timestamps in the documents table, and refactors document storage to remove user_id-based tenant isolation in favor of organization scoping. New indexed-document listing capabilities for agents are added to the Convex knowledge system. The RAG service gains recency-based search result scoring, and database migrations are integrated via dbmate. Helper functions for timestamp parsing across file formats are implemented throughout the platform and RAG services.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 28.57% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately and concisely describes the main objective of the PR: extracting and surfacing document file dates across the entire platform.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/document-file-dates

Warning

Review ran into problems

🔥 Problems

Git: Failed to clone repository. Please run the @coderabbitai full review command to re-trigger a full review. If the issue persists, set path_filters to include or exclude specific files.


Comment @coderabbitai help to get the list of available commands and usage tips.

larryro added 3 commits March 25, 2026 23:14
Use datetime.UTC alias instead of datetime.timezone.utc (ruff UP017)
and remove dbmate scripts from db package.json (knip unlisted binary).
The checked-in api.d.ts was generated against an older @convex-dev/workflow
version missing list, listByName, restart and updated cleanup signatures.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 22

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
services/rag/app/routers/search.py (1)

35-43: ⚠️ Potential issue | 🟠 Major

source_created_at is still dropped from both response paths.

These mappings now forward source_modified_at, but they still omit source_created_at when constructing SearchResult. Clients will never see the original creation date even if the search service returns it.

Suggested fix
         search_results = [
             SearchResult(
                 content=r.get("content", ""),
                 score=r.get("score", 0.0),
                 file_id=r.get("file_id"),
                 filename=r.get("filename"),
+                source_created_at=r.get("source_created_at"),
                 source_modified_at=r.get("source_modified_at"),
                 metadata=r.get("metadata") if request.include_metadata else None,
             )
             for r in results
         ]
@@
         sources = [
             SearchResult(
                 content=s.get("content", ""),
                 score=s.get("score", 0.0),
                 file_id=s.get("file_id"),
+                source_created_at=s.get("source_created_at"),
                 source_modified_at=s.get("source_modified_at"),
                 metadata=s.get("metadata"),
             )
             for s in result.get("sources", [])
         ]

Also applies to: 76-83

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@services/rag/app/routers/search.py` around lines 35 - 43, The SearchResult
objects are never populated with source_created_at, so add
source_created_at=r.get("source_created_at") to the SearchResult constructor
when building search_results (the list comprehension/block that constructs
SearchResult) and make the same change in the second construction block around
the other occurrence (the alternate SearchResult creation at lines noted in the
review). Ensure both places include source_created_at from the result dict so
clients receive the original creation timestamp.
services/rag/tests/test_document_content.py (1)

52-59: 🧹 Nitpick | 🔵 Trivial

Add one positive file-date round-trip case.

This widens the fixture shape, but with both new fields always None the suite still won't catch a regression where get_document_content() forgets to return them. Add a case with concrete timestamps and assert they appear in the response.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@services/rag/tests/test_document_content.py` around lines 52 - 59, The test
fixture DOC_ROW currently sets source_created_at and source_modified_at to None,
so add a positive round-trip case by assigning concrete ISO8601 timestamps
(e.g., "2023-01-01T00:00:00Z") to DOC_ROW["source_created_at"] and
DOC_ROW["source_modified_at"] in the test setup and update the test for
get_document_content() to assert those exact timestamp strings are present in
the returned document payload; ensure you only add one case (leave the existing
None case intact) and reference DOC_ROW and get_document_content() when making
the assertions.
services/rag/app/services/database.py (1)

95-101: ⚠️ Potential issue | 🟠 Major

Fix dimension mismatch handling to prevent startup failure on populated embedding columns.

The ALTER TABLE ... TYPE vector({dimensions}) operation at lines 95-101 will fail if the chunks table already contains vectors from a previous model with a different dimension. pgvector cannot directly cast between incompatible fixed dimensions on populated columns. Add a check to detect and reject mismatched dimensions when embeddings already exist, forcing explicit remediation instead of silent startup failure.

Suggested guard
         elif col_type != expected_type:
+            has_embeddings = await conn.fetchval(
+                f"SELECT EXISTS (SELECT 1 FROM {SCHEMA}.chunks WHERE embedding IS NOT NULL)"
+            )
+            if has_embeddings:
+                raise RuntimeError(
+                    f"{SCHEMA}.chunks.embedding is {col_type}; clear or reindex existing chunks before pinning to {expected_type}."
+                )
             logger.warning(
                 "Embedding column is {}, expected {}. Dimension mismatch — existing embeddings may need re-generation.",
                 col_type,
                 expected_type,
             )
             await conn.execute(f"ALTER TABLE {SCHEMA}.chunks ALTER COLUMN embedding TYPE vector({dimensions})")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@services/rag/app/services/database.py` around lines 95 - 101, The current
logic attempts to ALTER COLUMN embedding TYPE vector({dimensions}) whenever
col_type != expected_type, which will fail if the chunks table already contains
embeddings of a different dimension; change this to first check whether any rows
have a non-null embedding (e.g., using conn to count non-null embedding entries
on the chunks table) and only perform the ALTER if the column is empty,
otherwise abort startup by raising/logging a clear error that reports col_type,
expected_type and the existing row count and instructs explicit remediation
(re-generate embeddings or drop/replace the column) instead of attempting the
incompatible ALTER; update the branch that references col_type, expected_type,
SCHEMA, chunks, embedding, dimensions and conn.execute to implement this guard
and error path.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@services/crawler/app/routers/docx.py`:
- Around line 498-501: The code is directly indexing the metadata dict
(meta["..."]) which raises KeyError for missing properties; update all direct
accesses to use meta.get("key") (or meta.get("key", default) when a default is
needed) in the docx router where metadata is extracted (look for the variable
meta and functions like extract_metadata/process_docx or the block that builds
the metadata object for docx documents) so missing keys return None or a safe
default instead of raising; ensure any downstream code that assumed a value
handles None/empty-string appropriately.
- Around line 505-511: The route handler in services/crawler/app/routers/docx.py
currently catches generic exceptions and returns a JSON payload with "success":
false while still sending an HTTP 200; update the exception path in the handler
to raise an HTTP error (e.g., raise fastapi.HTTPException(status_code=500,
detail="Extraction failed") or return a Response/JSONResponse with
status_code=500) instead of returning a 2xx body, and ensure the handler (and/or
surrounding except block) logs the exception details (exception instance and
traceback) before returning/raising so callers get a non-2xx status on
unexpected extraction failures.

In `@services/crawler/app/routers/pdf.py`:
- Line 243: The endpoint currently imports the private helper _parse_pdf_date
from another module — change the helper to a public API (rename/export it as
parse_pdf_date) in its defining module and update this module to import
parse_pdf_date instead of _parse_pdf_date; ensure the original function name is
updated consistently (rename the function or add a public wrapper named
parse_pdf_date) and update all call sites in routers/pdf.py that reference
_parse_pdf_date to use parse_pdf_date.
- Around line 251-254: The PDF document opened into variable doc must be closed
on all paths; wrap the code that opens and processes the PDF (the fitz.open(...)
call that assigns doc and subsequent operations that currently call doc.close())
in a try/finally (or use a context manager if supported) so doc.close() is
invoked in the finally block; update the routine in routers/pdf.py that creates
doc to ensure doc.close() runs even on exceptions to avoid leaking the fitz
document handle.

In `@services/crawler/app/routers/pptx.py`:
- Around line 224-227: The error response in the PPTX metadata handler returns
filename=file.filename or "unknown" while the success path uses "unknown.pptx";
update the failure return (the FileMetadataResponse constructed in the
exception/failure branch) to use the same fallback as the success path
(filename=file.filename or "unknown.pptx") so clients receive a consistent
filename format.
- Around line 197-209: The route currently reads the entire upload with
file.read() and then calls _extract_ooxml_metadata and Presentation using the
same bytes, which allows oversized/zip-bomb uploads to exhaust memory/CPU;
before fully loading, enforce a hard upload-size guard (e.g., check
request.content_length or read at most MAX_UPLOAD_BYTES+1 from file and if
exceeded raise HTTPException 413), avoid parsing twice by reusing the validated
bytes (pass the same BytesIO(file_bytes) to both _extract_ooxml_metadata and
Presentation or adapt _extract_ooxml_metadata to accept a stream) and ensure
MAX_UPLOAD_BYTES is a defined constant used for the guard.

In `@services/crawler/app/services/file_parser_service.py`:
- Around line 292-297: Duplicate inline metadata dicts used for DOCX and PPTX
should be replaced by a shared helper to ensure consistent title/author/date
normalization; add a new function core_properties_to_metadata(core_props) (or
similar) in file_parser_service.py that implements the same normalization logic
currently in _extract_ooxml_metadata, then call core_properties_to_metadata(...)
from both the DOCX and PPTX code paths (replace the inline dicts where metadata
is constructed) so both branches reuse the single normalizer instead of
re-parsing or duplicating logic.
- Around line 33-35: The code is clamping parsed datetimes to 1970 which
destroys valid pre-1970 source dates; instead, stop forcing a 1970 floor in the
date handling code (look for functions/methods like parse_date, _serialize_date
or normalize_timestamp referenced around the flagged lines) and preserve the
original datetime value. Remove the hard floor-to-1970 logic and, if downstream
consumers cannot accept negative epoch seconds, serialize pre-1970 values as
ISO-8601 strings or attach an explicit metadata flag (e.g., original_datetime or
is_pre_1970) rather than altering the timestamp.
- Around line 228-248: The vision-path PDF metadata currently returns only a
partial dict or {} on failure, breaking callers that expect the stable metadata
shape returned by parse_pdf(); update the vision extraction code so it builds
and returns the same metadata object shape as parse_pdf() (e.g., keys like
title, author, created, modified — using None or empty strings as sensible
defaults) instead of returning {} on fitz failure, or re-use/normalize via
parse_pdf()’s metadata normalization function so callers don't need
special-casing.
- Around line 20-31: The PDF date parser is incorrectly accepting malformed
timezone-only digits because it uses re.match (prefix only) and a regex that
allows timezone digits without a sign; update the parser to use re.fullmatch
instead of re.match and refactor the PDF date regex so the entire timezone
portion is an optional grouped unit that requires a sign if present (i.e., make
the sign mandatory whenever any timezone digits appear). Locate the PDF date
regex variable and the call that currently uses re.match in the PDF date parsing
function (the code paths that set source_created_at/source_modified_at) and
apply these two changes together to fully reject malformed strings.

In `@services/db/docker-entrypoint-wrapper.sh`:
- Around line 106-116: The background subshell running run_init_scripts and
run_migrations can fail without stopping the main PostgreSQL process (the
subshell exit won't propagate), so either run those steps synchronously before
starting/post-exec of docker-entrypoint.sh, or make failures fatal by
propagating the error to the main process: start Postgres in the wrapper,
capture its PID, run run_init_scripts/run_migrations in the foreground (or in
background but check their exit status), and on any non-zero exit call kill
-TERM "$PG_PID" (or exit with that code) instead of just exiting the subshell;
use the existing pg_isready, run_init_scripts, run_migrations, touch
/tmp/.db_ready and trap logic to implement the chosen approach so migration/init
failures reliably stop the container startup.
- Around line 92-104: The run_migrations function currently runs dbmate but
doesn't stop on failure; update run_migrations to check dbmate's exit status (or
run with set -e) and on non-zero exit print an error including dbmate's stderr
and exit the script (exit 1), and ensure the code that touches /tmp/.db_ready
only runs after a successful migrate; reference the run_migrations function,
MIGRATIONS_DIR, dbmate migrate invocation, and the /tmp/.db_ready readiness
touch to locate where to add the failure check and early exit.

In `@services/db/init-scripts/03-create-knowledge-database.sql`:
- Around line 174-175: The migration is not convergence-safe: CREATE TABLE IF
NOT EXISTS won’t add new columns to an existing private_knowledge.documents
table and CREATE UNIQUE INDEX IF NOT EXISTS may leave an outdated
idx_pk_docs_unique_scope; replace the CREATE variants with idempotent
ALTER/conditional logic — add ALTER TABLE private_knowledge.documents ADD COLUMN
IF NOT EXISTS for source_created_at and source_modified_at (so pre-existing
installs get new columns), and ensure the unique index idx_pk_docs_unique_scope
is recreated safely by checking/dropping the old index if present and then
creating the desired unique index (or using a conditional DO block that creates
the index only when missing) so long-lived deployments converge to the new
schema.

In
`@services/platform/convex/documents/__tests__/create_document_from_upload.test.ts`:
- Around line 29-37: Add a test that asserts the scheduler is invoked for
document date extraction: in the create_document_from_upload test suite,
mockGetAuthUser to return AUTH_USER, create a mock ctx via createMockCtx, call
the handler returned by getHandler with baseArgs and a fileName ending in
.pdf/.docx/.pptx, and expect ctx.scheduler.runAfter toHaveBeenCalledWith(0,
'extractDocumentDates', expect.objectContaining({ documentId: 'doc_created',
fileId: 'storage_1' })); target the handler returned by getHandler and the
scheduler method ctx.scheduler.runAfter and the internal action name
'extractDocumentDates' so the scheduling logic is covered for PDF/DOCX/PPTX
uploads.

In
`@services/platform/convex/documents/__tests__/transform_to_document_item.test.ts`:
- Around line 5-12: The mock factory createMockDocument currently uses an unsafe
cast ("as never") which disables type checking; change it to return a properly
typed partial/fixture—e.g., declare the function to return Partial<DocumentItem>
or use the TypeScript "satisfies" operator on the object literal to satisfy the
expected DocumentItem type—so TypeScript validates fields while still allowing
overrides; update the createMockDocument signature and its return value
accordingly (refer to createMockDocument and any test usages that consume its
return).

In `@services/platform/convex/documents/internal_actions.ts`:
- Around line 457-462: The current deleteDocumentFromRag flow bails out when
document.fileId is missing and treats HTTP 404 from the RAG delete as a hard
failure, leaving Convex records orphaned; change deleteDocumentFromRag so it
does NOT return early when document?.fileId is falsy but instead logs the
missing fileId and continues to call deleteDocumentById(documentId), and modify
the RAG delete logic (the HTTP delete/response handling where it currently
throws on 404) to catch 404 responses, log that the remote resource was already
gone, and treat them as success so deleteDocumentById always runs; keep throwing
or returning errors only for non-404/critical failures.

In `@services/platform/convex/documents/internal_mutations.ts`:
- Around line 64-89: The mutation updateDocumentDates currently only patches
sourceCreatedAt/sourceModifiedAt when a number is present, preventing explicit
clears; change the arg types to accept null (e.g., sourceCreatedAt:
v.union([v.number(), v.null()]) and same for sourceModifiedAt or use a
nullable/union variant) and update the handler so it checks for undefined (not
null) — e.g., if (args.sourceCreatedAt !== undefined) patch.sourceCreatedAt =
args.sourceCreatedAt — and do the same for sourceModifiedAt, then call
ctx.db.patch with that patch so null values will explicitly clear the fields via
ctx.db.patch.

In `@services/platform/convex/documents/list_indexed_documents_for_agent.ts`:
- Around line 140-143: The mapping that builds AgentIndexedDocumentItem
currently uses doc.sourceModifiedAt directly and returns null for legacy docs;
update the mapper in list_indexed_documents_for_agent (the code constructing
documents: AgentIndexedDocumentItem[]) to call the existing
getDocumentEffectiveDate(doc) fallback chain instead of using
doc.sourceModifiedAt, so name, fileId remain the same but sourceModifiedAt is
set to the value returned by getDocumentEffectiveDate for each doc to preserve
legacy timestamps.

In `@services/rag/app/config.py`:
- Around line 48-51: Add Pydantic validation to ensure recency_decay_base is
strictly between 0 and 1 and recency_max_age_days is a positive integer:
implement validators on the model containing
recency_boost_enabled/recency_decay_base/recency_max_age_days (e.g., add
`@validator` methods for "recency_decay_base" and "recency_max_age_days") that
raise ValueError with clear messages when recency_decay_base <= 0 or >= 1 and
when recency_max_age_days <= 0; keep the existing default values and ensure
validation runs on model initialization.

In `@services/rag/app/services/indexing_service.py`:
- Around line 117-127: The PDF handling opens a PyMuPDF document with
fitz.open(stream=content_bytes, filetype="pdf") but calls doc.close() manually,
risking resource leaks if an exception occurs; change to a context manager (with
fitz.open(stream=content_bytes, filetype="pdf") as doc:) inside the branch where
ext == "pdf", read metadata from doc (doc.metadata or {}), then return the two
_parse_pdf_date(...) values and remove the explicit doc.close() call so the file
is always closed safely.

In `@services/rag/app/services/search_service.py`:
- Around line 194-222: The local import of datetime/timezone inside
_apply_recency_boost should be moved to the module level to improve readability
and avoid repeated imports; add "from datetime import datetime, timezone" at the
top of the file and remove the in-function import statement in
_apply_recency_boost so the function uses the module-level symbols.

In `@services/rag/tests/test_indexing_service.py`:
- Around line 727-731: The test's mock for mock_conn.fetchrow currently returns
{"chunks_count": 5, "source_created_at": None, "source_modified_at": None} which
only verifies tolerance for missing dates; update the fixture used by the clone
test to return realistic timestamps instead of None so clone_from_existing() is
exercised copying actual values — replace the None entries in the second
fetchrow response with concrete datetime values (strings or timezone-aware
datetimes matching the code's expected format) so the test asserts the new row
preserves source_created_at and source_modified_at.

---

Outside diff comments:
In `@services/rag/app/routers/search.py`:
- Around line 35-43: The SearchResult objects are never populated with
source_created_at, so add source_created_at=r.get("source_created_at") to the
SearchResult constructor when building search_results (the list
comprehension/block that constructs SearchResult) and make the same change in
the second construction block around the other occurrence (the alternate
SearchResult creation at lines noted in the review). Ensure both places include
source_created_at from the result dict so clients receive the original creation
timestamp.

In `@services/rag/app/services/database.py`:
- Around line 95-101: The current logic attempts to ALTER COLUMN embedding TYPE
vector({dimensions}) whenever col_type != expected_type, which will fail if the
chunks table already contains embeddings of a different dimension; change this
to first check whether any rows have a non-null embedding (e.g., using conn to
count non-null embedding entries on the chunks table) and only perform the ALTER
if the column is empty, otherwise abort startup by raising/logging a clear error
that reports col_type, expected_type and the existing row count and instructs
explicit remediation (re-generate embeddings or drop/replace the column) instead
of attempting the incompatible ALTER; update the branch that references
col_type, expected_type, SCHEMA, chunks, embedding, dimensions and conn.execute
to implement this guard and error path.

In `@services/rag/tests/test_document_content.py`:
- Around line 52-59: The test fixture DOC_ROW currently sets source_created_at
and source_modified_at to None, so add a positive round-trip case by assigning
concrete ISO8601 timestamps (e.g., "2023-01-01T00:00:00Z") to
DOC_ROW["source_created_at"] and DOC_ROW["source_modified_at"] in the test setup
and update the test for get_document_content() to assert those exact timestamp
strings are present in the returned document payload; ensure you only add one
case (leave the existing None case intact) and reference DOC_ROW and
get_document_content() when making the assertions.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 507824ab-7bc0-4830-ba4e-caa9e76f78e3

📥 Commits

Reviewing files that changed from the base of the PR and between a4d8a39 and e2282ae.

⛔ Files ignored due to path filters (7)
  • services/db/migrations/db/migrations/00000000000000_baseline.sql is excluded by !**/migrations/**
  • services/db/migrations/db/migrations/20260325000001_add_documents_error_column.sql is excluded by !**/migrations/**
  • services/db/migrations/db/migrations/20260325000002_add_documents_content_hash_index.sql is excluded by !**/migrations/**
  • services/db/migrations/db/migrations/20260325000003_rename_document_id_to_file_id.sql is excluded by !**/migrations/**
  • services/db/migrations/db/migrations/20260325000004_add_document_date_columns.sql is excluded by !**/migrations/**
  • services/db/migrations/db/migrations/20260325000005_drop_user_id_columns.sql is excluded by !**/migrations/**
  • services/platform/convex/_generated/api.d.ts is excluded by !**/_generated/**
📒 Files selected for processing (51)
  • services/crawler/app/models.py
  • services/crawler/app/routers/docx.py
  • services/crawler/app/routers/pdf.py
  • services/crawler/app/routers/pptx.py
  • services/crawler/app/services/file_parser_service.py
  • services/db/Dockerfile
  • services/db/docker-entrypoint-wrapper.sh
  • services/db/init-scripts/03-create-knowledge-database.sql
  • services/db/package.json
  • services/platform/app/features/documents/hooks/mutations.ts
  • services/platform/app/features/documents/hooks/use-documents-table-config.tsx
  • services/platform/convex/agent_tools/rag/format_search_results.test.ts
  • services/platform/convex/agent_tools/rag/format_search_results.ts
  • services/platform/convex/agent_tools/rag/helpers/list_indexed_documents.ts
  • services/platform/convex/agent_tools/rag/parse_search_results.test.ts
  • services/platform/convex/agent_tools/rag/parse_search_results.ts
  • services/platform/convex/agent_tools/rag/rag_search_tool.ts
  • services/platform/convex/documents/__tests__/create_document_from_upload.test.ts
  • services/platform/convex/documents/__tests__/get_agent_scoped_file_ids.test.ts
  • services/platform/convex/documents/__tests__/list_indexed_documents_for_agent.test.ts
  • services/platform/convex/documents/__tests__/transform_to_document_item.test.ts
  • services/platform/convex/documents/actions.ts
  • services/platform/convex/documents/create_document.ts
  • services/platform/convex/documents/get_agent_scoped_file_ids.ts
  • services/platform/convex/documents/internal_actions.ts
  • services/platform/convex/documents/internal_mutations.ts
  • services/platform/convex/documents/internal_queries.ts
  • services/platform/convex/documents/list_indexed_documents_for_agent.ts
  • services/platform/convex/documents/mutations.ts
  • services/platform/convex/documents/schema.ts
  • services/platform/convex/documents/transform_to_document_item.ts
  • services/platform/convex/documents/types.ts
  • services/platform/convex/documents/update_document_rag_info.ts
  • services/platform/convex/documents/validators.ts
  • services/platform/convex/workflow_engine/action_defs/document/document_action.ts
  • services/platform/messages/en.json
  • services/platform/types/documents.ts
  • services/rag/app/config.py
  • services/rag/app/models.py
  • services/rag/app/routers/documents.py
  • services/rag/app/routers/search.py
  • services/rag/app/services/database.py
  • services/rag/app/services/indexing_service.py
  • services/rag/app/services/rag_service.py
  • services/rag/app/services/search_service.py
  • services/rag/tests/test_background_ingest.py
  • services/rag/tests/test_document_content.py
  • services/rag/tests/test_file_dates.py
  • services/rag/tests/test_indexing_service.py
  • services/rag/tests/test_rag_service.py
  • services/rag/tests/test_search_service.py
💤 Files with no reviewable changes (1)
  • services/platform/convex/documents/actions.ts

Comment on lines +498 to +501
title=meta["title"] or None,
author=meta["author"] or None,
created_at=meta["created_at"],
modified_at=meta["modified_at"],

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Use safe metadata access to avoid avoidable KeyError failures.

Direct indexing (meta["..."]) can fail when a property is absent. Use .get() for resilient partial metadata extraction.

Proposed fix
-            title=meta["title"] or None,
-            author=meta["author"] or None,
-            created_at=meta["created_at"],
-            modified_at=meta["modified_at"],
+            title=meta.get("title") or None,
+            author=meta.get("author") or None,
+            created_at=meta.get("created_at"),
+            modified_at=meta.get("modified_at"),
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@services/crawler/app/routers/docx.py` around lines 498 - 501, The code is
directly indexing the metadata dict (meta["..."]) which raises KeyError for
missing properties; update all direct accesses to use meta.get("key") (or
meta.get("key", default) when a default is needed) in the docx router where
metadata is extracted (look for the variable meta and functions like
extract_metadata/process_docx or the block that builds the metadata object for
docx documents) so missing keys return None or a safe default instead of
raising; ensure any downstream code that assumed a value handles
None/empty-string appropriately.

Comment on lines +505 to +511
except Exception:
logger.exception("Error extracting DOCX metadata")
return FileMetadataResponse(
success=False,
filename=file.filename or "unknown",
error="Failed to extract DOCX metadata",
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Unexpected extraction failures should not return HTTP 200.

The generic exception path currently returns success=False as a normal response, which results in a 2xx status and can hide operational failures from upstream callers.

Proposed fix
     except HTTPException:
         raise
     except Exception:
         logger.exception("Error extracting DOCX metadata")
-        return FileMetadataResponse(
-            success=False,
-            filename=file.filename or "unknown",
-            error="Failed to extract DOCX metadata",
-        )
+        raise HTTPException(
+            status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
+            detail="Failed to extract DOCX metadata",
+        )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
except Exception:
logger.exception("Error extracting DOCX metadata")
return FileMetadataResponse(
success=False,
filename=file.filename or "unknown",
error="Failed to extract DOCX metadata",
)
except HTTPException:
raise
except Exception:
logger.exception("Error extracting DOCX metadata")
raise HTTPException(
status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
detail="Failed to extract DOCX metadata",
)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@services/crawler/app/routers/docx.py` around lines 505 - 511, The route
handler in services/crawler/app/routers/docx.py currently catches generic
exceptions and returns a JSON payload with "success": false while still sending
an HTTP 200; update the exception path in the handler to raise an HTTP error
(e.g., raise fastapi.HTTPException(status_code=500, detail="Extraction failed")
or return a Response/JSONResponse with status_code=500) instead of returning a
2xx body, and ensure the handler (and/or surrounding except block) logs the
exception details (exception instance and traceback) before returning/raising so
callers get a non-2xx status on unexpected extraction failures.

"""Extract metadata from a PDF file without full text extraction."""
import fitz

from app.services.file_parser_service import _parse_pdf_date

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Avoid importing a private helper across module boundaries.

Importing _parse_pdf_date (underscore-prefixed) from another module couples this endpoint to a private API. Promote it to a public utility (e.g., parse_pdf_date) and import that instead.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@services/crawler/app/routers/pdf.py` at line 243, The endpoint currently
imports the private helper _parse_pdf_date from another module — change the
helper to a public API (rename/export it as parse_pdf_date) in its defining
module and update this module to import parse_pdf_date instead of
_parse_pdf_date; ensure the original function name is updated consistently
(rename the function or add a public wrapper named parse_pdf_date) and update
all call sites in routers/pdf.py that reference _parse_pdf_date to use
parse_pdf_date.

Comment on lines +251 to +254
doc = fitz.open(stream=file_bytes, filetype="pdf")
raw = doc.metadata or {}
page_count = len(doc)
doc.close()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Ensure fitz document is always closed (exception-safe cleanup).

If an exception is raised after opening the PDF (Line 251) but before doc.close() (Line 254), the document handle is leaked. Wrap open/read operations in try/finally so close is guaranteed.

Proposed fix
-        doc = fitz.open(stream=file_bytes, filetype="pdf")
-        raw = doc.metadata or {}
-        page_count = len(doc)
-        doc.close()
+        doc = None
+        try:
+            doc = fitz.open(stream=file_bytes, filetype="pdf")
+            raw = doc.metadata or {}
+            page_count = len(doc)
+        finally:
+            if doc is not None:
+                doc.close()
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
doc = fitz.open(stream=file_bytes, filetype="pdf")
raw = doc.metadata or {}
page_count = len(doc)
doc.close()
doc = None
try:
doc = fitz.open(stream=file_bytes, filetype="pdf")
raw = doc.metadata or {}
page_count = len(doc)
finally:
if doc is not None:
doc.close()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@services/crawler/app/routers/pdf.py` around lines 251 - 254, The PDF document
opened into variable doc must be closed on all paths; wrap the code that opens
and processes the PDF (the fitz.open(...) call that assigns doc and subsequent
operations that currently call doc.close()) in a try/finally (or use a context
manager if supported) so doc.close() is invoked in the finally block; update the
routine in routers/pdf.py that creates doc to ensure doc.close() runs even on
exceptions to avoid leaking the fitz document handle.

Comment on lines +197 to +209
file_bytes = await file.read()
if not file_bytes:
raise HTTPException(status_code=status.HTTP_400_BAD_REQUEST, detail="Empty file uploaded")

filename = file.filename or "unknown.pptx"
meta = _extract_ooxml_metadata(file_bytes, "pptx")

from io import BytesIO

from pptx import Presentation

prs = Presentation(BytesIO(file_bytes))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Add an explicit upload-size guard before parsing PPTX bytes.

Line 197 reads the full file into memory and Lines 202/208 parse it twice. Without a hard cap, oversized/zip-bomb uploads can cause memory/CPU exhaustion on this route.

Suggested fix
     try:
         file_bytes = await file.read()
         if not file_bytes:
             raise HTTPException(status_code=status.HTTP_400_BAD_REQUEST, detail="Empty file uploaded")
+        if len(file_bytes) > 20 * 1024 * 1024:
+            raise HTTPException(status_code=413, detail="File too large for metadata extraction")
 
         filename = file.filename or "unknown.pptx"
         meta = _extract_ooxml_metadata(file_bytes, "pptx")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@services/crawler/app/routers/pptx.py` around lines 197 - 209, The route
currently reads the entire upload with file.read() and then calls
_extract_ooxml_metadata and Presentation using the same bytes, which allows
oversized/zip-bomb uploads to exhaust memory/CPU; before fully loading, enforce
a hard upload-size guard (e.g., check request.content_length or read at most
MAX_UPLOAD_BYTES+1 from file and if exceeded raise HTTPException 413), avoid
parsing twice by reusing the validated bytes (pass the same BytesIO(file_bytes)
to both _extract_ooxml_metadata and Presentation or adapt
_extract_ooxml_metadata to accept a stream) and ensure MAX_UPLOAD_BYTES is a
defined constant used for the guard.

Comment on lines +140 to +143
const documents: AgentIndexedDocumentItem[] = page.map((doc) => ({
fileId: String(doc.fileId),
name: doc.title ?? 'Untitled',
sourceModifiedAt: doc.sourceModifiedAt ?? null,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Preserve legacy modified dates in list_indexed.

Line 143 only emits doc.sourceModifiedAt, but this PR already added getDocumentEffectiveDate() because older documents can still store their source timestamp in metadata.sourceModifiedAt / metadata.lastModified. As written, list_indexed will return null for those documents and drop the temporal context agents are supposed to get.

💡 Reuse the existing fallback chain
-import type { Doc, Id } from '../_generated/dataModel';
+import type { Doc, Id } from '../_generated/dataModel';
+import type { DocumentMetadata } from './types';
 import type { QueryCtx } from '../_generated/server';
+
+import { getDocumentEffectiveDate } from './transform_to_document_item';
@@
   const documents: AgentIndexedDocumentItem[] = page.map((doc) => ({
     fileId: String(doc.fileId),
     name: doc.title ?? 'Untitled',
-    sourceModifiedAt: doc.sourceModifiedAt ?? null,
+    // oxlint-disable-next-line typescript/no-unsafe-type-assertion -- same Convex metadata shape used in transformToDocumentItem
+    sourceModifiedAt:
+      getDocumentEffectiveDate(
+        doc,
+        doc.metadata as DocumentMetadata | undefined,
+      ) ?? null,
   }));
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
const documents: AgentIndexedDocumentItem[] = page.map((doc) => ({
fileId: String(doc.fileId),
name: doc.title ?? 'Untitled',
sourceModifiedAt: doc.sourceModifiedAt ?? null,
import type { Doc, Id } from '../_generated/dataModel';
import type { DocumentMetadata } from './types';
import type { QueryCtx } from '../_generated/server';
import { getDocumentEffectiveDate } from './transform_to_document_item';
const documents: AgentIndexedDocumentItem[] = page.map((doc) => ({
fileId: String(doc.fileId),
name: doc.title ?? 'Untitled',
// oxlint-disable-next-line typescript/no-unsafe-type-assertion -- same Convex metadata shape used in transformToDocumentItem
sourceModifiedAt:
getDocumentEffectiveDate(
doc,
doc.metadata as DocumentMetadata | undefined,
) ?? null,
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@services/platform/convex/documents/list_indexed_documents_for_agent.ts`
around lines 140 - 143, The mapping that builds AgentIndexedDocumentItem
currently uses doc.sourceModifiedAt directly and returns null for legacy docs;
update the mapper in list_indexed_documents_for_agent (the code constructing
documents: AgentIndexedDocumentItem[]) to call the existing
getDocumentEffectiveDate(doc) fallback chain instead of using
doc.sourceModifiedAt, so name, fileId remain the same but sourceModifiedAt is
set to the value returned by getDocumentEffectiveDate for each doc to preserve
legacy timestamps.

Comment on lines +48 to +51
# Recency boost
recency_boost_enabled: bool = False
recency_decay_base: float = 0.85
recency_max_age_days: int = 730

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Consider adding validation for recency boost parameters.

The recency_decay_base should be constrained to (0, 1) for meaningful exponential decay, and recency_max_age_days should be positive. Invalid values could produce unexpected search scoring behavior.

🛡️ Suggested validation using Pydantic validators
+from pydantic import field_validator
+
 class Settings(BaseServiceSettings):
     # ... existing fields ...
 
     # Recency boost
     recency_boost_enabled: bool = False
     recency_decay_base: float = 0.85
     recency_max_age_days: int = 730
+
+    `@field_validator`('recency_decay_base')
+    `@classmethod`
+    def validate_decay_base(cls, v: float) -> float:
+        if not 0 < v < 1:
+            raise ValueError('recency_decay_base must be between 0 and 1 (exclusive)')
+        return v
+
+    `@field_validator`('recency_max_age_days')
+    `@classmethod`
+    def validate_max_age_days(cls, v: int) -> int:
+        if v <= 0:
+            raise ValueError('recency_max_age_days must be positive')
+        return v
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Recency boost
recency_boost_enabled: bool = False
recency_decay_base: float = 0.85
recency_max_age_days: int = 730
# Recency boost
recency_boost_enabled: bool = False
recency_decay_base: float = 0.85
recency_max_age_days: int = 730
`@field_validator`('recency_decay_base')
`@classmethod`
def validate_decay_base(cls, v: float) -> float:
if not 0 < v < 1:
raise ValueError('recency_decay_base must be between 0 and 1 (exclusive)')
return v
`@field_validator`('recency_max_age_days')
`@classmethod`
def validate_max_age_days(cls, v: int) -> int:
if v <= 0:
raise ValueError('recency_max_age_days must be positive')
return v
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@services/rag/app/config.py` around lines 48 - 51, Add Pydantic validation to
ensure recency_decay_base is strictly between 0 and 1 and recency_max_age_days
is a positive integer: implement validators on the model containing
recency_boost_enabled/recency_decay_base/recency_max_age_days (e.g., add
`@validator` methods for "recency_decay_base" and "recency_max_age_days") that
raise ValueError with clear messages when recency_decay_base <= 0 or >= 1 and
when recency_max_age_days <= 0; keep the existing default values and ensure
validation runs on model initialization.

Comment on lines +117 to +127
try:
if ext == "pdf":
import fitz

doc = fitz.open(stream=content_bytes, filetype="pdf")
metadata = doc.metadata or {}
doc.close()
return (
_parse_pdf_date(metadata.get("creationDate")),
_parse_pdf_date(metadata.get("modDate")),
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Consider using context manager for PDF document.

If an exception occurs between fitz.open() and doc.close(), the document handle may not be properly closed. While Python's garbage collector will eventually clean it up, explicit resource management is cleaner.

♻️ Suggested refactor using context manager
         if ext == "pdf":
             import fitz

-            doc = fitz.open(stream=content_bytes, filetype="pdf")
-            metadata = doc.metadata or {}
-            doc.close()
-            return (
-                _parse_pdf_date(metadata.get("creationDate")),
-                _parse_pdf_date(metadata.get("modDate")),
-            )
+            with fitz.open(stream=content_bytes, filetype="pdf") as doc:
+                metadata = doc.metadata or {}
+                return (
+                    _parse_pdf_date(metadata.get("creationDate")),
+                    _parse_pdf_date(metadata.get("modDate")),
+                )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@services/rag/app/services/indexing_service.py` around lines 117 - 127, The
PDF handling opens a PyMuPDF document with fitz.open(stream=content_bytes,
filetype="pdf") but calls doc.close() manually, risking resource leaks if an
exception occurs; change to a context manager (with
fitz.open(stream=content_bytes, filetype="pdf") as doc:) inside the branch where
ext == "pdf", read metadata from doc (doc.metadata or {}), then return the two
_parse_pdf_date(...) values and remove the explicit doc.close() call so the file
is always closed safely.

Comment on lines +194 to +222
def _apply_recency_boost(
results: list[dict[str, Any]],
decay_base: float,
max_age_days: int,
) -> None:
"""Scale RRF scores by document age so newer documents rank higher.

Modifies *results* in place: adjusts ``rrf_score``, re-normalises so the
top result equals 1.0, and re-sorts descending.
"""
from datetime import datetime, timezone

now = datetime.now(timezone.utc)
for item in results:
doc_ts = item.get("source_modified_at") or item.get("created_at")
if doc_ts is None:
item["rrf_score"] *= decay_base
continue
age_days = (now - doc_ts).total_seconds() / 86400
recency_factor = max(0.0, 1.0 - age_days / max_age_days)
boost = decay_base + (1.0 - decay_base) * recency_factor
item["rrf_score"] *= boost

max_score = max((r["rrf_score"] for r in results), default=1.0)
if max_score > 0:
for r in results:
r["rrf_score"] /= max_score

results.sort(key=lambda x: x.get("rrf_score", 0), reverse=True)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Consider moving the import to module level.

The datetime import at line 204 inside the function works but is unconventional. Moving it to the module level improves readability and avoids repeated import overhead.

🔧 Move import to module level

At the top of the file (around line 7):

from datetime import datetime, timezone

Then remove line 204.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@services/rag/app/services/search_service.py` around lines 194 - 222, The
local import of datetime/timezone inside _apply_recency_boost should be moved to
the module level to improve readability and avoid repeated imports; add "from
datetime import datetime, timezone" at the top of the file and remove the
in-function import statement in _apply_recency_boost so the function uses the
module-level symbols.

Comment on lines 727 to 731
mock_conn.fetchrow = AsyncMock(
side_effect=[
None,
{"chunks_count": 5},
{"chunks_count": 5, "source_created_at": None, "source_modified_at": None},
{"id": "new-uuid"},

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Use real timestamps in this clone fixture.

None/None only proves the clone path tolerates missing dates. It won't catch a regression where clone_from_existing() stops copying actual source_created_at / source_modified_at values into the new row.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@services/rag/tests/test_indexing_service.py` around lines 727 - 731, The
test's mock for mock_conn.fetchrow currently returns {"chunks_count": 5,
"source_created_at": None, "source_modified_at": None} which only verifies
tolerance for missing dates; update the fixture used by the clone test to return
realistic timestamps instead of None so clone_from_existing() is exercised
copying actual values — replace the None entries in the second fetchrow response
with concrete datetime values (strings or timezone-aware datetimes matching the
code's expected format) so the test asserts the new row preserves
source_created_at and source_modified_at.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant