Add create-subset command for database subsetting#162
Conversation
Add `msgvault create-subset --output <dir> --rows <n>` to create a smaller database from the archive containing the N most recent messages and all referentially-linked data. Useful for testing, demos, or sharing. Core logic in internal/store/subset.go: - Copies messages in dependency order with FK validation - Updates denormalized conversation counts - Populates FTS5 index - Cleans up on error Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
roborev: Combined Review (
|
- Copy only sources referenced by selected messages instead of all sources, preventing unrelated account metadata from leaking into shared subsets - Remove automatic config.toml copy which could expose API keys (server.api_key, remote.api_key) when sharing subset databases - Add multi-source test verifying only relevant sources, labels, and conversations are included in the subset Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
roborev: Combined Review (
|
SQLite treats LIMIT -1 as unlimited, so a negative rowCount passed directly to the library function would silently copy the entire database. The CLI already validated, but the library API did not. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
roborev: Combined Review (
|
- Detach source DB immediately after tx.Commit() so PRAGMA foreign_key_check only scans the destination database, not the entire source archive - Add os.Stat check for source DB path in CopySubset to prevent ATTACH from silently creating an empty file for missing paths - Close db before cleanup on error paths so WAL/SHM files are released before removal - Include WAL/SHM in cleanup when destination directory pre-existed - Fix "Created subset in <duration>" wording ambiguity Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Seeds a source DB with an FK violation (dangling label_id in message_labels), then verifies CopySubset succeeds because the PRAGMA foreign_key_check runs only after src is detached. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
roborev: Combined Review (
|
roborev: Combined Review (
|
- Filter deleted_from_source_at IS NULL in the subset seed query so soft-deleted messages don't consume row slots - Include reaction participant_ids in the participant copy scope to prevent FK violations when reactions reference non-sender/recipient participants Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
roborev: Combined Review (
|
- Include labels referenced by selected messages (via message_labels) in addition to source-scoped labels, so user-created labels with NULL source_id are preserved in subset output - Only suppress FTS errors for exact "no such table: messages_fts" and "no such module: fts5" patterns instead of broad substring matching that could hide real write failures Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
roborev: Combined Review (
|
- Switch FTS error suppression from Contains to HasSuffix so errors for related tables like messages_fts_data are not silently suppressed - Move SourceFKViolationIgnored comment to its correct test function and add comment to NullSourceIDLabels test Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
roborev: Combined Review (
|
- Order subset selection by COALESCE(sent_at, received_at, internal_date) to match app query behavior and correctly rank messages where sent_at is NULL - Null out reply_to_message_id when the parent message wasn't selected, preventing FK violations from dangling self-references Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
roborev: Combined Review (
|
- Use COALESCE(sent_at, received_at, internal_date) in updateConversationCounts to match the selection query, preventing NULL last_message_at for messages without sent_at - Add id DESC as tie-breaker to subset selection for deterministic results when timestamps are equal Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Verifies that when multiple messages share the same coalesced timestamp, the id DESC tie-breaker selects the highest IDs at the LIMIT boundary. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
roborev: Combined Review (
|
roborev: Combined Review (
|
Summary
create-subsetCLI command that copies the N most recent messages (and all referenced data) from the archive into a new, standalone msgvault databaseSupersedes #101.
Test plan
go test ./internal/store/ -run TestCopySubset)./msgvault create-subset -o /tmp/subset --rows 100against a real archiveMSGVAULT_HOME=/tmp/subset msgvault tui🤖 Generated with Claude Code