Skip to content

Add create-subset command for database subsetting#162

Merged
wesm merged 11 commits intomainfrom
create-subset-cmd
Feb 26, 2026
Merged

Add create-subset command for database subsetting#162
wesm merged 11 commits intomainfrom
create-subset-cmd

Conversation

@wesm
Copy link
Owner

@wesm wesm commented Feb 26, 2026

Summary

  • Adds a create-subset CLI command that copies the N most recent messages (and all referenced data) from the archive into a new, standalone msgvault database
  • Copies all dependent rows (conversations, participants, labels, recipients, bodies, attachments) in FK-safe dependency order with foreign key verification
  • Updates denormalized conversation counts and populates FTS index in the destination database
  • Includes tests covering basic subsetting, row limits, FTS population, conversation count consistency, pre-existing directories, duplicate destination detection, and SQL injection/control character path validation

Supersedes #101.

Test plan

  • Unit tests pass (go test ./internal/store/ -run TestCopySubset)
  • Manual test: ./msgvault create-subset -o /tmp/subset --rows 100 against a real archive
  • Verify subset DB works: MSGVAULT_HOME=/tmp/subset msgvault tui

🤖 Generated with Claude Code

Add `msgvault create-subset --output <dir> --rows <n>` to create a
smaller database from the archive containing the N most recent messages
and all referentially-linked data. Useful for testing, demos, or sharing.

Core logic in internal/store/subset.go:
- Copies messages in dependency order with FK validation
- Updates denormalized conversation counts
- Populates FTS5 index
- Cleans up on error

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@wesm wesm mentioned this pull request Feb 26, 2026
@roborev-ci
Copy link

roborev-ci bot commented Feb 26, 2026

roborev: Combined Review (2ac2138)

Summary: The PR successfully implements the create-subset functionality but requires remediation for a high-severity secret exposure risk and several medium-severity data integrity and file handling bugs.

High

  • Unintentional Secret Exposure via config.toml
    • Files: [/home/roborev/repos/msgvault/cmd/msgvault/cmd/create_subset.go:77](/home/roborev/repos/msgvault/cmd/
      msgvault/cmd/create_subset.go:77), [/home/roborev/repos/msgvault/cmd/msgvault/cmd/create_subset.go:79](/home/roborev/repos/msgvault/cmd/msgvault/cmd/create_subset.
      go:79), /home/roborev/repos/msgvault/cmd/msgvault/cmd/create_subset.go:101, /home
      /roborev/repos/msgvault/cmd/msgvault/cmd/create_subset.go:104
    • Issue: create -subset unconditionally copies config.toml into the subset directory by default. Since the tool is intended for sharing, this creates a direct credential exposure path if the config contains API keys, account metadata, or other local secrets.
    • Remediation: Do not copy config.toml by
      default. Make it an explicit opt-in (e.g., --include-config or --copy-config) with a strong warning, or generate a sanitized config that excludes secret fields. Ensure restrictive permissions (e.g., 0600) if copied.

Medium


Missing Participant Dependency for Reactions**
* Files: /home/roborev/repos/msgvault/internal/store/subset.go:242, /home/rob
orev/repos/msgvault/internal/store/subset.go:298

* Issue: Participants are copied from message senders and recipients, but reactions are copied
later. If a reactor is not in the sender/recipient sets, foreign key validation fails, causing the entire subset operation to roll back.
* Remediation: Include SELECT participant_id FROM src.reactions WHERE message_id IN (SELECT id FROM selected_messages) in the participant selection
union to ensure all required foreign keys are present.

  • File Lock Prevents Cleanup on Error (Windows)
    • Files: [/home/roborev/repos/msgvault/internal/store/subset.go:121](/home/roborev/repos/msg
      vault/internal/store/subset.go:121)
    • Issue: cleanup() is called synchronously in error paths before the deferred db.Close() executes. On Windows, the active *sql.DB connection holds an exclusive file lock, causing os.Remove and os.RemoveAll to silently fail and leave partial database files behind.
    • Remediation: Ensure db.Close() is explicitly called before cleanup() in error paths, or refactor cleanup() into a deferred function that runs after the database is closed.

Schema-Drift Fragility in INSERT ... SELECT ***
* Files: [/home/roborev/repos/msgvault/internal/store/subset.go:210](/home/roborev/repos/msgvault/internal/store/subset.go:
210), /home/roborev/repos/msgvault/internal/store/subset.go:230, /home/roborev/repos/msgvault/internal
/store/subset.go:274

* Issue: Broad SELECT * copies assume the source and destination table column count and order match exactly. This can fail on
older source databases if the schema has evolved.
* Remediation: Use explicit column lists per table instead of SELECT *, or implement compatibility mapping via PRAGMA table_info and defaults for missing columns.


Synthesized from 4 reviews (agents: codex, gemini | types: default, security)

- Copy only sources referenced by selected messages instead of all
  sources, preventing unrelated account metadata from leaking into
  shared subsets
- Remove automatic config.toml copy which could expose API keys
  (server.api_key, remote.api_key) when sharing subset databases
- Add multi-source test verifying only relevant sources, labels,
  and conversations are included in the subset

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@roborev-ci
Copy link

roborev-ci bot commented Feb 26, 2026

roborev: Combined Review (472ec60)

Summary Verdict: The create-subset implementation is secure and well-structured, but contains one
medium-severity issue regarding input validation in the library API.

Medium

Missing guard for non-positive rowCount in library API
File: [internal/store/subset.go:33](/home/roborev/repos/msgvault/internal/store/subset.go
:33)

runCreateSubset validates --rows, but CopySubset itself does not. In SQLite, LIMIT -1 means "no limit", so a negative value can copy the entire dataset unexpectedly.
Suggested fix: Validate rowCount > 0 at the start of
CopySubset and return a clear error.


Synthesized from 4 reviews (agents: codex, gemini | types: default, security)

SQLite treats LIMIT -1 as unlimited, so a negative rowCount passed
directly to the library function would silently copy the entire
database. The CLI already validated, but the library API did not.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@roborev-ci
Copy link

roborev-ci bot commented Feb 26, 2026

roborev: Combined Review (d7dc50f)

Summary Verdict: The PR introduces robust security controls for the new create-subset functionality, but contains a medium-
severity issue regarding foreign key constraints on message replies.

Medium

  • File: internal/store/subset.go, [internal/store/subset.go](/home/roborev/repos
    /msgvault/internal/store/subset.go#L289), internal/store/subset.go
  • Problem: Subset selection copies only the top N messages, but
    preserves messages.reply_to_message_id. If a selected message replies to a non-selected message, the foreign_key_check will fail and subset creation will abort.
  • Suggested Fix: Either include referenced parent messages in selected_messages (via a closure or recursive CTE),
    or null out reply_to_message_id when the parent is not selected during the insert.

Synthesized from 4 reviews (agents: codex, gemini | types: default, security)

wesm and others added 2 commits February 26, 2026 09:54
- Detach source DB immediately after tx.Commit() so PRAGMA
  foreign_key_check only scans the destination database, not the
  entire source archive
- Add os.Stat check for source DB path in CopySubset to prevent
  ATTACH from silently creating an empty file for missing paths
- Close db before cleanup on error paths so WAL/SHM files are
  released before removal
- Include WAL/SHM in cleanup when destination directory pre-existed
- Fix "Created subset in <duration>" wording ambiguity

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Seeds a source DB with an FK violation (dangling label_id in
message_labels), then verifies CopySubset succeeds because the
PRAGMA foreign_key_check runs only after src is detached.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@roborev-ci
Copy link

roborev-ci bot commented Feb 26, 2026

roborev: Combined Review (f9a4130)

Summary Verdict: The code changes are generally
secure and sound, but there are two medium-severity logic issues related to data subset creation that need to be addressed.

Medium

Soft-deleted messages can be selected into the subset

  • File: [internal/store/subset.go:233](/home/roborev/repos/
    msgvault/internal/store/subset.go:233)
  • Details: The subset seed query selects by sent_at DESC LIMIT ? without filtering deleted_from_source_at IS NULL. Most read/search paths in this repo exclude soft-deleted messages, so --rows N can produce fewer than N visible messages in the subset DB.
  • Suggested fix: Add WHERE deleted_from_source_at IS NULL to the selected_messages query (or make inclusion explicit via a flag).
  • Testing gap: No test currently covers deleted messages in source data.

Reactions can reference participants that were never copied

  • Files: internal/store/subset.go:271, internal/store/subset.go:327
  • Details: Participants are copied from message sender + message_recipients, but reactions are copied separately and include a participant_id foreign key. If a reactor is neither a
    sender nor a recipient in the selected rows, FK violations occur and subset creation fails.
  • Suggested fix: Include reaction participants in the participant copy scope (e.g., UNION SELECT participant_id FROM src.reactions WHERE message_id IN (SELECT id FROM selected_messages)).
  • Testing
    gap:
    No test covers reactions with a distinct reactor participant.

Synthesized from 4 reviews (agents: codex, gemini | types: default, security)

@roborev-ci
Copy link

roborev-ci bot commented Feb 26, 2026

roborev: Combined Review (5098e68)

Summary Verdict: The reviewed code is clean with no medium, high, or critical issues found.


Synthesized from 4 reviews (agents: codex, gemini | types: default, security)

- Filter deleted_from_source_at IS NULL in the subset seed query so
  soft-deleted messages don't consume row slots
- Include reaction participant_ids in the participant copy scope to
  prevent FK violations when reactions reference non-sender/recipient
  participants

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@roborev-ci
Copy link

roborev-ci bot commented Feb 26, 2026

roborev: Combined Review (927d5a7)

Verdict: The code is secure and structurally sound, but contains one medium-severity data integrity issue related to foreign key constraints during subset creation.

Medium

Dangling reply_to_message_id Foreign Key Reference

  • Files: subset.go, [subset.go](/home/roborev/repos/msgvault/internal/store/subset.go#L3
  • Problem: messages are copied as-is, including reply_to_message_id. If a selected message replies to an older non-selected message, PRAGMA foreign_key_check will fail and abort the whole subset creation.
  • Suggested Fix: After the
    message copy, either include the referenced reply ancestors in selected_messages, or null out any dangling reply_to_message_id values in the destination database (e.g., UPDATE messages SET reply_to_message_id = NULL WHERE reply_to_message_id NOT IN (SELECT id FROM messages )). Additionally, add a regression test with a reply edge that crosses the subset boundary.

Synthesized from 4 reviews (agents: codex, gemini | types: default, security)

- Include labels referenced by selected messages (via message_labels)
  in addition to source-scoped labels, so user-created labels with
  NULL source_id are preserved in subset output
- Only suppress FTS errors for exact "no such table: messages_fts"
  and "no such module: fts5" patterns instead of broad substring
  matching that could hide real write failures

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@roborev-ci
Copy link

roborev-ci bot commented Feb 26, 2026

roborev: Combined Review (702144c)

Verdict: The changes implement the subset export functionality securely, but there are medium-severity structural and data integrity issues that need addressing.

Medium

  • File: internal/store/subset.go:234
    , internal/store/subset.go:309, internal/store
    /schema.sql:130

    CopySubset selects exactly rowCount messages and copies them directly, but messages.reply_to_message_id is a
    self-FK. If a copied message replies to a parent message outside the subset, FK validation fails and the whole subset creation errors.
    Suggested fix: Either include required parent messages in selected_messages (recursive closure), or preserve exact row count by nulling orphaned reply_to_message _id after insert.

  • File: internal/store/subset.go (around lines 120-131 and 147)
    ATTACH DATABASE and DETACH DATABASE are executed using db.Exec(), which checks out a connection
    and immediately returns it to the pool. The subsequent db.Begin() or DETACH could grab a different connection from the pool that does not have the database attached. While this happens to pass tests due to Go's LIFO idle connection reuse in a single-threaded context, it is structurally unsafe in
    database/sql.
    Suggested fix: Acquire a dedicated connection using conn, err := db.Conn(context.Background()), then execute ATTACH, BeginTx, Commit/Rollback, and DETACH exclusively on that *sql.Conn.


Synthesized from 4 reviews (agents: codex, gemini | types: default, security)

- Switch FTS error suppression from Contains to HasSuffix so
  errors for related tables like messages_fts_data are not
  silently suppressed
- Move SourceFKViolationIgnored comment to its correct test
  function and add comment to NullSourceIDLabels test

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@roborev-ci
Copy link

roborev-ci bot commented Feb 26, 2026

roborev: Combined Review (c333117)

The code is secure and well-structured overall, but requires fixes for timestamp fallbacks and foreign key handling to ensure data consistency
.

Medium

  • Timestamp Fallback Selection: "Most recent" selection is based only on sent_at, which can pick the wrong rows when sent_at is NULL but received_at or internal_date is present.

Suggested fix: Order by COALESCE(sent_at, received_at, internal_date) (or the project’s canonical timestamp expression) so subset selection matches app behavior. Add a test with mixed NULL sent_at values and non-NULL fallback timestamps.

  • Foreign Key Validation on
    Parent Messages:
    Referenced parent messages via reply_to_message_id are not handled, so FK validation can fail if a selected message replies to a non-selected message.
    • Location: [subset.go:308](//home/roborev/repos/msgvault/
      internal/store/subset.go:308), subset.go:310
    • Suggested fix: Either (a) recursively include parent messages in selected_ messages, or (b) null out reply_to_message_id when the parent is absent before the FK check. Add a regression test where a selected message references an older, excluded parent.

Synthesized from 4 reviews (agents: codex, gemini | types: default, security)

- Order subset selection by COALESCE(sent_at, received_at,
  internal_date) to match app query behavior and correctly rank
  messages where sent_at is NULL
- Null out reply_to_message_id when the parent message wasn't
  selected, preventing FK violations from dangling self-references

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@roborev-ci
Copy link

roborev-ci bot commented Feb 26, 2026

roborev: Combined Review (1a17f7a)

Verdict: The code is generally clean and secure, but there is one medium-severity issue regarding
conversation timestamp calculations.

Medium

  • last_message_at can be incorrectly nulled for conversations whose copied messages only have received_at/internal_date
    • File: internal/store/subset.go (in updateConversationCounts, MAX (sent_at))
    • Problem: Message selection now uses COALESCE(sent_at, received_at, internal_date), but conversation recalculation still uses only sent_at. For subsets with missing sent_at, last_message_at becomes
      NULL, which can break conversation ordering and recency behavior.
    • Suggested fix: Recompute using the same canonical fallback, e.g., MAX(COALESCE(sent_at, received_at, internal_date)).

Synthesized from 4 reviews (agents: codex, gemini | types: default, security)

wesm and others added 2 commits February 26, 2026 13:43
- Use COALESCE(sent_at, received_at, internal_date) in
  updateConversationCounts to match the selection query, preventing
  NULL last_message_at for messages without sent_at
- Add id DESC as tie-breaker to subset selection for deterministic
  results when timestamps are equal

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Verifies that when multiple messages share the same coalesced
timestamp, the id DESC tie-breaker selects the highest IDs at
the LIMIT boundary.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@roborev-ci
Copy link

roborev-ci bot commented Feb 26, 2026

roborev: Combined Review (377c580)

Verdict: The code is secure and well-structured, but requires a fix for a connection pooling issue during database attachment.

Medium

File:
internal/store/subset.go:114

  • Problem: ATTACH DATABASE is a connection-scoped operation. Because Go's database/sql uses a connection pool, db.Exec(attachSQL) and the subsequent db.Begin() could theoretically run on different underlying
    SQLite connections, causing the transaction to fail with "no such table".
  • Suggested Fix: Add db.SetMaxOpenConns(1) immediately after sql.Open to ensure all queries strictly reuse the same connection.

Synthesized from 4 reviews (agents: codex, gemini | types: default, security)

@roborev-ci
Copy link

roborev-ci bot commented Feb 26, 2026

roborev: Combined Review (f516095)

Verdict: All reviewers found no issues; the code is clean and secure.

The reviews comprehensively evaluated the new create-subset CLI command and store.CopySubset implementation, focusing on correctness, regression testing, and security (including injection, authorization
, and path traversal risks). All agents agreed that dynamic SQL inputs are safely handled, proper validations are in place, and no vulnerabilities or functional defects were identified.


Synthesized from 4 reviews (agents: codex, gemini | types: default, security)

@wesm wesm merged commit 3164737 into main Feb 26, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant