Skip to content

fix(platform): dedupe overlapping PII matches and add CVC pattern#1624

Merged
larryro merged 1 commit into
mainfrom
fix/pii-masking-overlap-1618
Apr 25, 2026
Merged

fix(platform): dedupe overlapping PII matches and add CVC pattern#1624
larryro merged 1 commit into
mainfrom
fix/pii-masking-overlap-1618

Conversation

@larryro

@larryro larryro commented Apr 25, 2026

Copy link
Copy Markdown
Collaborator

Closes #1618.

Summary

  • Sweep overlapping PII matches in detectPii so the masker never sees two ranges sharing a span (longer match wins; on tie, earlier built-in wins).
  • Add a context-anchored cvc pattern: matches CVC: 123, my cvv is 4567, card security code 890, etc. Bare 3-digit numbers are intentionally not detected.
  • 13 new tests, including the exact repro from the issue.

Why the bug looks like "email wasn't masked"

The reporter said only the credit card got masked and email/CVC came through. Tracing it on real inputs revealed two distinct issues:

  1. The masker was eating the [EMAIL] token. The built-in phone regex matches a 14-char prefix of any credit-card number (e.g. 4532-1234-5678 inside 4532-1234-5678-9010). detectPii returned both ranges; maskPii sorted DESC by start and spliced each replacement using original-text indices into a string already mutated by earlier splices. Once two matches shared a start, the later splice's result.slice(match.end) landed at the wrong offset and consumed adjacent characters — sometimes the [EMAIL] placeholder entirely.

    Concrete repro (verified against the unfixed code):

    Input Output
    4532123456789010 test@example.com 123 [CREDIT_CARD] 123 (email vanishes)
    cc 4111-1111-1111-1111 mail user@foo.org cvv 999 cc [CREDIT_CARD]EMAIL] cvv 999 ([ of [EMAIL] eaten)
    My credit card is 4532-1234-5678-9010, my email is alice@example.com. My credit card is [CREDIT_CARD]ail is [EMAIL]. (, my em eaten)

    Fix: dedupe overlaps in detectPii via a sort-and-sweep. Once the input has no overlaps, the existing end-to-start splice in maskPii is correct, so maskPii itself is unchanged.

  2. CVC was never detected. BUILT_IN_PII_PATTERNS had no CVC entry. Adding one is non-trivial because bare 3-digit numbers can't be reliably distinguished from ages, room numbers, error codes, etc. — Microsoft Presidio, AWS Comprehend, and Cloudflare WAF all skip CVV detection for the same reason. We add a context-anchored pattern that requires a label keyword (cvc/cvv/cv2/card security code) near the digits. Bare 123 is intentionally not detected.

Out of scope (called out for the reviewer)

  • The issue mentions "no per-issue error toast shown". Once the overlap bug is fixed, masking is deterministic and complete, so the silent-failure problem the reporter saw goes away. Adding a positive "Masked N items" notice is a separate UX decision and not in this PR.
  • Bare CVV digits without a label keyword are intentionally not detected (industry-standard tools skip these).

Behavior change for existing orgs

Existing orgs' saved enabledPatterns arrays don't list cvc, so it's off-by-default for them — admins toggle the new "Card security codes (CVC/CVV)" switch on in Settings → Guardrails. New orgs pick it up via the PATTERN_NAMES default in pii-config.tsx. No DB migration; the schema stores enabledPatterns as z.array(z.string()) (no enum).

Test plan

  • npm run lint --workspace=@tale/platform — 0 warnings, 0 errors
  • npx tsc --noEmit from services/platform/ — clean
  • npx vitest run convex/governance/pii — 33/33 (12 new + 21 existing)
  • npx vitest run convex/agents/__tests__/unified_chat_ttft.test.ts — 3/3
  • npx vitest run app/features/settings/governance/components/pii-config.test.tsx — 1/1
  • Manual e2e: enable PII masking, enable cvc toggle, send My card is 4532-1234-5678-9010, email user@example.com, CVC: 987 → all three masked, no eaten text.

Summary by CodeRabbit

  • New Features

    • Added detection for card security codes (CVC/CVV).
    • Extended multi-language support for PII pattern labels.
  • Bug Fixes

    • Enhanced PII detection to better handle overlapping matches, preventing duplicate masking.
  • Tests

    • Added comprehensive regression tests covering PII deduplication, overlapping patterns, and security code detection.

)

The phone regex matches a 14-char prefix of any credit-card number, so
detectPii returned both ranges with the same start. maskPii then spliced
each replacement using original-text indices into a string already mutated
by an earlier splice, which corrupted the output and could swallow the
next [EMAIL] token entirely (e.g. "4532123456789010 test@example.com 123"
became "[CREDIT_CARD] 123" — the email vanished).

Fixed by sweeping overlapping matches in detectPii: longer match wins, on
equal length the earlier-inserted pattern wins. Once detectPii returns no
overlaps, the existing end-to-start splice in maskPii is correct.

Also adds a context-anchored CVC pattern. Bare 3-digit numbers are
intentionally NOT detected (would false-positive on ages, room numbers,
error codes); Microsoft Presidio, AWS Comprehend, and Cloudflare WAF skip
CVV detection for the same reason. The pattern catches labeled cases like
"my CVC is 123", "cvv: 456", "card security code 789".

Existing orgs need to toggle the new "Card security codes (CVC/CVV)"
pattern on in Settings → Guardrails; new orgs pick it up via the
PATTERN_NAMES default.
@larryro larryro merged commit 7ba430e into main Apr 25, 2026
16 of 17 checks passed
@larryro larryro deleted the fix/pii-masking-overlap-1618 branch April 25, 2026 15:19
@coderabbitai

coderabbitai Bot commented Apr 25, 2026

Copy link
Copy Markdown
Contributor

Caution

Review failed

Pull request was closed or merged during review

📝 Walkthrough

Walkthrough

This PR addresses a PII masking bug where multiple PII items in a single input weren't all detected. It modifies the detectPii function to post-process regex matches and remove overlapping spans, prioritizing earlier starts and longer matches. A new CVC/CVV pattern is added to BUILT_IN_PII_PATTERNS to detect card security codes when explicitly labeled. Regression tests are included to verify correct deduplication when credit card, phone, and email patterns overlap, and to validate the new CVC detection logic. Translation entries for the CVC pattern are added to English, German, and French message files.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the two main changes: deduping overlapping PII matches and adding CVC pattern detection.
Linked Issues check ✅ Passed The PR fully addresses issue #1618: fixes overlapping match deduplication [detectPii in pii_detector.ts], adds CVC pattern detection [pii_patterns.ts], includes comprehensive tests [pii_patterns.test.ts], and updates i18n translations for all supported languages.
Out of Scope Changes check ✅ Passed All changes are directly scoped to fixing the PII masking bug: deduplication logic, CVC pattern addition, test coverage, and i18n translations. No unrelated changes detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/pii-masking-overlap-1618

Comment @coderabbitai help to get the list of available commands and usage tips.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: PII Masking Inconsistent — Multiple PII Items in Single Prompt Only Partially Masked

1 participant