jamesconsultingllc

pdf-anonymize

True-redacts PII from PDF, Word (.docx), and Excel (.xlsx) documents before sharing. PDFs have glyphs physically removed and replacement text re-laid; DOCX patches run slices to preserve formatting; XLSX replaces cells, comments, headers, defined names, and properties. Handles names, addresses, account numbers, credit scores, and merchants/stores/phones/cities on transaction lines. Works on ASCII PDFs and modern bank statements with CID-encoded subsetted fonts (BoA, Chase). Single file or directory (with --recursive). Output cannot be recovered via Ctrl+F, pdftotext, or another AI. Trigger on "anonymize/redact/sanitize/de-identify this PDF/Word/Excel", "scrub PII", "remove my name/address", "share this safely" — even when the user just attaches a file and asks to strip identifying info. DO NOT USE FOR visual watermarking, legacy .doc/.xls, or .pptx. Do NOT use nano-pdf — it re-renders pages via image AI and corrupts financial text.

jamesconsultingllc 0 Updated 4w ago

Resources

15
GitHub

Install

npx skillscat add jamesconsultingllc/anonymize

Install via the SkillsCat registry.

SKILL.md

PDF Anonymize

Performs true redaction of PII in a PDF using PyMuPDF's font-aware
redaction APIs. Original glyphs are physically removed from the content
stream and replacement text is re-laid in their place. Pure local
processing. No cloud API calls.

Works on:

  • ASCII-encoded text PDFs (Citi, most utility bills)
  • CID-encoded subsetted-font PDFs (Bank of America, Chase, most modern
    bank statements) — these store text as glyph indices that defeat naive
    byte-replacement; PyMuPDF's redaction handles them transparently
  • Word documents (.docx) — run-slice replacement preserves inline
    formatting, hyperlinks, and comment anchors; covers paragraphs,
    headers/footers, footnotes/endnotes, comments, text boxes, and chart text
  • Excel workbooks (.xlsx) — covers all cells (including numeric and
    hidden), cell comments, sheet headers/footers, defined names, data
    validation prompts, and workbook properties

For Office files, a package-level OOXML byte-regex backstop runs after the
library pass to catch text in surfaces the library APIs don't expose
(e.g., chart titles, pivot caches, drawing text). The backstop is scoped to
text elements (<w:t>, <a:t>, <t>, <vt:lpstr>/<vt:lpwstr>) inside
an allowlist of editable parts — it never touches signatures, content
types, rels, themes, styles, calcChain, or formula structures.

Single file or directory: --in and --out accept a file or a
directory. With --recursive (-r), subdirectories are walked and the
output mirrors the input tree. Office lock files (~$*, .~lock.*) and
hidden files are skipped. Directory symlinks are not followed.

Unsupported: legacy .doc/.xls (ask the user to Save As to the
modern format), .pptx, and other Office types. Files with unknown
extensions are skipped with a warning.

When to Use

  • User wants to share a financial statement, bill, or report without leaking
    identity (name, address, last-4 digits, transaction locations, etc.)
  • User asks to "make this not traceable" or "scrub identifying info"
  • Replacing a name or other field everywhere it appears in a multi-page PDF
  • Removing or replacing transaction-level identifiers (store numbers, city
    names) so location can't be inferred

When NOT to Use

  • The user only needs a visual mock-up — use nano-pdf instead (image-based)
  • The PDF is a scanned image with no text layer — see References
  • The user wants black-bar redaction only (no replacement text) — possible
    with PyMuPDF but not the default; pass an empty string as the replacement

Why True Redaction Matters

Overlay redaction (drawing a white box over text) leaves the original text
intact under the box — Ctrl+F, copy-paste, and pdftotext still find it.
nano-pdf re-renders the page via an image AI, which degrades text in
financial documents and may hallucinate numbers. This skill removes the
underlying glyphs from the page content stream, so the original bytes are
gone from the file.

If the input PDF has no extractable text, see
references/scanned-pdfs.md — this skill cannot
handle scanned/image-only PDFs.

Procedure

The agent — not the script — is responsible for finding PII the user
didn't explicitly mention. The script only redacts what it's told to.
Run this loop on every job:

  1. Collect inputs the user provided (free-form names/values, a YAML
    --config, prior --replace/--scrub flags). These are the
    user-supplied terms.
  2. First pass: apply the user-supplied terms with the script.
  3. Extract text from the output (Step 1 below) and scan for missed
    PII using the checklist.
  4. Propose a supplemental map to the user (additional replacements
    and/or additional scrub:) and wait for approval.
  5. Second pass: re-run with the merged map (easiest: append to the
    YAML config and re-run on the original input).
  6. Final verify: every original token reports 0 occurrences.

If the user provides nothing at all, skip step 2 and start with step 3
on the input.

Step 1: Extract text for PII discovery

Pull every visible string out of the document so the agent can scan
for PII the user didn't enumerate. Choose by file type:

# PDFs
uvx --with pymupdf python3 -c "
import fitz, sys
doc = fitz.open(sys.argv[1])
for i, p in enumerate(doc, 1):
    print(f'=== Page {i} ==='); print(p.get_text() or '')
" "$FILE"

# DOCX (paragraphs, tables, headers/footers, comments)
uvx --with python-docx python3 -c "
import sys
from docx import Document
d = Document(sys.argv[1])
def emit(label, runs):
    for t in runs:
        if t and t.strip(): print(f'[{label}] {t}')
emit('body', (p.text for p in d.paragraphs))
for tbl in d.tables:
    emit('table', (c.text for r in tbl.rows for c in r.cells))
for s in d.sections:
    emit('header', (p.text for p in s.header.paragraphs))
    emit('footer', (p.text for p in s.footer.paragraphs))
print('--- core props ---')
cp = d.core_properties
for k in ('author','last_modified_by','title','subject','comments','keywords'):
    v = getattr(cp, k, None)
    if v: print(f'{k}: {v}')
" "$FILE"

# XLSX (cells, comments, headers/footers, defined names, properties)
uvx --with openpyxl python3 -c "
import sys
from openpyxl import load_workbook
wb = load_workbook(sys.argv[1], data_only=False)
for ws in wb.worksheets:
    print(f'=== sheet: {ws.title} ===')
    for row in ws.iter_rows(values_only=False):
        for cell in row:
            if cell.value is None: continue
            print(f'{cell.coordinate}: {cell.value!r}')
            if cell.comment: print(f'  # comment: {cell.comment.text}')
    for hf in ('oddHeader','oddFooter','evenHeader','evenFooter'):
        h = getattr(ws, hf, None)
        for part in ('left','center','right'):
            v = getattr(getattr(h, part, None), 'text', None) if h else None
            if v: print(f'{hf}.{part}: {v}')
print('--- defined names ---')
for n in wb.defined_names: print(n, '=', wb.defined_names[n].value)
print('--- core props ---')
p = wb.properties
for k in ('creator','lastModifiedBy','title','subject','description','keywords'):
    v = getattr(p, k, None)
    if v: print(f'{k}: {v}')
" "$FILE"

For directories, run the appropriate command per file (or write a small
loop) — there is no built-in 'extract' subcommand on the script.

Step 2: PII discovery checklist

After every extraction, scan for these categories. Always propose
findings even when the user only asked to redact one thing — they
usually don't realize how much else is in the document.

Category What to look for
People Cardholder/payor, joint owner, beneficiaries, "insured name", agent name, recipient, document author, last-modified-by, signatures, "attention to" lines
Address Street, apt/unit, city, state, ZIP (full + ZIP+4), country if non-default
Account identifiers Full account #, last-4 ("ending in NNNN", "*NNNN", "xNNNN"), member ID, customer ID, policy #, MICR routing/account, IBAN
Card details PAN (full or partial), CVV, expiration, cardholder name, BIN
Transaction lines Merchant + city/state, store number, branch, terminal ID, ATM ID, "Doing business as" names
Government IDs SSN/SIN (NNN-NN-NNNN), EIN/TIN (NN-NNNNNNN), driver's license, passport, NPI, federal tax ID
Contact Phone (with area code), fax, email, personal URLs/handles
Dates Date of birth, anniversary, account-open date if combined with name
Identifiers in URLs/QR Tokens, signed URLs in footers, "view online" links with IDs
Credit/financial Credit score, FICO range, balance + last-4 combo, salary, tax withholding
Document metadata core.author, lastModifiedBy, title, subject, keywords, comments, custom properties (often shows full username/email)
Embedded references Phone number embedded in merchant name, city embedded in MERCHANT - CITY, area-code prefix on phone numbers ( 972-), bare TX / CA strings

For numeric fields prefer scrub: — same-length deterministic digits
keep the document's layout intact (e.g. 85752252) while still
making the value untraceable.

For free-form text where you have a good fake (e.g. you want a
consistent fake name across many documents) prefer replacements:.

Step 3: Propose the supplemental map

Show the user a table with three columns: Source location,
Original, Suggested action (scrub or replace → fake). Wait
for explicit approval.

When the user approves, append the confirmed items to their profile
by re-running with --save-config. The profile becomes a personal
anonymization fingerprint that grows over time, so future runs need
fewer agent discoveries:

# First time on this user's documents — start the profile.
... --scrub "CHRISTINE JAMES" --scrub "8575" \
    --replace "JAMES, CHRISTINE=DOE, JANE" \
    --save-config ~/.config/pdf-anonymize/personal.yml

# Subsequent runs — load profile, add today's discoveries, save back.
... --config ~/.config/pdf-anonymize/personal.yml \
    --scrub "23-73-11462314-4" --scrub "17880" \
    --save-config

Behavior:

  • --save-config PATH writes (or overwrites) the merged set to PATH.
    If PATH already exists and --config was not also passed, the
    existing file is auto-loaded as the base so additions accumulate
    rather than overwrite.
  • --save-config with no value saves back to whatever --config
    pointed at.
  • Existing entries are deduplicated (same string in replacements,
    scrub, or checks is not added twice).
  • Comments and key order in the YAML are normalized on round-trip.

Default profile location (~/.config/pdf-anonymize/profile.yaml)

If no --config is passed, the script auto-loads
~/.config/pdf-anonymize/profile.yaml (or $XDG_CONFIG_HOME/pdf-anonymize/profile.yaml)
when present. This is the recommended location for a user's permanent
anonymization fingerprint — replacements/scrub items added here apply
to every run automatically.

The first time a user runs pdf-anonymize on an interactive terminal
without that profile, they'll be prompted to create one. The starter
template is fully commented and includes a salt: placeholder they
must change. To create it explicitly (or in a non-interactive context):

pdf-anonymize --init                    # writes the default path
pdf-anonymize --init /custom/path.yaml  # writes elsewhere

--init refuses to overwrite an existing file. Use
--no-default-config to skip auto-load and the prompt for a single
run. Agent invocations (non-TTY) never block on the prompt — they
silently fall through to whatever flags were passed.

When proposing an updated map (Step 3), prefer pointing
--save-config at the default profile so it grows over time:

pdf-anonymize --in ... --out ... \
    --scrub "..." --replace "..." \
    --save-config   # saves back to ~/.config/pdf-anonymize/profile.yaml
                    # if that's what was auto-loaded

Step 4: Apply Replacements

Run the bundled script via uvx so PyMuPDF is pulled into an ephemeral
isolated environment (no global installs):

uvx --with pymupdf python3 scripts/anonymize_pdf.py \
    --in "April 22.pdf" \
    --out "April 22 (Anon).pdf" \
    --replace "RUDY R JAMES JR=BILL JOHNSON" \
    --replace "5519 MCCAIN CT=1428 ELMWOOD AVE" \
    --replace "DALLAS TX 75249-1650=DENVER CO 80014-2210" \
    --replace "ending in: 8495=ending in: 0000"

If the user already has pymupdf available, plain python3 scripts/anonymize_pdf.py ... works.

For DOCX/XLSX inputs, include those libraries:

uvx --with pymupdf --with python-docx --with openpyxl python3 \
    scripts/anonymize_pdf.py \
    --in "Statement.xlsx" \
    --out "Statement (Anon).xlsx" \
    --replace "John Smith=REDACTED" \
    --replace "12345=99999"

For a whole directory (top-level only by default; pass -r to recurse):

uvx --with pymupdf --with python-docx --with openpyxl python3 \
    scripts/anonymize_pdf.py \
    --in ./statements/ \
    --out ./statements-anon/ \
    --recursive \
    --replace "RUDY R JAMES JR=BILL JOHNSON" \
    --replace "5519 MCCAIN CT=1428 ELMWOOD AVE"

Reusable config + auto-scrub

For values you redact repeatedly, put them in a YAML config and reuse it:

# anon-profile.yml
salt: "personal-2026"          # rotate to invalidate old fakes (optional)

replacements:                  # explicit find → replace pairs
  - find: "RUDY R JAMES JR"
    replace: "BILL JOHNSON"
  - find: "5519 MCCAIN CT"
    replace: "1428 ELMWOOD AVE"

scrub:                         # auto-fake (deterministic from salt)
  - "8575"                     # → same-length digits, e.g. "2252"
  - "23-73-11462314-4"         # → "72-27-20214807-9"
  - "CHRISTINE JAMES"          # → "PERSON_<6 hex>"
  - "17-F133-77"               # → "REDACTED_<6 hex>"

checks:                        # extra tokens for --verify
  - "JAMES"
  - "WOODSTOCK"
uvx --with pymupdf --with python-docx --with openpyxl --with pyyaml \
    python3 scripts/anonymize_pdf.py \
    --in ./statements/ --out ./statements-anon/ -r \
    --config anon-profile.yml

CLI flags --replace OLD=NEW, --scrub WORD, and --check WORD merge on
top of the config. A --scrub value is auto-classified by shape:

  • digits-only or numeric-with-separators (≥2 digits) → same-length
    deterministic digits, separators preserved
  • mostly letters/spaces (≥70%) → PERSON_<6 hex>
  • otherwise → REDACTED_<6 hex>

Replacements are deterministic per (salt, value), so the same input
always becomes the same fake — drop in new statements at any time and
they'll redact consistently. Each scrub value is also auto-added to the
verify check list.

The script applies replacements longest-first (so "RUDY R JAMES JR" wins
over "RUDY") and applies each redaction completely before moving to the
next. Output reports counts per pattern.

Step 5: Catch Embedded References

Watch for cases where PII is embedded inside other text — e.g., a merchant
name like ANDY'S - GRAND PRAIRIE includes the city. The verifier (Step 5)
catches these. Just add another --replace and re-run.

Common embedded forms:

  • Merchant <NAME> - <CITY> → also map the merchant form
  • Phone numbers in transaction descriptions
  • Bare TX / CA strings in MICR/footer lines → swap with target state
  • Last-4 digits referenced separately from the full 16-digit account number

Step 6: Verify

After every run, verify zero leakage by extracting text and searching for
every original token:

uvx --with pymupdf python3 scripts/anonymize_pdf.py --verify "$NEW_PDF" \
    --check "RUDY" --check "8495" --check "MCCAIN" --check "DALLAS"

For DOCX/XLSX, verify uses a package-level scan (parsed logical text per
container and raw byte search per part) so PII split across runs is
caught. Numeric-looking tokens are also matched in separator-stripped form
(so 123-45-6789 is detected as 123456789):

uvx --with pymupdf --with python-docx --with openpyxl python3 \
    scripts/anonymize_pdf.py --verify "Statement (Anon).xlsx" \
    --check "John Smith" --check "12345"

--verify also accepts a directory (with -r for recursion) and exits
non-zero if any check leaks in any file.

Every check must report 0 occurrences before declaring success. The
exit code is non-zero when any check leaks, so this works inside CI:

if ! python3 scripts/anonymize_pdf.py --verify "$OUT" --check RUDY; then
    echo "PII leaked!" >&2
    exit 1
fi

Step 7: Report

Show the user:

  • File path of the scrubbed copy
  • Total replacement count (broken down by pattern from script output)
  • Verification result (✅ 0 occurrences for each original token)
  • Reminder that this is true redaction (irrecoverable), pure local
    processing, no AI service called

Important Rules

  1. Always work on a copy. Never modify the source PDF in place — create
    <original> (Anon).pdf or similar.
  2. Get explicit approval of the replacement map before running the
    script.
  3. Verify after every run. Run --verify and confirm 0 occurrences of
    every original token. If any leak, look at the verifier output for
    surrounding context and add more --replace patterns.
  4. Never use nano-pdf for this. It re-renders pages via Gemini and
    degrades text in financial documents.
  5. Address transaction-level PII. Don't stop at the cardholder block —
    merchants, store numbers, phone numbers, and city/state on transaction
    lines also leak location. Change them all.
  6. Sort longer replacements first is automatic, but be aware: if you
    want "RUDY" → "BILL" you don't need to also list "RUDY R JAMES JR" if
    the longer form already maps to "BILL JOHNSON" — the longest pattern
    wins per occurrence.
  7. Replacement text font may differ slightly. PyMuPDF re-lays
    replacements in Helvetica. On PDFs that use a custom branded font, the
    replaced regions look subtly different from surrounding text. This is
    visually noticeable but functionally fine — the data is gone.
  8. No cloud calls. This skill must remain fully local. Do not pipe
    content through any LLM service for the actual edit step.

Available Scripts

This skill ships as an installable Python package. Three equivalent ways
to invoke it — pick whichever fits the runtime:

Form Command When
Console script pdf-anonymize ... After pipx install "pdf-anonymize[all]" or uv tool install.
Ephemeral (uvx) uvx --from git+https://github.com/jamesconsultingllc/pdf-anonymize --with python-docx --with openpyxl --with pyyaml pdf-anonymize ... One-off, no install.
In-skill launcher uvx --with pymupdf --with python-docx --with openpyxl --with pyyaml python3 scripts/anonymize_pdf.py ... When this skill folder is checked out and the agent runs from inside it.
  • scripts/anonymize_pdf.py — thin launcher. Adds ../src to
    sys.path and delegates to pdf_anonymize.cli:main. Lets agents and
    users run the tool directly from a clone without installing.
  • src/pdf_anonymize/cli.py — the actual implementation. Also
    exposed as python -m pdf_anonymize (after install) and as the
    pdf-anonymize console script.

Drop only pymupdf from the --with list for PDF-only work; the other
libraries are imported lazily and are unused for PDF inputs.

Examples

Example 1: Anonymize a Bank of America credit-card statement

User asks: "Make a copy of this BoA statement with my name and address
changed, and remove all transaction location info so nobody can profile
where I shop."

# 1. Inspect first to identify PII
uvx --with pymupdf python3 -c "
import fitz, sys
doc = fitz.open(sys.argv[1])
for i, p in enumerate(doc, 1):
    print(f'=== Page {i} ==='); print(p.get_text() or '')
" "April 25.pdf"

# 2. Apply replacements (longest patterns automatically win)
uvx --with pymupdf python3 scripts/anonymize_pdf.py \
    --in "April 25.pdf" \
    --out "April 25 (Anon).pdf" \
    --replace "RUDY R JAMES JR=BILL JOHNSON" \
    --replace "5519 MCCAIN CT=1428 ELMWOOD AVE" \
    --replace "DALLAS TX 75249-1650=DENVER CO 80014-2210" \
    --replace "4400 6631 4372 3867=4400 6631 4372 0000" \
    --replace "GRAND PRAIRIE TX=AURORA CO" \
    --replace "ANDY'S - GRAND PRAIRIE=ANDY'S - AURORA" \
    --replace "DALLAS TX=DENVER CO" \
    --replace " TX = CO " \
    --replace "WHATABURGER 742=WHATABURGER 999" \
    --replace "972-237-7941=303-555-0100"

# 3. Verify nothing leaked
uvx --with pymupdf python3 scripts/anonymize_pdf.py \
    --verify "April 25 (Anon).pdf" \
    --check RUDY --check JAMES --check MCCAIN --check DALLAS \
    --check "GRAND PRAIRIE" --check "4372 3867" --check 75249

Expected output of step 3 — every line must say ✅ ... 0 occurrence(s).

Example 2: Strip transaction location info

User asks: "Anonymize this checking statement so nobody can tell what city
I shop in."

uvx --with pymupdf python3 scripts/anonymize_pdf.py \
    --in "checking-2026-04.pdf" \
    --out "checking-2026-04 (Anon).pdf" \
    --replace "JANE Q DOE=ALEX SMITH" \
    --replace "9001 PRESTON RD=500 5TH AVE" \
    --replace "DALLAS TX 75225=NEW YORK NY 10001" \
    --replace "STARBUCKS #14672=STARBUCKS #20015" \
    --replace "KROGER #586=KROGER #311" \
    --replace "GRAPEVINE TX=BROOKLYN NY" \
    --replace "DALLAS TX=NEW YORK NY" \
    --replace "PLANO TX=QUEENS NY"

Example 3: Dry-run by verifying first

You can run --verify against the original PDF before any edits to see
exactly which tokens currently appear:

uvx --with pymupdf python3 scripts/anonymize_pdf.py \
    --verify "April 22.pdf" \
    --check "RUDY" --check "MCCAIN" --check "8495"

Output indicates what to scrub:

❌ 'RUDY': 11 occurrence(s)
❌ 'MCCAIN': 1 occurrence(s)
❌ '8495': 14 occurrence(s)

Exit code is non-zero when any check finds a leak.

References

Load these only when the trigger condition applies (progressive disclosure):

  • references/scanned-pdfs.md — load when
    fitz.Page.get_text() returns empty for every page (image-only PDF;
    this skill cannot handle it, see the file for alternatives).

PDF Anonymize

Performs true redaction of PII in a PDF by rewriting the content stream so
the original strings are gone — not just visually covered. Pure local
processing using open-source tools (pikepdf / qpdf). No cloud API calls.

When to Use

  • User wants to share a financial statement, bill, or report without leaking
    identity (name, address, last-4 digits, transaction locations, etc.)
  • User asks to "make this not traceable" or "scrub identifying info"
  • Replacing a name or other field everywhere it appears in a multi-page PDF
  • Removing or replacing transaction-level identifiers (store numbers, city
    names) so location can't be inferred

When NOT to Use

  • The user only needs a visual mock-up — use nano-pdf instead (image-based)
  • The PDF is a scanned image with no text layer — see References
  • The user wants to redact (black-bar) rather than replace — this skill
    replaces; for black-bar redaction add a different workflow

Why True Redaction Matters

Overlay redaction (drawing a white box) leaves the original text intact under
the box — Ctrl+F, copy-paste, and pdftotext still find it. nano-pdf
re-renders the page via an image AI, which degrades text on financial docs.
This skill rewrites the actual Tj/TJ strings, so the original bytes are
gone from the file.

If the input PDF has no extractable text, see
references/scanned-pdfs.md — this skill cannot
handle scanned/image-only PDFs.

Procedure

Step 1: Discover PII

Extract every text string from the PDF and present a list of candidates to the
user, grouped by category. Do not assume what should change — confirm.

uvx --with pypdf python3 -c "
import pypdf, sys
r = pypdf.PdfReader(sys.argv[1])
for i, p in enumerate(r.pages, 1):
    print(f'=== Page {i} ===')
    print(p.extract_text() or '')
" "$PDF"

Look for these PII categories:

Category Examples
Names Cardholder, joint owner, beneficiaries
Address Street, city, state, ZIP
Account identifiers Last-4, full account #, member ID
Transaction identifiers Merchant name, store #, branch, city/state on each line
Credit identifiers Credit score, FICO range
Contact Phone, email
Dates Birth date if printed (rare in statements)
Hidden/encoded glyphs See Step 4

Step 2: Propose Replacement Map

Present a table to the user showing original → replacement. Use plausible
fake values that match length where possible to avoid layout breakage:

| Original                | Replacement              |
|-------------------------|--------------------------|
| RUDY R JAMES JR         | BILL JOHNSON             |
| 5519 MCCAIN CT          | 1234 ELM ST              |
| DALLAS TX 75249-1650    | PHOENIX AZ 85001-0000    |
| ending in: 8495         | ending in: 4321          |
| credit score is 720     | credit score is 715      |

For transaction lines, change the city/state to a different region so the
user's home area can't be inferred. Change store numbers too (e.g., STORE #435STORE #812).

Get explicit user approval before applying.

Step 3: Apply Content-Stream Replacements

Run the bundled script via uvx so dependencies are pulled into an
ephemeral isolated environment (no global installs):

uvx --with pikepdf --with pypdf python3 scripts/anonymize_pdf.py \
    --in "April 22.pdf" \
    --out "April 22 (Anon).pdf" \
    --replace "RUDY R JAMES JR=BILL JOHNSON" \
    --replace "5519 MCCAIN CT=1234 ELM ST" \
    --replace "DALLAS  TX  75249-1650=PHOENIX  AZ  85001-0000" \
    --replace "ending in: 8495=ending in: 4321"

If the user already has pikepdf and pypdf available on their Python path,
plain python3 scripts/anonymize_pdf.py ... works too.

The script edits every Tj, TJ, ', and " text-showing operator across
all pages. Replacements are applied longest-first to avoid partial overlap
issues (e.g., replace "RUDY R JAMES" before "RUDY").

Step 4: Handle Custom-Encoded Glyphs

Gotcha: Some PDFs (e.g., Citi statements) print sensitive routing codes
using a custom-encoded font where the visible "8495" is actually hex
bytes like <f8f4f9f5f0f0> in the content stream. Standard string
replacement misses these because pikepdf doesn't see them as "8495".

When the verification step (Step 5) reports residual PII that you can't find
in the parsed text strings, the cause is almost always a custom-encoded
font. Read references/pikepdf-content-streams.md
for background on PDF text operators and the two string encodings, then dump
the raw content streams and search for hex strings:

uvx --with pikepdf --with pypdf python3 scripts/anonymize_pdf.py \
    --in "$PDF" --out "$PDF.scrubbed" \
    --hex-replace "f8f4f9f5f0f0=f0f0f0f0f0f0"

The right-hand side should reuse the same font's encoding bytes — e.g., if
f0 renders as "0", a string of f0 bytes shows as zeros. To find the
correct bytes for replacement, inspect the font's /ToUnicode or
/Differences entries.

Step 5: Verify

After every run, verify zero leakage by extracting text and searching for
every original token:

uvx --with pikepdf --with pypdf python3 scripts/anonymize_pdf.py --verify "$NEW_PDF" \
    --check "RUDY" --check "8495" --check "MCCAIN" --check "DALLAS"

Every check must report 0 occurrences before declaring success.

Step 6: Report

Show the user:

  • File path of the scrubbed copy
  • Table of all replacements (count per category)
  • Verification result (✅ 0 occurrences for each original token)
  • Reminder that this is true redaction (irrecoverable), pure local processing,
    no AI service called

Important Rules

  1. Always work on a copy. Never modify the source PDF in place — create
    <original> (Anon).pdf or similar.
  2. Get explicit approval of the replacement map before running the script.
  3. Verify after every run. Run text extraction and confirm 0 occurrences
    of every original token. If any leak, investigate custom-encoded glyphs
    (Step 4).
  4. Never use nano-pdf for this. It re-renders pages via Gemini and degrades
    text in financial documents.
  5. Address transaction-level PII. Don't stop at the cardholder block —
    merchants, store numbers, and city/state on transaction lines also leak
    location. Change them all.
  6. Keep replacements layout-safe. Prefer replacements at most the same
    length as the original. PDF text positions are absolute — much shorter or
    longer strings won't reflow but won't break either; the document just looks
    off.
  7. No cloud calls. This skill must remain fully local. Do not pipe content
    through any LLM service for the actual edit step.

Available Scripts

  • scripts/anonymize_pdf.py — content-stream find/replace, hex-byte
    replace, and verification. Self-contained; only depends on pikepdf and
    pypdf. Run via uvx --with pikepdf --with pypdf python3 scripts/anonymize_pdf.py ... for ephemeral execution.

Examples

Example 1: Anonymize a credit-card statement

User asks: "Make a copy of this Citi statement with my name and address
changed to fake values, and remove the last-4 of my account."

# 1. Inspect first to identify PII
uvx --with pypdf python3 -c "
import pypdf, sys
r = pypdf.PdfReader(sys.argv[1])
for i, p in enumerate(r.pages, 1):
    print(f'=== Page {i} ==='); print(p.extract_text() or '')
" "April 22.pdf"

# 2. Apply replacements
uvx --with pikepdf --with pypdf python3 scripts/anonymize_pdf.py \
    --in "April 22.pdf" \
    --out "April 22 (Anon).pdf" \
    --replace "RUDY R JAMES JR=BILL JOHNSON" \
    --replace "5519 MCCAIN CT=1234 ELM ST" \
    --replace "DALLAS  TX  75249-1650=PHOENIX  AZ  85001-0000" \
    --replace "ALLISHA L JAMES=JANE JOHNSON" \
    --replace "ending in: 8495=ending in: 4321" \
    --replace "ending in 8495=ending in 4321" \
    --replace "Card ending in 8495=Card ending in 4321" \
    --replace "Card ending in 6588=Card ending in 5678" \
    --replace "credit score is 720=credit score is 715" \
    --hex-replace "f8f4f9f5f0f0=f0f0f0f0f0f0"

# 3. Verify nothing leaked
uvx --with pikepdf --with pypdf python3 scripts/anonymize_pdf.py \
    --verify "April 22 (Anon).pdf" \
    --check RUDY --check JAMES --check ALLISHA --check MCCAIN \
    --check DALLAS --check 75249 --check 8495 --check 6588 --check 720

Expected output of step 3 — every line must say ✅ ... 0 occurrence(s):

✅ 'RUDY': 0 occurrence(s)
✅ 'JAMES': 0 occurrence(s)
✅ 'ALLISHA': 0 occurrence(s)
✅ 'MCCAIN': 0 occurrence(s)
...

Example 2: Strip transaction location info from a bank statement

User asks: "Anonymize this checking statement so nobody can tell what city
I shop in."

The statement has lines like STARBUCKS #14672 DALLAS TX and
KROGER #586 GRAPEVINE TX. Goal: keep the merchant brands but move
locations to a different region.

uvx --with pikepdf --with pypdf python3 scripts/anonymize_pdf.py \
    --in "checking-2026-04.pdf" \
    --out "checking-2026-04 (Anon).pdf" \
    --replace "JANE Q DOE=ALEX SMITH" \
    --replace "9001 PRESTON RD=500 5TH AVE" \
    --replace "DALLAS TX 75225=NEW YORK NY 10001" \
    --replace "STARBUCKS #14672=STARBUCKS #20015" \
    --replace "KROGER #586=KROGER #311" \
    --replace "GRAPEVINE TX=BROOKLYN NY" \
    --replace "DALLAS TX=NEW YORK NY" \
    --replace "PLANO TX=QUEENS NY"

Example 3: Hex replacement for a custom-encoded font

When --verify reports residual leakage but you can't find the token in the
text strings (the PDF prints sensitive data using a custom subset font), use
--hex-replace to operate on the raw glyph bytes:

# First find the offending hex string by inspecting raw streams:
uvx --with pikepdf python3 -c "
import pikepdf, re, sys
pdf = pikepdf.open(sys.argv[1])
for i, page in enumerate(pdf.pages, 1):
    raw = page.Contents.read_bytes()
    for m in re.finditer(rb'<([0-9a-f]{6,})>', raw, re.I):
        print(f'P{i}: {m.group(0).decode()}')
" "input.pdf" | sort -u

# Then redact it (replace each glyph with the font's '0' glyph):
uvx --with pikepdf --with pypdf python3 scripts/anonymize_pdf.py \
    --in "input.pdf" --out "input-anon.pdf" \
    --hex-replace "f8f4f9f5f0f0=f0f0f0f0f0f0"

Example 4: Dry-run by verifying first

You can run --verify against the original PDF before any edits to see
exactly which tokens currently appear and need attention:

uvx --with pikepdf --with pypdf python3 scripts/anonymize_pdf.py \
    --verify "April 22.pdf" \
    --check "RUDY" --check "MCCAIN" --check "8495"

Output indicates what to scrub:

❌ 'RUDY': 11 occurrence(s)
❌ 'MCCAIN': 1 occurrence(s)
❌ '8495': 14 occurrence(s)

Exit code is non-zero when any check finds a leak, so this works inside CI
or shell pipelines (if ! python3 ... --verify ...; then echo leaked; fi).

References

Load these only when the trigger condition applies (progressive disclosure):

  • references/pikepdf-content-streams.md
    — load when a --verify check still leaks despite running replacements
    (Step 4); explains text operators, literal vs hex strings, and custom font
    encodings.
  • references/scanned-pdfs.md — load when
    pypdf.extract_text() returns empty for every page (image-only PDF; this
    skill cannot handle it, see the file for alternatives).