True-redacts PII from PDF, Word (.docx), and Excel (.xlsx) documents before sharing. PDFs have glyphs physically removed and replacement text re-laid; DOCX patches run slices to preserve formatting; XLSX replaces cells, comments, headers, defined names, and properties. Handles names, addresses, account numbers, credit scores, and merchants/stores/phones/cities on transaction lines. Works on ASCII PDFs and modern bank statements with CID-encoded subsetted fonts (BoA, Chase). Single file or directory (with --recursive). Output cannot be recovered via Ctrl+F, pdftotext, or another AI. Trigger on "anonymize/redact/sanitize/de-identify this PDF/Word/Excel", "scrub PII", "remove my name/address", "share this safely" — even when the user just attaches a file and asks to strip identifying info. DO NOT USE FOR visual watermarking, legacy .doc/.xls, or .pptx. Do NOT use nano-pdf — it re-renders pages via image AI and corrupts financial text.
Resources
15Install
npx skillscat add jamesconsultingllc/anonymize Install via the SkillsCat registry.
PDF Anonymize
Performs true redaction of PII in a PDF using PyMuPDF's font-aware
redaction APIs. Original glyphs are physically removed from the content
stream and replacement text is re-laid in their place. Pure local
processing. No cloud API calls.
Works on:
- ASCII-encoded text PDFs (Citi, most utility bills)
- CID-encoded subsetted-font PDFs (Bank of America, Chase, most modern
bank statements) — these store text as glyph indices that defeat naive
byte-replacement; PyMuPDF's redaction handles them transparently - Word documents (
.docx) — run-slice replacement preserves inline
formatting, hyperlinks, and comment anchors; covers paragraphs,
headers/footers, footnotes/endnotes, comments, text boxes, and chart text - Excel workbooks (
.xlsx) — covers all cells (including numeric and
hidden), cell comments, sheet headers/footers, defined names, data
validation prompts, and workbook properties
For Office files, a package-level OOXML byte-regex backstop runs after the
library pass to catch text in surfaces the library APIs don't expose
(e.g., chart titles, pivot caches, drawing text). The backstop is scoped to
text elements (<w:t>, <a:t>, <t>, <vt:lpstr>/<vt:lpwstr>) inside
an allowlist of editable parts — it never touches signatures, content
types, rels, themes, styles, calcChain, or formula structures.
Single file or directory: --in and --out accept a file or a
directory. With --recursive (-r), subdirectories are walked and the
output mirrors the input tree. Office lock files (~$*, .~lock.*) and
hidden files are skipped. Directory symlinks are not followed.
Unsupported: legacy .doc/.xls (ask the user to Save As to the
modern format), .pptx, and other Office types. Files with unknown
extensions are skipped with a warning.
When to Use
- User wants to share a financial statement, bill, or report without leaking
identity (name, address, last-4 digits, transaction locations, etc.) - User asks to "make this not traceable" or "scrub identifying info"
- Replacing a name or other field everywhere it appears in a multi-page PDF
- Removing or replacing transaction-level identifiers (store numbers, city
names) so location can't be inferred
When NOT to Use
- The user only needs a visual mock-up — use
nano-pdfinstead (image-based) - The PDF is a scanned image with no text layer — see References
- The user wants black-bar redaction only (no replacement text) — possible
with PyMuPDF but not the default; pass an empty string as the replacement
Why True Redaction Matters
Overlay redaction (drawing a white box over text) leaves the original text
intact under the box — Ctrl+F, copy-paste, and pdftotext still find it.
nano-pdf re-renders the page via an image AI, which degrades text in
financial documents and may hallucinate numbers. This skill removes the
underlying glyphs from the page content stream, so the original bytes are
gone from the file.
If the input PDF has no extractable text, see
references/scanned-pdfs.md — this skill cannot
handle scanned/image-only PDFs.
Procedure
The agent — not the script — is responsible for finding PII the user
didn't explicitly mention. The script only redacts what it's told to.
Run this loop on every job:
- Collect inputs the user provided (free-form names/values, a YAML
--config, prior--replace/--scrubflags). These are the
user-supplied terms. - First pass: apply the user-supplied terms with the script.
- Extract text from the output (Step 1 below) and scan for missed
PII using the checklist. - Propose a supplemental map to the user (
additional replacements
and/oradditional scrub:) and wait for approval. - Second pass: re-run with the merged map (easiest: append to the
YAML config and re-run on the original input). - Final verify: every original token reports 0 occurrences.
If the user provides nothing at all, skip step 2 and start with step 3
on the input.
Step 1: Extract text for PII discovery
Pull every visible string out of the document so the agent can scan
for PII the user didn't enumerate. Choose by file type:
# PDFs
uvx --with pymupdf python3 -c "
import fitz, sys
doc = fitz.open(sys.argv[1])
for i, p in enumerate(doc, 1):
print(f'=== Page {i} ==='); print(p.get_text() or '')
" "$FILE"
# DOCX (paragraphs, tables, headers/footers, comments)
uvx --with python-docx python3 -c "
import sys
from docx import Document
d = Document(sys.argv[1])
def emit(label, runs):
for t in runs:
if t and t.strip(): print(f'[{label}] {t}')
emit('body', (p.text for p in d.paragraphs))
for tbl in d.tables:
emit('table', (c.text for r in tbl.rows for c in r.cells))
for s in d.sections:
emit('header', (p.text for p in s.header.paragraphs))
emit('footer', (p.text for p in s.footer.paragraphs))
print('--- core props ---')
cp = d.core_properties
for k in ('author','last_modified_by','title','subject','comments','keywords'):
v = getattr(cp, k, None)
if v: print(f'{k}: {v}')
" "$FILE"
# XLSX (cells, comments, headers/footers, defined names, properties)
uvx --with openpyxl python3 -c "
import sys
from openpyxl import load_workbook
wb = load_workbook(sys.argv[1], data_only=False)
for ws in wb.worksheets:
print(f'=== sheet: {ws.title} ===')
for row in ws.iter_rows(values_only=False):
for cell in row:
if cell.value is None: continue
print(f'{cell.coordinate}: {cell.value!r}')
if cell.comment: print(f' # comment: {cell.comment.text}')
for hf in ('oddHeader','oddFooter','evenHeader','evenFooter'):
h = getattr(ws, hf, None)
for part in ('left','center','right'):
v = getattr(getattr(h, part, None), 'text', None) if h else None
if v: print(f'{hf}.{part}: {v}')
print('--- defined names ---')
for n in wb.defined_names: print(n, '=', wb.defined_names[n].value)
print('--- core props ---')
p = wb.properties
for k in ('creator','lastModifiedBy','title','subject','description','keywords'):
v = getattr(p, k, None)
if v: print(f'{k}: {v}')
" "$FILE"For directories, run the appropriate command per file (or write a small
loop) — there is no built-in 'extract' subcommand on the script.
Step 2: PII discovery checklist
After every extraction, scan for these categories. Always propose
findings even when the user only asked to redact one thing — they
usually don't realize how much else is in the document.
| Category | What to look for |
|---|---|
| People | Cardholder/payor, joint owner, beneficiaries, "insured name", agent name, recipient, document author, last-modified-by, signatures, "attention to" lines |
| Address | Street, apt/unit, city, state, ZIP (full + ZIP+4), country if non-default |
| Account identifiers | Full account #, last-4 ("ending in NNNN", "*NNNN", "xNNNN"), member ID, customer ID, policy #, MICR routing/account, IBAN |
| Card details | PAN (full or partial), CVV, expiration, cardholder name, BIN |
| Transaction lines | Merchant + city/state, store number, branch, terminal ID, ATM ID, "Doing business as" names |
| Government IDs | SSN/SIN (NNN-NN-NNNN), EIN/TIN (NN-NNNNNNN), driver's license, passport, NPI, federal tax ID |
| Contact | Phone (with area code), fax, email, personal URLs/handles |
| Dates | Date of birth, anniversary, account-open date if combined with name |
| Identifiers in URLs/QR | Tokens, signed URLs in footers, "view online" links with IDs |
| Credit/financial | Credit score, FICO range, balance + last-4 combo, salary, tax withholding |
| Document metadata | core.author, lastModifiedBy, title, subject, keywords, comments, custom properties (often shows full username/email) |
| Embedded references | Phone number embedded in merchant name, city embedded in MERCHANT - CITY, area-code prefix on phone numbers ( 972-), bare TX / CA strings |
For numeric fields prefer scrub: — same-length deterministic digits
keep the document's layout intact (e.g. 8575 → 2252) while still
making the value untraceable.
For free-form text where you have a good fake (e.g. you want a
consistent fake name across many documents) prefer replacements:.
Step 3: Propose the supplemental map
Show the user a table with three columns: Source location,
Original, Suggested action (scrub or replace → fake). Wait
for explicit approval.
When the user approves, append the confirmed items to their profile
by re-running with --save-config. The profile becomes a personal
anonymization fingerprint that grows over time, so future runs need
fewer agent discoveries:
# First time on this user's documents — start the profile.
... --scrub "CHRISTINE JAMES" --scrub "8575" \
--replace "JAMES, CHRISTINE=DOE, JANE" \
--save-config ~/.config/pdf-anonymize/personal.yml
# Subsequent runs — load profile, add today's discoveries, save back.
... --config ~/.config/pdf-anonymize/personal.yml \
--scrub "23-73-11462314-4" --scrub "17880" \
--save-configBehavior:
--save-config PATHwrites (or overwrites) the merged set to PATH.
If PATH already exists and--configwas not also passed, the
existing file is auto-loaded as the base so additions accumulate
rather than overwrite.--save-configwith no value saves back to whatever--config
pointed at.- Existing entries are deduplicated (same string in
replacements,scrub, orchecksis not added twice). - Comments and key order in the YAML are normalized on round-trip.
Default profile location (~/.config/pdf-anonymize/profile.yaml)
If no --config is passed, the script auto-loads~/.config/pdf-anonymize/profile.yaml (or $XDG_CONFIG_HOME/pdf-anonymize/profile.yaml)
when present. This is the recommended location for a user's permanent
anonymization fingerprint — replacements/scrub items added here apply
to every run automatically.
The first time a user runs pdf-anonymize on an interactive terminal
without that profile, they'll be prompted to create one. The starter
template is fully commented and includes a salt: placeholder they
must change. To create it explicitly (or in a non-interactive context):
pdf-anonymize --init # writes the default path
pdf-anonymize --init /custom/path.yaml # writes elsewhere--init refuses to overwrite an existing file. Use--no-default-config to skip auto-load and the prompt for a single
run. Agent invocations (non-TTY) never block on the prompt — they
silently fall through to whatever flags were passed.
When proposing an updated map (Step 3), prefer pointing--save-config at the default profile so it grows over time:
pdf-anonymize --in ... --out ... \
--scrub "..." --replace "..." \
--save-config # saves back to ~/.config/pdf-anonymize/profile.yaml
# if that's what was auto-loadedStep 4: Apply Replacements
Run the bundled script via uvx so PyMuPDF is pulled into an ephemeral
isolated environment (no global installs):
uvx --with pymupdf python3 scripts/anonymize_pdf.py \
--in "April 22.pdf" \
--out "April 22 (Anon).pdf" \
--replace "RUDY R JAMES JR=BILL JOHNSON" \
--replace "5519 MCCAIN CT=1428 ELMWOOD AVE" \
--replace "DALLAS TX 75249-1650=DENVER CO 80014-2210" \
--replace "ending in: 8495=ending in: 0000"If the user already has pymupdf available, plain python3 scripts/anonymize_pdf.py ... works.
For DOCX/XLSX inputs, include those libraries:
uvx --with pymupdf --with python-docx --with openpyxl python3 \
scripts/anonymize_pdf.py \
--in "Statement.xlsx" \
--out "Statement (Anon).xlsx" \
--replace "John Smith=REDACTED" \
--replace "12345=99999"For a whole directory (top-level only by default; pass -r to recurse):
uvx --with pymupdf --with python-docx --with openpyxl python3 \
scripts/anonymize_pdf.py \
--in ./statements/ \
--out ./statements-anon/ \
--recursive \
--replace "RUDY R JAMES JR=BILL JOHNSON" \
--replace "5519 MCCAIN CT=1428 ELMWOOD AVE"Reusable config + auto-scrub
For values you redact repeatedly, put them in a YAML config and reuse it:
# anon-profile.yml
salt: "personal-2026" # rotate to invalidate old fakes (optional)
replacements: # explicit find → replace pairs
- find: "RUDY R JAMES JR"
replace: "BILL JOHNSON"
- find: "5519 MCCAIN CT"
replace: "1428 ELMWOOD AVE"
scrub: # auto-fake (deterministic from salt)
- "8575" # → same-length digits, e.g. "2252"
- "23-73-11462314-4" # → "72-27-20214807-9"
- "CHRISTINE JAMES" # → "PERSON_<6 hex>"
- "17-F133-77" # → "REDACTED_<6 hex>"
checks: # extra tokens for --verify
- "JAMES"
- "WOODSTOCK"uvx --with pymupdf --with python-docx --with openpyxl --with pyyaml \
python3 scripts/anonymize_pdf.py \
--in ./statements/ --out ./statements-anon/ -r \
--config anon-profile.ymlCLI flags --replace OLD=NEW, --scrub WORD, and --check WORD merge on
top of the config. A --scrub value is auto-classified by shape:
- digits-only or numeric-with-separators (≥2 digits) → same-length
deterministic digits, separators preserved - mostly letters/spaces (≥70%) →
PERSON_<6 hex> - otherwise →
REDACTED_<6 hex>
Replacements are deterministic per (salt, value), so the same input
always becomes the same fake — drop in new statements at any time and
they'll redact consistently. Each scrub value is also auto-added to the
verify check list.
The script applies replacements longest-first (so "RUDY R JAMES JR" wins
over "RUDY") and applies each redaction completely before moving to the
next. Output reports counts per pattern.
Step 5: Catch Embedded References
Watch for cases where PII is embedded inside other text — e.g., a merchant
name like ANDY'S - GRAND PRAIRIE includes the city. The verifier (Step 5)
catches these. Just add another --replace and re-run.
Common embedded forms:
- Merchant
<NAME> - <CITY>→ also map the merchant form - Phone numbers in transaction descriptions
- Bare
TX/CAstrings in MICR/footer lines → swap with target state - Last-4 digits referenced separately from the full 16-digit account number
Step 6: Verify
After every run, verify zero leakage by extracting text and searching for
every original token:
uvx --with pymupdf python3 scripts/anonymize_pdf.py --verify "$NEW_PDF" \
--check "RUDY" --check "8495" --check "MCCAIN" --check "DALLAS"For DOCX/XLSX, verify uses a package-level scan (parsed logical text per
container and raw byte search per part) so PII split across runs is
caught. Numeric-looking tokens are also matched in separator-stripped form
(so 123-45-6789 is detected as 123456789):
uvx --with pymupdf --with python-docx --with openpyxl python3 \
scripts/anonymize_pdf.py --verify "Statement (Anon).xlsx" \
--check "John Smith" --check "12345"--verify also accepts a directory (with -r for recursion) and exits
non-zero if any check leaks in any file.
Every check must report 0 occurrences before declaring success. The
exit code is non-zero when any check leaks, so this works inside CI:
if ! python3 scripts/anonymize_pdf.py --verify "$OUT" --check RUDY; then
echo "PII leaked!" >&2
exit 1
fiStep 7: Report
Show the user:
- File path of the scrubbed copy
- Total replacement count (broken down by pattern from script output)
- Verification result (✅ 0 occurrences for each original token)
- Reminder that this is true redaction (irrecoverable), pure local
processing, no AI service called
Important Rules
- Always work on a copy. Never modify the source PDF in place — create
<original> (Anon).pdfor similar. - Get explicit approval of the replacement map before running the
script. - Verify after every run. Run
--verifyand confirm 0 occurrences of
every original token. If any leak, look at the verifier output for
surrounding context and add more--replacepatterns. - Never use nano-pdf for this. It re-renders pages via Gemini and
degrades text in financial documents. - Address transaction-level PII. Don't stop at the cardholder block —
merchants, store numbers, phone numbers, and city/state on transaction
lines also leak location. Change them all. - Sort longer replacements first is automatic, but be aware: if you
want "RUDY" → "BILL" you don't need to also list "RUDY R JAMES JR" if
the longer form already maps to "BILL JOHNSON" — the longest pattern
wins per occurrence. - Replacement text font may differ slightly. PyMuPDF re-lays
replacements in Helvetica. On PDFs that use a custom branded font, the
replaced regions look subtly different from surrounding text. This is
visually noticeable but functionally fine — the data is gone. - No cloud calls. This skill must remain fully local. Do not pipe
content through any LLM service for the actual edit step.
Available Scripts
This skill ships as an installable Python package. Three equivalent ways
to invoke it — pick whichever fits the runtime:
| Form | Command | When |
|---|---|---|
| Console script | pdf-anonymize ... |
After pipx install "pdf-anonymize[all]" or uv tool install. |
| Ephemeral (uvx) | uvx --from git+https://github.com/jamesconsultingllc/pdf-anonymize --with python-docx --with openpyxl --with pyyaml pdf-anonymize ... |
One-off, no install. |
| In-skill launcher | uvx --with pymupdf --with python-docx --with openpyxl --with pyyaml python3 scripts/anonymize_pdf.py ... |
When this skill folder is checked out and the agent runs from inside it. |
scripts/anonymize_pdf.py— thin launcher. Adds../srctosys.pathand delegates topdf_anonymize.cli:main. Lets agents and
users run the tool directly from a clone without installing.src/pdf_anonymize/cli.py— the actual implementation. Also
exposed aspython -m pdf_anonymize(after install) and as thepdf-anonymizeconsole script.
Drop only pymupdf from the --with list for PDF-only work; the other
libraries are imported lazily and are unused for PDF inputs.
Examples
Example 1: Anonymize a Bank of America credit-card statement
User asks: "Make a copy of this BoA statement with my name and address
changed, and remove all transaction location info so nobody can profile
where I shop."
# 1. Inspect first to identify PII
uvx --with pymupdf python3 -c "
import fitz, sys
doc = fitz.open(sys.argv[1])
for i, p in enumerate(doc, 1):
print(f'=== Page {i} ==='); print(p.get_text() or '')
" "April 25.pdf"
# 2. Apply replacements (longest patterns automatically win)
uvx --with pymupdf python3 scripts/anonymize_pdf.py \
--in "April 25.pdf" \
--out "April 25 (Anon).pdf" \
--replace "RUDY R JAMES JR=BILL JOHNSON" \
--replace "5519 MCCAIN CT=1428 ELMWOOD AVE" \
--replace "DALLAS TX 75249-1650=DENVER CO 80014-2210" \
--replace "4400 6631 4372 3867=4400 6631 4372 0000" \
--replace "GRAND PRAIRIE TX=AURORA CO" \
--replace "ANDY'S - GRAND PRAIRIE=ANDY'S - AURORA" \
--replace "DALLAS TX=DENVER CO" \
--replace " TX = CO " \
--replace "WHATABURGER 742=WHATABURGER 999" \
--replace "972-237-7941=303-555-0100"
# 3. Verify nothing leaked
uvx --with pymupdf python3 scripts/anonymize_pdf.py \
--verify "April 25 (Anon).pdf" \
--check RUDY --check JAMES --check MCCAIN --check DALLAS \
--check "GRAND PRAIRIE" --check "4372 3867" --check 75249Expected output of step 3 — every line must say ✅ ... 0 occurrence(s).
Example 2: Strip transaction location info
User asks: "Anonymize this checking statement so nobody can tell what city
I shop in."
uvx --with pymupdf python3 scripts/anonymize_pdf.py \
--in "checking-2026-04.pdf" \
--out "checking-2026-04 (Anon).pdf" \
--replace "JANE Q DOE=ALEX SMITH" \
--replace "9001 PRESTON RD=500 5TH AVE" \
--replace "DALLAS TX 75225=NEW YORK NY 10001" \
--replace "STARBUCKS #14672=STARBUCKS #20015" \
--replace "KROGER #586=KROGER #311" \
--replace "GRAPEVINE TX=BROOKLYN NY" \
--replace "DALLAS TX=NEW YORK NY" \
--replace "PLANO TX=QUEENS NY"Example 3: Dry-run by verifying first
You can run --verify against the original PDF before any edits to see
exactly which tokens currently appear:
uvx --with pymupdf python3 scripts/anonymize_pdf.py \
--verify "April 22.pdf" \
--check "RUDY" --check "MCCAIN" --check "8495"Output indicates what to scrub:
❌ 'RUDY': 11 occurrence(s)
❌ 'MCCAIN': 1 occurrence(s)
❌ '8495': 14 occurrence(s)Exit code is non-zero when any check finds a leak.
References
Load these only when the trigger condition applies (progressive disclosure):
- references/scanned-pdfs.md — load when
fitz.Page.get_text()returns empty for every page (image-only PDF;
this skill cannot handle it, see the file for alternatives).
PDF Anonymize
Performs true redaction of PII in a PDF by rewriting the content stream so
the original strings are gone — not just visually covered. Pure local
processing using open-source tools (pikepdf / qpdf). No cloud API calls.
When to Use
- User wants to share a financial statement, bill, or report without leaking
identity (name, address, last-4 digits, transaction locations, etc.) - User asks to "make this not traceable" or "scrub identifying info"
- Replacing a name or other field everywhere it appears in a multi-page PDF
- Removing or replacing transaction-level identifiers (store numbers, city
names) so location can't be inferred
When NOT to Use
- The user only needs a visual mock-up — use
nano-pdfinstead (image-based) - The PDF is a scanned image with no text layer — see References
- The user wants to redact (black-bar) rather than replace — this skill
replaces; for black-bar redaction add a different workflow
Why True Redaction Matters
Overlay redaction (drawing a white box) leaves the original text intact under
the box — Ctrl+F, copy-paste, and pdftotext still find it. nano-pdf
re-renders the page via an image AI, which degrades text on financial docs.
This skill rewrites the actual Tj/TJ strings, so the original bytes are
gone from the file.
If the input PDF has no extractable text, see
references/scanned-pdfs.md — this skill cannot
handle scanned/image-only PDFs.
Procedure
Step 1: Discover PII
Extract every text string from the PDF and present a list of candidates to the
user, grouped by category. Do not assume what should change — confirm.
uvx --with pypdf python3 -c "
import pypdf, sys
r = pypdf.PdfReader(sys.argv[1])
for i, p in enumerate(r.pages, 1):
print(f'=== Page {i} ===')
print(p.extract_text() or '')
" "$PDF"Look for these PII categories:
| Category | Examples |
|---|---|
| Names | Cardholder, joint owner, beneficiaries |
| Address | Street, city, state, ZIP |
| Account identifiers | Last-4, full account #, member ID |
| Transaction identifiers | Merchant name, store #, branch, city/state on each line |
| Credit identifiers | Credit score, FICO range |
| Contact | Phone, email |
| Dates | Birth date if printed (rare in statements) |
| Hidden/encoded glyphs | See Step 4 |
Step 2: Propose Replacement Map
Present a table to the user showing original → replacement. Use plausible
fake values that match length where possible to avoid layout breakage:
| Original | Replacement |
|-------------------------|--------------------------|
| RUDY R JAMES JR | BILL JOHNSON |
| 5519 MCCAIN CT | 1234 ELM ST |
| DALLAS TX 75249-1650 | PHOENIX AZ 85001-0000 |
| ending in: 8495 | ending in: 4321 |
| credit score is 720 | credit score is 715 |For transaction lines, change the city/state to a different region so the
user's home area can't be inferred. Change store numbers too (e.g., STORE #435 → STORE #812).
Get explicit user approval before applying.
Step 3: Apply Content-Stream Replacements
Run the bundled script via uvx so dependencies are pulled into an
ephemeral isolated environment (no global installs):
uvx --with pikepdf --with pypdf python3 scripts/anonymize_pdf.py \
--in "April 22.pdf" \
--out "April 22 (Anon).pdf" \
--replace "RUDY R JAMES JR=BILL JOHNSON" \
--replace "5519 MCCAIN CT=1234 ELM ST" \
--replace "DALLAS TX 75249-1650=PHOENIX AZ 85001-0000" \
--replace "ending in: 8495=ending in: 4321"If the user already has pikepdf and pypdf available on their Python path,
plain python3 scripts/anonymize_pdf.py ... works too.
The script edits every Tj, TJ, ', and " text-showing operator across
all pages. Replacements are applied longest-first to avoid partial overlap
issues (e.g., replace "RUDY R JAMES" before "RUDY").
Step 4: Handle Custom-Encoded Glyphs
Gotcha: Some PDFs (e.g., Citi statements) print sensitive routing codes
using a custom-encoded font where the visible "8495" is actually hex
bytes like <f8f4f9f5f0f0> in the content stream. Standard string
replacement misses these because pikepdf doesn't see them as "8495".
When the verification step (Step 5) reports residual PII that you can't find
in the parsed text strings, the cause is almost always a custom-encoded
font. Read references/pikepdf-content-streams.md
for background on PDF text operators and the two string encodings, then dump
the raw content streams and search for hex strings:
uvx --with pikepdf --with pypdf python3 scripts/anonymize_pdf.py \
--in "$PDF" --out "$PDF.scrubbed" \
--hex-replace "f8f4f9f5f0f0=f0f0f0f0f0f0"The right-hand side should reuse the same font's encoding bytes — e.g., iff0 renders as "0", a string of f0 bytes shows as zeros. To find the
correct bytes for replacement, inspect the font's /ToUnicode or/Differences entries.
Step 5: Verify
After every run, verify zero leakage by extracting text and searching for
every original token:
uvx --with pikepdf --with pypdf python3 scripts/anonymize_pdf.py --verify "$NEW_PDF" \
--check "RUDY" --check "8495" --check "MCCAIN" --check "DALLAS"Every check must report 0 occurrences before declaring success.
Step 6: Report
Show the user:
- File path of the scrubbed copy
- Table of all replacements (count per category)
- Verification result (✅ 0 occurrences for each original token)
- Reminder that this is true redaction (irrecoverable), pure local processing,
no AI service called
Important Rules
- Always work on a copy. Never modify the source PDF in place — create
<original> (Anon).pdfor similar. - Get explicit approval of the replacement map before running the script.
- Verify after every run. Run text extraction and confirm 0 occurrences
of every original token. If any leak, investigate custom-encoded glyphs
(Step 4). - Never use nano-pdf for this. It re-renders pages via Gemini and degrades
text in financial documents. - Address transaction-level PII. Don't stop at the cardholder block —
merchants, store numbers, and city/state on transaction lines also leak
location. Change them all. - Keep replacements layout-safe. Prefer replacements at most the same
length as the original. PDF text positions are absolute — much shorter or
longer strings won't reflow but won't break either; the document just looks
off. - No cloud calls. This skill must remain fully local. Do not pipe content
through any LLM service for the actual edit step.
Available Scripts
scripts/anonymize_pdf.py— content-stream find/replace, hex-byte
replace, and verification. Self-contained; only depends onpikepdfandpypdf. Run viauvx --with pikepdf --with pypdf python3 scripts/anonymize_pdf.py ...for ephemeral execution.
Examples
Example 1: Anonymize a credit-card statement
User asks: "Make a copy of this Citi statement with my name and address
changed to fake values, and remove the last-4 of my account."
# 1. Inspect first to identify PII
uvx --with pypdf python3 -c "
import pypdf, sys
r = pypdf.PdfReader(sys.argv[1])
for i, p in enumerate(r.pages, 1):
print(f'=== Page {i} ==='); print(p.extract_text() or '')
" "April 22.pdf"
# 2. Apply replacements
uvx --with pikepdf --with pypdf python3 scripts/anonymize_pdf.py \
--in "April 22.pdf" \
--out "April 22 (Anon).pdf" \
--replace "RUDY R JAMES JR=BILL JOHNSON" \
--replace "5519 MCCAIN CT=1234 ELM ST" \
--replace "DALLAS TX 75249-1650=PHOENIX AZ 85001-0000" \
--replace "ALLISHA L JAMES=JANE JOHNSON" \
--replace "ending in: 8495=ending in: 4321" \
--replace "ending in 8495=ending in 4321" \
--replace "Card ending in 8495=Card ending in 4321" \
--replace "Card ending in 6588=Card ending in 5678" \
--replace "credit score is 720=credit score is 715" \
--hex-replace "f8f4f9f5f0f0=f0f0f0f0f0f0"
# 3. Verify nothing leaked
uvx --with pikepdf --with pypdf python3 scripts/anonymize_pdf.py \
--verify "April 22 (Anon).pdf" \
--check RUDY --check JAMES --check ALLISHA --check MCCAIN \
--check DALLAS --check 75249 --check 8495 --check 6588 --check 720Expected output of step 3 — every line must say ✅ ... 0 occurrence(s):
✅ 'RUDY': 0 occurrence(s)
✅ 'JAMES': 0 occurrence(s)
✅ 'ALLISHA': 0 occurrence(s)
✅ 'MCCAIN': 0 occurrence(s)
...Example 2: Strip transaction location info from a bank statement
User asks: "Anonymize this checking statement so nobody can tell what city
I shop in."
The statement has lines like STARBUCKS #14672 DALLAS TX andKROGER #586 GRAPEVINE TX. Goal: keep the merchant brands but move
locations to a different region.
uvx --with pikepdf --with pypdf python3 scripts/anonymize_pdf.py \
--in "checking-2026-04.pdf" \
--out "checking-2026-04 (Anon).pdf" \
--replace "JANE Q DOE=ALEX SMITH" \
--replace "9001 PRESTON RD=500 5TH AVE" \
--replace "DALLAS TX 75225=NEW YORK NY 10001" \
--replace "STARBUCKS #14672=STARBUCKS #20015" \
--replace "KROGER #586=KROGER #311" \
--replace "GRAPEVINE TX=BROOKLYN NY" \
--replace "DALLAS TX=NEW YORK NY" \
--replace "PLANO TX=QUEENS NY"Example 3: Hex replacement for a custom-encoded font
When --verify reports residual leakage but you can't find the token in the
text strings (the PDF prints sensitive data using a custom subset font), use--hex-replace to operate on the raw glyph bytes:
# First find the offending hex string by inspecting raw streams:
uvx --with pikepdf python3 -c "
import pikepdf, re, sys
pdf = pikepdf.open(sys.argv[1])
for i, page in enumerate(pdf.pages, 1):
raw = page.Contents.read_bytes()
for m in re.finditer(rb'<([0-9a-f]{6,})>', raw, re.I):
print(f'P{i}: {m.group(0).decode()}')
" "input.pdf" | sort -u
# Then redact it (replace each glyph with the font's '0' glyph):
uvx --with pikepdf --with pypdf python3 scripts/anonymize_pdf.py \
--in "input.pdf" --out "input-anon.pdf" \
--hex-replace "f8f4f9f5f0f0=f0f0f0f0f0f0"Example 4: Dry-run by verifying first
You can run --verify against the original PDF before any edits to see
exactly which tokens currently appear and need attention:
uvx --with pikepdf --with pypdf python3 scripts/anonymize_pdf.py \
--verify "April 22.pdf" \
--check "RUDY" --check "MCCAIN" --check "8495"Output indicates what to scrub:
❌ 'RUDY': 11 occurrence(s)
❌ 'MCCAIN': 1 occurrence(s)
❌ '8495': 14 occurrence(s)Exit code is non-zero when any check finds a leak, so this works inside CI
or shell pipelines (if ! python3 ... --verify ...; then echo leaked; fi).
References
Load these only when the trigger condition applies (progressive disclosure):
- references/pikepdf-content-streams.md
— load when a--verifycheck still leaks despite running replacements
(Step 4); explains text operators, literal vs hex strings, and custom font
encodings. - references/scanned-pdfs.md — load when
pypdf.extract_text()returns empty for every page (image-only PDF; this
skill cannot handle it, see the file for alternatives).