"Creates, edits, and analyzes Word documents with tracked changes, comments, and formatting preservation. Use when working with .docx files for document creation, modification, redlining, or text extraction."
Resources
3Install
npx skillscat add costa-marcello/skillkit/docx Install via the SkillsCat registry.
DOCX Creation, Editing, and Analysis
Read the relevant reference file completely before starting work:
- Creating a new document: read
references/docx-js.md - Editing an existing document: read
references/ooxml.md
Workflow Decision Tree
| Task | Workflow | Reference |
|---|---|---|
| Read/analyse content | Text extraction (pandoc) or Raw XML | None needed |
| Create new document | docx-js (JavaScript) | references/docx-js.md |
| Edit your own doc (simple) | OOXML editing | references/ooxml.md |
| Edit someone else's doc | Redlining workflow (recommended) | references/ooxml.md |
| Legal/business/government | Redlining workflow (required) | references/ooxml.md |
Reading and Analysing Content
Text Extraction (Default)
Convert the document to markdown with pandoc:
pandoc --track-changes=all path-to-file.docx -o output.md
# Options: --track-changes=accept (default) / reject / allDefault to --track-changes=all to preserve revision history. Use accept only when the user wants clean text without markup.
Raw XML Access
Use raw XML when you need: comments, complex formatting, document structure, embedded media, or metadata.
python ooxml/scripts/unpack.py <office_file> <output_directory>Key files after unpacking:
word/document.xml-- main document bodyword/comments.xml-- comments referenced in document.xmlword/media/-- embedded images and media- Tracked changes use
<w:ins>(insertions) and<w:del>(deletions) tags
Creating a New Word Document
Use docx-js (JavaScript/TypeScript) for new documents.
- Read
references/docx-js.mdcompletely - Write a script using Document, Paragraph, TextRun components
- Export with
Packer.toBuffer() - Verify the output opens in Word/LibreOffice without errors
Action:
- Read
references/docx-js.md - Create script with Document, Paragraph, TextRun, numbering config for bullets
- Run:
node memo.js - Verify:
soffice --headless --convert-to pdf memo.docx && pdftoppm -jpeg -r 150 memo.pdf preview
Editing an Existing Word Document
Use the Document library (Python) from scripts/document.py. It handles infrastructure setup automatically (people.xml, RSIDs, settings.xml, comments, relationships, content types).
Standard Editing Workflow
- Read
references/ooxml.mdcompletely (focus on "Document Library" section) - Unpack:
python ooxml/scripts/unpack.py <file.docx> <output_dir> - Edit using Document library methods
- Pack:
python ooxml/scripts/pack.py <output_dir> <result.docx> - Verify: convert to markdown and check output
Action:
from scripts.document import Document
doc = Document('unpacked', track_revisions=True)
node = doc["word/document.xml"].get_node(tag="w:r", contains="30 days")
rpr = tags[0].toxml() if (tags := node.getElementsByTagName("w:rPr")) else ""
replacement = (
f'<w:r w:rsidR="ORIGINAL">{rpr}<w:t>within </w:t></w:r>'
f'<w:del><w:r>{rpr}<w:delText>30</w:delText></w:r></w:del>'
f'<w:ins><w:r>{rpr}<w:t>60</w:t></w:r></w:ins>'
f'<w:r w:rsidR="ORIGINAL">{rpr}<w:t> days</w:t></w:r>'
)
doc["word/document.xml"].replace_node(node, replacement)
doc.save()
Redlining Workflow (Document Review with Tracked Changes)
Plan tracked changes in markdown before implementing in OOXML. Group related changes into batches of 3-10 for manageable debugging.
Principle: Minimal, Precise Edits. Only mark text that actually changes. Repeating unchanged text makes edits harder to review. Break replacements into: [unchanged text] + [deletion] + [insertion] + [unchanged text]. Preserve the original run's RSID for unchanged text.
Step-by-Step
Get markdown representation:
pandoc --track-changes=all path-to-file.docx -o current.mdIdentify and group changes. Organise into batches by section, type, or proximity. Use these location methods for finding text in XML:
- Section/heading numbers (e.g., "Section 3.2")
- Grep patterns with unique surrounding text
- Document structure (e.g., "first paragraph after Heading 2")
- Do NOT use markdown line numbers -- they do not map to XML structure
Read documentation and unpack:
- Read
references/ooxml.md-- focus on "Document Library" and "Tracked Change Patterns" - Unpack:
python ooxml/scripts/unpack.py <file.docx> <dir> - Note the suggested RSID from unpack script
- Read
Implement changes in batches. For each batch:
- Grep
word/document.xmlto verify current text and line numbers (they shift after each script) - Write a script using
get_nodeto find nodes, thenreplace_node,suggest_deletion, orinsert_after - Run the script and verify with
doc.save()
- Grep
Pack the document:
python ooxml/scripts/pack.py unpacked reviewed-document.docxFinal verification:
pandoc --track-changes=all reviewed-document.docx -o verification.md grep "original phrase" verification.md # Should NOT match grep "replacement phrase" verification.md # Should match
Batch plan:
- Batch 1 (Term changes): "2 years" to "1 year" in Section 5
- Batch 2 (Jurisdiction): "New York" to "Delaware" in Section 8
Per batch: grep for text, write script, run, verify. After all batches, pack and do final verification.
Method Selection Guide
| Scenario | Method |
|---|---|
| Change part of regular text | replace_node() with <w:del>/<w:ins> |
| Delete entire run or paragraph | suggest_deletion() |
| Reject another author's insertion | revert_insertion() (NOT suggest_deletion()) |
| Restore another author's deletion | revert_deletion() |
| Partially modify another author's change | replace_node() with nested <w:ins>/<w:del> |
Converting Documents to Images
Two-step process for visual analysis:
# Step 1: DOCX to PDF
soffice --headless --convert-to pdf document.docx
# Step 2: PDF pages to JPEG
pdftoppm -jpeg -r 150 document.pdf page
# Creates page-1.jpg, page-2.jpg, etc.
# For specific pages only:
pdftoppm -jpeg -r 150 -f 2 -l 5 document.pdf pageUse -r 150 for a good quality/size balance. Increase to 300 for print-quality output.
Code Style
Write concise code. Avoid verbose variable names, redundant operations, and unnecessary print statements.
Dependencies
Install if not available:
| Dependency | Install | Purpose |
|---|---|---|
| pandoc | brew install pandoc or apt-get install pandoc |
Text extraction |
| docx | npm install -g docx |
Creating new documents |
| LibreOffice | brew install --cask libreoffice or apt-get install libreoffice |
PDF conversion |
| Poppler | brew install poppler or apt-get install poppler-utils |
PDF to images |
| defusedxml | pip install defusedxml |
Secure XML parsing |
References
| File | Purpose |
|---|---|
references/docx-js.md |
docx-js API patterns for creating new documents |
references/ooxml.md |
OOXML XML patterns, Document library API, tracked changes |