docx

"Creates, edits, and analyzes Word documents with tracked changes, comments, and formatting preservation. Use when working with .docx files for document creation, modification, redlining, or text extraction."

costa-marcello 2 1 Updated 5mo ago

Resources

GitHub

Install

npx skillscat add costa-marcello/skillkit/docx

Install via the SkillsCat registry.

SKILL.md

DOCX Creation, Editing, and Analysis

Read the relevant reference file completely before starting work:

Creating a new document: read references/docx-js.md
Editing an existing document: read references/ooxml.md

Workflow Decision Tree

Task	Workflow	Reference
Read/analyse content	Text extraction (pandoc) or Raw XML	None needed
Create new document	docx-js (JavaScript)	`references/docx-js.md`
Edit your own doc (simple)	OOXML editing	`references/ooxml.md`
Edit someone else's doc	Redlining workflow (recommended)	`references/ooxml.md`
Legal/business/government	Redlining workflow (required)	`references/ooxml.md`

Reading and Analysing Content

Text Extraction (Default)

Convert the document to markdown with pandoc:

pandoc --track-changes=all path-to-file.docx -o output.md
# Options: --track-changes=accept (default) / reject / all

Default to --track-changes=all to preserve revision history. Use accept only when the user wants clean text without markup.

Raw XML Access

Use raw XML when you need: comments, complex formatting, document structure, embedded media, or metadata.

python ooxml/scripts/unpack.py <office_file> <output_directory>

Key files after unpacking:

word/document.xml -- main document body
word/comments.xml -- comments referenced in document.xml
word/media/ -- embedded images and media
Tracked changes use <w:ins> (insertions) and <w:del> (deletions) tags

Creating a New Word Document

Use docx-js (JavaScript/TypeScript) for new documents.

Read references/docx-js.md completely
Write a script using Document, Paragraph, TextRun components
Export with Packer.toBuffer()
Verify the output opens in Word/LibreOffice without errors

**Task:** User says "Create a one-page memo with a title and two bullet points"

Action:

Read references/docx-js.md
Create script with Document, Paragraph, TextRun, numbering config for bullets
Run: node memo.js
Verify: soffice --headless --convert-to pdf memo.docx && pdftoppm -jpeg -r 150 memo.pdf preview

Editing an Existing Word Document

Use the Document library (Python) from scripts/document.py. It handles infrastructure setup automatically (people.xml, RSIDs, settings.xml, comments, relationships, content types).

Standard Editing Workflow

Read references/ooxml.md completely (focus on "Document Library" section)
Unpack: python ooxml/scripts/unpack.py <file.docx> <output_dir>
Edit using Document library methods
Pack: python ooxml/scripts/pack.py <output_dir> <result.docx>
Verify: convert to markdown and check output

**Task:** User says "Change '30 days' to '60 days' in this contract"

Action:

from scripts.document import Document
doc = Document('unpacked', track_revisions=True)
node = doc["word/document.xml"].get_node(tag="w:r", contains="30 days")
rpr = tags[0].toxml() if (tags := node.getElementsByTagName("w:rPr")) else ""
replacement = (
    f'<w:r w:rsidR="ORIGINAL">{rpr}<w:t>within </w:t></w:r>'
    f'<w:del><w:r>{rpr}<w:delText>30</w:delText></w:r></w:del>'
    f'<w:ins><w:r>{rpr}<w:t>60</w:t></w:r></w:ins>'
    f'<w:r w:rsidR="ORIGINAL">{rpr}<w:t> days</w:t></w:r>'
)
doc["word/document.xml"].replace_node(node, replacement)
doc.save()

Redlining Workflow (Document Review with Tracked Changes)

Plan tracked changes in markdown before implementing in OOXML. Group related changes into batches of 3-10 for manageable debugging.

Principle: Minimal, Precise Edits. Only mark text that actually changes. Repeating unchanged text makes edits harder to review. Break replacements into: [unchanged text] + [deletion] + [insertion] + [unchanged text]. Preserve the original run's RSID for unchanged text.

Step-by-Step

Get markdown representation:

pandoc --track-changes=all path-to-file.docx -o current.md

Identify and group changes. Organise into batches by section, type, or proximity. Use these location methods for finding text in XML:
- Section/heading numbers (e.g., "Section 3.2")
- Grep patterns with unique surrounding text
- Document structure (e.g., "first paragraph after Heading 2")
- Do NOT use markdown line numbers -- they do not map to XML structure
Read documentation and unpack:
- Read references/ooxml.md -- focus on "Document Library" and "Tracked Change Patterns"
- Unpack: python ooxml/scripts/unpack.py <file.docx> <dir>
- Note the suggested RSID from unpack script
Implement changes in batches. For each batch:
- Grep word/document.xml to verify current text and line numbers (they shift after each script)
- Write a script using get_node to find nodes, then replace_node, suggest_deletion, or insert_after
- Run the script and verify with doc.save()

Pack the document:

python ooxml/scripts/pack.py unpacked reviewed-document.docx

Final verification:

pandoc --track-changes=all reviewed-document.docx -o verification.md
grep "original phrase" verification.md   # Should NOT match
grep "replacement phrase" verification.md # Should match

**Task:** User says "Review this NDA and suggest changing the non-compete period from 2 years to 1 year, and update the jurisdiction from New York to Delaware"

Batch plan:

Batch 1 (Term changes): "2 years" to "1 year" in Section 5
Batch 2 (Jurisdiction): "New York" to "Delaware" in Section 8

Per batch: grep for text, write script, run, verify. After all batches, pack and do final verification.

Method Selection Guide

Scenario	Method
Change part of regular text	`replace_node()` with `<w:del>`/`<w:ins>`
Delete entire run or paragraph	`suggest_deletion()`
Reject another author's insertion	`revert_insertion()` (NOT `suggest_deletion()`)
Restore another author's deletion	`revert_deletion()`
Partially modify another author's change	`replace_node()` with nested `<w:ins>`/`<w:del>`

Converting Documents to Images

Two-step process for visual analysis:

# Step 1: DOCX to PDF
soffice --headless --convert-to pdf document.docx

# Step 2: PDF pages to JPEG
pdftoppm -jpeg -r 150 document.pdf page
# Creates page-1.jpg, page-2.jpg, etc.

# For specific pages only:
pdftoppm -jpeg -r 150 -f 2 -l 5 document.pdf page

Use -r 150 for a good quality/size balance. Increase to 300 for print-quality output.

Code Style

Write concise code. Avoid verbose variable names, redundant operations, and unnecessary print statements.

Dependencies

Install if not available:

Dependency	Install	Purpose
pandoc	`brew install pandoc` or `apt-get install pandoc`	Text extraction
docx	`npm install -g docx`	Creating new documents
LibreOffice	`brew install --cask libreoffice` or `apt-get install libreoffice`	PDF conversion
Poppler	`brew install poppler` or `apt-get install poppler-utils`	PDF to images
defusedxml	`pip install defusedxml`	Secure XML parsing

References

File	Purpose
`references/docx-js.md`	docx-js API patterns for creating new documents
`references/ooxml.md`	OOXML XML patterns, Document library API, tracked changes

docx

Resources

Install

DOCX Creation, Editing, and Analysis

Workflow Decision Tree

Reading and Analysing Content

Text Extraction (Default)

Raw XML Access

Creating a New Word Document

Editing an Existing Word Document

Standard Editing Workflow

Redlining Workflow (Document Review with Tracked Changes)

Step-by-Step

Method Selection Guide

Converting Documents to Images

Code Style

Dependencies

References

Categories

Install

Recommended Skills