Resources
5Install
npx skillscat add thousandcents/convert-file2md Install via the SkillsCat registry.
SKILL.md
convert-file2md — File to Markdown Converter
Overview
Batch convert PDF, DOCX, PPTX, and XLSX files to Markdown format using the mineru CLI engine.
Requirements
- Python 3.7+
- mineru CLI (
pip install mineru) curl(for image download)
The script automatically locates mineru in your PATH or common installation directories.
Quick Start
# Single file conversion
python3 convert2md.py -i document.pdf -o ./output/
# Batch conversion (all supported files in a directory)
python3 convert2md.py -i ./input/ -o ./output/
# Chinese document
python3 convert2md.py -i document.pdf -o ./output/ -l ch
# Scanned document (OCR mode)
python3 convert2md.py -i scanned.pdf -o ./output/ -m ocrCLI Parameters
| Parameter | Short | Description | Default |
|---|---|---|---|
--input |
-i |
Input file or directory | ./input/ |
--output |
-o |
Output directory | ./output/ |
--backend |
-b |
mineru backend engine | pipeline |
--method |
-m |
Parse method | auto |
--lang |
-l |
Document language | en |
--keep-intermediates |
— | Keep intermediate files | false |
Backend Options
| Value | Description |
|---|---|
pipeline |
(Default) Local Pipeline engine |
vlm-http-client |
VLM HTTP client mode |
hybrid-http-client |
Hybrid HTTP client mode |
vlm-auto-engine |
VLM auto engine |
hybrid-auto-engine |
Hybrid auto engine |
Parse Methods
| Value | Description |
|---|---|
auto |
(Default) Auto-detect |
txt |
Plain text mode (fast) |
ocr |
OCR mode for scanned documents |
Language Options
| Value | Use Case |
|---|---|
en |
(Default) English |
ch |
Chinese |
ch_server |
Chinese (server model) |
ch_lite |
Chinese (lightweight model) |
japan |
Japanese |
korean |
Korean |
chinese_cht |
Traditional Chinese |
latin |
Latin scripts |
arabic |
Arabic |
cyrillic |
Cyrillic |
devanagari |
Devanagari |
Output Structure
output_dir/
└── <filename>/
├── <filename>.md
└── images/
├── image1.png
└── image2.jpgSupported Formats
Only .pdf, .docx, .pptx, .xlsx are supported. Legacy .doc files require conversion first.
Troubleshooting
mineru not found: Install via
pip install mineruor ensure it's in PATH.pip install mineru which mineruPoor Chinese recognition: Use
-l chinstead of the default-l en.
Traditional Chinese:-l chinese_cht.Conversion timeout: Single file timeout is 600 seconds (10 minutes). For large files, try
-m txtto skip OCR.Unsupported formats: Only
.pdf,.docx,.pptx,.xlsxare supported.