Convert a research PDF into a rich source Markdown package with Mistral OCR first, then translate and correct it into Chinese Markdown by combining the source markdown with rendered PDF page images.
Resources
8Install
npx skillscat add crushdemo01/pdf-to-md-zh Install via the SkillsCat registry.
PDF to Markdown (Chinese) Skill
Core Principle
Two-stage workflow: rich OCR first, agent translation second.
This skill is no longer centered on heuristic figure crops. The preferred flow is:
- Use Mistral OCR to generate a complete source markdown package
- Use the agent to translate that markdown into Chinese
- During translation, the agent must also correct OCR errors by checking the
rendered PDF page images
The default final deliverable is target.md. The first-stage OCR output is a
source package, not the final translated result. The default workflow should
not create a separate translation draft file.
Intermediate chunk_XX.md translation files are optional and should not be the
default output. Prefer writing directly into target.md section by section.
This skill expects MISTRAL_API_KEY to be present in the environment before
OCR starts. It does not prompt for the token interactively.
Stage 1: Rich OCR Package
The OCR stage should preserve as much Mistral-returned structure as possible.
Prefer these outputs:
mistral.mdocr.jsonimages/tables_md/ortables_html/asset_index.mdsource_package.mdpages/page_XX.pngtranslation_prep.jsontarget.md
At this stage, do not translate. The goal is to create the most complete
possible source markdown in the original language.
OCR Rules
- Prefer Mistral OCR returned images as the image source.
- Prefer Mistral OCR returned tables as the table source.
- Do not treat heuristic figure crops as the primary path.
- Preserve OCR-returned placeholders by mapping them to local image/table files.
- Save the raw OCR response JSON so the agent can inspect missing details later.
- Render PDF pages to
pages/so the agent can visually verify OCR output.
Stage 2: Agent Translation And Correction
The second stage is not just “translate markdown”. The agent must:
- translate the source markdown into Chinese
- inspect rendered PDF page images when OCR structure is suspicious
- correct section boundaries, captions, table content, formula layout, and image placement when needed
- verify that each image is placed near the paragraph or subsection that actually discusses it
- move an image line when the OCR insertion point is clearly wrong
- keep model names, method names, and citations in original form
- preserve markdown structure and math formatting
The translation stage should use:
- source markdown from Mistral OCR
- rendered PDF page images in
pages/ - OCR-returned images in
images/ - OCR-returned tables in
tables_md/ortables_html/ - OCR response metadata from
ocr.jsonwhen needed
Required Workflow
- Confirm the PDF path and final markdown filename.
- Run the OCR preparation step once.
- Inspect the source markdown and OCR asset index.
First readsource_package.mdif present. - Split
mistral.mdinto manageable chunks in memory. - Translate a small chunk using:
- source markdown
- rendered PDF page images
- OCR images/tables
- Correct OCR mistakes while translating.
- Verify image placement against the PDF pages and repair wrong insertion points.
- Immediately write the translated chunk into
target.md. - Continue chunk by chunk until the final Chinese markdown is complete.
Do not stop at a scaffold containing [待翻译], and do not create an
extra *_draft.md file unless the user explicitly asks for one.
If temporary chunk files are used for convenience, remove them afterwards
unless the user explicitly wants them preserved.
Preferred Tools
- Preparation tools
prepare_translation_inputs.pymistral_ocr_to_markdown.pyextract_pdf_assets.py
- Mechanical writer
write_md_chunk.py- direct incremental edits to
target.mdare preferred - temporary chunk files are optional, not required
- Optional compatibility helpers
translate_markdown_chunks.pybuild_translation_draft.pyfor debugging only
translate_markdown_chunks.py is still allowed as a scripted compatibility path,
but the preferred quality path is agent-led translation with page-image checking.
Preferred Commands
Rich OCR preparation:
python3 skills/pdf-to-md-zh/scripts/prepare_translation_inputs.py \
--pdf <paper.pdf> \
--outdir <output_dir> \
--inline-images \
--table-format markdownThis should also produce source_package.md, which is the preferred brief entry
point for stage 2. The output directory should be normalized and stable:
mistral.mdocr.jsonimages/pages/target.md
Direct OCR only:
python3 skills/pdf-to-md-zh/scripts/mistral_ocr_to_markdown.py \
<paper.pdf> \
-o <output_dir>/mistral.md \
--inline-images \
--table-format markdownRender pages and build index from OCR outputs:
python3 skills/pdf-to-md-zh/scripts/extract_pdf_assets.py \
--pdf <paper.pdf> \
--outdir <output_dir> \
--ocr-md <output_dir>/mistral.md \
--ocr-json <output_dir>/ocr.jsonTranslation Rules
- Translate narrative text to Chinese.
- Keep model names, method names, equations, and citations in original form.
- Preserve markdown image syntax and local relative links such as
images/.... - Check every image against
pages/and keep it near the paragraph or subsection that discusses it. - If OCR inserted an image in the wrong place, move that image line in
target.mdinstead of preserving the wrong location. - Preserve or repair table structure using OCR-returned tables first.
- Use rendered PDF page images as the ground truth when OCR markdown is wrong.
- If OCR is clearly wrong, fix it in the translated output instead of copying the mistake forward.
- Use
[待人工校对]only when content remains unreadable after checking page images. - Do not leave
[待翻译]in the final deliveredtarget.md. - Do not create an intermediate draft markdown file unless the user explicitly asks for one.
- Do not keep
chunk_XX.mdfiles unless they are actively useful or the user asked to keep them.
Quality Checklist
- Does the OCR package contain source markdown, OCR JSON, OCR images, OCR tables, and rendered page images?
- Does the OCR package contain
source_package.mdso stage 2 has a stable starting point? - Does the translated output correct obvious OCR mistakes instead of inheriting them blindly?
- Are images taken from
images/rather than heuristic crops? - Are images placed near the correct discussion instead of blindly following OCR insertion points?
- Are tables taken from OCR-returned table files where possible?
- Are figures and tables placed near the relevant discussion?
- Does the final
target.mdread like a publishable translated article? - Were chunks, if any, written incrementally into
target.mdrather than being left as the main artifact?