Ingest any document (PDF, EPUB, DOCX, TXT, HTML, Markdown) and convert it into structured, ML-ready JSON. Use when the user asks to: extract text from documents, convert documents to structured data, create ML datasets from documents, parse PDF/EPUB/DOCX content, chunk documents for RAG, or produce tokenizable JSON from unstructured documents. Triggers on file uploads of supported formats or requests mentioning document conversion, text extraction, dataset creation from documents.
Resources
4Install
npx skillscat add usagi-epta/doc2ml-json Install via the SkillsCat registry.
doc2ml-json: Document-to-ML-JSON Converter
Overview
This skill provides a complete pipeline for converting unstructured documents into
structured JSON optimized for machine learning workflows. It handles format detection,
content extraction, structural understanding, normalization, and JSON generation.
Supported Formats
- PDF (text-based and scanned/OCR)
- EPUB (ebooks)
- DOCX (Microsoft Word)
- TXT / Markdown / MD
- HTML (single file or archive)
Workflow Decision Tree
Start here. Based on the user's request, follow the appropriate path:
User uploaded a document file? → Follow "Full Pipeline" below
User provided a directory of documents? → Follow "Batch Processing" below
User wants to extract specific sections? → Follow "Selective Extraction" below
User wants document converted to training data? → Follow "ML Dataset Generation" below
Full Pipeline
Phase 1: Format Detection
Run scripts/detect_format.py on the input file. If unavailable, implement the
three-layer cascade manually:
- File extension → map to MIME type (
.pdf→application/pdf,.epub→application/epub+zip,.docx→application/vnd.openxmlformats-officedocument.wordprocessingml.document,.md→text/markdown,.html→text/html,.txt→text/plain) - Magic bytes → inspect first 16 bytes (
%PDF-= PDF,PK\x03\x04= ZIP-based,<?xml= XML,<html= HTML) - Deep inspection → for ZIP-based files, inspect internal structure (DOCX has
word/document.xml, EPUB hasmimetypecontainingepub)
Confidence scoring: all three agree = 1.0; signature + extension agree = 0.85; only signature = 0.70; only extension = 0.50. If confidence < 0.5, flag for manual review.
For PDFs specifically, check if scanned: open with fitz (PyMuPDF), if page.get_text() is empty but page.get_images() has entries on >50% of pages, set is_scanned: true.
Phase 2: Content Extraction
Route by detected format. Use these libraries:
| Format | Primary Library | Fallback | Key Extraction Target |
|---|---|---|---|
pdfplumber + fitz (PyMuPDF) |
pytesseract + pdf2image for scanned PDFs |
Text blocks with bbox, tables, metadata | |
| EPUB | ebooklib + BeautifulSoup |
Manual ZIP + XML | Spine reading order, chapter HTML, TOC |
| DOCX | python-docx + lxml |
zipfile + manual XML |
Paragraphs, tables, lists, styles, footnotes |
| TXT/MD | Native Python + markdown |
yaml for frontmatter |
Headings, paragraphs, code blocks, lists, tables |
| HTML | BeautifulSoup |
lxml |
Semantic tags, heading hierarchy, links |
PDF extraction details:
- Use
pdfplumber.open(filepath)→ iterate pages →page.extract_words()for layout,page.extract_tables()for tables - Use
fitz.open(filepath)→page.get_text("dict")for font/size/style metadata,doc.metadatafor document properties - Detect headings by font size: >1.3x average and <200 chars = heading; level by size tier relative to max
- Remove headers/footers: collect text in top/bottom 10% margin across pages, drop lines appearing on >50% of pages
- OCR fallback: If pdfplumber and PyMuPDF both return no text (scanned/image PDF), automatically fall back to Tesseract OCR via
pdf2image. OCR is optional — ifpytesseractorpdf2imageare not installed, the fallback is skipped gracefully. Setmax_pageslimit for large scanned PDFs to avoid excessive processing time. OCR blocks receive confidence 0.75–0.80 vs 0.92 for text-based extraction.
EPUB extraction details:
- Open as ZIP → read
META-INF/container.xml→ find OPF path - Parse OPF: extract
dc:title,dc:creator,dc:language, manifest, spine - Navigate spine sequentially; parse each chapter with BeautifulSoup; remove
script,style,nav - Extract TOC from NCX (EPUB2) or NavDoc (EPUB3)
DOCX extraction details:
- Open with
python-docx→ iteratedoc.paragraphs,doc.tables - Detect headings from paragraph style names (
Heading1→ level 1) - Detect lists from
w:numPrin paragraph XML; resolve list types viaword/numbering.xml - Extract footnotes from
word/footnotes.xml; hyperlinks fromword/_rels/document.xml.rels - Handle merged cells: check
w:gridSpanin table cell XML
Markdown extraction details:
- Detect YAML frontmatter between
---delimiters; parse withyaml.safe_load - Parse line-by-line: ATX headings (
#), fenced code blocks (```), GFM tables (|), lists (-,*,1.), blockquotes (>) - Setext headings: underlined with
=(h1) or-(h2)
Phase 3: Structure Understanding
Heading hierarchy reconstruction:
- Collect all heading blocks; if none exist, infer from short all-caps paragraphs
- Normalize skipped levels:
[1, 3, 3, 5]→[1, 2, 2, 3]by remapping unique sorted levels - If first heading level > 1, insert inferred h1 from first substantial text or filename
- Promote paragraph-as-heading: short (<100 chars) all-caps or bold-styled paragraphs
Section boundary detection:
- Group blocks into sections based on heading hierarchy
- Maintain a section stack: on heading at level L, pop stack to find parent with level < L
- Content before first heading → anonymous section at level 0
List nesting reconstruction:
- From flat items with
levelfields, build nested tree using stack depth - If
leveljumps by >1, create intermediate nesting levels
- From flat items with
Table normalization:
- Build rectangular grid handling
colspanandrowspan - Track span_map for cells propagated from above rows
- Pad rows to uniform
max_cols
- Build rectangular grid handling
Phase 4: Normalization
Apply in order:
- Unicode NFKC →
unicodedata.normalize("NFKC", text)for compatibility characters - Whitespace → collapse multiple spaces/tabs to single space; normalize
\r\n→\n; remove zero-width chars (\u200b,\ufeff); collapse >2 newlines to 2 - Encoding → re-encode as UTF-8 strict, fall back to replacement on error
- Dates → parse common formats (
%Y-%m-%d,%d/%m/%Y,%B %d, %Y) to ISO 8601; keep original on failure - Links → remove tracking params (
utm_*,fbclid,gclid); preferhttps; resolve relative URLs ifbase_urlknown
Phase 5: JSON Generation
Build the canonical Doc2MLDocument structure (see schema quick reference below).
Key construction rules:
chunk_idformat:blk-{zero-padded-index:03d}- Every block gets:
content,text_plain,char_count,token_count_est(chars/4),embedding_ready,context_window,provenance,language,semantics,relations context_window: linkprev_chunk_id,next_chunk_id,parent_heading_chunk_id,parent_structure_node_id, and 100-char preview of surrounding textprovenance:page_number,source_location(e.g.,h1,p tag,line 45-67),extraction_method,confidence- Build
structuretree withnode_id,node_type(document/part/chapter/section/...),level,chunk_ids,children - Build
ml_index:chunk_id_map(index + structure_path per block),heading_map,embedding_candidates(allembedding_ready=Trueblocks),chunk_boundaries(sections with token_count_est)
Note on metadata.source: When a Markdown file contains YAML frontmatter with a source field (e.g., source: "Journal of Machine Learning"), that declared source takes precedence over filesystem metadata. The output JSON stores both declared_source (frontmatter value) and source_type: "frontmatter" to distinguish it from source_type: "filesystem". If no frontmatter source is present, declared_source falls back to the file path.
Note on token counts: token_count_est values are estimates computed with a whitespace-based heuristic (words * 0.75), not formal tokenizer counts. They may vary by ±30% from actual counts produced by subword tokenizers such as BPE or SentencePiece. Use them for relative sizing and chunking decisions, not for exact budget calculations.
Phase 6: Validation & Output
Run scripts/validate_output.py on the generated JSON. Manual checks:
- Schema compliance → validate against JSON Schema in
references/schema.mdsection 9 (usejsonschemalibrary if available, else manual required-field check) - Roundtrip sanity → text coverage ratio (JSON text chars / original text chars) should be >0.85; heading count should match; metadata fields preserved
- ML readiness → all chunks have non-empty text; token counts are reasonable; no empty
embedding_readyblocks;chunk_boundariescover allembedding_readyblocks
Output: write {document_id}.doc2ml.json with indentation=2.
Batch Processing
For directories or file lists:
- Collect all files with supported extensions (recurse if requested)
- Run Phase 1-6 on each file independently
- Aggregate outputs:
- Per-document JSONs: write individual
.doc2ml.jsonfiles - Dataset manifest: generate
manifest.jsonwithdocument_id,filename,block_count,token_count_estper file - Combined chunks: optionally merge all
embedding_readyblocks into a single flat JSONL for training
- Per-document JSONs: write individual
- Parallelize with
multiprocessing.Poolif >10 files; setmax_workers = min(8, cpu_count()) - Log failures per file with error code; do not fail the entire batch
Selective Extraction
When user requests specific content only:
- Parse the full document through Phase 1-4 (do not skip — structure needed for context)
- Filter by criteria:
- Section/chapter: collect
chunk_idsfrom targetStructureNode, emit only those blocks - Block type: filter
blocksarray bytype(e.g., onlytable, onlycode_block) - Page range: filter by
provenance.page_numberwithin range - Keyword search: scan
text_plainfor regex matches, emit matched blocks + theircontext_windowpreviews
- Section/chapter: collect
- Rebuild a minimal
Doc2MLDocumentfrom filtered subset:- Update
structureto only include nodes with matchingchunk_ids - Update
ml_index.chunk_id_mapandml_index.heading_map - Recalculate
metadata.statisticsfor the subset
- Update
ML Dataset Generation
Derive task-specific training data from the canonical JSON:
| Task | Strategy | Source Fields |
|---|---|---|
| Instruction-following | Pair each heading with its section paragraphs as instruction → response | ml_index.heading_map + section blocks |
| Summarization | Abstract block as target, full body as input; or last paragraph per section as target | abstract blocks, chunk_boundaries |
| RAG retrieval | Sliding-window chunks over embedding_candidates with overlap |
ml_index.embedding_candidates, context_window |
| Embeddings | One payload per embedding_ready block with metadata for vector DB filtering |
text_plain, structure_path, section_role |
| Table-to-text | Table records + caption as input, nearest paragraph as target |
table blocks, spatial proximity in blocks array |
| Code LLM | code_block content as output, preceding paragraph as instruction |
code_block blocks, context_window.prev_chunk_id |
| QA | Citation cross-references as pseudo-QA pairs | cross_references with ref_type=citation |
| Classification | Document-level metadata + block statistics as features | metadata.classification, metadata.statistics |
Implementation: iterate over blocks or ml_index fields, transform to target format (Alpaca, ChatML, JSONL), write to task-specific output files.
Error Handling
13 defined error codes with recovery strategies:
| Code | Severity | Recovery |
|---|---|---|
DETECT_001 Extension/signature mismatch |
Warning | Use signature result, flag for review |
DETECT_002 Unknown file format |
Error | Try generic text extraction; fail if binary |
DETECT_003 Corrupted ZIP (DOCX/EPUB) |
Error | Try to repair ZIP, extract readable parts with zipfile |
EXTRACT_001 Password-protected PDF |
Error | Report failure, request password |
EXTRACT_002 Scanned PDF without OCR text |
Warning | Set is_scanned=true, note OCR recommendation |
EXTRACT_003 Missing fonts in PDF |
Warning | Use raw text extraction fallback (PyPDF2) |
EXTRACT_004 Broken XML in DOCX |
Error | Parse with lxml recovery mode (recover=True) |
EXTRACT_005 Encoding detection failure |
Warning | Use UTF-8 with replacement characters |
STRUCT_001 No headings detected |
Warning | Infer from paragraph patterns (all-caps, short lines) |
STRUCT_002 Broken heading hierarchy |
Warning | Normalize levels, insert inferred headings |
STRUCT_003 Table extraction failed |
Warning | Emit table as unknown block with raw text |
NORM_001 Invalid date format |
Warning | Keep original, set normalized: null |
NORM_002 URL parse error |
Warning | Keep original URL, log warning |
Recovery principle: progressive enhancement. Extract what you can; never fail entirely if partial extraction is possible. Log all errors in metadata.ingestion under warnings or errors arrays.
Output Schema Quick Reference
Top-level structure:
{
"doc2ml_version": "0.6.2",
"document_id": "uuid-v4",
"metadata": {
"title": "string",
"authors": [{"name": "string"}],
"source": {"uri": "...", "mime_type": "...", "filename": "...", "checksum_sha256": "...", "file_size_bytes": 0},
"ingestion": {"ingestion_date": "ISO-8601", "processing_version": "doc2ml-json v0.6.2", "extractor": "...", "extractor_version": "...", "ingestion_pipeline": ["..."], "processing_duration_ms": 0},
"language": {"detected": "en", "confidence": 0.97},
"statistics": {"page_count": 0, "chapter_count": 0, "section_count": 0, "block_count": 0, "table_count": 0, "figure_count": 0, "footnote_count": 0, "total_char_count": 0, "total_token_count_est": 0, "total_word_count": 0},
"classification": {"doc_type": "...", "genre": "...", "keywords": ["..."], "topics_ml": [{"label": "...", "score": 0.92}]},
"dates": {"created": "...", "modified": "...", "published": "..."},
"rights": {"license": "...", "copyright": "...", "open_access": true}
},
"structure": {"node_id": "root", "node_type": "document", "title": "...", "level": 0, "chunk_ids": ["..."], "children": [...]},
"blocks": [{"chunk_id": "blk-000", "type": "heading|paragraph|table|list|code_block|...", "content": {...}, "text_plain": "...", "char_count": 0, "token_count_est": 0, "embedding_ready": true, "context_window": {...}, "provenance": {...}, "language": {...}, "semantics": {...}, "relations": {...}, "custom": {}}],
"cross_references": [{"ref_id": "...", "ref_type": "citation|internal_link|figure_ref|...", "source_chunk_id": "...", "target_chunk_id": "...", "resolved": true}],
"ml_index": {"chunk_id_map": {...}, "heading_map": [...], "embedding_candidates": [...], "chunk_boundaries": [...]},
"custom": {}
}Resources
references/schema.md— Full JSON schema specification, block types, examples, ML use-case mappings, and JSON Schema validator (Draft 2020-12)references/workflow.md— Complete per-format extraction recipes, edge case guide (scanned PDFs, password protection), tool mapping, and decision treesscripts/detect_format.py— Document format detection with extension, magic bytes, and deep inspectionscripts/extract_structure.py— Main extraction pipeline for all supported formatsscripts/validate_output.py— JSON schema validation with ML readiness metricsscripts/chunk_for_ml.py— Chunk JSON for ML context windows with section boundary respect