kreuzberg-dev

MIME Detection & Routing

A polyglot document intelligence framework with a Rust core. Extract text, metadata, and structured information from PDFs, Office documents, images, and 76+ formats. Available for Rust, Python, Ruby, Java, Go, PHP, Elixir, C#, TypeScript (Node/Bun/Wasm/Deno)- or use via CLI, REST API, or MCP server.

kreuzberg-dev 8,432 497 Updated 3mo ago
GitHub

Install

npx skillscat add kreuzberg-dev/kreuzberg/ai-rulez-skills-mime-detection-routing

Install via the SkillsCat registry.

SKILL.md

priority: high

MIME Detection & Routing

Detection Flow

Extension → EXT_TO_MIME map → validate → Registry lookup → Extractor

Key Functions

Function Location Purpose
detect_mime_type(path, inspect) core/mime.rs Extension + optional content inspection
detect_mime_type_from_bytes(bytes) core/mime.rs Magic number detection (infer crate)
validate_mime_type(mime) core/mime.rs Check if any extractor supports it

Extension Mapping

118+ extensions mapped in EXT_TO_MIME (core/mime.rs). Case-insensitive.

Key mappings: .pdfapplication/pdf, .docxapplication/vnd.openxmlformats-officedocument.wordprocessingml.document, .xlsx → spreadsheet variant, .png/.jpgimage/*

Registry Selection

// In core/extractor/bytes.rs
fn select_extractor_for_mime(mime_type: &str) -> Result<Arc<dyn DocumentExtractor>> {
    let registry = get_document_extractor_registry();
    let registry_guard = registry.read()?;
    registry_guard.get_for_mime_type(mime_type)
        .ok_or_else(|| KreuzbergError::UnsupportedFormat(mime_type.into()))
}

Selects highest-priority extractor registered for that MIME type.

Adding New MIME Types

  1. Add extension mapping: m.insert("ext", "application/x-new"); in core/mime.rs
  2. Implement DocumentExtractor with supported_mime_types() returning the MIME
  3. Register in register_default_extractors()

Wildcard Support

Extractors can register for MIME type families: "image/*" matches image/png, image/jpeg, etc.

Critical Rules

  1. Always validate_mime_type() before extraction
  2. Extension mapping is case-insensitive
  3. Content inspection (infer crate) is fallback for extension-less files
  4. Registry validation is final authority on supported types