cjk-aware-text-metrics

"CJK/Latin weighted token estimation for multilingual LLM pipelines. Use when processing Japanese/Chinese/Korean text with fixed chars-per-token constants."

shimo4228 1 Updated 5mo ago

GitHub

Install

npx skillscat add shimo4228/claude-code-learned-skills/cjk-aware-text-metrics

Install via the SkillsCat registry.

SKILL.md

CJK-Aware Text Metrics

Extracted: 2026-02-11
Context: Multilingual LLM pipelines where token estimation affects cost, chunking, or rate limits

Problem

Fixed chars-per-token constants (e.g., CHARS_PER_TOKEN = 4) assume Latin text.
Japanese/Chinese/Korean text uses ~2.5 chars/token, causing ~60% underestimation
in token counts, cost previews, and chunk sizing for CJK-heavy documents.

Solution

Detect CJK characters by Unicode range:

def _is_cjk(char: str) -> bool:
    cp = ord(char)
    return (
        0x4E00 <= cp <= 0x9FFF      # CJK Unified Ideographs
        or 0x3040 <= cp <= 0x309F   # Hiragana
        or 0x30A0 <= cp <= 0x30FF   # Katakana
        or 0x3400 <= cp <= 0x4DBF   # CJK Extension A
        or 0xF900 <= cp <= 0xFAFF   # CJK Compatibility
    )

Weighted token estimation:

CJK_CHARS_PER_TOKEN = 2.5
LATIN_CHARS_PER_TOKEN = 4.0

def estimate_tokens(text: str) -> int:
    cjk_count = sum(1 for c in text if _is_cjk(c))
    other_count = len(text) - cjk_count
    return int(cjk_count / CJK_CHARS_PER_TOKEN + other_count / LATIN_CHARS_PER_TOKEN)

Chunk splitting must use token-based accumulation (not char-based):

# BAD: char_limit = token_limit * FIXED_CONSTANT
# GOOD: accumulate estimated tokens per paragraph
current_tokens += estimate_tokens(para)
if current_tokens > token_limit:
    flush_chunk()

When to Use

Building LLM pipelines that process Japanese/Chinese/Korean text
Implementing chunk splitting for multilingual documents
Estimating API costs for non-English content
Any text metric (token count, cost, rate limit) using a fixed chars-per-token constant

cjk-aware-text-metrics

Install

CJK-Aware Text Metrics

Problem

Solution

When to Use

Categories

Install

Recommended Skills