Extract transcripts from video URLs and produce publication-ready bilingual lecture scripts. Use this skill whenever a user provides a video link (YouTube, Bilibili, TikTok/Douyin, Vimeo, or any platform supported by yt-dlp) and wants: a text transcript, lecture notes, subtitles extraction, speech-to-text from video, a written version of a talk, translated lecture scripts, or bilingual transcripts. Also trigger when the user pastes a video URL and asks to "write out what they said", "get the transcript", "turn this video into a document", or similar requests even if they don't explicitly mention "transcript".
Resources
5Install
npx skillscat add ylouis83/video-transcript Install via the SkillsCat registry.
Video Transcript Skill
Extract spoken content from any video URL and produce beautifully formatted, publication-ready
lecture scripts. When the source language is not Chinese, automatically produce both the original
language version and a Chinese translation.
Workflow Overview
Video URL
|
v
1. Download subtitles/captions (yt-dlp) — preserve timestamps
|-- Found? --> Parse with timestamps
|-- Not found? --> Download audio --> Whisper transcription (with timestamps)
|
v
2. Structure the raw transcript
|-- Detect language
|-- Split into logical sections (by topic/timestamps)
|-- Map each section to its starting timestamp
|-- Add headings and paragraph breaks
|
v
2.5. Capture Visual Summaries
|-- Try: download video → ffmpeg frame capture at each section timestamp
|-- Fallback: generate concept images via DALL-E (requires OPENAI_API_KEY)
|-- Skip with --no-images flag
|
v
3. If non-Chinese source:
|-- Produce original language document
|-- Produce Chinese translation document
|
v
4. Format and export
|-- Markdown (.md) with timestamps in TOC and section headings, embedded images
|-- Word (.docx) with professional layout, embedded images, timestamp annotationsStep 1: Extract Raw Transcript (with Timestamps)
1a. Try subtitle extraction first (preferred)
Use the bundled script to extract subtitles with timestamps:
python3 "<skill-path>/scripts/extract_transcript.py" "<video-url>"The script tries in order:
- Manual/human-written subtitles (highest quality) — parses VTT/SRT timestamps
- Auto-generated captions — parses VTT/SRT timestamps
- Falls back to audio download + Whisper (outputs JSON with segment timestamps)
New flags:
--no-images: Skip visual summary capture--no-timestamps: Use legacy behavior without timestamp preservation
If running manually with yt-dlp:
# List available subtitles
yt-dlp --list-subs "<video-url>"
# Download best available subtitle
yt-dlp --write-sub --write-auto-sub --sub-lang "en,zh-Hans,zh,ja,ko,fr,de,es" \
--skip-download --sub-format "vtt/srt/best" -o "transcript" "<video-url>"1b. Audio download + Whisper fallback
When no subtitles are available:
# Download audio only
yt-dlp -x --audio-format wav --audio-quality 0 -o "audio.%(ext)s" "<video-url>"
# Transcribe with Whisper (local)
whisper audio.wav --model medium --output_format txt --language auto
# OR via OpenAI API (if OPENAI_API_KEY is set)
python3 "<skill-path>/scripts/whisper_api.py" audio.wavModel selection guidance:
- Short videos (<15 min):
mediummodel for good balance of speed and accuracy - Long videos (>15 min):
baseorsmallmodel to save time - High accuracy needed:
large-v3model - OpenAI API: always uses the latest Whisper model, best for quality
1c. Subtitle cleaning and timestamp extraction
Raw subtitles contain timestamps, duplicates, and formatting artifacts. The script produces:
timestamped_transcript.json: Structured segments with start/end times{ "segments": [ {"start": 0.0, "end": 5.2, "text": "Hello everyone..."}, {"start": 5.2, "end": 12.1, "text": "Today we'll talk about..."} ] }raw_transcript.txt: Cleaned plain text (timestamps removed, duplicates merged, tags stripped)sections.json: Segments grouped into sections with timestamp mapping
Cleaning rules:
- Remove formatting tags (
<font>,<i>, position markers) - Merge duplicate lines from overlapping subtitle segments
- Join broken sentences across subtitle blocks
- Preserve paragraph breaks at natural pause points (>2 seconds gap)
Step 2: Structure the Transcript
Transform raw text into a structured document:
Language detection
Detect the source language from the first 500 characters. If the video title or metadata
contains language hints, use those as confirmation.
Section splitting
- If timestamps are available: Group by natural topic shifts (look for long pauses >3s,
topic transition phrases like "now let's talk about", "moving on to", "next") - If no timestamps: Split by semantic coherence — each section should cover one main idea
- Target section length: 300-600 words per section
Heading generation
- Create a descriptive title from the video title/content
- Generate section headings that summarize each section's main point
- Use H2 (
##) for major sections, H3 (###) for subsections
Text refinement
The raw transcript is spoken language. Refine for readability while preserving the speaker's
voice and intent:
- Remove filler words ("um", "uh", "you know", "like" when used as filler)
- Fix grammatical artifacts of speech (incomplete restarts, self-corrections)
- Keep the speaker's unique expressions, idioms, and speaking style
- Do NOT rewrite content or add information not present in the original
- Preserve technical terms and proper nouns exactly as spoken
Step 3: Bilingual Output (when source is not Chinese)
When the source language is not Chinese, produce TWO documents:
Original language document
- Clean, structured transcript in the source language
- All formatting and sections applied
Chinese translation document
- Professional, natural Chinese translation (not machine-translation style)
- Adapt idioms and cultural references for Chinese readers
- Keep technical terms with both Chinese translation and original in parentheses
- Example: "Transformer 架构 (Transformer Architecture)"
- Maintain the same section structure as the original
- Translation style: 信达雅 (faithful, expressive, elegant)
When source IS Chinese
- Produce only one document (the Chinese version)
- Apply the same structuring and refinement
Step 4: Format and Export
Produce BOTH formats for every document:
Markdown (.md) format
Use this structure (timestamps and images are included when available):
# [Document Title]
> **Source**: [Video title and URL]
> **Speaker**: [Speaker name if identifiable]
> **Date**: [Video publish date if available]
> **Duration**: HH:MM:SS
---
## Table of Contents
- [Section 1 Title](#section-1-title) `[00:01:23]`
- [Section 2 Title](#section-2-title) `[00:05:47]`
- [Section 3 Title](#section-3-title) `[00:12:00]`
---
## Section 1 Title `[00:01:23]`

[Section content with proper paragraphs]
## Section 2 Title `[00:05:47]`

[Section content]
---
*Transcript extracted and formatted by Video Transcript Skill*Timestamp format: HH:MM:SS (pure text, not clickable links).
Images: either ffmpeg-captured video frames or AI-generated concept illustrations.
Word (.docx) format
Use the bundled script to generate professionally formatted Word documents:
python3 "<skill-path>/scripts/generate_docx.py" \
--input transcript.md \
--output transcript.docx \
--title "Document Title" \
--author "Speaker Name" \
--base-dir ./outputThe script applies these formatting standards:
- Font: Title in 小二号 Microsoft YaHei (微软雅黑) bold; body in 小四号 SimSun (宋体) for
Chinese, Calibri for English - Line spacing: 1.5x for body text
- Margins: 2.54cm (1 inch) all around — standard A4
- Header: Document title + page number
- Footer: Source URL
- Title page: Title, speaker, date, source URL
- Table of Contents: Auto-generated with page numbers
- Section headings: Styled with 三号 bold, automatic numbering, timestamp annotation in grey
- Images: Embedded from
Markdown syntax, 5.5 inches wide with centered captions - Paragraph spacing: 0.5 line before, 0.5 line after
File Naming Convention
Output files follow this pattern:
[video-title]_[language]_transcript.md
[video-title]_[language]_transcript.docxExamples:
attention_is_all_you_need_en_transcript.mdattention_is_all_you_need_zh_transcript.mdattention_is_all_you_need_en_transcript.docxattention_is_all_you_need_zh_transcript.docx
For Chinese-only sources:
[video-title]_transcript.md[video-title]_transcript.docx
Dependencies
Install these if not already available (prefer uv, fallback to pip):
# Core (required)
uv pip install yt-dlp python-docx
# Whisper - local (optional, for videos without subtitles)
uv pip install openai-whisper
# Whisper - API (optional, alternative to local Whisper)
# Also used for DALL-E image generation fallback
# Requires OPENAI_API_KEY environment variable
uv pip install openaiSystem tools:
# ffmpeg — required for audio extraction and video frame capture
# Install via: brew install ffmpeg (macOS), apt install ffmpeg (Linux)
ffmpeg -version
# yt-dlp
yt-dlp --versionNote: ffmpeg is needed for both Whisper audio extraction and video frame capture for visual summaries. DALL-E fallback image generation requires OPENAI_API_KEY.
Optional Pixelle-Video Installation
Pixelle-Video is optional. It is not bundled inside this skill directory and will not be installed automatically when someone copies or installs video-transcript.
You only need Pixelle-Video if you plan to run:
python3 "<skill-path>/scripts/pixelle_end_to_end.py" ...Recommended local layout:
workspace/
├── video-transcript/
└── Pixelle-Video/Suggested setup steps:
# 1. Create a workspace directory
mkdir -p workspace
cd workspace
# 2. Place this skill repo in your workspace
git clone https://github.com/ylouis83/video_transcript.git
# 3. Clone or copy Pixelle-Video beside it
# Replace this with the actual Pixelle-Video source you use
git clone <pixelle-video-repo> Pixelle-Video
# 4. Install Pixelle-Video with its own setup instructions
cd Pixelle-Video
# ...follow Pixelle-Video installation steps...Default behavior:
pixelle_end_to_end.pylooks for a sibling folder namedPixelle-Video- If Pixelle lives elsewhere, pass
--pixelle-repo /absolute/path/to/Pixelle-Video - If you do not need video rendering, ignore Pixelle and use the transcript / Markdown / docx pipeline directly
Error Handling
| Scenario | Action |
|---|---|
| Video is private/unavailable | Tell the user, suggest checking the URL or permissions |
| No subtitles AND Whisper not installed | Ask user to install Whisper or set OPENAI_API_KEY |
| Video is very long (>2 hours) | Warn about processing time, suggest processing in chunks |
| Unsupported platform | Try yt-dlp anyway (it supports 1000+ sites), report if it fails |
| Network error during download | Retry once, then report the error |
| Audio extraction fails | Try alternative format (-x --audio-format mp3), then report |
Quality Checklist
Before delivering the final documents, verify:
- No subtitle artifacts (timestamps, tags, position markers) remain in body text
- Sections have meaningful headings (not just "Section 1")
- Paragraphs are properly broken (not one giant wall of text)
- Technical terms are preserved accurately
- Translation (if applicable) reads naturally, not machine-translated
- Both .md and .docx files are generated
- File names follow the naming convention
- Metadata block (source, speaker, date) is filled in
- Table of contents matches actual sections
- Timestamps in TOC and headings are accurate and in HH:MM:SS format
- Section images are present (or --no-images was used)
- Images are properly embedded in .docx (not just text references)