Transcribes audio/video files using ElevenLabs Scribe v2 API. Use when transcribing audio files, generating transcripts, or converting speech to text.
Resources
1Install
npx skillscat add qdhenry/claude-command-suite/elevenlabs-transcribe Install via the SkillsCat registry.
Requirements:
ELEVENLABS_API_KEYin the project's.envfileuvinstalled (dependencies auto-install via PEP 723)</quick_start>
Before transcribing, verify:
uvis available (dependency installation is automatic via inline script metadata — no venv or manual pip install needed)API key configured in the
.envfile where Claude is running:ELEVENLABS_API_KEY=your-key-hereAudio file exists and is a supported format (mp3, wav, mp4, m4a, ogg, flac, webm, etc.)
MUST stop if the API key is missing — inform the user to add it to their .env file.
Step 1: Parse user input
Extract the audio file path and any options from $ARGUMENTS or the user's message. Supported options:
--output <path>or-o <path>— where to save the transcript--language <code>— ISO-639 language code (e.g., eng, spa, fra, deu, jpn, zho)--num-speakers <n>— max speakers in the audio (1-32)--keyterms "term1" "term2"— words/phrases to bias transcription towards--timestamps none|word|character— timestamp granularity--no-diarize— disable speaker identification--no-audio-events— disable audio event tagging--json— output full JSON response
Step 2: Validate the audio file
Confirm the file path exists. Expand ~ paths. The script handles validation automatically but check early for a clear error message.
Step 3: Check for API key
grep -q "ELEVENLABS_API_KEY=" .env 2>/dev/null && echo "API key configured" || echo "API key missing"If missing, tell the user to add ELEVENLABS_API_KEY= to their .env file and stop.
Step 4: Run transcription
Dependencies are installed automatically by uv via inline script metadata (PEP 723). No venv or manual pip install needed.
Basic transcription (diarize + audio events + auto language):
uv run ~/.claude/skills/elevenlabs-transcribe/scripts/transcribe.py "<audio_file_path>"With output file and options:
uv run ~/.claude/skills/elevenlabs-transcribe/scripts/transcribe.py "<audio_file_path>" --output transcript.txt --language eng --num-speakers 3With key terms for better accuracy:
uv run ~/.claude/skills/elevenlabs-transcribe/scripts/transcribe.py "<audio_file_path>" --keyterms "technical term" "product name"Full JSON response:
uv run ~/.claude/skills/elevenlabs-transcribe/scripts/transcribe.py "<audio_file_path>" --json --output result.jsonStep 5: Present results
Format the transcription output cleanly for the user. If diarization is enabled, group text by speaker. Highlight any audio events detected. Example output:
[Speaker 0]: Hello, how are you doing today?
[Speaker 1]: I'm doing great, thanks for asking! (laughter)
<script_options></p> <table> <thead> <tr> <th>Flag</th> <th>Description</th> <th>Default</th> </tr> </thead> <tbody><tr> <td><code><file></code></td> <td>Path to audio/video file (required)</td> <td>-</td> </tr> <tr> <td><code>--output <path></code>, <code>-o</code></td> <td>Save transcription to file</td> <td>stdout</td> </tr> <tr> <td><code>--language <code></code></td> <td>ISO-639 code (eng, spa, fra, deu, jpn, zho)</td> <td>auto-detect</td> </tr> <tr> <td><code>--num-speakers <n></code></td> <td>Max speakers in audio (1-32)</td> <td>auto-detect</td> </tr> <tr> <td><code>--keyterms "t1" "t2"</code></td> <td>Terms to bias transcription towards (max 100)</td> <td>none</td> </tr> <tr> <td><code>--timestamps <level></code></td> <td>Granularity: none, word, character</td> <td>word</td> </tr> <tr> <td><code>--no-diarize</code></td> <td>Disable speaker identification</td> <td>diarize enabled</td> </tr> <tr> <td><code>--no-audio-events</code></td> <td>Disable audio event tagging</td> <td>events enabled</td> </tr> <tr> <td><code>--json</code></td> <td>Output full JSON response</td> <td>formatted text</td> </tr> <tr> <td></script_options></td> <td></td> <td></td> </tr> </tbody></table> <supported_formats> All major audio and video formats: mp3, wav, mp4, m4a, ogg, flac, webm, aac, wma, mov, avi, mkv, and more. Maximum file size: 3GB. </supported_formats> <api_details> - **Endpoint:** POST /v1/speech-to-text - **Model:** scribe_v2 (latest, most accurate) - **Diarization:** Identifies and labels different speakers (up to 32) - **Audio events:** Tags non-speech sounds like (laughter), (applause), (music) - **Language:** Auto-detected or specified via ISO-639 code - **Timestamps:** none, word-level, or character-level granularity - **Key terms:** Bias transcription towards specific words/phrases for better accuracy </api_details> <error_handling> | Error | Resolution | |-------|------------| | `ELEVENLABS_API_KEY not found` | Add key to `.env` file in current directory | | `uv: command not found` | Install uv: `curl -LsSf https://astral.sh/uv/install.sh` pipe to `sh` | | `File not found` | Verify the file path and expand any `~` | | `422 Validation Error` | Check file format/size, ensure model_id is valid | | `401 Unauthorized` | API key is invalid or expired | </error_handling> <success_criteria> - Audio file exists and is accessible - API key loaded from `.env` without exposure in chat - Transcription completed successfully - Output formatted with speaker labels (if diarized) - Audio events shown inline (if enabled) - If `--output` specified, file written to requested path - User can see the full transcription text </success_criteria>