voice — TTS & STT tool for AI agents

Requires macOS with Apple Silicon, Git LFS, and Rust 1.85+. TTS model weights download on first `voice say` (~312MB, cached). STT model weights download on first `voice listen` (~246MB, cached).

rgbkrk 3 1 Updated 4mo ago

Resources

GitHub

Install

npx skillscat add rgbkrk/voice

Install via the SkillsCat registry.

SKILL.md

voice — TTS & STT tool for AI agents

voice speaks text aloud using Kokoro TTS and transcribes speech using Moonshine STT on Apple Silicon. Use it to talk to your user, listen for their response, or run a full voice conversation loop.

Quick reference

Speak (TTS)

# Speak text (backward compatible — no subcommand needed)
voice Hello, I finished the task.

# Explicit say subcommand with options
voice say -v am_michael "Switching to a male voice."

# Speak from a pipe
echo "Build complete." | voice say

# Read a file aloud (strip markdown first)
voice say --markdown -f README.md

# Save to WAV instead of playing
voice say -o result.wav "Here is your audio."

# Precise pronunciation via IPA phonemes
voice say --phonemes "həlˈO wˈɜɹld"

Converse (speak + listen)

# Speak text, then immediately listen for a response
voice converse "How are you today?"

# With voice and speed options
voice converse -v am_michael -s 1.2 "What do you think about that?"

Listen (STT)

# Record from mic, transcribe on Enter/Ctrl+C
voice listen

# Continuous mode — transcribe segments as you speak, split on silence
voice listen --continuous

# Transcribe a WAV file
voice transcribe recording.wav

JSON-RPC server

# Start the server (for programmatic control)
voice serve -v am_michael

# Speak
→ {"jsonrpc":"2.0","method":"speak","params":{"text":"Hello"},"id":1}
← {"jsonrpc":"2.0","result":{"duration_ms":1800,"chunks":1},"id":1}

# Listen (ding plays, records, auto-stops on silence)
→ {"jsonrpc":"2.0","method":"listen","id":2}
← {"jsonrpc":"2.0","result":{"text":"I heard you","tokens":4,"duration_ms":3200},"id":2}

# Cancel current playback or recording
→ {"jsonrpc":"2.0","method":"cancel","id":3}

# Other methods: set_voice, set_speed, list_voices, ping

When to use

Get attention: Speak when a long task finishes, a build fails, or you need input
Read content: Pipe text through voice say to read back docs, errors, or summaries
Confirm actions: "Deploying to production" before doing something irreversible
Listen for input: Use voice listen to capture a spoken response from the user
Voice conversation: Use voice converse to speak then listen in one shot, or voice serve for programmatic control
Transcribe recordings: Use voice transcribe to convert audio files to text

Tips

TTS tips

Use -q for quiet mode — suppresses phonemes and progress, only errors print
For long text, voice automatically chunks at ~510 phonemes and streams playback
Stderr shows phoneme output — useful for debugging pronunciation
Use --sub word=replacement to fix names: voice say --sub kubectl=cube-cuddle "Restarting kubectl"
A .voice-subs file in the project root is auto-discovered for persistent fixes
Wrap substitution values in /slashes/ for raw phoneme overrides: Kokoro=/kˈOkəɹO/

STT tips

A ding sound plays when the mic is ready — wait for it before speaking
Bluetooth mics (AirPods) have ~0.5s latency; the ding helps you time it
Noise floor is calibrated automatically — works with MacBook mic or AirPods
Use STT_MODEL=UsefulSensors/moonshine-tiny for faster (but less accurate) transcription
Default model is moonshine-base (61M params, ~50× real-time on Apple Silicon)

JSON-RPC tips

voice serve loads the TTS model at startup; STT model loads lazily on first listen
cancel interrupts the current speak or listen mid-operation
speak supports per-request voice and speed overrides without changing defaults
listen params are tunable: noise_multiplier, calibration_ms, silence_timeout_ms
Notifications (requests without id) are fire-and-forget — no response returned

Subcommands

Command	What it does
`voice <text>`	Speak text (implicit `say`, backward compatible)
`voice say`	Speak text with full TTS options
`voice converse`	Speak text, then listen for a response
`voice listen`	Record from mic, transcribe once
`voice listen --continuous`	Record and transcribe segments continuously
`voice transcribe <file>`	Transcribe a WAV file
`voice serve`	Start JSON-RPC server on stdin/stdout

Builtin voices (no network)

af_heart (default), af_bella, af_sarah, af_sky, am_michael, am_adam, bf_emma

Install

git clone https://github.com/rgbkrk/voice.git
cd voice
cargo install --path crates/voice-cli

Requires macOS with Apple Silicon, Git LFS, and Rust 1.85+. TTS model weights download on first voice say (~312MB, cached). STT model weights download on first voice listen (~246MB, cached).

voice — TTS & STT tool for AI agents

Resources

Install

voice — TTS & STT tool for AI agents

Quick reference

Speak (TTS)

Converse (speak + listen)

Listen (STT)

JSON-RPC server

When to use

Tips

TTS tips

STT tips

JSON-RPC tips

Subcommands

Builtin voices (no network)

Install

Categories

Install

Recommended Skills