A macOS-native fork of Kyutai's Pocket TTS — a ~100M-parameter CPU-only text-to-speech engine. This fork wraps the original Python library in an Electron desktop app, a macOS Quick Action for system-wide text reading, and a Swift menu bar companion app. Voice enhancement via LavaSR is integrated directly into the app.
Upstream: kyutai-labs/pocket-tts · Demo · HuggingFace · Paper
| Feature | Description |
|---|---|
| Electron Desktop App | Dark-themed GUI with voice management, streaming playback, multi-talk mode, and history |
| LavaSR Enhancement Studio | A/B preview of enhanced vs original voice samples before committing — self-bootstrapping venv, no external setup |
| macOS Quick Action | Select text anywhere → right-click → "Read Selection with Pocket TTS" — streams audio via ffplay |
| Menu Bar App | Native Swift app for voice selection and TTS server monitoring |
| LaunchAgent | Auto-starts the TTS server on login (port 8765) |
| Text Normalizer | Numbers, currencies, abbreviations, acronyms, ISR/radar terms → speakable words |
| Pause/Resume/Stop | Client-side audio controls + server-side cancellation |
| Fish Audio S2 Pro | Alternate 5B-parameter TTS backend via MLX — selectable from Electron, Menu Bar, and Quick Action |
Multi-Talk-Example.webm
- macOS (Apple Silicon recommended)
- Python 3.10–3.14
- uv (Python package manager)
- Node.js 18+ and npm (for Electron)
- ffplay (for Quick Action streaming —
brew install ffmpeg) - PyTorch ≥ 2.5 (CPU build, installed automatically)
# 1. Clone and install Python package
git clone https://github.com/slaughters85j/pocket-tts.git
cd pocket-tts
uv pip install -e .
# 1b. Optional: enable Fish Audio S2 Pro backend (Apple Silicon only)
uv sync --group mlx
# 2. Run the Electron app in dev mode
cd electron && npm install && npm run dev
# 3. Or start the TTS server directly
uv run pocket-tts serve --port 8765There are several ways to build depending on what changed. Read this section carefully — it will save you headaches.
./scripts/rebuild-all.shThis runs all steps in order:
- Python editable install (
uv pip install -e .) - Electron app — npm install, PyInstaller bundle, electron-builder, copy to
/Applications/ - macOS Quick Action + Menu Bar App — Swift builds, workflow install
- LaunchAgent restart
Flags:
./scripts/rebuild-all.sh --skip-electron # Python + macOS only
./scripts/rebuild-all.sh --skip-macos # Python + Electron onlyIf you only changed TypeScript/React code and the Python server bundle is already built, use this. It's significantly faster than a full rebuild:
cd electron && rm -rf out release && npm run build:electron \
&& killall "Pocket TTS" 2>/dev/null; \
rm -rf "/Applications/Pocket TTS.app" \
&& cp -R "release/mac-arm64/Pocket TTS.app" /Applications/ \
&& open "/Applications/Pocket TTS.app"Why
rm -rf out release? Without it, electron-builder may repackage stale assets. The content-hashed JS filenames look fresh but the asar can contain old code. Always nukeout/andrelease/for a clean build.
Source changes to pocket_tts/ take effect immediately for uv run and the LaunchAgent (editable install). No rebuild needed unless:
-
New dependency added to
pyproject.toml: Re-run from project root:uv pip install -e . -
Changes need to be in the Electron distributable: The Electron app bundles Python via PyInstaller, so you must re-bundle:
uv pip install -e . cd electron/python && ./bundle-python.sh cd .. && npm run build:electron
cd macos-service/scripts && ./install-quick-action.shcd macos-service/scripts && ./dev-test.shcd electron && npm run devHot-reloads renderer changes. The dev server connects to whatever TTS server is running on port 8765.
The app integrates LavaSR for speech super-resolution and denoising of voice samples. Enhancement is fully self-bootstrapping:
- First time you click "Set Up LavaSR" in the Save Voice modal or Voice Selector, the app creates a dedicated venv at
~/Library/Application Support/pocket-tts-electron/lavasr-venv/and installs torch, torchaudio, soundfile, and LavaSR from GitHub viauv. - Once set up, the Enhancement Studio lets you preview enhanced vs original audio side-by-side before committing.
- Enhanced voices are tagged in
voices.jsonwith metadata (denoise settings, RMS normalization).
No external scripts or manual venv management required — it just works.
An alternate TTS backend using Fish Audio S2 Pro (5B params, MLX 8-bit quantized). Requires Apple Silicon.
# 1. Install MLX dependencies
uv sync --group mlx
# 2. Download the model (~6.7 GB)
huggingface-cli download mlx-community/fish-audio-s2-pro-8bit \
--local-dir models/fish-audio-s2-pro-8bitThe entire models/fish-audio-s2-pro-8bit/ directory is gitignored — all files come from HuggingFace so updates are always in sync. The rebuild script (scripts/rebuild-all.sh) installs MLX dependencies automatically on Apple Silicon.
Note: The model is gated under the Fish Audio Research License. If prompted, accept the license at the HuggingFace page and authenticate with
huggingface-cli loginfirst.
Once installed, a Model dropdown appears in the Electron app (and a Select Model submenu in the Menu Bar app). Switch between:
- Pocket TTS (100M, CPU) — fast, lightweight, built-in voices
- Fish Audio S2 Pro (5B, MLX) — higher quality, 80+ languages, inline
[tag]emotion/prosody control
Only one model is loaded at a time. Switching unloads the current model and loads the new one (~10-15s for fish-speech).
Fish Audio S2 Pro supports 15,000+ inline tags for fine-grained control:
S2 Pro enables localized control over speech generation by embedding natural-language instructions directly within the text using [tag] syntax. Rather than relying on a fixed set of predefined tags, S2 Pro accepts free-form textual descriptions — such as [whisper in small voice], [professional broadcast tone], or [pitch up] — allowing open-ended expression control at the word level.
[whisper in small voice]
[professional broadcast tone]
[pitch up]
Common Tags (15,000+ unique tags supported):
[pause] [emphasis] [laughing] [inhale] [chuckle] [tsk] [singing] [excited] [laughing tone] [interrupting] [chuckling] [excited tone] [volume up] [echo] [angry] [low volume] [sigh] [low voice] [whisper] [screaming] [shouting] [loud] [surprised] [short pause] [exhale] [delight] [panting] [audience laughter] [with strong accent] [volume down] [clearing throat] [sad] [moaning] [shocked]
Based on local MLX inference testing, the following guidelines produce the most reliable output:
- One tag per phrase or sentence. Give the model enough text after a tag to settle into the style before switching. Rapid tag switching every sentence degrades quality.
- Do not stack tags back-to-back. Adjacent tags with no text between them (e.g.,
[audience laughter][chuckling]) produce garbled or distorted audio. Separate them with natural text or a full sentence. [pause]counts as a tag. Do not place[pause]immediately before another tag — insert text between them.[singing]is unreliable. The model is a TTS system, not a vocoder trained on melodic data. Expect spoken cadence, not actual singing.[screaming]causes distortion. Audio phases out and distorts even with adequate text after the tag. Use[loud]or[troubled]as safer alternatives for high-intensity delivery.- Emotion transitions need runway. When shifting between emotions, allow at least one full sentence per tag so the Dual-AR architecture can stabilize the new prosody.
Good:
[excited]I cannot believe this works! After all that effort, we finally have local inference.
[whisper]And the best part is, nobody else knows about it yet.
Bad:
[excited]Wow! [sad]But also sad. [angry]And frustrating! [laughing]Just kidding.
[pause][short pause][excited]Surprise![audience laughter][chuckling]Funny right?
- Voice cloning uses custom WAV files only — predefined pocket-tts voices (Alba, etc.) are not available with this backend.
- Multi-Talk mode is pocket-tts only.
- Model weights (~6.7 GB) download from HuggingFace on first use.
System-wide text-to-speech from any application.
- Install:
cd macos-service/scripts && ./install-quick-action.sh - Enable in System Settings → Keyboard → Shortcuts → Services → find "Read Selection with Pocket TTS"
- Optional: assign a keyboard shortcut (e.g., F19)
- Start the server:
uv run pocket-tts serve --port 8765(or install the LaunchAgent for auto-start)
Select text anywhere → right-click → Services → "Read Selection with Pocket TTS". Audio streams immediately via ffplay.
~/Library/Logs/PocketTTS/tts-stream-YYYY-MM-DD.log
A native Swift menu bar app (macos-service/PocketTTSMenuBar/) for:
- Voice selection (syncs with Electron app and Quick Action)
- Server status monitoring
- Stop Speaking control
Built with AppKit (not SwiftUI App — fixes menu not appearing). Installed to ~/Applications/Pocket TTS Menu Bar.app.
Text → SentencePiece tokenizer → LUTConditioner (embeddings)
↓
Audio prompt → Mimi encoder → voice state → FlowLMModel (CaLM) → latent frames
↓
Mimi decoder → PCM audio
- Thread 1: CaLM generates latent frames autoregressively (12.5 Hz, 80ms/frame)
- Thread 2: Mimi decoder converts latents to waveform in parallel
- CPU-only — GPU provides no speedup at this model size (~100M params)
- Not thread-safe — server does not support concurrent requests
uv run pytest -n 3 -v # all tests (3 parallel workers)
uv run pytest tests/test_cli_generate.py -v # single file
uv run pytest tests/test_cli_generate.py -k "name" # single testRuff via pre-commit. Line length 100, LF endings, relative imports banned.
uvx pre-commit install # one-time setup
uvx pre-commit run --all-files # manual run- PyTorch < 2.5 produces incorrect audio. Enforced in
pyproject.toml. - Python 3.10–3.14 only.
uvmanages its own Python — system Python may lack headers. - Electron won't load? Check
ELECTRON_RUN_AS_NODEis not set:unset ELECTRON_RUN_AS_NODE - Voice cloning requires gated HF model access (
uvx hf auth login). Predefined voices work without auth. - Editable install means Python source changes are live immediately. New deps require
uv pip install -e .from project root. - Electron distributable bundles PyInstaller output, not source — must re-bundle after Python changes.
Use of our model must comply with all applicable laws and regulations and must not result in, involve, or facilitate any illegal, harmful, deceptive, fraudulent, or unauthorized activity. Prohibited uses include, without limitation, voice impersonation or cloning without explicit and lawful consent; misinformation, disinformation, or deception (including fake news, fraudulent calls, or presenting generated content as genuine recordings of real people or events); and the generation of unlawful, harmful, libelous, abusive, harassing, discriminatory, hateful, or privacy-invasive content. We disclaim all liability for any non-compliant use.
Upstream (Kyutai): Manu Orsini*, Simon Rouard*, Gabriel De Marmiesse*, Václav Volhejn, Neil Zeghidour, Alexandre Défossez (*equal contribution)
This fork: John Saunders — Electron app, LavaSR integration, macOS Quick Action, Menu Bar app, text normalizer, pause/resume/stop controls, reusable creations with metadata, .mp3/.mpa/.wav export




