Skip to content

slaughters85j/pocket-tts

 
 

Repository files navigation

Pocket TTS — macOS Desktop Fork

pocket-tts-logo-v2-transparent

A macOS-native fork of Kyutai's Pocket TTS — a ~100M-parameter CPU-only text-to-speech engine. This fork wraps the original Python library in an Electron desktop app, a macOS Quick Action for system-wide text reading, and a Swift menu bar companion app. Voice enhancement via LavaSR is integrated directly into the app.

Upstream: kyutai-labs/pocket-tts · Demo · HuggingFace · Paper

What This Fork Adds

Feature Description
Electron Desktop App Dark-themed GUI with voice management, streaming playback, multi-talk mode, and history
LavaSR Enhancement Studio A/B preview of enhanced vs original voice samples before committing — self-bootstrapping venv, no external setup
macOS Quick Action Select text anywhere → right-click → "Read Selection with Pocket TTS" — streams audio via ffplay
Menu Bar App Native Swift app for voice selection and TTS server monitoring
LaunchAgent Auto-starts the TTS server on login (port 8765)
Text Normalizer Numbers, currencies, abbreviations, acronyms, ISR/radar terms → speakable words
Pause/Resume/Stop Client-side audio controls + server-side cancellation
Fish Audio S2 Pro Alternate 5B-parameter TTS backend via MLX — selectable from Electron, Menu Bar, and Quick Action

Screenshots

Single Voice Mode

Multi-Talk-Example.webm

History View

Requirements

  • macOS (Apple Silicon recommended)
  • Python 3.10–3.14
  • uv (Python package manager)
  • Node.js 18+ and npm (for Electron)
  • ffplay (for Quick Action streaming — brew install ffmpeg)
  • PyTorch ≥ 2.5 (CPU build, installed automatically)

Quick Start

# 1. Clone and install Python package
git clone https://github.com/slaughters85j/pocket-tts.git
cd pocket-tts
uv pip install -e .

# 1b. Optional: enable Fish Audio S2 Pro backend (Apple Silicon only)
uv sync --group mlx

# 2. Run the Electron app in dev mode
cd electron && npm install && npm run dev

# 3. Or start the TTS server directly
uv run pocket-tts serve --port 8765

Building

There are several ways to build depending on what changed. Read this section carefully — it will save you headaches.

Rebuild Everything (recommended after pulling changes)

./scripts/rebuild-all.sh

This runs all steps in order:

  1. Python editable install (uv pip install -e .)
  2. Electron app — npm install, PyInstaller bundle, electron-builder, copy to /Applications/
  3. macOS Quick Action + Menu Bar App — Swift builds, workflow install
  4. LaunchAgent restart

Flags:

./scripts/rebuild-all.sh --skip-electron   # Python + macOS only
./scripts/rebuild-all.sh --skip-macos      # Python + Electron only

Electron-Only Rebuild (UI/renderer changes, no Python changes)

If you only changed TypeScript/React code and the Python server bundle is already built, use this. It's significantly faster than a full rebuild:

cd electron && rm -rf out release && npm run build:electron \
  && killall "Pocket TTS" 2>/dev/null; \
  rm -rf "/Applications/Pocket TTS.app" \
  && cp -R "release/mac-arm64/Pocket TTS.app" /Applications/ \
  && open "/Applications/Pocket TTS.app"

Why rm -rf out release? Without it, electron-builder may repackage stale assets. The content-hashed JS filenames look fresh but the asar can contain old code. Always nuke out/ and release/ for a clean build.

Python Changes Only

Source changes to pocket_tts/ take effect immediately for uv run and the LaunchAgent (editable install). No rebuild needed unless:

  • New dependency added to pyproject.toml: Re-run from project root:

    uv pip install -e .
  • Changes need to be in the Electron distributable: The Electron app bundles Python via PyInstaller, so you must re-bundle:

    uv pip install -e .
    cd electron/python && ./bundle-python.sh
    cd .. && npm run build:electron

macOS Quick Action Only

cd macos-service/scripts && ./install-quick-action.sh

Menu Bar App Only

cd macos-service/scripts && ./dev-test.sh

Dev Mode (no build needed)

cd electron && npm run dev

Hot-reloads renderer changes. The dev server connects to whatever TTS server is running on port 8765.

LavaSR Voice Enhancement

The app integrates LavaSR for speech super-resolution and denoising of voice samples. Enhancement is fully self-bootstrapping:

  1. First time you click "Set Up LavaSR" in the Save Voice modal or Voice Selector, the app creates a dedicated venv at ~/Library/Application Support/pocket-tts-electron/lavasr-venv/ and installs torch, torchaudio, soundfile, and LavaSR from GitHub via uv.
  2. Once set up, the Enhancement Studio lets you preview enhanced vs original audio side-by-side before committing.
  3. Enhanced voices are tagged in voices.json with metadata (denoise settings, RMS normalization).

No external scripts or manual venv management required — it just works.

Save voice with LavaSR enhance option
Enhancement Studio — tuning controls
A/B preview of enhanced voice

Fish Audio S2 Pro Backend

An alternate TTS backend using Fish Audio S2 Pro (5B params, MLX 8-bit quantized). Requires Apple Silicon.

Setup

# 1. Install MLX dependencies
uv sync --group mlx

# 2. Download the model (~6.7 GB)
huggingface-cli download mlx-community/fish-audio-s2-pro-8bit \
  --local-dir models/fish-audio-s2-pro-8bit

The entire models/fish-audio-s2-pro-8bit/ directory is gitignored — all files come from HuggingFace so updates are always in sync. The rebuild script (scripts/rebuild-all.sh) installs MLX dependencies automatically on Apple Silicon.

Note: The model is gated under the Fish Audio Research License. If prompted, accept the license at the HuggingFace page and authenticate with huggingface-cli login first.

Usage

Once installed, a Model dropdown appears in the Electron app (and a Select Model submenu in the Menu Bar app). Switch between:

  • Pocket TTS (100M, CPU) — fast, lightweight, built-in voices
  • Fish Audio S2 Pro (5B, MLX) — higher quality, 80+ languages, inline [tag] emotion/prosody control

Only one model is loaded at a time. Switching unloads the current model and loads the new one (~10-15s for fish-speech).

Inline Tags

Fish Audio S2 Pro supports 15,000+ inline tags for fine-grained control:

Fine-Grained Inline Control

S2 Pro enables localized control over speech generation by embedding natural-language instructions directly within the text using [tag] syntax. Rather than relying on a fixed set of predefined tags, S2 Pro accepts free-form textual descriptions — such as [whisper in small voice], [professional broadcast tone], or [pitch up] — allowing open-ended expression control at the word level.

[whisper in small voice]
[professional broadcast tone]
[pitch up]

Common Tags (15,000+ unique tags supported):

[pause] [emphasis] [laughing] [inhale] [chuckle] [tsk] [singing] [excited] [laughing tone] [interrupting] [chuckling] [excited tone] [volume up] [echo] [angry] [low volume] [sigh] [low voice] [whisper] [screaming] [shouting] [loud] [surprised] [short pause] [exhale] [delight] [panting] [audience laughter] [with strong accent] [volume down] [clearing throat] [sad] [moaning] [shocked]

Inline Tag Best Practices

Based on local MLX inference testing, the following guidelines produce the most reliable output:

  • One tag per phrase or sentence. Give the model enough text after a tag to settle into the style before switching. Rapid tag switching every sentence degrades quality.
  • Do not stack tags back-to-back. Adjacent tags with no text between them (e.g., [audience laughter][chuckling]) produce garbled or distorted audio. Separate them with natural text or a full sentence.
  • [pause] counts as a tag. Do not place [pause] immediately before another tag — insert text between them.
  • [singing] is unreliable. The model is a TTS system, not a vocoder trained on melodic data. Expect spoken cadence, not actual singing.
  • [screaming] causes distortion. Audio phases out and distorts even with adequate text after the tag. Use [loud] or [troubled] as safer alternatives for high-intensity delivery.
  • Emotion transitions need runway. When shifting between emotions, allow at least one full sentence per tag so the Dual-AR architecture can stabilize the new prosody.

Good:

[excited]I cannot believe this works! After all that effort, we finally have local inference.
[whisper]And the best part is, nobody else knows about it yet.

Bad:

[excited]Wow! [sad]But also sad. [angry]And frustrating! [laughing]Just kidding.
[pause][short pause][excited]Surprise![audience laughter][chuckling]Funny right?

Limitations

  • Voice cloning uses custom WAV files only — predefined pocket-tts voices (Alba, etc.) are not available with this backend.
  • Multi-Talk mode is pocket-tts only.
  • Model weights (~6.7 GB) download from HuggingFace on first use.

macOS Quick Action

System-wide text-to-speech from any application.

Setup

  1. Install: cd macos-service/scripts && ./install-quick-action.sh
  2. Enable in System Settings → Keyboard → Shortcuts → Services → find "Read Selection with Pocket TTS"
  3. Optional: assign a keyboard shortcut (e.g., F19)
  4. Start the server: uv run pocket-tts serve --port 8765 (or install the LaunchAgent for auto-start)

Usage

Select text anywhere → right-click → Services → "Read Selection with Pocket TTS". Audio streams immediately via ffplay.

Logs

~/Library/Logs/PocketTTS/tts-stream-YYYY-MM-DD.log

Menu Bar App

A native Swift menu bar app (macos-service/PocketTTSMenuBar/) for:

  • Voice selection (syncs with Electron app and Quick Action)
  • Server status monitoring
  • Stop Speaking control

Built with AppKit (not SwiftUI App — fixes menu not appearing). Installed to ~/Applications/Pocket TTS Menu Bar.app.

Architecture (from upstream)

Text → SentencePiece tokenizer → LUTConditioner (embeddings)
                                       ↓
Audio prompt → Mimi encoder → voice state → FlowLMModel (CaLM) → latent frames
                                                                        ↓
                                                               Mimi decoder → PCM audio
  • Thread 1: CaLM generates latent frames autoregressively (12.5 Hz, 80ms/frame)
  • Thread 2: Mimi decoder converts latents to waveform in parallel
  • CPU-only — GPU provides no speedup at this model size (~100M params)
  • Not thread-safe — server does not support concurrent requests

Testing

uv run pytest -n 3 -v                              # all tests (3 parallel workers)
uv run pytest tests/test_cli_generate.py -v         # single file
uv run pytest tests/test_cli_generate.py -k "name"  # single test

Linting

Ruff via pre-commit. Line length 100, LF endings, relative imports banned.

uvx pre-commit install          # one-time setup
uvx pre-commit run --all-files  # manual run

Gotchas

  • PyTorch < 2.5 produces incorrect audio. Enforced in pyproject.toml.
  • Python 3.10–3.14 only. uv manages its own Python — system Python may lack headers.
  • Electron won't load? Check ELECTRON_RUN_AS_NODE is not set: unset ELECTRON_RUN_AS_NODE
  • Voice cloning requires gated HF model access (uvx hf auth login). Predefined voices work without auth.
  • Editable install means Python source changes are live immediately. New deps require uv pip install -e . from project root.
  • Electron distributable bundles PyInstaller output, not source — must re-bundle after Python changes.

Prohibited Use

Use of our model must comply with all applicable laws and regulations and must not result in, involve, or facilitate any illegal, harmful, deceptive, fraudulent, or unauthorized activity. Prohibited uses include, without limitation, voice impersonation or cloning without explicit and lawful consent; misinformation, disinformation, or deception (including fake news, fraudulent calls, or presenting generated content as genuine recordings of real people or events); and the generation of unlawful, harmful, libelous, abusive, harassing, discriminatory, hateful, or privacy-invasive content. We disclaim all liability for any non-compliant use.

Authors

Upstream (Kyutai): Manu Orsini*, Simon Rouard*, Gabriel De Marmiesse*, Václav Volhejn, Neil Zeghidour, Alexandre Défossez (*equal contribution)

This fork: John Saunders — Electron app, LavaSR integration, macOS Quick Action, Menu Bar app, text normalizer, pause/resume/stop controls, reusable creations with metadata, .mp3/.mpa/.wav export

About

A TTS that fits in your CPU (and pocket)

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

No contributors

Languages

  • Python 44.2%
  • TypeScript 35.2%
  • Shell 7.8%
  • Swift 6.2%
  • HTML 6.1%
  • JavaScript 0.2%
  • Other 0.3%