Six concrete project ideas to inspire your own build. Each one names the failure mode it stresses, the tools that fit best, and what would make a strong demo.
These are not templates - they're sketches. Pick one, remix two, or use them as ammunition for something else entirely.
| # | Project | Failure mode it stresses |
|---|---|---|
| 1 | Meeting Discipline Keeper | Overlapping speech, room echo, varying mic distances |
| 2 | Calisthenics Trainer | Loud music, dropped weights, gym noise |
| 3 | Voice-Controlled Language Learning Game | Non-native pronunciation, hesitant speech |
| 4 | The "Rema 1000" Smart Home | Slurred / mumbled / whispered / accented speech |
| 5 | Private Local Voice Agent (Edge) | Resource constraints, no cloud, on-device |
| 6 | Voice Pipeline Failure Autopsy | Systematic measurement of where things break |
A bot that joins a meeting, tracks the agenda in real time, interrupts off-topic speakers, asks clarifying questions, and updates notes live. Over time, builds a memory of each participant - who runs late, who derails, who never speaks.
What's hard: Conference rooms are acoustically hostile. Overlapping speech, room echo, varying mic distances, background noise. STT breaks down exactly when the meeting is most chaotic - and that's when the bot needs to understand best.
Tools that fit:
- Agent framework: Pipecat or LiveKit Agents
- Diarization / overlap detection: pyannote-audio
- STT: faster-whisper or whisperX for word-level timestamps
- Speech enhancement before STT: DeepFilterNet or similar
- Notes integration: any structured doc API (Notion, Google Docs, Linear, etc.)
Strong demo angle: Re-enact a chaotic meeting live. The bot interrupts an off-topic speaker mid-sentence, asks for clarification, then updates a shared doc in real time.
A real-time multimodal voice coach (Gemini Live, OpenAI Realtime, etc.) that watches you exercise via camera, counts reps, corrects form, and motivates you mid-set.
What's hard: The gym is one of the most hostile environments for voice AI - loud music, dropped weights, multiple people talking nearby. Without front-end audio cleanup, the model hears music more than the user.
Tools that fit:
- Realtime S2S model: Gemini Live or OpenAI Realtime API
- Speech enhancement: DeepFilterNet, Resemble Enhance, or commercial alternatives
- Vision: any pose-estimation model (MediaPipe, MoveNet) for rep counting
- Camera + audio capture: WebRTC via LiveKit
Strong demo angle: Do a set of squats live at the demo while the coach counts reps and corrects form. Play loud music in the background to prove the audio pipeline is doing the work.
A 2D game where the player controls their character only by speaking the language they're learning. Each level is a real-life scenario - ordering at a cafe, buying a train ticket - and NPCs respond naturally, forcing real conversation.
What's hard: Language learners are the hardest STT users. Non-native pronunciation, hesitant speech, accent variation. Add a noisy classroom or home and standard STT misses badly enough to break the game loop.
Tools that fit:
- STT with strong multilingual support: Whisper (large-v3 or turbo), faster-whisper
- Pronunciation scoring: WER vs. expected phrase, or use SpeechBrain phoneme recognizers for finer-grained feedback
- LLM for NPC dialogue: any frontier model with a system prompt locking it to the target language
- Game framework: anything you're already comfortable in - Phaser, Godot, plain HTML canvas
Strong demo angle: Live audience plays a level. Show the WER score on screen - before and after enhancement - so they see the recognition improve in real time.
Inspired by the Norwegian supermarket ad where a man's voice-controlled smart home becomes unusable after a dentist visit because his speech is slurred. Build a voice-controlled environment and prove voice control still works when the speaker is impaired - mumbling, illness, whispering, post-anaesthesia, non-native accents.
What's hard: Most voice AI tests failure from the environment (background noise). This one tests failure from the speaker themselves. Different problem, different fix - input speech needs to be clarified, not just denoised.
Tools that fit:
- Smart home simulator: a simple web UI with toggleable lights / locks / thermostat is plenty for a demo
- STT robust to degraded speech: try Whisper large-v3, SeamlessM4T, or test multiple
- Speech restoration: voicefixer, Resemble Enhance
- Test corpus: TORGO or UA-Speech for dysarthric reference; or self-record yourself whispering / mumbling
Strong demo angle: Side-by-side. Same commands, two pipelines. One fails comically (turn on bathroom light → "playing Bathroom by Lana Del Rey"), the other succeeds. Bonus: re-enact the ad.
Accessibility angle: This isn't just funny - it's the right framing for elderly users, people with speech impediments, and non-native speakers, who are systematically excluded by current voice AI.
A fully offline voice agent running on a Raspberry Pi or similar - no cloud, no API keys, no data leaving the device. Useful for privacy-sensitive deployments (healthcare, security, kids' devices).
What's hard: Everything has to run small. Small STT, small LLM, small enhancement model, small TTS. And the audio quality is usually worse - cheap mics, no acoustic treatment.
Tools that fit:
- Lightweight STT: whisper.cpp (
tinyorbase.en), faster-whisper with int8 quantization - Lightweight LLM: llama.cpp with a 1B–4B parameter model, Ollama for ease
- Lightweight TTS: piper - runs comfortably on a Pi
- Lightweight VAD: silero-vad
- Lightweight enhancement: RNNoise, DeepFilterNet (Rust runtime)
- Wake word: openWakeWord
Strong demo angle: Show the device unplugged from the internet. Cheap USB mic. Still works. Bonus: show power draw / latency numbers.
Less of an app, more of a diagnostic tool. Systematically rank where a voice pipeline breaks. Measure how much audio quality alone degrades results. Measure how much enhancement recovers. Quantify how STT model and LLM choice compound the problem.
What's hard: This isn't a build challenge - it's a measurement challenge. The contribution is a clear, reproducible answer to "where does voice break first?"
Tools that fit:
- Quality metrics: VERSA, DNSMOS, NISQA
- Augmentation: audiomentations, pedalboard
- WER: jiwer
- Reference test set: LibriSpeech test-clean
- Plotting: pandas + matplotlib / Plotly is plenty
See ../quality-dashboard/ for the wiring details.
Strong demo angle: A single chart. X-axis SNR, two lines (DNSMOS-OVRL and WER) before and after enhancement. The story tells itself.
PRs welcome. See ../CONTRIBUTING.md. Real examples from real builds are especially valuable.