Skip to content

Latest commit

 

History

History

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 

README.md

Example Projects

Six concrete project ideas to inspire your own build. Each one names the failure mode it stresses, the tools that fit best, and what would make a strong demo.

These are not templates - they're sketches. Pick one, remix two, or use them as ammunition for something else entirely.

# Project Failure mode it stresses
1 Meeting Discipline Keeper Overlapping speech, room echo, varying mic distances
2 Calisthenics Trainer Loud music, dropped weights, gym noise
3 Voice-Controlled Language Learning Game Non-native pronunciation, hesitant speech
4 The "Rema 1000" Smart Home Slurred / mumbled / whispered / accented speech
5 Private Local Voice Agent (Edge) Resource constraints, no cloud, on-device
6 Voice Pipeline Failure Autopsy Systematic measurement of where things break

1. Meeting Discipline Keeper

A bot that joins a meeting, tracks the agenda in real time, interrupts off-topic speakers, asks clarifying questions, and updates notes live. Over time, builds a memory of each participant - who runs late, who derails, who never speaks.

What's hard: Conference rooms are acoustically hostile. Overlapping speech, room echo, varying mic distances, background noise. STT breaks down exactly when the meeting is most chaotic - and that's when the bot needs to understand best.

Tools that fit:

Strong demo angle: Re-enact a chaotic meeting live. The bot interrupts an off-topic speaker mid-sentence, asks for clarification, then updates a shared doc in real time.


2. Calisthenics Trainer with Vision

A real-time multimodal voice coach (Gemini Live, OpenAI Realtime, etc.) that watches you exercise via camera, counts reps, corrects form, and motivates you mid-set.

What's hard: The gym is one of the most hostile environments for voice AI - loud music, dropped weights, multiple people talking nearby. Without front-end audio cleanup, the model hears music more than the user.

Tools that fit:

Strong demo angle: Do a set of squats live at the demo while the coach counts reps and corrects form. Play loud music in the background to prove the audio pipeline is doing the work.


3. Voice-Controlled Language Learning Game

A 2D game where the player controls their character only by speaking the language they're learning. Each level is a real-life scenario - ordering at a cafe, buying a train ticket - and NPCs respond naturally, forcing real conversation.

What's hard: Language learners are the hardest STT users. Non-native pronunciation, hesitant speech, accent variation. Add a noisy classroom or home and standard STT misses badly enough to break the game loop.

Tools that fit:

  • STT with strong multilingual support: Whisper (large-v3 or turbo), faster-whisper
  • Pronunciation scoring: WER vs. expected phrase, or use SpeechBrain phoneme recognizers for finer-grained feedback
  • LLM for NPC dialogue: any frontier model with a system prompt locking it to the target language
  • Game framework: anything you're already comfortable in - Phaser, Godot, plain HTML canvas

Strong demo angle: Live audience plays a level. Show the WER score on screen - before and after enhancement - so they see the recognition improve in real time.


4. The "Rema 1000" Smart Home

Inspired by the Norwegian supermarket ad where a man's voice-controlled smart home becomes unusable after a dentist visit because his speech is slurred. Build a voice-controlled environment and prove voice control still works when the speaker is impaired - mumbling, illness, whispering, post-anaesthesia, non-native accents.

What's hard: Most voice AI tests failure from the environment (background noise). This one tests failure from the speaker themselves. Different problem, different fix - input speech needs to be clarified, not just denoised.

Tools that fit:

  • Smart home simulator: a simple web UI with toggleable lights / locks / thermostat is plenty for a demo
  • STT robust to degraded speech: try Whisper large-v3, SeamlessM4T, or test multiple
  • Speech restoration: voicefixer, Resemble Enhance
  • Test corpus: TORGO or UA-Speech for dysarthric reference; or self-record yourself whispering / mumbling

Strong demo angle: Side-by-side. Same commands, two pipelines. One fails comically (turn on bathroom light → "playing Bathroom by Lana Del Rey"), the other succeeds. Bonus: re-enact the ad.

Accessibility angle: This isn't just funny - it's the right framing for elderly users, people with speech impediments, and non-native speakers, who are systematically excluded by current voice AI.


5. Private Local Voice Agent (Edge AI)

A fully offline voice agent running on a Raspberry Pi or similar - no cloud, no API keys, no data leaving the device. Useful for privacy-sensitive deployments (healthcare, security, kids' devices).

What's hard: Everything has to run small. Small STT, small LLM, small enhancement model, small TTS. And the audio quality is usually worse - cheap mics, no acoustic treatment.

Tools that fit:

Strong demo angle: Show the device unplugged from the internet. Cheap USB mic. Still works. Bonus: show power draw / latency numbers.


6. Voice Pipeline Failure Autopsy

Less of an app, more of a diagnostic tool. Systematically rank where a voice pipeline breaks. Measure how much audio quality alone degrades results. Measure how much enhancement recovers. Quantify how STT model and LLM choice compound the problem.

What's hard: This isn't a build challenge - it's a measurement challenge. The contribution is a clear, reproducible answer to "where does voice break first?"

Tools that fit:

See ../quality-dashboard/ for the wiring details.

Strong demo angle: A single chart. X-axis SNR, two lines (DNSMOS-OVRL and WER) before and after enhancement. The story tells itself.


Want to add one?

PRs welcome. See ../CONTRIBUTING.md. Real examples from real builds are especially valuable.