Piper sample generator uses text-to-speech to generate many wake word samples. We also generate adversarial phrase samples created using openWakeWord.
We apply several augments to the generated samples. We use the following sources for background audio samples:
- FSD50K: An Open Dataset of Human-Labeled Sound Events - (Various Creative Commons Licenses.)
- FMA: A Dataset For Music Analysis - (Creative Commons Attribution 4.0 International License.)
- WHAM!: Extending Speech Separation to Noisy Environments - (Creative Commons Attribution-NonCommercial 4.0 International License.)
We reverberate the samples with room impulse responses from BIRD: Big Impulse Response Dataset.
We use a variety of sources of ambient background noises as negative samples during training.
- Voices Obscured in Complex Environmental Settings (VOICES) corpus - (Creative Commons Attribution 4.0 License.)
- Common Voice: A Massively-Multilingual Speech Corpus - (Creative Commons License.)
- FSD50K: An Open Dataset of Human-Labeled Sound Events
- FMA: A Dataset For Music Analysis - reverberated with room impulse responses
- WHAM!: Extending Speech Separation to Noisy Environments
We generate positive and negative samples solely for validation and testing. We augment these samples in the same way as the training data. We split the FSDK50K, FMA, and WHAM! datasets 90-10 into training and testing sets (they are not in the validation set). We estimate the false accepts per hour during training with the VOiCES validation set and DiPCo - Dinner Party Corpus (Community Data License Agreement – Permissive Version 1.0 License.) We test the false accepts per hour in streaming mode after training with the DiPCo set.