Skip to content
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Fix typos and punctuation
  • Loading branch information
AnirudhBHarish committed Jul 20, 2021
commit 11b718eaadd0f39ee9926da892ccea61d539cc25
40 changes: 20 additions & 20 deletions applications/KWS_Phoneme/README.md
Original file line number Diff line number Diff line change
@@ -1,25 +1,25 @@
# Phoneme based Keyword Spotting(KWS)
# Phoneme-based Keyword Spotting(KWS)

# Project Description
There are two major issues in the existing KWS systems (a) they are not robust to heavy background noise and random utterances, and (b) they require collecting a lot of data, hampering the ease of adding a new keyword. Tackling these issues from a different perspective, we propose a new two staged scheme with a model for predicting phonemes which are in turn used for phoneme based keyword classification.
There are two major issues in the existing KWS systems (a) They are not robust to heavy background noise and random utterances, and (b) They require collecting a lot of data, hampering the ease of adding a new keyword. Tackling these issues from a different perspective, we propose a new two staged scheme with a model for predicting phonemes which are in turn used for phoneme-based keyword classification.

First we train a phoneme classification model which gives the phoneme transcription of the input speech snippet. For training this phoneme classifier, we use a large public speech dataset like LibriSpeech. The public dataset can be aligned (meaning get the phoneme labels for each speech snippet in the data) using Montreal Forced Aligner. We also add reverberations and additive noise to the speech samples from the public dataset to make the phoneme classifier training robust to various accents, background noise and varied environment. In this project, we predict phonemes at every 10ms which is the standard way. You can find the aligned LibriSpeech dataset we used for training here.
First we train a phoneme classification model which gives the phoneme transcription of the input speech snippet. For training this phoneme classifier, we use a large public speech dataset like LibriSpeech. The public dataset can be aligned (meaning we can get the phoneme labels for each speech snippet in the data) using Montreal Forced Aligner. We also add reverberations and additive noise to the speech samples from the public dataset to make the phoneme classifier training robust to various accents, background noise and varied environments. In this project, we predict phonemes at every 10ms which is the standard way. You can find the aligned LibriSpeech dataset we used for training here.

In the second part, we use the predicted phoneme outputs from the phoneme classifier for predicting the input keyword. We train a 1 layer FastGRNN classifier to predict the keyword based on the phoneme transcription as input. Since the phoneme classifier training has been done to account for diverse accent, background noise and environments, the keyword classifier can be trained using a small number of Text-To-Speech(TTS) samples generated using any standard TTS api from cloud services like Azure, Google Cloud or AWS.
In the second part, we use the predicted phoneme outputs from the phoneme classifier for predicting the input keyword. We train a 1 layer FastGRNN classifier to predict the keyword based on the phoneme transcription as input. Since the phoneme classifier training has been done to account for diverse accents, background noise and environments, the keyword classifier can be trained using a small number of Text-To-Speech(TTS) samples generated using any standard TTS API from cloud services like Azure, Google Cloud or AWS.

This gives two advantages: (a) The phoneme model is trained to account for diverse accents and background noise settings, thus the flexible keyword classifier training requires only a small number of keyword samples, and (b) Empirically this method was able to detect keywords from as far as 9ft of distance. Further, the phoneme model has a small size of around 250k parameters and can fit on a Cortex M7 micro-controller.

# Training the Phoneme Classifier
1) Train a phoneme classification model on some public speech dataset like LibriSpeech
2) Training speech dataset can be labelled using Montreal Force Aligner
3) Speech snippets are convolved with reverberation files, and additive noises from YouTube or other open source are added
4) We also add white gaussian noise of various SNRs
1) Train a phoneme classification model on some public speech dataset like LibriSpeech.
2) Training speech dataset can be labelled using Montreal Force Aligner.
3) Speech snippets are convolved with reverberation files, and additive noises from YouTube or other open source are added.
4) We also add white gaussian noise of various SNRs.

# Training the KWS Model
1) Our method takes as input the speech snippet and passes it through the phoneme classifier
2) Keywords are detected by training a keyword classifier over the detected phonemes
3) For training the keyword classifier, we use Azure and Google Text-To-Speech API to get the training data (keyword snippets)
4) For example, if you want to train a Keyword classifier for the keywords in the Google30 dataset, generate TTS samples from the Azure/Google-Cloud/AWS API for each of the 30 keywords. The TTS samples for each keyword must be stored in a separate folder named according to the keyword. More details about how the generated TTS data should be stored are mentioned below in sample use case for classifier model training.
1) Our method takes as input the speech snippet and passes it through the phoneme classifier.
2) Keywords are detected by training a keyword classifier over the detected phonemes.
3) For training the keyword classifier, we use Azure and Google Text-To-Speech API to get the training data (keyword snippets).
4) For example, if you want to train a keyword classifier for the keywords in the Google30 dataset, generate TTS samples from the Azure/Google-Cloud/AWS API for each of the 30 keywords. The TTS samples for each keyword must be stored in a separate folder named according to the keyword. More details about how the generated TTS data should be stored are mentioned below in sample use case for classifier model training.

# Sample Use Cases

Expand All @@ -29,10 +29,10 @@ The following command can be used to instantiate and train the phoneme model.
python train_phoneme.py --base_path=/path/to/librispeech_data/ --rir_base_path=/path/to/reverb_files/ --additive_base_path=/path/to/additive_noises/ --snr_samples="0,5,10,25,100,100" --rir_chance=0.5
```
Some important command line arguments:
1) base_path : Path of the speech data folder. The data in this folder should be in accordance to the dataloader code written here.
2) rir_base_path, additive_base_path : Path to the reverb and additive noise files
1) base_path : Path of the speech data folder. The data in this folder should be in accordance to the data-loader code written here.
2) rir_base_path, additive_base_path : Path to the reverb and additive noise files.
3) snr_samples : List of various SNRs at which the additive noise is to be added.
4) rir_chance : Probability at which reverberation has to be done for each speech sample
4) rir_chance : Probability that would decide if the reverberation operation has to be performed for a given speech sample.

## Classifier Model Training
The following command can be used to instantiate and train the classifier model.
Expand All @@ -41,11 +41,11 @@ python train_classifier.py --base_path=/path/to/train_and_test_data_folders/ --t
```
Some important command line arguments:

1) base_path : path to train and test data folders
2) train_data_folders, test_data_folders : These folders should have the .wav files for each keyword in a separate subfolder inside according to the dataloader here
3) phoneme_model_load_ckpt : The full path of the checkpoint file that would be used to load the weights to the instantiated phoneme model
4) rir_base_path, additive_base_path : Path to the reverb and additive noise files
5) synth : Boolean flag for specifying if reverberations and noise addition has to be done
1) base_path : Path to train and test data folders.
2) train_data_folders, test_data_folders : These folders should have the .wav files for each keyword in a separate subfolder inside according to the data-loader here.
3) phoneme_model_load_ckpt : The full path of the checkpoint file that would be used to load the weights to the instantiated phoneme model.
4) rir_base_path, additive_base_path : Path to the reverb and additive noise files.
5) synth : Boolean flag for specifying if reverberations and noise addition has to be done.

Copyright (c) Microsoft Corporation. All rights reserved.
Licensed under the MIT license.
4 changes: 2 additions & 2 deletions applications/KWS_Phoneme/auxiliary_files/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,8 +17,8 @@ The downloaded files would need to be converted to 16KHz for our pipeline. Pleas
```
python convert_sampling_rate.py --source_folder=/path/to/csv_file.csv --target_folder=/path/to/target/16KHz_folder/ --fs=16000 --log_rate=100
```
The script can convert the sampling rate of any wav file to the specified --fs. But for our applications, we use 16KHz only.<br/>
Choose the log rate for how often the log should be printed for the sample rate conversion. This will print a string ever log_rate iterations.
The script can convert the sampling rate of any .wav file to the specified --fs. But for our applications, we use 16KHz only.<br/>
Choose the log rate for how often the log should be printed for the sample rate conversion. This will print a string every log_rate iterations.

Copyright (c) Microsoft Corporation. All rights reserved.
Licensed under the MIT license.
1 change: 0 additions & 1 deletion applications/KWS_Phoneme/kwscnn.py
Original file line number Diff line number Diff line change
Expand Up @@ -675,4 +675,3 @@ def init_hidden(self, batch, rnn_hidden_size, rnn_num_layers):
hidden = Variable(hidden)

return hidden

2 changes: 1 addition & 1 deletion applications/KWS_Phoneme/train_classifier.py
Original file line number Diff line number Diff line change
Expand Up @@ -293,4 +293,4 @@ def train_classifier_model(args):

if __name__ == '__main__':
args = parseArgs()
train_classifier_model(args)
train_classifier_model(args)
2 changes: 1 addition & 1 deletion applications/KWS_Phoneme/train_phoneme.py
Original file line number Diff line number Diff line change
Expand Up @@ -209,4 +209,4 @@ def train_phoneme_model(args):

if __name__ == '__main__':
args = parseArgs()
train_phoneme_model(args)
train_phoneme_model(args)