Train All the Things - Synthetic Generation
4 min, 638 words
After getting the display and worker up and running I started down the path of
training my model for keyword recognition. Right now I've settled on the wake
Hi Smalltalk. After the wake word is detected the model will then
My starting point for training the model was the
tutorials that are part of the Tensorflow project. One of the first things I
noticed while planning out this step was the lack of good wake words in the
speech command dataset. There are
many voice datasets available
online, but many are unlabeled or conversational. Since digging didn't turn up
much in the way of open labeled word datasets I decided to use
from the speech commands
since that gave me a baseline for comparison with my custom words. After
recording myself saying
smalltalk less then ten times I knew I did
not want to generate my own samples at the scale of the other labeled keywords.
Instead of giving up on my wake word combination I started digging around for options and found an interesting project where somebody had started down the path of generating labeled words with text to speech. After reading through the repo I ended up using espeak and sox to generate my labeled dataset.
The first step was to generate the phonemes for the wake words:
$ espeak -v en -X smalltalk sm'O:ltO:k
I then stored the phoneme in a word file that will be used by
$ cat words hi 001 [[h'aI]] busy 002 [[b'Izi]] free 003 [[fr'i:]] smalltalk 004 [[sm'O:ltO:k]]
generate.sh from the spoken command repo (eliminating some
extra commands and extending the loop to generating more samples) I had
everything I needed to synthetically generate a new labeled word dataset.
#!/bin/bash # For the various loops the variable stored in the index variable # is used to attenuate the voices being created from espeak. lastwordid="" cat words | while read word wordid phoneme do echo $word mkdir -p db/$word if [[ $word != $lastword ]]; then versionid=0 fi lastword=$word # Generate voices with various dialects for i in english english-north en-scottish english_rp english_wmids english-us en-westindies do # Loop changing the pitch in each iteration for k in $(seq 1 99) do # Change the speed of words per minute for j in 80 100 120 140 160; do echo $versionid "$phoneme" $i $j $k echo "$phoneme" | espeak -p $k -s $j -v $i -w db/$word/$versionid.wav # Set sox options for Tensorflow sox db/$word/$versionid.wav -b 16 --endian little db/$word/tf_$versionid.wav rate 16k ((versionid++)) done done done done
After the run I have samples and labels with a volume comparable to the other
words provided by Google. The pitch, speed and tone of voice changes with each
loop which will hopefully provide enough variety to make this dataset useful in
training. Even if this doesn't work out learning about
interesting. I've already got some future ideas on how to use those. If it does
work the ability to generate training data on demand seems incredibly useful.
Next up, training the model and loading to the ESP-EYE. The code, docs, images etc for the project can be found here and I'll be posting updates as I continue along to HackadayIO and this blog. If you have any questions or ideas reach out.