Train All the Things - Synthetic Generation

After getting the display and worker up and running I started down the path of training my model for keyword recognition. Right now I've settled on the wake words Hi Smalltalk. After the wake word is detected the model will then detect silence, on, off, or unknown.

My starting point for training the model was the micro_speech and speech_commands tutorials that are part of the Tensorflow project. One of the first things I noticed while planning out this step was the lack of good wake words in the speech command dataset. There are many voice datasets available online, but many are unlabeled or conversational. Since digging didn't turn up much in the way of open labeled word datasets I decided to use on and off from the speech commands dataset since that gave me a baseline for comparison with my custom words. After recording myself saying hi and smalltalk less then ten times I knew I did not want to generate my own samples at the scale of the other labeled keywords.

Instead of giving up on my wake word combination I started digging around for options and found an interesting project where somebody had started down the path of generating labeled words with text to speech. After reading through the repo I ended up using espeak and sox to generate my labeled dataset.

The first step was to generate the phonemes for the wake words:

$ espeak -v en -X smalltalk

I then stored the phoneme in a word file that will be used by

$ cat words
hi 001 [[h'aI]]
busy 002 [[b'Izi]]
free 003 [[fr'i:]]
smalltalk 004 [[sm'O:ltO:k]]

After modifying from the spoken command repo (eliminating some extra commands and extending the loop to generating more samples) I had everything I needed to synthetically generate a new labeled word dataset.

# For the various loops the variable stored in the index variable
# is used to attenuate the voices being created from espeak.


cat words | while read word wordid phoneme

    echo $word
    mkdir -p db/$word

    if [[ $word != $lastword ]]; then


    # Generate voices with various dialects
    for i in english english-north en-scottish english_rp english_wmids english-us en-westindies
        # Loop changing the pitch in each iteration
        for k in $(seq 1 99)
            # Change the speed of words per minute
            for j in 80 100 120 140 160; do
                echo $versionid "$phoneme" $i $j $k
                echo "$phoneme" | espeak -p $k -s $j -v $i -w db/$word/$versionid.wav
                # Set sox options for Tensorflow
                sox db/$word/$versionid.wav -b 16 --endian little db/$word/tf_$versionid.wav rate 16k

After the run I have samples and labels with a volume comparable to the other words provided by Google. The pitch, speed and tone of voice changes with each loop which will hopefully provide enough variety to make this dataset useful in training. Even if this doesn't work out learning about espeak and sox was interesting. I've already got some future ideas on how to use those. If it does work the ability to generate training data on demand seems incredibly useful.

Next up, training the model and loading to the ESP-EYE. The code, docs, images etc for the project can be found here and I'll be posting updates as I continue along to HackadayIO and this blog. If you have any questions or ideas reach out.