Published: Mar 24, 2020 by

Recently I spent some time learning how to generate synthetic voices using espeak. After working with the tools to aligning with the Tensorflow keyword models expectations I was ready for training, and to see how well the synthetic data performed. TLDR: not well :)

I started by training using the keywords hi, smalltalk and on. This let me have a known working word while testing two synthetic words. Although training went well:

INFO:tensorflow:Saving to "/Users/n0mn0m/projects/on-air/voice-assistant/train/model/speechcommandstrain/tinyconv.ckpt-18000"  
I0330 10:34:28.514455 4629171648 train.py:297] Saving to "/Users/n0mn0m/projects/on-air/voice-assistant/train/model/speechcommandstrain/tinyconv.ckpt-18000"  
INFO:tensorflow:setsize=1445  
I0330 10:34:28.570324 4629171648 train.py:301] setsize=1445  
WARNING:tensorflow:Confusion Matrix:  
 [[231 3 3 0 4]  
 [ 2 178 6 29 26]  
 [ 3 12 146 2 2]  
 [ 4 17 2 352 21]  
 [ 2 16 7 16 361]]  
W0330 10:34:32.116044 4629171648 train.py:320] Confusion Matrix:  
 [[231 3 3 0 4]  
 [ 2 178 6 29 26]  
 [ 3 12 146 2 2]  
 [ 4 17 2 352 21]  
 [ 2 16 7 16 361]]  
WARNING:tensorflow:Final test accuracy = 87.8% (N=1445)  
W0330 10:34:32.116887 4629171648 train.py:322] Final test accuracy = 87.8% (N=1445)The model didn’t respond well once it was loaded onto the ESP-EYE. I tried a couple more rounds with other keywords and spectrogram samples with similar results.

Because of the brute force nature that I used to generate audio the synthetic training data isn’t very representative of real human voices. While the experiment didn’t work out, I do think that generating data this way could be useful with the right amount of time and research. Instead of scaling parameters in a loop I think researching the characteristic of various human voices and using those to tune the data generated via espeak could actually work out well. That said it’s possible the model may pick up on characteristics of the espeak program too. Regardless, voice data that is ready for training is still a hard problem in need of more open solutions.

Along with the way I scaled the espeak parameters another monkey wrench is that the microspeech model makes use of a CNN and spectrogram of the input audio instead of full signal processing. This means it’s highly likely the model will work with voices around the comparison spectrogram well, but not generalize. This makes picking the right spectrogram relative to the user another key task.

Because of these results and bigger issues I ended up tweaking my approach and used visual as my wake word followed by on/off. All of these are available in the TF command words dataset, and visual seems like an ok wake word when controlling a display. For somebody working on a generic voice assistant you will want to work on audio segmentation since many datasets are sentences, or consider using something like Skainet. All of this was less fun than running my own model from synthtetic data, but I needed to continue forward. After a final round of training with all three words I followed the TF docs to represent the model as a C array and then flashed it onto the board with the rest of the program. Using idf monitor I was able to observe the model working as expected:

I (31) boot: ESP-IDF v4.1  
I (31) boot: compile time 13:35:43  
I (704) wifi: config NVS flash: enabled  
I (734) WIFI STATION: Setting WiFi configuration SSID Hallow...  
I (824) WIFI STATION: wifiinitsta finished.  
I (1014) TFLITEAUDIOPROVIDER: Audio Recording started  
Waking up  
Recognized on  
I (20434) HTTPSHANDLING: HTTPS Status = 200, contentlength = 1  
I (20434) HTTPSHANDLING: HTTPEVENTDISCONNECTED  
I (20444) HTTPSHANDLING: HTTPEVENTDISCONNECTED  
Going back to sleep.  
Waking up  
Recognized off  
I (45624) HTTPSHANDLING: HTTPS Status = 200, contentlength = 1  
I (45624) HTTPSHANDLING: HTTPEVENTDISCONNECTED  
I (45634) HTTPSHANDLING: HTTPEVENTDISCONNECTEDThis was an educational experiment. It helped me put some new tools in my belt while thinking further about the problem of voice and audio processing. I developed some [scripts](https://github.com/n0mn0m/on-air/tree/main/voice-assistant/train) to run through the full data generation, train and export cycle. Training will need to be done based on the architecture somebody is using, but hopefully it’s useful.

The code, docs, images etc for the project can be found here and I’ll be posting updates as I continue along to HackadayIO and this blog. If you have any questions or ideas reach out.

python, programming, hackaday