My goal : use cloud ASR (Google?) for just a single part of a dialogue/App handling to preserve as much as possible Snips privacy by design core philosophy, but be able to recognise “Free Words” not previously configured on any Snips console.
Uses example :
search wikipedia for “impressionist”, “Quantum physics” etc…
search a translation for a particular word or phrase
dictionnary definition of a word.
add elements to a shopping list, even the “Nvidia’s GeForce RTX 2060”
…etc, you get the idea
see @Psycho blog for a full/100% google ASR proper implementation https://laurentchervet.wordpress.com/2018/03/08/project-alice-arbitrary-text/
My own constraints : I want my domotic at home to use as much as possible the local ressources for both practical & privacy purposes. I am therefore using snips 100% local ASR (and voice snips TTS, etc…) even if sometimes the quality/accuracy of the detection or the voice does not always match 100% (yet?) the quality of cloud API ones. It remains 100% usable and handy for our family domotic voice interactions everyday at home.
-> Snips is a really nice project/product : well done guys ! I mean it!
That being said, the choice of running on a tiny Raspberry Pi inherently imply some limitations : Snips handles well the recognition of pre-entered phrases / words (-> slots) but cannot recognize “Free”/“arbitrary” words we’ve not previously “injected in” for learning.
I needed a general purposes ASR (Google in my case) but to be used just for some specific recognition intent / point in times within the Snips dialogue of the corresponding App.
I want the dialogue to be handled 99% via Snips ASR and some parts via Google ASR (Snips ASR -> Google ASR -> back to Snips ASR).
Example of what I actually developped/coded :
- me : “Jarvis!” … Ding ?
- me : “search on wikipedia” … DingDong !
- Jarvis : “What do you want to search for ?” - Ding ?
- me : “pikachu” … DingDong !
- Jarvis : “Pikachu is a pokemon…”
Steps 1) to 3) all classic snips/ASR/NLU dialogue stuff executed 100% locally via Snips.
Step 4) has to be handled via a general purposes ASR -> Google ASR
-> after the “hermes/audioServer/default/playFinished” topic, I capture just the “hermes/audioServer/default/audioFrame” payloads I need to a wav file.
-> send wav file trough Google ASR and I get back the text recognised and use it to whatever I wanted… Wikipédia API as my first use case.
hermes/dialogueManager/startSession outbound message even specifying “sendIntentNotRecognized” was not behaving correctly (or what I understood as to how it should behave…).
Snips MQTT dialogue logic handling was not made for this purposes (and/or I am NOT a Python/Snips guru developper! started python 2 month ago…) so I improvised with some evil / very basic / crude solutions to mimic and handle the dialogue, play the end of message tones / intent recognised etc… I use those message as flags to signal the correct timing to start and stop the recording and send just the exact good portion of the sound recorded to the ASR (I told you it was crude… but nonetheless efficient).
After quite some tweaks, I implemented the Wikipedia search, it was working very well and we used it at home for a couple of weeks. I had to implement filters to voice back correctly some piece of informations (roman centuries “XIX”, etc…) or check with some basic signal treatment algorithm if the message/wav file was blank / noise only, etc…
BUT, since Snips version 0.60.8 (or before?), it seems the way sound data message payload sent through MQTT is not encoded as it was before and my capture code is not working anymore, and I can’t find what changed as I did not kept previous MQTT message captures I used in the dev phase.
(I first though it was the newly introduced VAD handling that was interfering but setting [snips-hotword] no_vad_inhibitor = true in snips.toml does not change anything)
My problem / question :
how to encode a wav file from the “hermes/audioServer/default/audioFrame” messages payloads again !
I found the initial python basic code for handling the capture on internet as a snippet copy/paste site (I cannot seem to find it again to reference the proper dev/source…).
Proper integration with Snips logic versus my crude implementation : Can Snips MQTT actual messages dialogue logic be modified to allow such raw component handling (wav capture , use Snips already implemented stopping capture after user stops speaking features, etc… instead of reinventing the wheel as I did).
Sorry for the long post, I tend to lean on the verbose side…
Here is the capture code that used to work simply well before (just create a “recordings” directory or change the code accordingly …) to create a wav file from what Snips audio records via MQTT :
# record audio from snips.ai mqtt traffic import paho.mqtt.client as mqtt import json import struct import wave import datetime import os import sys VC_SERVER = 'localhost' VC_PORT = 1883 record_running = False # MUST CREATE a DIR 'recordings' ! def on_connect(client, userdata, flags, rc): print('Connected to MQTT system') # mqtt.subscribe('hermes/audioServer/default/audioFrame') mqtt.subscribe('hermes/audioServer/default/audioFrame') # sc_say("recording") def on_message(client, userdata, msg): if msg.topic == "hermes/audioServer/default/audioFrame": start_record(msg) def start_record(msg): global record_running global record riff, size, fformat = struct.unpack('<4sI4s', msg.payload[:12]) if riff != b'RIFF': print("RIFF parse error") return if fformat != b'WAVE': print("FORMAT parse error") return # print("size: %d" % size) # Data Header chunk_header = msg.payload[12:20] subchunkid, subchunksize = struct.unpack('<4sI', chunk_header) if (subchunkid == b'fmt '): aformat, channels, samplerate, byterate, blockalign, bps = struct.unpack('HHIIHH', msg.payload[20:36]) bitrate = (samplerate * channels * bps) / 1024 # print("Format: %i, Channels %i, Sample Rate: %i, Kbps: %i" % (aformat, channels, samplerate, bitrate)) #print('.'), sys.stdout.write('.') # to debug message flag sync for start/stop recording sys.stdout.flush() if(not record_running): record_running = True sys.stdout.write('Begin') record = wave.Wave_write record = wave.open(os.path.join("recordings" , datetime.datetime.now().strftime("%Y%m%d-%H%M%S")+".wav"), "wb") record.setnchannels(channels) record.setframerate(samplerate) record.setsampwidth(2) chunkOffset = 36 while (chunkOffset < size): subchunk2id, subchunk2size = struct.unpack('<4sI', msg.payload[chunkOffset:chunkOffset+8]) chunkOffset += 8 #print("chunk id: %s, size: %i" % (subchunk2id, subchunk2size)) if (subchunk2id == b'data'): if(record_running): record.writeframes(msg.payload[chunkOffset:chunkOffset+subchunk2size]) # else: # Should never be here... # print("Data: %s" % msg.payload[chunkOffset:chunkOffset+subchunk2size]) chunkOffset = chunkOffset + subchunk2size + 8 def stop_recording(): global record_running global record record_running = False record.close() def main(): global record_running mqtt.on_connect = on_connect mqtt.on_message = on_message mqtt.connect(VC_SERVER, VC_PORT) mqtt.loop_forever() if __name__ == "__main__": mqtt = mqtt.Client() main()