Snips Free Words



My goal : use cloud ASR (Google?) for just a single part of a dialogue/App handling to preserve as much as possible Snips privacy by design core philosophy, but be able to recognise “Free Words” not previously configured on any Snips console.

Uses example :

My own constraints : I want my domotic at home to use as much as possible the local ressources for both practical & privacy purposes. I am therefore using snips 100% local ASR (and voice snips TTS, etc…) even if sometimes the quality/accuracy of the detection or the voice does not always match 100% (yet?) the quality of cloud API ones. It remains 100% usable and handy for our family domotic voice interactions everyday at home.
-> Snips is a really nice project/product : well done guys ! I mean it!

That being said, the choice of running on a tiny Raspberry Pi inherently imply some limitations : Snips handles well the recognition of pre-entered phrases / words (-> slots) but cannot recognize “Free”/“arbitrary” words we’ve not previously “injected in” for learning.

I needed a general purposes ASR (Google in my case) but to be used just for some specific recognition intent / point in times within the Snips dialogue of the corresponding App.
I want the dialogue to be handled 99% via Snips ASR and some parts via Google ASR (Snips ASR -> Google ASR -> back to Snips ASR).

Example of what I actually developped/coded :

  1. me : “Jarvis!” … Ding ?
  2. me : “search on wikipedia” … DingDong !
  3. Jarvis : “What do you want to search for ?” - Ding ?
  4. me : “pikachu” … DingDong !
  5. Jarvis : “Pikachu is a pokemon…”

Steps 1) to 3) all classic snips/ASR/NLU dialogue stuff executed 100% locally via Snips.
Step 4) has to be handled via a general purposes ASR -> Google ASR
-> after the “hermes/audioServer/default/playFinished” topic, I capture just the “hermes/audioServer/default/audioFrame” payloads I need to a wav file.
-> send wav file trough Google ASR and I get back the text recognised and use it to whatever I wanted… Wikipédia API as my first use case.
hermes/dialogueManager/startSession outbound message even specifying “sendIntentNotRecognized” was not behaving correctly (or what I understood as to how it should behave…).
Snips MQTT dialogue logic handling was not made for this purposes (and/or I am NOT a Python/Snips guru developper! started python 2 month ago…) so I improvised with some evil / very basic / crude solutions to mimic and handle the dialogue, play the end of message tones / intent recognised etc… I use those message as flags to signal the correct timing to start and stop the recording and send just the exact good portion of the sound recorded to the ASR (I told you it was crude… :slight_smile: but nonetheless efficient).

After quite some tweaks, I implemented the Wikipedia search, it was working very well and we used it at home for a couple of weeks. I had to implement filters to voice back correctly some piece of informations (roman centuries “XIX”, etc…) or check with some basic signal treatment algorithm if the message/wav file was blank / noise only, etc…
BUT, since Snips version 0.60.8 (or before?), it seems the way sound data message payload sent through MQTT is not encoded as it was before and my capture code is not working anymore, and I can’t find what changed as I did not kept previous MQTT message captures I used in the dev phase.
(I first though it was the newly introduced VAD handling that was interfering but setting [snips-hotword] no_vad_inhibitor = true in snips.toml does not change anything)

My problem / question :

  1. how to encode a wav file from the “hermes/audioServer/default/audioFrame” messages payloads again !
    I found the initial python basic code for handling the capture on internet as a snippet copy/paste site (I cannot seem to find it again to reference the proper dev/source…).

  2. Proper integration with Snips logic versus my crude implementation : Can Snips MQTT actual messages dialogue logic be modified to allow such raw component handling (wav capture , use Snips already implemented stopping capture after user stops speaking features, etc… instead of reinventing the wheel as I did).

Sorry for the long post, I tend to lean on the verbose side… :sweat_smile:

Here is the capture code that used to work simply well before :sob: (just create a “recordings” directory or change the code accordingly …) to create a wav file from what Snips audio records via MQTT :

# record audio from mqtt traffic

import paho.mqtt.client as mqtt
import json
import struct
import wave
import datetime
import os
import sys

VC_SERVER = 'localhost'
VC_PORT = 1883

record_running = False
# MUST CREATE a DIR 'recordings' !

def on_connect(client, userdata, flags, rc):
	print('Connected to MQTT system')
	# mqtt.subscribe('hermes/audioServer/default/audioFrame')
	# sc_say("recording")

def on_message(client, userdata, msg):
	if msg.topic == "hermes/audioServer/default/audioFrame":

def start_record(msg):
	global record_running
	global record

	riff, size, fformat = struct.unpack('<4sI4s', msg.payload[:12])
	if riff != b'RIFF':
		print("RIFF parse error")
	if fformat != b'WAVE':
		print("FORMAT parse error")
	# print("size: %d" % size)

	# Data Header
	chunk_header = msg.payload[12:20]
	subchunkid, subchunksize = struct.unpack('<4sI', chunk_header)
	if (subchunkid == b'fmt '):
		aformat, channels, samplerate, byterate, blockalign, bps = struct.unpack('HHIIHH', msg.payload[20:36])
		bitrate = (samplerate * channels * bps) / 1024
		# print("Format: %i, Channels %i, Sample Rate: %i, Kbps: %i" % (aformat, channels, samplerate, bitrate))
		sys.stdout.write('.') # to debug message flag sync for start/stop recording

	if(not record_running):
		record_running = True
		record = wave.Wave_write
		record ="recordings" ,"%Y%m%d-%H%M%S")+".wav"), "wb")

	chunkOffset = 36
	while (chunkOffset < size):
		subchunk2id, subchunk2size = struct.unpack('<4sI', msg.payload[chunkOffset:chunkOffset+8])
		chunkOffset += 8
		#print("chunk id: %s, size: %i" % (subchunk2id, subchunk2size))
		if (subchunk2id == b'data'):
		# else:
			# Should never be here... 
			# print("Data: %s" % msg.payload[chunkOffset:chunkOffset+subchunk2size])

		chunkOffset = chunkOffset + subchunk2size + 8

def stop_recording():
	global record_running
	global record
	record_running = False

def main():
	global record_running
	mqtt.on_connect = on_connect
	mqtt.on_message = on_message
	mqtt.connect(VC_SERVER, VC_PORT)


if __name__ == "__main__":
	mqtt = mqtt.Client()
1 Like


Well, I’m using a lot of non supported word capture. As you said, for shopping list, online searches, reminders, facts etc etc. If I have my assistant turned to offline mode and it doesn’t understand a word, it asks me to spell it. Once spelled, she reads it and asks for confirmation. The word is then injected



Not sure I understand how you do it technically and/or practically :thinking: : Snips will recognize “Unknownword” or match a specific (wrong…) intent every time (at least in French, it’s how it behaves at home), right ?
The unknown intent state is quite rare (maybe depends on how many intents / App we have?).
I suppose you’ve triggered the “Unknown intent” -> ask for spelling, once you are already in a particular App tree intent decision ?
Is it really practical/useful to spell “haricots verts” then “pain de mie” etc… 15 times for just a shopping list ? :face_with_raised_eyebrow: I or my family members will take a pen and a paper after the first 2 or 3 tries made for the fun of the new App daddy put on :smiling_face_with_three_hearts:, but not used further on… :sweat:

I did configure :

retry_count = 1

And almost all my intents slots are optional. False-positive hotword and intent recognition are really annoying if you are in a conversation with friends and the assistant is asking 3 times (default) what was really your intention… :triumph:



I have retry count deactivated and the retries are handeled by my assistant. Also, all my slots are optional. The spelling applies to many things and you can spell pretty fast. I’m almost sure you’ll have the “Haricots vert and pain de mie” loaded in your assistant. Only specific words will be a problem. I loaded a list of 19k common words into a slot for Snips. The spelling in my case happens per exemple when searching for informations:

  • Hey Alice!
  • Yes?
  • Can you find me some information about Morat please?
  • Sure! Thank you for asking so nice by the way!
  • I have found many results for “Mohawk”. Please be more precise or try to spell the search word
  • m o r a t
  • Murten (German) or Morat (French) is a municipality in the See district of the canton of Fribourg in Switzerland.
    It is located on the southern shores of Lake Morat (also known as Lake Murten). Morat is situated between Bern and Lausanne and is the capital of the See/Lac District of the canton of Fribourg.

After this, Morat is injected in the NLU. This applies to many things and can be handled differently depending on the use case

About hotword detection when friends are around, I made a do not disturb me mode that only a specific wakeword can reset, or covering the satellite with your palm, as showcased in one of my video.

Sorry, this is all beside your specific question, I know, I’m not trying to get you to another way of course, just giving my way of doing as you pinged me in the first topic



Yeah, “do not disturb”/hotword off/led off mode is a must have (conversation, watching TV,etc) : activated via Tasker -> Jeedom, or any IR remote (multiple IR receiver / emitter in the house). I did not implement multiple wake/hot words yet, but I will. My Pi sound card does not have a button and most of the time I can’t physically access it anyway.

I am indeed waiting for Project Alice (awaiting Snips team review atm, if I’m correct), I’d like to see how you handled this :smiley:
You have a “shopping list slot_type”, a wiki one, etc… ? then NLU reinject matched words. Nice! Good idea! I like it ! :slight_smile:
My personnal experience on Snips is if you inject many words for a slot (19k !), afterward, false positives on words (->then intent recognition) gets too high on some particular situation, difficult to predict. But if you are using it like that, I most probably have a lot of tweaks to do to my Snips :sweat_smile:
On your wiki search example, you ask for spelling only if wiki API gets back too many answer (disambiguation case, etc…), but how do you do it if a single match exists, but not the one I want ? I want to search for info on “Code Quantum” (old TV show) versus “Quantum physics” both have a result.
But asking for spelling systematically won’t do either…

-> General purpose ASR on specific cases still seem to allow more practical uses



Alice does tell me she found many results, and asks me to be more specific if more than one result is found. I did not yet implement her listing the results and inject the results for ease of use, but it’s on my todo list :slight_smile:

Alice is not awaiting Snips review at all. I gave it for them to look at what I did after they asked me, when I visited during Makerfaire. Alice is awaiting on my code sanity satisfaction and decision to release. To be really honest I am not really satisfied with the current code and I’m refactoring daily for Alice to become deployable at any place automatically. And I simply don’t want to share her yet :slight_smile:



About your first point

‘how to encode a wav file from the “hermes/audioServer/default/audioFrame” messages payloads again !
I found the initial python basic code for handling the capture on internet as a snippet copy/paste site (I cannot seem to find it again to reference the proper dev/source…).’

What do you mean by encode a wav file again ?



This was a “perfectly” working code/app. The code included on my first post use to work : decode and assemble MQTT “hermes/audioServer/default/audioFrame” message payloads generated by Snips audioServer and add them as a simple wav file. By starting and stopping the capture at the right timings, I got only the voice parts I am interested in that I can then use.
The part “broken” :sob: :
Voice -> Mic -> Snips Audioserver -> encoded and transmitted/published via MQTT -> my -> written in a wav file
I want to get this part working again.
Sorry if I was not clear.



Do you manually ‘parse’ the wav files you receive through mqtt ? I noticed that in an older version of snips-audio-server (0.58.3) the wav I receive (‘hermes/audioServer/default/audioFrame’) have a correct 44 bytes wav header but on a more recent version (0.60+) the wav I receive have a 60 bytes header. Hope it can help.



Just saw your code snippets on the initial post you do parse the wav your self so you problem might be with the header size



Yeeeees !! Well spoted @cedmart1 thanks for the info, some extra “time” chuncks seems to have been added in the header from the 44 bytes classic wave file to a 60 bytes one…
Setting the chunkOffset directly to 52 to bypass the extra bytes, It seems I’m back in business :wink:. Just did some basic debugging / bytes info print etc to get to the correct data set, it seems to work well. I will do some further testing tomorrow.
How did you determine the size of the new header (60 bytes) from the rest of the data ??
Thanks again :slight_smile:



Glad I could help. I lost some hairs on this one too.



I injected a known wavfile directly in snips-audio-server (–no-mike --hijack)
and I saved the wav file from the payload of the first mqtt message I received.
I compared the file with and hexa decimal editor that where I saw my first ‘data’ byte wasn’t were it is supposed to be



Thanks for your help, the App is now working again :sunglasses:

On a side note, for curiosity & improvements, it raised Python dev questions :

  1. How do you inject a wav file to be played (default or satellite) ?
    At the moment , I use a basic system Call , using the mosquitto client from the shell :sweat_smile::

/usr/bin/mosquitto_pub -h -t “hermes/audioServer/”+ site_id +"/playBytes/"+ wav_name -f test.wav

Do you know how to do it in a more pythonish way (ie. publish topic …) ? How to send the wav data without reinventing the wheel…?
(I am using wav_name as an id/signal/flag to emulate a classic Snips dialogue atm.)

  1. You said you “saved the wav data from the payload”, how ? If you didn’t do it the (crude/hardcore) way I did ?

Thanks in advance.



Sorry I just saw your last message.

So to play a wav I do this :

import paho.mqtt.publish as publish

with open("./test.wav", mode=‘rb’) as binaryFile :
wav = bytearray(

publish.single(“hermes/audioServer/{}/playBytes/”.format(site), payload=wav, hostname=snips_mqtt_host, port=snips_mqtt_port)

But when I want to do it from a skill
I tweaked my action{} to provide paho_mqtt publish methods to all my skills

For the 2. point I simply saved the payload data except the header (giving me a raw audio file that I can open with audacity)



Hi together, looking at the 16 additional bytes: Does some know how the last 12 bytes behind the 4 bytes (“time”) are created? I can see that specific bytes change each second / minute / hour / day. But to be honest, I can not find a way to recreate those bytes with Java. Any ideas?

EDIT: Solved the problem, by assigning zeros to nearly every byte. It seems like, that the content does not matter in my case. (I am replacing snips-audio-server with a MqqtClient which parses raw audiostream to the required Mqtt audioFrames in WAV format.)

        header[36] ='t';
        header[37] ='i';
        header[38] ='m';
        header[39] ='e';
        header[40] =8;      
        header[41] =0;      
        header[42] =0;     
        header[43] =0;      
        header[44] =0; 
        header[45] =0; 
        header[46] =0;
        header[47] =0; 
        header[48] =0; 
        header[49] =0; 
        header[50] =0;      
        header[51] =0;
1 Like


@crlhldbrndt can you explain more what you’re doing? Seems very interesting.



@idalys love your progress on this. I’m doing something similar but the next step is to alter the voice enough so that our direct voice samples aren’t sent to these cloud based services :slight_smile:



@computerjunkie yep, I’m happy about it :grinning: (just implemented list management add/remove/rename/send items and lists, and the wiki search, so far). Since then, I modified the interaction/dialogue integration with Snips, to start & end sample recording, by using the VAD signals (and a min & max recording time) instead of my own crappy MQTT signaling.
I am also using Psycho idea to reinject dynamically into Snips vocabulary all list names/items/search entry previously done (and stored and reinjected on Snips startup). Especially usefull on list management ! :slight_smile:
My next step will probably be to enable spelling words as @Psycho seems to do, which seems to be the only practical solution for homophones and similar situations.

What do you mean by altering the voice ? Do you mean to use an alternate ASR ,100% local, to analyze only the “Free word” parts ? Somethin’ homebrew ?



@idalys so my idea is similar to yours but with one exception: I want to capture the audio that snips records then do the following:

  1. Wait to see if an existing on-device intent is recognized
  2. If no intent is recognized, alter the captured audio
  3. Send it to amazon/google for processing
  4. Play audio response from amazon/google

A basic example would be:

  1. Jarvis, when is the first day of summer.
    1a. Capture audio, alter audio to disguise real voice
  2. Wait for IntentNotRecognized message
  3. Send altered audio to amazon/google
  4. Wait for response from amazon/google
  5. Continue or finish session with the response from amazon/google