"Turn off the ceiling light" - ASR truncates to "Turn off"


I’m facing this problem:

  1. user say with normal voice “Turn off the ceiling light”
  2. ASR detects “Turn off”
  3. All lights in that room goes off

Sometimes - but not so often - ASR do understand “Turn off the ceiling light”.
When I’m saying “normal voice”, I mean a voice that is strong enough for all other commands. A vast majority of the commands and queries works as expected and are understood correctly.

If I raise the voice and make an effort to minimize the delay between “Turn off” and “the ceiling light” it is more likely to understand.

I have tried to modify the endpointing rules Is it possible to modify the default timeout on no response from the user
I’m on 0.61.1, so I had 5 rules: I increased 1 and played with rule 2-3-4 , longer time and higher cost. I noticed that it took significantly longer to get the response from ASR so I guess the values were accepted.
However this didn’t seem to change the behaviour, ASR is till truncating the command to “Turn off”.

In the assistant I have no training example with just “Turn off”. There are quite a few (10%) starting with “Turn off”.

“Switch off/on” and “Turn on” suffers from a similar problem.

How can I address this problem?
It is a commercial product and users are very likely to start playing with “Turn on/off” commands so I really would like to fix it.


Hello @jens !
I think your issue might be solved by increasing the min_trailing_silence of the shortest rules. Indeed, it indicates the minimum amount of time which is necessary before endpointing. It should do what you want but, true, it will then be slower to endpoint at the end too.

If you want to know more about how these rules work, you can have a look at the implementation that we use here :

In the first link, there are comments that really sum up the logic. I copy paste it here :

   The endpointing rule is a disjunction of conjunctions.  The way we have
   it configured, it's an OR of five rules, and each rule has the following form:
      (<contains-nonsilence> || !rule.must_contain_nonsilence) &&
       <length-of-trailing-silence> >= rule.min_trailing_silence &&
       <relative-cost> <= rule.max_relative_cost &&
       <utterance-length> >= rule.min_utterance_length)
    <contains-nonsilence> is true if the best traceback contains any nonsilence phone;
    <length-of-trailing-silence> is the length in seconds of silence phones at the
      end of the best traceback (we stop counting when we hit non-silence),
    <relative-cost> is a value >= 0 extracted from the decoder, that is zero if
      a final-state of the grammar FST had the best cost at the final frame, and
      infinity if no final-state was active (and >0 for in-between cases).
    <utterance-length> is the number of seconds of the utterance that we have
      decoded so far.

Hope it helps :slight_smile: