This article also refers to my series of earlier articles regarding Project Sinos.
Did you ever think about how you understand language? How do you manage to recognize the language from all the noises in the environment when someone speaks to you? Almost nobody thinks about it. But it becomes essential if you want to recognize speech electronically.
Frameworks like CMU Sphinx may can take over the recognition of words and the formation of meaningful sentences. But permanently analyzing all noises of the environment brings miserable results. This is why smart speakers usually need to detect a Hot Word like "Alexa" or "Hey, Google". In case of cloud based smart speakers, they would also produce a huge amount of network traffic if sending permanently all noises to their decoding service. This shows the need for hot words.
Well, how do you do that exactly?
Hot Word Detection
The simplest part is the detection of the Hot Word itself. You just need to detect the (already decoded) input sentence if it starts with the Hot Word. But as you see, the input already needs to be decoded. If you analyze permanently without any background noise, this works.
When To Listen?
Before decoding, we need to detect when to decode anything. Decoding every filled frame you get from the microphone means that you always start a blocking task, which prevents you from further recording and sound frame receiving.. In the worst case, you can only get to the first full frame. So you need to record into a buffer until there are no "loud" frames any more. And of course you only start buffering when the frames are "loud".
Handling Background Noise
Another problem is background noise, because you never have an absolutely silent environment. Fortunately, CMU Sphinx can handle quite a bit of background noise itself. But our detection from the last paragraph, which fills the buffer, still cannot recognize if the recorded noise is background noise or the start of a voice command. And, worse: how to handle a permanent background noise?
Permanent background noise can be handled as silence - if it is not too loud. If you listen to music at room volume, this is not a big problem for CMU Sphinx. If you start speaking a voice command, the loudness will increase (non-linear) for the recording device. So you can simply add a threshold to the noise detection that need to be exceeded.
The next problem is increasing background noise. Imagine the following scenario: You are sitting in your living room. A clock on the wall tics, you breathe and move around. All this causes background noise. You tell the smart speaker to play music ... the environmental loudness increases. And the threshold level for voice detection also needs to be increased.
Okay - so the threshold increased. And now, you tell the smart speaker to stop the music again. As you can imagine, you also need to lower the threshold again.
Kommentare
Kommentar veröffentlichen