wakeword

Spokestack Wakeword Detection

The wakeword detection component in spokestack is responsible for detecting utterances of any of a set of keyword phrases in soft real time in an Android application. Its interface is the same as the current Voice Activity Detector (VAD) module, so that wakeword detection can be plugged into a speech pipeline transparently.

Behavior

The wakeword trigger detects whole keyword phrases, in order to activate the spokestack pipeline. These activations are maintained until the user stops talking (as detected with the VAD) or an activation timeout occurs.

This behavior allows the wakeword trigger to be used for multi-turn dialogue sessions. In these sessions, the first turn is triggered by the wakeword and subsequent turns are triggered by an application call to context.setActive after a system prompt is presented to the user.

Because the wakeword trigger uses a VAD for speech detection, it must cope with two types of activations. In the first case, the user says the wakeword and then pauses for a visual cue. For example, "Alexa (pause) what is seventy factorial?" In this case, the VAD will deactivate before the user has spoken, so the trigger must ignore the first VAD deactivation and keep the activation open.

The second type of activation occurs when the user says the wakeword phrase and then immediately makes the request, such as "Alexa what is seventy factorial?" In this case, the VAD will not deactivate after the wakeword was spoken, so the trigger will maintain the spokestack activation until the VAD deactivates or the activation timeout expires.

Configuration

A reference for the wakeword trigger's configuration parameters and defaults can be found in the javadoc API documentation. This guide describes how to use the parameters in practice and the subtle interactions among them.

Model Hyperparameters

These parameters are coupled to the machine learning models used by the wakeword trigger and must be adjusted when the models' hyperparameters change.

rms-target

The wakeword trigger normalizes all audio during signal processing to a reference target, based on the Root Mean Squared (RMS) of the audio signal. All audio samples are normalized to this target RMS value based on the overall RMS energy of the clip (training) or an Exponentially-Weighted Moving Average (EMA) of the RMS energy of the microphone (inference). The RMS target should be a floating point value in the range [0, 1].

fft-window-size

This is the size of the sliding window of audio samples that is used to compute the STFT of the signal. It is measured in number of samples. For best performance, the window size should be a power of 2.

The window size determines the number of frequency bands calculated by the STFT (fft-window-size / 2 + 1) and can affect the runtime performance of the wakeword trigger. However, increasing the window size can improve the vertical resolution of the spectrogram and thus the accuracy of the trigger. A typical value for wakeword detection is 512 samples.

fft-window-type

The window type is the string name of a windowing function to apply to the audio signal prior to calculating its STFT. Currently, only the hann window is supported.

fft-hop-length

This parameter is the number of milliseconds to advance the audio sliding window each time the STFT is calculated. The hop length can improve the horizontal resolution of the spectrogram by increasing the frequency at which the STFT is calculated. This must be traded against the cost of calculating the STFT each frame, as well as running the rest of the detection pipeline, since a detection occurs on each STFT frame. A typical value for wakeword detection is 10ms.

mel-frame-length

The frame length is the length of the filtered STFT, in milliseconds. This parameter determines the number of frames to include in the spectrogram, which determines the size of the input to the encoder model. For ordinary RNN encoders, this value should be set to fft-hop-length (the default), since these encoders process a single frame at a time. For CRNN encoders, this value can be any multiple of the hop length.

mel-frame-width

This is the number of features in each filtered STFT frame. Similar to fft-window-size, the filtered frame width increases the vertical resolution of the detector model's inputs. If the filter model uses the mel filterbank, a typical value for this parameter is 40 features for wakeword detection.

wake-encode-length

This is the length of the encoder output sliding window, in milliseconds. It determines the number of encoder output frames to send into the detector model. The default value for this parameter is 1000 (1 second).

wake-encode-width

This is the number of features in each encoded frame. The encode model transforms the mel frames into encoded frames, each of which has this dimension. A typical value for this parameter is 128.

wake-state-width

This is the number of features in the encoder's state vector. This parameter depends on the type of encoder (GRU/LSTM). For GRU networks, this parameter is the same as wake-encode-width (the default for this parameter), since GRU outputs are identical to their hidden states.

Runtime Tunable Parameters

These parameters may be adjusted at runtime without rebuilding/retraining the ML models used for wakeword detection.

wake-filter-path

This parameter is a file system path to the TF-Lite model for filtering the audio spectrogram during signal processing (see Models below for a description of these models). If this model is stored in an Android asset, it must first be extracted to the cache diretory, and the cache path must be passed as this parameter.

wake-encode-path

The encode path is a file system path to the wakeword encoder TF-Lite model. It behaves similarly to the wake-filter-path parameter.

wake-detect-path

The detector path is a file system path to the wakeword detection TF-Lite model. It behaves similarly to the wake-filter-path parameter.

rms-alpha

This is the rate parameter of the EMA used to normalize the the audio signal to the running RMS energy. A higher rate allows normalization to respond more quickly to changes in signal energy, while a lower rate reduces noise in the RMS value. Note that the RMS energy is only calculated for voiced audio (when the VAD is active).

RMS normalization can be disabled by setting rms-alpha to 0, which is the default.

pre-emphasis

This value controls the pre-emphasis filter used to process the audio signal after RMS normalization. The filter is implemented as x[i] = x[i] - p * x[i - 1], where p is the configured pre-emphasis weight. This filter removes any DC components from the signal and amplifies high frequency components.

Pre-emphasis can be disabled by setting pre-emphasis to 0, which is the default.

vad-rise-delay

This parameter sets the rising edge delay (in milliseconds) of the internal VAD used by the wakeword trigger. The rising edge delay is typically configured to 0, since the wakeword trigger sees a sliding window of audio.

vad-fall-delay

The falling edge delay is the number of milliseconds to delay deactivating the VAD after voiced speech is no longer detected. This parameter ensures that the wakeword trigger continues to run between words in a phrase for slow talkers and words with leading/trailing unvoiced phonemes. It also has a subtle interaction with wake-active-min for activations with and without pauses after the wakeword. This parameter should be tuned specifically to the wakeword being used.

wake-threshold

This is the threshold that is compared with the detector model's posterior probability, in order to determine whether to activate the pipeline. It is the primary means of tuning precision/recall for model performance. A standard approach is to choose this threshold such that the model outputs no more than 1 false positive per hour in the test set. This parameter takes on values in the range [0, 1] and defaults to 0.5.

wake-active-min

This parameter represents the minimum number of milliseconds that a wakeword activation must remain active. It is used to prevent a VAD deactivation at the end of the wakeword utterance from prematurely terminating the wakeword activation, when a user pauses between saying the wakeword and making the system request. It should be tuned alongside vad-fall-delay and is typically longer than vad-fall-delay.

wake-active-max

The maximum activation length (milliseconds) is the maximum amount of time any activation can take, even if a VAD deactivation does not occur. This limits the amount of audio processed further in the pipeline by allowing the pipeline activation to time out. The maximum activation length applies to wakeword activations, as well as manual activations (external calls to context.setActive). It should be tuned to the longest expected user utterance.

Design

Models

Filter

The filter model processes the linear amplitude Short-Time Fourier Transform (STFT), converting it into an audio feature representation. This representation may be the result of applying the mel filterbank or calculating MFCC features. The use of a TF-Lite model for filtering hides the details of the filter from spokestack while optimizing the matrix operations involved.

The filter model takes as inputs a single linear STFT frame, which is computed by spokestack as the magnitude of the FFT over a sliding window of the audio signal. This input is shaped [fft-window-size / 2 + 1]. The model outputs a feature vector shaped [mel-frame-width].

Encoder

The encoder model is the autoregressive component of the system. It processes a single frame (RNN) or a sliding window (CRNN) along with a previous state tensor. The model outputs an encoded representation of the frame and an updated state tensor. The input tensor is shaped [mel-frame-length, mel-frame-width], the output tensor is shaped [wake-encode-width], and the state tensor is shaped [wake-state-width].

Detector

The wakeword trigger uses a pretrained tensorflow model to process audio samples and detect keyword phrases. The model is a binary classifier that outputs a posterior probability that the wakeword was detected. The architecture of the tensorflow model is opaque to the wakeword trigger and may vary across models, although it must be constrained to be compatible with tflite and core.ml and fast enough to run in soft real time on all supported devices.

The input to this model is a sliding window of encoder frames, each of which was produced by the encoder model described above. This input is shaped [wake-encode-length, wake-encode-width]. The classifier outputs a scalar probability value.

Spokestack Trigger

The wakeword trigger is responsible for implementing the spokestack trigger interface and using the wakeword model to detect keyword phrases. It receives each frame of audio from the speech pipeline and processes these samples in soft real time to determine whether a keyword phrase was spoken.

Voice Activity Detection

First, the wakeword trigger must determine whether there is active speech, since the wakeword model is not trained to distinguish speech/non-speech. This is done by using an internal instance of the existing voice activity detector component.

Signal Processing

In order to compute the mel spectrogram, the wakeword trigger must maintain a sliding window of audio samples (typically 25ms). The wakeword trigger then computes the amplitude STFT of this sliding window at each frame stride (typically 10ms, 512 FFT components, hann window). The STFT is then multiplied by the mel filterbank matrix to produce a single mel frame that typically contains 40 components for wakeword detection.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly