End-to-end awakening word detection method and device

文档序号:1244051 发布日期:2020-08-18 浏览:24次 中文

阅读说明:本技术 端到端唤醒词检测方法及装置 (End-to-end awakening word detection method and device ) 是由 解传栋 胡博 刘忠亮 唐文琦 于 2019-01-24 设计创作,主要内容包括:本发明公开了一种端到端唤醒词检测方法及装置,所述方法包括:接收待检测语音;依次提取所述待检测语音中每个语音帧的声学特征;将提取的声学特征输入预先构建的声学模型,得到所述声学模型输出的每个语音帧中的目标发音单元的后验概率;将每个目标发音单元作为一个节点,并在所述目标发音单元前后插入虚拟静音节点,得到目标-时间关系矩阵;逐帧计算所述目标-时间关系矩阵中各节点的累积概率;根据所述矩阵中各节点的累积概率确定最优路径;根据所述最优路径确定唤醒词检测结果。利用本发明,可以提高检测结果的准确性,降低误唤醒率。(The invention discloses a method and a device for detecting an end-to-end awakening word, wherein the method comprises the following steps: receiving a voice to be detected; sequentially extracting the acoustic characteristics of each voice frame in the voice to be detected; inputting the extracted acoustic features into a pre-constructed acoustic model to obtain the posterior probability of a target pronunciation unit in each voice frame output by the acoustic model; taking each target pronunciation unit as a node, and inserting a virtual mute node in front of and behind the target pronunciation unit to obtain a target-time relation matrix; calculating the cumulative probability of each node in the target-time relation matrix frame by frame; determining an optimal path according to the cumulative probability of each node in the matrix; and determining a detection result of the awakening word according to the optimal path. By using the invention, the accuracy of the detection result can be improved, and the false wake-up rate can be reduced.)

1. An end-to-end wake word detection method, the method comprising:

receiving a voice to be detected;

sequentially extracting the acoustic characteristics of each voice frame in the voice to be detected;

inputting the extracted acoustic features into a pre-constructed acoustic model to obtain the posterior probability of a target pronunciation unit in each voice frame output by the acoustic model;

taking each target pronunciation unit as a node, inserting virtual mute nodes in front of and behind the target pronunciation unit, and obtaining a target-time relation matrix according to the posterior probability of the target pronunciation unit;

calculating the cumulative probability of each node in the target-time relation matrix frame by frame;

determining an optimal path according to the cumulative probability of each node in the matrix;

and determining a detection result of the awakening word according to the optimal path.

2. The method of claim 1, further comprising constructing the acoustic model in the following manner:

collecting awakening word data and non-awakening word data;

respectively time-marking the awakening word data and the non-awakening word data to obtain frame-level tag data;

and training by using the frame level label data to obtain the acoustic model.

3. The method of claim 2, wherein the time-stamping the wakeup word data and the non-wakeup word data respectively to obtain frame-level tag data comprises:

determining a label mapping relation between a wake-up word and a non-wake-up word;

respectively aligning the awakening word data and the non-awakening word data to obtain the corresponding relation between each character in the awakening word data and the non-awakening word data and the voice frame occupied by the character;

and mapping the awakening word data and the non-awakening word data into label forms respectively according to the label mapping relation and the corresponding relation to obtain frame-level label data.

4. The method of claim 3, wherein determining the tag mapping relationship between the wake word and the non-wake word comprises:

representing the starting time period and the ending time period of the awakening word by using silence;

setting labels corresponding to the characters according to the position sequence for each character in the awakening words;

for other words or characters than the mute and wake words, the corresponding label is set to 0.

5. The method of claim 1, wherein said calculating, on a frame-by-frame basis, a cumulative probability for each node in the target-time relationship matrix comprises:

determining the optimal path cumulative probability of all paths before the nodes can be reached;

and adding the cumulative probability of the optimal path and the probability of the node to obtain the cumulative probability of the node.

6. The method according to any one of claims 1 to 5, wherein the determining a wake word detection result according to the optimal path comprises:

determining the starting position and the ending position of each target pronunciation unit on the optimal path, and calculating the length and the average probability of the target pronunciation units according to the cumulative probability, the starting position and the ending position;

determining whether a set condition is met or not according to the length and/or the average probability of each target pronunciation unit on the optimal path;

if so, a wake word is determined to be detected.

7. The method according to claim 6, wherein the setting conditions include:

the length of each target pronunciation unit in the set interval is larger than a set length threshold; and/or

The average probability of each target pronunciation unit in the set interval is larger than the set average probability threshold value.

8. An end-to-end wake word detection apparatus, the apparatus comprising:

the receiving module is used for receiving the voice to be detected;

the characteristic extraction module is used for sequentially extracting the acoustic characteristic of each voice frame in the voice to be detected;

the acoustic detection module is used for inputting the extracted acoustic features into a pre-constructed acoustic model to obtain the posterior probability of a target pronunciation unit in each voice frame output by the acoustic model;

the matrix construction module is used for taking each target pronunciation unit as a node, inserting virtual mute nodes in front of and behind the target pronunciation unit and obtaining a target-time relation matrix according to the posterior probability of the target pronunciation unit;

the calculation module is used for calculating the cumulative probability of each node in the target-time relation matrix frame by frame;

the optimal path determining module is used for determining an optimal path according to the cumulative probability of each node in the matrix;

and the detection module is used for determining a detection result of the awakening word according to the optimal path.

9. A computer device, comprising: one or more processors, memory;

the memory is for storing computer-executable instructions, and the processor is for executing the computer-executable instructions to implement the method of any one of claims 1 to 7.

10. A readable storage medium having stored thereon instructions that are executed to implement the method of any one of claims 1 to 7.

Technical Field

The invention relates to the technical field of voice awakening, in particular to a method and a device for detecting an end-to-end awakening word.

Background

The voice wakeup technology is that when the intelligent device detects a specific voice (usually a set wakeup word) of a user in a sleep state, the device enters a waiting state and then enters a voice intelligent interaction process. The voice wake-up has a wide application field, such as robots, sound boxes, automobiles, and the like. The main indexes for evaluating the voice awakening effect are awakening rate and false awakening rate, and the quality of the performance of a decoder plays a key role in the voice identification process.

At present, the wake-up detection method used by most intelligent devices is a wake-up word detection method based on an end-to-end technology. In a conventional end-to-end model, the output of the acoustic model in the decoder typically corresponds to phonetic units, which may be words, or syllables. The acoustic model is mainly used for calculating the likelihood between the voice characteristics and each pronunciation template, the input of the acoustic model is the voice characteristics, and the output of the acoustic model is the posterior probability of the target pronunciation unit. For example, a segment of speech is input, which contains the wake word "waning you good", and the word-based end-to-end acoustic model output is: "you", "good", "waning", "young", "sil" and "other" where "sil" represents silent output and "other" represents other speech output than "your good waning young".

The existing end-to-end awakening word detection method is that a target pronunciation unit output by an acoustic model is dynamically planned, then an optimal value is searched, whether the optimal value of a certain path exceeds a preset threshold value or not is judged, and then whether awakening is carried out or not is determined. This detection method has at least the following disadvantages: the detection rate of awakening words is still to be improved, and false awakening is achieved to a certain degree.

Disclosure of Invention

The embodiment of the invention provides an end-to-end awakening word detection method and device, which are used for improving the accuracy of a detection result and reducing the false awakening rate.

Therefore, the invention provides the following technical scheme:

an end-to-end wake word detection method, the method comprising:

receiving a voice to be detected;

sequentially extracting the acoustic characteristics of each voice frame in the voice to be detected;

inputting the extracted acoustic features into a pre-constructed acoustic model to obtain the posterior probability of a target pronunciation unit in each voice frame output by the acoustic model;

taking each target pronunciation unit as a node, inserting virtual mute nodes in front of and behind the target pronunciation unit, and obtaining a target-time relation matrix according to the posterior probability of the target pronunciation unit;

calculating the cumulative probability of each node in the target-time relation matrix frame by frame;

determining an optimal path according to the cumulative probability of each node in the matrix;

and determining a detection result of the awakening word according to the optimal path.

Optionally, the acoustic model is an end-to-end acoustic model based on the target pronunciation unit in the wake word.

Optionally, the target pronunciation unit is a syllable, or a character, or a word.

Optionally, the method further comprises constructing the acoustic model in the following manner:

collecting awakening word data and non-awakening word data;

respectively time-marking the awakening word data and the non-awakening word data to obtain frame-level tag data;

and training by using the frame level label data to obtain the acoustic model.

Optionally, the time-stamping the wakeup word data and the non-wakeup word data respectively to obtain the frame-level tag data includes:

determining a label mapping relation between a wake-up word and a non-wake-up word;

respectively aligning the awakening word data and the non-awakening word data to obtain the corresponding relation between each character in the awakening word data and the non-awakening word data and the voice frame occupied by the character;

and mapping the awakening word data and the non-awakening word data into label forms respectively according to the label mapping relation and the corresponding relation to obtain frame-level label data.

Optionally, the determining a label mapping relationship between a wakeup word and a non-wakeup word includes:

representing the starting time period and the ending time period of the awakening word by using silence;

setting labels corresponding to the characters according to the position sequence for each character in the awakening words;

for other words or characters than the mute and wake words, the corresponding label is set to 0.

Optionally, the aligning the wakeup word data and the non-wakeup word data includes:

and respectively aligning the awakening word data and the non-awakening word data by utilizing a pre-established alignment model.

Optionally, the calculating, frame by frame, an accumulated probability of each node in the target-time relationship matrix includes:

determining the optimal path cumulative probability of all paths before the nodes can be reached;

and adding the cumulative probability of the optimal path and the probability of the node to obtain the cumulative probability of the node.

Optionally, the determining an optimal path according to the cumulative probability of each node in the matrix includes:

and calculating the score of each path according to the cumulative probability of each node in the matrix, and taking the path with the maximum score as the optimal path.

Optionally, the determining a detection result of the wake word according to the optimal path includes:

and if the cumulative probability corresponding to each target pronunciation unit on the optimal path is greater than the set maximum probability threshold, determining that the awakening word is detected.

Optionally, the determining a detection result of the wake word according to the optimal path includes:

determining the starting position and the ending position of each target pronunciation unit on the optimal path, and calculating the length and the average probability of the target pronunciation units according to the cumulative probability, the starting position and the ending position;

determining whether a set condition is met or not according to the length and/or the average probability of each target pronunciation unit on the optimal path;

if so, a wake word is determined to be detected.

Optionally, the calculating the length and the average probability of the target pronunciation unit according to the cumulative probability and the start position and the end position includes:

subtracting the starting position from the ending position of the target pronunciation unit to obtain the length of the target pronunciation unit;

and subtracting the cumulative probability corresponding to the end position from the cumulative probability corresponding to the start position, and dividing the result by the length of the target pronunciation unit to obtain the average probability of the target pronunciation unit.

Optionally, the setting condition includes:

the length of each target pronunciation unit in the set interval is larger than a set length threshold; and/or

The average probability of each target pronunciation unit in the set interval is larger than the set average probability threshold value.

An end-to-end wake word detection apparatus, the apparatus comprising:

the receiving module is used for receiving the voice to be detected;

the characteristic extraction module is used for sequentially extracting the acoustic characteristic of each voice frame in the voice to be detected;

the acoustic detection module is used for inputting the extracted acoustic features into a pre-constructed acoustic model to obtain the posterior probability of a target pronunciation unit in each voice frame output by the acoustic model;

the matrix construction module is used for taking each target pronunciation unit as a node, inserting virtual mute nodes in front of and behind the target pronunciation unit and obtaining a target-time relation matrix according to the posterior probability of the target pronunciation unit;

the calculation module is used for calculating the cumulative probability of each node in the target-time relation matrix frame by frame;

the optimal path determining module is used for determining an optimal path according to the cumulative probability of each node in the matrix;

and the detection module is used for determining a detection result of the awakening word according to the optimal path.

Optionally, the acoustic model is an end-to-end acoustic model based on the target pronunciation unit in the wake word.

Optionally, the target pronunciation unit is a syllable, or a character, or a word.

Optionally, the apparatus further comprises: a model construction module for constructing the acoustic model; the model building module comprises:

the data collection unit is used for collecting awakening word data and non-awakening word data;

the marking unit is used for respectively carrying out time marking on the awakening word data and the non-awakening word data to obtain frame level label data;

and the training unit is used for training by utilizing the frame level label data to obtain the acoustic model.

Optionally, the marking unit includes:

the mapping relation determining unit is used for determining the label mapping relation between the awakening words and the non-awakening words;

the alignment unit is used for respectively aligning the awakening word data and the non-awakening word data to obtain the corresponding relation between each character in the awakening word data and the non-awakening word data and the voice frame occupied by the character;

and the mapping unit is used for respectively mapping the awakening word data and the non-awakening word data into a tag form according to the tag mapping relation and the corresponding relation to obtain frame-level tag data.

Optionally, the mapping relationship determining unit is specifically configured to represent a start time period and an end time period of the wakeup word by using silence; setting labels corresponding to the characters according to the position sequence for each character in the awakening words; for other words or characters than the mute and wake words, the corresponding label is set to 0.

Optionally, the alignment unit is specifically configured to align the wakeup word data and the non-wakeup word data respectively by using a pre-established alignment model.

Optionally, the calculating module is specifically configured to determine an optimal path cumulative probability of all paths before reaching the node, and add the optimal path cumulative probability to the probability of the node to obtain the cumulative probability of the node.

Optionally, the optimal path determining module is specifically configured to calculate a score of each path according to the cumulative probability of each node in the matrix, and use the path with the largest score as the optimal path.

Optionally, the detection module is specifically configured to determine whether cumulative probabilities corresponding to the target pronunciation units on the optimal path are all greater than a set maximum probability threshold, and if so, determine that the awakening word is detected.

Optionally, the detection module includes:

the determining unit is used for determining the starting position and the ending position of each target pronunciation unit on the optimal path, and calculating the length and the average probability of the target pronunciation units according to the cumulative probability, the starting position and the ending position;

and the judging unit is used for determining whether a set condition is met according to the length and/or the average probability of each target pronunciation unit on the optimal path, and if so, determining that the awakening word is detected.

Optionally, the determining unit subtracts the ending position and the starting position of the target pronunciation unit to obtain the length of the target pronunciation unit; and subtracting the cumulative probability corresponding to the end position from the cumulative probability corresponding to the start position, and dividing the result by the length of the target pronunciation unit to obtain the average probability of the target pronunciation unit.

Optionally, the setting condition includes:

the length of each target pronunciation unit in the set interval is larger than a set length threshold; and/or

The average probability of each target pronunciation unit in the set interval is larger than the set average probability threshold value.

A computer device, comprising: one or more processors, memory;

the memory is configured to store computer-executable instructions and the processor is configured to execute the computer-executable instructions to implement the method described above.

A readable storage medium having stored thereon instructions which are executed to implement the foregoing method.

The end-to-end awakening word detection method and device provided by the embodiment of the invention have the advantages that the virtual mute nodes are inserted in front of and behind each target pronunciation unit, and a target-time relation matrix is obtained according to the posterior probability of the target pronunciation unit; calculating the cumulative probability of each node in the target-time relation matrix frame by frame; determining an optimal path according to the cumulative probability of each node in the matrix; and determining a detection result of the awakening word according to the optimal path. Because the virtual mute nodes are added between the target pronunciation units, the target pronunciation units are more consistent with the normal pronunciation rules, the accuracy of the detection result is effectively improved, the detection rate of the awakening words is improved, and the phenomenon of mistaken awakening is inhibited.

In the dynamic time planning process, considering that the other words or characters except the mute and the awakening words have unobvious functions in the dynamic time planning and can form a competitive relationship with the mute, the invention represents the other words or characters except the mute and the awakening words as the mute, thereby avoiding the influence on the mute expression and further improving the accuracy of the detection result.

Furthermore, when the acoustic model is trained, the alignment model is used for aligning the awakening word data and the non-awakening word data respectively to obtain the time information of the training data and further obtain the frame-level label data, and the acoustic model is obtained by training the frame-level label data.

Furthermore, by limiting conditions such as the length of the target pronunciation unit and/or the average probability, false awakening is effectively reduced.

Drawings

In order to more clearly illustrate the embodiments of the present application or technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments described in the present invention, and other drawings can be obtained by those skilled in the art according to the drawings.

FIG. 1 is a flow chart of constructing an acoustic model in an embodiment of the present invention;

FIG. 2 is a flowchart of a method for detecting an end-to-end wake-up word according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of end-to-end HMM transition logic in an embodiment of the present invention;

FIG. 4 is another flowchart of a method for detecting an end-to-end wake-up word according to an embodiment of the present invention;

FIG. 5 is a block diagram of an end-to-end wake-up word detection apparatus according to an embodiment of the present invention;

FIG. 6 is a block diagram of a model building module according to an embodiment of the present invention;

FIG. 7 is a block diagram of another structure of an end-to-end wake-up word detection apparatus according to an embodiment of the present invention;

FIG. 8 is a block diagram illustrating an apparatus for an end-to-end wake word detection method in accordance with an example embodiment;

fig. 9 is a schematic structural diagram of a server in an embodiment of the present invention.

Detailed Description

In order to make the technical field of the invention better understand the scheme of the embodiment of the invention, the embodiment of the invention is further described in detail with reference to the drawings and the implementation mode.

The embodiment of the invention provides an end-to-end awakening word detection method and device, wherein according to the posterior probability of each target pronunciation unit output by an acoustic model, when dynamic time planning is carried out, virtual mute nodes are inserted in front of and behind the target pronunciation unit, the cumulative probability of each target pronunciation unit is calculated frame by frame, then an optimal path is determined according to the cumulative probability, and an awakening word detection result is determined according to the optimal path.

First, the training process of the acoustic model in the embodiment of the present invention is described in detail below.

Fig. 1 is a flowchart of constructing an acoustic model according to an embodiment of the present invention, which includes the following steps:

step 101, collecting awakening word data and non-awakening word data.

The awakening word data refers to voice data containing a set awakening word, and the non-awakening word data refers to voice data not containing the awakening word. It is generally required that the total duration of the non-wakeup word data is the same as or greater than the total duration of the wakeup word data, such as a 2:1 or 3:1 duration ratio.

And 102, respectively carrying out time marking on the awakening word data and the non-awakening word data to obtain frame level label data.

First, the label mapping relationship between the awakening word and the non-awakening word needs to be determined. In the embodiment of the invention, the starting time period and the ending time period of the awakening word are expressed by silence; setting labels corresponding to the words according to a position sequence for each word in the wake-up words, for example, for the wake-up word "waning you good", wherein the labels corresponding to the words are 1, 2, 3, and 4, respectively, and the corresponding label mapping relationship is (1-2-3-4); for other words or words than the mute and wake words, the corresponding label is set to 0 and the corresponding label mapping relationship is (-0/0-/-0-).

And secondly, aligning the awakening word data and the non-awakening word data respectively to obtain the corresponding relation between each character in the awakening word data and the non-awakening word data and the voice frame occupied by the character, namely which frames are occupied by each character.

In the embodiment of the present invention, the wakeup word data and the non-wakeup word data may be aligned respectively by using a pre-established alignment model. The alignment model is a state-level speech recognition model, and can be obtained by training with the use of the awakening word data and the non-awakening word data by using a neural network model, such as a DNN-HMM model.

And finally, mapping the awakening word data and the non-awakening word data into a tag form according to the tag mapping relation and the corresponding relation to obtain frame-level tag data.

And 103, training by using the frame level label data to obtain the acoustic model.

The acoustic model may employ a neural network model, such as a DNN-HMM model. The input of the acoustic model is the acoustic characteristics extracted from the current frame, and the output is the posterior probability of the target pronunciation unit in the current frame.

Taking "waning you good" as an example, the target pronunciation units are (0-1-2-3-4), and the output of the acoustic model is the posterior probabilities corresponding to the target pronunciation units.

The alignment model is used for aligning the awakening word data and the non-awakening word data respectively, and compared with a mode of manually marking time information in the prior art, the method can greatly save human resources and improve the model training efficiency.

By utilizing the acoustic model, the posterior probability of each target pronunciation unit in each voice frame of the voice to be detected can be obtained.

As shown in fig. 2, it is a flowchart of an end-to-end wake word detection method according to an embodiment of the present invention, and the method includes the following steps:

step 201, receiving a voice to be detected.

For example, the information may be received by a microphone, which may be disposed on the device to be wakened, or may be disposed on a controller of the device to be wakened, such as a remote controller.

Step 202, sequentially extracting the acoustic characteristics of each voice frame in the voice to be detected.

The received voice to be detected needs to be subjected to framing processing, and in addition, pre-emphasis processing can be carried out on the framed voice data to increase the high-frequency resolution of the voice.

The acoustic features may be MFCC (Mel-Frequency cepstral coefficients), PLP (Linear Predictive Coding), Filterbank, etc., and the extraction of the acoustic features may be performed by using the prior art, which is not described herein again.

Step 203, inputting the extracted acoustic features into a pre-constructed acoustic model to obtain the posterior probability of each target pronunciation unit in each voice frame output by the acoustic model.

In the embodiment of the present invention, the acoustic model is an end-to-end acoustic model based on a target pronunciation unit in a wake word, and specifically may be in a form of a combination of one or more of DNN (Deep Neural Networks), for example, FFNN (Feed forward Neural Network), CNN (Convolutional Neural Network), and RNN (Recurrent Neural Network). The input of the acoustic model is acoustic characteristics extracted from the current frame, and the output is the posterior probability of the target pronunciation unit in the current frame.

The acoustic model may be obtained by training with a large amount of collected speech data, and a specific training process will be described in detail later.

The target pronunciation unit refers to a pronunciation unit of the wake-up word, and specifically may be a syllable, a character, or a word.

And step 204, taking each target pronunciation unit as a node, inserting virtual mute nodes in front of and behind the target pronunciation unit, and obtaining a target-time relation matrix according to the posterior probability of the target pronunciation unit.

The target-Time relation matrix is a DTW (Dynamic Time Warping) matrix, the longitudinal axis of the DTW matrix is each node, the corresponding label is 1-2-3-4 by taking the Wang you as an example, and the DTW matrix is in a form of (0-1-0-2-0-3-0-4-0) after virtual mute nodes are inserted in front of and behind each target pronunciation unit; the horizontal axis is time in units of frames, and the intersection of the vertical axis and the horizontal axis is the value of each node, which is the product of the posterior probability of the node and the weight thereof. For convenience of description, the value of each node in the matrix will be referred to as the probability of the node later.

Step 205, calculating the cumulative probability of each node in the target-time relation matrix frame by frame.

In the embodiment of the invention, a dynamic planning method is adopted to respectively calculate the cumulative probability of each node (including a virtual silent node and a target pronunciation unit) in the target-time relation matrix, and the cumulative probability of each node is the sum of the cumulative probability of the optimal path in all paths before the node can be reached and the probability of the current node. Specifically, the optimal path cumulative probability of all paths before the node can be reached may be determined first, and then the optimal path cumulative probability and the probability of the node are added to obtain the cumulative probability of the node.

As shown in fig. 3, because the virtual mute node is added, when dynamically planning, the path that can reach each target pronunciation unit includes not only the path from the previous target pronunciation unit to the target pronunciation unit, but also the path from the previous virtual mute node to the target pronunciation unit.

Assuming that a node set, i.e., a label set, is represented by state [ ], the calculation formula is as follows:

1) for a virtual mute node, i.e., sil _ state 0,

dp[i][t]=max(dp[i][t-1],dp[i-1][t-1])+out[i][t];

wherein dp [ i ] [ t ] represents the cumulative probability of the ith node (the node is the virtual mute node) in the t-th frame, dp [ i ] [ t-1] represents the cumulative probability of the ith node (the node is the virtual mute node) in the t-1 th frame, dp [ i-1] [ t-1] represents the cumulative probability of the ith node (the node is the target pronunciation unit) in the t-1 th frame, and out [ i ] [ t ] represents the probability of the ith node (the node is the virtual mute node) in the t-th frame.

That is, the maximum value of the cumulative probability of the virtual mute node of the previous frame and the cumulative probability of the target pronunciation unit node before the previous frame is selected as the cumulative probability of the virtual mute node of the previous frame, and the cumulative probability of the virtual mute node of the previous frame and the probability of the virtual mute node of the current frame are taken as the cumulative probability of the virtual mute node of the current frame.

2) For the first target pronunciation unit node, i.e., sil _ state ≠ 0, and i ≠ 1, the calculation formula is similar to that described above.

That is, the maximum value of the cumulative probability of the target pronunciation unit node of the previous frame and the cumulative probability of the virtual mute node before the previous frame is selected as the cumulative probability of the target pronunciation unit node of the previous frame, and the cumulative probability of the target pronunciation unit node of the previous frame plus the probability of the target pronunciation unit node of the current frame is selected as the cumulative probability of the target pronunciation unit node of the current frame.

3) For other target pronunciation unit nodes, i.e., sil _ state ≠ 0, and i >1,

dp[i][t]=max{max(dp[i][t-1],dp[i-1][t-1]),dp[i-2][t-1]}+out[i][t];

wherein dp [ i ] [ t ] represents the cumulative probability of the ith node (the node is the target pronunciation unit node) in the t-th frame, dp [ i ] [ t-1] represents the cumulative probability of the ith node (the node is the target pronunciation unit node) in the t-1 th frame, dp [ i-1] [ t-1] represents the cumulative probability of the ith-1 node (the node is the virtual silence unit) in the t-1 th frame, dp [ i-2] [ t-1] represents the cumulative probability of the ith-2 node (the node is the target pronunciation unit) in the t-1 th frame, and out [ i ] [ t ] represents the probability of the ith node (the node is the target pronunciation unit node) in the t-th frame.

That is, the maximum value of the cumulative probability of the target pronunciation unit node of the previous frame, the cumulative probability of the target pronunciation unit node of the previous frame and the cumulative probability of the virtual mute node of the previous frame is selected as the cumulative probability of the target pronunciation unit node of the previous frame, and the cumulative probability of the target pronunciation unit node of the current frame are selected as the cumulative probability of the target pronunciation unit node of the current frame.

And step 206, determining an optimal path according to the cumulative probability of each node in the matrix.

Specifically, the score of each path may be calculated according to the cumulative probability of each node in the matrix, and the path with the largest score may be used as the optimal path.

And step 207, determining a detection result of the awakening word according to the optimal path.

For example, whether the cumulative probabilities corresponding to the target pronunciation units on the optimal path are all greater than a set maximum probability threshold value can be judged, and if yes, the awakening word is determined to be detected; otherwise, determining that the awakening word is not detected.

The end-to-end awakening word detection method provided by the embodiment of the invention is characterized in that virtual silent nodes are inserted in front of and behind each target pronunciation unit to obtain a target-time relation matrix; calculating the cumulative probability of each node in the target-time relation matrix frame by frame according to the posterior probability of the target pronunciation unit in each voice frame output by the acoustic model; determining an optimal path according to the cumulative probability of each node in the matrix; and determining a detection result of the awakening word according to the optimal path. Because the virtual mute nodes are added between the target pronunciation units, the target pronunciation units are more consistent with the normal pronunciation rules, the accuracy of the detection result is effectively improved, the detection rate of the awakening words is improved, and the phenomenon of mistaken awakening is inhibited.

In the dynamic time planning process, considering that the other words or characters except the mute and the awakening words have unobvious functions in the dynamic time planning and form a competitive relationship with the mute, the other words or characters except the mute and the awakening words are all represented as the mute, so that the influence on the expression of the mute is avoided, and the accuracy of the detection result is further improved.

As shown in fig. 4, another flowchart of an end-to-end wake word detection method according to an embodiment of the present invention includes the following steps:

step 401, receiving a voice to be detected.

And 402, sequentially extracting the acoustic characteristics of each voice frame in the voice to be detected.

Step 403, inputting the extracted acoustic features into a pre-constructed acoustic model, and obtaining the posterior probability of each target pronunciation unit in each voice frame output by the acoustic model.

Step 404, constructing a target-time relation matrix, and determining an optimal path by using a dynamic programming algorithm.

The construction method of the target-time relationship matrix and the determination process of the optimal path may refer to steps 204 to 206 shown in fig. 2, which are not described herein again.

Step 405, determining the starting position and the ending position of each target pronunciation unit on the optimal path, and calculating the length and the average probability of the target pronunciation unit according to the cumulative probability, the starting position and the ending position.

In the dynamic planning process, after the accumulated probability is calculated to the last frame, backtracking frame by frame to determine whether each frame contains a corresponding target pronunciation unit. Specifically, if the probability of the target pronunciation unit in the current frame is greater than the set output threshold, it is determined that the target pronunciation unit is included in the current frame. Therefore, after the starting frame is traced back, all frames containing the target pronunciation unit can be obtained, the starting position and the ending position of each target pronunciation unit can be further obtained, and the length of the target pronunciation unit can be obtained by subtracting the starting position from the ending position of the target pronunciation unit.

Accordingly, for each target pronunciation unit, the cumulative probability corresponding to the ending position of the target pronunciation unit is subtracted from the cumulative probability corresponding to the starting position of the target pronunciation unit, and then the result is divided by the length of the target pronunciation unit, so that the average probability of the target pronunciation unit can be obtained.

Step 406, determining whether a set condition is met according to the length and/or the average probability of each target pronunciation unit on the optimal path; if so, go to step 407; otherwise, step 408 is performed.

In the embodiment of the present invention, the setting condition may be: the length of each target pronunciation unit in the set interval is larger than a set length threshold; and/or the average probability of each target pronunciation unit in the set interval is larger than the set average probability threshold value.

In step 407, it is determined that a wake word is detected.

At step 408, it is determined that a wake word is not detected.

The end-to-end awakening word detection method provided by the embodiment of the invention can effectively reduce the false awakening rate by further limiting the length and average probability of each target pronunciation unit.

Certainly, in practical applications, the total length of non-silence (for example, the length threshold of the wake-up word may be determined according to different speech speeds), the average probability, the total length of silence, the length of silence between words or phrases, and the like may also be limited, so as to ensure that the wake-up operation is performed in a situation more suitable for the normal speaking style.

The end-to-end awakening word detection method provided by the embodiment of the invention can be applied to various intelligent devices, such as household devices, sound boxes, tablet computers, mobile phones, wearable devices, robots, toys and the like, and can enable the intelligent devices to accurately detect the voice instruction of a user, namely awakening words, in a dormant or screen locking state, so that the devices in the dormant state directly enter a waiting instruction state or directly execute the operation corresponding to the voice instruction.

Correspondingly, an embodiment of the present invention further provides an end-to-end wake-up word detection apparatus, as shown in fig. 5, which is a structural block diagram of the apparatus.

In this embodiment, the end-to-end wake word detection apparatus includes the following modules:

a receiving module 501, configured to receive a voice to be detected;

a feature extraction module 502, configured to sequentially extract an acoustic feature of each speech frame in the speech to be detected; specifically, the speech received by the receiving module 501 may be frame-divided to obtain each speech frame, and then the acoustic features of each speech frame are extracted, where the acoustic features may be MFCC features, PLP features, Filterbank features, or the like, and the extraction of the acoustic features may adopt the prior art, and is not described herein again;

the acoustic detection module 503 is configured to input the extracted acoustic features into a pre-constructed acoustic model, so as to obtain a posterior probability of a target pronunciation unit in each speech frame output by the acoustic model; the target pronunciation unit refers to a pronunciation unit of the awakening word, and specifically can be a syllable, a character or a word;

a matrix construction module 504, configured to use each target pronunciation unit as a node, insert a virtual mute node in front of and behind the target pronunciation unit, and obtain a target-time relationship matrix according to a posterior probability of the target pronunciation unit;

a calculating module 505, configured to calculate an accumulated probability of each node in the target-time relationship matrix frame by frame;

an optimal path determining module 506, configured to determine an optimal path according to the cumulative probability of each node in the matrix; specifically, the score of each path may be calculated according to the cumulative probability of each node in the matrix, and the path with the largest score is used as the optimal path;

a detection module 507, configured to determine a detection result of the wakeup word according to the optimal path; for example, whether the cumulative probabilities corresponding to the target pronunciation units on the optimal path are all greater than a set maximum probability threshold value can be judged, and if yes, the awakening word is determined to be detected; otherwise, determining that the awakening word is not detected.

In the embodiment of the present invention, the acoustic model is an end-to-end acoustic model based on a target pronunciation unit in the wake-up word, and specifically, DNN may be used, for example, a combination form of one or more of FFNN, CNN, and RNN. The input of the acoustic model is acoustic characteristics extracted from the current frame, and the output is the probability of a target pronunciation unit in the current frame.

The acoustic model can be obtained by training a corresponding model building module in advance by using a large amount of collected voice data, wherein the voice data comprises awakening word data and non-awakening word data. The model building module may be integrated in the device or may be independent from the device, and the embodiment of the present invention is not limited thereto.

A structural block diagram of the model building module is shown in fig. 6, and includes the following units:

a data collection unit 61 for collecting wakeup word data and non-wakeup word data;

a marking unit 62, configured to perform time marking on the wakeup word data and the non-wakeup word data respectively to obtain frame level tag data;

and a training unit 63, configured to train to obtain the acoustic model by using the frame-level tag data.

The marking unit 62 may specifically include: the device comprises a mapping relation determining unit, an aligning unit and a mapping unit. Wherein:

the mapping relation determining unit is configured to determine a label mapping relation between a wakeup word and a non-wakeup word, and in the embodiment of the present invention, a start time period and an end time period of the wakeup word may be represented by silence; setting labels corresponding to the characters according to the position sequence for each character in the awakening words; for other words or characters except for the mute and wake-up words, the corresponding label is set to be 0, and the traditional other label is not set any more.

The alignment unit is configured to align the wakeup word data and the non-wakeup word data, for example, align the wakeup word data and the non-wakeup word data by using a pre-established alignment model, to obtain a correspondence between each character in the wakeup word data and the speech frame occupied by the character in the non-wakeup word data;

and the mapping unit is used for mapping the awakening word data and the non-awakening word data into label forms respectively according to the label mapping relation and the corresponding relation to obtain frame-level label data.

In the embodiment of the invention, the alignment model is used for aligning the awakening word data and the non-awakening word data respectively, compared with the mode of manually marking time information in the prior art, the method can greatly save human resources and improve the model training efficiency. The alignment model may be obtained by training using the awakening word data and the non-awakening word data using a neural network model, such as a DNN-HMM model.

The acoustic model may employ a neural network model, such as a DNN-HMM model. The input of the acoustic model is the acoustic characteristics extracted from the current frame, and the output is the posterior probability of the target pronunciation unit in the current frame.

By using the posterior probabilities of the target pronunciation units obtained by the acoustic model, when calculating the cumulative probability of each node in the target-time relationship matrix, the calculation module 505 in fig. 5 first determines the optimal path cumulative probability of all paths before reaching the node, and then adds the optimal path cumulative probability to the probability of the node to obtain the cumulative probability of the node. The specific calculation method may refer to the description in the foregoing embodiment of the method of the present invention, and is not described herein again.

The end-to-end awakening word detection device provided by the embodiment of the invention inserts the virtual silent nodes in front of and behind each target pronunciation unit to obtain a target-time relation matrix; calculating the cumulative probability of each node in the target-time relation matrix frame by frame according to the posterior probability of the target pronunciation unit in each voice frame output by the acoustic model; determining an optimal path according to the cumulative probability of each node in the matrix; and determining a detection result of the awakening word according to the optimal path. Because the virtual mute nodes are added between the target pronunciation units, the target pronunciation units are more consistent with the normal pronunciation rules, the accuracy of the detection result is effectively improved, the detection rate of the awakening words is improved, and the phenomenon of mistaken awakening is inhibited.

In the dynamic time planning process, considering that the other words or characters except the mute and the awakening words have unobvious functions in the dynamic time planning and form a competitive relationship with the mute, the other words or characters except the mute and the awakening words are all represented as the mute, so that the influence on the expression of the mute is avoided, and the accuracy of the detection result is further improved.

Fig. 7 is another structural block diagram of an end-to-end wake-up word detection apparatus according to an embodiment of the present invention.

The difference from the embodiment shown in fig. 5 is that, in this embodiment, the detection module 507 includes: a determination unit 571 and a judgment unit 572. Wherein:

the determining unit 571 is configured to determine a starting position and an ending position of each target pronunciation unit on the optimal path, and calculate a length and an average probability of the target pronunciation unit according to the cumulative probability and the starting position and the ending position.

The determining unit 571 may specifically determine, after the calculating module 505 calculates the cumulative probability of each node to the last frame, whether each frame includes the corresponding target pronunciation unit by backtracking frame by frame. Specifically, if the probability of the target pronunciation unit in the current frame is greater than the set output threshold, it is determined that the target pronunciation unit is included in the current frame. Therefore, after the starting frame is traced back, all frames containing the target pronunciation unit can be obtained, the starting position and the ending position of each target pronunciation unit can be further obtained, and the length of the target pronunciation unit can be obtained by subtracting the starting position from the ending position of the target pronunciation unit. In addition, for each target pronunciation unit, the average probability of the target pronunciation unit can be obtained by subtracting the cumulative probability corresponding to the ending position of the target pronunciation unit from the cumulative probability corresponding to the starting position of the target pronunciation unit and then dividing the result by the length of the target pronunciation unit.

The judging unit 572 is configured to determine whether a set condition is satisfied according to the length and/or the average probability of each target pronunciation unit on the optimal path, and if so, determine that a wakeup word is detected.

In the embodiment of the present invention, the setting condition may be: the length of each target pronunciation unit in the set interval is larger than a set length threshold; and/or the average probability of each target pronunciation unit in the set interval is larger than the set average probability threshold value.

The end-to-end awakening word detection device provided by the embodiment of the invention can effectively reduce the false awakening rate by further limiting the length and average probability of each target pronunciation unit.

Certainly, in practical applications, the total length of non-silence (for example, the length threshold of the wake-up word may be determined according to different speech speeds), the average probability, the total length of silence, the length of silence between words or phrases, and the like may also be limited, so as to ensure that the wake-up operation is performed in a situation more suitable for the normal speaking style.

The end-to-end awakening word detection device provided by the embodiment of the invention can be applied to various intelligent devices, such as household devices, sound boxes, tablet computers, mobile phones, wearable devices, robots, toys and the like, and can enable the intelligent devices to accurately detect the voice instruction of a user, namely awakening words, in a dormant or screen locking state, so that the devices in the dormant state directly enter a waiting instruction state or directly execute the operation corresponding to the voice instruction.

It should be noted that the terms "first," "second," and the like in the description of the embodiments of the present invention are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. Further, in the description of the embodiments of the present invention, the meaning of "a plurality" means two or more unless otherwise specified.

Fig. 8 is a block diagram illustrating an apparatus 800 for an end-to-end wake word detection method according to an example embodiment. For example, the apparatus 800 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, and the like.

Referring to fig. 8, the apparatus 800 may include one or more of the following components: processing component 802, memory 804, power component 806, multimedia component 808, audio component 810, input/output (I/O) interface 812, sensor component 814, and communication component 816.

The processing component 802 generally controls overall operation of the device 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing elements 802 may include one or more processors 820 to execute instructions to perform all or a portion of the steps of the methods described above. Further, the processing component 802 can include one or more modules that facilitate interaction between the processing component 802 and other components. For example, the processing component 802 can include a multimedia module to facilitate interaction between the multimedia component 808 and the processing component 802.

The memory 804 is configured to store various types of data to support operation at the device 800. Examples of such data include instructions for any application or method operating on device 800, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 804 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

Power component 806 provides power to the various components of device 800. The power components 806 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for the device 800.

The multimedia component 808 includes a screen that provides an output interface between the device 800 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 808 includes a front facing camera and/or a rear facing camera. The front-facing camera and/or the rear-facing camera may receive external multimedia data when the device 800 is in an operating mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

The audio component 810 is configured to output and/or input audio signals. For example, the audio component 810 includes a Microphone (MIC) configured to receive external audio signals when the apparatus 800 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 804 or transmitted via the communication component 816. In some embodiments, audio component 810 also includes a speaker for outputting audio signals.

The I/O interface 812 provides an interface between the processing component 802 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor assembly 814 includes one or more sensors for providing various aspects of state assessment for the device 800. For example, the sensor assembly 814 may detect the open/closed state of the device 800, the relative positioning of the components, such as a display and keypad of the apparatus 800, the sensor assembly 814 may also detect a change in position of the apparatus 800 or a component of the apparatus 800, the presence or absence of user contact with the apparatus 800, orientation or acceleration/deceleration of the apparatus 800, and a change in temperature of the apparatus 800. Sensor assembly 814 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 814 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 814 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 816 is configured to facilitate communications between the apparatus 800 and other devices in a wired or wireless manner. The device 800 may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 816 receives a broadcast signal or broadcast associated information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communications component 816 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the apparatus 800 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described methods.

In an exemplary embodiment, a non-transitory computer-readable storage medium comprising instructions, such as the memory 804 comprising instructions, executable by the processor 820 of the device 800 to perform the above-described method is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

A non-transitory computer readable storage medium, wherein instructions in the storage medium, when executed by a processor of a mobile terminal, enable the mobile terminal to perform some or all of the steps of the above-described method embodiments to reduce false wake-up rate. .

Fig. 9 is a schematic structural diagram of a server in an embodiment of the present invention. The server 1900, which may vary widely in configuration or performance, may include one or more Central Processing Units (CPUs) 1922 (e.g., one or more processors) and memory 1932, one or more storage media 1930 (e.g., one or more mass storage devices) that store applications 1942 or data 1944. Memory 1932 and storage medium 1930 can be, among other things, transient or persistent storage. The program stored in the storage medium 1930 may include one or more modules (not shown), each of which may include a series of instructions operating on a server. Still further, a central processor 1922 may be provided in communication with the storage medium 1930 to execute a series of instruction operations in the storage medium 1930 on the server 1900.

The server 1900 may also include one or more power supplies 1926, one or more wired or wireless network interfaces 1950, one or more input-output interfaces 1958, one or more keyboards 1956, and/or one or more operating systems 1941, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, etc.

Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This invention is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.

It will be understood that the invention is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the invention is only limited by the appended claims

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

23页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种面向中文语音识别的语言模型建模方法

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!