Method and device for determining activation probability of awakening words and intelligent voice product

文档序号:50805 发布日期:2021-09-28 浏览:35次 中文

阅读说明:本技术 一种唤醒词激活概率的确定方法、装置和智能语音产品 (Method and device for determining activation probability of awakening words and intelligent voice product ) 是由 赵亚东 金忠孝 于 2021-07-05 设计创作,主要内容包括:本申请公开了一种唤醒词激活概率的确定方法、装置和智能语音设备,该方法和装置应用于智能语音设备,具体为获取智能语音设备接收到的音频信号;基于声学分类模型对音频信号进行处理,得到音频信号的声学分类序列;将声学分类序列和唤醒词的声学表征序列输入神经网络模型,得到唤醒词的激活概率。本方案通过准确地确定唤醒词的激活概率,能够使智能语音设备能够根据该激活概率确定设备是否被激活,从而为智能语音设备的正常工作提供了关键技术环节。(The application discloses a method and a device for determining activation probability of a wake-up word and intelligent voice equipment, wherein the method and the device are applied to the intelligent voice equipment, and particularly obtain an audio signal received by the intelligent voice equipment; processing the audio signal based on the acoustic classification model to obtain an acoustic classification sequence of the audio signal; and inputting the acoustic classification sequence and the acoustic characterization sequence of the awakening word into the neural network model to obtain the activation probability of the awakening word. According to the scheme, the activation probability of the awakening word is accurately determined, so that the intelligent voice equipment can determine whether the equipment is activated or not according to the activation probability, and a key technical link is provided for normal work of the intelligent voice equipment.)

1. A method for determining activation probability of a wake-up word is applied to intelligent voice equipment, and is characterized in that the method comprises the following steps:

acquiring an audio signal received by the intelligent voice equipment;

processing the audio signal based on an acoustic classification model to obtain an acoustic classification sequence of the audio signal;

and inputting the acoustic classification sequence and the acoustic characterization sequence of the awakening word into a neural network model to obtain the activation probability of the awakening word.

2. The determination method of claim 1, wherein the acoustic classification sequence is x, wherein:

x=[xi,xi,…,xn]a two-dimensional real number tensor of shape nxn;

n is the length of the sequence of audio frames;

n is the total number of acoustic classification categories;

the ith column vector of x is xi=[xi1,xi2,…,xiN]Then xiIs a one-dimensional real number tensor with the length of N, wherein i is more than or equal to 1 and less than or equal to N, xikIs xiThe value of the kth element, xikRepresenting the probability that an audio frame belongs to the kth acoustic classification, where 1 ≦ k ≦ N, 0 ≦ xik≤1。

3. The determination method of claim 1, wherein the acoustic characterization sequence is skeywordWherein:

it is a two-dimensional real number tensor of shape nxm;

keyword represents a wake word;

n is the total number of acoustic classification categories;

m is the length of the acoustic characterization sequence of the specific awakening word;

remember of skeywordThe jth column vector ofIs a one-dimensional real number tensor with the length of N, wherein j is more than or equal to 1 and less than or equal to m,is composed ofWherein h is more than or equal to 1 and less than or equal to N,for any given keyword, it is always possible to find a unique acoustic characterization s artificiallykeywordRepresenting a keyword.

4. The method of determining according to any one of claims 1 to 3, further comprising the steps of:

performing model pre-training based on labeling data of a large number of general text audios;

and performing tuning training based on labeled data of a small number of awakening word audios to obtain the neural network model, wherein the general text audios refer to texts in audios which are not related to the awakening words, and the awakening word audios refer to texts in audios which are the same as or similar to the awakening words.

5. The method of claim 4, wherein the model pre-training is performed based on labeling data of a plurality of generic text audios, comprising the steps of:

determining an acoustic classification model;

building a main body neural network according to the network structure of the acoustic classification model;

training the subject neural network by using the labeled data;

and saving the neural network model parameters.

6. The determination method according to claim 5, wherein the tuning training based on the annotation data of the small amount of wake word audio comprises the steps of:

determining an acoustic characterization sequence corresponding to the tuning-training awakening word;

splicing the acoustic classification model with the main body neural network to form an end-to-end network;

loading activation probability to the end-to-end neural network, calculating neural network model parameters and acoustic classification model parameters, and performing cross entropy loss function-based training on the end-to-end network;

and saving the adjusted neural network model parameters and the acoustic classification model parameters.

7. A device for determining activation probability of a wake-up word is applied to intelligent voice equipment, and is characterized in that the device for determining the activation probability of the wake-up word comprises:

the signal acquisition module is configured to acquire an audio signal received by the intelligent voice device;

a first processing module configured to process the audio signal based on an acoustic classification model to obtain an acoustic classification sequence of the audio signal;

and the second processing module is configured to input the acoustic classification sequence and the acoustic characterization sequence of the awakening word into a neural network model to obtain the activation probability of the awakening word.

8. The determination apparatus of claim 7, wherein the acoustic classification sequence is x, wherein:

x=[xi,xi,…,xn]a two-dimensional real number tensor of shape nxn;

n is the length of the sequence of audio frames;

n is the total number of acoustic classification categories;

the ith column vector of x is xi=[xi1,xi2,…,xiN]Then xiIs a one-dimensional real number tensor with the length of N, wherein i is more than or equal to 1 and less than or equal to N, xikIs xiThe value of the kth element, xikRepresenting the probability that an audio frame belongs to the kth acoustic classification, where 1 ≦ k ≦ N, 0 ≦ xik≤1。

9. Determination apparatus according to claim 7Wherein the acoustic characterization sequence is skeywordWherein:

is a two-dimensional real number tensor of shape nxm;

keyword represents a wake word;

n is the total number of acoustic classification categories;

m is the length of the acoustic characterization sequence of the specific awakening word;

remember of skeywordThe jth column vector ofIs a one-dimensional real number tensor with the length of N, wherein j is more than or equal to 1 and less than or equal to m,is composed ofWherein h is more than or equal to 1 and less than or equal to N,for any given keyword, it is always possible to find a unique acoustic characterization s artificiallykeywordRepresenting a keyword.

10. The determination apparatus according to any one of claims 7 to 9, further comprising:

a first training module configured to perform model pre-training based on labeling data of a large amount of general text audio;

and the second training module is configured to perform tuning training based on labeled data of a small number of awakening word audios to obtain the neural network model, wherein the general text audios refer to texts in audios which are not related to the awakening words, and the awakening word audios refer to texts in audios which are the same as or similar to the awakening words.

11. The determination apparatus of claim 10, wherein the first training module is specifically configured to:

determining an acoustic classification model;

building a main body neural network according to the network structure of the acoustic classification model;

training the subject neural network by using the labeled data;

and saving the neural network model parameters.

12. The determination apparatus of claim 11, wherein the second training module is specifically configured to:

determining an acoustic characterization sequence corresponding to the tuning-training awakening word;

splicing the acoustic classification model with the main body neural network to form an end-to-end network;

loading activation probability to the end-to-end neural network, calculating neural network model parameters and acoustic classification model parameters, and performing cross entropy loss function-based training on the end-to-end network;

and saving the adjusted neural network model parameters and the acoustic classification model parameters.

13. An intelligent speech device, characterized in that it is provided with a determination device according to any one of claims 7 to 12.

14. An intelligent speech device comprising at least one processor and a memory coupled to the processor, wherein:

the memory is for storing a computer program or instructions;

the processor is configured to execute the computer program or instructions to cause the smart speech device to perform the determination method according to any one of claims 1 to 6.

Technical Field

The application relates to the technical field of artificial intelligence, in particular to a method and a device for determining activation probability of a wake-up word and an intelligent voice product.

Background

When the intelligent voice equipment such as an intelligent sound box, an intelligent car machine and an intelligent mobile phone realizes voice control, a voice interaction process is divided into five links such as awakening, responding, inputting, understanding and feeding back, wherein the awakening link is a first contact point for interaction between a user and an intelligent voice product, the experience of the awakening link is crucial in the whole voice interaction process, and the first impression of the user on the product is directly influenced by the experience of the awakening link. At present, although intelligent voice products are called intelligently, the intelligent voice products still have no human intelligence and cannot be awakened through the eyesight or actions, so that a word for switching the products from a standby state to a working state, namely a so-called awakening word, needs to be defined.

The traditional voice awakening technical scheme based on awakening words comprises three parts, wherein in the first part, an acoustic classification sequence is generated by audio through an acoustic classification algorithm; a second part, calculating the activation probability of the awakening words by the acoustic classification sequence through methods of distance numerical calculation, model generation, probability generation calculation and the like; and a third part, determining an activation probability threshold value on the basis of obtaining the activation probability of the awakening word, and finally judging whether the equipment is activated or not according to the height relation of the threshold value. And finally, controlling the intelligent voice equipment to realize voice interaction with the user under the condition that the equipment is determined to be activated. Therefore, accurately determining the activation probability of the wake-up word is a key link for controlling the intelligent voice device.

Disclosure of Invention

In view of this, the present application provides a method and an apparatus for determining an activation probability of a wake-up word, and an intelligent voice device, which are used to determine the activation probability of the wake-up word and provide a key technical link for normal operation of the intelligent voice device.

In order to achieve the above object, the following solutions are proposed:

a method for determining activation probability of a wake-up word is applied to intelligent voice equipment, and comprises the following steps:

acquiring an audio signal received by the intelligent voice equipment;

processing the audio signal based on an acoustic classification model to obtain an acoustic classification sequence of the audio signal;

and inputting the acoustic classification sequence and the acoustic characterization sequence of the awakening word into a neural network model to obtain the activation probability of the awakening word.

Optionally, the acoustic classification sequence is x, where:

x=[xi,xi,…,xn]a two-dimensional real number tensor of shape nxn;

n is the length of the sequence of audio frames;

n is the total number of acoustic classification categories;

the ith column vector of x is xi=[xi1,xi2,…,xiN]Then xiIs a one-dimensional real number tensor with the length of N, wherein i is more than or equal to 1 and less than or equal to N, xikIs xiThe value of the kth element, xikRepresenting the probability that an audio frame belongs to the kth acoustic classification, where 1 ≦ k ≦ N, 0 ≦ xik≤1。

Optionally, the acoustic characterization sequence is skeywordWherein:

it is a two-dimensional real number tensor of shape nxm;

keyword represents a wake word;

n is the total number of acoustic classification categories;

m is the length of the acoustic characterization sequence of the specific awakening word;

remember of skeywordThe jth column vector ofIs a one-dimensional real number tensor with the length of N, wherein j is more than or equal to 1 and less than or equal to m,is composed ofWherein h is more than or equal to 1 and less than or equal to N,for any given keyword, it is always possible to find a unique acoustic characterization s artificiallykeywordRepresenting a keyword.

Optionally, the method further comprises the steps of:

performing model pre-training based on labeling data of a large number of general text audios;

and performing tuning training based on labeled data of a small number of awakening word audios to obtain the neural network model, wherein the general text audios refer to texts in audios which are not related to the awakening words, and the awakening word audios refer to texts in audios which are the same as or similar to the awakening words.

Optionally, the model pre-training is performed on the labeled data based on a large amount of general text audios, and includes the steps of:

determining an acoustic classification model;

building a main body neural network according to the network structure of the acoustic classification model;

training the subject neural network by using the labeled data;

and saving the neural network model parameters.

Optionally, the tuning training based on the labeled data of a small number of wake word audios includes the steps of:

determining an acoustic characterization sequence corresponding to the tuning-training awakening word;

splicing the acoustic classification model with the main body neural network to form an end-to-end network;

loading activation probability to the end-to-end neural network, calculating neural network model parameters and acoustic classification model parameters, and performing cross entropy loss function-based training on the end-to-end network;

and saving the adjusted neural network model parameters and the acoustic classification model parameters.

A device for determining activation probability of a wake-up word is applied to intelligent voice equipment, and comprises the following steps:

the signal acquisition module is configured to acquire an audio signal received by the intelligent voice device;

a first processing module configured to process the audio signal based on an acoustic classification model to obtain an acoustic classification sequence of the audio signal;

and the second processing module is configured to input the acoustic classification sequence and the acoustic characterization sequence of the awakening word into a neural network model to obtain the activation probability of the awakening word.

Optionally, the acoustic classification sequence is x, where:

x=[xi,xi,…,xn]a two-dimensional real number tensor of shape nxn;

n is the length of the sequence of audio frames;

n is the total number of acoustic classification categories;

the ith column vector of x is xi=[xi1,xi2,…,xiN]Then xiIs a one-dimensional real number tensor with the length of N, wherein i is more than or equal to 1 and less than or equal to N, xikIs xiThe value of the kth element, xikRepresenting the probability that an audio frame belongs to the kth acoustic classification, where 1 ≦ k ≦ N, 0 ≦ xik≤1。

Optionally, the acoustic characterization sequence is skeywordWherein:

is a two-dimensional real number tensor of shape nxm;

keyword represents a wake word;

n is the total number of acoustic classification categories;

m is the length of the acoustic characterization sequence of the specific awakening word;

remember of skeywordThe jth column vector ofIs a one-dimensional real number tensor with the length of N, wherein j is more than or equal to 1 and less than or equal to m,is composed ofWherein h is more than or equal to 1 and less than or equal to N,for any given keyword, it is always possible to find a unique acoustic characterization s artificiallykeywordRepresenting a keyword.

Optionally, the method further includes:

a first training module configured to perform model pre-training based on labeling data of a large amount of general text audio;

and the second training module is configured to perform tuning training based on labeled data of a small number of awakening word audios to obtain the neural network model, wherein the general text audios refer to texts in audios which are not related to the awakening words, and the awakening word audios refer to texts in audios which are the same as or similar to the awakening words.

Optionally, the first training module is specifically configured to:

determining an acoustic classification model;

building a main body neural network according to the network structure of the acoustic classification model;

training the subject neural network by using the labeled data;

and saving the neural network model parameters.

Optionally, the second training module is specifically configured to:

determining an acoustic characterization sequence corresponding to the tuning-training awakening word;

splicing the acoustic classification model with the main body neural network to form an end-to-end network;

loading activation probability to the end-to-end neural network, calculating neural network model parameters and acoustic classification model parameters, and performing cross entropy loss function-based training on the end-to-end network;

and saving the adjusted neural network model parameters and the acoustic classification model parameters.

An intelligent speech device is provided with a determination device as described above.

An intelligent speech device comprising at least one processor and a memory coupled to the processor, wherein:

the memory is for storing a computer program or instructions;

the processor is configured to execute the computer program or instructions to cause the intelligent speech device to perform the determination method as described above.

According to the technical scheme, the method and the device for determining the activation probability of the awakening word and the intelligent voice equipment are applied to the intelligent voice equipment, and particularly, the audio signal received by the intelligent voice equipment is obtained; processing the audio signal based on the acoustic classification model to obtain an acoustic classification sequence of the audio signal; and inputting the acoustic classification sequence and the acoustic characterization sequence of the awakening word into the neural network model to obtain the activation probability of the awakening word. According to the scheme, the activation probability of the awakening word is accurately determined, so that the intelligent voice equipment can determine whether the equipment is activated or not according to the activation probability, and a key technical link is provided for normal work of the intelligent voice equipment.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a flowchart of a method for determining activation probability of a wakeup word according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a neural network model according to an embodiment of the present application;

FIG. 3 is a flow chart of a training process of a neural network model according to an embodiment of the present application;

fig. 4 is a block diagram of an apparatus for determining activation probability of a wakeup word according to an embodiment of the present application;

fig. 5 is a block diagram of another apparatus for determining activation probability of a wakeup word according to an embodiment of the present application;

fig. 6 is a block diagram of an intelligent speech device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Example one

Fig. 1 is a flowchart of a method for determining activation probability of a wakeup word according to an embodiment of the present application.

As shown in fig. 1, the determination method provided in this embodiment is applied to an intelligent voice device, such as an intelligent sound box, an intelligent car machine, and a smart phone. The determining method is used for processing the audio received by the intelligent voice equipment so as to obtain the activation probability of the corresponding awakening word. The determination method comprises the following steps:

and S1, acquiring the audio signal received by the intelligent voice equipment.

That is, when the sound collection device of the intelligent voice device collects the sound emitted by the user and converts the sound into an audio signal in a standby state, the execution subject executing the determination method in the application acquires the audio signal so as to further process the audio signal.

In addition, the acoustic standard sequence of the corresponding awakening words can be obtained at the same time or after the audio signal is obtained, and the acoustic characterization sequence can be listedShown as skeywordWherein:

it is a two-dimensional real number tensor of shape nxm;

keyword represents a wake word; n is the total number of acoustic classification categories; m is the length of the acoustic characterization sequence of the specific awakening word;

remember of skeywordThe jth column vector ofIs a one-dimensional real number tensor with the length of N, wherein j is more than or equal to 1 and less than or equal to m,is composed ofWherein h is more than or equal to 1 and less than or equal to N,for any given keyword, it is always possible to find a unique acoustic characterization s artificiallykeywordRepresenting a keyword.

And S2, processing the audio signal based on the acoustic classification model.

And under the condition of acquiring the audio signal, processing the audio signal by using an acoustic classification model obtained by pre-training so as to obtain an acoustic classification sequence of the audio signal. The acoustic classification sequence is denoted x, where:

x=[xi,xi,…,xn]a two-dimensional real number tensor of shape nxn; n is the length of the sequence of audio frames; n is the total number of acoustic classification categories; the ith column vector of x is xi=[xi1,xi2,…,xiN]Then xiIs a one-dimensional real number tensor with the length of N, wherein i is more than or equal to 1 and less than or equal to N, xikIs xiThe value of the kth element, xikRepresenting the probability that an audio frame belongs to the kth acoustic classification, where 1 ≦ kN,0≤xik≤1。

And S3, processing the acoustic classification sequence and the acoustic characterization sequence based on the neural network model.

The neural network model is imported into a corresponding processing body of a corresponding intelligent voice device before the processing is started, wherein the acoustic classification sequence refers to an acoustic classification sequence of the audio signal, and the acoustic characterization sequence refers to an acoustic characterization sequence of the awakening word. And obtaining the activation probability of the awakening word through the processing of the neural network model.

The neural network model comprises an available long-term and short-term memory network layer, a convolutional layer, a maximum pooling layer and a linear transformation layer, and the combination is expressed as follows:

emb=Linear(LSTM(s^keyword))

p _ keyword (activation | x) ═ Sigmoid (Conv _ emb (Tanh (Conv (x)))), as shown in fig. 2. The LSTM represents a long-term and short-term memory network layer, the Linear transform layer, the convolutional layer, the Tanh represents a Tanh activation function, the MaxPool represents a maximum pooling layer, the convolutional layer with convolution parameters of emb is represented by Conv _ emb, the emb represents acoustic characterization embedding of awakening words, and the Sigmoid represents the use of a Sigmoid activation function.

The activation probability of the awakening word is marked as Pkeyword(activate | x), where keyword is a specific wake up word and x is the corresponding acoustic classification sequence of the sequence of audio frames.

As can be seen from the foregoing technical solutions, the present embodiment provides a method for determining activation probability of a wakeup word, where the method is applied to an intelligent voice device, and specifically, obtains an audio signal received by the intelligent voice device; processing the audio signal based on the acoustic classification model to obtain an acoustic classification sequence of the audio signal; and inputting the acoustic classification sequence and the acoustic characterization sequence of the awakening word into the neural network model to obtain the activation probability of the awakening word. According to the scheme, the activation probability of the awakening word is accurately determined, so that the intelligent voice equipment can determine whether the equipment is activated or not according to the activation probability, and a key technical link is provided for normal work of the intelligent voice equipment.

The technical scheme is realized on the basis of obtaining a corresponding neural network model, and under the condition that no existing neural network model exists, the method further comprises the following model training scheme, wherein the neural network model is provided for the scheme, and as shown in fig. 3, the specific scheme comprises the following steps:

and S101, model pre-training is carried out based on a large amount of labeled data of the general text audio.

The first step is as follows: determining an acoustic classification model adopted by the intelligent voice equipment, and storing network parameters of the model as Mc, thereby determining the number N of acoustic classifications and an acoustic classification set; and performing acoustic representation sequence representation on the labeled text of the universal text audio labeled data by using the acoustic category set, and performing forced alignment on the universal text audio on the basis of the acoustic representation sequence representation of the acoustic classification model and each audio labeled text (finding out the acoustic classification sequence with the highest probability corresponding to the audio), thereby obtaining the forced aligned acoustic classification sequence.

The acoustic classification sequence is subjected to segmentation with uniformly distributed segmentation intervals of 30-50 and is marked as xcutEach x iscutAre combined to obtain xcutCorresponding acoustic characterization sequence scutWill (x)cut,scut) Putting the training data set T as a training sample point; after the data set T is constructed, finding out all different acoustic characterization sequence elements in the T to form an acoustic characterization sequence set KsThen based on KsAugmenting a training data set T to TaThe specific expansion mode is as follows:

whileKshas elements not selected to:

from KsTo select an unselected element sK

while training data set T in the current round sKAre not selected to:

selecting an element (x) from Tcut,scut);

ifsKAnd sTThe same is that:

data entry ((x)cut,scut) 1) put into the data set Ta

else:

Data entry ((x)cut,scut) 0) put into the data set Ta

The second step is that: building a main body neural network according to the network structure of the acoustic classification model;

the third step: using a training data set TaTraining a main neural network, selecting a proper parameter initialization method, a proper learning rate and a proper adaptive gradient algorithm, performing cross entropy loss function-based training on the main neural network, reserving a part of training data for cross check in the training process, and avoiding overfitting;

the fourth step: saving neural network model parameters, denoted as Mg

And S102, performing tuning training based on the labeled data of a small number of specific awakening word audios.

The tuning training utilizes the labeled data of the specific awakening word audio to carry out end-to-end tuning on the whole intelligent voice equipment, requires the acoustic classification model to be a neural network model, and can skip the following steps if the acoustic classification model is adopted as a non-neural network model, and the specific process is as follows:

the first step is as follows: determining the acoustic characterization sequence s of the training awakening words and the corresponding awakening wordsf

The second step is that: splicing an acoustic classification model (neural network) with the main body neural network, namely directly connecting an output layer of the acoustic classification model (neural network) with an acoustic classification sequence input layer (Conv layer) of the main body neural network to form an end-to-end network;

the third step: loading activation probability calculation neural network parameters MgAnd acoustic classification model parameters McPerforming cross entropy loss function-based training on the end-to-end network in the second step by using the audio annotation data of the specific awakening word, and reserving a part of training data for cross check in the training process to avoid overfitting;

the fourth step: saving the adjusted acoustic classification model parameters McfAnd masterSomatic neural network parameter Mgf

According to the steps, under the condition that the corresponding parameters are obtained, the corresponding parameters can be input into the main body neural network, and the neural network model is obtained.

Example two

Fig. 4 is a block diagram of a device for determining activation probability of a wakeup word according to an embodiment of the present application.

As shown in fig. 4, the determining apparatus provided in this embodiment is applied to an intelligent voice device, such as an intelligent speaker, an intelligent car machine, and a smart phone. The determining device is used for processing the audio received by the intelligent voice equipment so as to obtain the activation probability of the corresponding awakening word. The determining device specifically includes a signal acquiring module 10, a first processing module 20 and a second processing module 30.

The signal acquisition module is used for acquiring the audio signal received by the intelligent voice equipment.

That is, when the sound collection device of the intelligent voice device collects the sound emitted by the user and converts the sound into an audio signal in a standby state, the execution subject executing the determination method in the application acquires the audio signal so as to further process the audio signal.

In addition, the module may also acquire an acoustic standard sequence of corresponding wake-up words at the same time or after the audio signal is acquired, and the acoustic characterization sequence may be represented as skeywordWherein:

it is a two-dimensional real number tensor of shape nxm;

keyword represents a wake word; n is the total number of acoustic classification categories; m is the length of the acoustic characterization sequence of the specific awakening word;

remember of skeywordThe jth column vector ofIs a one-dimensional real number tensor with the length of N, wherein j is more than or equal to 1 and less than or equal to m,is composed ofWherein h is more than or equal to 1 and less than or equal to N,for any given keyword, it is always possible to find a unique acoustic characterization s artificiallykeywordRepresenting a keyword.

The first processing module is used for processing the audio signal based on the acoustic classification model.

And under the condition of acquiring the audio signal, processing the audio signal by using an acoustic classification model obtained by pre-training so as to obtain an acoustic classification sequence of the audio signal. The acoustic classification sequence is denoted x, where:

x=[xi,xi,…,xn]a two-dimensional real number tensor of shape nxn; n is the length of the sequence of audio frames; n is the total number of acoustic classification categories; the ith column vector of x is xi=[xi1,xi2,…,xiN]Then xiIs a one-dimensional real number tensor with the length of N, wherein i is more than or equal to 1 and less than or equal to N, xikIs xiThe value of the kth element, xikRepresenting the probability that an audio frame belongs to the kth acoustic classification, where 1 ≦ k ≦ N, 0 ≦ xik≤1。

The second processing module is used for processing the acoustic classification sequence and the acoustic characterization sequence based on the neural network model.

The neural network model is imported into a corresponding processing body of a corresponding intelligent voice device before the processing is started, wherein the acoustic classification sequence refers to an acoustic classification sequence of the audio signal, and the acoustic characterization sequence refers to an acoustic characterization sequence of the awakening word. And obtaining the activation probability of the awakening word through the processing of the neural network model.

The neural network model comprises an available long-term and short-term memory network layer, a convolutional layer, a maximum pooling layer and a linear transformation layer, and the combination is expressed as follows:

emb=Linear(LSTM(s^keyword))

p _ keyword (activation | x) ═ Sigmoid (Conv _ emb (Tanh (Conv (x)))), as shown in fig. 2. The LSTM represents a long-term and short-term memory network layer, the Linear transform layer, the convolutional layer, the Tanh represents a Tanh activation function, the MaxPool represents a maximum pooling layer, the convolutional layer with convolution parameters of emb is represented by Conv _ emb, the emb represents acoustic characterization embedding of awakening words, and the Sigmoid represents the use of a Sigmoid activation function.

The activation probability of the awakening word is marked as Pkeyword(activate | x), where keyword is a specific wake up word and x is the corresponding acoustic classification sequence of the sequence of audio frames.

It can be seen from the foregoing technical solutions that, the present embodiment provides a device for determining activation probability of a wakeup word, where the device is applied to an intelligent voice device, and is specifically configured to obtain an audio signal received by the intelligent voice device; processing the audio signal based on the acoustic classification model to obtain an acoustic classification sequence of the audio signal; and inputting the acoustic classification sequence and the acoustic characterization sequence of the awakening word into the neural network model to obtain the activation probability of the awakening word. According to the scheme, the activation probability of the awakening word is accurately determined, so that the intelligent voice equipment can determine whether the equipment is activated or not according to the activation probability, and a key technical link is provided for normal work of the intelligent voice equipment.

The above technical solution is implemented on the basis of obtaining a corresponding neural network model, and in the case of no existing neural network model, the present application further includes a first training module 40 and a second training module 50, which are used for providing the neural network model for the above solution, as shown in fig. 5 in particular.

The first training module is used for model pre-training based on labeling data of a large amount of general text audios.

Specifically, the first training module specifically performs the following operations:

the first step is as follows: determining an acoustic classification model adopted by the intelligent voice equipment, and storing network parameters of the model as McThereby determining the number N of acoustic classifications and the soundA learning category set; and performing acoustic representation sequence representation on the labeled text of the universal text audio labeled data by using the acoustic category set, and performing forced alignment on the universal text audio on the basis of the acoustic representation sequence representation of the acoustic classification model and each audio labeled text (finding out the acoustic classification sequence with the highest probability corresponding to the audio), thereby obtaining the forced aligned acoustic classification sequence.

The acoustic classification sequence is subjected to segmentation with uniformly distributed segmentation intervals of 30-50 and is marked as xcutEach x iscutAre combined to obtain xcutCorresponding acoustic characterization sequence scutWill (x)cut,scut) Putting the training data set T as a training sample point; after the data set T is constructed, finding out all different acoustic characterization sequence elements in the T to form an acoustic characterization sequence set KsThen based on KsAugmenting a training data set T to TaThe specific expansion mode is as follows:

whileKshas elements not selected to:

from KsTo select an unselected element sK

while training data set T in the current round sKAre not selected to:

selecting an element (x) from Tcut,scut);

ifsKAnd sTThe same is that:

data entry ((x)cut,scut) 1) put into the data set Ta

else:

Data entry ((x)cut,scut) 0) put into the data set Ta

The second step is that: building a main body neural network according to the network structure of the acoustic classification model;

the third step: using a training data set TaTraining a main neural network, selecting a proper parameter initialization method, a proper learning rate and a proper adaptive gradient algorithm, and concentrating the main neural networkPerforming cross entropy loss function-based training through a network, reserving a part of training data for cross check in the training process, and avoiding overfitting;

the fourth step: saving neural network model parameters, denoted as Mg

The second training module is used for performing tuning training based on the labeling data of a small amount of specific awakening word audio.

The tuning training utilizes the labeled data of the specific awakening word audio to carry out end-to-end tuning on the whole intelligent voice equipment, an acoustic classification model is required to be a neural network model, and the following steps can be skipped if the acoustic classification model is adopted as a non-neural network model, specifically, the execution process of the module is as follows:

the first step is as follows: determining the acoustic characterization sequence s of the training awakening words and the corresponding awakening wordsf

The second step is that: splicing an acoustic classification model (neural network) with the main body neural network, namely directly connecting an output layer of the acoustic classification model (neural network) with an acoustic classification sequence input layer (Conv layer) of the main body neural network to form an end-to-end network;

the third step: loading activation probability calculation neural network parameters MgAnd acoustic classification model parameters McPerforming cross entropy loss function-based training on the end-to-end network in the second step by using the audio annotation data of the specific awakening word, and reserving a part of training data for cross check in the training process to avoid overfitting;

the fourth step: saving the adjusted acoustic classification model parameters McfAnd subject neural network parameter Mgf

According to the operation, under the condition of obtaining the corresponding parameters, the corresponding parameters can be input into the main body neural network, and the neural network model can be obtained.

EXAMPLE III

The embodiment also provides an intelligent voice device, including but not limited to an intelligent sound box, an intelligent car machine, an intelligent mobile phone, and the like, where the intelligent voice device is provided with the apparatus for determining activation probability of a wakeup word as provided in the above embodiment, and the apparatus is applied to the intelligent voice device, and is specifically used to obtain an audio signal received by the intelligent voice device; processing the audio signal based on the acoustic classification model to obtain an acoustic classification sequence of the audio signal; and inputting the acoustic classification sequence and the acoustic characterization sequence of the awakening word into the neural network model to obtain the activation probability of the awakening word. According to the scheme, the activation probability of the awakening word is accurately determined, so that the intelligent voice equipment can determine whether the equipment is activated or not according to the activation probability, and a key technical link is provided for normal work of the intelligent voice equipment.

Example four

Fig. 6 is a block diagram of an intelligent speech device according to an embodiment of the present application.

As shown in fig. 6, the smart voice device provided in this embodiment includes, but is not limited to, a smart speaker, a smart car machine, a smart phone, and the like, and the device includes at least one processor 101 and a memory 102, which are connected through a data bus 103. The memory is used for storing computer programs or instructions, and the processor acquires and executes the corresponding computer programs or instructions, so that the intelligent voice device can execute the determination method of the wake-up word probability provided by the embodiment.

The method for determining the activation probability of the awakening word comprises the steps of acquiring an audio signal received by intelligent voice equipment; processing the audio signal based on the acoustic classification model to obtain an acoustic classification sequence of the audio signal; and inputting the acoustic classification sequence and the acoustic characterization sequence of the awakening word into the neural network model to obtain the activation probability of the awakening word. According to the scheme, the activation probability of the awakening word is accurately determined, so that the intelligent voice equipment can determine whether the equipment is activated or not according to the activation probability, and a key technical link is provided for normal work of the intelligent voice equipment.

The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, apparatus, or computer program product. Accordingly, embodiments of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

Embodiments of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing terminal to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing terminal to cause a series of operational steps to be performed on the computer or other programmable terminal to produce a computer implemented process such that the instructions which execute on the computer or other programmable terminal provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present invention have been described, additional variations and modifications of these embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the embodiments of the invention.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or terminal that comprises the element.

The technical solutions provided by the present invention are described in detail above, and the principle and the implementation of the present invention are explained in this document by applying specific examples, and the descriptions of the above examples are only used to help understanding the method and the core idea of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

17页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:语音交互方法、装置、系统、交通工具及介质

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!