Terminal device awakening method and device, storage medium and electronic device

文档序号：193329 发布日期：2021-11-02 浏览：24次中文

阅读说明：本技术 终端设备唤醒方法和装置、存储介质及电子装置 (Terminal device awakening method and device, storage medium and electronic device ) 是由葛路奇张卓博朱文博于 2021-06-25 设计创作，主要内容包括：本发明公开了一种终端设备唤醒方法和装置、存储介质及电子装置,其中,上述方法包括：获取待识别的音频数据；在终端设备内配置的至少两个唤醒模型的每个唤醒模型中,基于各自从音频数据中提取的不同维度下的音频特征分别进行唤醒识别,得到与唤醒模型对应的音频识别结果,其中,每个唤醒模型用于提取一种维度下的音频特征；在音频识别结果达到唤醒条件的情况下,将终端设备调整为唤醒状态。采用上述技术方案,解决了现有技术中终端设备的唤醒性能差的问题。(The invention discloses a method and a device for awakening terminal equipment, a storage medium and an electronic device, wherein the method comprises the following steps: acquiring audio data to be identified; in each awakening model of at least two awakening models configured in the terminal equipment, respectively awakening and identifying based on audio features under different dimensionalities extracted from audio data to obtain an audio identification result corresponding to the awakening model, wherein each awakening model is used for extracting the audio features under one dimensionality; and under the condition that the audio recognition result reaches the awakening condition, adjusting the terminal equipment to be in an awakening state. By adopting the technical scheme, the problem of poor awakening performance of the terminal equipment in the prior art is solved.)

1. A terminal device awakening method is characterized by comprising the following steps:

acquiring audio data to be identified;

in each awakening model of at least two awakening models configured in the terminal equipment, respectively awakening and identifying based on audio features under different dimensionalities extracted from the audio data to obtain an audio identification result corresponding to the awakening model, wherein each awakening model is used for extracting the audio features under one dimensionality;

and under the condition that the audio recognition result reaches the awakening condition, adjusting the terminal equipment to be in an awakening state.

2. The method according to claim 1, wherein in each of at least two wake-up models configured in a terminal device, after performing wake-up recognition based on audio features extracted from the audio data respectively under different dimensions and obtaining an audio recognition result corresponding to the wake-up model, the method further comprises:

and under the condition that the number of the audio identification results used for indicating that the audio data carries the awakening information is larger than a first threshold value, determining that the audio identification results reach the awakening condition.

3. The method according to claim 2, wherein in each of at least two wake-up models configured in a terminal device, the respectively performing wake-up recognition based on the audio features extracted from the audio data in different dimensions, and obtaining the audio recognition result corresponding to the wake-up model comprises:

taking each of the at least two wake-up models as a current wake-up model respectively, and executing the following operations:

extracting audio features of the audio data in the current dimension from the current wake-up model;

performing awakening identification on the audio features under the current dimension;

and under the condition that the awakening keyword is identified from the audio features under the current dimensionality, determining that the awakening information is carried in the audio identification result.

4. The method according to claim 1, wherein in each of at least two wake-up models configured in a terminal device, after performing wake-up recognition based on audio features extracted from the audio data respectively under different dimensions and obtaining an audio recognition result corresponding to the wake-up model, the method further comprises:

and sequentially inputting the audio data into each of the at least two awakening models to obtain the audio identification result, wherein in two adjacent awakening models of the at least two awakening models, the output result of the first awakening model and the audio data are simultaneously input into the second awakening model, and the first awakening model is positioned in front of the second awakening model.

5. The method according to claim 4, wherein in each of at least two wake-up models configured in a terminal device, the respectively performing wake-up recognition based on the audio features extracted from the audio data in different dimensions, and obtaining the audio recognition result corresponding to the wake-up model comprises:

and under the condition that the output result of the last awakening model indicates that the audio data carries the awakening keyword, determining that the audio identification result reaches the awakening condition.

6. The method of claim 1, prior to the obtaining the audio data to be identified, further comprising:

obtaining a plurality of sample audio data;

and training at least two initialization wake-up models by using the plurality of sample audio data to obtain the at least two wake-up models.

7. The method of claim 6, wherein training at least two initialization wake models with the plurality of sample audio data to derive the at least two wake models comprises:

traversing the at least two initialization wakeup models to perform the following operations until a convergence condition is reached:

determining a current initialization wake-up model to be trained;

under the condition that the current initialization wake-up model is not the first initialization wake-up model, acquiring a reference training result obtained after the last initialization wake-up model before the current initialization wake-up model is trained; training the current initialization awakening model by using the reference training result and the plurality of sample audio data to obtain a current training result;

under the condition that the current initialization wakeup model is the first initialization wakeup model, training the current initialization wakeup model by utilizing the plurality of sample audio data to obtain a current training result;

and under the condition that the current training result does not reach the convergence condition, determining the next initialization wake-up model after the current initialization wake-up model as the current initialization wake-up model.

8. The method of claim 6, wherein training at least two initialization wake models with the plurality of sample audio data to derive the at least two wake models comprises:

under the condition that the at least two initialized wake-up models comprise two initialized wake-up models, inputting part of audio data in the plurality of sample audio data into a first initialized wake-up model as a training set for training, and inputting the rest of audio data in the plurality of sample audio data into the first initialized wake-up model as a test set for prediction to obtain a prediction result;

splicing the prediction result of the first initialization awakening model and the plurality of sample audio data to obtain spliced data;

inputting the splicing data into a second initialization awakening model for training until a convergence condition is reached, wherein the at least two awakening models are obtained when the convergence condition is reached.

9. A terminal device wake-up apparatus, comprising:

the acquisition unit is used for acquiring audio data to be identified by a user;

an extraction unit, configured to perform, in each of at least two wake-up models configured in a terminal device, wake-up identification based on audio features extracted from the audio data in different dimensions, respectively, to obtain an audio identification result corresponding to the wake-up model, where each wake-up model is used to extract an audio feature in one dimension;

and the adjusting unit is used for adjusting the terminal equipment to be in an awakening state under the condition that the audio recognition result reaches the awakening condition.

10. A computer-readable storage medium, comprising a stored program, wherein the program when executed performs the method of any one of claims 1 to 8.

11. An electronic device comprising a memory and a processor, characterized in that the memory has stored therein a computer program, the processor being arranged to execute the method of any of claims 1 to 8 by means of the computer program.

Technical Field

The invention relates to the technical field of voice recognition, in particular to a terminal equipment awakening method and device, a storage medium and an electronic device.

Background

In the field of voice interaction of terminal equipment, the terminal equipment is generally in a standby state, if the terminal equipment needs to interact with the terminal equipment, the first step is to awaken the terminal equipment, the target is integrated in an awakening algorithm of the terminal, and when noise is input or other non-awakening voices of a user cause mistaken awakening of the equipment, much inconvenience is brought to normal life of the user.

To above-mentioned problem, generally can adopt the mode of secondary check to assist the calibration among the prior art to reduce the mistake and awaken, but the model that is used for the secondary to awaken the check all is the big model that the precision is higher generally, does not well dispose in the terminal, generally all deposits in the high in the clouds, because network transmission, high in the clouds calculation etc. can increase awakening response time, has caused the problem that awakening performance of equipment is poor.

Aiming at the problem of poor awakening performance of terminal equipment in the related art, an effective solution is not provided at present.

Disclosure of Invention

The embodiment of the invention provides a terminal equipment awakening method and device, a storage medium and an electronic device, and aims to at least solve the problem of poor awakening performance in the awakening process of the terminal equipment.

According to an aspect of the embodiments of the present invention, a method for waking up a terminal device is provided, including: acquiring audio data to be identified; in each awakening model of at least two awakening models configured in the terminal equipment, respectively awakening and identifying based on audio features under different dimensionalities extracted from audio data to obtain an audio identification result corresponding to the awakening model, wherein each awakening model is used for extracting the audio features under one dimensionality; and under the condition that the audio recognition result reaches the awakening condition, adjusting the terminal equipment to be in an awakening state.

Optionally, in each of the at least two wake-up models configured in the terminal device, the separately performing wake-up identification based on the audio features extracted from the audio data in different dimensions, and after obtaining the audio identification result corresponding to the wake-up model, further includes: and under the condition that the number of the audio identification results used for indicating that the audio data carries the awakening information is larger than a first threshold value, determining that the audio identification results reach the awakening condition.

Optionally, in each of the at least two wake-up models configured in the terminal device, the respectively performing wake-up recognition based on the audio features extracted from the audio data in different dimensions, and obtaining the audio recognition result corresponding to the wake-up model includes: taking each of the at least two wake-up models as a current wake-up model respectively, and executing the following operations: extracting the audio features of the audio data under the current dimension from the current awakening model; performing awakening identification on the audio features under the current dimensionality; and under the condition that the awakening keyword is identified from the audio features under the current dimensionality, determining that the awakening information is carried in the audio identification result.

Optionally, in each of the at least two wake-up models configured in the terminal device, the separately performing wake-up identification based on the audio features extracted from the audio data in different dimensions, and after obtaining the audio identification result corresponding to the wake-up model, further includes: and sequentially inputting the audio data into each of the at least two awakening models to obtain the audio recognition result, wherein in two adjacent awakening models of the at least two awakening models, the output result of the first awakening model and the audio data are simultaneously input into the second awakening model, and the first awakening model is positioned in front of the second awakening model.

Optionally, before the obtaining of the audio data to be identified, the method further includes: obtaining a plurality of sample audio data; and training at least two initialization wake-up models by using the plurality of sample audio data to obtain the at least two wake-up models.

Optionally, the training at least two initialization wake-up models by using the plurality of sample audio data to obtain the at least two wake-up models includes: traversing at least two initialization wakeup models to perform the following operations until a convergence condition is reached: determining a current initialization wake-up model to be trained; under the condition that the current initialization wake-up model is not the first initialization wake-up model, acquiring a reference training result obtained after the last initialization wake-up model before the current initialization wake-up model is trained; training the current initialization awakening model by using a reference training result and a plurality of sample audio data to obtain a current training result; under the condition that the current initialization wakeup model is the first initialization wakeup model, training the current initialization wakeup model by utilizing the plurality of sample audio data to obtain a current training result; and determining a next initialization wakeup model after the current initialization wakeup model as the current initialization wakeup model when the current training result does not reach the convergence condition.

Optionally, the training at least two initialization wake-up models by using the plurality of sample audio data to obtain the at least two wake-up models includes: under the condition that the at least two initialized wake-up models comprise two initialized wake-up models, inputting part of audio data in the plurality of sample audio data into a first initialized wake-up model as a training set for training, and inputting the rest of audio data in the plurality of sample audio data into the first initialized wake-up model as a test set for prediction to obtain a prediction result; splicing the prediction result of the first initialization wake-up model and the plurality of sample audio data to obtain spliced data; inputting the splicing data into a second initialization awakening model for training until a convergence condition is reached, wherein the at least two awakening models are obtained when the convergence condition is reached.

According to another aspect of the embodiments of the present invention, there is provided a terminal device wake-up apparatus, including: an acquisition unit configured to acquire audio data to be recognized; an extraction unit, configured to perform, in each of at least two wake-up models configured in a terminal device, wake-up identification based on audio features extracted from the audio data in different dimensions, respectively, to obtain an audio identification result corresponding to the wake-up model, where each wake-up model is used to extract an audio feature in one dimension; and the adjusting unit is used for adjusting the terminal equipment to be in an awakening state under the condition that the audio identification result reaches the awakening condition.

According to still another aspect of the embodiments of the present invention, there is also provided a computer-readable storage medium having a computer program stored therein, wherein the computer program is configured to execute the advertisement presentation method when running.

According to another aspect of the embodiments of the present invention, there is also provided an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the advertisement displaying method through the computer program.

According to the embodiment of the invention, audio data to be identified are obtained; in each awakening model of at least two awakening models configured in the terminal equipment, respectively awakening and identifying based on audio features under different dimensionalities extracted from audio data to obtain an audio identification result corresponding to the awakening model, wherein each awakening model is used for extracting the audio features under one dimensionality; and under the condition that the audio recognition result reaches the awakening condition, adjusting the terminal equipment to be in an awakening state. That is to say, at least two awakening models are deployed on the terminal device, and audio features under different dimensions are respectively extracted, so that an audio identification result corresponding to each awakening model is obtained. And then judging whether the awakening condition of the terminal equipment is reached or not based on the obtained audio recognition result, and adjusting the state of the terminal equipment to be the awakening state under the condition that the awakening condition is reached. The method and the device have the advantages that the at least one awakening model is used for carrying out feature extraction and audio identification on the audio data under different dimensions, and the problem of poor awakening performance in the process of awakening identification by a single model is solved. And then the reliability of the audio recognition result is improved, and the awakening performance of the terminal equipment is improved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:

fig. 1 is a schematic diagram of a hardware environment of an alternative terminal device wake-up method according to an embodiment of the present invention;

fig. 2 is a flowchart of an alternative terminal device wake-up method according to an embodiment of the present invention;

fig. 3 is a schematic diagram (one) of an alternative terminal device wake-up method according to an embodiment of the present invention;

fig. 4 is a schematic diagram (two) of an alternative terminal device wake-up method according to an embodiment of the present invention;

fig. 5 is a schematic diagram (one) of a terminal device wake-up method in the related art;

fig. 6 is a schematic diagram (two) of a terminal device wake-up method in the related art;

FIG. 7 is a flow chart of an alternative wake model training method according to an embodiment of the present invention;

FIG. 8 is a flow chart of an alternative wake model training method according to an embodiment of the present invention;

FIG. 9 is a schematic diagram of an alternative wake-up model training method according to an embodiment of the present invention;

fig. 10 is a block diagram of a terminal device wake-up apparatus according to an embodiment of the present invention.

Detailed Description

In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

According to an aspect of the embodiments of the present invention, a terminal device wake-up method is provided, optionally, as an optional implementation manner, the terminal device wake-up method may be but is not limited to be used in a terminal device wake-up system in a hardware environment as shown in fig. 1, where the terminal device wake-up system may include but is not limited to a terminal device 102, a network 104, a server 106, and a database 108. A target client logged in by using a target user account is operated in the terminal device 102 (as shown in fig. 1, the target client takes an audio recognition client as an example, the terminal device 102 includes a human-computer interaction screen, a processor and a memory, the human-computer interaction screen is used for displaying an awakening scene of the terminal device in an operating state (for example, whether the terminal device is in a waiting state or an awakening state), and is also used for providing a human-computer interaction interface to receive human-computer interaction operation for awakening the terminal device.

In addition, the server 106 includes a processing engine, and a user of the processing engine performs a storage operation or a reading operation on the database 108, for example, stores the state of the terminal device and the function information of the corresponding wake-up model, so as to complete the wake-up process of the terminal device provided in this embodiment.

The specific process comprises the following steps: in step S102, the audio data to be recognized is obtained, and step S104 is executed when at least two wake-up models are configured in the terminal device. In each awakening model of at least two awakening models configured in the terminal equipment, awakening and identifying are respectively carried out based on audio features extracted from audio data in different dimensions, and an audio identification result corresponding to the awakening model is obtained, wherein each awakening model is used for extracting the audio features in one dimension. And in the case that the audio recognition result reaches the wake-up condition, adjusting the terminal device to be in the wake-up state as by step S106. Then, steps S108-S110 are executed, the audio recognition result corresponding to the wake-up model is sent to the server 106 through the network 104, and the server 106 stores the audio recognition result corresponding to the wake-up model in the database 108.

The interface and the flow steps shown in fig. 1 are examples, and the steps may also be executed in an independent hardware device with a strong processing capability, which is not limited in the embodiment of the present application.

It should be noted that, in this embodiment, in each of at least two wake-up models configured in the terminal device, audio features under different dimensions are respectively extracted from the audio data and are wakened up and identified, so as to obtain an audio identification result corresponding to each wake-up model. And when the audio recognition result reaches the awakening condition, adjusting the terminal equipment to be in an awakening state. That is to say, audio features of the audio data under different dimensions are extracted from different wake-up models, the audio features under different dimensions are wakened up and identified, and a joint decision is made on the audio data based on different wake-up and identification results. And when the decision result reaches the awakening condition of the terminal equipment, the terminal equipment is adjusted to be in the awakening state, and when the decision result does not reach the awakening condition of the terminal equipment, the state of the terminal equipment is still in the waiting state. The limitation of the audio identification result of a single awakening model is avoided, the reliability of the awakening identification result is improved, and the problem of poor awakening performance of the terminal equipment in the related technology is solved.

Optionally, in this embodiment, the terminal device may be a terminal device that supports running of the target application, and may include but is not limited to at least one of the following: mobile phones (such as Android phones, iOS phones, etc.), notebook computers, tablet computers, palm computers, MID (Mobile Internet Devices), PAD, desktop computers, smart televisions, etc. The target application may be a terminal application supporting running of the target task and displaying a task scene in the target task, such as a video application, an instant messaging application, a browser application, an education application, and the like. Such networks may include, but are not limited to: a wired network, a wireless network, wherein the wired network comprises: a local area network, a metropolitan area network, and a wide area network, the wireless network comprising: bluetooth, WIFI, and other networks that enable wireless communication. The server may be a single server, a server cluster composed of a plurality of servers, or a cloud server. The above is merely an example, and this is not limited in this embodiment.

In order to solve the problem of poor wake-up performance occurring in the wake-up process of the terminal device, in this embodiment, a wake-up method of the terminal device is provided, and fig. 2 is a flowchart of the wake-up method of the terminal device according to the embodiment of the present invention, where the flowchart includes the following steps:

step S202, audio data to be identified are obtained;

step S204, in each awakening model of at least two awakening models configured in the terminal equipment, respectively carrying out awakening identification based on audio features under different dimensionalities extracted from audio data to obtain an audio identification result corresponding to the awakening model, wherein each awakening model is used for extracting the audio features under one dimensionality;

and step S206, under the condition that the audio recognition result reaches the awakening condition, adjusting the terminal equipment to be in an awakening state.

In the above step S202, the audio data to be recognized may include, but is not limited to, the following ways: directly acquiring user voice as audio data; the voice of the user is stored in the voice playing device in advance, one voice data is selected from the plurality of voice data stored in advance to be played, and the played voice is the audio data.

Further, based on the audio data determined in step S202, the audio data is feature-extracted by using a plurality of wake-up models configured on the terminal device, and it can be understood that different types of wake-up models have different finenesses, and the extracted audio features are different from each other. And performing awakening identification based on the audio features under different dimensions, and obtaining different audio identification results respectively corresponding to each awakening model. The limitation of the audio recognition result obtained by a single awakening model is avoided, and the reliability of the audio recognition process is improved.

In step S206, the determination method of whether the audio recognition result reaches the wake-up condition includes, but is not limited to, at least one of the following:

respectively calculating the similarity between the awakening keyword contained in the audio identification result corresponding to each awakening model and a preset awakening word in the awakening condition, then carrying out weighted summation on the multiple similarities to obtain the total similarity, and if the total similarity reaches a set threshold value, reaching the awakening condition;

and respectively calculating the similarity between the awakening keywords contained in the audio identification result corresponding to each awakening model and the preset awakening words in the awakening condition, wherein the awakening condition is reached when the ratio between the number of the awakening models reaching the set threshold and the total number of the awakening models exceeds one half.

It should be noted that the different types of wake-up models are all configured on the same terminal device, and the audio recognition result obtained based on each wake-up model can also be quickly transmitted to the data processing module of the terminal device, so that the data transmission time is saved, the wake-up efficiency of the terminal device is improved, and further the wake-up performance is improved.

In this embodiment, different types of wake-up models are deployed in the terminal device, and whether the terminal device is to be woken up is jointly decided according to the audio recognition results of the multiple models, so that the wake-up rate of the terminal device is improved, and the false wake-up frequency is reduced, thereby achieving the effect of improving the wake-up performance of the terminal device.

In an optional embodiment, after the step S204, the method further includes:

Specifically, as shown in fig. 3, assume that there are 3 different types of models a, B, and C in the wake-up module in the terminal device, respectively input the speech signal into the models a, B, and C, and obtain 3 speech recognition results corresponding to each model. And under the condition that 2 indication voice signals in the 3 voice recognition results carry awakening information, determining that the results after the 3 model recognition reach awakening conditions.

As an optional implementation, the at least two wake-up models respectively perform wake-up recognition on the voice signal, and an implementation manner of obtaining a voice recognition result corresponding to each wake-up model is as follows:

taking each of the at least two wake-up models as a current wake-up model respectively, and executing the following operations:

extracting audio features of the audio data in the current dimension from the current awakening model;

performing awakening identification on the audio features under the current dimensionality;

As shown in fig. 3, the training data is input into the wake-up model a in the terminal device, the voice feature in the first dimension is extracted through the wake-up model a, and the voice feature in the first dimension is compared with the voice signal in the wake-up module to obtain a training result R1, as shown in step S304. When the training result R1 reaches the set condition RR, the result after the voice recognition by the wake model a is determined as: the voice signal carries wake-up information. Under the condition that the training result R1 does not reach the set condition RR, determining that the result after the voice recognition of the awakening model A is as follows: the voice signal does not carry the wake-up information. The training result may include, but is not limited to, an identification rate, and the setting condition may include, but is not limited to, an identification rate threshold.

Similarly, the training data is input into the wake-up model B in the terminal device, the speech feature in the second dimension is extracted through the wake-up model B, and the speech feature in the second dimension is compared with the speech signal in the wake-up module to obtain the training result R2, as shown in step S304. When the training result R2 reaches the set condition RR, the result after the voice recognition by the wake model B is determined as: the voice signal carries wake-up information. Under the condition that the training result R2 does not reach the set condition RR, determining that the result after the voice recognition of the awakening model B is as follows: the voice signal does not carry the wake-up information. The training result may include, but is not limited to, an identification rate, and the setting condition may include, but is not limited to, an identification rate threshold.

Inputting the training data into a wake-up model C in the terminal device, extracting the voice feature in the third dimension through the wake-up model C, and performing feature comparison between the voice feature in the third dimension and the voice signal in the wake-up module to obtain a training result C2, as in step S304. When the training result C2 reaches the set condition RR, the result after the voice recognition by the wake model C is determined as: the voice signal carries wake-up information. Under the condition that the training result R3 does not reach the set condition RR, determining that the result after the voice recognition of the awakening model C is as follows: the voice signal does not carry the wake-up information. The training result may include, but is not limited to, an identification rate, and the setting condition may include, but is not limited to, an identification rate threshold.

Under the condition that the identification results of the wake-up model a and the wake-up model B both carry wake-up information in the voice signal and the identification result of the wake-up model C does not carry wake-up information in the voice signal, recording and indicating that the number of the identification results carrying wake-up information in the voice signal is 2 and the number of the identification results not carrying wake-up information in the voice signal is 1, wherein 2 is greater than 1, so that the voice identification result of the terminal device is determined to reach the wake-up condition, as in step S308.

That is, by adopting the principle of "a few obeying to a majority", a first threshold is set, and in the results after voice recognition by the wakeup model a, the wakeup model B, and the wakeup model C, it is indicated that the number of recognition results carrying wakeup information is 2, and if the number is greater than the set first threshold 1, it is determined that the voice recognition result reaches the wakeup condition.

It should be noted that the acquiring manner and type of the wake pattern a, the wake pattern B, and the wake pattern C in this embodiment may include, but are not limited to, one of the following manners: the simple model obtained by the ensemble learning and the fine model achieving a certain classification effect after certain training. Also, the number of models used for speech recognition is not limited.

By adopting the technical scheme, the same group of voice signals are identified through a plurality of models with different structures, and then the identification result is voted (minority follows the principle of majority) to jointly decide whether to awaken the terminal equipment. The voice recognition is carried out by adopting different models, and the final recognition result is more reasonable by adopting a mode of combining a plurality of models for decision making, and meanwhile, the awakening rate of the terminal equipment is also improved.

As an optional embodiment, as shown in fig. 4, after performing wake-up recognition based on audio features extracted from audio data respectively under different dimensions in each of at least two wake-up models configured in the terminal device to obtain an audio recognition result corresponding to the wake-up model, the method further includes:

and sequentially inputting the audio data into each of the at least two awakening models to obtain an audio identification result, wherein in two adjacent awakening models of the at least two awakening models, the output result of the first awakening model and the audio data are simultaneously input into the second awakening model, and the first awakening model is positioned in front of the second awakening model.

As shown in fig. 4, there are two adjacent wake-up models a and B in the wake-up module of the terminal device, the audio data a is first input into the wake-up model a, and the wake-up model a is used to map the audio features in the audio data, so as to distinguish different types of audio features. A plurality of feature sets corresponding to a plurality of audio features in the audio data is then obtained. It will be appreciated that the plurality of feature sets are distributed in respective sub-regions of the database space, i.e. appear as a plurality of hidden layer data distributed in sequence throughout the database space.

And merging the penultimate layer in the hidden layer data with the audio data, taking the merged data as the input of the awakening model B, carrying out audio recognition again, and taking the obtained recognition result as the final output. It can be understood that, because the input data of the wake-up model B contains the audio features after the first mapping processing by the model training a, the audio features of the first mapping processing can be better distinguished from the data of different types after the second mapping processing by the wake-up model B, and thus a better classification effect can be obtained.

It can be understood that the above-mentioned wake-up model a and wake-up model B are two adjacent models, and the wake-up model a is located before the wake-up model B, and the partial output (second to last layer) processed by the wake-up model a is transmitted to the wake-up model B, and the audio data is determined for the second time by the wake-up model B, and the output of the wake-up model B is taken as the final determination result. That is, the audio data is sequentially determined by using the partial output of the former model as the input of the adjacent latter model between the adjacent models in the plurality of serial models, so as to obtain the recognition result with better classification effect.

It should be noted that the above-mentioned wake-up model a is a relatively fine model obtained by training with training data, and the type of wake-up model a is not limited herein, for example: deep Neural Network (DNN) model and Convolutional Neural Network (CNN) model. The above wake model B is a simple linear classifier, and the specific type of the wake model B is not limited herein.

Further, in the related art as shown in fig. 5, by step S502, a voice signal is acquired and processed; then, the processed voice signal is input into the wake-up module, and the training result processed by the wake-up model a is input into the wake-up module at the same time, as shown in steps S504-S506. Whether the awakening condition is achieved or not is judged through the awakening module, and the equipment responds under the condition that the awakening condition is achieved.

That is to say, the wake-up scheme of the terminal device is to deploy a wake-up module after signal processing of the terminal, determine whether the voice signal contains wake-up information by a wake-up model a, and determine that a wake-up condition is reached, the device responds, and the device state is adjusted to be a wake-up state when the determination result is that the voice signal contains the wake-up information.

It can be understood that, due to the different fineness of each model, the existing model defects are different, if the voice signal is recognized through only one wake-up model a in fig. 1, false wake-up may occur, or in some scenarios, some type of voice data is likely to be mistakenly woken up and is not easy to be improved, which affects the user experience. In order to meet the service requirement and reduce false awakening, a complex and fine model is deployed at the cloud end for carrying out secondary verification on the audio, and the specific process is as follows:

as shown in fig. 6, through step S602, a voice signal is acquired and processed; then, the processed voice signal is input into the wake-up module, and the training result processed by the wake-up model a is input into the wake-up module, as shown in steps S604 to S606. And under the condition that the awakening module is in the awakening state, performing secondary verification on the awakening result to the cloud end, and returning the verification result to the awakening module, in steps S608-S610. Whether the awakening condition is achieved or not is judged through the awakening module, and the equipment responds under the condition that the awakening condition is achieved.

That is to say, when the wake up result of the voice signal by the wake up model a in the device carries wake up information, the wake up model a transmits the wake up result to the wake up module. When the awakening module detects the awakening result, the audio woken up by the terminal module is uploaded to the cloud for classification (namely cloud verification), and then the awakening classification result (verification result) is returned to the terminal. If the returned result is that the awakening condition is met, the equipment responds and is adjusted to be in an awakening state; if the return result is that the awakening condition is not met, the equipment does not respond.

It should be noted that, in the related art, the manner of using the cloud wake-up check requires that the audio is packaged and transmitted to the cloud after the terminal wakes up, then the audio is wakened up by the cloud for classification, and then the result is returned to the terminal, so that the wake-up response time (data transmission and cloud computing) is increased. The response time will extend the maximum latency if network fluctuations are encountered. The cloud awakening is to check the audio frequency after the terminal awakening module is awakened, meanwhile, the cloud awakening also needs to play between the awakening rate and the false awakening, parameters are adjusted, the false awakening is filtered under the condition that awakening words are not leaked as far as possible, the total awakening rate is smaller than or equal to the awakening rate of the terminal single awakening, and is smaller than the false awakening frequency of the terminal single awakening, so that the awakening rate is reduced to a certain extent. Therefore, the wake-up mode adopting the secondary verification mode of the terminal + cloud multiple models cannot balance the wake-up rate and the false wake-up frequency, and the problem of poor wake-up performance of the terminal equipment is also caused.

The embodiment of the present invention provides an improvement to the technical problems in the related art, and specifically, at least two wake-up models are configured in the terminal device, and between the models that are adjacent to each other in the plurality of serial models, a part of the output of the former model is used as the input of the adjacent latter model, and the audio data is sequentially determined, so as to obtain a classification result with a better classification effect. Meanwhile, secondary verification of the cloud is avoided, the consumption of awakening response time is reduced, the mistaken awakening frequency is reduced under the condition that the awakening rate is not reduced, and the awakening performance of the terminal equipment is improved.

As an optional embodiment, after the output result of the first wake-up model and the audio data are input into the second wake-up model at the same time, the method further includes:

That is, a plurality of adjacent wake-up models sequentially identify the audio data, each wake-up model can obtain an output result, and the output result may include that the audio data is indicated to carry a wake-up keyword, or may include that the audio data is indicated not to carry a wake-up keyword. Then the basis for judging whether the audio recognition result reaches the awakening condition is as follows: and when the output result of the last awakening model contains the condition that the instruction audio data carries the awakening keyword, determining that the awakening condition is reached.

Through the judgment process, the plurality of models are used for sequentially carrying out awakening identification on the audio data under different dimensions, in the process, the later awakening model has the better classification effect on the audio data, and the reliability of the output result is higher. Therefore, the output result of the last awakening model is used for judging whether the mode of the awakening condition is achieved or not, a better classification effect can be obtained, the mistaken awakening frequency is reduced, and the awakening rate of the terminal equipment is improved.

As an alternative embodiment, before the step S202, the method further includes:

obtaining a plurality of sample audio data;

training the at least two initialization wake-up models by using a plurality of sample audio data to obtain at least two wake-up models.

Specifically, as shown in fig. 7, the process of training two initialization wakeup models by using a plurality of sample audio data includes:

s702, using a plurality of sample audio data as training samples, and dividing a training set into a test set and a verification set according to a proportion;

s704, dividing and verifying the training test set by adopting a cross verification method, wherein one part of the training test set is used as the test set, and the rest part of the training test set is used as the training set;

s706, training an initialization awakening model by using a training set and a test set;

s708, voting by using a voting method to obtain a prediction result, and calculating by using a verification set to obtain the accuracy of the awakening model;

and S710, selecting an optimal awakening model according to the calculated accuracy.

In step S702, the training set may be divided according to a ratio of, for example, 1:5 or 1:6, without limitation.

In step S704, the cross-validation method may include, but is not limited to, ten-fold cross-validation, that is, one tenth of the training test set is used as the test set, and the remaining nine tenths are used as the training set. And then training the initialized awakening model by using the test set and the training set obtained by the cross verification method to obtain a training result, predicting the training result according to a voting method, and finally calculating by using the verification set to obtain the accuracy of the awakening model.

It can be understood that the key to train the data set by using the neural network model lies in continuously iterating and updating the weight parameters, and the above-mentioned calculation of the accuracy of the wake-up model by using the verification set means that whether the change (the error between the two weights) is smaller than the set threshold is judged by the weight change between the two iterations, and the training is stopped when the weight change between the two iterations is smaller than the set threshold, so as to obtain the wake-up model satisfying the classification effect.

As an optional implementation, the training of the at least two initialization wake-up models through the training step includes the specific process of obtaining the at least two wake-up models:

traversing at least two initialization wakeup models to perform the following operations until a convergence condition is reached:

determining a current initialization wake-up model to be trained;

training the current initialization awakening model by using a reference training result and a plurality of sample audio data to obtain a current training result;

under the condition that the current initialization wakeup model is the first initialization wakeup model, training the current initialization wakeup model by utilizing a plurality of sample audio data to obtain a current training result;

and under the condition that the current training result does not reach the convergence condition, taking the next initialization awakening model after the current initialization awakening model as the current initialization awakening model.

As shown in fig. 8, in this embodiment, through step S802, a current initialization wake-up model is determined, then audio data to be recognized are sequentially input into different initialization wake-up models, and the output of each initialization wake-up model is iteratively calculated in a next initialization wake-up model until a training result reaches a convergence condition, and a last initialization wake-up model is used as a wake-up model.

Before the iterative computation starts, it is first determined whether the current initialization wakeup model to be trained is the first initialization wakeup model according to step S804, which aims to always ensure that the training result of the previous initialization wakeup model can be used as input, and iterative computation is performed on the current initialization wakeup model to obtain a training result with a gradually improved classification effect.

In steps S806 to S812, when the current initialized wake-up model is not the first initialized wake-up model, the training result of the last initialized wake-up model before the current initialized wake-up model is used as the reference training result, and the reference training result and the plurality of sample audio data are input to the current initialized wake-up model together for training, so as to obtain the current training result. Judging the current training result, and stopping training under the condition that the current training result reaches a convergence condition; if the current training result does not reach the convergence condition, step S806 is executed to continue the iterative computation.

It should be noted that the above convergence condition may include, but is not limited to, stopping training when the weight change between two iterations is less than a preset threshold.

And sequentially inputting the audio data to be recognized into different initialization awakening models through the training process of the plurality of series models until the training result reaches the convergence condition, thereby obtaining the awakening models meeting the condition. That is to say, through the different training processes of a plurality of models, the missed awakening in the audio data can be effectively avoided, and the awakening rate of the terminal equipment is improved. Meanwhile, iterative computation is carried out on training results of the plurality of awakening models, the recognition rate of audio data is improved, the mistaken awakening rate of the terminal equipment is further reduced, and the technical effect of improving the awakening performance of the terminal equipment is achieved.

As an optional embodiment, the training of the at least two initialization wakeup models, and the specific process of obtaining the at least two wakeup models further includes:

splicing the prediction result of the first initialization awakening model and the plurality of sample audio data to obtain spliced data; inputting the splicing data into a second initialization awakening model for training until a convergence condition is reached, wherein the at least two awakening models are obtained when the convergence condition is reached.

Specifically, as shown in fig. 9, a stack learning method is adopted, one or more models of learning data from different dimensions with a simple structure are used, and assuming that a B model is used, training data is divided into N parts for N times of training in a cross validation manner, wherein N-1 part of training data is given to B training as a training set and 1 part of training data is given to B prediction as a test set, as shown in (a) in fig. 9.

As shown in fig. 9 (B), the N prediction results of the model B are put together with the original training data and are fed to the model a for training. When decoding, the data is firstly predicted by B, and then the predicted result and the original training data are used as input values of A to obtain a training result A.

It should be noted that, when both the training result a and the training result B obtained through the above process need to reach the convergence condition, the training process can be stopped, and the wake-up model a and the wake-up model B meeting the condition are obtained. Wherein, the convergence condition is consistent with the convergence condition appearing in the above embodiment, and is not described herein again.

And through the two models, a cross validation training mode is adopted, partial prediction results of one model are input into the other model, two awakening models meeting the convergence condition are obtained, and through the joint decision of a plurality of models, the recognition rate of audio data is improved, so that the awakening performance of the terminal equipment is improved.

Through the above description of the embodiments, those skilled in the art can clearly understand that the method according to the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (such as a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.

In this embodiment, a wake-up apparatus of a terminal device is further provided, where the wake-up apparatus is used to implement the foregoing embodiments and preferred embodiments, and details of the wake-up apparatus are omitted for description. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. Although the devices described in the following embodiments are preferably implemented in software, implementations in hardware or a combination of software and hardware are also possible and contemplated.

Fig. 10 is a block diagram of a wake-up apparatus of a terminal device according to an embodiment of the present invention, where the wake-up apparatus includes:

a first obtaining unit 1002, configured to obtain audio data to be identified;

a wake-up unit 1004, configured to perform wake-up identification on the basis of audio features in different dimensions, which are extracted from audio data respectively, in each wake-up model of at least two wake-up models configured in the terminal device, to obtain an audio identification result corresponding to the wake-up model, where each wake-up model is used to extract audio features in one dimension;

an adjusting unit 1006, configured to adjust the terminal device to be in an awake state when the audio recognition result reaches the awake condition.

Optionally, after the wake-up unit 1004 in the above embodiment, the method further includes:

the determining unit is configured to determine that the audio identification result reaches the wake-up condition when the number of the audio identification results indicating that the audio data carries the wake-up information is greater than a first threshold.

Optionally, the wake-up unit 1004 in the above embodiment further includes:

taking each of the at least two wake-up models as a current wake-up model respectively, and executing the following operations:

the extraction module is used for extracting the audio features of the audio data in the current dimension from the current awakening model;

the first awakening module is used for awakening and identifying the audio features under the current dimensionality;

the first determining module is used for determining that the audio identification result carries awakening information under the condition that the awakening keyword is identified from the audio characteristics under the current dimensionality.

Optionally, after the wake-up unit 1004 in the above embodiment, the method further includes:

and the input module is used for sequentially inputting the audio data into each of the at least two awakening models to obtain an audio identification result, wherein in two adjacent awakening models in the at least two awakening models, the output result of the first awakening model and the audio data are simultaneously input into the second awakening model, and the first awakening model is positioned in front of the second awakening model.

Optionally, after the wake-up unit 1004 in the above embodiment, the method further includes:

and the first determining submodule is used for determining that the audio recognition result reaches the awakening condition under the condition that the output result of the last awakening model indicates that the audio data carries the awakening keyword.

Optionally, before the first obtaining unit, the method further includes:

a second acquisition unit configured to acquire a plurality of sample audio data;

the first training unit is used for training the at least two initialization wake-up models by using the plurality of sample audio data to obtain the at least two wake-up models.

Optionally, the first training unit includes:

the traversing module is used for traversing at least two initialization awakening models to execute the following operations until a convergence condition is reached:

the second determination module is used for determining the current initialization wake-up model to be trained;

the first training module is used for acquiring a reference training result obtained after the last initialized wake-up model before the current initialized wake-up model is trained under the condition that the current initialized wake-up model is not the first initialized wake-up model; training the current initialization awakening model by using a reference training result and a plurality of sample audio data to obtain a current training result;

the second training module is used for training the current initialization awakening model by utilizing a plurality of sample audio data under the condition that the current initialization awakening model is the first initialization awakening model to obtain a current training result;

and the third determining module is used for determining the next initialization awakening model after the current initialization awakening model as the current initialization awakening model under the condition that the current training result does not reach the convergence condition.

Optionally, the first training unit further includes:

the prediction module is used for inputting part of audio data in a plurality of sample audio data as a training set into a first initialization wake-up model for training and inputting the rest of audio data in the plurality of sample audio data as a test set into the first initialization wake-up model for prediction to obtain a prediction result under the condition that the at least two initialization wake-up models comprise two initialization wake-up models;

the splicing module is used for splicing the prediction result of the first initialization awakening model and the plurality of sample audio data to obtain spliced data;

and the third training module is used for inputting the splicing data into the second initialization awakening model for training until a convergence condition is reached, wherein at least two awakening models are obtained when the convergence condition is reached.

An embodiment of the present invention further provides a computer-readable storage medium, in which a computer program is stored, where the computer program is configured to, when executed, perform the steps in any of the above method embodiments.

Alternatively, in the present embodiment, the storage medium may be configured to store a computer program for executing the steps of:

s1, acquiring audio data to be identified;

s2, in each of at least two awakening models configured in the terminal device, respectively performing awakening identification based on audio features under different dimensions respectively extracted from audio data to obtain an audio identification result corresponding to the awakening model, wherein each awakening model is used for extracting the audio features under one dimension;

and S3, adjusting the terminal equipment to be in the awakening state when the audio recognition result reaches the awakening condition.

In an exemplary embodiment, the computer-readable storage medium may include, but is not limited to: various media capable of storing computer programs, such as a usb disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic disk, or an optical disk.

For specific examples in this embodiment, reference may be made to the examples described in the above embodiments and exemplary embodiments, and details of this embodiment are not repeated herein.

Embodiments of the present invention also provide an electronic device comprising a memory having a computer program stored therein and a processor arranged to run the computer program to perform the steps of any of the above method embodiments.

Optionally, in this embodiment, the processor may be configured to execute the following steps by a computer program:

s1, acquiring audio data to be identified;

and S3, adjusting the terminal equipment to be in the awakening state when the audio recognition result reaches the awakening condition.

In an exemplary embodiment, the electronic apparatus may further include a transmission device and an input/output device, wherein the transmission device is connected to the processor, and the input/output device is connected to the processor.

For specific examples in this embodiment, reference may be made to the examples described in the above embodiments and exemplary embodiments, and details of this embodiment are not repeated herein.

It will be apparent to those skilled in the art that the various modules or steps of the invention described above may be implemented using a general purpose computing device, they may be centralized on a single computing device or distributed across a network of computing devices, and they may be implemented using program code executable by the computing devices, such that they may be stored in a memory device and executed by the computing device, and in some cases, the steps shown or described may be performed in an order different than that described herein, or they may be separately fabricated into various integrated circuit modules, or multiple ones of them may be fabricated into a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the principle of the present invention should be included in the protection scope of the present invention.

22页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：语音控制方法和装置

Terminal device awakening method and device, storage medium and electronic device

相关技术

网友询问留言