Acoustic feature generation, voice model training and voice recognition method and device

文档序号：193310 发布日期：2021-11-02 浏览：32次中文

阅读说明：本技术 一种生成声学特征、语音模型训练、语音识别方法及装置 (Acoustic feature generation, voice model training and voice recognition method and device ) 是由董林昊马泽君于 2021-08-02 设计创作，主要内容包括：本申请实施例公开了一种生成声学特征、语音模型训练、语音识别方法及装置,通过获取当前语音帧的声学信息向量和当前语音帧的信息量权重,并根据上一语音帧对应的已累积信息量权重、当前语音帧对应的保留率以及当前语音帧的信息量权重,能够得到当前语音帧对应的已累积信息量权重。保留率为1与泄漏率之差。利用泄漏率调整当前语音帧对应的已累积信息量权重和当前语音帧对应的整合声学信息向量,能够降低信息量权重较小的语音帧对于整合声学信息向量的影响,提高信息量权重较大的语音帧的声学信息向量在整合声学信息向量中所占的比重,得到的整合声学信息向量更为准确,提高语音模型的准确程度。(The embodiment of the application discloses a method and a device for generating acoustic features, training a voice model and recognizing voice, wherein the method and the device can obtain the accumulated information weight corresponding to the current voice frame by obtaining the acoustic information vector of the current voice frame and the information weight of the current voice frame and according to the accumulated information weight corresponding to the previous voice frame, the retention rate corresponding to the current voice frame and the information weight of the current voice frame. The retention rate is the difference between 1 and the leakage rate. The accumulated information weight corresponding to the current voice frame and the integrated acoustic information vector corresponding to the current voice frame are adjusted by utilizing the leakage rate, so that the influence of the voice frame with smaller information weight on the integrated acoustic information vector can be reduced, the proportion of the acoustic information vector of the voice frame with larger information weight in the integrated acoustic information vector is increased, the obtained integrated acoustic information vector is more accurate, and the accuracy of the voice model is improved.)

1. A method of generating an acoustic feature, the method comprising:

acquiring an acoustic information vector of a current voice frame and an information weight of the current voice frame;

obtaining the accumulated information weight corresponding to the current voice frame according to the accumulated information weight corresponding to the previous voice frame, the retention rate corresponding to the current voice frame and the information weight of the current voice frame; the retention rate is the difference between 1 and the leakage rate;

if the accumulated information weight corresponding to the current voice frame is smaller than a threshold value, obtaining an integrated acoustic information vector corresponding to the current voice frame according to the integrated acoustic information vector corresponding to the previous voice frame, the retention rate corresponding to the current voice frame, the information weight of the current voice frame and the acoustic information vector of the current voice frame;

if the weight of the accumulated information amount corresponding to the current voice frame is larger than or equal to a threshold value, outputting a delivered integrated acoustic information vector by using the integrated acoustic information vector corresponding to the previous voice frame and the acoustic information vector of the current voice frame, and calculating to obtain the integrated acoustic information vector corresponding to the current voice frame;

and after obtaining the integrated acoustic information vector corresponding to the current voice frame, taking the next voice frame as the current voice frame, and repeatedly executing the steps of obtaining the acoustic information vector of the current voice frame, the information weight of the current voice frame and the subsequent steps until the next voice frame does not exist.

2. The method of claim 1, wherein obtaining the accumulated information weight corresponding to the current speech frame according to the accumulated information weight corresponding to the previous speech frame, the retention rate corresponding to the current speech frame, and the information weight of the current speech frame comprises:

and multiplying the accumulated information weight corresponding to the previous voice frame by the retention rate corresponding to the current voice frame, and adding the accumulated information weight to the information weight of the current voice frame to obtain the accumulated information weight corresponding to the current voice frame.

3. The method of claim 1, wherein if the accumulated information weight corresponding to the current speech frame is smaller than a threshold, obtaining an integrated acoustic information vector corresponding to the current speech frame according to an integrated acoustic information vector corresponding to a previous speech frame, a retention rate corresponding to the current speech frame, an information weight of the current speech frame, and an acoustic information vector of the current speech frame, comprises:

and if the accumulated information weight corresponding to the current voice frame is smaller than a threshold value, multiplying the integrated acoustic information vector corresponding to the previous voice frame by the retention rate corresponding to the current voice frame, and adding the product of the information weight of the current voice frame and the acoustic information vector of the current voice frame to obtain the integrated acoustic information vector corresponding to the current voice frame.

4. The method of claim 1, wherein the outputting the delivered integrated acoustic information vector by using the integrated acoustic information vector corresponding to the previous speech frame and the acoustic information vector of the current speech frame if the accumulated information weight corresponding to the current speech frame is greater than or equal to a threshold value comprises:

if the accumulated information weight corresponding to the current voice frame is larger than or equal to a threshold value, calculating the accumulated information weight corresponding to the last voice frame multiplied by the retention rate corresponding to the current voice frame to obtain a first numerical value, and calculating the difference between 1 and the first numerical value to obtain the information weight of the first part of the current voice frame;

and multiplying the integrated acoustic information vector corresponding to the previous voice frame by the retention rate corresponding to the current voice frame, and adding the integrated acoustic information vector to the product of the first part information weight of the current voice frame and the acoustic information vector of the current voice frame to obtain the delivered integrated acoustic information vector.

5. The method of claim 4, wherein the calculating the integrated acoustic information vector corresponding to the current speech frame comprises:

calculating the difference between the information weight of the current voice frame and the first part of information weight of the current voice frame to obtain a second part of information weight of the current voice frame, and taking the second part of information weight of the current voice frame as the accumulated information weight of the current voice frame;

and calculating the weight of the second part of information quantity of the current voice frame multiplied by the acoustic information vector of the current voice frame to obtain an integrated acoustic information vector corresponding to the current voice frame.

6. The method of claim 1, further comprising:

and inputting the acoustic information vector of the current voice frame and the integrated acoustic information vector corresponding to the previous voice frame into a prediction model to obtain the leakage rate of the current voice frame.

7. The method of any of claims 1-6, wherein the leakage rate for the current speech frame is 0 every N speech frames, where N is a positive integer.

8. A method for training a speech model, the method comprising:

inputting training voice data into a coder to obtain acoustic information vectors of all voice frames;

inputting the acoustic information vector of each voice frame and the information weight of each voice frame into a continuous integration and distribution CIF module to obtain a distributed integrated acoustic information vector; the CIF module outputs the issued integrated acoustic information vector according to the method for generating acoustic features of any one of claims 1 to 7;

inputting the issued integrated acoustic information vector into a decoder to obtain a word prediction result of the training voice data;

and training the encoder, the CIF module and the decoder according to the word prediction result and the word label corresponding to the training voice data.

9. A method of speech recognition, the method comprising:

inputting voice data to be recognized into a coder to obtain acoustic information vectors of all voice frames;

and inputting the issued integrated acoustic information vector into a decoder to obtain a word recognition result of the voice data to be recognized.

10. An apparatus for generating acoustic features, the apparatus comprising:

the device comprises a first acquisition unit, a second acquisition unit and a processing unit, wherein the first acquisition unit is used for acquiring an acoustic information vector of a current voice frame and an information weight of the current voice frame;

a first calculating unit, configured to obtain an accumulated information amount weight corresponding to a current speech frame according to an accumulated information amount weight corresponding to a previous speech frame, a retention rate corresponding to the current speech frame, and an information amount weight of the current speech frame; the retention rate is the difference between 1 and the leakage rate;

a second calculating unit, configured to, if the accumulated information amount weight corresponding to the current speech frame is smaller than a threshold, obtain an integrated acoustic information vector corresponding to the current speech frame according to an integrated acoustic information vector corresponding to a previous speech frame, a retention rate corresponding to the current speech frame, the information amount weight of the current speech frame, and the acoustic information vector of the current speech frame;

a third calculating unit, configured to output a delivered integrated acoustic information vector by using an integrated acoustic information vector corresponding to a previous speech frame and an acoustic information vector of the current speech frame if the accumulated information weight corresponding to the current speech frame is greater than or equal to a threshold, and calculate to obtain the integrated acoustic information vector corresponding to the current speech frame;

11. An apparatus for training a speech model, the apparatus comprising:

the first training unit is used for inputting training voice data into the encoder to obtain acoustic information vectors of all voice frames;

the second training unit is used for inputting the acoustic information vector of each voice frame and the information weight of each voice frame into a continuous integration and distribution CIF module to obtain a distributed integrated acoustic information vector; the CIF module outputs the issued integrated acoustic information vector according to the method for generating acoustic features of any one of claims 1 to 7;

the third training unit is used for inputting the issued integrated acoustic information vector into a decoder to obtain a word prediction result of the training voice data;

and the fourth training unit is used for training the encoder, the CIF module and the decoder according to the word prediction result and the word label corresponding to the training voice data.

12. A speech recognition apparatus, characterized in that the apparatus comprises:

the first input unit is used for inputting the voice data to be recognized into the encoder to obtain the acoustic information vector of each voice frame;

the second input unit is used for inputting the acoustic information vector of each voice frame and the information weight of each voice frame into a continuous integration and distribution CIF module to obtain a distributed integrated acoustic information vector; the CIF module outputs the issued integrated acoustic information vector according to the method for generating acoustic features of any one of claims 1 to 7;

and the third input unit is used for inputting the issued integrated acoustic information vector into a decoder to obtain a word recognition result of the voice data to be recognized.

13. An electronic device, comprising:

one or more processors;

a storage device having one or more programs stored thereon,

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-7, the method of claim 8, or the method of claim 9.

14. A computer-readable medium, on which a computer program is stored, wherein the program, when executed by a processor, implements the method of any one of claims 1-7, the method of claim 8, or the method of claim 9.

Technical Field

The application relates to the field of data processing, in particular to a method and a device for generating acoustic features, training a voice model and recognizing voice.

Background

The speech recognition technology is a technology of recognizing speech data and converting contents corresponding to the speech data into computer-readable inputs. For example, by using a speech recognition technology, the content contained in the speech data can be converted into a corresponding text, which facilitates the subsequent processing of the content contained in the speech data.

Currently, speech recognition of speech data can be achieved using speech models. And the voice model extracts the acoustic characteristics of the voice data and processes the acoustic characteristics to obtain a text recognition result corresponding to the voice data. However, the recognition result obtained by speech model recognition is not accurate enough, and it is difficult to satisfy the requirement of speech recognition.

Disclosure of Invention

In view of this, embodiments of the present application provide a method and an apparatus for generating acoustic features, training a speech model, and recognizing speech, which can generate more accurate acoustic features, thereby improving the recognition accuracy of the speech model.

Based on this, the technical scheme that this application embodiment provided is as follows:

in a first aspect, an embodiment of the present application provides a method for generating an acoustic feature, where the method includes:

acquiring an acoustic information vector of a current voice frame and an information weight of the current voice frame;

In a second aspect, an embodiment of the present application provides a speech model training method, where the method includes:

inputting training voice data into a coder to obtain acoustic information vectors of all voice frames;

inputting the issued integrated acoustic information vector into a decoder to obtain a word prediction result of the training voice data;

and training the encoder, the CIF module and the decoder according to the word prediction result and the word label corresponding to the training voice data.

In a third aspect, an embodiment of the present application provides a speech recognition method, where the method includes:

inputting voice data to be recognized into a coder to obtain acoustic information vectors of all voice frames;

and inputting the issued integrated acoustic information vector into a decoder to obtain a word recognition result of the voice data to be recognized.

In a fourth aspect, an embodiment of the present application provides an apparatus for generating an acoustic feature, the apparatus including:

In a fifth aspect, an embodiment of the present application provides a speech model training apparatus, where the apparatus includes:

the first training unit is used for inputting training voice data into the encoder to obtain acoustic information vectors of all voice frames;

the third training unit is used for inputting the issued integrated acoustic information vector into a decoder to obtain a word prediction result of the training voice data;

and the fourth training unit is used for training the encoder, the CIF module and the decoder according to the word prediction result and the word label corresponding to the training voice data.

In a sixth aspect, an embodiment of the present application provides a speech recognition apparatus, including:

the first input unit is used for inputting the voice data to be recognized into the encoder to obtain the acoustic information vector of each voice frame;

and the third input unit is used for inputting the issued integrated acoustic information vector into a decoder to obtain a word recognition result of the voice data to be recognized.

In a seventh aspect, an embodiment of the present application provides an electronic device, including:

one or more processors;

a storage device having one or more programs stored thereon,

when executed by the one or more processors, cause the one or more processors to implement a method of generating acoustic features, a method of speech model training, or a method of speech recognition as described in any of the embodiments above.

In an eighth aspect, the present application provides a computer-readable medium, on which a computer program is stored, where the program, when executed by a processor, implements a method for generating acoustic features, a method for training a speech model, or a method for speech recognition according to any of the above embodiments.

Therefore, the embodiment of the application has the following beneficial effects:

the embodiment of the application provides a method and a device for generating acoustic features, training a voice model and recognizing voice, wherein the accumulated information weight corresponding to the current voice frame can be obtained by obtaining the acoustic information vector of the current voice frame and the information weight of the current voice frame, multiplying the accumulated information weight corresponding to the previous voice frame by the retention rate corresponding to the current voice frame and adding the accumulated information weight corresponding to the current voice frame and the information weight of the current voice frame. The retention rate is the difference between 1 and the leakage rate. The accumulated information weight corresponding to the current voice frame and the integrated acoustic information vector corresponding to the current voice frame are adjusted by utilizing the leakage rate, so that the influence of the voice frame with smaller information weight on the integrated acoustic information vector can be reduced, the proportion of the acoustic information vector of the voice frame with larger information weight in the integrated acoustic information vector is increased, and the obtained integrated acoustic information vector is more accurate. Therefore, more accurate acoustic characteristics can be obtained by voice model extraction, and the accuracy of the voice model is improved.

Drawings

Fig. 1 is a schematic diagram of generating an integrated acoustic information vector by a CIF according to an embodiment of the present application;

FIG. 2 is a block diagram of an exemplary application scenario provided by an embodiment of the present application;

FIG. 3 is a flow chart of a method of generating acoustic signatures provided by an embodiment of the present application;

FIG. 4 is a schematic diagram of a method for generating acoustic features according to an embodiment of the present application;

FIG. 5 is a flowchart of a speech model training method according to an embodiment of the present application;

fig. 6 is a flowchart of a speech recognition method according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of an apparatus for generating an acoustic feature according to an embodiment of the present disclosure;

fig. 8 is a schematic structural diagram of a speech model training apparatus according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of a speech recognition apparatus according to an embodiment of the present application;

fig. 10 is a schematic diagram of a basic structure of an electronic device according to an embodiment of the present application.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present application more comprehensible, embodiments accompanying the drawings are described in detail below.

In order to facilitate understanding of the technical solutions provided in the present application, the following description will be made on the background related to the present application.

The CIF (Continuous Integrated and-Fire) method is applied to the coding and decoding framework. Referring to fig. 1, the figure is a schematic diagram of generating an integrated acoustic information vector for a CIF according to an embodiment of the present application. Firstly, according to the coding sequence, sequentially receiving the acoustic information vector H ═ H of the speech frame output by the coder₁,h₂,…,h_NAnd the information quantity weight alpha corresponding to the voice frame is { alpha ═ alpha₁,α₂,…,α_NAnd accumulating the information weight of the voice frame. Wherein, N is the total number of the voice frames in the voice data. The acoustic boundary is located after the information content weight of the accumulated speech frames reaches a threshold. Integrating the acoustic information vectors of the voice frames in a weighted sum mode to obtain an integrated acoustic information vector C ═ C₁,c₂,…,c_M}. Wherein M is the total number of integrated acoustic information vectors. After researching the traditional CIF method, the fact that the voice frame with small information weight influences the accuracy degree of the obtained integrated acoustic information vector is found.

Based on this, embodiments of the present application provide a method and an apparatus for generating acoustic features, training a speech model, and recognizing speech, which can obtain an accumulated information weight corresponding to a current speech frame by obtaining an acoustic information vector of the current speech frame and an information weight of the current speech frame, multiplying the accumulated information weight corresponding to a previous speech frame by a retention rate corresponding to the current speech frame, and adding the accumulated information weight to the information weight of the current speech frame. The retention rate is the difference between 1 and the leakage rate. The accumulated information weight corresponding to the current voice frame and the integrated acoustic information vector corresponding to the current voice frame are adjusted by utilizing the leakage rate, so that the influence of the voice frame with smaller information weight on the integrated acoustic information vector can be reduced, the proportion of the acoustic information vector of the voice frame with larger information weight in the integrated acoustic information vector is increased, and the obtained integrated acoustic information vector is more accurate. Therefore, more accurate acoustic characteristics can be obtained by voice model extraction, and the accuracy of the voice model is improved.

In order to facilitate understanding of a medical report generation method provided by the embodiment of the present application, the following description is made with reference to a scenario example shown in fig. 1. Referring to fig. 2, the drawing is a schematic diagram of a framework of an exemplary application scenario provided in an embodiment of the present application.

In practical application, the voice data to be recognized is input into the encoder 201, and the acoustic information vector of each voice frame is obtained. And then the acoustic information vector of each voice frame and the information amount weight of each voice frame are input into the CIF module 202, so as to obtain an integrated acoustic information vector output by the CIF module 202. Finally, the integrated acoustic information vector is input into the decoder 203 to obtain a word recognition result of the voice data to be recognized.

Those skilled in the art will appreciate that the frame diagram shown in fig. 2 is only one example in which embodiments of the present application may be implemented. The scope of applicability of the embodiments of the present application is not limited in any way by this framework.

Based on the above description, the method for generating acoustic features provided in the present application will be described in detail below with reference to the accompanying drawings.

Referring to fig. 3, which is a flowchart illustrating a method for generating an acoustic feature according to an embodiment of the present application, the method includes steps S301 to S305:

s301: and acquiring the acoustic information vector of the current voice frame and the information weight of the current voice frame.

When the voice data is processed by using a model of a coding and decoding frame adopting a CIF method, an encoder performs feature extraction on the input voice data by voice frames to obtain acoustic information vectors of the voice frames. The acoustic information vector of a speech frame is a high-dimensional representation of the speech data. Each speech frame has a corresponding information weight. The information weight is used to measure the amount of information included in the speech frame.

When processing input voice data, the encoder generates an acoustic information vector of each voice frame in the voice data. And sequentially acquiring the acoustic information vector of the generated voice frame and the information weight of the voice frame for processing.

And taking the current voice frame being processed as the current voice frame, and acquiring the acoustic information vector of the current voice frame and the information weight of the current voice frame.

For example, the u-th speech frame in the speech data is taken as the current speech frame, u is a positive integer smaller than N, and N is the total number of speech frames in the speech data. The acoustic information vector of the current speech frame is denoted h_uThe information weight of the current speech frame is alpha_u。

S302: obtaining the accumulated information weight corresponding to the current voice frame according to the accumulated information weight corresponding to the previous voice frame, the retention rate corresponding to the current voice frame and the information weight of the current voice frame; the retention rate is the difference between 1 and the leak rate.

By using the retention rate corresponding to the current voice frame, the accumulated information weight corresponding to the previous voice frame and the information weight of the current voice frame, the accumulated information weight corresponding to the current voice frame adjusted by using the retention rate corresponding to the current voice frame can be obtained. Wherein the retention rate is the difference between 1 and the leak rate. The leak rate is used to indicate a rate at which the information amount weight leaks. The leakage rate has a value range of [0, 1 ]. The retention rate is the difference between 1 and the leakage rate, and is used to indicate the rate of weight retention of the information amount.

In a possible implementation manner, an embodiment of the present application provides a specific implementation manner for obtaining an accumulated information amount weight corresponding to a current speech frame according to an accumulated information amount weight corresponding to a previous speech frame, a retention rate corresponding to the current speech frame, and an information amount weight of the current speech frame, which is specifically referred to below. S303: and if the accumulated information weight corresponding to the current voice frame is smaller than a threshold value, obtaining an integrated acoustic information vector corresponding to the current voice frame according to the integrated acoustic information vector corresponding to the previous voice frame, the retention rate corresponding to the current voice frame, the information weight of the current voice frame and the acoustic information vector of the current voice frame.

The accumulated information weight corresponding to the current speech frame is compared to a threshold. And if the weight of the accumulated information amount corresponding to the current voice frame is smaller than the threshold value, the weight of the information amount corresponding to the next voice frame needs to be accumulated continuously. The threshold may be set as needed to determine the acoustic boundary, and may be, for example, 1.

And the accumulated information weight corresponding to the current voice frame is smaller, and the integrated acoustic information vector corresponding to the current voice frame is obtained based on the integrated acoustic information vector corresponding to the previous voice frame, the retention rate corresponding to the current voice frame, the information quantity weight of the current voice frame and the acoustic information vector of the current voice frame. And the obtained integrated acoustic information vector corresponding to the current voice frame is obtained after the retention rate corresponding to the current voice frame is adjusted.

In a possible implementation manner, an embodiment of the present application provides a specific implementation manner for obtaining an integrated acoustic information vector corresponding to a current speech frame according to an integrated acoustic information vector corresponding to a previous speech frame, a retention rate corresponding to the current speech frame, an information amount weight of the current speech frame, and an acoustic information vector of the current speech frame if an accumulated information amount weight corresponding to the current speech frame is smaller than a threshold, and please refer to the following text specifically.

S304: and if the weight of the accumulated information amount corresponding to the current voice frame is greater than or equal to a threshold value, outputting a delivered integrated acoustic information vector by using the integrated acoustic information vector corresponding to the previous voice frame and the acoustic information vector of the current voice frame, and calculating to obtain the integrated acoustic information vector corresponding to the current voice frame.

If the weight of the accumulated information amount corresponding to the current voice frame is larger than or equal to the threshold value, the acoustic information vectors of the voice frame can be integrated to obtain an integrated acoustic information vector which is output and issued.

The distributed integrated acoustic information vector is obtained by the integrated acoustic information vector corresponding to the previous voice frame and the acoustic information vector of the current voice frame.

In a possible implementation manner, an embodiment of the present application provides a specific implementation manner that, if the weight of the accumulated information amount corresponding to the current speech frame is greater than or equal to a threshold, the integrated acoustic information vector corresponding to the previous speech frame and the acoustic information vector of the current speech frame are used to output the delivered integrated acoustic information vector, and the integrated acoustic information vector corresponding to the current speech frame is obtained through calculation, which is specifically referred to below.

S305: and after obtaining the integrated acoustic information vector corresponding to the current voice frame, taking the next voice frame as the current voice frame, and repeatedly executing the steps of obtaining the acoustic information vector of the current voice frame, the information weight of the current voice frame and the subsequent steps until the next voice frame does not exist.

And then, taking the next speech frame as the current speech frame, and repeatedly executing the step S301 and the subsequent steps, that is, obtaining the acoustic information vector of the current speech frame, the information weight of the current speech frame, and the subsequent steps, until the next speech frame does not exist, that is, the processing of all speech frames of the speech data is finished, and then stopping.

Based on the relevant contents of S301 to S305, it can be known that adjusting the accumulated information weight corresponding to the current voice frame and the integrated acoustic information vector corresponding to the current voice frame by using the leakage rate can reduce the influence of the voice frame with smaller information weight on the integrated acoustic information vector, thereby increasing the proportion of the acoustic information vector of the voice frame with larger information weight in the integrated acoustic information vector, and making the obtained integrated acoustic information vector more accurate. And multiplying the accumulated information weight corresponding to the previous voice frame by the retention rate corresponding to the current voice frame to obtain the retained accumulated information weight. And adding the reserved accumulated information weight and the information weight of the current voice frame to obtain the accumulated information weight corresponding to the current voice frame.

Taking the current speech frame as an example, the accumulated information weight corresponding to the current speech frame can be represented by the following formula:

wherein the content of the first and second substances,represents the weight of the accumulated information amount corresponding to the current voice frame, R is the leakage rate,representing the weight of the accumulated information amount, alpha, corresponding to the previous speech frame_uRepresenting the information content weight of the current speech frame. In a possible implementation manner, an embodiment of the present application provides a specific implementation manner for obtaining an integrated acoustic information vector corresponding to a current speech frame according to an integrated acoustic information vector corresponding to a previous speech frame, a retention rate corresponding to the current speech frame, an information amount weight of the current speech frame, and an acoustic information vector of the current speech frame if an accumulated information amount weight corresponding to the current speech frame is less than a threshold, and specifically includes:

And multiplying the integrated acoustic information vector corresponding to the previous voice frame by the retention rate corresponding to the current voice frame, multiplying the information weight of the current voice frame by the acoustic information vector of the current voice frame, and finally adding the two obtained products to obtain the integrated acoustic information vector corresponding to the current voice frame.

The integrated acoustic information vector corresponding to the current speech frame can be represented by the following formula:

wherein, the integrated acoustic information vector corresponding to the current voice frame is represented, R is the leakage rate,represents the corresponding integrated acoustic information vector, alpha, of the last speech frame_uWeight of information, h, representing the current speech frame_uA vector of acoustic information representing the current speech frame.

Further, an embodiment of the present application provides a specific implementation manner for outputting a distributed integrated acoustic information vector by using an integrated acoustic information vector corresponding to a previous speech frame and an acoustic information vector of a current speech frame if an accumulated information weight corresponding to a current speech frame is greater than or equal to a threshold, and specifically includes the following two steps:

a1: if the accumulated information weight corresponding to the current voice frame is larger than or equal to the threshold, calculating the accumulated information weight corresponding to the last voice frame and the retention rate corresponding to the current voice frame to obtain a first numerical value, and calculating the difference between 1 and the first numerical value to obtain the information weight of the first part of the current voice frame.

When the weight of the accumulated information amount corresponding to the current voice frame is larger than or equal to the threshold value, the acoustic boundary can be positioned in the current voice frame, and the corresponding integrated acoustic information vector is obtained.

The information weight of the first part of the current speech frame is determined according to the accumulated information weight corresponding to the previous speech frame.

And multiplying the accumulated information weight corresponding to the previous voice frame by the retention rate corresponding to the current voice frame to obtain a first numerical value. And calculating the difference between 1 and the first value to obtain the first part information weight of the current voice frame.

Information weight alpha of first part of current speech frame_u1Can be represented by the following formula:

wherein R is the leakage rate,representing the weight of the accumulated information amount corresponding to the last speech frame.

A2: the integrated acoustic information vector corresponding to the previous voice frame is multiplied by the retention rate corresponding to the current voice frame, and then the product of the first part information weight of the current voice frame and the acoustic information vector of the current voice frame is added to obtain the delivered integrated acoustic information vector

The delivered integrated acoustic information vector comprises a part of the integrated acoustic information vector corresponding to the previous voice frame and a part of the acoustic information vector of the current voice frame.

And multiplying the integrated acoustic information vector corresponding to the last voice frame by the retention rate corresponding to the current voice frame. And multiplying the first part information weight of the current speech frame by the acoustic information vector of the current speech frame. And adding the two obtained products to obtain the issued integrated acoustic information vector.

The delivered integrated acoustic information vector may be represented by the following formula:

wherein, (1-R) represents the corresponding retention rate of the current voice frame,represents the corresponding integrated acoustic information vector, alpha, of the last speech frame_u1Weight of first part information quantity, h, representing current speech frame_uAcoustic information vector representing current speech frame。

Based on the above content, the integrated acoustic information vector corresponding to the previous voice frame and the first part information weight of the current voice frame are adjusted by using the leakage rate, so that the influence of the integrated acoustic information vector of the voice frame with the lower information weight on the delivered integrated acoustic information vector can be further reduced, and the obtained delivered integrated acoustic information vector is more accurate.

After the delivered integrated acoustic information vector is obtained by using the first part information weight of the current voice frame, the current voice frame has a part which is not integrated to the delivered integrated acoustic information vector. The integrated acoustic information vector corresponding to the current speech frame needs to be determined according to the first part information weight of the current speech frame.

Further, an embodiment of the present application further provides a specific implementation manner for calculating an integrated acoustic information vector corresponding to a current speech frame, which specifically includes the following two steps:

b1: and calculating the difference between the information weight of the current voice frame and the first part of information weight of the current voice frame to obtain the second part of information weight of the current voice frame, and taking the second part of information weight of the current voice frame as the accumulated information weight of the current voice frame.

The second part information weight of the current speech frame is obtained according to the difference between the information weight of the current speech frame and the first part information weight of the current speech frame. That is to say, the information weight of the integrated acoustic information vector which is not used for integration and is delivered in the current voice frame is used as the information weight of the second part.

The accumulated information weight of the current speech frame is the information weight that can be integrated with the subsequent speech frame in the current speech frame. And taking the second part information weight of the current speech frame as the accumulated information weight of the current speech frame.

Accumulated information weight of current speech frameCan be represented by the following formula:

wherein alpha is_u2Is the second partial information weight, alpha, of the current speech frame_u1The information weight of the first part of the current speech frame.

B2: and calculating the weight of the second part of information quantity of the current voice frame multiplied by the acoustic information vector of the current voice frame to obtain an integrated acoustic information vector corresponding to the current voice frame.

And multiplying the second part of information weight of the current voice frame by the acoustic information vector of the current voice frame to obtain an integrated acoustic information vector corresponding to the current voice frame.

The integrated acoustic information vector corresponding to the current speech frame can be represented by the following formula:

in the embodiment of the application, the accumulated information weight of the current speech frame is determined based on the information weight of the first part of the current speech frame, and the integrated acoustic information vector corresponding to the current speech frame is determined. Therefore, the accumulated information weight of the current voice frame can be obtained more accurately, and the integration with the subsequent voice frame is facilitated.

In order to illustrate the method for generating acoustic features provided by the above embodiments, the following is exemplified with reference to a specific scenario.

Referring to fig. 4, a schematic diagram of a method for generating an acoustic feature according to an embodiment of the present application is shown. Wherein, the acoustic information vector of each speech frame output by the encoder is H ═ H₁,h₂,h₃,h₄The information quantity weight alpha corresponding to each voice frame is { alpha ═ alpha }₁,α₂,α₃,α₄}＝{0.2,0.9,0.6,0.6}。

The current speech frame is the first speech frame, i.e. u-1. Obtaining the acoustic information vector of the current voice frame as h₁Corresponding information weight of alpha₁. The first speech frame does not have the previous speech frame, and the weight of the accumulated information amount corresponding to the current speech frameThe calculation formula of (a) is as follows:

on the premise that the threshold is 1, the weight of the accumulated information amount corresponding to the current voice frame is smaller than the threshold. Calculating an integrated acoustic information vector corresponding to the current voice frame and an integrated acoustic information vector corresponding to the current voice frameCan be expressed as:

the next speech frame is taken as the current speech frame, i.e. u-2. Obtaining acoustic information vector h of current voice frame₂And the information weight alpha of the current speech frame₂. Determining the weight of the accumulated information amount corresponding to the current voice frame

Wherein, R is the leakage rate corresponding to the current voice frame, and the value of R is 0.1.

Accumulated information weight corresponding to current speech frameIf the value is larger than the threshold value, utilizing the integrated acoustic information vector corresponding to the last voice frameAnd acoustic information vector h of the current speech frame₂Outputting the issued integrated acoustic information vector C₁。

First part information weight alpha of current speech frame is calculated₂₁，α₂₁Can be represented by the following formula:

integrating acoustic information vectors C₁Can be represented by the following formula:

then recalculate the weight of the accumulated information amount corresponding to the current voice frameCan be represented by the following formula:

integrated acoustic information vector corresponding to current voice frameCan be expressed as:

then the next speech frame is used as the current speech frame, i.e. u-3. Obtaining acoustic information vector h of current voice frame₃And the information weight alpha of the current speech frame₃。

Then the weight of the accumulated information amount corresponding to the current voice frame is calculatedCan be expressed as:

if the weight of the accumulated information amount corresponding to the current voice frame is less than the threshold value, calculating an integrated acoustic information vector corresponding to the current voice frame

And taking the next speech frame as the current speech frame, namely u-4. Obtaining acoustic information vector h of current voice frame₄And the information weight alpha of the current speech frame₄。

Calculating the weight of the accumulated information amount corresponding to the current voice frame

And calculating an integrated acoustic information vector when the weight of the accumulated information amount corresponding to the current voice frame is greater than or equal to a threshold value.

Information weight alpha of first part of current speech frame₄₁Can be represented by the following formula:

the delivered integrated acoustic information vector may be represented by the following formula:

and ending the generation of the integrated acoustic information vector when no other voice frame exists after the fourth voice frame.

In one possible implementation, the leakage rate of the current speech frame is adjustable. The leakage rate of the current speech frame may be determined using a predictive model.

Based on this, the embodiment of the present application further provides a method for generating an acoustic feature, which includes the following steps in addition to the above steps:

The prediction model can output the leakage rate of the current voice frame according to the acoustic information vector of the input current voice frame and the integrated acoustic information vector corresponding to the previous voice frame. The leakage rate of the current speech frame has a value range of [0, 1 ].

The predictive model may be a neural network layer in the speech model, for example, it may be a fully connected layer or a convolutional layer, and the activation function is sigmoid. The predictive models may be trained in conjunction with the speech models. Model parameters of the predictive model are adjusted during training of the speech model.

Based on the above content, the leakage rate of the current speech frame is obtained by using the prediction model, so that the leakage rate of the speech frame can be more accurate, and the accuracy of the obtained issued integrated acoustic information vector is further improved.

In a possible implementation manner, N speech frames may be further separated, and the leakage rate corresponding to the current speech frame is 0.

In the embodiment of the application, by adjusting the leakage rate of part of the voice frames to be 0, the calculation amount can be reduced and the efficiency of generating the integrated acoustic information vector can be improved on the premise of improving the accuracy of the issued integrated acoustic information vector.

Based on the method for generating acoustic features provided by the above embodiment, the embodiment of the present application further provides a speech model training method. Referring to fig. 5, which is a flowchart of a speech model training method provided in the present application, the method includes steps S501-S504.

S501: and inputting the training voice data into an encoder to obtain the acoustic information vector of each voice frame.

The training speech data is used to train the acoustic model and determine model parameters in the acoustic model. The training speech data has a corresponding word label.

For example, when the training speech data is speech data corresponding to "hello", the word label corresponding to the training speech data is "hello".

And inputting the training voice data into the encoder to obtain the acoustic information vector of each voice frame output by the encoder.

S502: and inputting the acoustic information vector of each voice frame and the information weight of each voice frame into a continuous integration and distribution CIF module to obtain the issued integrated acoustic information vector.

And inputting the acoustic information vector of each voice frame output by the encoder and the information weight of each voice frame into the CIF module to obtain an integrated acoustic information vector issued by the CIF module. The CIF module obtains the delivered integrated acoustic information vector by using the method for generating acoustic features according to the above embodiment.

S503: and inputting the issued integrated acoustic information vector into a decoder to obtain a word prediction result of the training voice data.

And inputting the obtained issued integrated acoustic information vector into a decoder to obtain a word prediction result of the training voice data output by the decoder.

S504: and training the encoder, the CIF module and the decoder according to the word prediction result and the word label corresponding to the training voice data.

And training the voice model according to the word prediction result output by the voice model and the word label corresponding to the training voice data. The speech model is composed of an encoder, a CIF module and a decoder.

Based on the relevant contents of the above S501-S504, by adopting the above method for generating acoustic features, the integrated acoustic information vector output by the CIF module is more accurate, so that the word prediction result obtained by decoding by the decoder is more accurate, and the speech model obtained by training has higher accuracy and better performance.

Based on the speech model training method provided by the embodiment, the embodiment of the application further provides a speech recognition method. Referring to fig. 6, which is a flowchart illustrating a speech recognition method according to an embodiment of the present application, the method includes steps S601-S603.

S601: and inputting the voice data to be recognized into a coder to obtain the acoustic information vector of each voice frame.

The voice data to be recognized is the voice data which needs to be recognized to obtain a word recognition result. And inputting the voice data to be recognized into the encoder to obtain the acoustic information vector of each voice frame output by the encoder.

S602: and inputting the acoustic information vector of each voice frame and the information weight of each voice frame into a continuous integration and distribution CIF module to obtain the issued integrated acoustic information vector.

And inputting the acoustic information vector of each voice frame and the information weight of each voice frame into the CIF module to obtain an integrated acoustic information vector output by the CIF module. The CIF module obtains the delivered integrated acoustic information vector by using the method for generating acoustic features according to the above embodiment.

S603: and inputting the issued integrated acoustic information vector into a decoder to obtain a word recognition result of the voice data to be recognized.

And finally, inputting the issued integrated acoustic information vector into a decoder to obtain a word recognition result of the voice data to be recognized output by the decoder. The character recognition result is the recognition result of the voice data to be recognized output by the voice model.

Based on the relevant contents of the above S601-S603, by adopting the above method for generating acoustic features, the integrated acoustic information vector output by the CIF module is more accurate, so that the word prediction result obtained by decoding by the decoder is more accurate, the accuracy of the speech model is higher, and the performance is better.

Based on the method for generating the acoustic features provided by the above method embodiment, the present application embodiment also provides an apparatus for generating the acoustic features, and the apparatus for generating the acoustic features will be described below with reference to the accompanying drawings.

Referring to fig. 7, the drawing is a schematic structural diagram of an apparatus for generating an acoustic feature according to an embodiment of the present application. As shown in fig. 7, the apparatus for generating acoustic features includes:

a first obtaining unit 701, configured to obtain an acoustic information vector of a current speech frame and an information weight of the current speech frame;

a first calculating unit 702, configured to obtain an accumulated information amount weight corresponding to a current speech frame according to an accumulated information amount weight corresponding to a previous speech frame, a retention rate corresponding to the current speech frame, and an information amount weight of the current speech frame; the retention rate is the difference between 1 and the leakage rate;

a second calculating unit 703, configured to, if the accumulated information weight corresponding to the current speech frame is smaller than a threshold, obtain an integrated acoustic information vector corresponding to the current speech frame according to an integrated acoustic information vector corresponding to a previous speech frame, a retention rate corresponding to the current speech frame, the information weight of the current speech frame, and the acoustic information vector of the current speech frame;

a third calculating unit 704, configured to output a delivered integrated acoustic information vector by using an integrated acoustic information vector corresponding to a previous speech frame and an acoustic information vector of the current speech frame if the accumulated information weight corresponding to the current speech frame is greater than or equal to a threshold, and calculate to obtain the integrated acoustic information vector corresponding to the current speech frame;

the executing unit 705 is configured to, after obtaining the integrated acoustic information vector corresponding to the current speech frame, take a next speech frame as the current speech frame, and repeatedly execute the obtaining of the acoustic information vector of the current speech frame and the information weight of the current speech frame and the subsequent steps until there is no next speech frame.

In a possible implementation manner, the first calculating unit 702 is specifically configured to multiply the accumulated information weight corresponding to the previous voice frame by the retention rate corresponding to the current voice frame, and add the multiplied information weight to the information weight of the current voice frame to obtain the accumulated information weight corresponding to the current voice frame.

In a possible implementation manner, the second calculating unit 703 is specifically configured to, if the accumulated information weight corresponding to the current speech frame is smaller than a threshold, multiply the integrated acoustic information vector corresponding to the previous speech frame by the retention rate corresponding to the current speech frame, and add the product of the information weight of the current speech frame and the acoustic information vector of the current speech frame to obtain the integrated acoustic information vector corresponding to the current speech frame.

In a possible implementation manner, the third computing unit 704 includes:

a first calculating subunit, configured to calculate, if the accumulated information weight corresponding to the current speech frame is greater than or equal to a threshold, an accumulated information weight corresponding to a previous speech frame multiplied by a retention rate corresponding to the current speech frame to obtain a first value, and calculate a difference between 1 and the first value to obtain a first partial information weight of the current speech frame;

and the second calculating subunit is used for multiplying the integrated acoustic information vector corresponding to the previous voice frame by the retention rate corresponding to the current voice frame, and then adding the product of the first part of information weight of the current voice frame and the acoustic information vector of the current voice frame to obtain the issued integrated acoustic information vector.

In a possible implementation manner, the third computing unit 704 includes:

a third calculating subunit, configured to calculate a difference between the information weight of the current speech frame and the first part of information weight of the current speech frame to obtain a second part of information weight of the current speech frame, and use the second part of information weight of the current speech frame as an accumulated information weight of the current speech frame;

and the fourth calculating subunit is configured to calculate a second part information weight of the current speech frame multiplied by the acoustic information vector of the current speech frame to obtain an integrated acoustic information vector corresponding to the current speech frame.

In one possible implementation, the apparatus further includes:

and the second obtaining unit is used for inputting the acoustic information vector of the current voice frame and the integrated acoustic information vector corresponding to the previous voice frame into a prediction model to obtain the leakage rate of the current voice frame.

In a possible implementation manner, every N speech frames, the leakage rate corresponding to the current speech frame is 0.

Based on the method for training the speech model provided by the embodiment of the method, the embodiment of the application also provides a device for training the speech model, and the device for training the speech model is described below with reference to the accompanying drawings.

Fig. 8 is a schematic structural diagram of a speech model training apparatus according to an embodiment of the present application. As shown in fig. 8, the speech model training apparatus includes:

a first training unit 801, configured to input training speech data into an encoder to obtain an acoustic information vector of each speech frame;

a second training unit 802, configured to input the acoustic information vector of each speech frame and the information weight of each speech frame into a continuous integration and distribution CIF module, so as to obtain a distributed integrated acoustic information vector; the CIF module outputs the issued integrated acoustic information vector according to the method for generating acoustic features in any one of the embodiments;

a third training unit 803, configured to input the issued integrated acoustic information vector to a decoder, so as to obtain a word prediction result of the training speech data;

a fourth training unit 804, configured to train the encoder, the CIF module, and the decoder according to the word prediction result and the word label corresponding to the training speech data.

Based on the voice recognition method provided by the above method embodiment, the embodiment of the present application further provides a voice recognition apparatus, and the voice recognition apparatus will be described with reference to the accompanying drawings.

Referring to fig. 9, a schematic structural diagram of a speech recognition apparatus according to an embodiment of the present application is shown. As shown in fig. 9, the speech recognition apparatus includes:

a first input unit 901, configured to input speech data to be recognized into an encoder, so as to obtain an acoustic information vector of each speech frame;

a second input unit 902, configured to input the acoustic information vector of each speech frame and the information weight of each speech frame into a continuous integration and distribution CIF module, so as to obtain a distributed integrated acoustic information vector; the CIF module outputs the issued integrated acoustic information vector according to the method for generating acoustic features in any one of the embodiments;

and a third input unit 903, configured to input the issued integrated acoustic information vector to a decoder, so as to obtain a word recognition result of the speech data to be recognized.

Based on the method for generating acoustic features, the speech model training method and the speech recognition method provided by the embodiment of the method, the application also provides an electronic device, which comprises: one or more processors; a storage device having one or more programs stored thereon, which when executed by the one or more processors, cause the one or more processors to implement the method of generating acoustic features, the method of speech model training, or the method of speech recognition as described in any of the embodiments above.

Referring now to FIG. 10, shown is a schematic diagram of an electronic device 1000 suitable for use in implementing embodiments of the present application. The terminal device in the embodiment of the present application may include, but is not limited to, a mobile terminal such as a mobile phone, a notebook computer, a Digital broadcast receiver, a PDA (Personal Digital Assistant), a PAD (Portable android device), a PMP (Portable multimedia Player), a car terminal (e.g., car navigation terminal), and the like, and a fixed terminal such as a Digital TV (television), a desktop computer, and the like. The electronic device shown in fig. 10 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.

As shown in fig. 10, the electronic device 1000 may include a processing means (e.g., a central processing unit, a graphics processor, etc.) 1001 that may perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)1002 or a program loaded from a storage means 1008 into a Random Access Memory (RAM) 1003. In the RAM1003, various programs and data necessary for the operation of the electronic apparatus 1000 are also stored. The processing device 1001, the ROM1002, and the RAM1003 are connected to each other by a bus 1004. An input/output (I/O) interface 1005 is also connected to bus 1004.

Generally, the following devices may be connected to the I/O interface 1005: input devices 1008 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, or the like; an output device 1007 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 808 including, for example, magnetic tape, hard disk, etc.; and a communication device 1009. The communication device 1009 may allow the electronic device 1000 to communicate with other devices wirelessly or by wire to exchange data. While fig. 10 illustrates an electronic device 1000 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided.

In particular, according to embodiments of the application, the processes described above with reference to the flow diagrams may be implemented as computer software programs. For example, embodiments of the present application include a computer program product comprising a computer program carried on a non-transitory computer readable medium, the computer program containing program code for performing the method illustrated by the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication means 1009, or installed from the storage means 1008, or installed from the ROM 1002. When executed by the processing device 1001, the computer program performs the above-described functions defined in the method of the embodiment of the present application.

The electronic device provided by the embodiment of the present application and the method for generating acoustic features, the method for training a speech model, and the method for recognizing speech provided by the above embodiment belong to the same inventive concept, and technical details that are not described in detail in the present embodiment can be referred to the above embodiment, and the present embodiment has the same beneficial effects as the above embodiment.

Based on the method for generating acoustic features, the speech model training method, and the speech recognition method provided in the above method embodiments, the present application provides a computer storage medium having a computer program stored thereon, where the program is executed by a processor to implement the method for generating acoustic features, the speech model training method, or the speech recognition method according to any one of the above embodiments.

It should be noted that the computer readable medium mentioned above in the present application may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.

In some embodiments, the clients, servers may communicate using any currently known or future developed network Protocol, such as HTTP (HyperText Transfer Protocol), and may interconnect with any form or medium of digital data communication (e.g., a communications network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the Internet (e.g., the Internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed network.

The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device.

The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to perform the method of generating acoustic features, the method of speech model training, or the method of speech recognition described above.

Computer program code for carrying out operations for aspects of the present application may be written in any combination of one or more programming languages, including but not limited to an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present application may be implemented by software or hardware. Where the name of a unit/module does not in some cases constitute a limitation on the unit itself, for example, a voice data collection module may also be described as a "data collection module".

The functions described herein above may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), systems on a chip (SOCs), Complex Programmable Logic Devices (CPLDs), and the like.

In the context of this application, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

According to one or more embodiments of the present application, there is provided a method of generating acoustic features, the method comprising:

acquiring an acoustic information vector of a current voice frame and an information weight of the current voice frame;

According to one or more embodiments of the present application, in example two, there is provided a method for generating an acoustic feature, where the obtaining an accumulated information weight corresponding to a current speech frame according to an accumulated information weight corresponding to a previous speech frame, a retention rate corresponding to the current speech frame, and an information weight of the current speech frame includes:

According to one or more embodiments of the present application, in an example three, there is provided a method for generating an acoustic feature, where if an accumulated information amount weight corresponding to the current speech frame is less than a threshold, obtaining an integrated acoustic information vector corresponding to the current speech frame according to an integrated acoustic information vector corresponding to a previous speech frame, a retention rate corresponding to the current speech frame, an information amount weight of the current speech frame, and an acoustic information vector of the current speech frame, including:

According to one or more embodiments of the present application, in an example four, there is provided a method for generating acoustic features, where if an accumulated information amount weight corresponding to a current speech frame is greater than or equal to a threshold, an integrated acoustic information vector corresponding to a previous speech frame and an acoustic information vector of the current speech frame are used to output an issued integrated acoustic information vector, including:

According to one or more embodiments of the present application, example five provides a method for generating acoustic features, wherein the calculating of the integrated acoustic information vector corresponding to the current speech frame includes:

According to one or more embodiments of the present application, [ example six ] there is provided a method of generating an acoustic feature, the method further comprising:

According to one or more embodiments of the present application, [ example seven ] there is provided a method of generating an acoustic signature, the leakage rate corresponding to the current speech frame being 0 every N speech frames.

According to one or more embodiments of the present application, [ example eight ] there is provided a speech model training method, the method comprising:

inputting training voice data into a coder to obtain acoustic information vectors of all voice frames;

inputting the issued integrated acoustic information vector into a decoder to obtain a word prediction result of the training voice data;

and training the encoder, the CIF module and the decoder according to the word prediction result and the word label corresponding to the training voice data.

According to one or more embodiments of the present application, [ example nine ] there is provided a speech recognition method, the method comprising:

inputting voice data to be recognized into a coder to obtain acoustic information vectors of all voice frames;

and inputting the issued integrated acoustic information vector into a decoder to obtain a word recognition result of the voice data to be recognized.

According to one or more embodiments of the present application, [ example ten ] there is provided an apparatus to generate acoustic features, the apparatus comprising:

and the execution unit is used for taking the next voice frame as the current voice frame after the integrated acoustic information vector corresponding to the current voice frame is obtained, and repeatedly executing the steps of obtaining the acoustic information vector of the current voice frame and the information weight of the current voice frame and the subsequent steps until the next voice frame does not exist. According to one or more embodiments of the present application, in an eleventh example, there is provided an apparatus for generating an acoustic feature, where the first calculating unit is specifically configured to multiply an accumulated information amount weight corresponding to a previous speech frame by a retention rate corresponding to the current speech frame, and add the accumulated information amount weight to an information amount weight of the current speech frame to obtain an accumulated information amount weight corresponding to the current speech frame.

According to one or more embodiments of the present application, in an exemplary twelfth, there is provided an apparatus for generating acoustic features, where the second calculating unit is specifically configured to, if an accumulated information amount weight corresponding to the current speech frame is less than a threshold, multiply an integrated acoustic information vector corresponding to a previous speech frame by a retention rate corresponding to the current speech frame, and add the integrated acoustic information vector to a product of the information amount weight of the current speech frame and the acoustic information vector of the current speech frame to obtain the integrated acoustic information vector corresponding to the current speech frame.

According to one or more embodiments of the present application, [ example thirteen ] provides an apparatus that generates an acoustic feature, the third calculation unit including:

According to one or more embodiments of the present application, an apparatus for generating acoustic features is provided [ example fourteen ], the third calculation unit comprising:

According to one or more embodiments of the present application, [ example fifteen ] there is provided an apparatus to generate acoustic features, the apparatus further comprising:

According to one or more embodiments of the present application, [ example sixteen ] there is provided an apparatus for generating an acoustic signature, wherein every N speech frames, the current speech frame has a corresponding leakage rate of 0.

According to one or more embodiments of the present application, [ example seventeen ] there is provided a speech model training apparatus, the apparatus comprising:

the first training unit is used for inputting training voice data into the encoder to obtain acoustic information vectors of all voice frames;

the third training unit is used for inputting the issued integrated acoustic information vector into a decoder to obtain a word prediction result of the training voice data;

and the fourth training unit is used for training the encoder, the CIF module and the decoder according to the word prediction result and the word label corresponding to the training voice data.

According to one or more embodiments of the present application, [ example eighteen ] there is provided a speech recognition apparatus, the apparatus comprising:

the first input unit is used for inputting the voice data to be recognized into the encoder to obtain the acoustic information vector of each voice frame;

and the third input unit is used for inputting the issued integrated acoustic information vector into a decoder to obtain a word recognition result of the voice data to be recognized.

According to one or more embodiments of the present application, there is provided an electronic device, comprising:

one or more processors;

a storage device having one or more programs stored thereon,

when executed by the one or more processors, cause the one or more processors to implement a method for generating acoustic features as in any one of the embodiments above, a method for training a speech model as in any one of the embodiments above, or a method for speech recognition as in any one of the embodiments above.

According to one or more embodiments of the present application, an example twenty provides a computer-readable medium, on which a computer program is stored, wherein the program, when executed by a processor, implements the method of generating acoustic features according to any of the embodiments described above, the speech model training method according to the embodiments described above, or the speech recognition method according to the embodiments described above.

It should be noted that, in the present specification, the embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments may be referred to each other. For the system or the device disclosed by the embodiment, the description is simple because the system or the device corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the method part for description.

It should be understood that in the present application, "at least one" means one or more, "a plurality" means two or more. "and/or" for describing an association relationship of associated objects, indicating that there may be three relationships, e.g., "a and/or B" may indicate: only A, only B and both A and B are present, wherein A and B may be singular or plural. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. "at least one of the following" or similar expressions refer to any combination of these items, including any combination of single item(s) or plural items. For example, at least one (one) of a, b, or c, may represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c may be single or plural.

It is further noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

28页详细技术资料下载

Acoustic feature generation, voice model training and voice recognition method and device

相关技术

网友询问留言