Speech recognition method, speech recognition device, computer equipment and storage medium

文档序号：21558 发布日期：2021-09-21 浏览：53次中文

阅读说明：本技术 语音识别方法、装置、计算机设备及存储介质 (Speech recognition method, speech recognition device, computer equipment and storage medium ) 是由张旭龙王健宗于 2021-06-30 设计创作，主要内容包括：本发明公开了一种语音识别方法、装置、计算机设备及存储介质,其中,方法包括：利用预先训练好的特征提取网络从用户输入的待识别语音信息中提取第一特征信息；利用预先训练好的自注意力神经网络将第一特征信息转换为与识别任务相关的第二特征信息；计算第二特征信息与预设的每个支撑集中支撑特征信息的相似度,每个支撑集对应一种语音类型；确认相似度最高的支撑特征信息指向的支撑集对应的目标语音类型；根据目标语音类型的音素识别规则识别待识别语音信息的音素,并根据音素生成文本。本发明能够利用与识别任务相关的特征完成语音识别,降低了对特征提取网络泛化能力的要求。(The invention discloses a voice recognition method, a voice recognition device, computer equipment and a storage medium, wherein the method comprises the following steps: extracting first characteristic information from voice information to be recognized input by a user by utilizing a pre-trained characteristic extraction network; converting the first characteristic information into second characteristic information related to the recognition task by utilizing a pre-trained self-attention neural network; calculating the similarity between the second characteristic information and the preset support characteristic information in each support set, wherein each support set corresponds to one voice type; confirming a target voice type corresponding to the support set pointed by the support characteristic information with the highest similarity; and identifying the phoneme of the voice information to be identified according to the phoneme identification rule of the target voice type, and generating a text according to the phoneme. The invention can complete the voice recognition by utilizing the characteristics related to the recognition task, and reduces the requirement on the generalization capability of the characteristic extraction network.)

1. A speech recognition method, comprising:

extracting first characteristic information from voice information to be recognized input by a user by utilizing a pre-trained characteristic extraction network;

converting the first characteristic information into second characteristic information related to an identification task by utilizing a pre-trained self-attention neural network;

calculating the similarity between the second characteristic information and the preset support characteristic information in each support set, wherein each support set corresponds to one voice type;

confirming a target voice type corresponding to the support set pointed by the support characteristic information with the highest similarity;

and identifying the phoneme of the voice information to be identified according to the phoneme identification rule of the target voice type, and generating a text according to the phoneme.

2. The speech recognition method of claim 1, further comprising: pre-training the feature extraction network and the self-attention neural network, comprising:

extracting first support feature information of the voice information in the support set by using the feature extraction network, and converting the first support feature information into second support feature information related to a recognition task through the self-attention neural network, wherein each support set corresponds to one voice type and comprises a plurality of voice information with the same voice type;

extracting first sample characteristic information of sample voice information by using the characteristic extraction network, and converting the first sample characteristic information into second sample characteristic information related to an identification task through the self-attention neural network;

calculating the similarity between the second sample characteristic information and each second supporting characteristic information;

confirming a predicted voice type corresponding to the support set pointed by the second support characteristic information with the highest similarity;

and reversely updating the feature extraction network and the self-attention neural network according to the predicted voice type and the actual voice type to which the sample voice information belongs.

3. The speech recognition method of claim 2, wherein before extracting the first support feature information of the speech information in the support set using the feature extraction network, the method further comprises:

and forcibly aligning the phonemes of the voice information in the support set according to the text information corresponding to the voice information in the support set.

4. The speech recognition method according to claim 2, wherein after the backward update of the feature extraction network and the self-attention neural network according to the predicted speech type and the actual speech type to which the sample speech information belongs, further comprises:

after the training of the feature extraction network and the self-attention neural network is completed, clustering all the second support feature information converted by the trained self-attention neural network to obtain a plurality of clustering clusters;

the calculating the similarity between the second feature information and each preset support concentrated support feature information includes:

calculating the similarity of the second characteristic information and each cluster to determine a target cluster with the highest similarity;

and calculating the similarity between the second characteristic information and each second supporting characteristic information in the target clustering cluster.

5. The speech recognition method according to claim 2, wherein the extracting, by using the feature extraction network, first support feature information of the speech information in the support set and converting, by the self-attention neural network, the first support feature information into second support feature information related to a recognition task includes:

extracting support feature information of each voice message in the support set by using the feature extraction network to obtain a plurality of first support feature information;

converting each of the first support feature information into the second support feature information associated with an identification task using the self-attention neural network to obtain a plurality of second support feature information;

and calculating an average value of a plurality of second support characteristic information, and taking the average value as the second support characteristic information of the support set.

6. The speech recognition method of claim 1, wherein the feature extraction network comprises one of a CNN network or a ResNet network.

7. The speech recognition method according to claim 1, wherein the calculating of the similarity between the second feature information and the support feature information corresponding to each support set comprises:

respectively acquiring a first feature vector representation of the second feature information and a second feature vector representation of the support feature information;

calculating a cosine similarity between the first eigenvector representation and the second eigenvector representation.

8. A speech recognition apparatus, comprising:

the characteristic extraction module is used for extracting first characteristic information from the voice information to be recognized input by a user by utilizing a pre-trained characteristic extraction network;

the characteristic conversion module is used for converting the first characteristic information into second characteristic information related to the recognition task by utilizing a pre-trained self-attention neural network;

the similarity calculation module is used for calculating the similarity between the second feature information and preset support feature information in each support set, and each support set corresponds to one voice type;

the confirming module is used for confirming the target voice type corresponding to the support set pointed by the support characteristic information with the highest similarity;

and the recognition module is used for recognizing the phonemes of the voice information to be recognized according to the phoneme recognition rule of the target voice type and generating a text according to the phonemes.

9. A computer device, characterized in that the computer device comprises a processor, a memory coupled to the processor, in which memory program instructions are stored which, when executed by the processor, cause the processor to carry out the steps of the speech recognition method according to any one of claims 1-7.

10. A storage medium characterized in that it stores program instructions capable of implementing the speech recognition method according to any one of claims 1 to 7.

Technical Field

The present application relates to the field of speech recognition technologies, and in particular, to a speech recognition method, an apparatus, a computer device, and a storage medium.

Background

With the development of deep learning, speech recognition technology has developed to a great extent, and speech can be performed with high accuracy. In the development of speech recognition, speech was recognized earlier by a parameter technique based on HMM, and a speech recognition technique based on deep learning is now gradually known due to the development of the deep learning technique.

However, as is known, the existing deep learning model needs a large amount of sample data training, so that the generalization capability of a feature extraction network of the deep learning model is improved, and the extracted feature information meets new samples which are not seen in a test set at the same time.

Disclosure of Invention

The application provides a voice recognition method, a voice recognition device, computer equipment and a storage medium, which are used for solving the problem that the existing voice recognition model is low in recognition accuracy rate under the condition of less sample data.

In order to solve the technical problem, the application adopts a technical scheme that: there is provided a speech recognition method comprising: extracting first characteristic information from voice information to be recognized input by a user by utilizing a pre-trained characteristic extraction network; converting the first characteristic information into second characteristic information related to the recognition task by utilizing a pre-trained self-attention neural network; calculating the similarity between the second characteristic information and the preset support characteristic information in each support set, wherein each support set corresponds to one voice type; confirming a target voice type corresponding to the support set pointed by the support characteristic information with the highest similarity; and identifying the phoneme of the voice information to be identified according to the phoneme identification rule of the target voice type, and generating a text according to the phoneme.

As a further improvement of the present application, the method further comprises: pre-training a feature extraction network and a self-attention neural network, comprising: extracting first support characteristic information of the voice information in the support set by using a characteristic extraction network, and converting the first support characteristic information into second support characteristic information related to a recognition task by using a self-attention neural network, wherein each support set corresponds to one voice type and comprises a plurality of voice information with the same voice type; extracting first sample characteristic information of the sample voice information by using a characteristic extraction network, and converting the first sample characteristic information into second sample characteristic information related to an identification task through a self-attention neural network; calculating the similarity between the second sample characteristic information and each second supporting characteristic information; confirming a predicted voice type corresponding to the support set pointed by the second support characteristic information with the highest similarity; and reversely updating the feature extraction network and the self-attention neural network according to the predicted voice type and the actual voice type to which the sample voice information belongs.

As a further improvement of the present application, before extracting the first support feature information of the speech information in the support set by using the feature extraction network, the method further includes: and forcibly aligning the phonemes of the voice information in the support set according to the text information corresponding to the voice information in the support set.

As a further improvement of the present application, after the feature extraction network and the self-attention neural network are updated in reverse according to the predicted speech type and the actual speech type to which the sample speech information belongs, the method further includes: after the training of the feature extraction network and the self-attention neural network is completed, clustering all second support feature information obtained by converting the trained self-attention neural network to obtain a plurality of cluster clusters; calculating the similarity between the second characteristic information and the preset support characteristic information in each support set, wherein the similarity comprises the following steps: calculating the similarity of the second characteristic information and each cluster to determine a target cluster with the highest similarity; and calculating the similarity of the second characteristic information and each second supporting characteristic information in the target clustering.

As a further improvement of the present application, extracting first support feature information of the speech information in the support set by using a feature extraction network, and converting the first support feature information into second support feature information related to the recognition task by using a self-attention neural network, includes: extracting support characteristic information of each voice message in the support set by using a characteristic extraction network to obtain a plurality of first support characteristic information; converting each first support characteristic information into second support characteristic information related to the recognition task by using a self-attention neural network to obtain a plurality of second support characteristic information; and calculating an average value of the plurality of second support characteristic information, and taking the average value as the second support characteristic information of the support set.

As a further refinement of the present application, the feature extraction network comprises one of a CNN network or a ResNet network.

As a further improvement of the present application, calculating the similarity between the second feature information and the preset feature information supported in each support set includes: respectively acquiring first feature vector representation of second feature information and second feature vector representation of support feature information; a cosine similarity between the first eigenvector representation and the second eigenvector representation is calculated.

In order to solve the above technical problem, another technical solution adopted by the present application is: there is provided a speech recognition apparatus including: the characteristic extraction module is used for extracting first characteristic information from the voice information to be recognized input by a user by utilizing a pre-trained characteristic extraction network; the feature conversion module is used for converting the first feature information into second feature information related to the recognition task by utilizing a pre-trained self-attention neural network; the similarity calculation module is used for calculating the similarity between the second characteristic information and preset support characteristic information in each support set, and each support set corresponds to one voice type; the confirming module is used for confirming the target voice type corresponding to the support set pointed by the support characteristic information with the highest similarity; and the recognition module is used for recognizing the phonemes of the speech information to be recognized according to the phoneme recognition rules of the target speech type and generating a text according to the phonemes.

In order to solve the above technical problem, the present application adopts another technical solution that: there is provided a computer device comprising a processor, a memory coupled to the processor, the memory having stored therein program instructions which, when executed by the processor, cause the processor to carry out the steps of the speech recognition method as claimed in any one of the preceding claims.

In order to solve the above technical problem, the present application adopts another technical solution that: there is provided a storage medium storing program instructions capable of implementing the speech recognition method of any one of the above.

The beneficial effect of this application is: the voice recognition method of the invention effectively avoids the influence of the first characteristic information irrelevant to the recognition task on the final recognition result by converting the first characteristic information irrelevant to the recognition task extracted by the characteristic extraction network into the second characteristic information relevant to the recognition task and then realizes the recognition of the voice information to be recognized based on the second characteristic information relevant to the recognition task, and simultaneously realizes the restriction of the extracted characteristic by converting the first characteristic information irrelevant to the recognition task into the second characteristic information relevant to the recognition task, so that the voice recognition method has better adaptability to the data not encountered in the training process, greatly reduces the demand of sample data and can ensure higher recognition accuracy under the condition of less sample data.

Drawings

FIG. 1 is a flow chart of a speech recognition method according to an embodiment of the present invention;

FIG. 2 is a functional block diagram of a speech recognition apparatus according to an embodiment of the present invention;

FIG. 3 is a schematic structural diagram of a computer device according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of a storage medium according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The terms "first", "second" and "third" in this application are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implying any indication of the number of technical features indicated. Thus, a feature defined as "first," "second," or "third" may explicitly or implicitly include at least one of the feature. In the description of the present application, "plurality" means at least two, e.g., two, three, etc., unless explicitly specifically limited otherwise. All directional indications (such as up, down, left, right, front, and rear … …) in the embodiments of the present application are only used to explain the relative positional relationship between the components, the movement, and the like in a specific posture (as shown in the drawings), and if the specific posture is changed, the directional indication is changed accordingly. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.

Fig. 1 is a flow chart of a speech recognition method according to an embodiment of the present invention. It should be noted that the method of the present invention is not limited to the flow sequence shown in fig. 1 if the results are substantially the same. As shown in fig. 1, the speech recognition method includes the steps of:

step S101: and extracting first characteristic information from the voice information to be recognized input by the user by utilizing a pre-trained characteristic extraction network.

It should be noted that the feature extraction network includes one of a CNN network or a ResNet network.

In step S101, after receiving the voice information to be recognized input by the user, a trained feature extraction network is used to extract first feature information from the voice information to be recognized, where the first feature information includes features related to the recognition task and also includes features unrelated to the recognition task.

Step S102: and converting the first characteristic information into second characteristic information related to the recognition task by utilizing a pre-trained self-attention neural network.

In step S102, after the feature extraction network extracts the first feature information, the first feature information is converted into second feature information related to the recognition task through the self-attention neural network in the self-attention neural network trained by the first feature information input value. The self-attention neural network is constructed by adopting a self-attention mechanism.

It should be noted that the feature extraction network and the self-attention neural network are obtained by pre-training, and the step of training the feature extraction network and the self-attention neural network includes:

1. the method comprises the steps of extracting first support characteristic information of voice information in support sets by utilizing a characteristic extraction network, converting the first support characteristic information into second support characteristic information related to a recognition task through a self-attention neural network, wherein each support set corresponds to one voice type, and the support sets comprise a plurality of voice information with the same voice type.

Specifically, the voice types include various types of voices such as mandarin, guangdong, tetranchuan, and Hunan, and each support set corresponds to one voice type, for example, the voice type corresponding to the first support set is guangdong, the voice type corresponding to the second support set is tetranchuan, the voice information in the support set corresponding to guangdong is guangdong voice information, and the voice information in the support set corresponding to tetranchuan is tetranchuan voice information. It is to be understood that each support set is pre-constructed.

Further, in order to ensure that the extracted support feature information of each support set is accurate, in this embodiment, the step of extracting the first support feature information of the speech information in the support set by using the feature extraction network and converting the first support feature information into the second support feature information related to the recognition task by using the self-attention neural network specifically includes:

1.1, extracting the support characteristic information of each voice message in the support set by using a characteristic extraction network to obtain a plurality of first support characteristic information.

1.2, converting each first support characteristic information into second support characteristic information related to the recognition task by utilizing a self-attention neural network to obtain a plurality of second support characteristic information.

And 1.3, calculating an average value of the plurality of second support characteristic information, and taking the average value as the second support characteristic information of the support set.

Specifically, in order to ensure that the support feature information of the extracted support set is accurate, each support set comprises a plurality of pieces of voice information, therefore, when extracting features, the feature extraction network is used for respectively extracting the support feature information of each piece of voice information in the support set so as to obtain a plurality of pieces of first support feature information, then the plurality of pieces of first support feature information are respectively input into the self-attention neural network, the self-attention neural network is used for converting to obtain a plurality of pieces of second support feature information, then the plurality of pieces of second support feature information are converted into corresponding vector representations and then average value calculation is carried out, the calculated average value is used as the final second support feature information of the support set, and the mode that the average value calculation is carried out by a plurality of pieces of voice information belonging to the same voice type is used for enabling the finally obtained second support feature information to be more representative, and can further represent the characteristics of the voice type.

Further, in order to further improve the accuracy of the feature extraction by the feature extraction network, before extracting the first support feature information of the speech information in the support set by using the feature extraction network, the method further includes: and forcibly aligning the phonemes of the voice information in the support set according to the text information corresponding to the voice information in the support set.

In the present embodiment, the forced alignment of phonemes is realized by MFA (simple-forced-aligner). Specifically, the speech information in the support set is speech information that is preset and prepared for reference, and the text information corresponding to the speech information in the support set can be known in advance, so before extracting the feature information by using the feature extraction network, the MFA is used to combine the text information corresponding to the speech information in the support set to forcibly align the phonemes of the speech information in the support set, the length of the phonemes is predicted, and the phonemes and the corresponding text are aligned according to the predicted lengths of the phonemes. The accuracy of the feature extracted by the feature extraction network is higher by forcibly aligning the phonemes of the support concentrated voice information.

2. And extracting first sample characteristic information of the sample voice information by using the characteristic extraction network, and converting the first sample characteristic information into second sample characteristic information related to the recognition task through the self-attention neural network.

In this embodiment, before training the feature extraction network and the self-attention neural network, a query set needs to be constructed in advance, where the query set includes a plurality of pieces of speech information with different speech types, and the query set is used as a training sample to train the feature extraction network and the self-attention neural network. Specifically, after sample voice information in a query set is acquired, first sample feature information of the sample voice information is extracted by using a feature extraction network, and then the first sample feature information is converted into second sample feature information related to a recognition task by using a self-attention neural network.

3. And calculating the similarity of the second sample characteristic information and each second supporting characteristic information.

Specifically, after second sample feature information of the sample voice information is acquired, the similarity between the second sample feature information and each piece of second support feature information is calculated.

4. And confirming the predicted voice type corresponding to the support set pointed by the second support characteristic information with the highest similarity.

Specifically, after the similarity is calculated, a second support feature information with the highest similarity is selected from the second support feature information, a support set corresponding to the second support feature information is confirmed, a voice type corresponding to the support set is confirmed, and the voice type is used as a predicted voice type of the current recognition.

5. And reversely updating the feature extraction network and the self-attention neural network according to the predicted voice type and the actual voice type to which the sample voice information belongs.

Specifically, the actual voice type of the sample voice information is obtained, the actual voice type is compared and judged with the predicted voice type, and the feature extraction network and the self-attention neural network are updated according to the back propagation of the comparison result, so that the training of the feature extraction network and the self-attention neural network is realized.

Step S103: and calculating the similarity between the second characteristic information and the preset support characteristic information in each support set, wherein each support set corresponds to one voice type.

Further, when training the feature extraction network and the self-attention neural network, after reversely updating the feature extraction network and the self-attention neural network according to the predicted speech type and the actual speech type to which the sample speech information belongs, the method further includes:

and after the training of the feature extraction network and the self-attention neural network is finished, clustering all second support feature information converted by the trained self-attention neural network to obtain a plurality of cluster clusters.

Specifically, after the training of the feature extraction network and the self-attention neural network is completed, the feature extraction network is used for extracting first support feature information of each support set, the self-attention neural network converts the first support feature information into second support feature information to obtain second support feature information corresponding to each support set, and then all the second support feature information is subjected to clustering processing, so that the second support feature information is clustered into a plurality of clusters, and each cluster represents a support set with closer feature information.

Step S103 of calculating a similarity between the second feature information and each preset support feature information in the support set, specifically including: calculating the similarity of the second characteristic information and each cluster to determine a target cluster with the highest similarity; and calculating the similarity of the second characteristic information and each second supporting characteristic information in the target clustering.

Specifically, when the similarity between the second feature information and the preset support feature information in each support set is calculated, the similarity between the second feature information and each cluster is calculated, a target cluster with the most similar second feature information is selected, and then the similarity between the second feature information and the second support feature information of each support set in the target cluster is calculated to confirm the most similar second support feature information.

In the embodiment, the second support characteristic information is clustered, and when the similarity is calculated subsequently, only the similarity between the second characteristic information and the clustered cluster needs to be calculated, and the similarity between the second characteristic information and each second support characteristic information in the most similar clustered cluster needs to be calculated, but the similarity calculation with all the second support characteristic information does not need to be calculated, so that the data processing amount is greatly reduced.

Further, in some embodiments, the step 103 is a step of calculating a similarity between the second feature information and the preset feature information supported in each support set, and specifically includes: respectively acquiring first feature vector representation of second feature information and second feature vector representation of support feature information; a cosine similarity between the first eigenvector representation and the second eigenvector representation is calculated.

Further, in some embodiments, the similarity between the second feature information and the support feature information may also be represented by an euclidean distance, a mahalanobis distance, a manhattan distance, a minkowski distance, a hamming distance, a jaccard correlation coefficient, a chebyshev distance, a pearson correlation coefficient.

Step S104: and confirming the target voice type corresponding to the support set pointed by the support characteristic information with the highest similarity.

In step S104, after the support feature information with the highest similarity is screened out, the target speech type to which the support set corresponding to the support feature information with the highest similarity belongs is determined.

Step S105: and identifying the phoneme of the voice information to be identified according to the phoneme identification rule of the target voice type, and generating a text according to the phoneme.

In step S105, it should be noted that the phoneme recognition rule of each voice type is preset, after the target voice type corresponding to the voice information to be recognized is confirmed, the phoneme of the voice information to be recognized is recognized according to the phoneme recognition rule corresponding to the target voice type, and then a text is generated according to the phoneme, so as to complete the recognition of the voice information to be recognized.

The voice recognition method provided by the embodiment of the invention effectively avoids the influence of the first characteristic information irrelevant to the recognition task on the final recognition result by converting the first characteristic information irrelevant to the recognition task, which is extracted by the characteristic extraction network, into the second characteristic information relevant to the recognition task and then realizes the recognition of the voice information to be recognized based on the second characteristic information relevant to the recognition task, and simultaneously realizes the restriction on the extracted characteristics by converting the first characteristic information irrelevant to the recognition task into the second characteristic information relevant to the recognition task, so that the voice recognition method has better adaptability to the data which is not met in the training process, the demand of the voice recognition method on sample data is greatly reduced, and higher recognition accuracy can be ensured under the condition of less sample data.

Fig. 2 is a functional block diagram of a speech recognition apparatus according to an embodiment of the present invention. As shown in fig. 2, the speech recognition apparatus 20 includes a feature extraction module 21, a feature conversion module 22, a similarity calculation module 23, a confirmation module 24, and a recognition module 25.

The feature extraction module 21 is configured to extract first feature information from the to-be-recognized voice information input by the user by using a pre-trained feature extraction network;

the feature conversion module 22 is configured to convert the first feature information into second feature information related to the recognition task by using a pre-trained self-attention neural network;

the similarity calculation module 23 is configured to calculate a similarity between the second feature information and support feature information in each preset support set, where each support set corresponds to one voice type;

the confirming module 24 is configured to confirm a target voice type corresponding to the support set pointed by the support feature information with the highest similarity;

and the recognition module 25 is used for recognizing the phonemes of the speech information to be recognized according to the phoneme recognition rules of the target speech type and generating a text according to the phonemes.

Preferably, the speech recognition device 20 further comprises a training module for pre-training the feature extraction network and the self-attention neural network; the operation of the training module for executing the training feature extraction network and the self-attention neural network specifically comprises the following steps: extracting first support characteristic information of the voice information in the support set by using a characteristic extraction network, and converting the first support characteristic information into second support characteristic information related to a recognition task by using a self-attention neural network, wherein each support set corresponds to one voice type and comprises a plurality of voice information with the same voice type; extracting first sample characteristic information of the sample voice information by using a characteristic extraction network, and converting the first sample characteristic information into second sample characteristic information related to an identification task through a self-attention neural network; calculating the similarity between the second sample characteristic information and each second supporting characteristic information; confirming a predicted voice type corresponding to the support set pointed by the second support characteristic information with the highest similarity; and reversely updating the feature extraction network and the self-attention neural network according to the predicted voice type and the actual voice type to which the sample voice information belongs.

Preferably, before the training module performs the operation of extracting the first support feature information of the speech information in the support set by using the feature extraction network, the training module is further configured to: and forcibly aligning the phonemes of the voice information in the support set according to the text information corresponding to the voice information in the support set.

Preferably, after the training module performs the operation of reversely updating the feature extraction network and the self-attention neural network according to the predicted voice type and the actual voice type to which the sample voice information belongs, the training module is further configured to: and after the training of the feature extraction network and the self-attention neural network is finished, clustering all second support feature information converted by the trained self-attention neural network to obtain a plurality of cluster clusters.

The operation performed by the similarity calculation module 23 to calculate the similarity between the second feature information and each preset support feature information in each support set may further be: calculating the similarity of the second characteristic information and each cluster to determine a target cluster with the highest similarity; and calculating the similarity of the second characteristic information and each second supporting characteristic information in the target clustering.

Preferably, the training module performs operations of extracting first support feature information of the speech information in the support set by using the feature extraction network and converting the first support feature information into second support feature information related to the recognition task by using the self-attention neural network, and may further include: extracting support characteristic information of each voice message in the support set by using a characteristic extraction network to obtain a plurality of first support characteristic information; converting each first support characteristic information into second support characteristic information related to the recognition task by using a self-attention neural network to obtain a plurality of second support characteristic information; and calculating an average value of the plurality of second support characteristic information, and taking the average value as the second support characteristic information of the support set.

Preferably, the feature extraction network comprises one of a CNN network or a ResNet network.

Preferably, the operation of calculating the similarity between the second feature information and the preset support feature information in each support set performed by the similarity calculation module 23 may further be: respectively acquiring first feature vector representation of second feature information and second feature vector representation of support feature information; a cosine similarity between the first eigenvector representation and the second eigenvector representation is calculated.

For other details of the technical solution implemented by each module in the speech recognition apparatus in the above embodiment, reference may be made to the description in the speech recognition method in the above embodiment, and details are not repeated here.

It should be noted that, in the present specification, the embodiments are all described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments may be referred to each other. For the device-like embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

Referring to fig. 3, fig. 3 is a schematic structural diagram of a computer device according to an embodiment of the present invention. As shown in fig. 3, the computer device 30 includes a processor 31 and a memory 32 coupled to the processor 31, wherein the memory 32 stores program instructions, and the program instructions, when executed by the processor 31, cause the processor 31 to execute the steps of the speech recognition method according to any of the above embodiments.

The processor 31 may also be referred to as a CPU (Central Processing Unit). The processor 31 may be an integrated circuit chip having signal processing capabilities. The processor 31 may also be a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

Referring to fig. 4, fig. 4 is a schematic structural diagram of a storage medium according to an embodiment of the invention. The storage medium of the embodiment of the present invention stores a program instruction 41 capable of implementing the voice recognition method described in any one of the above embodiments, where the program instruction 41 may be stored in the storage medium in the form of a software product, and includes several instructions to enable a computer device (which may be a personal computer, a server, or a network device) or a processor (processor) to execute all or part of the steps of the method described in each embodiment of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, or computer equipment, such as a computer, a server, a mobile phone, and a tablet.

In the several embodiments provided in the present application, it should be understood that the disclosed computer apparatus, device and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, a division of a unit is merely a logical division, and an actual implementation may have another division, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit. The above embodiments are merely examples and are not intended to limit the scope of the present disclosure, and all modifications, equivalents, and flow charts using the contents of the specification and drawings of the present disclosure or those directly or indirectly applied to other related technical fields are intended to be included in the scope of the present disclosure.

13页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：语音交互方法、语音交互系统和存储介质

Speech recognition method, speech recognition device, computer equipment and storage medium

相关技术

网友询问留言