Method and system for voice recognition real-time re-scoring

文档序号：344502 发布日期：2021-12-03 浏览：10次中文

阅读说明：本技术 一种语音识别实时重打分的方法和系统 (Method and system for voice recognition real-time re-scoring ) 是由王金龙徐欣康胡新辉谌明于 2021-10-13 设计创作，主要内容包括：本说明书实施例提供一种语音识别实时重打分的方法和系统,该方法包括获取语音材料中语音帧的特征；基于语音帧的特征,通过解码模型和预设重打分模型获取候选语音识别结果,其中,预设重打分模型用于对解码模型的语音识别结果的分数进行实时修正；基于候选语音识别结果确定目标语音识别结果。(The embodiment of the specification provides a method and a system for real-time re-scoring of voice recognition, wherein the method comprises the steps of obtaining characteristics of voice frames in voice materials; based on the characteristics of the voice frame, acquiring a candidate voice recognition result through a decoding model and a preset re-grading model, wherein the preset re-grading model is used for correcting the fraction of the voice recognition result of the decoding model in real time; a target speech recognition result is determined based on the candidate speech recognition results.)

1. A method of voice recognition real-time re-scoring, comprising:

acquiring the characteristics of a voice frame in a voice material;

based on the characteristics of the voice frame, acquiring a candidate voice recognition result through a decoding model and a preset re-grading model, wherein the preset re-grading model is used for correcting the fraction of the voice recognition result of the decoding model in real time;

and determining a target voice recognition result based on the candidate voice recognition result.

2. The method of claim 1, the preset re-scoring model comprising pre-saved corrections to the score, the real-time corrections comprising:

and obtaining a corrected score by summing the score and the correction value, wherein the corrected score is the score of the voice recognition result of the decoding model corrected in real time.

3. The method of claim 1, the preset re-scoring model being a pre-generated model, the pre-generation comprising:

acquiring a first language model and a second language model, wherein the second language model is obtained by training a preset language model, and the first language model is obtained by cutting the second language model;

generating the decoding model based on the first language model;

generating a first scoring model and a second scoring model based on the first language model and the second language model;

obtaining the preset re-scoring model by merging the first scoring model and the second scoring model.

4. The method of claim 3, the deriving the preset re-scoring model by merging the first scoring model and the second scoring model comprising:

traversing the second scoring model to obtain a second voice recognition result score, and synchronously traversing the first scoring model based on the traversal of the second scoring model to obtain a first voice recognition result score;

updating the second scoring model based on a difference between the speech recognition result score one and the speech recognition result score two;

and determining the preset re-grading model based on the updated second grading model.

5. The method of claim 4, the synchronizing traversing the first scoring model comprising:

determining, in the first scoring model, a corresponding arc to an arc in the second scoring model based on arcs in the second scoring model, wherein

When an arc that is consistent with an arc in the second scoring model is found in the first scoring model, determining the consistent arc as the corresponding arc;

when no arc consistent with the arc in the second scoring model can be found in the first scoring model, determining the consistent arc with the minimum backspacing step number as the corresponding arc through backspacing.

6. A method according to claims 2 and 3, the correction value being the difference between the speech recognition result scores of the first scoring model and the second scoring model.

7. A system for real-time re-scoring of voice recognition comprises a feature acquisition module, a candidate result acquisition module and a target result determination module;

the characteristic acquisition module is used for acquiring the characteristics of the voice frame in the voice material;

the candidate result acquisition module is used for acquiring a candidate voice recognition result through a decoding model and a preset re-grading model based on the characteristics of the voice frame, and the preset re-grading model is used for correcting the fraction of the voice recognition result of the decoding model in real time;

the target result determination module is used for determining a target voice recognition result based on the candidate voice recognition result.

8. The system of claim 7, the preset re-scoring model comprising pre-saved corrections to the score, the real-time corrections comprising:

and obtaining a corrected score by summing the score and the correction value, wherein the corrected score is the score of the voice recognition result of the decoding model corrected in real time.

9. The system of claim 8, wherein the preset re-scoring model is a pre-generated model, the system further comprising a language model acquisition module, a decoding model generation module, a scoring model generation module, and a re-scoring model generation module:

the language model acquisition module is used for acquiring a first language model and a second language model, the second language model is obtained by training a preset language model, and the first language model is obtained by cutting the second language model;

the decoding model generation module is used for generating the decoding model based on the first language model;

the scoring model generation module is used for generating a first scoring model and a second scoring model based on the first language model and the second language model;

the re-scoring model generation module is used for obtaining the preset re-scoring model by combining the first scoring model and the second scoring model.

10. The system of claim 9, the re-scoring model generation module further comprising a score acquisition unit, a model update unit, and a model determination unit:

the score obtaining unit is used for obtaining a second voice recognition result score by traversing the second scoring model, and synchronously traversing the first scoring model based on the traversal of the second scoring model to obtain a first voice recognition result score;

the model updating unit is used for updating the second scoring model based on the difference value of the first voice recognition result score and the second voice recognition result score;

the model determining unit is used for determining the preset re-scoring model based on the updated second scoring model.

11. The system of claim 10, the synchronizing traversing the first scoring model comprising:

determining, in the first scoring model, a corresponding arc to an arc in the second scoring model based on arcs in the second scoring model, wherein

When an arc that is consistent with an arc in the second scoring model is found in the first scoring model, determining the consistent arc as the corresponding arc;

12. The system according to claims 8 and 9, wherein the correction value is a difference between the speech recognition result scores of the first scoring model and the second scoring model.

13. An apparatus for real-time voice recognition re-scoring, comprising a processor configured to perform the method of real-time voice recognition re-scoring as claimed in any one of claims 1-6.

14. A computer-readable storage medium storing computer instructions which, when read by a computer, cause the computer to perform the method of voice recognition real-time re-scoring as claimed in any one of claims 1 to 6.

Technical Field

The present disclosure relates to the field of speech recognition, and more particularly, to a method and system for real-time re-scoring in speech recognition.

Background

In speech recognition, a speech recognition model is usually used, and a real-time re-scoring method is combined to obtain a speech recognition effect. Real-time re-grading requires real-time construction of decoding networks, path searching and calculation are performed on the decoding networks, decoding speed is low, and the decoding networks occupy large memory space. In this scenario, it is desirable to obtain speech recognition results quickly while maintaining accuracy.

It is therefore desirable to provide a method for voice recognition re-scoring in real time.

Disclosure of Invention

One embodiment of the present disclosure provides a method for real-time re-scoring in speech recognition. The method comprises the following steps: acquiring the characteristics of a voice frame in a voice material; based on the characteristics of the voice frame, acquiring a candidate voice recognition result through a decoding model and a preset re-grading model, wherein the preset re-grading model is used for correcting the fraction of the voice recognition result of the decoding model in real time; and determining a target voice recognition result based on the candidate voice recognition result.

One embodiment of the present disclosure provides a system for real-time voice recognition scoring. The system comprises: the device comprises a characteristic acquisition module, a candidate result acquisition module and a target result determination module; the characteristic acquisition module is used for acquiring the characteristics of the voice frame in the voice material; the candidate result acquisition module is used for acquiring a candidate voice recognition result through a decoding model and a preset re-grading model based on the characteristics of the voice frame, and the preset re-grading model is used for correcting the fraction of the voice recognition result of the decoding model in real time; the target result determination module is used for determining a target voice recognition result based on the candidate voice recognition result.

One of the embodiments of the present specification provides an apparatus for voice recognition real-time re-scoring, which includes a processor configured to execute the method for voice recognition real-time re-scoring described in the present specification.

One of the embodiments of the present specification provides a computer-readable storage medium, where the storage medium stores computer instructions, and when the computer reads the computer instructions in the storage medium, the computer executes the method for real-time re-scoring for speech recognition described in the present specification.

Drawings

The present description will be further explained by way of exemplary embodiments, which will be described in detail by way of the accompanying drawings. These embodiments are not intended to be limiting, and in these embodiments like numerals are used to indicate like structures, wherein:

FIG. 1 is a schematic diagram of an application scenario of a system for real-time re-scoring of speech recognition according to some embodiments of the present description;

FIG. 2 is a schematic diagram of a system for real-time re-scoring of speech recognition in accordance with some embodiments of the present description;

FIG. 3 is an exemplary flow diagram of a method of speech recognition real-time re-scoring in accordance with some embodiments of the present description;

FIG. 4 is an exemplary flow diagram of a method of generating a pre-set re-scoring model according to some embodiments of the present description;

FIG. 5 is a schematic diagram of a method of generating a re-scoring model, according to some embodiments of the present description;

FIG. 6 is an exemplary diagram of a method of model traversal shown in accordance with some embodiments of the present description;

FIG. 7 is an exemplary diagram of a method of speech recognition real-time re-scoring in accordance with some embodiments of the present description;

FIG. 8 is an exemplary flow diagram of a method of generating a pre-set re-scoring model, according to some embodiments of the present description.

Detailed Description

In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings used in the description of the embodiments will be briefly described below. It is obvious that the drawings in the following description are only examples or embodiments of the present description, and that for a person skilled in the art, the present description can also be applied to other similar scenarios on the basis of these drawings without inventive effort. Unless otherwise apparent from the context, or otherwise indicated, like reference numbers in the figures refer to the same structure or operation.

It should be understood that "system", "apparatus", "unit" and/or "module" as used herein is a method for distinguishing different components, elements, parts, portions or assemblies at different levels. However, other words may be substituted by other expressions if they accomplish the same purpose.

As used in this specification and the appended claims, the terms "a," "an," "the," and/or "the" are not intended to be inclusive in the singular, but rather are intended to be inclusive in the plural, unless the context clearly dictates otherwise. In general, the terms "comprises" and "comprising" merely indicate that steps and elements are included which are explicitly identified, that the steps and elements do not form an exclusive list, and that a method or apparatus may include other steps or elements.

Flow charts are used in this description to illustrate operations performed by a system according to embodiments of the present description. It should be understood that the preceding or following operations are not necessarily performed in the exact order in which they are performed. Rather, the various steps may be processed in reverse order or simultaneously. Meanwhile, other operations may be added to the processes, or a certain step or several steps of operations may be removed from the processes.

FIG. 1 is a schematic diagram of an application scenario of a system for real-time re-scoring of speech recognition according to some embodiments of the present description. A system 100 for voice recognition real-time re-scoring (hereinafter system 100) may include a server 110, a network 120, a storage device 130, a voice capture device 140, and a user 150.

The server 110 may be used to manage resources and process data and/or information from at least one component of the present system or an external data source (e.g., a cloud data center). In some embodiments, the server 110 may be a single server or a group of servers. The server groups may be centralized or distributed. In some embodiments, the server 110 may be local or remote. For example, the server 110 may receive or retrieve voice data collected by the voice collection device 140 and/or information and/or data in the storage device 130 via the network 120. As another example, server 110 may be directly connected to voice capture device 140 and/or storage device 130 to access stored information and/or data. In some embodiments, the server 110 may be implemented on a cloud platform or on a vehicle computer. By way of example only, the cloud platform may include a private cloud, a public cloud, a hybrid cloud, a community cloud, a distributed cloud, an internal cloud, a multi-tiered cloud, and the like, or any combination thereof. In some embodiments, server 110 may retrieve relevant data and/or information from storage 130 for performing speech recognition real-time re-scoring as illustrated in some embodiments of the present description, e.g., speech data, language models, decoding models, scoring models, etc.

In some embodiments, the server 110 may include a processing engine 112. Processing engine 112 may process information and/or data related to speech recognition real-time re-scoring to perform one or more functions described herein. In some embodiments, the server 110 may include models for voice recognition real-time re-scoring, e.g., a decoding model, a pre-set scoring model. In some embodiments, the processing engine 112 may decode the voice data through the decoding model to obtain a text-form voice recognition result, and re-score the voice recognition result in real time through a preset scoring model to obtain an optimal voice recognition result. In some embodiments, processing engine 112 may generate a decoding model and/or a pre-set scoring model based on an existing language model. In some embodiments, processing engine 112 may obtain the preset scoring model by merging multiple scoring models. In some embodiments, server 110 may send data and/or information, such as models, text-to-speech recognition results, etc., generated by processing engine 112 to storage 130 for storage. In some embodiments, the server 110 may send the text speech recognition result to the speech acquisition device 140, the user terminal and/or the output device, etc. to feed back to the user 150 and/or display the text speech recognition result.

In some embodiments, processing engine 112 may include one or more processing engines (e.g., a single chip processing engine or a multi-chip processing engine). By way of example only, the processing engine 112 may include a Central Processing Unit (CPU), an Application Specific Integrated Circuit (ASIC), an application specific instruction set processor (ASIP), a Graphics Processing Unit (GPU), a Physical Processing Unit (PPU), a Digital Signal Processor (DSP), a Field Programmable Gate Array (FPGA), a Programmable Logic Device (PLD), a controller, a microcontroller unit, a Reduced Instruction Set Computer (RISC), a microprocessor, or the like, or any combination thereof.

Network 120 may facilitate the exchange of information and/or data. In some embodiments, one or more components of system 100 (e.g., server 110, storage 130, voice capture device 140) may send information and/or data to other components of system 100 via network 120. For example, the server 110 may obtain voice data collected by the voice collection device 140 via the network 120. As another example, server 110, voice capture device 140 may retrieve data and/or information from memory 130 and/or write data and/or information to memory 130 via network 120. In some embodiments, the network 120 may be any form of wired or wireless network, or any combination thereof. By way of example only, network 120 may include a cable network, a wireline network, a fiber optic network, a telecommunications network, an intranet, the internet, a Local Area Network (LAN), a Wide Area Network (WAN), a Wireless Local Area Network (WLAN), a Metropolitan Area Network (MAN), a Public Switched Telephone Network (PSTN), a bluetooth network, a zigbee network, a Near Field Communication (NFC) network, the like, or any combination thereof.

Storage device 130 may store data and/or instructions. In some embodiments, the storage device 130 may store data acquired from the voice capture device 140, such as captured voice data and the like. In some embodiments, storage 130 may store data and/or instructions used by server 110 to perform or use to perform the exemplary methods described in this application, e.g., decoding models, scoring models, text-to-speech recognition results, and so forth. In some embodiments, storage 130 may include mass storage, removable storage, volatile read-write memory, read-only memory (ROM), and the like, or any combination thereof. In some embodiments, storage device 130 may be implemented on a cloud platform. By way of example only, the cloud platform may include a private cloud, a public cloud, a hybrid cloud, a community cloud, a distributed cloud, an internal cloud, a multi-tiered cloud, and the like, or any combination thereof.

In some embodiments, storage device 130 may be connected to network 120 to communicate with one or more components of system 100 (e.g., server 110, voice capture device 140). One or more components of system 100 may access data or instructions stored in storage device 130 via network 120. In some embodiments, storage device 130 may be directly connected to or in communication with one or more components of system 100 (e.g., server 110 and voice capture device 140). In some embodiments, storage device 130 may be part of server 110. In some embodiments, the storage device 130 may be integrated into the voice capture device 140.

The voice collecting device 140 may collect voice data of the user 150 for obtaining a voice recognition result, for example, a text-form voice recognition result. The voice capture device 140 may be any device and/or means that can input and capture voice or that includes a voice input and capture module. In some embodiments, the voice capture device 140 may include a mobile device 140-1, a tablet computer 140-2, a laptop computer 140-3, a microphone 140-4, or the like, or any combination thereof. The mobile device 140-1 may be any mobile/handheld device capable of voice input and acquisition, such as a smartphone, personal digital assistant, handheld smart terminal, etc.; tablet computer 140-2 may be any smart tablet device capable of voice input and capture, e.g., android tablet, iPad, etc.; the laptop computer 140-3 may be any notebook computer or the like into which a voice input module such as a microphone is integrated; the microphone 140-4 may be a separate device or a microphone integrated device, such as a microphone, a microphone integrated headset, a microphone integrated VR device, and the like. In some embodiments, the voice capture device 140 may include a device and/or module for acquiring voice data/material, such as a microphone 140-5 for acquiring voice, or a module for acquiring voice data/material, etc. In some embodiments, the voice capture device 140 may obtain voice data of the user 150, e.g., a conversation, etc., through any voice input device (e.g., a microphone, etc.). In some embodiments, voice capture device 140 may be in communication and/or connection with server 110 and/or storage device 130 via network 120. For example, the voice capture device 140 may provide the retrieved voice data/material to the server 110 via the network 120. In some embodiments, the voice capture device 140 may be directly connected to or integrated within the server 110. In some embodiments, the speech capture device 140 may receive the text speech recognition results returned by the server 110 and present them to the user 150.

User 150 may provide voice data for recognition. The user 150 can provide the voice data to the server 110 through the voice collecting device 140, and the processing engine 112 recognizes the voice data through the model to obtain the text voice recognition result. In some embodiments, the user 150 may obtain the text-to-speech recognition result of the server 110 through the speech capturing device 140, the user terminal, or other device.

It should be noted that system 100 is provided for illustrative purposes only and is not intended to limit the scope of the present application. It will be apparent to those skilled in the art that various modifications and variations can be made in light of the description of the present application. For example, the system 100 may also include a voice database, a source of voice information, and the like. As another example, server 110 and voice capture device 140 may be integral. System 100 may be implemented on other devices to achieve similar or different functionality. However, variations and modifications may be made without departing from the scope of the present application.

FIG. 2 is a schematic diagram of a system for speech recognition real-time re-scoring in accordance with some embodiments of the present description. In some embodiments, the system 200 may include a feature acquisition module 210, a candidate result acquisition module 220, and a target result determination module 230.

The feature obtaining module 210 may be configured to obtain features of a speech frame in speech material. For more details on how to obtain the characteristics of a speech frame in speech material, reference may be made to fig. 3 and its description.

The candidate result obtaining module 220 may be configured to obtain a candidate speech recognition result through a decoding model and a preset re-scoring model based on features of the speech frame. In some embodiments, the pre-set re-scoring model is used to modify the scores of the speech recognition results of the decoding model in real-time.

The preset re-scoring model refers to a pre-designated model for re-scoring the speech recognition result. In some embodiments, the pre-set re-scoring model may include pre-saved scores. In some embodiments, the pre-set re-scoring model may include a modification to the pre-saved score.

The corrected score is a score obtained by correcting the score of the speech recognition result of the decoding model. In some embodiments, real-time correction may include obtaining a corrected score based on the score and the correction value. In some embodiments, the score of the speech recognition result of the decoding model may be corrected in real time by summing the score with a correction value, and the corrected score may be obtained. For example, a score of 0.5 and a correction value of-0.03 are added, and a corrected score of 0.47 can be obtained.

The target result determination module 230 may be used to determine a target speech recognition result based on the candidate speech recognition results. For more details on how to determine the target speech recognition result based on the candidate speech recognition results, reference may be made to fig. 3 and its description.

In some embodiments, the preset re-scoring model may be a pre-generated model. In some embodiments, the system 200 may further include a language model acquisition module 240, a decoding model generation module 250, a scoring model generation module 260, and a re-scoring model generation module 270.

The language model acquisition module 240 may be configured to acquire a first language model and a second language model. In some embodiments, the second language model may be obtained by training a preset language model. In some embodiments, the first language model may be derived by cropping the second language model. More details on how the first language model and the second language model are obtained can be found in fig. 4 and its description.

The decoding model generation module 250 may be used to generate a decoding model based on the first language model. For more details on how the decoding model is generated based on the first language model, reference may be made to fig. 4 and its description.

Scoring model generation module 260 may be used to generate a first scoring model and a second scoring model based on the first language model and the second language model. Further details regarding how the first scoring model and the second scoring model are generated based on the first language model and the second language model may be found in fig. 4 and its description.

The re-scoring model generation module 270 may be configured to obtain a preset re-scoring model by merging the first scoring model and the second scoring model. More details on how the preset re-scoring model is obtained by merging the first scoring model and the second scoring model can be seen in fig. 4 and its description.

In some embodiments, the re-scoring model generation module 270 may include a score acquisition unit 271, a model update unit 272, and a model determination unit 273.

The score obtaining unit 271 may be configured to obtain a speech recognition result score of two by traversing the second scoring model, and synchronously traverse the first scoring model to obtain a speech recognition result score of one based on the traversal of the second scoring model.

In some embodiments, the first scoring model may be traversed synchronously by determining features in the first scoring model that correspond to features of the second scoring model while traversing the second scoring model, which may be features common to the first and second scoring models, such as coincident arcs and the like.

In some embodiments, corresponding arcs to arcs in the second scoring model may be determined in the first scoring model based on arcs in the second scoring model. In some embodiments, when an arc is found in the first scoring model that is consistent with an arc in the second scoring model, the consistent arc is determined to be a corresponding arc; when no arc consistent with the arc in the second scoring model can be found in the first scoring model, the consistent arc with the least number of backspacing steps is determined as the corresponding arc through backspacing. For more details on traversal and rollback, reference may be made to fig. 6 and its description.

The model updating unit 272 may be configured to update the second scoring model based on a difference between the speech recognition result score of one and the speech recognition result score of two. For more details on how the second scoring model is updated based on the difference between the speech recognition result score one and the speech recognition result score two, see fig. 5 and its description.

The model determination unit 273 may be configured to determine a preset re-scoring model based on the updated second scoring model.

The correction value is a value according to which the score of the voice recognition result is corrected. In some embodiments, the correction value may be a difference in speech recognition result scores of the first scoring model and the second scoring model. For example, if the first scoring model has a first speech recognition result score of 0.6 and the second scoring model has a second speech recognition result score of 0.63, the correction value may be-0.03 or 0.03. In some embodiments, the order of subtracting the speech recognition result score one and the speech recognition result score two is fixed. For example, the correction value is based on the voice recognition result score one minus the voice recognition result score two. Still taking the above example as an example, the correction value is-0.03 points.

FIG. 3 is an exemplary flow diagram of a method for speech recognition real-time re-scoring in accordance with some embodiments of the present description. As shown in fig. 3, the process 300 includes the following steps.

Step 310, obtaining characteristics of a speech frame in the speech material. In some embodiments, step 310 may be performed by feature acquisition module 210.

Features refer to information contained in speech material. E.g., loudness, pitch, etc. In some embodiments, the features may refer to phonemes in the phonetic material.

The feature obtaining module 210 can obtain the features of the speech frames in the speech material in various ways and generate feature vectors.

In some embodiments, the feature obtaining module 210 may extract features of the speech frames at fixed time intervals, and generate a sequence of feature components of the speech; and then, calculating the probability of the phonemes in each speech frame by using an acoustic model to generate a matrix, wherein each row of the matrix corresponds to one feature vector, each feature vector corresponds to one frame of speech, each element in the matrix represents the probability of the phonemes, and the n frames of speech form the matrix, and each column corresponds to the same phoneme. The number of phonemes is fixed, such as 80. Each frame of speech has a probability value on each phoneme, and the probabilities of all phonemes in each frame add up to 1.

And step 320, acquiring a candidate voice recognition result through a decoding model and a preset re-grading model based on the characteristics of the voice frame. In some embodiments, step 320 may be performed by candidate result acquisition module 220.

The candidate speech recognition result is a set comprising at least one speech recognition result from which the target speech recognition result can be determined. The description of the target speech recognition result can be found below.

The candidate speech recognition results may include speech recognition results of the decoding model. In some embodiments, the candidate speech recognition results may include speech recognition results of the decoding model and the real-time revised score.

The decoding model and the preset re-scoring model may be various models that enable decoding and re-scoring. In some embodiments, the decoding model may be a decoding network HCLG. In some embodiments, the pre-set re-scoring model may be a Weighted Finite-State Transducer (WFST).

The candidate acquisition module 220 may input the matrix and/or the feature vector into a decoding model (e.g., a decoding network HCLG), resulting in a directed graph structure. The directed graph structure includes a plurality of arcs and nodes, the arcs having inputs, outputs, and weights. The weights may be used to score the sequence of arcs. The sequence of arcs is an ordered set of arcs that may reflect an ordered set of words, e.g., "today is Monday" and "I is Zhang three". In some embodiments, the sequence of arcs may reflect the order of the entire speech recognition process. For example, the order of the words recognized during speech recognition may be reflected by the order of arcs in the sequence of arcs. In some embodiments, the score may represent a confidence and/or an accuracy of the speech recognition result. For example, the higher the score, the higher the confidence and accuracy of the speech recognition result. The input of the arc is the ID of the skip between phonemes, when the output of the arc is 0, it means that there is no speech recognition result, and when the output is not 0, it corresponds to the speech recognition result, that is, the sequence of words reflected by the sequence of the arc. The preset re-scoring model can re-score the voice recognition result output by the arc to obtain a score after real-time correction, and further can obtain a candidate voice recognition result. More details about the re-scoring may be found elsewhere in this specification.

In step 330, a target speech recognition result is determined based on the candidate speech recognition results. In some embodiments, step 330 may be performed by target outcome determination module 230.

The target speech recognition result is the speech recognition result with the highest accuracy and/or confidence of the final determination. Where accuracy and/or confidence may be expressed in percentage or fraction, etc. For example, of the two speech recognition results with the accuracy of 80%, 90%, the speech recognition result with the accuracy of 90% is determined as the target speech recognition result. For another example, among three speech recognition results having scores of 0.8, 0.85, and 0.78, the speech recognition result having a score of 0.85 is determined as the target speech recognition result.

The target result determination module 230 may determine the target speech recognition result in a variety of ways. In some embodiments, the target result determination module 230 may obtain an optimal speech recognition result based on the scores of the candidate speech recognition results, and determine the optimal speech recognition result as the target speech recognition result.

In some embodiments of the present description, the speech recognition result of the decoding model is re-scored by using a pre-generated re-scoring model, so that the time for generating the model is saved, and the resource occupation during decoding is reduced; the preset re-grading model stores the correction value of the score, the score of the decoding model can be directly corrected in real time through simple calculation, the re-grading speed is accelerated, paths do not need to be searched in a plurality of decoding models and calculation is carried out, and only the re-grading model needs to be searched; meanwhile, the memory occupation is reduced, and the number of decoding networks is reduced.

FIG. 4 is an exemplary flow diagram of a method of generating a pre-set re-scoring model, according to some embodiments of the present description.

In some embodiments, the preset re-scoring model may be a pre-generated model. As shown in fig. 4, the process 400 includes the following steps.

Step 410, a first language model and a second language model are obtained. In some embodiments, step 410 may be performed by language model acquisition module 240.

In some embodiments, the first language model and the second language model may be language models in the arpa format.

The language model acquisition module 240 may acquire the first language model and the second language model in various ways. In some embodiments, the language model obtaining module 240 may train a preset arpa format language model to obtain a second language model. The manner of training may include, but is not limited to, supervised learning, and the like. In some embodiments, the language model obtaining module 240 may cut out a smaller language model from the second language model as the first language model. The manner of cutting may be various. For example, states and the number of arcs are defined. As another example, unimportant states and arcs are deleted, and higher scoring states and arcs are retained.

At step 420, a decoding model is generated based on the first language model. In some embodiments, step 420 may be performed by decoding model generation module 250.

The decoding model generation module 250 may generate the decoding model in a variety of ways. In some embodiments, the decoding model generation module 250 may generate the decoding model based on the first language model. In some embodiments, the decoding model generating module 250 may generate the decoding model hclg.fst according to the first language model and the acoustic model and the dictionary file.

Step 430, generating a first scoring model and a second scoring model based on the first language model and the second language model. In some embodiments, step 430 may be performed by scoring model generation module 260.

The first scoring model and the second scoring model refer to models that can be used to score speech recognition results. In some embodiments, the first scoring model and the second scoring model may score the output of arcs in the directed graph structure (i.e., speech recognition results).

In some embodiments, the first scoring model and the second scoring model may be various models that may be used for scoring. In some embodiments, the first scoring model and the second scoring model may be weighted finite state machines.

In some embodiments, the scale of the first scoring model and the scale of the second scoring model may be defined. For example, the scale of the first scoring model and the scale of the second scoring model are not equal. In some embodiments, the scale of the first scoring model is smaller than the scale of the second scoring model.

In some embodiments, there may be an association between the first scoring model and the second scoring model. For example, at least a portion of the first scoring model and the second scoring model are the same. In some embodiments, the first scoring model may be part of the second scoring model.

Scoring model generation module 260 may generate the first scoring model and the second scoring model in a variety of ways. In some embodiments, scoring model generation module 260 may generate the first scoring model and the second scoring model by converting the first language model and the second language model into a Finite-State-transmitter (FST).

Step 440, obtaining a preset re-scoring model by combining the first scoring model and the second scoring model. In some embodiments, step 440 may be performed by the re-scoring model generation module 270.

The re-scoring model generation module 270 may generate the preset re-scoring model in a variety of ways. In some embodiments, the re-scoring model generation module 270 may derive the preset re-scoring model by merging the first scoring model and the second scoring model. For more details on the preset re-scoring model obtained by combining the first scoring model and the second scoring model, reference may be made to fig. 5 and its description.

In some embodiments of the present description, the first language model is derived by clipping the second language model. By the arrangement, firstly, the accuracy of the first language model is improved, namely, when the second language model is cut, the state and the arc with high scores are saved as the first language model; and secondly, the workload is reduced, namely, a language model does not need to be trained independently and is used as the first language model.

It should be noted that the above descriptions regarding the processes 300, 400 are only for illustration and description, and do not limit the applicable scope of the present specification. Various modifications and changes to the processes 300, 400 may be made by those skilled in the art, guided by the present description. However, such modifications and variations are intended to be within the scope of the present description. For example, step 320 may be incorporated into step 330.

FIG. 5 is a schematic diagram of a method 500 of generating a re-scoring model, shown in accordance with some embodiments of the present description.

The voice recognition result score is a score obtained by scoring the voice recognition result by a scoring model. For example, the first scoring model scores 0.6 for the speech recognition result, and the speech recognition result scores one 0.6. For another example, if the second scoring model scores the speech recognition result to be 0.63, the speech recognition result score of two is 0.63.

The re-scoring model generation module 270 may obtain the preset re-scoring model in a variety of ways. In some embodiments, the re-scoring model generation module 270 may derive the preset re-scoring model by merging the first scoring model and the second scoring model. In some embodiments, the score obtaining unit 271 may obtain the voice recognition result score of two by traversing the second scoring model, and synchronously traverse the first scoring model to obtain the voice recognition result score of one based on the traversal of the second scoring model, the model updating unit 272 may update the second scoring model based on a difference between the voice recognition result score of one and the voice recognition result score of two, and the model determining unit 273 may determine the preset re-scoring model based on the updated second scoring model. For more details on the traversal, reference may be made to fig. 6 and its description.

The score obtaining unit 271 may traverse through various ways. Such as depth-first traversal or breadth-first traversal, etc. In some embodiments, the way the score acquisition unit 271 traverses may be a recursive depth-first traversal.

Updates and traversals are order required. In some embodiments, the model updating unit 272 may update the second scoring model while the score obtaining unit 271 traverses the first scoring model and the second scoring model.

In some embodiments, the difference in the scores of the speech recognition results of the first scoring model and the second scoring model may be determined as the correction value. For example, the score of the speech recognition result of the first scoring model is subtracted from the score of the speech recognition result of the second scoring model, and the difference is used as a correction value.

The model updating unit 272 may update the second scoring model in various ways. For example, the model updating unit 272 may update the second scoring model based on the speech recognition result score one and the speech recognition result score. In some embodiments, the model updating unit 272 may replace the second scoring model with the second speech recognition result score in the second scoring model by using the difference between the first speech recognition result score and the second speech recognition result score, i.e., the correction value.

In some embodiments of the present description, the difference is used for updating, and the updated second scoring model is determined as the preset re-scoring model, so that generation of the re-scoring model is simplified, complexity of score calculation in a re-scoring process is reduced, decoding speed is increased, and occupation of memory resources is reduced.

FIG. 6 is an exemplary diagram of a method 600 of model traversal shown in accordance with some embodiments of the present description.

In some embodiments, the first scoring model and the second scoring model may be a directed graph structure comprising a plurality of arcs and nodes. For more details on the arc, see the relevant description in step 320.

In some embodiments, there are arcs in the first scoring model that correspond to the second scoring model, i.e., the output of the arcs have a particular relationship, e.g., the output is the same or similar speech recognition result, etc.

Traversal of the scoring model refers to accessing the nodes and arcs of the model to obtain a sequence of all possible arcs of the model. In some embodiments, the model is traversed by obtaining all possible speech recognition results.

The synchronous traversal refers to that when the second scoring model traverses the arc sequence, the arc sequence corresponding to the arc sequence of the second scoring model is synchronously searched in the first scoring model. Since the arc sequence is an ordered set of arcs, the process of finding the corresponding arc sequence is essentially the process of finding the next corresponding arc in the arc sequence. For example, traversing a sequence of "today is monday night" arcs in the second scoring model, and having found a sequence of "today is monday" arcs in the second scoring model, then it is necessary to find the next arc in the second scoring model so that the sequence of the new arc is "today is monday night". In some embodiments, the sequence of corresponding arcs may be a sequence of identical arcs or may be a sequence of closest arcs. For example, for "today is Monday night", if no perfect agreement can be found, "is Monday night" can be determined as the sequence of corresponding arcs.

In some embodiments, the score obtaining unit 271 may determine, in the first scoring model, a corresponding arc to an arc in the second scoring model based on the arc in the second scoring model in a variety of ways. In some embodiments, the score obtaining unit 271 determines an arc that coincides with an arc in the second scoring model as a corresponding arc when an arc that coincides with the arc in the first scoring model is found in the first scoring model. For example, as shown in fig. 6, the arc in the second scoring model is "today is monday night" (i.e., the corresponding output of the arc is "today is monday night", the same below), the score obtaining unit 271 finds an arc that is identical to the arc in the second scoring model in the first scoring model, i.e., "today is monday night", and determines the arc in the first scoring model "today is monday night" as the corresponding arc.

In some embodiments, the score obtaining unit 271 determines, when no arc that coincides with an arc in the second scoring model is found in the first scoring model, the coinciding arc that has the smallest number of rollback steps as the corresponding arc by rollback. For example, as shown in fig. 6, the arc in the second scoring model is "today is monday night", the score obtaining unit 271 finds the arc corresponding to "today is monday" in the first scoring model, but when further searching in the first scoring model, the arc corresponding to "today is monday night" is not found, and therefore, it is necessary to go back once to the arc corresponding to "is monday", and then continue to find the arc corresponding to "is monday night", and the score obtaining unit 271 finds the arc "is monday night" in the first scoring model, so that the arc with the least number of steps to be backed back, which is identical, "is monday night", is determined as the corresponding arc.

In some embodiments, the score acquisition unit 271 may traverse the first scoring model synchronously based on arcs in the second scoring model while traversing the second scoring model. For example, when the score obtaining unit 271 traverses to the arc "today is monday night" in the second scoring model, the arc corresponding to the arc "today is monday night" in the second scoring model is determined in the first scoring model.

The rollback step number refers to the number of times that words need to be removed during rollback. In some embodiments, the number of rollback steps refers to sequentially removing words from front to back for the word sequence corresponding to the arc, and each removal is one rollback step.

FIG. 7 is an exemplary diagram of a method 700 for voice recognition real-time re-scoring in accordance with some embodiments of the present description.

As shown in fig. 7, the speech recognition process includes generating a feature sequence based on the user speech, processing the feature sequence using an acoustic model, and performing decoding search to obtain a recognition result. In decoding search, a preset re-dividing model GF.fst and a decoding model HCLG.fst are needed. In some embodiments, the preset re-scoring model gf.fst is generated in advance by the re-scoring model generation module 270 using the first scoring model g1.fst and the second scoring model g2.fst, and the preset re-scoring model gf.fst stores a difference of the voice recognition result scores of each voice recognition result in the first scoring model g1.fst and the second scoring model g 2.fst. For example, the re-scoring model generation module 270 obtains a difference value of 0.03 score based on that the score of the speech recognition result is 0.6 score, the score of the speech recognition result is two score is 0.63 score, and the specified difference value is obtained by subtracting the score of the speech recognition result from the score of the speech recognition result is one score, and stores the difference value in the preset re-scoring model gf.fst. When performing real-time re-scoring on the speech recognition result, the candidate result obtaining module 220 may directly extract a score, i.e., a difference value, corresponding to the speech recognition result from the preset re-scoring model gf.fst, and sum the score of the speech recognition result obtained through the decoding model hclg.fst and the corresponding score in the preset re-scoring model gf.fst, so as to obtain the final score of the speech recognition result.

In some embodiments of the present description, the voice recognition result of the decoding model hclg.fst is re-scored based on the preset re-scoring model gf.fst, so that firstly, the decoding speed is increased, a series of calculations originally need to be performed in the first scoring model g1.fst and the second scoring model g2.fst at the same time, and now only a search is performed in one preset re-scoring model gf.fst, and calculation is not needed, so that the speed of the decoding process is increased; and secondly, the number of decoding networks and the stored intermediate variables are reduced, so that the memory occupation is reduced.

FIG. 8 is an exemplary flow diagram illustrating a method 800 of generating a pre-set re-scoring model according to some embodiments of the present description.

In some embodiments, each state in the first and second scoring models, g1.fst and g2.fst, has an additional attribute (e.g., order) that is not directly stored in the data structure of the first and second scoring models, g1.fst and g2.fst, but is an independent attribute. As shown in fig. 8, in some embodiments, the ngram orders of each state in the first scoring model g1.fst and the second scoring model g2.fst may be obtained through a statistical method, and the like, and stored in Sn1 and Sn2, respectively.

As shown in fig. 8, in some embodiments, a fallback model gback.fst may be constructed based on the first component model g1. fst. In the first component model, g1.fst, for a typical state, there are several arcs, corresponding to several inputs that it can accept. Given one of these inputs, the weight can be returned and the next state can be jumped to. If the given input is not within the acceptable range for these arcs, an error is returned. The rollback model gback.fst extends the first component model g1.fst and can accept any input. This function is based on the assumption that: if an input is not in several arcs of the state, then by several rollback operations, an arc that can accept the input must be found. Fst allows constant rollback until an arc is found that meets the requirements. And summing the weights of the several rollback times and the satisfactory arc weights, returning as the weight fraction of the final arc, and jumping to the next state of the final arc.

As shown in fig. 8, in some embodiments, when traversing the second scoring model g2.fst, the weights of the arcs may be modified by Sn1, Sn2 and the back-off model gback. fst, and the modified arc weight of the second scoring model g2.fst is saved as the preset re-scoring model gf.fst.

In some embodiments, before the re-scoring model generation module 270 traverses the second scoring model g2.fst, the initial state of the second scoring model g2.fst and the initial state of the back-off model gback. fst are obtained, and then the two states are taken as entries for recursive depth traversal. In the recursive function, it is first determined whether the state of the current second scoring model g2.fst has already been processed. And if the processing is finished, directly returning. If not, all arcs of the current state are traversed and the weights are updated. There are three cases. First, if the input on the arc is not 0, it indicates that there is a speech recognition result output. At this time, assuming that the weight of the arc is w2, the weight value corresponding to the input needs to be searched from the rollback model gback. And saving the difference value between w2 and w1 to the original arc. And meanwhile, taking the next state of the current arc and the next state inquired in the backspace model Gbank. Second, if the input on the arc is 0, but the attributes of the two states are consistent (determined by Sn1 and Sn 2), then both are to be subjected to language model rollback operation, and the first case operation may be applied, i.e., modifying the weights and calling recursively with the next state of both. In the third case, the input on the arc is 0 and the properties of the states are not consistent. At this time, only the backspacing operation is needed to be performed on the second scoring model G2.fst, and then the recursive calling is performed.

The created model can be used for decoding. In some embodiments, during the decoding process, for the current state of each decoding path, an additional space may be created to store a state value and a corresponding gf. When a word is output on the decoding path, that is, the output value on the arc is not 0, a corresponding weight value can be obtained from the gf.fst, and the weight value is added to the current decoding path, and meanwhile, the preset rescaling model gf.fst state value can be updated.

Having thus described the basic concept, it will be apparent to those skilled in the art that the foregoing detailed disclosure is to be regarded as illustrative only and not as limiting the present specification. Various modifications, improvements and adaptations to the present description may occur to those skilled in the art, although not explicitly described herein. Such modifications, improvements and adaptations are proposed in the present specification and thus fall within the spirit and scope of the exemplary embodiments of the present specification.

Also, the description uses specific words to describe embodiments of the description. Reference throughout this specification to "one embodiment," "an embodiment," and/or "some embodiments" means that a particular feature, structure, or characteristic described in connection with at least one embodiment of the specification is included. Therefore, it is emphasized and should be appreciated that two or more references to "an embodiment" or "one embodiment" or "an alternative embodiment" in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, some features, structures, or characteristics of one or more embodiments of the specification may be combined as appropriate.

Additionally, the order in which the elements and sequences of the process are recited in the specification, the use of alphanumeric characters, or other designations, is not intended to limit the order in which the processes and methods of the specification occur, unless otherwise specified in the claims. While various presently contemplated embodiments of the invention have been discussed in the foregoing disclosure by way of example, it is to be understood that such detail is solely for that purpose and that the appended claims are not limited to the disclosed embodiments, but, on the contrary, are intended to cover all modifications and equivalent arrangements that are within the spirit and scope of the embodiments herein. For example, although the system components described above may be implemented by hardware devices, they may also be implemented by software-only solutions, such as installing the described system on an existing server or mobile device.

Similarly, it should be noted that in the preceding description of embodiments of the present specification, various features are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure aiding in the understanding of one or more of the embodiments. This method of disclosure, however, is not intended to imply that more features than are expressly recited in a claim. Indeed, the embodiments may be characterized as having less than all of the features of a single embodiment disclosed above.

Numerals describing the number of components, attributes, etc. are used in some embodiments, it being understood that such numerals used in the description of the embodiments are modified in some instances by the use of the modifier "about", "approximately" or "substantially". Unless otherwise indicated, "about", "approximately" or "substantially" indicates that the number allows a variation of ± 20%. Accordingly, in some embodiments, the numerical parameters used in the specification and claims are approximations that may vary depending upon the desired properties of the individual embodiments. In some embodiments, the numerical parameter should take into account the specified significant digits and employ a general digit preserving approach. Notwithstanding that the numerical ranges and parameters setting forth the broad scope of the range are approximations, in the specific examples, such numerical values are set forth as precisely as possible within the scope of the application.

For each patent, patent application publication, and other material, such as articles, books, specifications, publications, documents, etc., cited in this specification, the entire contents of each are hereby incorporated by reference into this specification. Except where the application history document does not conform to or conflict with the contents of the present specification, it is to be understood that the application history document, as used herein in the present specification or appended claims, is intended to define the broadest scope of the present specification (whether presently or later in the specification) rather than the broadest scope of the present specification. It is to be understood that the descriptions, definitions and/or uses of terms in the accompanying materials of this specification shall control if they are inconsistent or contrary to the descriptions and/or uses of terms in this specification.

Finally, it should be understood that the embodiments described herein are merely illustrative of the principles of the embodiments of the present disclosure. Other variations are also possible within the scope of the present description. Thus, by way of example, and not limitation, alternative configurations of the embodiments of the specification can be considered consistent with the teachings of the specification. Accordingly, the embodiments of the present description are not limited to only those embodiments explicitly described and depicted herein.

22页详细技术资料下载

Method and system for voice recognition real-time re-scoring

相关技术

网友询问留言