Voice information processing method and device and electronic equipment

文档序号：170833 发布日期：2021-10-29 浏览：25次中文

阅读说明：本技术 语音信息处理方法、装置和电子设备 (Voice information processing method and device and electronic equipment ) 是由朱耀明董倩倩王明轩李磊于 2021-07-28 设计创作，主要内容包括：本公开实施例公开了语音信息处理方法、装置和电子设备。该方法的一具体实施方式包括：获取至少一帧待翻译语音信息的第一声学特征信息；在流式语音识别下,确定第一声学特征信息是否对应完整语义；响应于确定结果为是,对所述第一声学特征信息执行翻译操作,得到对应的翻译结果,提高了翻译结果的准确度,降低了翻译结果的输出延迟。(The embodiment of the disclosure discloses a voice information processing method and device and electronic equipment. One embodiment of the method comprises: acquiring first acoustic characteristic information of at least one frame of voice information to be translated; determining whether the first acoustic characteristic information corresponds to complete semantics under streaming voice recognition; and responding to the determination result that the first acoustic characteristic information is the first acoustic characteristic information, executing translation operation on the first acoustic characteristic information, and obtaining a corresponding translation result, so that the accuracy of the translation result is improved, and the output delay of the translation result is reduced.)

1. A method of processing speech information, comprising:

acquiring first acoustic characteristic information of at least one frame of voice information to be translated;

determining whether the first acoustic characteristic information corresponds to complete semantics under streaming voice recognition;

and responding to the determination result that the first acoustic characteristic information is yes, and executing a translation operation on the first acoustic characteristic information to obtain a corresponding translation result.

2. The method of claim 1, wherein the obtaining of the first acoustic feature information of at least one frame of speech information to be translated comprises:

and inputting at least one frame of voice information to be processed into a pre-trained acoustic model to obtain the first acoustic characteristic information.

3. The method of claim 2, wherein the acoustic model comprises: a shielded acoustic model.

4. The method of claim 1 or 2, wherein the determining whether the first acoustic feature information corresponds to full semantics under streaming speech recognition comprises:

and inputting the first acoustic characteristic information into a pre-trained preset semantic recognition model, and determining whether the first acoustic characteristic information corresponds to complete semantics or not by using the preset semantic recognition model.

5. The method of claim 4, wherein the preset semantic recognition model comprises a continuous integration issuance module.

6. The method of claim 1, wherein the method further comprises:

under the non-streaming voice recognition, receiving multiple frames of voice information to be translated until an input ending instruction of the voice information is detected, acquiring second acoustic characteristic information of the multiple frames of voice information to be translated, and executing translation operation on the second acoustic characteristic information to obtain a corresponding translation result.

7. The method of claim 1 or 6, wherein the translation operation comprises:

and inputting the acoustic characteristic information into a pre-trained translation model to obtain a translation result corresponding to the acoustic characteristic information.

8. A speech information processing model comprising: an acoustic model, a semantic recognition model, and a translation model, wherein,

the acoustic model is used to: receiving at least one frame of voice information to be translated in a streaming voice recognition mode, and extracting first acoustic characteristic information of the at least one frame of voice information to be translated;

the semantic recognition model is used for: receiving the at least one frame of first acoustic feature information in a streaming voice recognition mode, and determining whether the at least one frame of first acoustic feature information corresponds to complete semantics;

the translation model is used for determining a translation result of the first acoustic characteristic information in a streaming voice recognition mode.

9. The model of claim 8, wherein the acoustic model is further to: in a non-streaming voice recognition mode, receiving multiframe voice information to be translated until an input ending instruction of the voice information is detected, and extracting second acoustic characteristic information of the multiframe voice information to be translated;

the semantic recognition model is further configured to: compressing and aligning the second acoustic feature information in a non-streaming speech recognition mode;

the translation model is further to: and determining a translation result of the second acoustic characteristic information in the non-streaming voice recognition mode.

10. A method for training a speech information processing model, applied to the speech information processing model of claim 8 or 9, the speech information processing model including an acoustic model, a semantic recognition model and a translation model, the method comprising:

acquiring a training sample set, wherein the training sample set comprises a plurality of training sample pairs, and the training sample pairs comprise translation results corresponding to original voice information of a first language and the original voice information of a second language;

and inputting the original voice information in the training sample pair into the acoustic model after initial training, taking the translation result as the output of the translation model, and training the voice information processing model to obtain the trained voice information processing model.

11. The method of claim 10, wherein the inputting the original speech information in the training sample pair to an initially trained acoustic model, using the translation result as the output of the translation model, and training the speech information processing model to obtain a trained speech information processing model comprises:

acquiring a sample code corresponding to the translation result;

inputting the original voice information into an acoustic model after initial training, and training the semantic recognition model by using the sample code of the translation result and a first loss function to obtain the trained semantic recognition model.

12. The method of claim 10, wherein the inputting the original speech information in the training sample pair to an initially trained acoustic model, using the translation result as the output of the translation model, and training the speech information processing model to obtain a trained speech information processing model comprises:

and training the voice information processing model by using a second loss function and a third loss function to obtain the trained voice information processing model.

13. A speech information processing apparatus comprising:

the device comprises an acquisition unit, a translation unit and a translation unit, wherein the acquisition unit is used for acquiring at least one frame of first acoustic characteristic information of voice information to be translated;

the determining unit is used for determining whether the at least one frame of first acoustic feature information meets a preset translation condition under the streaming voice recognition;

and the translation unit is used for executing translation operation on the first acoustic characteristic information in response to the determination result being yes to obtain a corresponding translation result.

14. An apparatus for training a speech information processing model, which is applied to the speech information processing model of claim 8 or 9, wherein the speech information processing model comprises an acoustic model, a semantic recognition model and a translation model, and the apparatus comprises:

the system comprises a sample acquisition unit, a translation unit and a translation unit, wherein the sample acquisition unit is used for acquiring a training sample set, the training sample set comprises a plurality of training sample pairs, and the training sample pairs comprise translation results corresponding to original voice information of a first language and the original voice information of a second language;

and the training unit is used for inputting the original voice information in the training sample pair into the acoustic model after initial training, taking the translation result as the output of the translation model, and training the voice information processing model to obtain the trained voice information processing model.

15. An electronic device, comprising:

one or more processors;

a storage device for storing one or more programs,

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-7, or the method of any one of claims 10-12.

16. A computer-readable medium, on which a computer program is stored, which program, when being executed by a processor, is adapted to carry out the method of any one of claims 1 to 7, or the method of any one of claims 10 to 12.

Technical Field

The present disclosure relates to the field of artificial intelligence technologies, and in particular, to a method and an apparatus for processing voice information, and an electronic device.

Background

This section is intended to provide a background or context to the embodiments of the invention that are recited in the claims. The description herein is not admitted to be prior art by inclusion in this section.

Speech Translation (ST) aims at translating a source language Speech into a target language text, and is widely applied to various scenes such as conference lectures, business meetings, cross-border customer service, overseas travel and the like.

The traditional speech translation model usually uses a speech recognition model to convert speech into characters in a source language, and then uses a machine translation model to translate the recognized characters in the source language into a target language.

Recently, the end-to-end translation method is applied to both non-streaming speech translation and streaming translation. The inventors have found that some solutions to apply the end-to-end translation approach to streaming translation are to slice the source audio at a fixed time, with each language slice treated as a translated password and applied to streaming speech translation. However, in real environments, the speech length tends to become long, resulting in end-to-end speech translation either introducing delay or causing translation errors.

Disclosure of Invention

This disclosure is provided to introduce concepts in a simplified form that are further described below in the detailed description. This disclosure is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

The embodiment of the disclosure provides a voice information processing method and device and electronic equipment.

In a first aspect, an embodiment of the present disclosure provides a method for processing voice information, including: acquiring first acoustic characteristic information of at least one frame of voice information to be translated; determining whether the first acoustic characteristic information corresponds to complete semantics under streaming voice recognition; and responding to the determination result that the first acoustic characteristic information is yes, and executing a translation operation on the first acoustic characteristic information to obtain a corresponding translation result.

In a second aspect, an embodiment of the present disclosure provides a speech information processing model, including: an acoustic model, a semantic recognition model, and a translation model, wherein the acoustic model is to: receiving at least one frame of voice information to be translated in a streaming voice recognition mode, and extracting first acoustic characteristic information of the at least one frame of voice information to be translated; the semantic recognition model is used for: receiving the at least one frame of first acoustic feature information in a streaming voice recognition mode, and determining whether the at least one frame of first acoustic feature information corresponds to complete semantics; the translation model is used for determining a translation result of the first acoustic characteristic information in a streaming voice recognition mode.

In a third aspect, an embodiment of the present disclosure provides a method for training a speech information processing model, which is applied to the speech information processing model in the second aspect, where the speech information processing model includes an acoustic model, a semantic recognition model, and a translation model, and the method includes: acquiring a training sample set, wherein the training sample set comprises a plurality of training sample pairs, and the training sample pairs comprise translation results corresponding to original voice information of a first language and the original voice information of a second language; and inputting the original voice information in the training sample pair into the acoustic model after initial training, taking the translation result as the output of the translation model, and training the voice information processing model to obtain the trained voice information processing model.

In a fourth aspect, an embodiment of the present disclosure provides a speech information processing apparatus, including: the device comprises an acquisition unit, a translation unit and a translation unit, wherein the acquisition unit is used for acquiring at least one frame of first acoustic characteristic information of voice information to be translated; the determining unit is used for determining whether the at least one frame of first acoustic feature information meets a preset translation condition under the streaming voice recognition; and the translation unit is used for executing translation operation on the first acoustic characteristic information in response to the determination result being yes to obtain a corresponding translation result.

In a fifth aspect, an embodiment of the present disclosure provides a training apparatus for a speech information processing model, including: the device comprises an acquisition unit, a translation unit and a translation unit, wherein the acquisition unit is used for acquiring at least one frame of first acoustic characteristic information of voice information to be translated; the determining unit is used for determining whether the at least one frame of first acoustic feature information meets a preset translation condition under the streaming voice recognition; and the translation unit is used for executing translation operation on the first acoustic characteristic information in response to the determination result being yes to obtain a corresponding translation result.

In a sixth aspect, an embodiment of the present disclosure provides an electronic device, including: one or more processors; a storage device for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to implement the method for processing speech information according to the first aspect or the method for training a model for processing speech information according to the third aspect.

In a seventh aspect, the disclosed embodiments provide a computer-readable medium, on which a computer program is stored, which when executed by a processor implements the method for processing speech information according to the first aspect, or the method for training a speech information processing model according to the third aspect.

According to the voice information processing method, the voice information processing device and the electronic equipment, the first acoustic characteristic information of at least one frame of voice information to be translated is obtained; determining whether the first acoustic characteristic information corresponds to complete semantics under streaming voice recognition; and in response to the fact that the determination result is yes, performing translation operation on the first acoustic feature information to obtain a corresponding translation result, automatically determining to-be-translated voice information with complete semantics in streaming translation, and translating the to-be-translated voice information with complete semantics. The accuracy of the translation result is improved. In addition, the scheme can translate the voice information to be translated with complete semantics without waiting for the time period specified by the fixed slicing time to end, so that the output delay of the translation result can be reduced.

Drawings

The above and other features, advantages and aspects of various embodiments of the present disclosure will become more apparent by referring to the following detailed description when taken in conjunction with the accompanying drawings. Throughout the drawings, the same or similar reference numbers refer to the same or similar elements. It should be understood that the drawings are schematic and that elements and features are not necessarily drawn to scale.

FIG. 1 is a flow diagram for one embodiment of a method of speech information processing according to the present disclosure;

FIG. 2 is a flow diagram of another embodiment of a method of speech information processing according to the present disclosure;

FIG. 3 is a schematic diagram of the processing of acoustic signature information by the continuous integrated issuance module in the embodiment of FIG. 2;

FIG. 4 illustrates a schematic structural diagram of a speech information processing model according to the present disclosure;

FIG. 5 shows a schematic flow chart diagram of a method of training a speech information processing model according to the present disclosure;

FIG. 6 is a schematic block diagram of one embodiment of a speech information processing apparatus according to the present disclosure;

FIG. 7 is a schematic diagram illustrating an embodiment of an apparatus for training a speech information processing model according to the present disclosure;

FIG. 8 is an exemplary system architecture to which the voice information processing method, voice information processing apparatus, of one embodiment of the present disclosure may be applied;

fig. 9 is a schematic diagram of a basic structure of an electronic device provided according to an embodiment of the present disclosure.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided for a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the disclosure are for illustration purposes only and are not intended to limit the scope of the disclosure.

It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order, and/or performed in parallel. Moreover, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.

The term "include" and variations thereof as used herein are open-ended, i.e., "including but not limited to". The term "based on" is "based, at least in part, on". The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments". Relevant definitions for other terms will be given in the following description.

It should be noted that the terms "first", "second", and the like in the present disclosure are only used for distinguishing different devices, modules or units, and are not used for limiting the order or interdependence relationship of the functions performed by the devices, modules or units.

It is noted that references to "a", "an", and "the" modifications in this disclosure are intended to be illustrative rather than limiting, and that those skilled in the art will recognize that "one or more" may be used unless the context clearly dictates otherwise.

The names of messages or information exchanged between devices in the embodiments of the present disclosure are for illustrative purposes only, and are not intended to limit the scope of the messages or information.

Referring to fig. 1, a flow of one embodiment of a voice information processing method according to the present disclosure is shown. As shown in fig. 1, the voice information processing method includes the following steps:

step 101, obtaining first acoustic characteristic information of at least one frame of voice information to be translated.

The end-to-end speech recognition model can map audio directly to characters or words. The voice information to be translated may be voice information in a first language. The voice information to be translated can be the currently collected voice information of the speaker, and can also be the pre-stored voice information of the speaker. The first language here may be any language, such as english, chinese, french, etc. The translation results may correspond to a target language. The target language may be, for example, any other language than the first language.

The speech information may comprise a sequence of words. Various methods can be used to perform feature extraction on the voice information to obtain the acoustic features of the voice information. Here, the acoustic features of the speech information may be extracted from a log-mel spectrogram of speech.

In a specific practice, the acoustic features of the speech information to be translated can be extracted frame by frame. In some application scenarios, each audio frame may comprise a plurality of sampling points for discretizing the continuous audio signal, e.g. one audio frame may comprise 1024 sampling points. Each audio frame may correspond to a sequence of acoustic features. The acoustic feature sequence of a frame of audio frame may include an acoustic feature sequence composed of the acoustic features of each sampling point, and the acoustic features of each sampling point may include amplitude, phase, frequency, correlation in each dimension, and the like.

The first acoustic feature information of the at least one frame of speech information to be translated may include an acoustic feature sequence corresponding to the at least one frame of speech information to be translated.

Step 102, under the streaming voice recognition, determining whether the first acoustic feature information corresponds to complete semantics.

End-to-end speech translation to which the present disclosure relates may include a streaming speech recognition mode and a non-streaming speech recognition mode.

In general, the non-streaming speech translation mode refers to a translation mode that can listen to all speech audio to be translated at once and then generate translated text. The streaming voice translation mode refers to a translation mode in which translation is completed while receiving a voice stream.

If the current speech translation mode is the streaming speech recognition mode, it may be determined whether at least one frame of the first acoustic feature information satisfies a preset translation condition.

The preset translation condition includes that at least one frame of first acoustic characteristic information corresponds to complete semantics.

If at least one frame of the first acoustic feature information is complete semantic, step 103 may be entered. And if not, continuously acquiring the acoustic feature sequence of the at least one subsequent frame of speech information to be translated, and adding the acoustic feature sequence of the at least one subsequent frame of speech information to the acoustic feature sequence of the at least one previous frame of speech information to be translated to obtain the updated first acoustic feature information.

That is, when the feature sequence included in at least one frame of the first acoustic feature information corresponds to a complete semantic meaning, the first acoustic feature information is translated.

In some optional implementations, the step 102 may include:

The semantic recognition model may be various machine learning models, such as a convolutional neural network model. In some application scenarios, the machine learning model may be, for example, a Continuous Integration and Fire (CIF) model. The CIF model includes a second encoder. The CIF model may compress the first acoustic feature information obtained in step 101, and the CIF model may determine whether the compressed first acoustic feature information has complete semantics.

And 103, in response to the determination result being yes, performing translation operation on the first acoustic feature information to obtain a corresponding translation result.

Here, the translation operation is performed on the first acoustic feature information, and the expression of the semantic meaning in the target language can be determined according to the semantic meaning corresponding to the first acoustic feature information, so as to obtain the translation result.

The translation result may be a translation result in a speech form or a translation result in a text form.

In the speech information processing method provided by the embodiment, first acoustic feature information of at least one frame of speech information to be translated is acquired; determining whether the first acoustic characteristic information corresponds to complete semantics under streaming voice recognition; and in response to the fact that the determination result is yes, performing translation operation on the first acoustic feature information to obtain a corresponding translation result, translating when the first acoustic feature information of at least one frame of speech information to be translated corresponds to complete semantics under the streaming speech recognition, and automatically determining the speech information to be translated with the complete semantics and translating the translated speech information with the complete semantics in the streaming translation. The stream type voice recognition in the related technology intercepts voice information according to fixed time length or fixed word number, and extracts the characteristic sequence of the voice information for translation. In the related technical scheme, the voice information is intercepted and translated according to the fixed time length or the fixed word number, the voice information to be translated of the source language intercepted according to the fixed time length or the fixed word number may not have complete semantics, and the obtained target language translation result may not reflect the original semantics of the voice information to be translated, so that the translation result is poor. The scheme of the embodiment translates the speech information with complete semantics, so that a more accurate translation result can be obtained, and the accuracy of the translation result is improved.

In addition, the scheme can translate the voice information to be translated with complete semantics without waiting for the time period specified by the fixed time length to be ended. The output delay of the translation result can be reduced.

In some optional implementation manners of this embodiment, the voice information processing method further includes the following steps:

and 104, receiving multiple frames of voice information to be translated until an input ending instruction of the voice information is detected under non-streaming voice recognition, acquiring second acoustic feature information of the multiple frames of voice information to be translated, and executing translation operation on the second acoustic feature information to obtain a corresponding translation result.

In these optional implementation manners, in the non-streaming voice recognition mode, after all the to-be-translated voice information is received, the second acoustic feature information corresponding to all the to-be-translated voice information is determined. The second acoustic feature information may include a plurality of feature sequences. The plurality of feature sequences may constitute a word vector matrix. And then analyzing the word vector matrix, and executing translation operation on the analyzed word vector matrix to obtain translation results corresponding to all the voice information to be translated. The translation result may be a translation result in a speech form or a translation result in a text form.

In these optional implementations, it is implemented that all the speech information to be translated is translated in the non-streaming speech recognition mode.

That is, in the same translation scheme, a streaming translation mode and a non-streaming translation mode are provided, and corresponding streaming translation or non-streaming translation is executed in the translation mode selected by the user, so that the purpose of utilizing the same translation scheme and considering both streaming voice translation and non-streaming voice translation is realized.

Referring to fig. 2, a flow diagram of another embodiment of a method of processing voice information in accordance with the present disclosure is shown. As shown in fig. 2, the method comprises the steps of:

step 201, inputting at least one frame of speech information to be processed into a pre-trained acoustic model to obtain the first acoustic feature information.

The acoustic model may be various machine learning models, such as a recurrent neural network model, and the like. The machine learning model may be a pre-trained machine learning model. The machine learning model may convert the input speech information into a sequence of features.

In some application scenarios, the Acoustic Model may be a Masked Acoustic Model (MAM). The masked acoustic model may include an encoder and a Prediction Head. When training the MAM, sample audio data may be selected as input, and a sample code corresponding to the sample audio data may be selected as output. The MAM may select 15% of the input audio frames to cover them during training, and the covered frames are predicted by the model according to the context of the training text. The Prediction Head layer comprises two layers of forward networks. A preset penalty function (e.g., L1 Loss) may be used in training the MAM to minimize the gap between the 15% covered predictor frame vector and the true frame vector.

The acoustic signature information may include, but is not limited to, amplitude, phase, frequency, and correlation in various dimensions. The acoustic feature information includes feature information of a human utterance in addition to the feature information of the information. The above-described acoustic feature information may be different for the same utterance of different persons.

Step 202, inputting the first acoustic feature information into a pre-trained preset semantic recognition model, and determining whether the first acoustic feature information corresponds to complete semantics by using the preset semantic recognition model.

In the streaming translation mode, the first acoustic feature information output in step 201 may be input to a pre-trained preset semantic recognition model. And compressing the first acoustic characteristic information by using a preset semantic recognition model, and judging whether the compressed first acoustic characteristic information corresponds to complete semantics.

The predetermined semantic recognition model may include, for example, a Continuous integration and issuance (CIF) module. The CIF module may compress and align the first acoustic feature information output in step 201. When compressing the first acoustic feature information, the CIF may divide a plurality of feature values in the first acoustic feature information into two parts, where one part is used for the current compression processing, and the other part is used for the next compression processing.

Please refer to fig. 3, which shows a schematic diagram of the CIF module processing the acoustic feature information. The acoustic feature information includes an acoustic feature sequence and a weight corresponding to each feature vector in the acoustic feature sequence. The weights may represent the amount of information contained by the feature vectors. As shown in FIG. 3, the acoustic signature sequence may be [ h ]₁、h₂、h₃、h₄、h₅、...](ii) a The corresponding weight sequence may be [ alpha ]₁、α₂、α₃、α₄、α₅...]. The plurality of feature vectors may be divided into two parts, the first part being used to calculate the present compression process. More than two feature vectors can be integrated into a new feature vector per compression process. For example, the plurality of feature vectors obtained by the compression processing are arranged in the order of the vectors. As shown in fig. 3, the feature vector h is divided into₂Weight of alpha₂Splitting into alpha₂₁And alpha₂₂。α₄Splitting into alpha₄₁And alpha₄₂. In each compression, the sum of the weights (including weight components) corresponding to each feature vector in the current compression is 1.For example, the feature vector h₁Weight of alpha₁And the feature vector h₂Is weighted by the weight component alpha₂₁When the sum is 1, the object of the compression is determined as a characteristic vector h₁And the feature vector h₂The component (c). As in fig. 3, the two compression results are as follows: c. C₁＝α₁×h₁+α₂₁×h₂；c₂＝α₂₂×h₂+α₃×h₃+α₄₁×h₄。α₂₂、α₃、α₄₁The sum is 1. When the sum of the weights of a plurality of feature vectors (including the weight components corresponding to the feature components) is 1, the feature vectors can be considered to have complete semantics. These feature vectors may be integrated.

Through the process, the CIF module can determine whether the first acoustic feature information corresponds to the complete semantics and compress the first acoustic feature information corresponding to the complete semantics.

Step 203, in response to the determination result being yes, inputting the acoustic feature information into a pre-trained translation model, and obtaining a translation result corresponding to the acoustic feature information.

The translation model may be, for example, various neural network models such as a hidden markov model. Preferably, the translation model may be a transformer model. the transformer model consists of an encoder and a decoder, receives the word vector matrix output by the CIF module and completes translation. The process of the above transformer model translating the word vector matrix output by the CIF module into the translation result of the target language may be the same as the process of the existing transformer model translating the word vector matrix, and is not repeated here.

In addition, in some alternative implementations, if in the non-streaming speech recognition, the step 201 may include receiving multiple frames of speech information to be translated until an input end instruction of the speech information is detected, and inputting the received multiple frames of speech information to be translated to a pre-trained acoustic model to obtain second acoustic feature information of the multiple frames of speech information to be translated. And the voice information processing method further comprises a step 204 of inputting the second acoustic feature information into a preset semantic recognition model, and compressing and aligning the second acoustic feature information to obtain the compressed second acoustic feature information. And step 205, inputting the compressed second acoustic feature information into a pre-trained translation model to obtain a translation result corresponding to the second acoustic feature information.

According to the voice information processing method provided by the embodiment, the preset semantic recognition model is used for determining the first acoustic feature information with complete semantics, and the translation result of the first acoustic feature information is obtained by using the translation model, so that the speed and the accuracy of end-to-end voice translation can be improved.

In addition, the voice information processing method provided by the embodiment completes streaming voice translation and non-streaming voice translation.

Referring to FIG. 4, a model for speech information processing according to the present disclosure is shown. As shown in fig. 4, the speech information processing model includes: an acoustic model 401, a semantic recognition model 402, and a translation model 403.

The speech information processing model described above may provide a translation mode selection item. The translation mode selection items include a streaming speech recognition mode and a non-streaming speech recognition mode. When the user uses the voice information processing model to perform end-to-end voice information translation, the translation mode can be selected. According to the selection of the translation mode selection item by the user, the voice information processing model works in a streaming voice recognition mode or a non-streaming voice recognition mode. When the translation mode selection item is selected, the source language speech information to be translated may be input to the speech information processing model to be translated by the speech information processing model.

In the streaming speech recognition mode, the acoustic model 401 is configured to receive at least one frame of speech information to be translated, and extract first acoustic feature information of the at least one frame of speech information to be translated.

A semantic recognition model 402, configured to receive the at least one frame of first acoustic feature information, and determine whether the at least one frame of first acoustic feature information corresponds to complete semantics.

The semantic recognition model may include a Continuous integration and issuance (CIF) module. The CIF module may perform semantic recognition on the first acoustic feature information, and compress and align the first acoustic feature information. The functions performed by the CIF module may refer to the process shown in fig. 3.

The translation model 403 is configured to determine that the first acoustic feature information corresponds to complete semantics, and determine a translation result corresponding to the first acoustic feature information. And determining a translation result corresponding to the second acoustic characteristic information in the non-streaming mode.

Furthermore, if the speech information processing model operates in a non-streaming speech recognition mode, the acoustic model 401 is configured to: receiving multi-frame voice information to be translated until an input ending instruction of the voice information is detected, and extracting second acoustic characteristic information of the multi-frame voice information to be translated. The semantic recognition model 402 is used to: compressing and aligning the second acoustic feature information; the translation model 403 is used to: the translation result of the second acoustic feature information is determined.

The functions of the models included in the speech information processing model in the speech information processing method may refer to the description of the embodiment shown in fig. 2, which is not repeated here.

Referring to fig. 5, a method for training a speech information processing model provided by the present disclosure is shown. The voice information processing model comprises an acoustic model, a semantic recognition model and a translation model. As shown in fig. 5, the method includes the following steps.

Step 501, a training sample set is obtained, where the training sample set includes a plurality of training sample pairs, and the training sample pairs include original speech information in a first language and sample translation results corresponding to the original speech information in a second language.

Step 502, inputting the original voice information in the training sample pair into the acoustic model after initial training, taking the translation result as the output of the translation model, and training the voice information processing model to obtain the trained voice information processing model.

In this embodiment, the acoustic model may be a pre-trained model. The acoustic model may be a recurrent neural network model or the like. Preferably, the acoustic model may be a shielded acoustic model.

In the training process, the speech information processing model may be trained using the second loss function and the third loss function. The second loss function may be a quality loss function and the third loss function may be a cross-entropy loss function. In some application scenarios, the trigger for the training end may be minimized by the sum of the second loss function and the third loss function. In other application scenarios, the trigger for ending the training may be a preset number of times required by the number of times of training.

In some optional implementations, the step 502 described above includes the following sub-steps:

first, a sample code corresponding to a sample translation result is obtained.

Secondly, inputting the original voice information into the acoustic model after initial training, and training the semantic recognition model by using the sample code of the sample translation result and the first loss function to obtain the trained semantic recognition model.

The first loss function here may be a quality loss function.

The semantic recognition model may include a Continuous integration and issuance (CIF) module.

In these alternative implementations, the semantic recognition model may be trained first. And then training the whole voice information processing model. The overall training times for the speech information processing model can be reduced.

The speech information processing model obtained by the training can work in a streaming speech recognition mode and can also work in a non-streaming speech recognition mode.

With further reference to fig. 6, as an implementation of the methods shown in the above figures, the present disclosure provides some embodiments of a speech information processing apparatus, which correspond to the method embodiment shown in fig. 1, and which may be applied in various electronic devices in particular.

As shown in fig. 6, the speech information processing apparatus of the present embodiment includes: an acquisition unit 601, a determination unit 602, and a translation unit 603. The acquiring unit 601 is configured to acquire at least one frame of first acoustic feature information of the speech information to be translated; a determining unit 602, configured to determine whether the at least one frame of first acoustic feature information meets a preset translation condition under streaming speech recognition; and a translating unit 603, configured to perform a translation operation on the first acoustic feature information in response to a yes determination result, so as to obtain a corresponding translation result.

In this embodiment, the specific processing of the obtaining unit 601, the determining unit 602, and the translating unit 603 of the speech information processing apparatus and the technical effects thereof can refer to the related descriptions of step 101, step 102, and step 103 in the corresponding embodiment of fig. 1, which are not described herein again.

In some optional implementations, the obtaining unit 601 is further configured to: and inputting at least one frame of voice information to be processed into a pre-trained acoustic model to obtain the first acoustic characteristic information.

In some alternative implementations, the acoustic model includes a shielded acoustic model.

In some optional implementations, the determining unit 602 is further configured to: and inputting the first acoustic characteristic information into a pre-trained preset semantic recognition model, and determining whether the first acoustic characteristic information corresponds to complete semantics or not by using the preset semantic recognition model.

In some alternative implementations, the predefined semantic recognition model includes a continuous integration issuance module.

In some optional implementations, the voice information processing apparatus further includes a non-streaming voice information processing unit (not shown in the figure), the non-streaming voice information processing unit is configured to: under the non-streaming voice recognition, receiving multiple frames of voice information to be translated until an input ending instruction of the voice information is detected, acquiring second acoustic characteristic information of the multiple frames of voice information to be translated, and executing translation operation on the second acoustic characteristic information to obtain a corresponding translation result.

In some optional implementations, the translating operation includes: and inputting the acoustic characteristic information into a pre-trained translation model to obtain a translation result corresponding to the acoustic characteristic information.

With further reference to fig. 7, as an implementation of the method shown in fig. 5, the present disclosure provides an embodiment of a training apparatus for a speech information processing model, where the embodiment of the apparatus corresponds to the embodiment of the method shown in fig. 5, and the apparatus may be applied to various electronic devices.

As shown in fig. 7, the training apparatus for a speech information processing model according to the present embodiment includes: a sample acquisition unit 701, and a training unit 702. The sample acquiring unit 701 is configured to acquire a training sample set, where the training sample set includes a plurality of training sample pairs, and the training sample pairs include translation results corresponding to original speech information in a first language and the original speech information in a second language; and the training unit 702 is configured to input the original speech information in the training sample pair to the initially trained acoustic model, use the translation result as the output of the translation model, and train the speech information processing model to obtain a trained speech information processing model.

In this embodiment, specific processing of the sample obtaining unit 701 and the training unit 702 of the training apparatus for processing the speech information model and technical effects thereof can refer to the related descriptions of step 501 and step 502 in the corresponding embodiment of fig. 5, which are not repeated herein.

In some optional implementations, the training unit 702 further includes a first training subunit (not shown in the figure), and the first training subunit is configured to: acquiring a sample code corresponding to the translation result; inputting the original voice information into an acoustic model after initial training, and training the semantic recognition model by using the sample code of the translation result and a first loss function to obtain the trained semantic recognition model.

In some optional implementations, the training unit 702 is further configured to: and training the voice information processing model by using a second loss function and a third loss function to obtain the trained voice information processing model.

Referring to fig. 8, fig. 8 illustrates an exemplary system architecture to which a voice information processing method or a voice information processing apparatus of an embodiment of the present disclosure may be applied.

As shown in fig. 8, the system architecture may include terminal devices 801, 802, 803, a network 804, and a server 805. The network 804 serves to provide a medium for communication links between the terminal devices 801, 802, 803 and the server 805. Network 804 may include various types of connections, such as wire, wireless communication links, or fiber optic cables, to name a few.

The terminal devices 801, 802, 803 may interact with a server 805 over a network 804 to receive or send messages or the like. Various client applications, such as a voice information collection application, may be installed on the terminal devices 801, 802, 803. The client application in the terminal devices 801, 802, and 803 may receive the instruction of the user, and complete a corresponding function according to the instruction of the user, for example, send the collected voice information to the server.

The terminal devices 801, 802, 803 may be hardware or software. When the terminal devices 801, 802, 803 are hardware, they may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, e-book readers, MP3 players (Moving Picture Experts Group Audio Layer III, mpeg compression standard Audio Layer 3), MP4 players (Moving Picture Experts Group Audio Layer IV, mpeg compression standard Audio Layer 4), laptop portable computers, desktop computers, and the like. When the terminal devices 801, 802, 803 are software, they can be installed in the electronic devices listed above. It may be implemented as multiple pieces of software or software modules (e.g., software or software modules used to provide distributed services) or as a single piece of software or software module. And is not particularly limited herein.

The server 805 may be a server that provides various services, for example, analyzes voice information transmitted from the terminal apparatuses 801, 802, 803, and obtains an analysis result (translation result). And transmits the translation results to the terminal devices 801, 802, 803.

It should be noted that the voice information processing method provided by the embodiment of the present disclosure may be executed by a server, and accordingly, the voice information processing apparatus may be provided in the server 805. Further, the voice information processing method may also be executed by the terminal device, and accordingly, the voice information processing apparatus may be provided in the terminal devices 801, 802, 803.

It should be understood that the number of terminal devices, networks, and servers in fig. 8 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

Referring now to fig. 9, shown is a schematic diagram of an electronic device (e.g., a server or a terminal device of fig. 8) suitable for use in implementing embodiments of the present disclosure. The terminal device in the embodiments of the present disclosure may include, but is not limited to, a mobile terminal such as a mobile phone, a notebook computer, a digital broadcast receiver, a PDA (personal digital assistant), a PAD (tablet computer), a PMP (portable multimedia player), a vehicle terminal (e.g., a car navigation terminal), and the like, and a stationary terminal such as a digital TV, a desktop computer, and the like. The electronic device shown in fig. 9 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.

As shown in fig. 9, the electronic device may include a processing means (e.g., a central processing unit, a graphic processor, etc.) 901, which may perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)902 or a program loaded from a storage means 908 into a Random Access Memory (RAM) 903. In the RAM 903, various programs and data necessary for the operation of the electronic apparatus 900 are also stored. The processing apparatus 901, the ROM 902, and the RAM 903 are connected to each other through a bus 904. An input/output (I/O) interface 905 is also connected to bus 904.

Generally, the following devices may be connected to the I/O interface 905: input devices 906 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; an output device 907 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 908 including, for example, magnetic tape, hard disk, etc.; and a communication device 909. The communication means 909 may allow the electronic device to perform wireless or wired communication with other devices to exchange data. While fig. 9 illustrates an electronic device having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program carried on a non-transitory computer readable medium, the computer program containing program code for performing the method illustrated by the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication device 909, or installed from the storage device 908, or installed from the ROM 902. The computer program performs the above-described functions defined in the methods of the embodiments of the present disclosure when executed by the processing apparatus 901.

It should be noted that the computer readable medium in the present disclosure can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.

In some embodiments, the clients, servers may communicate using any currently known or future developed network Protocol, such as HTTP (HyperText Transfer Protocol), and may interconnect with any form or medium of digital data communication (e.g., a communications network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the Internet (e.g., the Internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed network.

The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device.

The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: acquiring first acoustic characteristic information of at least one frame of voice information to be translated; determining whether the first acoustic characteristic information corresponds to complete semantics under streaming voice recognition; and responding to the determination result that the first acoustic characteristic information is yes, and executing a translation operation on the first acoustic characteristic information to obtain a corresponding translation result. Or

Acquiring a training sample set, wherein the training sample set comprises a plurality of training sample pairs, and the training sample pairs comprise translation results corresponding to original voice information of a first language and the original voice information of a second language; and inputting the original voice information in the training sample pair into the acoustic model after initial training, taking the translation result as the output of the translation model, and training the voice information processing model to obtain the trained voice information processing model.

Computer program code for carrying out operations for the present disclosure may be written in any combination of one or more programming languages, including but not limited to an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present disclosure may be implemented by software or hardware. Where the name of an element does not in some cases constitute a limitation on the element itself.

The functions described herein above may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), systems on a chip (SOCs), Complex Programmable Logic Devices (CPLDs), and the like.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the disclosure herein is not limited to the particular combination of features described above, but also encompasses other embodiments in which any combination of the features described above or their equivalents does not depart from the spirit of the disclosure. For example, the above features and (but not limited to) the features disclosed in this disclosure having similar functions are replaced with each other to form the technical solution.

Further, while operations are depicted in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order. Under certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limitations on the scope of the disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.

Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.

19页详细技术资料下载

Voice information processing method and device and electronic equipment

相关技术

网友询问留言