Streaming end-to-end voice recognition method and device and electronic equipment

文档序号：193322 发布日期：2021-11-02 浏览：30次中文

阅读说明：本技术 流式端到端语音识别方法、装置及电子设备 (Streaming end-to-end voice recognition method and device and electronic equipment ) 是由张仕良高志付于 2020-04-30 设计创作，主要内容包括：本申请实施例公开了流式端到端语音识别方法、装置及电子设备,所述方法包括：以帧为单位对接收到的语音流进行语音声学特征提取并进行编码；对已完成编码的帧进行分块处理,并对同一分块中包含的需要进行编码输出的激活点数量进行预测；根据预测结果确定需要进行解码输出的激活点所在的位置,以便解码器在所述激活点所在的位置进行解码并输出识别结果。通过本申请实施例,能够提升流式端到端语音识别系统对噪声的鲁棒性,进而提升系统性能以及准确度。(The embodiment of the application discloses a streaming end-to-end voice recognition method, a device and electronic equipment, wherein the method comprises the following steps: performing voice acoustic feature extraction and coding on the received voice stream by taking a frame as a unit; the method comprises the steps of carrying out blocking processing on a frame which is coded, and predicting the number of active points which are contained in the same block and need to be coded and output; and determining the position of an activation point needing decoding output according to the prediction result so that the decoder decodes at the position of the activation point and outputs the identification result. By the embodiment of the application, the robustness of the streaming end-to-end voice recognition system to noise can be improved, and further the system performance and accuracy are improved.)

1. A streaming end-to-end speech recognition method, comprising:

performing voice acoustic feature extraction and coding on the received voice stream by taking a frame as a unit;

the method comprises the steps of carrying out blocking processing on a frame which is coded, and predicting the number of active points which are contained in the same block and need to be coded and output;

and determining the position of an activation point needing decoding output according to the prediction result so that the decoder decodes at the position of the activation point and outputs the identification result.

2. The method of claim 1,

the block comprises a coding result corresponding to a frame of voice stream;

the prediction result comprises: whether the current block contains an activation point needing encoding output or not;

the determining the position of the activation point needing to be decoded and output according to the prediction result comprises the following steps:

and determining the position of the block containing the activation point as the position of the activation point.

3. The method of claim 2, further comprising:

respectively determining the Attention coefficient of each frame coding result; the Attention coefficient is used for describing the probability that the corresponding frame needs to be decoded and output;

and verifying the prediction result according to the Attention coefficient.

4. The method of claim 1,

the block comprises a coding result corresponding to the multi-frame voice stream;

the method further comprises the following steps:

determining the position of an activation point needing decoding output according to the prediction result, wherein the position comprises the following steps:

comparing the Attention coefficients of all frames in the same block, and sequencing according to the size;

and determining the positions of the frames with the highest corresponding number of the Attention coefficients in the coding results of the frames contained in the block as the positions of the activation points according to the number of the activation points contained in the block.

5. The method of claim 4, further comprising:

and carrying out self-adaptive adjustment on the size of the block according to the predicted occurrence frequency of the activation point.

6. The method according to any one of claims 1 to 5,

the block processing of the coding result includes:

caching the coding result;

and when the number of the encoding results added into the cache reaches the size of the block, determining the encoding results of each frame currently cached as one block.

7. The method of claim 6, further comprising:

and after the prediction processing of the block is finished, deleting the frame coding results of the block from the buffer.

8. A method of building a predictive model, comprising:

obtaining a training sample set, wherein the training sample set comprises a plurality of block data and marking information, each block data frame comprises a coding result of coding a plurality of frames of a voice stream respectively, and the marking information comprises the number of activation points which need to be decoded and output and are included in each block;

and inputting the training sample set into a prediction model for model training.

9. The method of claim 8,

the training sample set includes the situation that the multiframe voice stream corresponding to the same modeling unit is divided into different blocks.

10. A method of providing speech recognition services, comprising:

after receiving a call request of an application system, a cloud service system receives a voice stream provided by the application system;

performing voice acoustic feature extraction and coding on the received voice stream by taking a frame as a unit;

determining the position of an activation point needing decoding output according to the prediction result so that a decoder decodes at the position of the activation point to obtain a voice recognition result;

and returning the voice recognition result to the application system.

11. A method of obtaining speech recognition information, comprising:

the method comprises the steps that an application system submits a calling request and a voice stream to be recognized to a cloud service system by calling an interface provided by the cloud service system, the cloud service system performs voice acoustic feature extraction and coding on the received voice stream by taking a frame as a unit, blocks the coded frame, and predicts the number of active points which are contained in the same block and need to be coded and output; after the position of an activation point needing decoding output is determined according to the prediction result, decoding is carried out at the position of the activation point through a decoder to obtain a voice recognition result;

and receiving a voice recognition result returned by the cloud service system.

12. A court self-help filing implementation method is characterized by comprising the following steps:

the self-service case setting all-in-one machine equipment receives case setting request information input by voice;

performing voice acoustic feature extraction and coding on the received voice stream by taking a frame as a unit;

determining the position of an activation point needing decoding output according to the prediction result so that a decoder can decode at the position of the activation point and determine an identification result;

and inputting the identification result into an associated scheme information database.

13. A method for upgrading terminal equipment is characterized by comprising the following steps:

providing upgrade suggestion information to the terminal equipment;

after receiving an upgrade request submitted by a terminal device, giving the terminal device an authority to perform streaming voice recognition by using an upgraded mode, wherein the performing of the streaming voice recognition by using the upgraded mode comprises the following steps: performing voice acoustic feature extraction and coding on a received voice stream by taking a frame as a unit, performing block processing on a coded frame, and predicting the number of active points which are contained in the same block and need to be coded and output; and after determining the position of an activation point needing decoding output according to the prediction result, decoding the position of the activation point through a decoder to obtain a voice recognition result.

14. The method of claim 13,

the terminal equipment comprises intelligent sound box equipment.

15. The method of claim 13, further comprising:

and according to the degradation request submitted by the terminal equipment, closing the permission of the streaming voice recognition in the upgraded mode for the terminal equipment.

16. A streaming end-to-end speech recognition apparatus, comprising:

the coding unit is used for extracting and coding the voice acoustic characteristics of the received voice stream by taking a frame as a unit;

the prediction unit is used for carrying out block processing on the frame which is coded and predicting the number of the active points which are contained in the same block and need to be coded and output;

and the activation point position determining unit is used for determining the position of the activation point needing decoding output according to the prediction result so that the decoder decodes at the position of the activation point and outputs the identification result.

17. An apparatus for building a predictive model, comprising:

a training sample set obtaining unit, configured to obtain a training sample set, where the training sample set includes multiple block data and tagging information, where each block data frame includes an encoding result obtained by encoding multiple frames of a voice stream, and the tagging information includes the number of activation points that need to be decoded and output and are included in each block;

and the input unit is used for inputting the training sample set into a prediction model to train the model.

18. An apparatus for providing a voice recognition service, applied to a cloud service system, includes:

the voice stream receiving unit is used for receiving the voice stream provided by the application system after receiving the calling request of the application system;

the coding unit is used for extracting and coding the voice acoustic characteristics of the received voice stream by taking a frame as a unit;

an activation point position determining unit, configured to determine, according to the prediction result, a position where an activation point that needs to be decoded and output is located, so that a decoder decodes at the position where the activation point is located to obtain a speech recognition result;

and the recognition result returning unit is used for returning the voice recognition result to the application system.

19. An apparatus for obtaining speech recognition information, applied to an application system, comprising:

the device comprises a submitting unit, a processing unit and a processing unit, wherein the submitting unit is used for submitting a calling request and a voice stream to be identified to a cloud service system through an interface provided by the calling cloud service system, the cloud service system performs voice acoustic feature extraction and coding on the received voice stream by taking a frame as a unit, performs block processing on the coded frame, and predicts the number of active points which are contained in the same block and need to be coded and output; after the position of an activation point needing decoding output is determined according to the prediction result, decoding is carried out at the position of the activation point through a decoder to obtain a voice recognition result;

and the recognition result receiving unit is used for receiving the voice recognition result returned by the cloud service system.

20. The utility model provides a court self-service establishment implementation device which characterized in that is applied to self-service establishment all-in-one equipment, includes:

a request receiving unit for receiving the proposal request information input by voice;

the coding unit is used for extracting and coding the voice acoustic characteristics of the received voice stream by taking a frame as a unit;

the activation point position determining unit is used for determining the position of an activation point needing decoding output according to the prediction result so that a decoder can decode the position of the activation point and determine an identification result;

and the information input unit is used for inputting the identification result into the associated scheme information database.

21. A terminal device upgrading apparatus, comprising:

an upgrade suggestion providing unit for providing upgrade suggestion information to the terminal device;

the permission granting unit is configured to, after receiving an upgrade request submitted by a terminal device, grant permission for performing streaming voice recognition in an upgraded mode to the terminal device, where performing streaming voice recognition in the upgraded mode includes: performing voice acoustic feature extraction and coding on a received voice stream by taking a frame as a unit, performing block processing on a coded frame, and predicting the number of active points which are contained in the same block and need to be coded and output; and after determining the position of an activation point needing decoding output according to the prediction result, decoding the position of the activation point through a decoder to obtain a voice recognition result.

22. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 15.

23. An electronic device, comprising:

one or more processors; and

a memory associated with the one or more processors for storing program instructions that, when read and executed by the one or more processors, perform the steps of the method of any of claims 1 to 15.

Technical Field

The present application relates to the field of streaming end-to-end speech recognition technologies, and in particular, to a streaming end-to-end speech recognition method, an apparatus, and an electronic device.

Background

Speech recognition technology is a technology that allows a machine to convert speech signals into corresponding text or commands through a recognition and understanding process. Among them, end-to-end speech recognition is receiving more and more extensive attention from both academic and industrial circles. Compared with the traditional hybrid-based system, the end-to-end speech recognition jointly optimizes the acoustic model and the language model through one model, so that the system training complexity can be greatly reduced, and the performance can be remarkably improved. However, most end-to-end speech recognition systems mainly perform offline speech recognition, but cannot perform streaming (streaming) real-time speech recognition. That is, it is possible to perform voice recognition and output a recognition result after the user has spoken a sentence, and it is not possible to perform output of a recognition result while hearing a voice.

Some researchers have proposed a scheme of streaming end-to-end speech recognition, but the effect is not obvious. For example, the MoCHA model implements a scheme of streaming end-to-end speech recognition based on an Attention-Encoder-Decoder end-to-end speech recognition system. In MoCHA, streaming voice information can be converted into voice acoustic features and input to an Encoder, an attribute module determines an activation point to be decoded and output, and a Decoder outputs a specific recognition result (also called token, for example, one chinese character may correspond to one token) at the position of the activation point.

When training the Attention model, a complete sentence of speech is usually required as a sample, and the position of an activation point in the speech is marked, so as to complete the training of the Attention model. However, when prediction is performed through the Attention model, because streaming speech recognition is performed, streaming speech information is input into the model instead of corresponding to a complete sentence, the Attention model is configured such that an Attention coefficient is calculated for each received frame of speech stream, and then an activation point is determined by comparing the calculated Attention coefficient with a preset threshold value, for example, if the Attention coefficient of a frame exceeds the threshold value, the determined activation point may be used as an activation point, and the Decoder is notified to perform token output at the position of the activation point. It can be seen that in the MoCHA scheme, there is a large mismatch between training and testing, and the mismatch makes MoCHA less robust to noise, so that a streaming end-to-end speech recognition system based on MoCHA may encounter a large performance loss in practical tasks. In addition, because the input end is a continuous streaming voice signal, when the Attention coefficient of a certain frame is calculated, the condition of a future voice frame is not known, so even if the Attention coefficient of a current frame is larger than a threshold value, the condition that the Attention coefficient of a next frame is larger than that of the current frame may exist, and at the moment, the next frame may be more accurate to be used as an activation point. It can be seen that in the MoCHA scheme, there is also a problem that the activation point positioning accuracy is relatively low.

Therefore, how to improve the robustness of the streaming end-to-end speech recognition system to noise, and further improve the system performance and accuracy becomes a technical problem to be solved by those skilled in the art.

Disclosure of Invention

The application provides a streaming end-to-end voice recognition method, a streaming end-to-end voice recognition device and electronic equipment, which can improve the robustness of a streaming end-to-end voice recognition system to noise, and further improve the system performance and accuracy.

The application provides the following scheme:

a streaming end-to-end speech recognition method, comprising:

performing voice acoustic feature extraction and coding on the received voice stream by taking a frame as a unit;

A method of building a predictive model, comprising:

and inputting the training sample set into a prediction model for model training.

A method of providing speech recognition services, comprising:

after receiving a call request of an application system, a cloud service system receives a voice stream provided by the application system;

performing voice acoustic feature extraction and coding on the received voice stream by taking a frame as a unit;

and returning the voice recognition result to the application system.

A method of obtaining speech recognition information, comprising:

and receiving a voice recognition result returned by the cloud service system.

A court self-help filing implementation method comprises the following steps:

the self-service case setting all-in-one machine equipment receives case setting request information input by voice;

performing voice acoustic feature extraction and coding on the received voice stream by taking a frame as a unit;

and inputting the identification result into an associated scheme information database.

A terminal device upgrading method comprises the following steps:

providing upgrade suggestion information to the terminal equipment;

A streaming end-to-end speech recognition apparatus comprising:

the coding unit is used for extracting and coding the voice acoustic characteristics of the received voice stream by taking a frame as a unit;

An apparatus for building a predictive model, comprising:

and the input unit is used for inputting the training sample set into a prediction model to train the model.

An apparatus for providing a voice recognition service, applied to a cloud service system, comprises:

the voice stream receiving unit is used for receiving the voice stream provided by the application system after receiving the calling request of the application system;

the coding unit is used for extracting and coding the voice acoustic characteristics of the received voice stream by taking a frame as a unit;

and the recognition result returning unit is used for returning the voice recognition result to the application system.

An apparatus for obtaining speech recognition information, applied to an application system, includes:

and the recognition result receiving unit is used for receiving the voice recognition result returned by the cloud service system.

The utility model provides a court self-service establishment implementation device, is applied to self-service establishment all-in-one equipment, includes:

a request receiving unit for receiving the proposal request information input by voice;

the coding unit is used for extracting and coding the voice acoustic characteristics of the received voice stream by taking a frame as a unit;

and the information input unit is used for inputting the identification result into the associated scheme information database.

A terminal device upgrading apparatus includes:

an upgrade suggestion providing unit for providing upgrade suggestion information to the terminal device;

According to the specific embodiments provided herein, the present application discloses the following technical effects:

by the embodiment of the application, in the process of identifying the voice stream, the frame which is coded can be partitioned, and the number of the activation points which are contained in the partitioned block and need to be decoded and output is predicted, so that the specific positions of the activation points in the specific partitioned block can be determined according to the prediction result, and the decoder is guided to decode at the corresponding activation point positions to output the identification result. In this way, the position of the active point is determined without comparing the Attention coefficient with the threshold value, and the position is not affected by the future frame, so that the accuracy can be improved. In addition, the prediction process of the number of the activation points contained in the block is easy to obtain high accuracy, so that the mismatching degree between training and prediction is low, the robustness of the streaming end-to-end voice recognition system to noise is improved, and the influence on the system performance is small.

Of course, it is not necessary for any product to achieve all of the above-described advantages at the same time for the practice of the present application.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

FIG. 1 is a schematic representation of a scheme provided by an embodiment of the present application;

FIG. 2 is a schematic diagram of a system architecture provided by an embodiment of the present application;

FIG. 3 is a flow chart of a first method provided by an embodiment of the present application;

FIG. 4 is a flow chart of a second method provided by embodiments of the present application;

FIG. 5 is a flow chart of a third method provided by embodiments of the present application;

FIG. 6 is a flow chart of a fourth method provided by embodiments of the present application;

FIG. 7 is a flow chart of a fifth method provided by embodiments of the present application;

FIG. 8 is a flow chart of a sixth method provided by embodiments of the present application;

FIG. 9 is a schematic diagram of a first apparatus provided by an embodiment of the present application;

FIG. 10 is a schematic diagram of a second apparatus provided by an embodiment of the present application;

FIG. 11 is a schematic diagram of a third apparatus provided by an embodiment of the present application;

FIG. 12 is a schematic diagram of a fourth apparatus provided by an embodiment of the present application;

FIG. 13 is a schematic diagram of a fifth apparatus provided by an embodiment of the present application;

FIG. 14 is a schematic view of a sixth apparatus provided by an embodiment of the present application;

fig. 15 is a schematic diagram of an electronic device provided in an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments that can be derived from the embodiments given herein by a person of ordinary skill in the art are intended to be within the scope of the present disclosure.

In the embodiment of the present application, in order to improve the robustness of the streaming end-to-end speech recognition system to noise and further improve the system performance, as shown in fig. 1, a prediction module may be added on the basis of an end-to-end speech recognition system based on an attention mechanism. The function of the prediction module is that the output of the encoder can be first blocked, e.g. one block every 5 frames, etc. In addition, the number of activation points (tokens) included in each block, which are required to be decoded and output, can be predicted. Then, the position of the activation point can be determined by the prediction result of the number of the activation points contained in each block, and the decoder is guided to perform decoding output at the position of the activation point. For example, in a specific implementation, since the number of active points included in each block is predicted, the position of an active point may be determined by combining information such as an Attention coefficient corresponding to each frame. Specifically, if each block includes 5 frames and two active points that need to be decoded and output in a certain block are predicted, the positions of the two frames with the largest extension coefficient in the block can be determined as the positions of the active points that need to be decoded and output, and then the decoder can perform decoding and output at the positions of the active points. Therefore, by the above method, the judgment of the position of the activation point is not dependent on the artificially set attribute coefficient threshold value, but the number of the activation points predicted in each block can be used as a guide, and the position of one or more frames with the largest attribute coefficient in the block can be used as the position of the activation point.

Of course, in the scheme provided in the embodiment of the present application, the process of training the prediction module may also have a mismatch with the process of actually using the prediction module to perform the test. However, the mismatch is only that the training can use the actual number of prediction outputs (Cm) per block, but only the prediction outputs of the predictor can be used in the actual test. However, the accuracy of predicting how many activation points are contained in each block is very high, and the accuracy is over 95% in the responsible task, so that the mismatching of training and testing is very low, and the performance can be obviously improved compared with the existing MoCHA scheme. Moreover, experiments show that the streaming voice recognition scheme provided by the embodiment of the application has basically no damage to the offline voice recognition performance based on the whole sentence attention mechanism.

In specific implementation, the specific technical scheme provided by the embodiment of the application can be used in various application scenarios. For example, as shown in fig. 2, a cloud service system may provide a cloud speech recognition service, and if the service needs to implement streaming end-to-end speech recognition, the service may be implemented using the solution provided in the embodiment of the present application. Specifically, the cloud service system may provide a specific prediction model and provide a cloud speech recognition interface for a user, a plurality of users may call the interface in their respective application systems, and after receiving the call, the cloud service system may run a related processing program to implement streaming speech recognition and return a recognition result. Alternatively, the scheme provided by the embodiment of the present application may also be used in a localized speech recognition system or device for performing speech recognition, for example, a navigation robot in a mall, a self-help desk-settling all-in-one machine in a court, and the like.

The following describes in detail a specific technical solution provided in an embodiment of the present application.

Example one

First, the embodiment provides a streaming end-to-end speech recognition method, referring to fig. 3, including:

s301: performing voice acoustic feature extraction and coding on the received voice stream by taking a frame as a unit;

in the process of stream-type speech recognition, speech acoustic feature extraction can be performed on a speech stream by taking a frame as a unit, and coding is performed by taking the frame as a unit, and a coder outputs a coding result corresponding to each frame. Further, since the voice stream is continuously input, the operation of encoding the voice stream may be continuously performed. For example, assuming that 60ms is a frame, as the voice stream is received, feature extraction and encoding processing are performed as a frame every 60ms of the voice stream. The purpose of the encoding process is to convert the received speech acoustic features into a new, more discriminative high-level representation, which can usually be present in the form of a vector. Therefore, the encoder may be a multi-layer neural network, and the neural network may be selected from various types, such as DFSMN, CNN, BLSTM, transform, and so on.

S302: the method comprises the steps of carrying out blocking processing on a frame which is coded, and predicting the number of active points which are contained in the same block and need to be coded and output;

in the embodiment of the present application, after obtaining the encoding result, a blocking process may be performed first, and the number of active points may be predicted in units of blocks. In a specific implementation manner, after the coding processing of each frame of voice stream is completed, the coding result may be firstly buffered, and when the number of buffered coding result frames reaches the number of frames corresponding to one block, the currently buffered coding result of each frame may be determined as one block, and the prediction module may predict the number of active points included in the block, which need to be coded and output. For example, if each 5 frames corresponds to one block, the encoder may perform prediction processing once after encoding each 5 frames of the speech stream. In an optional embodiment, after completing prediction of a block, each frame encoding result corresponding to the block may be deleted from the buffer.

Of course, in a specific Attention mechanism system, the encoding result of each frame may also be used to calculate an Attention coefficient, and may also be weighted and summed with the Attention coefficient, and then provided to a decoder as an input of the decoder for operation. Therefore, in a specific implementation, in order to avoid data interaction or collision between the prediction module and the Attention module, the output of the encoder may be provided to the prediction module and the Attention module, respectively, and the prediction module and the Attention module may correspond to different buffer spaces, and each may process the decoding result data in the respective buffer space, so as to obtain the prediction result of the number of the activation points, and perform operations such as calculation on the Attention coefficient, and the like.

Because the decoding result needs to be partitioned, a certain delay may be generated in the speech recognition process, and the size of the delay depends on the size of the partition. For example, one block every 5 frames, the delay time is 5 frames, and so on. In particular, the size of the block may be determined according to the delay time tolerable by the system. For example, in an extreme case, each frame may be treated as a block, and so on.

The specific prediction module can be realized by a pre-trained model. In order to complete the training of the model, a training sample set may be prepared, which includes the coding result corresponding to the voice stream, and is partitioned according to a certain size, and the number of the activation points to be output included in each partition may be labeled. Specifically, when the prediction model is trained, the sample information and the label information may be input into an initialized model, and gradual optimization of model parameters is realized through multiple iterations until the training is finished when the algorithm converges. If the specific prediction model is implemented by using a deep learning model such as a neural network, the process of adjusting the parameters may be specifically the process of adjusting the weights of each layer in the deep learning model.

After the training of the prediction model is completed, the prediction model can output the number information of the activation points which need to be decoded and output in the block as long as the coding result contained in the same block is input into the prediction model.

It should be noted that, when the prediction model is trained, the actual prediction output number information (Cm) in each block is used, but only the prediction output of the predictor can be used in the actual test. However, since the accuracy of predicting how many activation points are contained in each block can be very high, the mismatch between training and testing can be very low compared to the MoCHA system, and the recognition performance is not substantially affected.

It should be noted that, for a speech stream, the average duration of one character in the speech stream is usually about 200ms, that is, the pronunciation of each character may last about 200ms while a user speaks (of course, the actual duration may be different for different people due to different speaking speeds). If 60ms is a frame, the pronunciation of the same modeling unit (e.g., corresponding to a word in chinese, or a word in english, etc.) may be distributed over multiple consecutive frames. In practice, the same modeling unit usually only needs to decode and output one frame and associate the features of the surrounding frames with the frame. In the embodiment of the present application, a multi-frame is divided into one block, and therefore, there may be a case where a frame in which the same modeling unit is located is divided into a plurality of different blocks. Therefore, in order to avoid that different frames corresponding to the same modeling unit are all identified as active points in different blocks, the problem can be considered when training the prediction model, that is, there may be training samples and corresponding label information corresponding to the above situation. Therefore, the trained model can correctly deal with the situations when being tested specifically.

S303: and determining the position of an activation point needing decoding output according to the prediction result so that the decoder decodes at the position of the activation point and outputs the identification result.

After the number of the activation points which are contained in each block and need to be decoded and output is predicted, the positions of the activation points can be further determined, so that the decoder can decode at the positions of the activation points and output the identification result. In the specific implementation, if a block only includes an encoding result corresponding to a frame of speech stream, when the number of active points included in each block is predicted, the prediction result is either 0 or 1, and therefore, the problem of predicting the number of active points included in a block is evolved into the problem of predicting whether each block includes an active point that needs to be decoded and output. That is, the specific prediction result may be whether the current block includes an active point that needs to be encoded and output. In this case, the position of the block containing the activation point may be directly determined as the position of the activation point. For example, specific prediction results can be shown in table 1:

TABLE 1

Therefore, under the condition of partitioning by taking one frame as a unit, after the number of the activation points is predicted by partitioning, the position of the specific activation point can be directly determined. Or, in another mode, since the Attention coefficient may reflect to some extent whether a frame is an active point or not, or the probability that a frame belongs to an active point. Therefore, in specific implementation, the Attention attribute coefficient of each frame encoding result can be further determined respectively, and the Attention attribute coefficient is used for describing the probability that the corresponding frame needs to be decoded and output. The prediction can then be validated against the Attention coefficient. For example, in a specific implementation, a threshold of the Attention coefficient may be preset, and if the block prediction result of a frame indicates that the frame belongs to an active point, and the Attention coefficient of the frame is also greater than the threshold, the probability between the frame and the active point may be further increased, and so on. On the contrary, if the block prediction result shows that a frame belongs to an active point, but the computed Attention coefficient is low, the prediction module may re-predict the frame by adjusting a policy or the like, for example, predict the frame by combining features of more peripheral frames, and the like. Of course, in this method, the threshold value of the Attention coefficient is still used, but since the threshold value is only used for verifying the block prediction result, the influence on the overall system performance is not large.

In another way, the same partition may include the coding result corresponding to the multiframe voice stream. At this point, the prediction module can only predict that several active points are included in the same partition, but cannot directly determine at which frame in the partition the active point is specific. Thus, in a specific implementation, the position of the activation point may also be determined in conjunction with the Attention coefficient of each frame. Specifically, the Attention attribute coefficient of each frame encoding result may be determined separately. Then, according to the number of active points included in a block, the position of the frame with the highest number of Attention coefficients in the frame coding results included in the block can be determined as the position of the active point. That is, assuming that two active points are predicted to be included in a block, the positions of two frames with the highest Attention coefficient in each frame included in the block may be determined as the positions of the two active points. For example, in one example, the specific number of active points prediction result, the Attention coefficient condition, and the determined position information of the active points can be shown in table 2:

TABLE 2

In the above table, every 5 frames are taken as a block, and then the 0 th to 4 th frames are divided into a block, the 5 th to 9 th frames are divided into a block, and so on. It is assumed that the prediction module predicts that the first partition contains 1 active point, and calculates the Attention coefficients of the 0 th to 4 th frames as 0.01, 0.22, 0.78, 0.95, and 0.75, at this time, the position of the frame with the highest Attention coefficient among the 0 th to 4 th frames can be determined as the position of the active point, that is, the position of the 3 rd frame is the active point, and other frames in the partition do not belong to the active point, and do not need to be decoded and output. Similarly, assume that the prediction module predicts that the second block contains 2 active points, and calculates the Attention coefficients of the 4 th to 9 th frames as 0.63, 0.88, 0.72, 0.58, and 0.93, at this time, the positions of the two frames with the highest Attention coefficients in the 5 th to 9 th frames can be determined as the positions of the active points, that is, the positions of the 6 th and 9 th frames are the active points, and other frames in the block do not belong to the active points and do not need to be decoded and output.

It can be seen that, in the manner described in the embodiment of the present application, when the position of the activation point is determined, the Attention coefficient does not need to be compared with the preset threshold value, but the Attention coefficient is compared among the frames included in the block under the condition that the number of the activation points existing in the block is predicted, and the frame where the larger corresponding number is located is taken as the position where the activation point is located. Therefore, the training and the testing can be uniformly carried out according to the mode, so that the matching degree between the training and the testing can be improved, and the influence on the system performance is reduced. In addition, since the specific Attention coefficient comparison operation can be performed in the same block and is not affected by the future frame, the position of the determined activation point is also relatively accurate.

It should be noted that, in the specific implementation, the size of the specific partition may be preset, or may also be preset as an initial value, and during the specific test, the size may also be dynamically adjusted according to the actual voice stream condition. Specifically, as described above, the number of modeling units (the number of chinese characters, etc.) and the density of modeling units input by different users in the same time period may be different due to different speech speeds. For this reason, in a specific implementation, the size of the block may be adaptively adjusted according to the predicted occurrence frequency of the active point. For example, if the frequency of the active points is found to be high in some prediction process, the blocks may be reduced to shorten the delay, and conversely, the blocks may be expanded to enable the recognition delay of the system to follow the speed of the speech of the input person.

In summary, according to the embodiment of the present application, in the process of identifying a voice stream, a coded frame may be partitioned, and the number of active points included in the partitioned block, which need to be decoded and output, may be predicted, so that the specific location of the active point may be determined in a specific partitioned block according to the prediction result, so as to guide a decoder to decode at the corresponding active point location and output the identification result. In this way, the position of the active point is determined without comparing the Attention coefficient with the threshold value, and the position is not affected by the future frame, so that the accuracy can be improved. In addition, the prediction process of the number of the activation points contained in the block is easy to obtain high accuracy, so that the mismatching degree between training and prediction is low, the robustness of the streaming end-to-end voice recognition system to noise is improved, and the influence on the system performance is small.

Example two

The second embodiment provides a method for building a prediction model, and referring to fig. 4, the method may specifically include:

s401: obtaining a training sample set, wherein the training sample set comprises a plurality of block data and marking information, each block data frame comprises a coding result of coding a plurality of frames of a voice stream respectively, and the marking information comprises the number of activation points which need to be decoded and output and are included in each block;

s402: and inputting the training sample set into a prediction model for model training.

In specific implementation, the training sample set may include a situation that the multi-frame voice stream corresponding to the same modeling unit is divided into different blocks, so that the situation that the same modeling unit such as the same character is divided into a plurality of different blocks is trained, and an accurate prediction result can be obtained under the same situation in a test process.

EXAMPLE III

The third embodiment is introduced with respect to a scenario when the scheme provided by the third embodiment of the present application is applied in a cloud service system, and specifically, the third embodiment provides a method for providing a voice recognition service from the perspective of a cloud service end, and referring to fig. 5, the method may specifically include:

s501: after receiving a call request of an application system, a cloud service system receives a voice stream provided by the application system;

s502: performing voice acoustic feature extraction and coding on the received voice stream by taking a frame as a unit;

s503: the method comprises the steps of carrying out blocking processing on a frame which is coded, and predicting the number of active points which are contained in the same block and need to be coded and output;

s504: determining the position of an activation point needing decoding output according to the prediction result so that a decoder decodes at the position of the activation point to obtain a voice recognition result;

s505: and returning the voice recognition result to the application system.

Example four

The fourth embodiment corresponds to the third embodiment, and from the perspective of an application system, a method for obtaining speech recognition information is provided, and referring to fig. 6, the method may specifically include:

s601: the method comprises the steps that an application system submits a calling request and a voice stream to be recognized to a cloud service system by calling an interface provided by the cloud service system, the cloud service system performs voice acoustic feature extraction and coding on the received voice stream by taking a frame as a unit, blocks the coded frame, and predicts the number of active points which are contained in the same block and need to be coded and output; after the position of an activation point needing decoding output is determined according to the prediction result, decoding is carried out at the position of the activation point through a decoder to obtain a voice recognition result;

s602: and receiving a voice recognition result returned by the cloud service system.

EXAMPLE five

The fifth embodiment introduces an application scenario of the scheme provided by the fifth embodiment of the present application in a self-help filing all-in-one machine of a court, and specifically, referring to fig. 7, the fifth embodiment provides a method for implementing the self-help filing of the court, and the method may include:

s701: the self-service case setting all-in-one machine equipment receives case setting request information input by voice;

s702: performing voice acoustic feature extraction and coding on the received voice stream by taking a frame as a unit;

s703: the method comprises the steps of carrying out blocking processing on a frame which is coded, and predicting the number of active points which are contained in the same block and need to be coded and output;

s704: determining the position of an activation point needing decoding output according to the prediction result so that a decoder can decode at the position of the activation point and determine an identification result;

s705: and inputting the identification result into an associated scheme information database.

EXAMPLE six

The foregoing embodiments describe the streaming speech recognition method provided in the embodiments of the present application and an application in a specific scenario. During specific implementation, for application scenarios in hardware devices such as smart speakers, the functions provided by the embodiments of the present application may not be implemented yet when a user purchases a specific hardware device, so that the "old" hardware device can only perform speech recognition in a traditional manner. In the embodiment of the application, in order to enable the part of "old" hardware equipment to perform streaming voice recognition in a new manner, so as to improve the experience of the user, an upgrade scheme may be provided for the terminal equipment. For example, in a specific implementation, a specific processing flow of streaming voice recognition may be provided at the server, and the specific hardware device side only needs to submit the acquired user voice stream to the server. Under the condition, the models and the like needed in the specific voice recognition process only need to be stored in the server, and the terminal equipment side can be upgraded without improving hardware. Of course, in the process of performing streaming voice recognition, it usually involves collecting user data and submitting the user data to the server, so in the specific implementation, a suggestion that upgrading can be performed may be pushed to a specific hardware device through the server, if a user needs to upgrade the device, the user's needs may be expressed in a manner of inputting voice or the like, and then a specific upgrade request may be submitted to the server, which processes the upgrade request. During specific implementation, the server may further determine a state of the specific hardware device, for example, whether a relevant user has paid a corresponding resource for obtaining the upgraded service, and the like, and if so, may give the authority to perform streaming voice recognition in the upgraded mode. Therefore, the hardware device can perform streaming voice recognition in a way provided by the embodiment of the application in the process of carrying out conversation with the user. Specifically, the specific streaming voice recognition function may be completed by the server, or, in a case where the hardware device can support its own hardware resources, the updated recognition model may be directly pushed to the specific hardware device, and the hardware device locally completes the streaming voice recognition, and so on.

In addition, for the situation that the specific model is stored in the server, a 'switch' function can be provided, so that the user can use the function only when necessary, and the purposes of saving resources and the like are achieved. For example, when the user only needs to use the device in a home scenario, because the requirement on the accuracy of voice recognition and the like is not high, the request for closing the above-mentioned advanced function (i.e. the recognition method provided in the embodiment of the present application) may be submitted by issuing a voice instruction and the like, and then, the server may temporarily close the function for the user, and if the charging and the like are involved, the charging stop may also be triggered. It may be acceptable for the hardware device to revert back to the original mode for streaming speech recognition, or even wait until the user has spoken a sentence before recognizing it. If the user needs to use the hardware device in the working scene, the advanced functions provided in the embodiment of the application can be restarted again, and the like.

Specifically, an embodiment of the present application provides an apparatus upgrading method, and referring to fig. 8, the method may specifically include:

s801: providing upgrade suggestion information to the terminal equipment;

s802: after receiving an upgrade request submitted by a terminal device, giving the terminal device an authority to perform streaming voice recognition by using an upgraded mode, wherein the performing of the streaming voice recognition by using the upgraded mode comprises the following steps: performing voice acoustic feature extraction and coding on a received voice stream by taking a frame as a unit, performing block processing on a coded frame, and predicting the number of active points which are contained in the same block and need to be coded and output; and after determining the position of an activation point needing decoding output according to the prediction result, decoding the position of the activation point through a decoder to obtain a voice recognition result.

The terminal device may specifically include a smart speaker device and the like.

In concrete implementation, the permission of the streaming voice recognition in the upgraded mode can be closed for the terminal equipment according to the downgrading request submitted by the terminal equipment.

For the parts that are not described in detail in the second to sixth embodiments, reference may be made to the description in the first embodiment, which is not described herein again.

It should be noted that, in the embodiments of the present application, the user data may be used, and in practical applications, the user-specific personal data may be used in the scheme described herein within the scope permitted by the applicable law, under the condition of meeting the requirements of the applicable law and regulations in the country (for example, the user explicitly agrees, the user is informed, etc.).

Corresponding to the first embodiment, an embodiment of the present application further provides a streaming end-to-end speech recognition apparatus, and referring to fig. 9, the apparatus may specifically include:

an encoding unit 901, configured to perform speech acoustic feature extraction and encoding on a received speech stream by using a frame as a unit;

a prediction unit 902, configured to perform block processing on a frame that has been encoded, and predict the number of active points that need to be encoded and output and are included in the same block;

an active point position determining unit 903, configured to determine, according to the prediction result, a position where an active point that needs to be decoded and output is located, so that the decoder decodes at the position where the active point is located and outputs an identification result.

Wherein, the block comprises a coding result corresponding to a frame of voice stream;

the prediction result comprises: whether the current block contains an activation point needing encoding output or not;

the active point position determining unit may be specifically configured to:

and determining the position of the block containing the activation point as the position of the activation point.

At this time, the apparatus may further include:

an Attention coefficient determining unit for determining an Attention attribute coefficient of each frame encoding result, respectively; the Attention coefficient is used for describing the probability that the corresponding frame needs to be decoded and output;

and the verification unit is used for verifying the prediction result according to the Attention coefficient.

Or, the block includes a coding result corresponding to the multi-frame voice stream;

at this time, the apparatus may further include:

an Attention determining unit for determining an Attention attribute coefficient of each frame encoding result, respectively; the Attention coefficient is used for describing the probability that the corresponding frame needs to be decoded and output;

the determination unit may specifically be configured to:

comparing the Attention coefficients of all frames in the same block, and sequencing according to the size;

At this time, the apparatus may further include:

and the block adjusting unit is used for adaptively adjusting the size of the block according to the predicted occurrence frequency of the activation point.

The prediction unit may specifically include:

the buffer memory subunit is used for buffering the coding result;

and the blocking determining subunit is used for determining each frame coding result currently cached as a block when the number of the coding result frames added into the cache reaches the size of the block.

Specifically, the apparatus may further include:

and after the prediction processing of the block is finished, deleting the frame coding results of the block from the buffer.

Corresponding to the second embodiment, the embodiment of the present application further provides an apparatus for building a prediction model, referring to fig. 10, the apparatus includes:

a training sample set obtaining unit 1001, configured to obtain a training sample set, where the training sample set includes multiple block data and tagging information, where each block data frame includes an encoding result obtained by encoding multiple frames of a speech stream, and the tagging information includes the number of activation points that need to be decoded and output and are included in each block;

an input unit 1002, configured to input the training sample set into a prediction model for model training.

The training sample set comprises the condition that the multi-frame voice stream corresponding to the same modeling unit is divided into different blocks.

Corresponding to the embodiment, the embodiment of the present application further provides an apparatus for providing a voice recognition service, and referring to fig. 11, the apparatus is applied to a cloud service system, and includes:

a voice stream receiving unit 1101, configured to receive a voice stream provided by an application system after receiving a call request of the application system;

an encoding unit 1102, configured to perform speech acoustic feature extraction and encoding on a received speech stream in units of frames;

a prediction unit 1103, configured to perform block processing on a frame that has been encoded, and predict the number of active points that need to be encoded and output and are included in the same block;

an active point position determining unit 1104, configured to determine, according to the prediction result, a position where an active point that needs to be decoded and output is located, so that a decoder decodes at the position where the active point is located to obtain a speech recognition result;

a recognition result returning unit 1105, configured to return the voice recognition result to the application system.

Corresponding to the fourth embodiment, an embodiment of the present application further provides an apparatus for obtaining speech recognition information, and referring to fig. 12, the apparatus is applied to an application system, and includes:

a submitting unit 1201, configured to submit a call request and a voice stream to be recognized to a cloud service system by calling an interface provided by the cloud service system, where the cloud service system performs voice acoustic feature extraction and coding on the received voice stream in units of frames, performs blocking processing on the coded frames, and predicts the number of active points included in the same block that need to be coded and output; after the position of an activation point needing decoding output is determined according to the prediction result, decoding is carried out at the position of the activation point through a decoder to obtain a voice recognition result;

a recognition result receiving unit 1202, configured to receive a voice recognition result returned by the cloud service system.

Corresponding to the fifth embodiment, the embodiment of the present application further provides a court self-help filing implementation device, referring to fig. 13, where the device is applied to a self-help filing all-in-one machine device, and includes:

a request receiving unit 1301, configured to receive scenario request information input by voice;

an encoding unit 1302, configured to perform speech acoustic feature extraction and encoding on a received speech stream by using a frame as a unit;

a prediction unit 1303, configured to perform block processing on the encoded frame, and predict the number of active points that need to be encoded and output and are included in the same block;

an active point position determining unit 1304, configured to determine, according to the prediction result, a position where an active point that needs to be decoded and output is located, so that the decoder decodes at the position where the active point is located and determines an identification result;

an information entry unit 1305, configured to enter the identification result into an associated case information database.

Corresponding to the sixth embodiment, an embodiment of the present application further provides a terminal device upgrading apparatus, referring to fig. 14, where the apparatus may include:

an upgrade advice providing unit 1401 configured to provide upgrade advice information to the terminal device;

an authority granting unit 1402, configured to, after receiving an upgrade request submitted by a terminal device, grant an authority to perform streaming voice recognition in an upgraded mode to the terminal device, where performing streaming voice recognition in the upgraded mode includes: performing voice acoustic feature extraction and coding on a received voice stream by taking a frame as a unit, performing block processing on a coded frame, and predicting the number of active points which are contained in the same block and need to be coded and output; and after determining the position of an activation point needing decoding output according to the prediction result, decoding the position of the activation point through a decoder to obtain a voice recognition result.

In addition, the present application also provides a computer readable storage medium, on which a computer program is stored, which when executed by a processor implements the steps of the method described in any of the preceding method embodiments.

And an electronic device comprising:

one or more processors; and

Where fig. 15 illustratively shows the architecture of an electronic device, for example, device 1500 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, a fitness device, a personal digital assistant, an aircraft, or the like.

Referring to fig. 15, device 1500 may include one or more of the following components: processing components 1502, memory 1504, power components 1506, multimedia components 1508, audio components 1510, input/output (I/O) interfaces 1512, sensor components 1514, and communication components 1516.

The processing component 1502 generally controls overall operation of the device 1500, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing element 1502 may include one or more processors 1520 executing instructions to perform all or a portion of the steps of the methods provided by the disclosed solution. Further, processing component 1502 may include one or more modules that facilitate interaction between processing component 1502 and other components. For example, the processing component 1502 may include a multimedia module to facilitate interaction between the multimedia component 1508 and the processing component 1502.

The memory 1504 is configured to store various types of data to support operation at the device 1500. Examples of such data include instructions for any application or method operating on device 1500, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 1504 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

The power supply component 1506 provides power to the various components of the device 1500. The power components 1506 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the device 1500.

Multimedia component 1508 includes a screen that provides an output interface between device 1500 and the user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, multimedia component 1508 includes a front facing camera and/or a rear facing camera. The front-facing camera and/or the rear-facing camera may receive external multimedia data when the device 1500 is in an operational mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

The audio component 1510 is configured to output and/or input audio signals. For example, the audio component 1510 includes a Microphone (MIC) configured to receive external audio signals when the device 1500 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 1504 or transmitted via the communication component 1516. In some embodiments, audio component 1510 also includes a speaker for outputting audio signals.

The I/O interface 1512 provides an interface between the processing component 1502 and peripheral interface modules, which can be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor assembly 1514 includes one or more sensors for providing status assessment of various aspects of the device 1500. For example, the sensor assembly 1514 can detect an open/closed state of the device 1500, the relative positioning of components, such as a display and keypad of the device 1500, the sensor assembly 1514 can also detect a change in position of the device 1500 or a component of the device 1500, the presence or absence of user contact with the device 1500, orientation or acceleration/deceleration of the device 1500, and a change in temperature of the device 1500. The sensor assembly 1514 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 1514 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 1514 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 1516 is configured to facilitate wired or wireless communication between the device 1500 and other devices. The device 1500 may access a wireless network based on a communication standard, such as WiFi, or a mobile communication network such as 2G, 3G, 4G/LTE, 5G, etc. In an exemplary embodiment, the communication part 1516 receives a broadcast signal or broadcast-related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communications component 1516 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the device 1500 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described methods.

In an exemplary embodiment, a non-transitory computer readable storage medium comprising instructions, such as the memory 1504 comprising instructions, executable by the processor 1520 of the device 1500 to perform the methods provided by the present disclosure is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

From the above description of the embodiments, it is clear to those skilled in the art that the present application can be implemented by software plus necessary general hardware platform. Based on such understanding, the technical solutions of the present application may be essentially or partially implemented in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments of the present application.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, the system or system embodiments are substantially similar to the method embodiments and therefore are described in a relatively simple manner, and reference may be made to some of the descriptions of the method embodiments for related points. The above-described system and system embodiments are only illustrative, wherein the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

The streaming end-to-end speech recognition method, device and electronic device provided by the present application are introduced in detail above, and a specific example is applied in the present application to explain the principle and the implementation of the present application, and the description of the above embodiment is only used to help understand the method and the core idea of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, the specific embodiments and the application range may be changed. In view of the above, the description should not be taken as limiting the application.

28页详细技术资料下载

Streaming end-to-end voice recognition method and device and electronic equipment

相关技术

网友询问留言