Speech recognition method, speech recognition device, computer equipment and readable storage medium

文档序号：1939856 发布日期：2021-12-07 浏览：20次中文

阅读说明：本技术 语音识别的方法、装置、计算机设备及可读存储介质 (Speech recognition method, speech recognition device, computer equipment and readable storage medium ) 是由林永业王珺于 2021-05-13 设计创作，主要内容包括：本申请公开了语音识别的方法、装置、计算机设备及可读存储介质,属于人工智能技术领域。方法包括：获取语音信号,将语音信号输入波形编码器。通过波形编码器获得语音信号对应的第一特征矩阵,将第一特征矩阵划分为至少两个第一特征段,对至少两个第一特征段进行局部特征提取,得到用于指示局部特征的至少两个第二特征段,对至少两个第二特征段进行全局特征提取,得到用于指示局部特征和全局特征的至少两个第三特征段,将至少两个第三特征段合并为波形编码器对应的第二特征矩阵。基于波形编码器对应的第二特征矩阵进行语音识别。本申请进行语音识别的准确率较高。(The application discloses a voice recognition method, a voice recognition device, computer equipment and a readable storage medium, and belongs to the technical field of artificial intelligence. The method comprises the following steps: and acquiring a voice signal, and inputting the voice signal into a waveform coder. The method comprises the steps of obtaining a first feature matrix corresponding to a voice signal through a waveform coder, dividing the first feature matrix into at least two first feature sections, carrying out local feature extraction on the at least two first feature sections to obtain at least two second feature sections used for indicating local features, carrying out global feature extraction on the at least two second feature sections to obtain at least two third feature sections used for indicating the local features and the global features, and combining the at least two third feature sections into a second feature matrix corresponding to the waveform coder. And performing voice recognition based on the second feature matrix corresponding to the waveform coder. The accuracy rate of speech recognition is high.)

1. A method of speech recognition, the method comprising:

acquiring a voice signal, and inputting the voice signal into a waveform coder;

obtaining a first feature matrix corresponding to the voice signal through the waveform encoder, dividing the first feature matrix into at least two first feature segments, performing local feature extraction on the at least two first feature segments to obtain at least two second feature segments used for indicating local features, performing global feature extraction on the at least two second feature segments to obtain at least two third feature segments used for indicating local features and global features, and merging the at least two third feature segments into a second feature matrix corresponding to the waveform encoder;

and performing voice recognition based on the second feature matrix corresponding to the waveform coder.

2. The method according to claim 1, wherein the number of the waveform coders is at least two, at least two waveform coders correspond to at least two second feature matrices, the at least two second feature matrices have different numbers of columns, and the performing speech recognition based on the second feature matrices corresponding to the waveform coders comprises:

obtaining at least two third feature matrixes corresponding to the at least two second feature matrixes through the at least two waveform encoders, wherein the column numbers of the at least two third feature matrixes are the same, and the at least two third feature matrixes are in one-to-one correspondence with the at least two waveform encoders;

cascading the at least two third feature matrixes from the row direction to obtain a cascaded feature matrix;

and performing voice recognition based on the cascaded feature matrix.

3. The method of claim 2, wherein before obtaining at least two third feature matrices corresponding to the at least two second feature matrices by the at least two waveform encoders, the method further comprises:

determining a minimum column number in the column numbers of the at least two second feature matrices, wherein the minimum column number corresponds to a first numerical value;

determining a second numerical value corresponding to the column number of a second feature matrix corresponding to any waveform encoder, and determining convolution kernel information corresponding to any waveform encoder based on the ratio of the first numerical value to the second numerical value;

the obtaining, by the at least two waveform encoders, at least two third feature matrices corresponding to the at least two second feature matrices includes:

and performing convolution processing on the second characteristic matrix corresponding to any waveform encoder based on the convolution kernel information corresponding to any waveform encoder to obtain a third characteristic matrix corresponding to any waveform encoder.

4. The method according to claim 2 or 3, wherein the dividing the first feature matrix into at least two first feature segments comprises:

the dividing of the first feature matrix into at least two first feature segments is performed in response to any one of the waveform encoders being a first one of the at least two waveform encoders.

5. The method according to claim 2 or 3, wherein the dividing the first feature matrix into at least two first feature segments comprises:

responding to any waveform encoder which is not the first waveform encoder in the at least two waveform encoders, and acquiring a second feature matrix corresponding to the former waveform encoder of the any waveform encoder;

pooling a second feature matrix corresponding to the previous waveform encoder to obtain a pooled feature matrix, wherein the column number of the pooled feature matrix is the same as that of a first feature matrix corresponding to any one waveform encoder;

summing the first feature matrix corresponding to any waveform encoder and the pooled feature matrix to obtain a summed feature matrix;

dividing the summed feature matrix into the at least two first feature segments.

6. The method according to any one of claims 1 to 3, wherein the global feature extraction on the at least two second feature segments to obtain at least two third feature segments indicating local features and global features comprises:

down-sampling the at least two second characteristic segments to obtain at least two down-sampling results;

global feature extraction is carried out on the at least two second feature segments through a self-care network, so that at least two global feature extraction results are obtained;

and upsampling the at least two global feature extraction results to obtain at least two upsampled results, and taking the at least two upsampled results as the at least two third feature segments for indicating the local features and the global features.

7. The method according to any one of claims 1-3, wherein said combining the at least two third eigen segments into a second eigen matrix corresponding to the waveform coder comprises:

carrying out nonlinear mapping on the at least two third characteristic segments to obtain at least two nonlinear mapping results;

and combining the at least two nonlinear mapping results into a second feature matrix corresponding to the waveform coder.

8. An apparatus for speech recognition, the apparatus comprising:

the acquisition module is used for acquiring a voice signal;

an input module for inputting the speech signal into a waveform encoder;

an obtaining module, configured to obtain a first feature matrix corresponding to the voice signal through the waveform encoder, divide the first feature matrix into at least two first feature segments, perform local feature extraction on the at least two first feature segments to obtain at least two second feature segments used for indicating a local feature, perform global feature extraction on the at least two second feature segments to obtain at least two third feature segments used for indicating a local feature and a global feature, and merge the at least two third feature segments into a second feature matrix corresponding to the waveform encoder;

and the voice recognition module is used for carrying out voice recognition based on the second characteristic matrix corresponding to the waveform coder.

9. A computer device, wherein the computer device comprises a memory and a processor; the memory has stored therein at least one instruction that is loaded and executed by the processor to cause the computer device to implement the method of speech recognition according to any of claims 1-7.

10. A computer-readable storage medium having stored therein at least one instruction, which is loaded and executed by a processor, to cause a computer to implement the method of speech recognition according to any one of claims 1-7.

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a method and an apparatus for speech recognition, a computer device, and a readable storage medium.

Background

With the development of artificial intelligence technology, ASR (Automatic Speech Recognition) is widely used in people's lives. In the ASR process, firstly, feature extraction is carried out on a voice signal to obtain a feature vector, then, phonemes are determined based on the feature vector, and then, characters are determined based on the phonemes.

In the related art, a speech signal is processed to obtain a spectrogram, and feature extraction is performed based on the spectrogram to obtain a feature vector for determining phonemes. The process of processing the speech signal to obtain the spectrogram may cause a part of information in the speech signal to be lost, thereby causing a low accuracy of the speech recognition process.

Disclosure of Invention

The embodiment of the application provides a voice recognition method, a voice recognition device, computer equipment and a readable storage medium, so as to solve the problem that the accuracy of voice recognition in the related technology is low. The technical scheme is as follows:

in one aspect, a method of speech recognition is provided, the method comprising:

acquiring a voice signal, and inputting the voice signal into a waveform coder;

and performing voice recognition based on the second feature matrix corresponding to the waveform coder.

In one aspect, an apparatus for speech recognition is provided, the apparatus comprising:

the acquisition module is used for acquiring a voice signal;

an input module for inputting the speech signal into a waveform encoder;

and the voice recognition module is used for carrying out voice recognition based on the second characteristic matrix corresponding to the waveform coder.

In an exemplary embodiment, the number of the waveform encoders is at least two, at least two waveform encoders correspond to at least two second feature matrices, and the number of columns of the at least two second feature matrices is different, the speech recognition module is configured to obtain at least two third feature matrices corresponding to the at least two second feature matrices through the at least two waveform encoders, the number of columns of the at least two third feature matrices is the same, and the at least two third feature matrices are in one-to-one correspondence with the at least two waveform encoders; cascading the at least two third feature matrixes from the row direction to obtain a cascaded feature matrix; and performing voice recognition based on the cascaded feature matrix.

In an exemplary embodiment, the speech recognition module is further configured to determine a minimum number of columns in the number of columns of the at least two second feature matrices, where the minimum number of columns corresponds to the first numerical value; determining a second numerical value corresponding to the column number of a second feature matrix corresponding to any waveform encoder, and determining convolution kernel information corresponding to any waveform encoder based on the ratio of the first numerical value to the second numerical value;

and the voice recognition module is used for performing convolution processing on the second characteristic matrix corresponding to any waveform coder based on the convolution kernel information corresponding to any waveform coder to obtain a third characteristic matrix corresponding to any waveform coder.

In an exemplary embodiment, the obtaining module is configured to perform the dividing of the first feature matrix into at least two first feature segments in response to any one of the waveform encoders being a first one of the at least two waveform encoders.

In an exemplary embodiment, the obtaining module is configured to, in response to that any waveform encoder is a non-first waveform encoder of the at least two waveform encoders, obtain a second feature matrix corresponding to a previous waveform encoder of the any waveform encoder; pooling a second feature matrix corresponding to the previous waveform encoder to obtain a pooled feature matrix, wherein the column number of the pooled feature matrix is the same as that of a first feature matrix corresponding to any one waveform encoder; summing the first feature matrix corresponding to any waveform encoder and the pooled feature matrix to obtain a summed feature matrix; dividing the summed feature matrix into the at least two first feature segments.

In an exemplary embodiment, the obtaining module is configured to perform downsampling on the at least two second feature segments to obtain at least two downsampling results; global feature extraction is carried out on the at least two second feature segments through a self-care network, so that at least two global feature extraction results are obtained; and upsampling the at least two global feature extraction results to obtain at least two upsampled results, and taking the at least two upsampled results as the at least two third feature segments for indicating the local features and the global features.

In an exemplary embodiment, the obtaining module is configured to perform nonlinear mapping on the at least two third feature segments to obtain at least two nonlinear mapping results; and combining the at least two nonlinear mapping results into a second feature matrix corresponding to the waveform coder.

In one aspect, a computer device is provided, the computer device comprising a memory and a processor; the memory has stored therein at least one instruction that is loaded and executed by the processor to cause a computer device to implement a method of speech recognition as provided in any of the exemplary embodiments of this application.

In one aspect, a computer-readable storage medium having at least one instruction stored therein is provided, the instruction being loaded and executed by a processor to cause a computer to implement a method for speech recognition provided by any one of the exemplary embodiments of this application.

In another aspect, there is provided a computer program or computer program product comprising: computer instructions which, when executed by a computer, cause the computer to implement a method of speech recognition as provided by any of the exemplary embodiments of this application.

The beneficial effects brought by the technical scheme provided by the embodiment of the application at least comprise:

the local details in the voice signal are reserved through the local feature extraction process, and the global relation of the voice signal is reserved through the global feature extraction process, so that feature extraction based on the voice signal becomes possible. The feature matrix can be obtained based on the feature extraction of the voice signals, and the feature matrix is applied to the voice recognition process, so that the accuracy of the voice recognition can be improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a schematic illustration of an implementation environment provided by an embodiment of the present application;

FIG. 2 is a schematic structural diagram of a waveform encoder according to an embodiment of the present application;

FIG. 3 is a schematic structural diagram of a waveform encoder according to an embodiment of the present application;

FIG. 4 is a flow chart of a method of speech recognition provided by an embodiment of the present application;

FIG. 5 is a schematic structural diagram of an apparatus for speech recognition according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of an electronic device provided in an embodiment of the present application;

fig. 7 is a schematic structural diagram of a server according to an embodiment of the present application.

Detailed Description

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

AI (Artificial Intelligence) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

Key technologies for Speech Technology (Speech Technology) are ASR, Text To Speech (TTS), and voiceprint recognition. The computer can listen, see, speak and feel, and the development direction of the future human-computer interaction is provided, wherein the voice becomes one of the best viewed human-computer interaction modes in the future.

ML (Machine Learning) is a multi-domain cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formal education learning. In some embodiments, the implementation of ASR involves the application of ML technology. For example, it relates to the application of artificial neural networks in ML technology.

With the research and progress of artificial intelligence technology, ASR is increasingly applied to a variety of projects and products, such as audio-video conferencing systems, intelligent voice interaction, intelligent voice assistants, online voice recognition systems, vehicle-mounted voice interaction systems, and so on. It is believed that ASR will find application in more projects and products and will play an increasingly important role as technology develops.

In the ASR process, feature extraction is firstly carried out based on a voice signal to obtain a feature vector, then, phonemes are determined based on the feature vector, and then, characters are determined based on the phonemes, so that voice recognition is realized. In an end-to-end (end-to-end) ASR scenario, the encoder is cascaded with a RNN-T (Recurrent Neural Network-Transducer), which includes an encoding portion and a decoding portion. The encoder is used for extracting features based on the voice signals to obtain feature vectors, the coding part of the RNN-T is used for determining phonemes based on the feature vectors, and the decoding part of the RNN-T is used for determining characters based on the phonemes. It should be understood that the RNN-T described above is merely an example, and other components may be used in an end-to-end ASR scenario to implement the process of determining phonemes based on feature vectors and the process of determining words based on phonemes.

In the related art, the encoder includes an MFCC (Mel-frequency Cepstral Coefficients), an FBANK (Mel-filter Bank Values), and the like, and the input of these encoders is a spectrogram. Therefore, it is necessary to perform Short-time Fourier Transform (STFT) or Modified Discrete Cosine Transform (MDCT) on the original speech signal to obtain a spectrogram, so that the encoder in the related art performs feature extraction based on the spectrogram to obtain a feature vector. Because the process of processing the speech signal to obtain the spectrogram can cause information loss, the feature vector obtained by the method provided by the related technology is not accurate enough, so that the recognition accuracy of the speech recognition process based on the feature vector is low.

Compared with a mode of processing a voice signal to cause information loss, the method can obtain more accurate feature vectors by directly extracting features based on the original voice signal, thereby being beneficial to improving the recognition accuracy of the voice recognition process. In the process of extracting features based on the original voice signal, the original voice signal needs to be subjected to intensive sampling, so that a large number of signal segments are obtained. The difficulty in feature extraction based on a large number of signals is: both the local details of each signal segment need to be guaranteed to be able to be preserved, and the global relationships between different signal segments need to be guaranteed to be able to be preserved. In view of this difficulty, the embodiments of the present application provide a speech recognition method, see the following description.

Referring to fig. 1, a method for speech recognition provided by the embodiment of the present application can be applied to an implementation environment shown in fig. 1. Fig. 1 includes a computer device including a waveform encoder for feature extraction for a speech signal. Wherein the computer device comprises an electronic device or a server.

The electronic device may be any electronic product that can interact with a user through one or more ways, such as a keyboard, a touch pad, a touch screen, a remote controller, a voice interaction device, or a handwriting device, for example, a PC (Personal Computer), a mobile phone, a smart phone, a PDA (Personal Digital Assistant), a wearable device, a pocket PC (pocket PC), a tablet Computer, a smart car, a smart television, a smart speaker, and the like. The server may be one server, a server cluster composed of a plurality of servers, or a cloud computing service center.

Those skilled in the art should appreciate that the above-described electronic devices and servers are merely examples, and that other existing or future electronic devices and servers may be suitable for use in the present application and are intended to be included within the scope of the present application and are hereby incorporated by reference.

Referring to fig. 2, fig. 2 shows an exemplary Waveform coder (Waveform Encoder). The Input to the Waveform coder is an original speech signal, which is a time-domain Waveform, also called Waveform Input (Waveform Input). The output of the waveform coder is a feature matrix, which includes feature vectors, also called Encoded Features (Encoded Features). The waveform encoder comprises a first convolution module, a partition module (Split crunks), a GALR (Global attention local recursion) module and a Merge module (Merge crunks) which are connected in series. Illustratively, the number of GALR modules is at least one, B in fig. 2 is used to indicate the total number of GALR modules, and different GALR modules may be distinguished by B (B ═ 1, …, B). One GALR module includes RNN (Recurrent Neural Network) and SAN (Self-attention Network) connected in series. The waveform encoder comprises the following modules:

a first volume module: the method is used for extracting the features of the voice signals to obtain a feature matrix.

A dividing module: the characteristic matrix used for outputting the first volume module is divided into at least two characteristic segments.

A GALR module: in the case where the number of GALR modules is one, the input of the GALR module is at least two characteristic segments of the output of the division module, and the output of the GALR module is used as the input of the combination module. In the case that the number of the GALR modules is at least two, the input of the first GALR module is at least two characteristic segments of the output of the partitioning module, the output of the first GALR module is used as the input of the second GALR module, and so on, and the output of the last GALR module is used as the input of the merging module.

In a GALR module, the RNN is used to learn local details of each of at least two feature segments, and at least two feature segments output by the RNN are used as inputs to the SAN. The SAN is used to learn a global relationship between different ones of the at least two feature segments, and at least two feature segments of the SAN output are used as outputs of the GALR module.

A merging module: and combining at least two characteristic segments output by the GALR module to obtain a characteristic matrix, wherein the characteristic matrix is used for voice recognition.

Illustratively, a Layer Normalization (LN) is also included in the GALR module between the RNN and the SAN to avoid excessive differences between the output and input of the RNN. Illustratively, LNs located after the SAN are also included in the GALR module, which are used to avoid too large a difference between the output and the input of the SAN.

Illustratively, a rel (Rectified Linear Unit) and a LN between the first convolution module and the partition module are further included in the waveform encoder. The ReLU is used to keep the output of the first convolution module as a non-negative output, and the LN is used to avoid too large difference between the output and the input of the first convolution module.

Illustratively, the waveform encoder further comprises a second convolution module located after the merging module, and the second convolution module is used for adjusting the size of the feature matrix output by the merging module, so as to avoid excessive processing resources occupied by the subsequent speech recognition process. Illustratively, the second convolution module includes a ReLU for keeping the output of the second convolution module non-negative and a LN for avoiding too large a difference between the output and the input of the second convolution module.

It should be noted that, for a segment of speech signal, before inputting the speech signal into the first convolution module, the speech signal needs to be divided into signal segments of a certain length. The smaller the length of the signal segment, the smaller the time resolution and the larger the frequency resolution.

To achieve a balance between time resolution and frequency resolution, the present embodiment provides at least two waveform encoders connected in parallel, see fig. 3. The total number of waveform encoders is denoted as N, and different waveform encoders can be distinguished by N (N is 1, …, N). The different waveform encoders divide a section of voice signal into signal sections with different lengths and then input the signal sections into the first convolution module, and then different waveform encoders obtain different feature matrices through the merging module. Because the lengths of the signal segments obtained by dividing different waveform encoders are different, the total number of the signal segments is also different, and therefore the sizes of the feature matrixes obtained by combining the modules in different waveform encoders are also different. In contrast, feature matrices of different sizes can be converted into feature matrices of the same size through processing by the second convolution module in the waveform encoder, so that feature matrices of the same size can be concatenated. And then, subsequent voice recognition is carried out based on the cascaded feature matrix, which is equivalent to combining the lengths of various different signal segments, so that the balance between time resolution and frequency resolution is realized, and the accuracy of the voice recognition is further improved.

In this embodiment, at least two parallel waveform coders are used as multi-scale (scale) waveform coders. The smaller the length of the signal segment obtained by dividing the waveform encoder is, the more the total number of the signal segments is, the finer the scale adopted by the waveform encoder is (fine). The larger the length of the signal segment obtained by dividing the waveform encoder is, the smaller the total number of the signal segments is, the coarser the scale adopted by the waveform encoder is (coarse). Illustratively, in multi-scale waveform coders, the scale employed by the different waveform coders is made from fine to coarse (fine to coarse).

Based on the implementation environment shown in fig. 1, referring to fig. 4, an embodiment of the present application provides a method for speech recognition, which is applied to the computer device shown in fig. 1. As shown in fig. 4, the method includes the following steps.

401, a speech signal is acquired and input to a waveform coder.

The embodiment of the present application does not limit the manner of acquiring a voice signal. For example, the embodiment samples the voice uttered by the user according to a certain sampling frequency, and can obtain the voice signal to be recognized. Alternatively, the present embodiment acquires a voice signal from the network. Alternatively, the present embodiment acquires the voice signal from the local. Since the speech signal varies with time, the speech signal is a time domain waveform, which can be expressed as the following formula (1):

in formula (1), x is the speech signal, T is used to indicate the length of the speech signal, and T is the product of the sampling frequency and the speech duration. Taking a sampling frequency of 16kHz (i.e., 16,000 samples per second) and a speech duration of 4 seconds as an example, T is 64,000.

Illustratively, in a waveform coder, a speech signal is divided into at least two signal segments of a reference duration, the at least two signal segments constituting a sequence of signal segments, and adjacent signal segments having a certain overlap ratio. In this embodiment, the total number of waveform encoders is denoted as N, which is an integer not less than 1. When waveform encoders are distinguished by a subscript N (N is 1, …, N), the sequence of signal segments obtained by dividing the nth waveform encoder is expressed by the following formula (2):

wherein, X_nFor sequences of signal segments in the n-th waveform coder, M_nFor the length of the signal segment in the n-th waveform coder, M_nEqual to the product of the sampling frequency and the reference duration. For example, with a sampling frequency of 16kHz and a reference duration of 0.025s, M_nIs 400. M_nFor the over-parameters determined according to actual needs, the embodiment does not match M_nAnd (4) limiting. L is_nFor the total number of signal segments in the n-th waveform encoder, L_nThe calculation is performed according to the following equation (3):

wherein, mu₁In this embodiment, the overlapping ratio between adjacent signal segments is not limited, and the overlapping ratio between adjacent signal segments is, for example, 50%. Illustratively, response to T can be by M_nAnd dividing by integer, the T refers to the real length of the voice signal. Cannot be M in response to a speech signal_nAnd (3) dividing, wherein T is the length of the voice signal after zero padding, and the zero padding has the following functions: enabling the zero-padded speech signal to be M_nAnd (4) trimming. Illustratively, the present embodiment performs zero padding both before and after the speech signal. Compared to the way of zero-padding only after the speech signal, the way of zero-padding both before and after the speech signal can avoid the last signal segmentAnd the accuracy of the subsequent voice recognition process is ensured.

It should be noted that, when the number of the waveform coders is at least two, that is, when N ≧ 2, different waveform coders divide the speech signal into signal segments of length M_nAnd total number L_nAll are different. Illustratively, the length M of the signal segment_nThe total number L of signal segments positively correlated with the size of n_nInversely related to the magnitude of n. That is, the greater n, the greater the length M of the signal segment_nThe larger the total number of signal segments L_nThe smaller.

402, obtaining a first feature matrix corresponding to a voice signal through a waveform encoder, dividing the first feature matrix into at least two first feature segments, performing local feature extraction on the at least two first feature segments to obtain at least two second feature segments used for indicating local features, performing global feature extraction on the at least two second feature segments to obtain at least two third feature segments used for indicating local features and global features, and merging the at least two third feature segments into a second feature matrix corresponding to the waveform encoder.

As can be seen from the above description, the waveform encoder includes a first convolution module, a dividing module, at least one GALR module, and a combining module connected in series. Therefore, the processing of the speech signal is sequentially performed by the first convolution module, the division module, the at least one GALR module, and the combination module in the waveform encoder. Next, the processing procedure in each block will be explained in the processing order.

4021, obtaining a first feature matrix corresponding to the voice signal through a first convolution module.

In 401, the speech signal is divided into a sequence of signal segments, which includes at least two signal segments, so that at least two signal segments are input into a first convolution module, which is configured to output a feature vector for each signal segment, respectively, so as to obtain at least two feature vectors corresponding to the at least two signal segments.

Wherein the first convolution module includes a Conv1D (conditional 1-dimension) having a convolution kernel expressed as the following formula (4):

wherein 1 is a convolution kernel U_nWindow width of (D) convolution kernel U_nNumber of lines, M_nAs a convolution kernel U_nThe number of columns D may be determined according to actual requirements, and the value of D is not limited in this embodiment. In addition, a convolution kernel U_nWindow shift of 1.

Illustratively, the embodiment takes at least two eigenvectors output by the first convolution module as the first eigenvector matrix. In this case, the first feature matrix is expressed as the following formula (5):

alternatively, in the case where the first convolution module is connected in series with the ReLU and the LN, the present embodiment further processes at least two eigenvectors output by the first convolution module sequentially through the ReLU and the LN, thereby taking the processing result as the first feature matrix, where the first feature matrix is expressed by the following equation (6):

4022, the first feature matrix is divided into at least two first feature segments by the dividing module.

After the first feature matrix is obtained, the first feature matrix is input to the partitioning module, so that the first feature matrix F is divided by the partitioning module_nThe first characteristic section is divided into at least two first characteristic sections, and a certain overlapping rate exists between the adjacent first characteristic sections. Illustratively, the length of the first feature segment is denoted as K_nThe overlapping rate between adjacent first feature segments is recorded as mu₂Then, the total number of the first feature segments is expressed as the following formula (7):

wherein S is_nI.e. the total number of first feature segments. Length K of the first characteristic section_nFor the over-parameters determined according to actual needs, this embodiment does not match K_nAnd (4) limiting. In addition, the embodiment also does not address the overlapping rate μ between adjacent first feature segments₂Is limited, e.g. mu₂Is 50%. Exemplarily, at L_nCan be covered by K_nIn the case of integer division, L_nIs the true length of the first feature matrix. At L_nCan not be covered by K_nIn the case of integer division, L_nIs the length after zero padding the first feature matrix. The way of zero padding is described in the above 401, and is not described herein again.

Based on the above-described division process, (D × L) can be divided_n) First feature matrix F of_nIs divided into S_nA (D K)_n) The first characteristic section of (1). The obtained first feature segment is expressed as a three-dimensional tensor, which is expressed according to the following formula (8):

in the case where the number of waveform encoders is at least two, the first feature matrix is illustratively divided into at least two first feature segments, including the following two cases.

The first condition is as follows: in response to that one waveform encoder is a non-first waveform encoder of the at least two waveform encoders, acquiring a second feature matrix corresponding to a previous waveform encoder of the waveform encoder, and pooling the second feature matrix corresponding to the previous waveform encoder (see the description of 4025 below in the process of acquiring the second feature matrix), to obtain a pooled feature matrix, where the pooled feature matrix has the same number of columns as the first feature matrix corresponding to the waveform encoder. And summing the first feature matrix corresponding to the waveform encoder and the pooled feature matrix to obtain a summed feature matrix. The summed feature matrix is divided into at least two first feature segments by a division module.

As can be seen from the description in 401, when the number of the waveform encoders is at least two, the larger the serial number n of the waveform encoder is, the total number L of the signal segments obtained by dividing the waveform encoder is_nThe smaller. From equations (5) and (6), the first feature matrix obtained by the first convolution module in the waveform encoder That is, the number of columns of the first feature matrix is L_n. Therefore, the larger the serial number n of the waveform encoder is, the larger the number L of columns of the first feature matrix is_nThe smaller. In addition, the number of columns of the second feature matrix corresponding to one waveform encoder corresponds to the waveform encoderThe number of columns of the first feature matrix is the same, so that the larger the serial number n of the waveform encoder is, the larger the number of columns L of the second feature matrix corresponding to the waveform encoder is_nThe smaller. As a result, when one waveform encoder is not the first waveform encoder, the number of columns of the second feature matrix corresponding to the waveform encoder immediately preceding the waveform encoder is larger than the number of columns of the first feature matrix corresponding to the waveform encoder.

Therefore, in this embodiment, the second feature matrix corresponding to the previous waveform encoder is pooled to reduce the number of columns of the second feature matrix corresponding to the previous waveform encoder, so as to obtain a pooled feature matrix, where the number of columns of the pooled feature matrix is the same as the number of columns of the first feature matrix of the current waveform encoder. Illustratively, the pooling process described above includes, but is not limited to, average pooling.

Just because the number of columns of the pooled feature matrix is the same as the number of columns of the first feature matrix corresponding to the waveform encoder, the embodiment may sum the pooled feature matrix and the first feature matrix corresponding to the waveform encoder to obtain a summed feature matrix, where the summed feature matrix is represented by the following formula (9):

in formula (9), E_n-1For the second feature matrix corresponding to the previous waveform encoder, by pair E_n-1Performing AvgPool1D (Average Pool 1-dimension, one-dimensional Average pooling) processing to obtain pooled feature matrix AvgPool1D (E)_n-1) For the pooled feature matrix AvgPool1D (E)_n-1) First feature matrix F corresponding to waveform encoder_nSumming to obtain a summed feature matrix

After obtaining the summed first feature matrixThe summed feature matrices can then be divided by the partitioning moduleInto at least two first feature segments. To the summed feature matrixThe division process and the above description of the first feature matrix F_nThe dividing process is the same, and is not described herein again.

Case two: and in response to one waveform encoder being the first waveform encoder of the at least two waveform encoders, the first feature matrix is directly divided into at least two first feature segments by the dividing module. For the first waveform coder, if n is 1, then E is given in equation (9)_n-1＝E₀. Definition of E in this example₀When 0, then there is according to formula (9)Therefore, the first waveform encoder can directly divide the first feature matrix without combining with second feature matrices of other waveform encoders to calculate and then divide the first feature matrix.

It can be understood that the three-dimensional tensor shown in the formula (8) can be obtained regardless of whether the division is performed according to the case one or the case two.

After the three-dimensional tensor is obtained, the three-dimensional tensor is input to a GALR module in a waveform encoder. The total number of GALR modules is denoted as B, which is an integer not less than 1, and the GALR modules are distinguished by subscript B (B ═ 1, …, B). In the present embodiment, the input of the b-th GALR block in the n-th waveform encoder is represented asThe output is expressed asFor the first GA in the nth waveform coderFor the LR module (b ═ 1),for the other GALR modules (b ≠ 1) in the nth waveform encoder,i.e., the input of the b-th GALR module is equal to the output of the (b-1) -th GALR module.

A GALR module includes RNN and SAN in series. For the sake of distinction, the input and output of RNN are respectively denoted asAndthe input and output of SAN are respectively recorded asAndnext, RNN and SAN in the GALR module are explained separately.

4023, extracting local features of the at least two first feature segments through RNN in GALR module to obtain at least two second feature segments for indicating local features.

Local features include, but are not limited to, temporal continuity, spectral structure, timbre, and the like. Different first feature segments may have different local features, so that the local features need to be extracted in this embodiment, thereby avoiding loss of the local features and further ensuring accuracy of the speech recognition process.

RNN inputI.e. the input of the GALR module where RNN is locatedRNN inputI.e. at least two first feature segments. Thus, it is possible to provideWill be provided withAfter the RNN is input, the RNN is output according to the following formula (10)Output by RNNI.e. at least two second feature segments for indicating local features.

Wherein the content of the first and second substances,being a three-dimensional tensor, this three-dimensional tensor can also be understood as S_nA (DxK)_n) Is measured in a two-dimensional tensor of (a),i.e. the s (DxK)_n) The two-dimensional tensor of. M_n,bAnd c_n,bBeing parameters of a linear transformation, M_n,bAnd c_n,bFor ensuring

Illustratively, the RNN in this embodiment includes an LSTM (Long Short-Term Memory network) having H hidden nodes therein. Then Accordingly, the number of the first and second electrodes,thereby can enableOr the RNN comprises Bi-LSTM (Bidirectional-LSTM), thenAccordingly, the number of the first and second electrodes, thereby can enable

4024, global feature extraction is performed on the at least two second feature segments through SAN in the GALR module to obtain at least two third feature segments for indicating local features and global features.

The global feature refers to a context relationship between different second feature segments, or a dependency relationship between different second feature segments. The global features can be prevented from being lost by extracting the global features, and the accuracy of the voice recognition process is further ensured. Since the at least two second feature segments are used for indicating the local features, the global features are extracted on the basis of the at least two second feature segments, and the obtained at least two third feature segments can be used for indicating the local features and the global features.

Illustratively, inputs to the SANAlternatively, in the case of LNs included between the RNN and the SANThe input to the SAN is determined according to equation (11) as follows:

then, after inputting the SAN, the output of the SAN is expressed as the following formula (12):

in an exemplary embodiment, the global feature extraction is performed on the at least two second feature segments, and obtaining at least two third feature segments indicating local features and global features includes: and downsampling the at least two second characteristic segments to obtain at least two downsampling results. And performing global feature extraction on the at least two second feature segments through the SAN to obtain at least two global feature extraction results. And upsampling the at least two global feature extraction results to obtain at least two upsampled results, and taking the at least two upsampled results as at least two third feature segments for indicating the local features and the global features. In this case, the output of the SAN is expressed as the following formula (13):

wherein, the downsampling process is used for adjusting the granularity for global feature extraction. For example, when the granularity is fine, it is equivalent to performing global feature extraction from a phoneme level, and the extracted global feature represents the context relationship between different phonemes. Or, in the case of a coarse granularity, it is equivalent to performing global feature extraction from a character level, and the extracted global features represent the context between different characters. In addition, UpSmpl () is an upsampling process corresponding to a downsampling process for securing

Illustratively, the present embodiment exports a SANAs output of GALR moduleAlternatively, in the case where the SAN is followed by LNs, the output of the GALR modulesExpressed according to the following equation (14):

when B < B, the output of GALR moduleAlso as input to the next GALR module, i.e.Therefore, the above-mentioned at least two third characteristic sections refer to: the output of the last waveform coder of the at least one waveform coder

4025, merging at least two third feature segments into a second feature matrix corresponding to the waveform coder through a merging module.

Exemplarily, the at least two third feature segments are merged in the merging module by an Overlap-add (overlay-add) algorithm, see formula (15) below:

wherein Overlapadd () is the overlap-add algorithm, E_nThat is to mergeTo the second feature matrix.

Illustratively, merging at least two third feature segments into a corresponding second feature matrix of the waveform encoder includes: and carrying out nonlinear mapping on the at least two third characteristic segments to obtain at least two nonlinear mapping results. And combining at least two nonlinear mapping results into a second feature matrix corresponding to the waveform coder.

In some embodiments, the at least two third feature segments are non-linearly mapped by the Swish function and the two-dimensional convolutional layer (Conv2D), see equation (16) below:

swish () is a Swish function in equation (16),i.e. the non-linear mapping result. On the basis of the formula (16), the nonlinear mapping results are merged through a merging module to obtain a second feature matrix E_nSee formula (17) below:

and 403, performing voice recognition based on the second feature matrix corresponding to the waveform coder.

And in response to the number of the waveform encoders being one, performing voice recognition directly on the basis of a second feature matrix corresponding to the waveform encoder. The process of performing voice recognition based on the second feature matrix comprises the following steps: determining phonemes based on the second feature matrix, and determining words based on the phonemes, thereby implementing a speech recognition process.

Alternatively, in response to that the number of the waveform encoders is at least two, the number of the second feature matrices is also at least two, and the number of columns of the at least two second feature matrices is different, the embodiment needs to perform speech recognition based on the at least two second feature matrices.

Illustratively, speech recognition is performed based on the corresponding second feature matrix of the waveform coder, including: and obtaining at least two third feature matrixes corresponding to the at least two second feature matrixes through the at least two waveform encoders, wherein the column numbers of the at least two third feature matrixes are the same, and the at least two third feature matrixes are in one-to-one correspondence with the at least two waveform encoders. Cascading at least two third feature matrixes from the row direction to obtain a cascaded feature matrix, and performing voice recognition based on the cascaded feature matrix. The method for performing speech recognition based on the concatenated feature matrix is the same as the method for performing speech recognition based on the second feature matrix, and is not repeated here.

Since the number of columns of the second feature matrices in different waveform encoders is different, in this embodiment, at least two third feature matrices with the same number of columns need to be obtained based on at least two second feature matrices, and the at least two third feature matrices can be concatenated in the row direction, so that speech recognition is performed based on the concatenated feature matrices. In this embodiment, different convolution kernels are used in the second convolution modules included in different waveform encoders, and the convolution processing is performed on the second feature matrices corresponding to the waveform encoders through the different convolution kernels, so that third feature matrices with the same number of columns can be obtained. Therefore, in an exemplary embodiment, before obtaining at least two third feature matrices corresponding to the at least two second feature matrices by the at least two waveform encoders, the method further includes: determining a minimum column number in the column numbers of the at least two second feature matrixes, wherein the minimum column number corresponds to a first numerical value; and determining a second numerical value corresponding to the column number of the second characteristic matrix corresponding to any waveform encoder, and determining convolution kernel information corresponding to any waveform encoder based on the ratio of the first numerical value to the second numerical value.

Wherein, the first numerical value corresponding to the minimum column number is: the length of the signal segment used to obtain the minimum number of columns. According to the formula (3), the length M of the signal segment is known_nThe minimum number of columns is obtained at the maximum, and thus the length of the signal segment for obtaining the minimum number of columns is the maximum M_n. Therefore, the first numerical value is recorded asM_maxExpressed as the following equation (18):

for any waveform encoder, the second numerical value corresponding to the number of columns of the second feature matrix corresponding to the waveform encoder is: the length of the signal segment for obtaining the column number of the second feature matrix, and the second numerical value is M corresponding to the waveform encoder_n。

In any waveform encoder, based on a first value M_maxAnd a second value M_nConvolution kernel information is determined according to equation (19) as follows:

in the formula (19), C is a constant. Based on the convolution kernel information shown in equation (19), the corresponding convolution kernel of any waveform encoder is expressed as equation (20) below:

wherein (C)_n/μ₃) As a convolution kernel V_nWindow width of (V), convolution kernel V_nThe number of rows and columns of (a) is D. Mu.s₃In the present embodiment, the overlap ratio when sliding is performed is not limited to the overlap ratio when sliding is performed in accordance with the window width. For example, the overlap ratio μ in sliding window₃Is 50%, the convolution kernel V_nWindow width of 2C_n. In addition, the window of the convolution kernel is shifted to C_n。

Correspondingly, at least two third feature matrices corresponding to the at least two second feature matrices are obtained by the at least two waveform encoders, including: and performing convolution processing on the second characteristic matrix corresponding to any waveform encoder through the convolution kernel information corresponding to any waveform encoder to obtain a third characteristic matrix.

The second feature matrix is convolved based on the convolution kernel shown in equation (20), and the obtained third feature matrix is expressed by equation (21) as follows:

wherein, Y_nIs a third feature matrix of the first set of feature matrices,to use the above-mentioned convolution kernel V_nThe one-dimensional convolution layer of (1). In some embodiments, a second convolution module in the waveform encoder after the combining module includes the methodThe second feature matrix output by the merging module can be convolved by a second convolution module in the waveform coder, so that a third feature matrix is obtained.

In addition, the third feature matrix Y_nNumber of columns L_minCalculated according to equation (22) as follows:

exemplarily, in the case where the second convolution module is further connected in series with ReLU and LN, the third feature matrix can also be expressed as the following formula (23):

based on the formula (21) or (23), the third feature matrix corresponding to each waveform encoder can be obtained, and since the number of waveform encoders is N, N (D × L) in total can be obtained_min) The third feature matrix of (1). Then, at least two third feature matrices may be cascaded from the row direction, so as to obtain a cascaded feature matrix represented by the following formula (24):

it can be seen from the equation (22) that the constant C and the overlap ratio μ when the sliding window is performed are adjusted₃Can control the third feature matrix Y_nNumber of columns L_minBy changing, i.e. by making the number of columns L of the concatenated feature matrix Y_minA change occurs. Illustratively, the present embodiment can adjust the constant C and the overlap ratio μ when performing the sliding window according to the actual requirement₃Thereby controlling the third feature matrix Y_nAnd the number of columns L of the cascaded feature matrix Y_minThereby avoiding the occurrence of L_minToo large and thus cause the subsequent speech recognition process to occupy too much processing resources. In addition, in the case where the number of the waveform encoders is one, the concatenation is not required, but in this case, the waveform encoder may still include the second convolution module. The second convolution module can adjust the column number of the second feature matrix output by the combination module, so that the situation that the subsequent voice recognition process based on the second feature matrix occupies too much processing resources due to the fact that the column number of the second feature matrix is too large is avoided.

Next, a comparison between the waveform encoder provided in the embodiment of the present application and an encoder provided in the related art is described to show a positive effect of the waveform encoder provided in the embodiment of the present application on the recognition accuracy of speech recognition.

And the comparison shows that: from the above description, speech recognition is achieved by concatenating the encoder and the RNN-T in an end-to-end ASR scenario. Referring to Table 1, the first column in Table 1 shows four different RNN-T: Conf-S, Conf-M, Conf-L and TDNN (Time Delay Neural Network) -Conf. The encoder MFCC in the related art and the waveform encoder (denoted as GALR in table 1) provided in the embodiment of the present application are respectively connected in series before the above four different RNN-ts, so as to form eight different end-to-end ASR systems. Then, the eight end-to-end ASR systems are trained on the reference data set AISHELL-2, and the number of parameters of the trained eight end-to-end ASR systems and the CER (Character Error Rate) for performing speech recognition are shown in table 1 below.

TABLE 1

As can be seen from table 1, compared with the four end-to-end ASR systems including MFCCs, the CER for speech recognition by the four end-to-end ASR systems including GALR is lower, and the reduction of CER is between 7.9% and 28.1%, so that the GALR provided by the embodiment of the present application can improve the accuracy of speech recognition. In addition, the number of parameters required by the four end-to-end ASR systems comprising GALR is lower, which is beneficial to reducing the system size and accelerating the training speed. It can also be seen from Table 1 that the use of TDNN-Conf in the end-to-end ASR system results in lower CER and lower parameter counts than Conf-S, Conf-M and Conf-L, and thus TDNN-Conf is used in the end-to-end ASR system during the comparison shown in tables 2 and 3 below.

It should be noted that in the different end-to-end ASR systems described above, only RNN-T and the encoder are included. For example, in this embodiment, NNLM (Neural Network Language Model), MBR (Minimum Bayes ridge, Minimum Bayes Risk) 8, LAS (Listen-note-spelling) recoring (record) may also be added to the end-to-end ASR system to further verify the positive influence of GALR on the speech recognition accuracy provided in this embodiment, which is not described herein again in this embodiment.

And B, comparative explanation shows two: before TDNN-Conf, MFCC, Conv1D and GALR provided by the embodiment are respectively connected in series to obtain three different end-to-end ASR systems. The MFCC is an encoder for performing feature extraction based on a spectrogram obtained by processing an original speech signal, and Conv1D (which is equivalent to a first convolution module included in a waveform encoder in the embodiment of the present application) and GALR are both encoders for performing feature extraction directly based on an original speech signal. Three end-to-end ASR systems were trained on a mandarin chinese dataset of 5,000 hours (5khrs), and the CERs of the trained three end-to-end ASR systems for speech recognition at different scales are shown in table 2 below.

TABLE 2

As can be seen by comparing CERs corresponding to MFCC and GALR, the reduction range of CER can be between 16.0% and 21.3% by using GALR for feature extraction. In the case of the scale of 25, comparing the CER (9.4%) corresponding to MFCC with the CER (9.2%) corresponding to Conv1D shows that the method of extracting features based on original speech is more beneficial to improving the accuracy of speech recognition compared with the method of extracting features based on spectrogram. In the case of the scale of 12.5, comparing the CER (9.1%) corresponding to Conv1D with the CER (9.0%) corresponding to GALR, it can be seen that, compared with the way of extracting features of the original speech signal by Conv1D, the method of extracting features of the original speech signal by GALR provided by the embodiment of the present application is beneficial to improving the accuracy of speech recognition.

In addition, in the case of multiscale {6.25, 12.5, 25}, comparing CER (8.9%) corresponding to Conv1D with CER (7.6%) corresponding to GALR, it can be seen that GALR can provide higher speech recognition accuracy than Conv1D even if Conv1D and GALR use the same multiscale in the process of feature extraction of the original speech signal. In the GALR, the CER using two different scales {6.25, 12.5} is 7.9%, the CER using three different scales {6.25, 12.5, 25} is 7.6%, and the CER using four different scales {6.25, 12.5, 25, 50} is 7.4%, so that the increase in the scale can further improve the accuracy of speech recognition. In practical applications, the scale of the waveform encoder can be selected according to practical requirements.

And (3) comparison shows that: before TDNN-Conf, MFCC and GALR provided by the embodiment are respectively connected in series to obtain two different end-to-end ASR systems. Two end-to-end ASR systems were trained on a 21,000 hour (21khrs) mandarin data set, and the CER and recognition speed (speed) for speech recognition of the trained two end-to-end ASR systems under different speech recognition scenarios are given in table 3 below.

TABLE 3

In table 3, read refers to a 1.5-hour read speech signal, spontaneous (spon) refers to a 2-hour spontaneous speech signal, and music (music) refers to a 2.2-hour speech signal with background music interference. Reading, spontaneous and music belong to three different speech recognition scenarios. It can be seen that, in each speech recognition scenario, the accuracy of the speech recognition based on the GALR provided by the present embodiment is higher than the accuracy of the speech recognition based on the MFCC. In a complex scene with interference, such as music, the CER reduction is large, reaching 15.2%. Therefore, the GALR provided by the present embodiment has stronger robustness in a complex scene with interference, that is, the GALR provided by the present embodiment can still maintain higher accuracy in the complex scene with interference. In addition, as can be seen from the comparison between the MFCC and the voice recognition speed of the GALR of the present embodiment, the GALR provided in the present embodiment can also improve the voice recognition speed as compared with the MFCC in the related art.

In summary, the present embodiment reserves the local details in the speech signal through the local feature extraction process, and reserves the global relationship of the speech signal through the global feature extraction process, so that the feature extraction based on the speech signal becomes possible. The feature matrix can be obtained based on the feature extraction of the voice signals, and the feature matrix is applied to the voice recognition process, so that the accuracy of the voice recognition can be improved. In addition, the embodiment can also improve the robustness and the recognition speed of the voice recognition.

An embodiment of the present application provides a speech recognition apparatus, and referring to fig. 5, the apparatus includes:

an obtaining module 501, configured to obtain a voice signal;

an input module 502 for inputting a speech signal into a waveform encoder;

an obtaining module 503, configured to obtain a first feature matrix corresponding to a voice signal through a waveform encoder, divide the first feature matrix into at least two first feature segments, perform local feature extraction on the at least two first feature segments to obtain at least two second feature segments used for indicating a local feature, perform global feature extraction on the at least two second feature segments to obtain at least two third feature segments used for indicating a local feature and a global feature, and merge the at least two third feature segments into a second feature matrix corresponding to the waveform encoder;

and a speech recognition module 504, configured to perform speech recognition based on the second feature matrix corresponding to the waveform encoder.

In an exemplary embodiment, the number of the waveform encoders is at least two, at least two waveform encoders correspond to at least two second feature matrices, and the number of columns of the at least two second feature matrices is different, and the speech recognition module 504 is configured to obtain at least two third feature matrices corresponding to the at least two second feature matrices through the at least two waveform encoders, where the number of columns of the at least two third feature matrices is the same, and the at least two third feature matrices are in one-to-one correspondence with the at least two waveform encoders; cascading at least two third feature matrices from the row direction to obtain a cascaded feature matrix; and performing voice recognition based on the concatenated feature matrix.

In an exemplary embodiment, the speech recognition module 504 is further configured to determine a minimum number of columns among the number of columns of the at least two second feature matrices, where the minimum number of columns corresponds to the first numerical value; determining a second numerical value corresponding to the column number of a second feature matrix corresponding to any waveform encoder, and determining convolution kernel information corresponding to any waveform encoder based on the ratio of the first numerical value to the second numerical value;

the speech recognition module 504 is configured to perform convolution processing on the second feature matrix corresponding to any waveform encoder based on convolution kernel information corresponding to any waveform encoder, so as to obtain a third feature matrix corresponding to any waveform encoder.

In an exemplary embodiment, the obtaining module 503 is configured to perform the dividing of the first feature matrix into the at least two first feature segments in response to any one of the waveform encoders being a first one of the at least two waveform encoders.

In an exemplary embodiment, the obtaining module 503 is configured to, in response to that any waveform encoder is a non-first waveform encoder of the at least two waveform encoders, obtain a second feature matrix corresponding to a previous waveform encoder of any waveform encoder; pooling a second feature matrix corresponding to a previous waveform encoder to obtain a pooled feature matrix, wherein the column number of the pooled feature matrix is the same as that of a first feature matrix corresponding to any waveform encoder; summing the first feature matrix corresponding to any waveform encoder and the pooled feature matrix to obtain a summed feature matrix; the summed feature matrix is divided into at least two first feature segments.

In an exemplary embodiment, the obtaining module 503 is configured to perform downsampling on at least two second feature segments to obtain at least two downsampling results; global feature extraction is carried out on the at least two second feature segments through a self-care network, and at least two global feature extraction results are obtained; and upsampling the at least two global feature extraction results to obtain at least two upsampled results, and taking the at least two upsampled results as at least two third feature segments for indicating the local features and the global features.

In an exemplary embodiment, the obtaining module 503 is configured to perform nonlinear mapping on at least two third feature segments to obtain at least two nonlinear mapping results; and combining at least two nonlinear mapping results into a second feature matrix corresponding to the waveform coder.

It should be noted that, when the apparatus provided in the foregoing embodiment implements the functions thereof, only the division of the functional modules is illustrated, and in practical applications, the functions may be distributed by different functional modules according to needs, that is, the internal structure of the apparatus may be divided into different functional modules to implement all or part of the functions described above. In addition, the apparatus and method embodiments provided by the above embodiments belong to the same concept, and specific implementation processes thereof are described in the method embodiments for details, which are not described herein again.

Referring to fig. 6, a schematic structural diagram of an electronic device 600 provided in an embodiment of the present application is shown. The electronic device 600 may be a portable mobile electronic device such as: a smart phone, a tablet computer, an MP3 player (Moving Picture Experts Group Audio Layer III, motion video Experts compression standard Audio Layer 3), an MP4 player (Moving Picture Experts Group Audio Layer IV, motion video Experts compression standard Audio Layer 4), a notebook computer, or a desktop computer. Electronic device 600 may also be referred to by other names as user equipment, portable electronic device, laptop electronic device, desktop electronic device, and so on.

In general, the electronic device 600 includes: a processor 601 and a memory 602.

The processor 601 may include one or at least two processing cores, such as a 4-core processor, an 8-core processor, and the like. The processor 601 may be implemented in at least one hardware form selected from the group consisting of a DSP (Digital Signal Processing), a Field-Programmable Gate Array (FPGA), and a Programmable Logic Array (PLA). The processor 601 may also include a main processor and a coprocessor, where the main processor is a processor for Processing data in an awake state, and is also called a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 601 may be integrated with a GPU (Graphics Processing Unit), which is responsible for rendering and drawing the content that the display screen 605 needs to display. In some embodiments, processor 601 may also include an AI (Artificial Intelligence) processor for processing computational operations related to machine learning.

The memory 602 may include one or at least two computer-readable storage media, which may be non-transitory. The memory 602 may also include high speed random access memory, as well as non-volatile memory, such as one or at least two magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 602 is used to store at least one instruction for execution by processor 601 to implement the method of speech recognition provided by the method embodiments herein.

In some embodiments, the electronic device 600 may further optionally include: a peripheral interface 603 and at least one peripheral. The processor 601, memory 602, and peripheral interface 603 may be connected by buses or signal lines. Various peripheral devices may be connected to the peripheral interface 603 via a bus, signal line, or circuit board. Specifically, the peripheral device includes: at least one of the group consisting of a radio frequency circuit 604, a display 605, a camera assembly 606, an audio circuit 607, a positioning component 608, and a power supply 609.

The peripheral interface 603 may be used to connect at least one peripheral related to I/O (Input/Output) to the processor 601 and the memory 602. In some embodiments, the processor 601, memory 602, and peripheral interface 603 are integrated on the same chip or circuit board; in some other embodiments, any one or two of the processor 601, the memory 602, and the peripheral interface 603 may be implemented on a separate chip or circuit board, which is not limited in this embodiment.

The Radio Frequency circuit 604 is used for receiving and transmitting RF (Radio Frequency) signals, also called electromagnetic signals. The radio frequency circuitry 604 communicates with communication networks and other communication devices via electromagnetic signals. The rf circuit 604 converts an electrical signal into an electromagnetic signal to transmit, or converts a received electromagnetic signal into an electrical signal. Optionally, the radio frequency circuit 604 comprises: an antenna system, an RF transceiver, one or at least two amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a subscriber identity module card, etc. The radio frequency circuitry 604 may communicate with other electronic devices via at least one wireless communication protocol. The wireless communication protocols include, but are not limited to: metropolitan area networks, various generations of mobile communication networks (2G, 3G, 4G, and 5G), Wireless local area networks, and/or Wi-Fi (Wireless Fidelity) networks. In some embodiments, the rf circuit 604 may further include NFC (Near Field Communication) related circuits, which are not limited in this application.

The display 605 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display screen 605 is a touch display screen, the display screen 605 also has the ability to capture touch signals on or over the surface of the display screen 605. The touch signal may be input to the processor 601 as a control signal for processing. At this point, the display 605 may also be used to provide virtual buttons and/or a virtual keyboard, also referred to as soft buttons and/or a soft keyboard. In some embodiments, the display 605 may be one, disposed on the front panel of the electronic device 600; in other embodiments, the display 605 may be at least two, respectively disposed on different surfaces of the electronic device 600 or in a foldable design; in other embodiments, the display 605 may be a flexible display disposed on a curved surface or on a folded surface of the electronic device 600. Even more, the display 605 may be arranged in a non-rectangular irregular pattern, i.e., a shaped screen. The Display 605 may be made of LCD (Liquid Crystal Display), OLED (Organic Light-Emitting Diode), and the like.

The camera assembly 606 is used to capture images or video. Optionally, camera assembly 606 includes a front camera and a rear camera. Generally, a front camera is disposed on a front panel of an electronic apparatus, and a rear camera is disposed on a rear surface of the electronic apparatus. In some embodiments, the number of the rear cameras is at least two, and each rear camera is any one of a main camera, a depth-of-field camera, a wide-angle camera and a telephoto camera, so that the main camera and the depth-of-field camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize panoramic shooting and VR (Virtual Reality) shooting functions or other fusion shooting functions. In some embodiments, camera assembly 606 may also include a flash. The flash lamp can be a monochrome temperature flash lamp or a bicolor temperature flash lamp. The double-color-temperature flash lamp is a combination of a warm-light flash lamp and a cold-light flash lamp, and can be used for light compensation at different color temperatures.

Audio circuitry 607 may include a microphone and a speaker. The microphone is used for collecting sound waves of a user and the environment, converting the sound waves into electric signals, and inputting the electric signals to the processor 601 for processing or inputting the electric signals to the radio frequency circuit 604 to realize voice communication. For stereo capture or noise reduction purposes, at least two microphones may be provided, each at a different location of the electronic device 600. The microphone may also be an array microphone or an omni-directional pick-up microphone. The speaker is used to convert electrical signals from the processor 601 or the radio frequency circuit 604 into sound waves. The loudspeaker can be a traditional film loudspeaker or a piezoelectric ceramic loudspeaker. When the speaker is a piezoelectric ceramic speaker, the speaker can be used for purposes such as converting an electric signal into a sound wave audible to a human being, or converting an electric signal into a sound wave inaudible to a human being to measure a distance. In some embodiments, audio circuitry 607 may also include a headphone jack.

The positioning component 608 is used to locate a current geographic Location of the electronic device 600 to implement navigation or LBS (Location Based Service). The Positioning component 608 can be a Positioning component based on the united states GPS (Global Positioning System), the chinese beidou System, the russian graves System, or the european union's galileo System.

The power supply 609 is used to supply power to various components in the electronic device 600. The power supply 609 may be ac, dc, disposable or rechargeable. When the power supply 609 includes a rechargeable battery, the rechargeable battery may support wired or wireless charging. The rechargeable battery may also be used to support fast charge technology.

In some embodiments, the electronic device 600 also includes one or at least two sensors 610. The one or at least two sensors 610 include, but are not limited to: acceleration sensor 611, gyro sensor 612, pressure sensor 613, fingerprint sensor 614, optical sensor 615, and proximity sensor 616.

The acceleration sensor 611 may detect the magnitude of acceleration in three coordinate axes of a coordinate system established with the electronic device 600. For example, the acceleration sensor 611 may be used to detect components of the gravitational acceleration in three coordinate axes. The processor 601 may control the display screen 605 to display the user interface in a landscape view or a portrait view according to the gravitational acceleration signal collected by the acceleration sensor 611. The acceleration sensor 611 may also be used for acquisition of motion data of a game or a user.

The gyro sensor 612 may detect a body direction and a rotation angle of the electronic device 600, and the gyro sensor 612 and the acceleration sensor 611 may cooperate to acquire a 3D motion of the user on the electronic device 600. The processor 601 may implement the following functions according to the data collected by the gyro sensor 612: motion sensing (such as changing the UI according to a user's tilting operation), image stabilization at the time of photographing, game control, and inertial navigation.

The pressure sensor 613 may be disposed on a side bezel of the electronic device 600 and/or on a lower layer of the display screen 605. When the pressure sensor 613 is disposed on a side frame of the electronic device 600, a user's holding signal of the electronic device 600 can be detected, and the processor 601 performs left-right hand recognition or shortcut operation according to the holding signal collected by the pressure sensor 613. When the pressure sensor 613 is disposed at the lower layer of the display screen 605, the processor 601 controls the operability control on the UI interface according to the pressure operation of the user on the display screen 605. The operability control comprises at least one of a group consisting of a button control, a scroll bar control, an icon control and a menu control.

The fingerprint sensor 614 is used for collecting a fingerprint of a user, and the processor 601 identifies the identity of the user according to the fingerprint collected by the fingerprint sensor 614, or the fingerprint sensor 614 identifies the identity of the user according to the collected fingerprint. Upon identifying that the user's identity is a trusted identity, the processor 601 authorizes the user to perform relevant sensitive operations including unlocking the screen, viewing encrypted information, downloading software, paying, and changing settings, etc. The fingerprint sensor 614 may be disposed on the front, back, or side of the electronic device 600. When a physical button or vendor Logo is provided on the electronic device 600, the fingerprint sensor 614 may be integrated with the physical button or vendor Logo.

The optical sensor 615 is used to collect the ambient light intensity. In one embodiment, processor 601 may control the display brightness of display screen 605 based on the ambient light intensity collected by optical sensor 615. Specifically, when the ambient light intensity is high, the display brightness of the display screen 605 is increased; when the ambient light intensity is low, the display brightness of the display screen 606 is turned down. In another embodiment, the processor 601 may also dynamically adjust the shooting parameters of the camera assembly 606 according to the ambient light intensity collected by the optical sensor 615.

Proximity sensor 616, also referred to as a distance sensor, is typically disposed on the front panel of electronic device 600. The proximity sensor 616 is used to capture the distance between the user and the front of the electronic device 600. In one embodiment, when the proximity sensor 616 detects that the distance between the user and the front of the electronic device 600 gradually decreases, the processor 601 controls the display 605 to switch from the bright screen state to the dark screen state; when the proximity sensor 616 detects that the distance between the user and the front surface of the electronic device 600 is gradually increased, the processor 601 controls the display 605 to switch from the breath-screen state to the bright-screen state.

Those skilled in the art will appreciate that the configuration shown in fig. 6 does not constitute a limitation of the electronic device 600, and may include more or fewer components than those shown, or combine certain components, or employ a different arrangement of components.

Fig. 7 is a schematic structural diagram of a server according to an embodiment of the present application, where the server 700 may generate a relatively large difference due to a difference in configuration or performance, and may include one or more processors 701 and one or more memories 702, where at least one program code is stored in the one or more memories 702, and is loaded and executed by the one or more processors 701, so as to enable the server to implement the method for speech recognition according to the foregoing method embodiments. Of course, the server 700 may also have components such as a wired or wireless network interface, a keyboard, and an input/output interface, so as to perform input and output, and the server 700 may also include other components for implementing the functions of the device, which are not described herein again.

The embodiment of the application provides computer equipment, which comprises a memory and a processor; the memory has stored therein at least one instruction that is loaded and executed by the processor to cause the computer device to implement the method of speech recognition provided by any of the exemplary embodiments of this application.

Embodiments of the present application provide a computer-readable storage medium, in which at least one instruction is stored, where the instruction is loaded and executed by a processor, so as to enable a computer to implement a method for speech recognition provided in any one of the exemplary embodiments of the present application.

An embodiment of the present application provides a computer program or a computer program product, where the computer program or the computer program product includes: computer instructions which, when executed by a computer, cause the computer to implement a method of speech recognition as provided by any of the exemplary embodiments of this application.

All the above optional technical solutions may be combined arbitrarily to form optional embodiments of the present application, and are not described herein again. It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc. The above description is only exemplary of the present application and should not be taken as limiting the present application, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the protection scope of the present application.

27页详细技术资料下载

Speech recognition method, speech recognition device, computer equipment and readable storage medium

相关技术

网友询问留言