Voice recognition method and device, electronic equipment and storage medium

文档序号：1818089 发布日期：2021-11-09 浏览：22次中文

阅读说明：本技术 语音识别方法、装置、电子设备及存储介质 (Voice recognition method and device, electronic equipment and storage medium ) 是由张铁林刘洪星徐波于 2021-10-12 设计创作，主要内容包括：本发明提供一种语音识别方法、装置、电子设备及存储介质,其中方法包括：获取待识别语音对应的脉冲序列；将脉冲序列输入至语音识别模型,得到待识别语音对应的语音识别结果；语音识别模型是基于循环脉冲神经网络构建的,语音识别模型隐藏层中任一神经元的膜电位是基于前向通道中的神经元脉冲标志和循环通道中的神经元脉冲标志确定的,同一隐藏层中的神经元的输出采用稀疏连接；前向通道用于连接任一神经元与上一隐藏层中的神经元；循环通道用于连接任一神经元在上一时刻的输出与当前时刻同层的其他神经元的输出。本发明提供的方法、装置、电子设备及存储介质,能够适应于识别样本的变化,提高了模型的鲁棒性,提高了识别结果的准确性。(The invention provides a voice recognition method, a voice recognition device, electronic equipment and a storage medium, wherein the method comprises the following steps: acquiring a pulse sequence corresponding to the voice to be recognized; inputting the pulse sequence into a voice recognition model to obtain a voice recognition result corresponding to the voice to be recognized; the speech recognition model is constructed based on a cyclic pulse neural network, the membrane potential of any neuron in a hidden layer of the speech recognition model is determined based on a neuron pulse mark in a forward channel and a neuron pulse mark in a cyclic channel, and the output of the neuron in the same hidden layer adopts sparse connection; the forward channel is used for connecting any neuron with a neuron in the last hidden layer; the circulation channel is used for connecting the output of any neuron at the last moment with the output of other neurons in the same layer at the current moment. The method, the device, the electronic equipment and the storage medium provided by the invention can adapt to the change of the identification sample, improve the robustness of the model and improve the accuracy of the identification result.)

1. A speech recognition method, comprising:

acquiring a pulse sequence corresponding to the voice to be recognized;

inputting the pulse sequence into a voice recognition model to obtain a voice recognition result corresponding to the voice to be recognized;

the speech recognition model is constructed based on a cyclic pulse neural network, the membrane potential of any neuron in a hidden layer of the speech recognition model is determined based on a neuron pulse mark in a forward channel and a neuron pulse mark in a cyclic channel, and the output of the neuron in the same hidden layer adopts sparse connection;

the forward channel is used for connecting the any neuron with a neuron in an upper hidden layer; the circulation channel is used for connecting the output of any neuron at the last moment with the output of other neurons in the same layer at the current moment.

2. The speech recognition method of claim 1, wherein the membrane potential of any neuron is initialized based on the following equation:

wherein the content of the first and second substances,is a neuronIn thatThe membrane potential at the moment of time,is a neuronIn thatThe forward membrane potential at the time of day,is a neuronIn thatThe cyclic membrane potential at the moment of time,is a neuronThe film capacitance of (a) is set,is a neuronThe conductivity of the synapses of (a),is a neuronThe potential of the resting membrane of (a),for the upper hidden layer and neuronsThe number of connected neurons is such that,for neurons in the previous hidden layerAnd neuronsThe synaptic weight in the forward channel in between,is neurons in the current layerAnd neuronsThe synaptic weight in the cyclic channel in between,is a neuronReceived from neuronsThe input of (a) is performed,is a mark of the pulse of the neuron,is a neuronThe neuron pulse signature in the forward channel,is a neuronNeuron pulse signatures in the circulation channel.

3. The speech recognition method of claim 2, wherein the membrane potential of any neuron is updated based on the steps of:

determining a dynamic firing threshold for the any neuron based on a neuron pulse marker in a forward channel and a neuron pulse marker in a circulation channel of the any neuron input;

updating the membrane potential of the any neuron based on the dynamic firing threshold of the any neuron, and the neuron pulse flags in the forward channel and the neuron pulse flags in the circulation channel of the any neuron input.

4. The speech recognition method of claim 3, wherein the determining the dynamic firing threshold of the any neuron based on the neuron pulse markers in the forward channel and the neuron pulse markers in the circular channel of the any neuron input comprises:

wherein the content of the first and second substances,is a neuronIn thatThe dynamic issuance threshold for a time of day,is a first weight coefficient of the first weight coefficient,is the second weight coefficient.

5. The speech recognition method of claim 4, wherein the updating the membrane potential of the neuron based on the dynamic firing threshold of the neuron and the neuron pulse flags in the forward channel and the neuron pulse flags in the circular channel of the neuron input comprises:

wherein the content of the first and second substances,for neurons in the previous hidden layerAnd neuronsThe weight of the synapse in between,is the third weight coefficient.

6. The speech recognition method according to any one of claims 1 to 5, wherein the output of neurons in the same hidden layer in the speech recognition model are sparsely connected based on:

determining a sparse connection proportion; the sparse connection proportion is the number proportion of the connected neurons in any hidden layer to all the neurons in any hidden layer;

and selecting the neurons meeting the sparse connection proportion from any hidden layer, and randomly connecting the output of each neuron.

7. The speech recognition method of any one of claims 1 to 5, wherein the speech recognition model is trained based on the following steps:

obtaining a sample label corresponding to the voice to be recognized of the sample;

based on a random matrix, the sample labels are mapped to all hidden layers in the voice recognition model in parallel, and the local gradient from the neuron membrane potential to the synaptic weight in each hidden layer is determined;

training the speech recognition model based on local gradients of neuron membrane potentials to synaptic weights in the hidden layers.

8. A speech recognition apparatus, comprising:

the acquisition unit is used for acquiring a pulse sequence corresponding to the voice to be recognized;

the recognition unit is used for inputting the pulse sequence into a voice recognition model to obtain a voice recognition result corresponding to the voice to be recognized;

the forward channel is used for connecting the input of any neuron with the output of the neuron in the previous hidden layer; the circulation channel is used for connecting the output of any neuron at the last moment and the input of the current moment.

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the speech recognition method according to any of claims 1 to 7 are implemented when the processor executes the program.

10. A non-transitory computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the speech recognition method according to any one of claims 1 to 7.

Technical Field

The present invention relates to the field of artificial intelligence technologies, and in particular, to a speech recognition method and apparatus, an electronic device, and a storage medium.

Background

In recent years, many different types of deep neural networks have been proposed to solve the problems of classification, recognition, memory association and prediction of speech data, however, with the rapid development of deep neural networks, some disadvantages also exist. The first problem is the increase of synaptic parameters, and the complexity of imbalance between artificial neurons and networks causes a deep neural network to contain a large number of network parameters to be adjusted, thereby increasing the difficulty of network learning. The second problem is that the back-propagation process is slow and computationally expensive, and is considered to be a non-biologically sound approach. In a deep neural network, feedback and feedforward are interleaved in the neural network in sequence, an error signal must reversely propagate from an output neuron to a hidden neuron layer by layer, and particularly for a network with extremely high depth, the risk of gradient disappearance or gradient explosion exists. The nature of supervised and synchronized computations in deep neural networks also makes them difficult to accelerate by parallel computations. A third problem is that all artificial neurons in the deep neural network in the back propagation process must satisfy the constraint of mathematical differentiability, which obviously lacks support for biological validation where nondifferentiatable pulse signals are ubiquitous. One key problem in the current development of deep neural networks is the poor interpretability of the network and the poor biological rationality, however the rich interpretability of the spiking neural networks can compensate for this problem.

In processing voice data, a Spiking Neural Networks (SNN) may be used. Compared with a deep neural network, the spiking neural network has more complex neuron and synapse structures, and considering that many biological rules ignored by the existing artificial network may be the key to realizing the general brain-like intelligence, the addition of the biological rules into the spiking neural network of a more brain-like network may enable the existing network to obtain stronger computing power and adaptability. In a spiking neural network, neuron plasticity plays a crucial role in the dynamic information processing of neurons.

The existing speech recognition method usually adopts standard neuron models, such as an H-H model, an LIF (Integrated-And-Fire) model, an SRM (stress resonance model) model And an Izhikevich model, And the models have poor robustness, poor accuracy of speech recognition results And high calculation cost.

Disclosure of Invention

The invention provides a voice recognition method, a voice recognition device, electronic equipment and a storage medium, which are used for solving the technical problems of poor accuracy of voice recognition results and high calculation cost in the prior art.

The invention provides a voice recognition method, which comprises the following steps:

acquiring a pulse sequence corresponding to the voice to be recognized;

inputting the pulse sequence into a voice recognition model to obtain a voice recognition result corresponding to the voice to be recognized;

According to the speech recognition method provided by the present invention, the membrane potential of any neuron is initialized based on the following formula:

According to the speech recognition method provided by the invention, the membrane potential of any neuron is updated based on the following steps:

determining a dynamic firing threshold for the any neuron based on a neuron pulse marker in a forward channel and a neuron pulse marker in a circulation channel of the any neuron input;

According to the speech recognition method provided by the present invention, the determining the dynamic firing threshold of any neuron based on the neuron pulse flag in the forward channel and the neuron pulse flag in the cyclic channel of the input of any neuron comprises:

According to the speech recognition method provided by the present invention, the updating the membrane potential of the any neuron based on the dynamic firing threshold of the any neuron, and the neuron pulse flag in the forward channel and the neuron pulse flag in the cyclic channel inputted by the any neuron comprises:

wherein the content of the first and second substances,for neurons in the previous hidden layerAnd neuronsThe weight of the synapse in between,is the third weight coefficient.

According to the speech recognition method provided by the invention, the output of the neurons in the same hidden layer in the speech recognition model is sparsely connected based on the following steps:

determining a sparse connection proportion; the sparse connection proportion is the number proportion of the connected neurons in any hidden layer to all the neurons in any hidden layer;

and selecting the neurons meeting the sparse connection proportion from any hidden layer, and randomly connecting the output of each neuron.

According to the speech recognition method provided by the invention, the speech recognition model is trained based on the following steps:

obtaining a sample label corresponding to the voice to be recognized of the sample;

training the speech recognition model based on local gradients of neuron membrane potentials to synaptic weights in the hidden layers.

The present invention provides a voice recognition apparatus, including:

the acquisition unit is used for acquiring a pulse sequence corresponding to the voice to be recognized;

the recognition unit is used for inputting the pulse sequence into a voice recognition model to obtain a voice recognition result corresponding to the voice to be recognized;

The invention provides an electronic device, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor realizes the steps of the voice recognition method when executing the program.

The invention provides a non-transitory computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the speech recognition method.

According to the voice recognition method, the voice recognition device, the electronic equipment and the storage medium, the voice recognition model is constructed through the cyclic pulse neural network, the membrane potential of any neuron in a hidden layer of the voice recognition model is determined based on the neuron pulse mark in the forward channel and the neuron pulse mark in the cyclic channel, the output of the neuron in the same hidden layer adopts sparse connection, and the membrane potential of the neuron can generate specific dynamic change according to real-time input, so that the method and the device can adapt to the change of a recognition sample, have high dynamic calculation capacity, improve the robustness of the model and improve the accuracy of a recognition result. In addition, by adopting sparse connection, the resource overhead of model learning is reduced to a certain extent while the model performance is improved, the memory occupation is reduced, the energy consumption is reduced, and the method is favorable for being placed on a chip for use.

Drawings

In order to more clearly illustrate the technical solutions of the present invention or the prior art, the drawings needed for the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

FIG. 1 is a flow chart illustrating a speech recognition method provided by the present invention;

FIG. 2 is a schematic diagram of a recurrent impulse neural network provided by the present invention;

FIG. 3 is a schematic diagram of a voice recognition apparatus according to the present invention;

fig. 4 is a schematic structural diagram of an electronic device provided in the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In a spiking neural network, neuron plasticity plays a crucial role in the dynamic information processing of neurons. The common standard neuron model ignores the important influence of discharge threshold plasticity of neurons on neuron dynamics, the dynamics characteristics of the neurons directly influence the dynamics and robustness calculation of the network, and the existing impulse neural network is utilized to process voice, so that the robustness of the model is poor, and the accuracy of the recognition result is poor.

The method and the device provided by the embodiment of the invention are suitable for processing video data, audio data, image data and the like, and the audio data is used for explanation.

Fig. 1 is a schematic flow chart of a speech recognition method provided by the present invention, as shown in fig. 1, the method includes:

and step 110, acquiring a pulse sequence corresponding to the voice to be recognized.

In particular, the speech to be recognized may be obtained from a public speech data set, such as TIDigits or TIMIT.

Before recognition, the speech to be recognized can be converted into a pulse sequence, and the conversion method can adopt a pulse encoder to encode a non-pulse input signal into a pulse sequence conforming to a certain distribution form. The pulse encoder may be a poisson encoder or the like. The poisson encoder encodes the input data into a pulse sequence whose distribution of issuance times conforms to the poisson process. For example, for a piece of speech, it may be divided into a number of frames, each of which may be converted by a pulse encoder into a pulse sequence conforming to a poisson distribution.

Step 120, inputting the pulse sequence into a voice recognition model to obtain a voice recognition result corresponding to the voice to be recognized;

the method comprises the steps that a voice recognition model is built on the basis of a cyclic pulse neural network, the membrane potential of any neuron in a hidden layer of the voice recognition model is determined on the basis of a neuron pulse mark in a forward channel and a neuron pulse mark in a cyclic channel, and the output of the neurons in the same hidden layer is in sparse connection; the forward channel is used for connecting any neuron with a neuron in the last hidden layer; the circulation channel is used for connecting the output of any neuron at the last moment with the output of other neurons in the same layer at the current moment.

Specifically, the voice recognition result is a result obtained by recognizing a pulse sequence of the voice to be recognized. For example, if the speech to be recognized is the pronunciation of the numbers 1 to 9, the speech recognition result may be the specific number corresponding to the pronunciation.

The speech recognition model can be obtained by pre-training, and the specific training mode is as follows: first, a large number of sample voices to be recognized and a sample tag (voice recognition result) corresponding to each voice to be recognized are collected. And secondly, converting the voice to be recognized of each sample to obtain a pulse sequence corresponding to the voice to be recognized of each sample. And secondly, training the initial model according to the pulse sequence and the sample label corresponding to the voice to be recognized of each sample, so that the initial model performs feature learning in the pulse sequence corresponding to the voice to be recognized, and improving the prediction capability of the initial model on the content in the voice to be recognized of the sample by taking the sample label as a true value to obtain the voice recognition model.

The initial model of the speech recognition model may be a recurrent impulse neural network. The network structure of the recurrent impulse neural network may include an input layer, an output layer, and a plurality of hidden layers. The number of neurons in the input layer may be determined according to the length of the input pulse sequence, and the number of neurons in the output layer may be determined according to the type of the speech recognition result. The number of hidden layers and the number of neurons in each hidden layer can be set according to actual needs.

After connecting the neuron in each hidden layer, can carry out sparse connection to the output of the neuron in same hidden layer for the signal characteristic that the neuron in the present hidden layer drawed can fuse the back, inputs the neuron in the next hidden layer, makes the neuron in the next hidden layer can obtain abundanter input information, improves the learning ability to more detailed characteristics, has improved the anti-noise performance of model, thereby has improved the robustness of whole speech recognition model.

The sparse connection may be a random connection, for example, if the current hidden layer includes 4 neurons, which are neuron 1, neuron 2, neuron 3, and neuron 4, respectively, then the outputs of neurons 1 and 2 may be connected, and the outputs of neurons 3 and 4 may be connected.

The output of the hidden layers in a spiking neural network is composed of the firing states of the neurons of the hidden layers, which are determined by the membrane potential, i.e., the output of any hidden layer is determined by the membrane potential of each neuron in that layer.

Because the speech recognition model provided by the embodiment of the invention is constructed by taking the recurrent impulse neural network as an initial model, the input channel of any neuron in any hidden layer comprises a forward channel and a recurrent channel.

The forward path is used to connect any neuron with a neuron in the previous hidden layer. The product of the output of the neuron in the previous hidden layer in the forward channel and the synaptic weight serves as the input of the neuron in the current hidden layer.

The circulation channel is used for connecting the output of any neuron at the last moment with the output of other neurons in the same layer at the current moment.

For any neuron in any hidden layer in the speech recognition model, the initialization and update of the membrane potential are influenced by the forward channel and the cyclic channel, and specifically, the membrane potential is determined according to the neuron pulse mark in the forward channel and the neuron pulse mark in the cyclic channel. The neuron pulse flag represents the number of pulses fired when the membrane potential reaches the firing threshold. The membrane potential of the neuron is determined through the neuron pulse marks of the two channels, so that the neuron in the hidden layer has plasticity, and the membrane potential can generate specific dynamic change according to real-time input.

According to the voice recognition method provided by the embodiment of the invention, the voice recognition model is constructed through the cyclic pulse neural network, the membrane potential of any neuron in the hidden layer of the voice recognition model is determined based on the neuron pulse mark in the forward channel and the neuron pulse mark in the cyclic channel, the output of the neuron in the same hidden layer adopts sparse connection, and the membrane potential of the neuron can generate specific dynamic change according to real-time input, so that the method can adapt to the change of a recognition sample, has higher dynamic calculation capacity, improves the robustness of the model and improves the accuracy of a recognition result. In addition, by adopting sparse connection, the resource overhead of model learning is reduced to a certain extent while the model performance is improved, the memory occupation is reduced, the energy consumption is reduced, and the method is favorable for being placed on a chip for use.

Based on the above example, the membrane potential of any neuron is initialized based on the following equation:

In particular, because sparse connections are used in the speech recognition model, both the neuron pulse markers in the forward channel and the neuron pulse markers in the cyclic channel have an effect on the membrane potential of the neurons. The neuron pulse mark in the forward channel influences the neuron to generate a forward membrane potential; the neuron pulse markers in the circulation channel affect the neurons, generating circulating membrane potentials. The membrane potentials generated by these two channels affect neurons simultaneously. These two types of membrane potentials can be defined by the following equation:

wherein the content of the first and second substances,in order to issue the threshold value(s),the time at which the pulse is issued in the forward path,the moment in the circulation path at which the pulse is delivered,is the refractory period of the neuron,is a time parameter of the forward path,is a time parameter of the circulation channel.

By this definition, the forward and circulating membrane potentials can be integrated and the membrane potential of any neuron can be initialized to obtain the above formula.

The dynamic change of the LIF neuron membrane potential used in the impulse neural network is shown in a formula:

wherein the content of the first and second substances,indicating the time at which the neuron releases a particular pulse.Has a history integration status. At the same timeMembrane potential through resting historyRather than direct blockingControl refractory period。

In accordance with any of the above embodiments, the membrane potential of any neuron is updated based on the following steps:

determining a dynamic firing threshold of any neuron based on the neuron pulse markers in the forward channel and the neuron pulse markers in the circulation channel of any neuron input;

and updating the membrane potential of any neuron based on the dynamic issuing threshold of any neuron and the neuron pulse mark in the forward channel and the neuron pulse mark in the circulating channel of any neuron input.

Specifically, after the membrane potential of any neuron is initialized, the issuing threshold of the neuron is influenced by two channels, and then the membrane potential issuing threshold of the neuron can be subjected to two-channel adaptive updating, so that the dynamic characteristic of the model is improved.

The dynamic firing threshold of a neuron can be determined based on the neuron pulse markers in the forward channel and the neuron pulse markers in the cyclic channel of any neuron input. And then, according to the dynamic issuing threshold value, combining the neuron pulse mark in the forward channel and the neuron pulse mark in the circulation channel to update the membrane potential of the neuron.

Based on any one of the above embodiments, determining the dynamic firing threshold of any one neuron based on the neuron pulse markers in the forward channel and the neuron pulse markers in the circular channel of any one neuron input, includes:

wherein the content of the first and second substances,is a neuronIn thatThe dynamic issuance threshold for a time of day,is firstThe weight coefficient of the weight is calculated,is the second weight coefficient.

In particular, the determination of the dynamic release threshold may be expressed by the above formula, which is an ordinary differential equation. The equilibrium point for the dynamic firing threshold is 0 when there are no input pulses in both channels. At the input of pulses from the forward pathAnd pulses of the circulation channelThe balance point of the dynamic issuing threshold is。 Is a first weight coefficient of the first weight coefficient,the second weighting factors, which are all hyperparameters, can be set according to practical situations, for example,，。

for theAccording to the formula in the above embodiment, a stable solution can be obtained as follows:

based on any one of the above embodiments, updating the membrane potential of any one neuron based on the dynamic firing threshold of any one neuron and the neuron pulse flag in the forward channel and the neuron pulse flag in the circular channel of any one neuron input comprises:

wherein the content of the first and second substances,for neurons in the previous hidden layerAnd neuronsThe weight of the synapse in between,is the third weight coefficient.

Specifically, the dynamic release threshold of any neuron can improve the plasticity of the neuron, and further, an update formula of the membrane potential of any neuron is obtained according to the LIF neuron model, as shown above.

From resting membrane potentialDynamic discharge threshold during this time to membrane potential triggeringGradually accumulate to finally reach a relatively stable value. Due to the fact thatSuch that the dispensing threshold is changedThereby effecting actuation of the dispensing thresholdThe state changes.

The hyper-parameter can be set according to the actual situation, for example,。

based on any of the above embodiments, the output of the neurons in the same hidden layer in the speech recognition model is sparsely connected based on the following steps:

determining a sparse connection proportion; the sparse connection proportion is the number proportion of the connected neurons in any hidden layer to all the neurons in any hidden layer;

and selecting the neurons meeting the sparse connection proportion from any hidden layer, and randomly connecting the output of each neuron.

Specifically, when performing sparse connection on neurons in the same hidden layer, setting a sparse connection ratio may be used to represent a sparse connection degree.

The sparse connection proportion is the number proportion of connected neurons in any hidden layer to all neurons in the hidden layer. For example, when the sparse connection ratio is 60%, a neuron with a number ratio of 60% may be selected from all neurons in the hidden layer to perform sparse connection. The specific connection mode is random connection.

According to the voice recognition method provided by the embodiment of the invention, the voice recognition model operates in a mode more similar to the human brain through random sparse connection, and the biological rationality of the model is improved.

Based on any of the above embodiments, the speech recognition model is trained based on the following steps:

obtaining a sample label corresponding to the voice to be recognized of the sample;

based on a random matrix, parallelly mapping the sample labels to each hidden layer in the voice recognition model, and determining the local gradient from the neuron membrane potential to the synaptic weight in each hidden layer;

the speech recognition model is trained based on the local gradient of neuronal membrane potential to synaptic weights in each hidden layer.

Specifically, the existing neural network performs Back propagation (Back propagation) of an error signal to a hidden layer neuron layer by layer, so as to train a model.

Different from the existing neural network, when the speech recognition model in the application updates parameters in the training process, the global label is used instead of an error signal as the reward of gradient propagation, the global label is used for modifying parameters of each layer in parallel, and the phenomenon of gradient propagation between layers does not exist.

Respectively mapping a sample label L (Label) corresponding to the to-be-recognized voice of the sample to different hidden layers through corresponding random matrixes B, taking a mapping result as the gradient of a hidden layer output neuron, and expressing the gradient by using the following formula:

wherein the content of the first and second substances,is as followsThe gradient of the layer output neurons is,is as followsA random matrix corresponding to the layer. Random matrixIs based on the dimension ofThe number of neurons in the layer is determined.

Then, the differentiation of the pulse time is calculated at the time of synaptic weight update of each layer, and is expressed by the following formula:

wherein the content of the first and second substances,is a local gradient of neuronal membrane potential to synaptic weights in the hidden layer,is as followsThe difference in firing pulses (spikes) for individual neurons at time t,is a set value.

The above formula is only used in the process when it is indistinguishable, i.e., it isIt is used when in use.

Based on any one of the above embodiments, an embodiment of the present invention provides a speech recognition method, including:

step 1, inputting data and coding the data into a pulse sequence;

step 2, adaptively modifying a neuron release threshold according to historical pulse information, and updating dynamic characteristics;

step 3, fig. 2 is a schematic structural diagram of the recurrent impulse neural network provided by the present invention, and as shown in fig. 2, the dynamical neurons described in step 2 are used to construct the recurrent impulse neural network with the self-defined sparse connections; the network comprises an input layer, a hidden layer 1, a hidden layer 2 and an output layer; the dotted lines in the figure are sparse connections;

step 4, in the neural network parameter updating stage, global labels are used instead of error signals as rewards of gradient propagation;

and 5, identifying the audio sequence by using a circular pulse neural network based on neuron plasticity and a reward propagation mechanism. The recurrent impulse neural network performs speech sequence recognition by using a group decision mode at an output layer, and for one input, a final speech class which has the most response and is classified as a model is obtained.

The voice recognition method provided by the embodiment of the invention adopts the recurrent pulse neural network as the initial model, and has the following advantages:

(1) and (3) kinetic calculation: dynamic neurons with self-adaptive threshold characteristics are added, the plasticity of the neurons is enriched, a plurality of neurons in the network have specific dynamic changes, and meanwhile, the overall dynamics calculation capability of the network is improved.

(2) Low power consumption capability: sparse connections between neurons can reduce computational overhead and reduce power consumption without affecting performance, which is an inability of deep neural networks.

(3) Robust calculation: the arrangement of the adjustable cyclic connection in the hidden layer is beneficial to the identification performance, especially for noisy samples, and is more beneficial to maintaining sequence information and robust classification.

(4) Biological rationality: the global label is used as the reward of parallel gradient propagation, rather than the error in the back propagation, and is more consistent with the discovery in biology, and is beneficial to understanding the reward propagation mode in the brain.

Based on any of the above embodiments, fig. 3 is a schematic structural diagram of a speech recognition apparatus provided by the present invention, as shown in fig. 3, the apparatus includes:

an obtaining unit 310, configured to obtain a pulse sequence corresponding to a voice to be recognized;

the recognition unit 320 is configured to input the pulse sequence to the speech recognition model to obtain a speech recognition result corresponding to the speech to be recognized;

The speech recognition device provided by the invention constructs the speech recognition model through the cyclic pulse neural network, the membrane potential of any neuron in the hidden layer of the speech recognition model is determined based on the neuron pulse mark in the forward channel and the neuron pulse mark in the cyclic channel, the output of the neuron in the same hidden layer adopts sparse connection, and the membrane potential of the neuron can generate specific dynamic change according to real-time input, so that the speech recognition device can adapt to the change of a recognition sample, has higher dynamic calculation capacity, improves the robustness of the model and improves the accuracy of a recognition result. In addition, by adopting sparse connection, the resource overhead of model learning is reduced to a certain extent while the model performance is improved, the memory occupation is reduced, the energy consumption is reduced, and the method is favorable for being placed on a chip for use.

In any of the above embodiments, the membrane potential of any neuron is initialized based on the following equation:

Based on any embodiment, the method comprises the following steps:

an updating unit, configured to determine a dynamic firing threshold of any neuron based on the neuron pulse flag in the forward channel and the neuron pulse flag in the cyclic channel, which are input by any neuron;

Based on any of the above embodiments, the updating unit is configured to determine the dynamic issuance threshold based on the following formula:

Based on any of the above embodiments, the updating unit is configured to update the membrane potential based on the following formula:

wherein the content of the first and second substances,for neurons in the previous hidden layerAnd neuronsThe weight of the synapse in between,is the third weight coefficient.

Based on any embodiment above, the apparatus further comprises:

the sparse connection unit is used for determining a sparse connection proportion; the sparse connection proportion is the number proportion of the connected neurons in any hidden layer to all the neurons in any hidden layer;

and selecting the neurons meeting the sparse connection proportion from any hidden layer, and randomly connecting the output of each neuron.

Based on any embodiment above, the apparatus further comprises:

the training unit is used for acquiring a sample label corresponding to the to-be-recognized voice of the sample;

the speech recognition model is trained based on the local gradient of neuronal membrane potential to synaptic weights in each hidden layer.

Based on any of the above embodiments, fig. 4 is a schematic structural diagram of an electronic device provided by the present invention, and as shown in fig. 4, the electronic device may include: a Processor (Processor) 410, a communication Interface (communication Interface) 420, a Memory (Memory) 430 and a communication Bus (communication Bus) 440, wherein the Processor 410, the communication Interface 420 and the Memory 430 are communicated with each other via the communication Bus 440. The processor 410 may call logical commands in the memory 430 to perform the following method:

acquiring a pulse sequence corresponding to the voice to be recognized; inputting the pulse sequence into a voice recognition model to obtain a voice recognition result corresponding to the voice to be recognized; the method comprises the steps that a voice recognition model is built on the basis of a cyclic pulse neural network, the membrane potential of any neuron in a hidden layer of the voice recognition model is determined on the basis of a neuron pulse mark in a forward channel and a neuron pulse mark in a cyclic channel, and the output of the neurons in the same hidden layer is in sparse connection; the forward channel is used for connecting any neuron with a neuron in the last hidden layer; the circulation channel is used for connecting the output of any neuron at the last moment with the output of other neurons in the same layer at the current moment.

In addition, the logic commands in the memory 430 may be implemented in the form of software functional units and stored in a computer readable storage medium when the logic commands are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes a plurality of commands for enabling a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The processor in the electronic device provided in the embodiment of the present invention may call a logic instruction in the memory to implement the method, and the specific implementation manner of the method is consistent with the implementation manner of the method, and the same beneficial effects may be achieved, which is not described herein again.

Embodiments of the present invention further provide a non-transitory computer-readable storage medium, on which a computer program is stored, where the computer program is implemented to perform the method provided in the foregoing embodiments when executed by a processor, and the method includes:

When the computer program stored on the non-transitory computer readable storage medium provided in the embodiments of the present invention is executed, the method is implemented, and the specific implementation manner of the method is consistent with the implementation manner of the method, and the same beneficial effects can be achieved, which is not described herein again.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes commands for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

17页详细技术资料下载

Voice recognition method and device, electronic equipment and storage medium

相关技术

网友询问留言