Voice keyword recognition system and method based on graph convolution neural network

文档序号：925431 发布日期：2021-03-02 浏览：5次中文

阅读说明：本技术 一种基于图卷积神经网络的语音关键词识别系统及方法 (Voice keyword recognition system and method based on graph convolution neural network ) 是由陈曦宋丹丹欧阳鹏尹首一于 2020-09-29 设计创作，主要内容包括：本发明公开一种基于图卷积神经网络的语音关键词识别系统及方法,属于神经网络的轻量级、低功耗的语音关键词识别方法技术领域。包括：一个语音数据采集模块、一个带通滤波器、一个声学特征提取模块、一个神经网络分类器和一个基础网络结构。基于图卷积神经网络的语音关键词识别系统使用窄通道的bottleneck结构和残差连接的方式,在精度相当的情况下显著压缩了网络的复杂度,实现了高效的网络计算,更适用于低资源设备场景的应用。引入图卷积网络对卷积特征图全局上下文进行建模,提高了语音关键词识别准确率。本发明解决了现有技术中基于卷积神经网络的关键词识别方法网络复杂度还比较高和计算仍然比较密集和卷积神经网络难以提取全局信息的问题。(The invention discloses a system and a method for recognizing a voice keyword based on a graph convolution neural network, and belongs to the technical field of lightweight and low-power-consumption voice keyword recognition methods of the neural network. The method comprises the following steps: the system comprises a voice data acquisition module, a band-pass filter, an acoustic feature extraction module, a neural network classifier and a basic network structure. The voice keyword recognition system based on the graph convolution neural network uses a cottleneck structure of a narrow channel and a residual connection mode, obviously compresses the complexity of the network under the condition of equivalent precision, realizes high-efficiency network calculation, and is more suitable for application in low-resource equipment scenes. And a graph convolution network is introduced to model the global context of the convolution characteristic graph, so that the accuracy of speech keyword recognition is improved. The method solves the problems that the keyword identification method based on the convolutional neural network in the prior art is high in network complexity and dense in calculation, and global information is difficult to extract by the convolutional neural network.)

1. A speech keyword recognition system based on a graph convolution neural network, comprising:

the voice data acquisition module is used for acquiring the awakening words sent by the operator;

a band-pass filter capable of filtering noise in the received voice data acquisition module wake-up word;

the acoustic feature extraction module can receive the awakening words acquired by the voice data acquisition module and extract feature information of the awakening words through the acoustic feature extraction module;

an infrastructure network structure comprising an initial group of cells, a plurality of stages and a neural network classifier;

a plurality of the stages are composed of bottelblock blocks with different numbers; the number of the bottelblock is adjusted according to the complexity of the model; the neural network classifier comprises a global pooling layer, a linear layer and a Softmax module;

the initial block has a convolution of 3 x 3 with no offset; the bottomblock comprises three layers of convolutions, the first layer of convolutions and the third layer of convolutions are 1 x 1 convolutions, the second layer of convolutions are 3 x 3 convolutions;

the plurality of stages are composed of bottomblock blocks with different numbers, and the number of the bottomblock blocks is adjusted according to the complexity of the model;

a graph convolution neural network module inserted into said infrastructure network structure; the convolutional neural network module can model global context information through the convolutional neural network.

2. The system of claim 1, wherein the convolutional neural network considers the convolved feature map as a fully connected map, and the information is propagated through the convolutional neural network, and the output feature map encodes global information;

the graph convolution neural network module treats the characteristic graph of the convolution neural network as a fully-connected graph, and the information propagation of graph convolution enables the correlation between nodes on the characteristic to be directly modeled.

3. The system of claim 1, wherein the band pass filter has a frequency range of 20HZ to 4 KHZ.

4. The system of claim 3, wherein the acoustic feature extraction module is capable of framing the speech according to a frame length of 30ms and a frame shift of 10ms, and the acoustic feature extraction module is capable of extracting Fbank features of each frame of speech.

5. The system of claim 4, wherein the process of extracting Fbank features of each frame of speech by the acoustic feature extraction module comprises windowing, pre-emphasis, FFT and energy logarithm extraction operations;

the Fbank is characterized in thatt represents the number of frames in time and f represents the characteristic dimension of the frequency domain.

6. The graph convolution neural network-based speech keyword recognition system of claim 1, wherein the unbiased 3 x 3 convolution layer is capable of extracting feature representations from MFCC features to convert single channel Fbank features into a multi-channel convolution feature map.

7. The system of claim 6, wherein the Ratio represents a dimension reduction Ratio, the Ratio is smaller than 1, and the Ratio smaller than 1 can be used to compress network parameters and computation amount.

8. The system of claim 1 wherein a number of said stages differ by a number of channels, the depth of said stage channel being proportional to the width of said number of stage channels; the channels are in the stage, and the number of the channels is consistent; the last bottleeck structure of each stage is used to promote dimensionality.

9. The graph convolution based neural network based speech keyword recognition system of claim 8, wherein the global pooling layer is capable of transforming the convolution extracted three-dimensional feature map into a one-dimensional vector.

10. A speech keyword recognition method based on a graph convolution neural network is characterized by comprising the following steps:

s101, configuring a voice data acquisition module, and acquiring a wake-up word sent by an operator through the voice data acquisition module;

s102, configuring a band-pass filter which can filter and receive noise in the voice data acquisition module awakening words;

s103, configuring an acoustic feature extraction module which can receive the awakening words collected by the voice data collection module and extracting feature information of the awakening words through the acoustic feature extraction module;

s104, configuring a basic network structure which comprises an initial block group, a plurality of stages and a neural network classifier;

the plurality of stages are composed of bottomblock blocks with different numbers, and the number of the bottomblock blocks is adjusted according to the complexity of the model;

s105, a graph convolution neural network module is inserted into the basic network structure; the convolutional neural network module can model global context information through the convolutional neural network.

Technical Field

The invention belongs to the technical field of a lightweight and low-power-consumption voice keyword recognition method based on a neural network, and particularly relates to a voice keyword recognition system and method based on a graph convolution neural network.

Background

Keyword recognition is often used as a first step in voice interaction to determine if a user has an intent to interact. When the user has an interaction intention, the system makes a corresponding reaction according to the instruction of the user. When the user has no interactive intention, the system is in a standby dormant state all the time. The keyword recognition model is generally configured on the end side, and is generally operated in an offline manner in order to protect the privacy of the user. The computation resources and the storage resources of the end side are relatively limited, and the speech keyword recognition system is always operated at the end side, so that the size, the accuracy and the operation amount required for prediction of the model are relatively strictly limited. At present, a keyword identification method based on a convolutional neural network has two problems: the first is that the complexity of the network is still high and the computation is still relatively intensive. The second is that it is difficult for the convolutional neural network to extract global information.

Disclosure of Invention

The invention aims to provide a system and a method for recognizing a voice keyword based on a convolutional neural network, which aim to solve the problems that the keyword recognition method based on the convolutional neural network in the prior art is higher in network complexity and is more dense in calculation and global information is difficult to extract by the convolutional neural network.

In order to achieve the above purpose, the invention provides the following technical scheme:

a system for speech keyword recognition based on a graph convolution neural network, comprising:

and the voice data acquisition module is used for acquiring the awakening words sent by the operator.

A band pass filter capable of filtering noise in the received voice data acquisition module wake-up word.

And the acoustic feature extraction module can receive the awakening words acquired by the voice data acquisition module and extract the feature information of the awakening words through the acoustic feature extraction module.

A neural network classifier capable of performing classification of command words by acoustic features.

An infrastructure network architecture includes an initial group of cells, a plurality of stages, and a neural network classifier.

Several stages are composed of different numbers of botteleck blocks. The number of bottelblock blocks is adjusted according to the complexity of the model. The neural network classifier comprises a global pooling layer, a linear layer and a Softmax module.

The initial block has a convolution of 3 x 3 with no offset. The bottomblock comprises three layers of convolutions, the first and third layers of convolutions being 1 x 1 convolutions and the second layer of convolutions being 3 x 3 convolutions.

The plurality of stages are composed of bottomblock blocks with different numbers, and the number of the bottomblock blocks is adjusted according to the complexity of the model.

A graph convolution neural network module inserted into the infrastructure network fabric. The convolutional neural network module can model the global context information through the convolutional neural network.

On the basis of the technical scheme, the invention can be further improved as follows:

further, the convolutional neural network regards the convolved feature map as a fully connected map, and the output feature map encodes global information through information propagation of the convolutional neural network.

The graph convolution neural network module treats the feature graph of the convolution neural network as a fully connected graph, and the information propagation through graph convolution enables the correlation between nodes on the feature to be directly modeled.

Further, the frequency range of the band-pass filter is 20Hz to 4 KHZ.

Further, the acoustic feature extraction module can frame the voice according to the frame length of 30ms and the frame shift of 10ms, and the acoustic feature extraction module can extract the Fbank feature of each frame of voice.

Further, the process of extracting the Fbank feature of each frame of voice by the acoustic feature extraction module comprises windowing, pre-emphasis, FFT and energy logarithm obtaining operation.

Fbank is characterized in that I belongs to the groupt represents the number of frames in time and f represents the characteristic dimension of the frequency domain.

Further, the 3 × 3 convolutional layer without bias can extract feature representation from the MFCC features, and convert the Fbank features of a single channel into a convolutional feature map of multiple channels.

Further, the Ratio represents the Ratio of dimensionality reduction, the Ratio is smaller than 1, and the Ratio smaller than 1 can play a role in compressing network parameters and calculated quantity.

Further, several stages are different in the number of channels, and the depth of the stage channel is proportional to the width of the number of stage channels. The channels are inside the stage, and the number of the channels is consistent. The last bottleeck structure of each stage is used to promote dimensionality.

Further, the global pooling layer can convert the convolution extracted three-dimensional feature map into a one-dimensional vector.

A speech keyword recognition method based on a graph convolution neural network comprises the following steps:

s101, a voice data acquisition module is configured, and awakening words sent by an operator are acquired through the voice data acquisition module.

S102, a band-pass filter is configured and can filter noise in the awakening words of the received voice data acquisition module.

S103, configuring an acoustic feature extraction module which can receive the awakening words collected by the voice data collection module and extracting feature information of the awakening words through the acoustic feature extraction module.

S104, configuring a basic network structure which comprises an initial block group, a plurality of stages and a neural network classifier.

The plurality of stages are composed of bottomblock blocks with different numbers, and the number of the bottomblock blocks is adjusted according to the complexity of the model.

S105, a graph convolution neural network module is inserted into the basic network structure. The graph convolution neural network module can model the global context information through the graph convolution neural network

The invention has the following advantages: the bottleeck network structure and residual connection are applied to a command word recognition task, so that a relatively complex convolution kernel acts on a relatively low dimension, and the size and the operand of the model are compressed. A graph convolution neural network is introduced to model global context information. The convolutional neural network takes the convolved characteristic diagram as a fully connected diagram, the information is propagated through the convolutional neural network, and the output characteristic diagram encodes the global information.

By using a cottleneck structure of a narrow channel and a residual connection mode, the complexity of a network is obviously reduced under the condition of equivalent precision, high-efficiency network calculation is realized, and the method is more suitable for application of low-resource equipment scenes. And a graph convolution network is introduced to model the global context of the convolution characteristic graph, so that the accuracy of speech keyword recognition is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

FIG. 1 is a flow chart of a method for speech keyword recognition.

Fig. 2 is a schematic diagram of the operation of the speech keyword recognition system.

Fig. 3 is a diagram of the basic network structure of the speech keyword recognition system.

FIG. 4 is a flow chart of graph convolution network information propagation in a speech keyword recognition system.

FIG. 5 is a schematic diagram of the location of the graph convolution module inserted into the underlying network in the speech keyword recognition system.

FIG. 6 is a diagram of comparison data between a speech keyword recognition based decoder and an original decoder.

Description of the reference symbols

The voice recognition system comprises a voice data acquisition module 10, a band-pass filter 20, an acoustic feature extraction module 30, a neural network classifier 40, a global pooling layer 401, a linear layer 402, a Softmax module 403, a basic network structure 50, an initial block 501, a first module 502, a second module 503, a third module 504 and a graph convolution neural network module 60.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be described clearly and completely with reference to the accompanying drawings of the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention.

As shown in fig. 1 to 6, the embodiment of the present invention provides a speech keyword recognition system based on a graph convolution neural network. The method comprises the following steps: a speech data acquisition module 10, a band pass filter 20, an acoustic feature extraction module 30, a neural network classifier 40, and an infrastructure 50.

The voice keyword recognition system based on the graph convolution neural network uses a cottleneck structure of a narrow channel and a residual connection mode, obviously compresses the complexity of the network under the condition of equivalent precision, realizes high-efficiency network calculation, and is more suitable for application in low-resource equipment scenes. And a graph convolution network is introduced to model the global context of the convolution characteristic graph, so that the accuracy of speech keyword recognition is improved.

As shown in fig. 1, when the broken line represents residual connection, i.e., when learning the network, the feature f (X) is not directly learned, but f (X) -X, i.e., residual, is learned.

And the voice data acquisition module 10 is used for acquiring the awakening words sent by the operator through the voice data acquisition module 10.

A band-pass filter 20 capable of filtering out noise in the wake-up word of the received voice data acquisition module 10. In the data pre-processing stage, we first use a 20Hz-4kHz band-pass filter 20 to filter out noise in the speech.

And the acoustic feature extraction module 30 is capable of receiving the awakening words collected by the voice data collection module 10, and extracting feature information of the awakening words through the acoustic feature extraction module 30. The acoustic feature extraction module 30 extracts acoustic features of the intensity, loudness, fundamental frequency, and voiced degree of the sound with all dimensions for each frame of the awakening word, and obtains feature values with 40 dimensions.

The acoustic feature extraction module 30 frames the speech according to the frame length of 30ms and the frame shift of 10ms, and then extracts Fbank features for the speech of each frame. Suppose that the extracted Fbank feature is I et represents the number of frames in time and f represents the characteristic dimension of the frequency domain. Since the data is 1 second fixed length speech and a 40-dimensional MFCC feature is employed, t is 101 and f is 40.

MFCC is another feature commonly used in speech, and is obtained by continuously performing DCT (discrete cosine transform) on the Fbank feature, and 40-dimension is one of the commonly used acoustic feature dimensions and is an empirical value.

The infrastructure 50, which includes an initial group 501, several stages and a neural network classifier 40.

The bottleeck structure is used as a basic unit of residual connection, and ratio <1, constructs a basic network. The ratio is always smaller than one, and the effects of compressing network parameters and calculating amount are achieved. The Bottlebeck structure locks the 3 × 3 convolution operation in a low dimension, so that the parameter number of convolution kernels and the operand of convolution layers are reduced, and the efficiency of network calculation is improved.

Several stages are composed of different numbers of botteleck blocks. The number of bottelblock blocks is adjusted according to the complexity of the model. The neural network classifier 40 includes a global pooling layer 401, a linear layer 402, and a Softmax module 403.

The Initial Block 501(Initial Block) has a convolution of 3 x 3 with no bias (bias). The layer can extract feature representation from MFCC features and convert Fbank features of a single channel into a convolution feature map of multiple channels. To reduce the size of the feature map and remove redundant information, a convolution of 3 x 3 is followed by an average pooling layer of 2 x 2 (averaging).

The information of a single channel is single, and is converted into multiple channels through a neural network with multiple convolution kernels, so that the modeling information is enriched, and the operation is also common.

The botteleck block comprises three layers of convolutions, the first layer of convolutions and the third layer of convolutions being 1 x 1 convolutions, the first layer of convolutions and the third layer of convolutions being used for dimensionality reduction and reconstruction, respectively. The dimension is reconstructed by controlling the reduction of the number of convolution kernels and the dimension. The number of convolution kernels in the first layer is small, and the obtained feature graph is dimension-reduced. And increasing the number of convolution kernels in the third layer and reconstructing dimensionality. The dimensionality reduction is to compress the size of the network and the calculated amount of the model, and the dimensionality reconstruction is to satisfy the residual linking. Since the original feature map channel is C, the feature map obtained by the dimensionality reduction of the first layer convolution is C/4, and the number of channels must be consistent with C for residual linking, so that a third layer convolution is required to reconstruct the channel dimension.

The second layer convolution is a 3 x 3 convolution. The 3 x 3 convolution layer of the second layer is convolved in one relatively low dimension. The Ratio indicates the Ratio of dimensionality reduction (Ratio <1), the feature map itself is relatively low in dimensionality in the first module 502, so we do not choose a Ratio that is too small to make the Ratio equal to 0.5, and we set the Ratio to 0.25 for the second module 503 and the third module 504.

The plurality of stages are composed of bottomblock blocks with different numbers, and the number of the bottomblock blocks is adjusted according to the complexity of the model. The different stages are different in the number of channels, and the deeper stage has the wider channel number, but the channels are in the stage and the channel number is consistent. The last bottleeck structure of each stage is used to promote dimensionality.

Neural network classifier 40(Classification Block): consists of a global pooling Layer 401(global averaging potential), a Linear Layer 402(Linear Layer) and a Softmax module 403. The role of the global pooling layer 401 is to convert the convolution extracted three-dimensional feature map into a one-dimensional vector. Assuming the original feature map size is h w c, the role of global pooling is to average h w points per channel, and get 1 average per channel, i.e. get a vector of 1 c, and the connectivity layer 402 is sorted.

A Graph Convolutional neural Network module 60, i.e., GCN module (Graph relational Network module), is inserted into the infrastructure Network fabric 50. The graph convolution neural network module 60 is capable of modeling global context information through a graph convolution neural network.

The convolutional neural network is limited in that only local information can be extracted, the voice keyword recognition system of the invention introduces a graph convolutional neural network module 60, a feature graph of the convolutional neural network is regarded as a fully-connected graph, and the association between nodes on the features can be directly modeled through the information propagation of graph convolution. And abundant context information is introduced, so that the performance of voice command word recognition is improved.

In order to evaluate the modeling capability of the graph convolution neural network on rich context and influence on the task of voice command word recognition, the GCN module is inserted into the basic network, so that the accuracy of voice command word recognition is improved.

As shown in fig. 6, the size of the occupied space of the original identification network is 19.9Kb, the accuracy of the original identification network is 90.1%, and the multiplication number of the primary prediction of the original identification network is 5.65M. The occupied space of the basic network is 16.2Kb, the accuracy of the basic network is 93.9%, and the one-time prediction multiplication number of the basic network is 1.95M. The occupied space of the basic network and the image volume module is 27.6KB, the accuracy of the basic network and the image volume module is 95.2%, and the one-time predicted multiplication number of the basic network and the image volume module is 2.55M.

The keyword recognition model is generally configured on the end side, and is generally operated in an offline manner in order to protect the privacy of the user. The computation resources and the storage resources of the end side are relatively limited, and the speech keyword recognition system is Always operated at the end side (Always-on), so that the size, the accuracy and the operation amount required for prediction of the model are relatively strictly limited.

First, we try a more efficient network structure in the command word task. Inspired by an optimization method of a deep residual error neural network for training time and calculated amount, a bottleeck network structure and residual error connection are applied to a command word recognition task, so that a relatively complex convolution kernel acts on a relatively low dimensionality, and the size and the operation amount of a model are compressed.

The voice keyword recognition system introduces a graph convolution neural network to model global context information. The convolutional neural network takes the convolved characteristic diagram as a fully connected diagram, the information is propagated through the convolutional neural network, and the output characteristic diagram encodes the global information. The graph convolution network can model any pair of node interrelations, independent of the distance between nodes. The speech signal is a sequence signal with context dependence, and the global information can help to improve the performance of speech command word recognition. By enhancing the context modeling capability of the model, the recognition effect is improved. In addition to demonstrating rationality on the method, the voice keyword recognition system of the present invention also verifies the validity of the method through experiments.

The voice keyword recognition system of the invention designs a basic voice command word recognition basic network based on bottletech. And then, a graph convolution neural network module 60 is added into the basic network, so that the performance of the network is further improved compared with that of the basic network.

A speech keyword recognition method based on a graph convolution neural network comprises the following steps:

s101, a voice data acquisition module is configured to acquire awakening words sent by an operator.

In this step, a voice data acquisition module 10 is configured, and the voice data acquisition module 10 acquires the awakening words sent by the operator.

S102, a band-pass filter is configured to filter noise in the wake-up words of the received voice data acquisition module.

In this step, a band pass filter 20 is configured, which can filter out noise in the wake-up word of the received voice data acquisition module 10.

S103, an acoustic feature extraction module is configured to extract feature information of the awakening words.

In this step, an acoustic feature extraction module 30 is configured, which can receive the wake-up word acquired by the voice data acquisition module 10, and extract feature information of the wake-up word through the acoustic feature extraction module 30.

S104, configuring an infrastructure network structure.

In this step, an infrastructure 50 is configured, which includes an initial block 501, several stages, and a neural network classifier 40.

Several stages are composed of different numbers of botteleck blocks. The number of bottelblock blocks is adjusted according to the complexity of the model. The neural network classifier 40 includes a global pooling layer 401, a linear layer 402, and a Softmax module 403.

The initial block 501 has a convolution of 3 x 3 with no offset. The bottomblock comprises three layers of convolutions, the first and third layers of convolutions being 1 x 1 convolutions and the second layer of convolutions being 3 x 3 convolutions.

The plurality of stages are composed of bottomblock blocks with different numbers, and the number of the bottomblock blocks is adjusted according to the complexity of the model.

And S105, the graph convolution neural network module can model the global context information through the graph convolution neural network.

In this step, a graph convolutional neural network module 60 is inserted into the infrastructure 50. The graph convolution neural network module 60 is capable of modeling global context information through a graph convolution neural network.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solutions of the present invention, and not for limiting the same. Although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those skilled in the art that: modifications of the technical solutions described in the embodiments above, or equivalent substitutions of some technical features, can still be made. And such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

13页详细技术资料下载

Voice keyword recognition system and method based on graph convolution neural network

相关技术

网友询问留言