MVGG-CTC-based keyword search method

文档序号：70632 发布日期：2021-10-01 浏览：23次中文

阅读说明：本技术 一种基于mvgg-ctc的关键词搜索方法 (MVGG-CTC-based keyword search method ) 是由江海刘俊南王化薛辉齐心于 2021-07-02 设计创作，主要内容包括：一种基于MVGG-CTC的关键词搜索方法,所述方法包括步骤：获取语音数据；对所述语音数据进行预处理；将所述语音数据输入MVGG-CTC模型中训练；构建语音模型和词典；将所述MVGG-CTC模型输出的拼音序列转化为连续文字；利用倒排索引构建语音关键词搜索库；在所述语音关键词搜索库中输入关键词进行检索；获取所述语音关键词搜索库输出的检索结果。本申请提供的一种基于MVGG-CTC的关键词搜索方法具有如下有益效果：(1)改进了网络结构,提升了语音识别的效率与准确度；(2)提升模型的鲁棒性；(3)能够实现快速关键词检索的智能自动识别。(A method for keyword search based on MVGG-CTC, the method comprising the steps of: acquiring voice data; preprocessing the voice data; inputting the voice data into an MVGG-CTC model for training; constructing a voice model and a dictionary; converting the pinyin sequence output by the MVGG-CTC model into continuous characters; constructing a voice keyword search library by using the inverted index; inputting keywords in the voice keyword search library for retrieval; and acquiring a retrieval result output by the voice keyword search library. The key word searching method based on MVGG-CTC has the following beneficial effects: (1) the network structure is improved, and the efficiency and the accuracy of voice recognition are improved; (2) the robustness of the model is improved; (3) the intelligent automatic identification of the rapid keyword retrieval can be realized.)

1. A key word searching method based on MVGG-CTC is characterized by comprising the following steps:

acquiring voice data;

preprocessing the voice data;

inputting the voice data into an MVGG-CTC model for training;

constructing a voice model and a dictionary;

converting the pinyin sequence output by the MVGG-CTC model into continuous characters;

constructing a voice keyword search library by using the inverted index;

inputting keywords in the voice keyword search library for retrieval;

and acquiring a retrieval result output by the voice keyword search library.

2. The MVGG-CTC-based keyword search method of claim 1, wherein the preprocessing the voice data comprises the steps of:

performing feature extraction on the voice data;

performing silence detection on the voice data;

performing multi-ambient reverberation enhancement on the voice data.

3. The MVGG-CTC-based keyword search method of claim 1, wherein the training of the input of the voice data into the MVGG-CTC model comprises the steps of:

constructing an MVGG model;

constructing a connection time sequence classifier;

inputting the voice data into the MVGG model and the connection time sequence classifier in sequence;

and acquiring the pinyin sequence output by the connection time sequence classifier.

4. The MVGG-CTC-based keyword search method of claim 1, wherein the constructing a speech model and a dictionary comprises the steps of:

constructing a unilingual model and a bilingual model;

and constructing the Chinese characters of the unary language model into a pinyin dictionary.

5. The MVGG-CTC based keyword search method according to claim 1, wherein the step of converting the Pinyin sequence outputted by the MVGG-CTC model into continuous text comprises the steps of:

constructing a Markov model;

constructing a decoder based on a Viterbi algorithm of panic compensation;

inputting the pinyin sequence into the Markov model and the decoder in sequence;

and acquiring continuous characters output by the decoder.

6. The MVGG-CTC-based keyword search method of claim 1, wherein the constructing a search library of the voice keywords using the inverted index comprises the steps of:

recognizing a text sequence corresponding to each section of voice in the voice data;

constructing an inverted index library;

and constructing a search program.

Technical Field

The invention belongs to the technical field of voice recognition, and particularly relates to a keyword search method based on MVGG-CTC.

Background

With the rapid development of network communication technology, audio and video media have become a mainstream information transmission form, and the characteristics of efficient circulation, rapid interaction and the like are more and more highlighted. Public opinion information is spread and interacted to an unprecedented degree, convenience is brought to people, and meanwhile negative effects caused by the spread of bad public opinion information are more and more obvious, so that the spread of public opinion information such as pornography, horror, violence and the like not only violates the good customs of public sequences, but also brings huge threats and hidden dangers to social public safety, and the problem is also a key concern of relevant departments in China. How to effectively strengthen monitoring and firmly prevent the propagation of bad public opinion information while ensuring the free circulation of information and effectively guide and solve the public opinion crisis has important practical significance for maintaining social stability and promoting national development, and is also a new subject and a new challenge in front of information science workers.

For monitoring audio and video media public opinion information, the most effective method is to monitor keywords of the audio frequency in real time and establish a keyword search system, wherein the keyword search system automatically identifies continuous voice data and monitors whether the continuous voice data contain sensitive keywords, and establishes keyword inverted indexes for voice segments containing the keywords so as to facilitate later manual verification. The method comprises the steps of voice signal preprocessing and feature extraction, establishment of a language model and an acoustic model, construction of a language decoder and an inverted index and the like, wherein:

1) the voice signal preprocessing and the feature extraction are to perform front-end preprocessing on voice signal data, and comprise three parts of feature extraction, silence detection and voice enhancement of mixed multi-environment reverberation, wherein the voice signal feature extraction usually adopts methods such as spectrogram features, FilterBank (filter bank) features, MFCC (Mel cepstrum coefficient) features or PLP (perceptual linear prediction) features, and the adopted techniques of the silence detection include a VAD (noise detection) method based on SNR (signal to noise ratio), a VAD (noise detection) method based on GMM (mixed Gaussian model), a silence detection method based on DNN (deep neural network), and the like; the speech enhancement of multi-ambient reverberation mainly includes indoor reverberation enhancement, outdoor noise enhancement, and music noise enhancement, etc.

2) The traditional acoustic models include traditional GMM-HMM (Gaussian mixture model-hidden Markov model), HMM-DNN (hidden Markov model-deep neural network model) and the like, and the models are formed by cascading a plurality of models, so that the efficiency is reduced, and the accuracy is reduced in the cascading process.

3) Although the WFST (weighted finite state decoder) in the prior art has better speed and accuracy, each module still needs to be trained separately when applied to the model, because the model is complex and some key information is lost in the middle loop, the result is often not satisfactory.

4) The most common search technology is the relational database, most model software applies the relational database, the query updating is excellent, and the processing of a large amount of data is not good. Based on the above current situation, it is most urgent to reduce the complexity of the model, improve the efficiency of keyword search, and solve various defects of the model.

Disclosure of Invention

In order to solve the problems, the invention provides a keyword search method based on MVGG-CTC, which comprises the following steps:

acquiring voice data;

preprocessing the voice data;

inputting the voice data into an MVGG-CTC model for training;

constructing a voice model and a dictionary;

converting the pinyin sequence output by the MVGG-CTC model into continuous characters;

constructing a voice keyword search library by using the inverted index;

inputting keywords in the voice keyword search library for retrieval;

and acquiring a retrieval result output by the voice keyword search library.

Preferably, the preprocessing the voice data comprises the steps of:

performing feature extraction on the voice data;

performing silence detection on the voice data;

performing multi-ambient reverberation enhancement on the voice data.

Preferably, the training of inputting the voice data into the MVGG-CTC model comprises the following steps:

constructing an MVGG model;

constructing a connection time sequence classifier;

inputting the voice data into the MVGG model and the connection time sequence classifier in sequence;

and acquiring the pinyin sequence output by the connection time sequence classifier.

Preferably, the constructing the speech model and the lexicon comprises the steps of:

constructing a unilingual model and a bilingual model;

and constructing the Chinese characters of the unary language model into a pinyin dictionary.

Preferably, the step of converting the pinyin sequence output by the MVGG-CTC model into continuous text comprises the steps of:

constructing a Markov model;

constructing a decoder based on a Viterbi algorithm of panic compensation;

inputting the pinyin sequence into the Markov model and the decoder in sequence;

and acquiring continuous characters output by the decoder.

Preferably, the constructing of the voice keyword search library by using the inverted index comprises the steps of:

recognizing a text sequence corresponding to each section of voice in the voice data;

constructing an inverted index library;

and constructing a search program.

The key word searching method based on MVGG-CTC has the following beneficial effects:

(1) the network structure is improved, and the efficiency and the accuracy of voice recognition are improved;

(2) the robustness of the model is improved;

(3) the intelligent automatic identification of the rapid keyword retrieval can be realized.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is a schematic flow chart of a keyword search method based on MVGG-CTC according to the present invention;

FIG. 2 is a schematic diagram of a spectrogram extraction process in a key word search method based on MVGG-CTC according to the present invention;

FIG. 3 is a schematic diagram of an MVGG-CTC network structure in a keyword search method based on MVGG-CTC provided by the present invention;

FIG. 4 is a schematic diagram of an inverted index structure in the MVGG-CTC-based keyword search method provided by the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail with reference to the accompanying drawings in conjunction with the following detailed description. It should be understood that the description is intended to be exemplary only, and is not intended to limit the scope of the present invention. Moreover, in the following description, descriptions of well-known structures and techniques are omitted so as to not unnecessarily obscure the concepts of the present invention.

As shown in fig. 1 to 4, in the embodiment of the present application, the present invention provides a keyword search method based on MVGG-CTC, where the method includes the steps of:

s1: acquiring voice data;

in the embodiment of the present application, the voice data may be acquired by a variety of storage devices.

S2: preprocessing the voice data;

in this embodiment of the present application, the preprocessing the voice data includes:

performing feature extraction on the voice data;

performing silence detection on the voice data;

performing multi-ambient reverberation enhancement on the voice data.

In the embodiment of the present application, when the speech data is preprocessed, acoustic features including any one of a spectrogram feature, a FilterBank feature, a MFCC (Mel cepstral coefficient) feature, or a PLP (perceptual linear prediction) feature are extracted from the received input speech signal; removing non-voice sections in the audio signal by using any one of a silence detection method based on signal-to-noise ratio (SNR), a silence detection method based on Gaussian Mixture Model (GMM) or a silence detection method based on Deep Neural Network (DNN) for the voice data of the training and testing of the input model; the speech enhancement mode using multi-environment reverberation enhances the indoor reverberation, the outdoor noise, the music noise and the like to the original training corpus for expansion so as to achieve better fitting effect and generalization capability.

S3: inputting the voice data into an MVGG-CTC model for training;

in an embodiment of the present application, the inputting the speech data into the MVGG-CTC model and training includes the steps of:

constructing an MVGG model;

constructing a connection time sequence classifier;

inputting the voice data into the MVGG model and the connection time sequence classifier in sequence;

and acquiring the pinyin sequence output by the connection time sequence classifier.

In the embodiment of the application, when the voice data is input into the MVGG-CTC model for training, the training voice features processed in the step S1 enter the MVGG-CTC model for training, and the model identifies the voice features to a pinyin sequence. The training process comprises the following steps: the method comprises the steps of modifying an original VGG network, adding a batch normalization layer between every two layers of the network, adjusting the size of a convolution kernel and the dimension of a full-connection layer so as to accelerate and adapt to training of a voice characteristic signal, wherein the specific structure of the VGG network comprises 16 convolution layers, 6 pooling layers, 18 batch normalization layers and 2 full-connection layers, and finally, speech neuron characteristic vectors are obtained through softmax normalization so as to be used for calculating a CTC loss function. For the CTC connection timing classification, given an input sequence, the network output is the probability distribution of all possible ways that the tag sequence corresponds to the input sequence. Its objective function P (Y | X) is calculated by the following form:

first of all, calculateThis formula represents the probability, a, of a given input sequence X of length T, the output being a sequence A_tRepresenting the Pinyin label at the time t, the dimension of the input speech neuron feature vector is 200-dimensional vector, and the default output Pinyin dimension at each time point is 1424 (namely 1423 Pinyin and 1 blank block).

Then calculateThe method is a phenomenon that a sequence is repeated due to a process of removing continuous blank blocks in an A sequence, for example, A (c _ a __ t) and A (_ c __ at) removed blank blocks are both A (cat), the sequence of removed blank blocks is a final output sequence, and the probability of distribution of the same sequence after the blank blocks are removed is summed.

Based on the above formula, CTCs obtain the most likely sequence according to the following formula:

h(X)＝argmaxP(A|X)，

feature input X for speech segments { X ═ X₁,x₂,...,x_tIs passed through [0013 ]]～[0017]The process outputs the pinyin sequence Y ═ Y₁,y₂,...,y_n}。

S4: constructing a voice model and a dictionary;

in an embodiment of the present application, the constructing the speech model and the dictionary includes:

constructing a unilingual model and a bilingual model;

and constructing the Chinese characters of the unary language model into a pinyin dictionary.

In the embodiment of the application, when the voice model and the dictionary are constructed, the unitary and binary language models are constructed by using a statistical-based method and are used for calculating the probability of each of a single word and two words; and then transcribing the Chinese characters in the unary language model into a pinyin dictionary by using a pypinyin tool, wherein each pinyin corresponds to a plurality of corresponding Chinese characters.

S5: converting the pinyin sequence output by the MVGG-CTC model into continuous characters;

in the embodiment of the present application, the converting the pinyin sequence output by the MVGG-CTC model into continuous text includes the steps of:

constructing a Markov model;

constructing a decoder based on a Viterbi algorithm of panic compensation;

inputting the pinyin sequence into the Markov model and the decoder in sequence;

and acquiring continuous characters output by the decoder.

In the embodiment of the application, when the pinyin sequence output by the MVGG-CTC model is converted into continuous characters, a Viterbi decoder based on panic compensation is used for converting the pinyin sequence output by the MVGG-CTC model into continuous characters; the method comprises the following two sub-steps: constructing a Markov model for calculating transition probability among words; and constructing a Viterbi algorithm construction decoder based on panic compensation, and decoding the continuous pinyin sequence into a character sequence.

S6: constructing a voice keyword search library by using the inverted index;

in the embodiment of the present application, the constructing the voice keyword search library by using the inverted index includes the steps of:

recognizing a text sequence corresponding to each section of voice in the voice data;

constructing an inverted index library;

and constructing a search program.

In the embodiment of the present application, when a speech keyword search library is constructed by using an inverted index, automatically recognizing each speech segment by using the speech data to be recognized in steps S1 to S5 to obtain a corresponding text sequence, where the form of the text sequence is: (voice ID, creation time, text sequence); then, an inverted index library is constructed, each word in each sentence is used as an index ID, the corresponding attribute is a set of creation time and a voice ID, and the form of the set is as follows: { index ID, [ (creation time 1, voice ID1), (creation time 2, voice ID2), … …, (creation time n, voice IDn) ] }; and then, constructing a search program, segmenting the input information by using a jieba tool, transmitting the segmentation result into an inverted index library for searching, and returning the queried creation time and the sequence of the voice ID.

S7: inputting keywords in the voice keyword search library for retrieval;

s8: and acquiring a retrieval result output by the voice keyword search library.

In the embodiment of the application, the search is carried out according to the input keywords, and the voice ID and the detailed information of the voice ID to which the keywords to be recognized belong are output as the search result. And the whole system flow is finished.

The key word searching method based on MVGG-CTC has the following beneficial effects:

(1) the network structure is improved, and the efficiency and the accuracy of voice recognition are improved;

(2) the robustness of the model is improved;

(3) the intelligent automatic identification of the rapid keyword retrieval can be realized.

It is to be understood that the above-described embodiments of the present invention are merely illustrative of or explaining the principles of the invention and are not to be construed as limiting the invention. Therefore, any modification, equivalent replacement, improvement and the like made without departing from the spirit and scope of the present invention should be included in the protection scope of the present invention. Further, it is intended that the appended claims cover all such variations and modifications as fall within the scope and boundaries of the appended claims or the equivalents of such scope and boundaries.

9页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：一种语音识别方法及装置

MVGG-CTC-based keyword search method

相关技术

网友询问留言