Text screening method for high-volume high-noise spoken short text

文档序号：1938140 发布日期：2021-12-07 浏览：11次中文

阅读说明：本技术 一种针对海量高噪音口语化短文本的文本筛选方法 (Text screening method for high-volume high-noise spoken short text ) 是由戚梦苑孙晓晨万辛李沁刘发强孙旭东倪善金吴广君梁睿琪于 2020-06-05 设计创作，主要内容包括：本发明提出一种针对海量高噪音口语化短文本的文本筛选方法,属于自然语言处理领域,通过对训练语料和待筛选的目标文本进行预处理；对预处理后的训练语料中的标注的正类语料进行句式信息提取,区分出业务强相关句式和弱相关句式；利用提取的句式信息对预处理后的目标文本进行句式匹配,将业务强相关句式的匹配结果归为正类文本,对业务弱相关句式的匹配结果进行以下步骤的处理；对目标文本和训练语料都进行文本处理,将处理后的文本转化为词向量表示；使用训练语料的词向量表示训练文本分类模型,将目标文本的词向量表示输入到训练好的文本分类模型中对文本进行分类,实现对目标文本的文本筛选。(The invention provides a text screening method aiming at a high-sea high-noise spoken short text, which belongs to the field of natural language processing, and is characterized in that training corpora and a target text to be screened are preprocessed; carrying out sentence pattern information extraction on the labeled positive-class linguistic data in the preprocessed training linguistic data, and distinguishing a business strong relevant sentence pattern and a business weak relevant sentence pattern; carrying out sentence pattern matching on the preprocessed target text by utilizing the extracted sentence pattern information, classifying the matching result of the strong business related sentence pattern into a positive text, and carrying out the following steps of processing on the matching result of the weak business related sentence pattern; performing text processing on both the target text and the training corpus, and converting the processed text into word vector representation; and expressing the word vector of the training corpus to a training text classification model, and inputting the word vector expression of the target text into the trained text classification model to classify the text, so as to realize text screening of the target text.)

1. A text screening method for a high-volume high-noise spoken short text is characterized by comprising the following steps of:

preprocessing a training corpus and a target text to be screened;

carrying out sentence pattern information extraction on the labeled positive-class linguistic data in the preprocessed training linguistic data, taking a sentence pattern containing labeled keywords as a business strong correlation sentence pattern, and taking a sentence pattern not containing the labeled keywords as a business weak correlation sentence pattern;

carrying out sentence pattern matching on the preprocessed target text by utilizing the extracted sentence pattern information, classifying the matching result of the strong business related sentence pattern into a positive text, and carrying out the following steps of processing on the matching result of the weak business related sentence pattern;

performing text processing on both the target text and the training corpus, and converting the processed text into word vector representation;

and expressing the word vector of the training corpus to a training text classification model, and inputting the word vector expression of the target text into the trained text classification model to classify the text, so as to realize text screening of the target text.

2. The method of claim 1, wherein preprocessing comprises word segmentation, noise reduction, and pinyin substitution.

3. The method of claim 2, wherein the step of pre-processing comprises:

firstly, cutting words through a jieba Chinese word segmentation device;

secondly, carrying out error detection on the text by using a sliding window, calculating a maximum likelihood estimation construction language model through an n-gram model, and if the calculated probability is lower than a legal threshold, judging that the text at the sliding window is in error;

thirdly, for errors of the character granularity, a candidate set is obtained by using a near-phonetic word dictionary, the sentence legality in a sliding window is calculated through an n-gram model, and results of all the candidate sets are compared and sequenced to obtain the optimal corrected characters; for the error of word granularity, directly adopting pinyin to replace the word.

4. The method of claim 1, wherein the step of schema matching comprises: sentence pattern extraction is carried out on sentences in the text; and comparing the similarity among the sentence patterns, and selecting the sentence pattern with the maximum similarity for matching.

5. The method of claim 4, wherein the step of sentence extraction comprises: segmenting words of sentences and labeling parts of speech; for the part of the sentence with subject, object and Bin supplementary components representing the action object, eliminating the part of noun, place and organization names, replacing with part of speech tags and reserving pronouns; replacing time and place in the foreign language with a word label, and replacing a modified component in the fixed language with a uniform character; the verbs and conjunctions form a list of alternatives, and the sentence pattern is represented by a list of words and part-of-speech tags.

6. The method of claim 4, wherein the step of comparing the similarity between the patterns comprises: converting the vocabulary containing the semantics into word vectors, and calculating Euclidean distances of the word vectors in a vector space based on the editing distance; using verbs in the same alternative verb list as the same vocabulary, selecting the vocabulary with the minimum distance from the verbs in the alternative word list as the vocabulary with the maximum similarity, and carrying out sentence matching according to the same vocabulary or the vocabulary with the maximum similarity.

7. The method of claim 1, wherein the text processing is the operations of length limiting the text, de-spoken words and merging repeated words.

8. The method of claim 7, wherein the step of text processing comprises:

scanning the text by using a sliding window, and only reserving the first word appearing twice or more in the sliding window;

establishing a spoken word library, and removing the appearing nonsense spoken words;

and eliminating less than 5 texts with effective vocabularies, calculating the average length of the training texts, intercepting the texts with the length more than 1.5 times of the average length, wherein the intercepted position is the tail of the sentence with the nearest average length so as to keep a complete sentence.

9. The method of claim 1, wherein the step of converting text into word vector representations is: training word vectors by using a word-to-vector model for the processed text, and expressing the text as word vectors; and for the vocabulary replaced by the pinyin, calculating the pinyin similarity based on the editing distance, finding the vocabulary with the highest pronunciation similarity to the pinyin according to the similarity, and using the word vector of the vocabulary as the word vector of the pinyin substitute word to obtain the final word vector representation.

10. The method of claim 1, wherein after the text screening is performed on the target text, whether the screening result contains labeling information is detected, and if so, the training corpus is updated according to the labeling information, and the text classification model is retrained.

Technical Field

The invention relates to a screening method for a high-volume high-noise spoken short text, which can reduce noise of a text with a high error rate and screen the text according to semantic similarity and sentence pattern information, and belongs to the field of natural language processing.

Background

With the diversification and the convenience of communication technology and the rapid development of computer networks, remote communication among people becomes lower in cost and higher in quality, the development of the internet also enables people to make own voice on line more conveniently, the communication and expression are different from written words, and the short spoken texts have some characteristics obviously different from common written linguistic data:

1. the syntax is complex and not normative: because people often have less standard and strict speaking than written expression, more time, the speaking is convenient and accords with habits, and the situations of multiple clauses, reversed word order, ambiguous expression and the like in the short spoken text are very common;

2. spoken word conversion: the voice communication is particularly the daily communication in informal occasions, and people may habitually have many vocabularies in terms of words, such as dialect specific words, common English words, abbreviation words and popular words;

3. accent and dialect: because people in different places have different language expression habits, the regional difference is very large, for example, the words and sentence patterns of cantonese are quite different from the common Chinese based on northern dialect, and the difference can cause high difference of the expression of the short spoken short text;

4. high noise: since the partial spoken short text may be from speech translation, the speech signal transmission depends on the communication environment, and the instability of the network signal and the environmental noise may greatly affect the quality of the speech signal, resulting in errors and a fragmented omission of the partial spoken short text.

5. The sentence length is short: since in most cases the spoken language expression of a person is not often used to modify complex and complicated sentence patterns, it is preferred to use a concise and clear expression and to include many nonsense words, such as "kays" and "feints", which only represent responses, and the effective sentence length is generally short.

The spoken short text has the characteristics of high noise, high error rate, poor normalization, short sentence length and the like, and great difficulty is brought to text classification.

In natural language processing tasks, text classification is an important ring. The Chinese text classification is divided into three steps of data preprocessing, text representation and classification by using a classification model. The data preprocessing comprises data cleaning, Chinese word segmentation, part of speech tagging and the like. The text representation is a Chinese vocabulary that is digitized to facilitate computation by the classification model. The mainstream of the text classification model is roughly a classical classification model based on statistical learning, such as a naive Bayes classifier, a support vector machine, a Rocchio algorithm, a KNN and the like, and the statistical learning-based method has the advantages of small calculated amount, low complexity and less required training expectation, but the accuracy of the algorithms depends on the text quality greatly, the algorithms are difficult to deal with texts with high complexity and high difference, the text noise influence is large, and the short text length can greatly reduce the classification accuracy and is not enough to deal with the classification of massive short spoken texts.

Machine learning techniques that have emerged in recent years, particularly deep learning algorithms implemented using high-complexity neural networks, have achieved excellent results in multiple task branches of natural language processing, and are almost a poor choice for processing complex text classification tasks. However, these deep learning algorithms have a great cost and cost while achieving good results. (1) Lack of interpretability: the input of the high-complexity neural network is generally to digitize natural language into vectors (word vectors or sentence vectors) obtained by preprocessing texts and output the vectors into prediction results or classification results, and the intermediate process is fuzzy and difficult to understand and control; (2) huge calculation amount: the training of the deep neural network usually needs huge calculation amount, the training time often needs several days, the parameters are often complicated, and a long time for trial and error is needed for selecting proper parameters; (3) limited by the corpus: the final effect of the deep learning algorithm is greatly related to the quality and quantity of the training corpora, the more complex networks often need a large amount of training data, but the cost of manually labeling the spoken short texts is high, the spoken short texts cannot be used for corpora of different genres and different fields, the problem of text classification in a certain field is to be solved, only a large amount of time can be spent on constructing a corpus of the field, such as medical science and news media, and the model for the corpora of the field is trained.

Disclosure of Invention

The invention aims to realize a text screening method for a high-volume high-noise spoken short text, solves the problems of high noise interference, high requirement on training corpora, low classification accuracy and the like in the spoken short text classification, and can improve and correct a model through service marking according to needs so as to further improve the classification accuracy.

In order to achieve the purpose, the invention adopts the following technical scheme:

a text screening method aiming at a high-volume high-noise spoken short text comprises the following steps:

preprocessing a training corpus and a target text to be screened;

carrying out sentence pattern information extraction on the labeled positive-class linguistic data in the preprocessed training linguistic data, taking a sentence pattern containing labeled keywords (namely, relevant to the service) as a service strong-relevant sentence pattern, and taking a sentence pattern not containing the labeled keywords as a service weak-relevant sentence pattern;

performing text processing on both the target text and the training corpus, and converting the processed text into word vector representation;

Further, the text classification model includes TextCNN, FastText, TextRNN, Hierarchical orientation Network, seq2seq with attribute, and the like, preferably TextCNN model.

Further, after the target text is subjected to text screening, whether the screening result contains the labeling information is detected, if so, the training corpus is updated according to the labeling information, and the text classification model (such as a TextCNN model) is retrained.

Further, the preprocessing includes word segmentation, noise reduction and pinyin substitution.

Further, the step of pre-processing comprises:

firstly, cutting words through a jieba Chinese word segmentation device;

Further, the sentence pattern matching step includes: sentence pattern extraction is carried out on sentences in the text; and comparing the similarity among the sentence patterns, and selecting the sentence pattern with the maximum similarity for matching.

Further, the sentence pattern extraction step includes: segmenting words of sentences and labeling parts of speech; for the part of the sentence with subject, object and Bin supplementary components representing the action object, eliminating the part of noun, place and organization names, replacing with part of speech tags and reserving pronouns; replacing time and place in the foreign language with a word label, and replacing a modified component in the fixed language with a uniform character; the verbs and conjunctions form a list of alternatives, and the sentence pattern is represented by a list of words and part-of-speech tags.

Further, the step of comparing the similarity between the sentence patterns comprises: converting the vocabulary containing the semantics into word vectors, and calculating Euclidean distances of the word vectors in a vector space based on the editing distance; using verbs in the same alternative verb list as the same vocabulary, selecting the vocabulary with the minimum distance from the verbs in the alternative word list as the vocabulary with the maximum similarity, and carrying out sentence matching according to the same vocabulary or the vocabulary with the maximum similarity.

Further, the text processing is an operation of limiting the length of the text, removing spoken words, and merging repeated words.

Further, the step of text processing comprises:

scanning the text by using a sliding window, and only reserving the first word appearing twice or more in the sliding window;

establishing a spoken word library, and removing the appearing nonsense spoken words;

Further, the step of converting the text into word vector representation is as follows: training word vectors by using a word-to-vector model for the processed text, and expressing the text as word vectors; and for the vocabulary replaced by the pinyin, calculating the pinyin similarity based on the editing distance, finding the vocabulary with the highest pronunciation similarity to the pinyin according to the similarity, and using the word vector of the vocabulary as the word vector of the pinyin substitute word to obtain the final word vector representation.

Compared with a deep learning method and a classical natural language processing algorithm based on statistical learning, the method disclosed by the invention has the advantages that firstly, the text error correction algorithm considering pinyin similarity is utilized to perform noise reduction processing on the short spoken texts, the technologies of sentence pattern extraction and matching, rule filtering, semantic similarity analysis and the like are fused, the calculated amount is relatively small, the accuracy is high, the required training corpus is less, and the model accuracy can be improved according to the service marking information feedback.

The invention has the main innovation points that: (1) the preprocessing noise reduction method of massive high-noise spoken short text data corrects word granularity errors and replaces word granularity errors which are difficult to correct by pinyin; (2) a lightweight semantic relevance screening method is provided. The traditional text classification has high requirement on a material library and high calculation complexity, and the method provides a lightweight spoken short text screening method which integrates extraction and matching of a conventional sentence pattern, word vector representation combining word and pinyin similarity, and semantic correlation analysis based on a text classification model (such as a textCNN model). (3) The feedback type model correction method based on the service labeling information can continuously improve the classification accuracy of the model.

Drawings

FIG. 1 is a flow diagram of spoken short text classification.

FIG. 2 is an exemplary diagram of Chinese sentence pattern extraction and matching result.

Fig. 3 is a simplified exemplary diagram of a text.

FIG. 4 is a flow chart of the TextCNN text classification model training.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the following method and steps for extracting a sentence from a natural language text according to the present invention are described in further detail with reference to the accompanying drawings.

The invention mainly aims to provide a text screening method for a high-volume high-noise spoken short text, which is used for carrying out high-accuracy classification according to a small amount of training corpora and carrying out feedback type correction on a classification model according to a service staff labeled text.

According to the first aspect of the invention, a Chinese text error correction algorithm based on improved pinyin similarity fuzzy matching is adopted: most lexical errors are both near-word and homophone errors (due to pinyin input or speech translation problems), depending on the nature of the short spoken text. Chinese text error correction is divided into error detection, error correction and pinyin replacement. The error detection part detects errors from two aspects of word granularity and word granularity to obtain a suspected error position candidate set; an error correction part, which can correct the errors of the word granularity by using an n-gram language model (obtaining alternative words for replacing error positions by using a candidate dictionary, then calculating sentence legality through the n-gram model, comparing and sequencing all candidate set results to obtain the optimal corrected words); for the error of word granularity, due to the characteristics of high noise, high error and leakage rate and short sentence of the short spoken text, the corrected word is difficult to be deduced by using an algorithm through other parts of the sentence, and the word is replaced by using pinyin at the moment.

According to a second aspect of the present invention, text pattern dependency matching is performed based on conventional sentences that have been extracted for the corpus. For some specific types of texts, there are conventional expression sentences, the conventional sentences of the positive type linguistic data in the training linguistic data are extracted, sentence pattern matching is carried out on the linguistic data to be classified, and most negative type texts can be extracted. A sentence pattern extraction part: firstly, the sentence is subjected to word segmentation and dependency syntactic analysis, words representing specific action objects are removed based on rules, and the main stem of the sentence is reserved. The sentence pattern matching part converts the vocabulary in the sentence main stem into word vectors by using word-to-vector algorithm, calculates the similarity of the vocabulary according to the word vectors and calculates the similarity of the whole sentence pattern according to the editing distance.

According to a third aspect of the present invention, the speech is processed for the features of the short spoken text and represented by word vectors, and the semantic relevance analysis is performed on the target classified speech using a text classification model (hereinafter, TextCNN model is taken as an example). In the sentence pattern matching stage, some sentence patterns with weak relevance to the service cannot determine the positive text which is certainly concerned by the service, and only the sentence patterns of the sentences are restricted, and the specific talking objects of the texts are not limited, so that the texts need to be further classified at a semantic level. According to the characteristics of the short spoken texts, the method carries out treatments such as eliminating spoken words, combining repeated words, restraining the length and the like on the short spoken texts:

(1) removing spoken words: the occurrence frequency of the mood auxiliary words, the nonsense words expressing response and the common spoken words in the short spoken language text is extremely high, the words prolong the text length, the length of the TextCNN input vector is certain and limited, and the elimination of the words can enable more effective information contained in the input vector, so that the accuracy is improved.

(2) And (3) combining repeated words: when people do informal communication, in order to ensure that the opposite side can hear clearly or get out of habit, the same words or phrases are easy to repeat continuously for a plurality of times, and for the situation, the repeated parts are only kept once, so that the semantics are not lost, and the situation that the repeated parts occupy effective length meaninglessly is avoided.

(3) Length limitation: due to the difference of the lengths of the spoken texts, the training of the positive text and the negative text may have a large length difference, but the lengths cannot be used as the basis for classification, and the TextCNN neural network easily learns the length of a non-zero part of an input vector as a feature, so that the excessively long text needs to be truncated, and the length of the text is approximately similar to the average length of a corpus; for the text which is too short, the effective information is too little, and the text needs to be removed.

These operations are designed for the characteristics of spoken short text and the neural network learning characteristics of TextCNN, which can improve the classification accuracy of TextCNN, but cannot be combined in the preprocessing step, which may affect sentence matching. After the training corpora are processed, the word-to-vector algorithm is used for training the Chinese text into word vectors, the word vectors are input into the shallow TextCNN neural network, and due to the fact that the complexity of the network is low, local relevance of the text can be captured well, and accurate text classification under the condition of less training corpora can be achieved.

According to a fourth aspect of the invention, the model is modified in a feedback manner based on the service annotation data. In a mass of short spoken texts, various possibilities are possible to occur, and due to the fact that the training corpus is extremely limited, the texts are still possibly classified wrongly, so that business personnel are necessary to correct the model by using the labeling information of the classification result data, and the accuracy of the model is continuously improved. The positive and negative corpora in the corpus can be updated according to the standard information of the service personnel, the positive common sentence patterns can be updated, and the seed vocabularies can be updated, so that the accuracy of the classification result can be improved.

Fig. 1 shows a flow chart of spoken short text classification. As shown in the figure, firstly, preprocessing is carried out on training corpora and target texts to be screened; secondly, the training corpus contains texts and labels thereof, and the labels include two types: positive class (indicating that the text is relevant to the service) and negative class (indicating that the text is irrelevant to the service), respectively extracting sentence pattern information from the marked positive class linguistic data (namely the text relevant to the service) in the training linguistic data, and distinguishing a strong relevant sentence pattern and a weak relevant sentence pattern of the service; thirdly, sentence pattern matching is carried out on the text, the sentence pattern matching result with strong correlation of the service can be directly classified as a positive text, and the sentence pattern matching result with weak correlation of the service needs to be analyzed next; fourthly, the length limitation, the spoken word removal and the repeated word combination are carried out on the target text and the training corpus. Fourthly, word vectors are trained by using a word-to-vector algorithm, the texts are converted into expression forms of the word vectors, the textCNN neural network is trained by using training corpora, the trained model is used for classifying the target texts, and a desired result is screened out. And fifthly, if the service personnel give out the labeling information of the screening result, updating the training corpus and retraining the model.

The pretreatment part comprises the following specific steps: the words are cut through the jieba Chinese word segmentation device, and because the sentences contain wrongly written or mispronounced characters, the word cutting result often has the situation of wrong segmentation. The n-gram model is based on the assumption of a Markov model, the occurrence probability of a word only depends on the first 1 word or the first few words of the word, the wrongly written characters of the Chinese text have locality, in an error detection part, only a sliding window with reasonable length needs to be selected to check whether the wrongly written characters exist, the length of the sliding window is short due to the characteristic of colloquialisation of short texts, the n-gram model constructs a language model through calculating maximum likelihood estimation, and if the probability is lower than a legal threshold, the error is judged to occur at the position. And the error correction part is used for acquiring a candidate set by using a near word dictionary for errors of the word granularity, then calculating the sentence legality in a sliding window through an n-gram language model, and comparing and sequencing results of all candidate sets to obtain the optimal corrected word. For errors in word granularity, the candidate set may be too large due to the high noise of the text to be classified, and pinyin is directly used as a substitute.

Sentence pattern matching part: fig. 2 shows an example of chinese sentence pattern extraction and matching result. First, a sentence pattern is extracted from the complete sentence. The method comprises the steps of carrying out word segmentation, dependency syntactic analysis and part-of-speech tagging on a sentence, removing part-of-speech components such as a person name, a place name and a organization name from a part, such as a subject, an object and a guest in the sentence, representing an action object, replacing the part-of-speech components with part-of-speech tags, reserving pronouns, replacing the part-of-speech components with part-of-speech tags at time and place in a idiom by part-of-speech tags, replacing the part-of-speech tags with part-speech tags by part-speech tags, keeping pronouns, replacing modified components in a fixed language with de' to represent modified components in a unified mode. Verbs and conjunct parts form a list of alternatives, such as [ question ', ' consult ', ' query ', where the sentence is represented by a list of words and part-of-speech tags. Secondly, comparing the similarity between the sentence patterns, and selecting the sentence pattern with the maximum similarity for matching. The sentence pattern matching method is based on the edit distance (namely Levenshtein distance), and certain words such as auxiliary words do not have very important semantic functions according to the characteristics of natural language, and are ignored when calculating the distance. And if the verbs are not in the candidate word list, taking the minimum distance between the verbs and the vocabulary in the vocabulary list. When the distance is calculated by the vocabulary, the vocabulary is converted into word vectors, the Euclidean distance of the word vectors in a vector space is calculated, and the word vectors are mapped to a [0,1] interval by using a sigmoid function.

And a model classification part: the method comprises the following steps of performing text processing on a text to be classified and a training corpus, wherein the text processing comprises the following steps:

(1) scanning the text by using a sliding window, and only reserving the first word appearing twice or more in the sliding window;

(2) establishing a spoken word library, and removing the appearing nonsense spoken words;

(3) and length limitation, namely removing less than 5 texts of an effective vocabulary through the processing of the first two steps, calculating the average length of the training texts, intercepting the texts with the length more than 1.5 times of the average length, wherein the intercepted position is the tail of the sentence with the nearest average length so as to keep a complete sentence.

Fig. 3 is a simplified text illustration showing the differences before and after text processing.

Training word vectors by using a word-to-vector model for the processed text, and expressing the text as word vectors; the method comprises the steps of adopting a pinyin substituted vocabulary due to the noise of the short-length spoken language, calculating pinyin similarity based on an editing distance, finding out a vocabulary with the highest pronunciation degree with the pinyin according to the similarity (if a plurality of vocabularies have the same pronunciation, taking a word with the highest frequency of occurrence in a corpus), and using a word vector of the word as a word vector of the pinyin substituted word. And then training the TextCNN network by using the training text to obtain a classification model and then classifying the text to be classified. FIG. 4 is a flow chart of the TextCNN model training.

It is to be noted and understood that various modifications and improvements can be made to the invention described in detail above without departing from the spirit and scope of the invention as claimed in the appended claims. Accordingly, the scope of the claimed subject matter is not limited by any of the specific exemplary teachings provided.

11页详细技术资料下载

Text screening method for high-volume high-noise spoken short text

相关技术

网友询问留言