Text noise data identification method and device, computer equipment and storage medium

文档序号：830108 发布日期：2021-03-30 浏览：18次中文

阅读说明：本技术 文本噪声数据识别方法、装置、计算机设备和存储介质 (Text noise data identification method and device, computer equipment and storage medium ) 是由韩旭红于 2019-09-30 设计创作，主要内容包括：本申请涉及一种文本噪声数据识别方法、装置、计算机设备和存储介质,通过对文本数据进行分句处理,以切分后的句子为数据处理的基点,将复杂的文本数据处理的任务转换为简单的句子数据处理任务,并且不同以往采用dropout机制对神经元进行dropout处理,本方案是采用dropout机制对携带标签数据的训练数据进行dropout处理,防止模型训练出现过拟合的问题,且通过训练数据训练出的句子相关性分类模型,能够为输入的文本数据添加相应的标签数据,无需标注大量文本数据,节省了人力成本的同时,还提高了数据处理的速度,并且,基于由句子相关性向量以及位置向量拼接得到的拼接矩阵,进行噪声预测,能够提高噪声数据识别的准确率。(The application relates to a text noise data identification method, a device, a computer device and a storage medium, wherein text data is divided into sentences, the divided sentences are used as base points of data processing, a complex text data processing task is converted into a simple sentence data processing task, and a dropout mechanism is adopted to perform dropout processing on neurons in the past, the dropout mechanism is adopted to perform the dropout processing on training data carrying label data, the problem of overfitting of model training is prevented, a sentence relevance classification model trained through the training data can add corresponding label data to the input text data, a large amount of text data is not required to be marked, the labor cost is saved, the data processing speed is improved, and a splicing matrix spliced by sentence relevance vectors and position vectors is used as a basis, noise prediction can improve the accuracy of noise data identification.)

1. A text noise data recognition method, the method comprising:

acquiring text data;

performing sentence segmentation processing on the text data to obtain segmented sentences, and extracting position vectors of the segmented sentences;

inputting the segmented sentences into a trained sentence relevance classification model, adding label data to the segmented sentences to obtain sentence relevance vectors, wherein the sentence relevance vectors are feature vectors which are output by a hidden layer of the trained sentence relevance classification model and used for representing sentence information, and the sentence relevance classification model is obtained by performing dropout processing training on training data carrying label data by adopting a dropout mechanism;

and splicing the sentence correlation vector and the position vector of the sentence to obtain a splicing matrix, and performing noise prediction on text data based on the splicing matrix to obtain a noise identification result.

2. The text noise data recognition method of claim 1, wherein said sentence splitting the text data comprises:

dividing the text data into a plurality of sentences by adopting a preset sentence division algorithm;

and segmenting or splicing the segmented sentences according to a preset sentence length threshold value so as to ensure that the length of the segmented sentences meets the preset sentence length threshold value.

3. The method of recognizing text noise data according to claim 1, wherein before inputting the segmented sentences into the trained sentence correlation classification model, further comprising:

collecting historical text data, wherein the historical text data carries labeling information;

according to the labeling information, sentence dividing and labeling processing are carried out on the historical text data to obtain training data carrying label data;

setting corresponding dropout probability for the training data carrying the label data;

performing dropout processing on the training data carrying the label data based on the dropout probability, and updating the training data;

and training an initial sentence relevance classification model by adopting the updated training data to obtain the trained sentence relevance classification model.

4. The text noise data identification method of claim 3, wherein the obtaining training data carrying label data by performing sentence segmentation and labeling on the historical text data according to the labeling information comprises:

segmenting the historical text data into a plurality of sentences;

identifying the labeling information of the historical text data;

if the labeling information of the historical text data is noise data, labeling labels of sentences segmented from the historical text data as irrelevant labels to obtain training data carrying the relevant labels;

and if the labeling information of the historical text data is non-noise data, labeling labels of sentences segmented from the historical text data as related labels to obtain training data carrying the unrelated labels.

5. The text noise data recognition method of claim 4, wherein the setting of the corresponding dropout probability for the training data carrying the tag data comprises:

respectively inputting the training data carrying the related labels and the training data carrying the unrelated labels into the initial sentence relevance classification model;

and setting a first dropout probability for the training data carrying the related labels by adopting the dropout mechanism, and setting a second dropout probability for the training data carrying the unrelated labels by adopting the dropout mechanism.

6. The text noise data recognition method of claim 5, wherein the dropout processing is performed on the training data carrying the tag data based on the dropout probability, and updating the training data comprises:

based on the first dropout probability, randomly discarding part of training data carrying the relevant labels to obtain a first training set;

based on the second dropout probability, randomly discarding part of training data carrying the irrelevant labels to obtain a second training set;

and combining the first training set and the second training set to serve as new training data, inputting the new training data to the initial sentence relevance classification model again, returning to the step of randomly discarding part of the training data carrying the relevant labels based on the first dropout probability until the number of times of return reaches a preset number threshold.

7. An apparatus for recognizing text noise data, the apparatus comprising:

the data acquisition module is used for acquiring text data;

the sentence dividing processing module is used for carrying out sentence dividing processing on the text data to obtain a divided sentence and extracting a position vector of the divided sentence;

the sentence relevance processing module is used for inputting the segmented sentences into a trained sentence relevance classification model to obtain sentence relevance vectors, the sentence relevance vectors are feature vectors which are output by a hidden layer of the trained sentence relevance classification model and used for representing sentence information, and the sentence relevance classification model is obtained by performing dropout processing training on training data carrying label data by adopting a dropout mechanism;

and the noise prediction module is used for splicing the sentence correlation vector and the position vector to obtain a splicing matrix, and performing noise prediction on the text data based on the splicing matrix to obtain a noise identification result.

8. The text noise data recognition apparatus according to claim 7, further comprising:

the model training module is used for collecting historical text data, the historical text data carries labeling information, sentence segmentation and labeling processing are carried out on the historical text data according to the labeling information to obtain training data carrying label data, corresponding dropout probability is set for training the label data, dropout processing is carried out on the training data carrying the label data based on the dropout probability, the training data are updated, the initial sentence relevance classification model is trained through the updated training data, and the trained sentence relevance classification model is obtained.

9. A computer device comprising at least one processor, at least one memory, and a bus; the processor and the memory complete mutual communication through the bus; the processor is configured to invoke program instructions in the memory to perform the method of any of claims 1 to 6.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 6.

Technical Field

The present application relates to the field of natural language processing technologies, and in particular, to a text noise data recognition method, apparatus, computer device, and storage medium.

Background

Natural language processing is an important direction in the field of computer science and the field of artificial intelligence, and studies various theories and methods that can realize effective communication between a person and a computer using natural language. Text data processing can be regarded as the basis of natural language processing, and is an important part of the text data processing.

When analyzing text data, some noise data have a great adverse effect on the data analysis work, and therefore, methods for recognizing noise data in text data by using machine learning or deep learning algorithms have appeared.

Although the method can identify the noise data to a certain extent, more marking work is required, the labor is consumed, a large amount of identification operation needs to be executed by a computer during identification, the consumption of computer hardware is high, the processing speed of data identification is reduced, and meanwhile, the accuracy of noise data identification is influenced by a large amount of marked data, so that the traditional text noise data identification method has the problem of low identification efficiency.

Disclosure of Invention

In view of the above, it is necessary to provide an efficient text noise data recognition method, apparatus, computer device and storage medium for solving the problem of low recognition efficiency of the existing text noise data.

A text noise data recognition method, the method comprising:

acquiring text data;

sentence dividing processing is carried out on the text data to obtain a segmented sentence, and a position vector of the segmented sentence is extracted;

and splicing the sentence correlation vector and the position vector to obtain a splicing matrix, and performing noise prediction on the text data based on the splicing matrix to obtain a noise identification result.

In one embodiment, the sentence dividing processing of the text data includes:

dividing the text data into a plurality of sentences by adopting a preset sentence division algorithm;

In one embodiment, before inputting the segmented sentence into the trained sentence correlation classification model, the method further includes:

collecting historical text data, wherein the historical text data carries labeling information;

according to the labeling information, sentence dividing and labeling are carried out on the historical text data to obtain training data carrying label data;

setting corresponding dropout probability for training data carrying label data;

performing dropout processing on training data carrying label data based on the dropout probability, and updating the training data;

and training the initial sentence relevance classification model by using the updated training data to obtain a trained sentence relevance classification model.

In one embodiment, the obtaining training data carrying label data by performing sentence segmentation and labeling on historical text data according to the labeling information comprises:

segmenting the historical text data into a plurality of sentences;

identifying marking information of the historical text data;

In one embodiment, setting a corresponding dropout probability for training data carrying label data includes:

respectively inputting training data carrying related labels and training data carrying unrelated labels into the initial sentence relevance classification model;

and setting a first dropout probability for the training data carrying the related labels by adopting a dropout mechanism, and setting a second dropout probability for the training data carrying the unrelated labels by adopting the dropout mechanism.

In one embodiment, dropout processing is performed on training data carrying tag data based on the dropout probability, and updating the training data includes:

based on the first dropout probability, randomly discarding part of training data carrying related labels to obtain a first training set;

based on the second dropout probability, randomly discarding part of training data carrying irrelevant labels to obtain a second training set;

and combining the first training set and the second training set to serve as new training data, inputting the new training data to the initial sentence relevance classification model again, and returning to the step of randomly discarding part of the training data carrying the relevant labels based on the first dropout probability until the number of times of returning reaches a preset number threshold.

A text noise data recognition apparatus, the apparatus comprising:

the data acquisition module is used for acquiring text data;

the sentence dividing processing module is used for carrying out sentence dividing processing on the text data to obtain divided sentences and extracting position vectors of the divided sentences;

In one embodiment, the apparatus further comprises:

the model training module is used for collecting historical text data, the historical text data carries labeling information, sentence dividing and labeling processing is carried out on the historical text data according to the labeling information to obtain training data carrying label data, corresponding dropout probability is set for training the label data, the training data carrying the label data are subjected to dropout processing based on the dropout probability, the training data are updated, an initial sentence relevance classification model is trained by the updated training data, and a trained sentence relevance classification model is obtained.

A computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the following steps when executing the computer program:

acquiring text data;

sentence dividing processing is carried out on the text data to obtain a segmented sentence, and a position vector of the segmented sentence is extracted;

A computer-readable storage medium, on which a computer program is stored which, when executed by a processor, carries out the steps of:

acquiring text data;

sentence dividing processing is carried out on the text data to obtain a segmented sentence, and a position vector of the segmented sentence is extracted;

According to the text noise data identification method, the text noise data identification device, the computer equipment and the storage medium, the text data is subjected to sentence division, the divided sentences are used as base points of data processing, a complex text data processing task is converted into a simple sentence data processing task, and different from the conventional method that a dropout mechanism is adopted to perform dropout processing on neurons, the dropout mechanism is adopted to perform dropout processing on training data carrying label data, the problem that the model training is over-fitted is prevented, the corresponding label data can be added to the input text data through a sentence relevance classification model trained by the training data, a large amount of text data does not need to be marked, the labor cost is saved, the data processing speed is increased, and noise prediction is performed based on a splicing matrix obtained by splicing sentence relevance vectors and position vectors, the accuracy of the noise data identification can be improved.

Drawings

FIG. 1 is a diagram of an environment in which a text noise data recognition method is applied in one embodiment;

FIG. 2 is a flow diagram illustrating a method for text noise data recognition in one embodiment;

FIG. 3 is a schematic flow chart diagram illustrating the steps of constructing a model in one embodiment;

FIG. 4 is a schematic flow chart of the steps of constructing a model in another embodiment;

FIG. 5 is a block diagram showing the structure of a text noise data recognition apparatus according to an embodiment;

FIG. 6 is a block diagram showing the construction of a text noise data recognizing apparatus in another embodiment;

FIG. 7 is a diagram illustrating an internal structure of a computer device according to an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

The text noise data identification method provided by the application can be applied to the application environment shown in FIG. 1. Wherein the terminal 102 communicates with the server 104 via a network. Specifically, the user uploads text data to be processed on the terminal 102, then sends a noise data recognition request (the noise data recognition request carries the text data) to the server 104 through the terminal 102, the server 104 responds to the noise data recognition request to obtain the text data, performs sentence segmentation processing on the text data to obtain a segmented sentence, extracts a position vector of the segmented sentence, then inputs the segmented sentence into a trained sentence relevance classification model (the sentence relevance classification model is obtained by performing the dropout processing training on training data carrying label data by using a dropout mechanism), adds label data to the segmented sentence to obtain a sentence relevance vector (which can be seen as an intermediate result of the model) for representing sentence information and output by a hidden layer of the trained sentence relevance classification model, and then splices the sentence relevance vector and the position vector, and obtaining a splicing matrix, and performing noise prediction on the text data based on the splicing matrix to obtain a noise identification result. The terminal 102 may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices, and the server 104 may be implemented by an independent server or a server cluster formed by a plurality of servers. In order to more clearly explain the text noise data recognition method provided in the present application, the text data is explained below by taking chapter data as an example.

In one embodiment, as shown in fig. 2, a text data noise identification method is provided, which is described by taking the method as an example applied to the server in fig. 1, and includes the following steps:

step S200, acquiring text data.

In the natural language processing task, text data includes words, phrases, sentences, and chapter data. In this embodiment, the text data is exemplified by chapter data, and in practical application, the text data (chapter data) to be identified in the database may be obtained when a text noise data identification request sent by the terminal is received. The chapter data is to transmit the semantic meaning to be expressed by organizing the information such as entity, event, etc. according to a certain structure, the chapter data includes sentences, words or phrases, etc., and the chapter analysis is also an important part in the natural language processing task.

Step S400, sentence segmentation processing is carried out on the text data to obtain segmented sentences, and position vectors of the segmented sentences are extracted.

The position vector position of the segmented sentence is a vector for representing the position of the sentence in the original text data, such as information representing the several lines or several segments of the sentence in the text data. In the embodiment, the text data is exemplified by chapter data, the preprocessing includes performing sentence division processing on the chapter data by using a sentence division algorithm to obtain a segmented sentence, and meanwhile, in order to improve the accuracy of noise data prediction, a position vector of the segmented sentence needs to be extracted, and the position vector represents the specific line number of the segmented sentence in the original chapter data.

In one embodiment, the sentence dividing processing of the text data includes: the method comprises the steps of segmenting text data into a plurality of sentences by adopting a preset sentence segmentation algorithm, and segmenting or splicing the segmented sentences according to a preset sentence length threshold value so as to ensure that the length of the segmented sentences meets the preset sentence length threshold value.

Since the sentence after sentence segmentation may have too complicated (long) sentences, which are not suitable for the actual test task, the sentence length control strategy is adopted in this embodiment, unlike the conventional sentence segmentation. Firstly, a preset sentence segmentation algorithm such as a jentenceend algorithm is adopted, for example, chapter data is segmented according to sentences, then, the sentences are secondarily processed according to the lengths of the segmented sentences, specifically, the sentence segmentation method comprises the steps of segmenting the sentences with longer sentence lengths again according to commas, and for the sentences with shorter sentence lengths, the sentences with shorter sentence lengths are spliced with the next sentence, and in principle, the length of the segmented sentences is ensured to be less than or equal to the preset sentence length threshold value. In the embodiment, a sentence length control strategy is adopted, so that the condition that a sentence is too long or too short is avoided, and the effective utilization rate of the sentence in text noise data prediction is improved.

Step S600, inputting the segmented sentences into a trained sentence relevance classification model, adding label data to the segmented sentences to obtain sentence relevance vectors, wherein the sentence relevance vectors are feature vectors which are output by a hidden layer of the trained sentence relevance classification model and used for representing sentence information, and the sentence relevance classification model is obtained by performing dropout processing training on training data carrying label data by adopting a dropout mechanism.

The sentence correlation classification model is a model for classifying whether the input sentence data is correlated with the noise data, and in this embodiment, the sentence correlation classification model may be an LSTM (Long Short-Term Memory, Long-time Memory network) + Attention model. Specifically, the sentence relevance classification model is a machine learning mode which adopts a supervised learning mode, namely, a function is deduced from a labeled training data set, and the trained sentence relevance classification model can automatically determine whether the input sentence is a class label of the relevant noise data or not. In addition, in the machine learning model, if the parameters of the model are too many and the training samples are too few, the trained model is easy to generate an overfitting phenomenon. Therefore, in the sentence relevance classification model training process, dropout processing is carried out on training data carrying labels by adopting a dropout mechanism, so that input data are different every time, and overfitting of the model is effectively prevented. In this embodiment, the segmented sentence is input into a trained sentence relevance classification model, the model predicts the input sentence, a tag indicating whether the sentence is relevant noise data is added to the sentence, and a sentence relevance vector is obtained from a result output by a hidden layer (intermediate layer) of the model, and the sentence relevance vector is a feature vector for representing sentence information.

And step S800, splicing the sentence relevance vector and the position vector to obtain a splicing matrix, and performing noise prediction on the text data based on the splicing matrix to obtain a noise identification result.

According to the embodiment, after the sentence correlation vector of the intermediate layer result is obtained, the sentence correlation vector is spliced with the extracted sentence Position vector Position to obtain a splicing matrix, and then noise prediction is carried out based on the splicing matrix to obtain a noise identification result. For example, the Position vector is [ -6-5-4-3-2-10123456 ], and 0 represents the Position of the current sentence. In this embodiment, the splicing matrix may be input as input data to a trained noise prediction model, the noise prediction model may be a bidirectional LSTM + Attention model, the input splicing matrix is output through a dense full connection layer to obtain a vector logit, then, based on the vector logit, the vector logit is converted into a probability distribution through softmax, or sofmax _ cross _ entropy _ with _ logit is used to calculate cross entropy, the parameter introduced by the method may be logit, the maximum or the first two of probability values in the probability distribution is taken as a classification label to obtain a noise recognition result, the recognition result may be binary 0 or 1, where 0 represents that the input chapter data is not noise data, and 1 represents that the input chapter data is noise data. It is appreciated that in other embodiments, sigmiod may be used to perform a second classification of the vector logits, resulting in a noise recognition result.

In the text data noise identification method, the text data is divided into sentences, the divided sentences are taken as the base points of the data processing, the complex text data processing task is converted into the simple sentence data processing task, and different from the conventional dropout mechanism for carrying out dropout processing on the neurons, the scheme adopts the dropout mechanism for carrying out the dropout processing on the training data carrying the label data, so as to prevent the overfitting problem of model training, and the sentence relevance classification model trained by the training data can add corresponding label data to the input text data without marking a large amount of text data, thereby saving the labor cost and improving the data processing speed, moreover, noise prediction is performed based on a splicing matrix obtained by splicing the sentence correlation vector and the position vector, and the accuracy of noise data identification can be improved.

The text data noise identification scheme has great application value in the fields of comment text semantic analysis, emotion analysis, text retrieval, text clustering, text recommendation, text management and the like. Recognizing noise in text is an upstream task in these areas. The method has the advantages that the noise in the text is accurately identified, more reasonable data support can be provided for the downstream text processing task, and the accuracy of subsequent processing is higher. For example, when analyzing the semantics of a text, if the noise in the text can be accurately recognized, it is possible to avoid that the noise data adversely affects the semantic analysis result.

In one embodiment, before inputting the segmented sentence into the trained sentence correlation classification model, the method further includes:

step S500, collecting historical text data, wherein the historical text data carries labeling information;

step S520, according to the labeling information, sentence dividing and labeling are carried out on the historical text data, and training data with label data are obtained;

step S540, setting corresponding dropout probability for training data carrying label data;

step S560, performing dropout processing on the training data carrying the label data based on the dropout probability, and updating the training data;

step S580, train the initial sentence relevance classification model using the updated training data, to obtain a trained sentence relevance classification model.

In practical application, historical chapter data carrying annotation information is collected as sample data, wherein the annotation information is added to the chapter data on the premise that whether the chapter data is noise data or not is known, if the chapter data is noise data, the annotation information is marked as noise data, and if the chapter data is non-noise data, the annotation information is marked as non-noise data. After the historical discourse data of the labeling information is collected, sentence division processing is carried out on the discourse data, and according to the labeling information, labeling processing is carried out on the segmented sentences, so that training data carrying label data are obtained. Because the related sections marked with information may contain many unrelated sentences and the unrelated sections do not contain related sentences basically, in order to prevent the overfitting phenomenon of the sentence relevance classification model, a deep learning dropout mechanism is introduced, except that the traditional deep learning dropout mechanism performs dropout processing on neuron nodes of word vectors, in the application, different dropout probabilities are set for training data, the training data is subjected to dropout processing, namely, part of training data carrying label data is discarded randomly, the training data is updated, and the loss function of the model is adjusted according to the dropout probabilities, so that the loss function of the model is obtainedAnd training the initial sentence relevance classification model by using the updated training data until the loss function is smaller and smaller, and finishing the training of the initial sentence relevance classification model. In the embodiment, dropout is adopted to perform dropout processing on the training data, so that the overfitting phenomenon of the model can be effectively prevented.

As shown in fig. 4, in one embodiment, performing sentence segmentation and labeling processing on historical text data according to the labeling information to obtain training data carrying label data includes: step S522, segmenting the historical text data into a plurality of sentences, identifying the label information of the historical text data, if the label information of the historical text data is noise data, labeling the labels of the sentences segmented from the historical text data as irrelevant labels to obtain training data carrying relevant labels, and if the label information of the historical text data is non-noise data, labeling the labels of the sentences segmented from the historical text data as relevant labels to obtain training data carrying irrelevant labels.

Similarly, the historical text data is divided into a plurality of sentences by adopting a preset sentence dividing algorithm, a sentence length control strategy is still adopted, longer sentences are divided according to the formula, the sentence is divided again, the shorter sentences are spliced with the following sentences, the sentence length is controlled within a preset length threshold value, then the label information of the historical text data is identified, if the label information is noise data, the labels of the sentences divided from the historical text data are marked as irrelevant labels, the training data carrying the relevant labels are obtained, and if the label information is non-noise data, the labels of the sentences divided from the historical text data are marked as the relevant labels, and the training data carrying the irrelevant labels are obtained. In the embodiment, the chapter marking information is used as the sentence label, and a large amount of sentence marks are not required to be added, so that the data processing efficiency is improved.

As shown in fig. 4, in one embodiment, setting a corresponding dropout probability for training data carrying tag data includes: step S542, respectively inputting training data carrying related labels and training data carrying unrelated labels into an initial sentence relevance classification model, setting a first dropout probability for the training data carrying the related labels by adopting a dropout mechanism, and setting a second dropout probability for the training data carrying the unrelated labels by adopting the dropout mechanism.

In this embodiment, a dropout mechanism is used to perform dropout processing on input training data (sentences) carrying tag data, and different dropout probabilities are used for different tags, specifically, the dropout probability of the training data is set to be a first dropout probability if the input training data carries a relevant tag, and the dropout probability of the training data is set to be a second dropout probability if the input training data carries an irrelevant tag. In the embodiment, different dropout probabilities are set for different labels, so that the influence of sentence label errors on the sentence relevance classification model can be effectively reduced.

In one embodiment, dropout processing is performed on training data carrying tag data based on the dropout probability, and updating the training data includes: based on the first dropout probability, randomly discarding part of training data carrying related labels to obtain a first training set, based on the second dropout probability, randomly discarding part of training data carrying unrelated labels to obtain a second training set, combining the first training set and the second training set to serve as new training data, inputting the new training data into the initial sentence relevance classification model again, and returning to the step of randomly discarding part of training data carrying related labels based on the first dropout probability until the number of times of return reaches a preset number threshold.

Different from a traditional deep learning dropout mechanism, in the application, instead of stopping the activation value of a certain neuron at a certain probability p or randomly (temporarily) deleting half of hidden neurons in a neural network, when training data carrying related labels are input, based on a first dropout probability, randomly discarding part of the training data carrying related labels, retaining the remaining data after discarding (filtering) to obtain a first training set, when the training data carrying unrelated labels are input, based on a second dropout probability, randomly discarding part of the training data carrying unrelated labels, retaining the remaining data after discarding to obtain a second training set, combining the first training set and the second training set as new training data to be input into an initial sentence relevance classification model again, and performing dropout processing in the above manner until the number of iterations (return) reaches a preset number threshold, and ending the circulation to obtain the finally updated training data. In this embodiment, the dropout mechanism is combined to process the training data, so that the training data input each time are different, and the model training effect is improved comprehensively.

It should be understood that although the various steps in the flow charts of fig. 2-4 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 2-4 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performance of the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternating with other steps or at least some of the sub-steps or stages of other steps.

In one embodiment, as shown in fig. 5, there is provided a text noise data recognition apparatus including: a data acquisition module 510, a sentence segmentation processing module 520, a sentence correlation processing module 530, and a noise prediction module 540, wherein:

a data obtaining module 510, configured to obtain text data.

The sentence segmentation processing module 520 is configured to perform sentence segmentation processing on the text data to obtain a segmented sentence, and extract a position vector of the segmented sentence.

The sentence relevance processing module 530 is configured to input the segmented sentences into a trained sentence relevance classification model to obtain a sentence relevance vector, where the sentence relevance vector is a feature vector output by a hidden layer of the trained sentence relevance classification model and used for representing sentence information, and the sentence relevance classification model is obtained by performing dropout processing training on training data carrying tag data by using a dropout mechanism.

And the noise prediction module 540 is configured to splice the sentence relevance vector and the position vector to obtain a splicing matrix, and perform noise prediction on the text data based on the splicing matrix to obtain a noise identification result.

As shown in fig. 6, in one embodiment, the apparatus further includes a model training module 550, configured to collect historical text data, where the historical text data carries tagging information, perform sentence segmentation and tagging on the historical text data according to the tagging information to obtain training data carrying tag data, set a corresponding dropout probability for training the training data carrying tag data, perform dropout processing on the training data carrying tag data based on the dropout probability, update the training data, and train an initial sentence relevance classification model using the updated training data to obtain a trained sentence relevance classification model.

In one embodiment, the sentence segmentation processing module 520 is further configured to segment the text data into a plurality of sentences by using a preset sentence segmentation algorithm, and segment or splice the segmented sentences according to a preset sentence length threshold, so as to ensure that the length of the segmented sentences meets the preset sentence length threshold.

As shown in fig. 6, in one embodiment, the model training module 550 further includes a sentence cutting and labeling unit 552 configured to cut the historical text data into a plurality of sentences, identify the label information of the historical text data, label the label of the sentence cut from the historical text data as an irrelevant label if the label information of the historical text data is noise data, to obtain training data carrying a relevant label, and label the label of the sentence cut from the historical text data as a relevant label if the label information of the historical text data is non-noise data, to obtain training data carrying an irrelevant label.

As shown in fig. 6, in one embodiment, the model training module 550 further includes a probability setting unit 554, configured to input training data carrying related labels and training data carrying unrelated labels to the initial sentence relevance classification model, respectively, set a first dropout probability for the training data carrying related labels by using a dropout mechanism, and set a second dropout probability for the training data carrying unrelated labels by using a dropout mechanism.

As shown in fig. 6, in one embodiment, the model training module 550 further includes a training data updating unit 556, configured to randomly discard a part of the training data carrying the relevant tags based on the first dropout probability to obtain a first training set, randomly discard a part of the training data carrying the irrelevant tags based on the second dropout probability to obtain a second training set, combine the first training set and the second training set as new training data, input the new training data to the initial sentence relevance classification model again, and return to the step of randomly discarding a part of the training data carrying the relevant tags based on the first dropout probability until the number of times of return reaches a preset number threshold.

For specific limitations of the text noise data recognition means, reference may be made to the above limitations of the text noise data recognition method, which are not described in detail herein. The respective modules in the text noise data recognition apparatus described above may be implemented in whole or in part by software, hardware, and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as shown in fig. 7. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is adapted to provide computing and control capabilities, invoking the computer program of the memory. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used for storing text data and the like. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a text noise data recognition method.

Those skilled in the art will appreciate that the architecture shown in fig. 7 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, a computer device is provided, comprising at least one processor, at least one memory, and a bus; the processor and the memory complete mutual communication through a bus; the processor is arranged to call program instructions in the memory, which processor when executing the computer program realizes the steps in the above-described text noise data recognition method.

In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the above-mentioned text noise data recognition method. It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

17页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：基于中文触发词指导的越南语新闻事件检测方法

Text noise data identification method and device, computer equipment and storage medium

相关技术

网友询问留言