Text encoding method, device, equipment and computer readable storage medium

文档序号:699309 发布日期:2021-05-04 浏览:7次 中文

阅读说明:本技术 文本编码方法、装置、设备及计算机可读存储介质 (Text encoding method, device, equipment and computer readable storage medium ) 是由 陈文斌 王腾飞 魏帮国 于 2021-01-11 设计创作,主要内容包括:本申请提供了一种文本编码方法,包括:将目标文本进行规则匹配,若匹配成功,则生成目标文本的第一编码结果;利用至少两个文本分类模型对目标文本进行分类,若至少两个文本分类模型对目标文本进行分类的分类结果相同、且各分类结果的置信度大于预设阈值,则将目标文本的模型编码结果作为第二编码结果;根据第一编码结果与第二编码结果,生成目标文本的最终编码结果。本申请基于自然语言处理技术的相关算法对目标文本进行处理,可以使最终编码结果的查准率和查全率大大提升。(The application provides a text encoding method, which comprises the following steps: carrying out rule matching on the target text, and if the matching is successful, generating a first coding result of the target text; classifying the target text by using at least two text classification models, and if the classification results of the target text classified by the at least two text classification models are the same and the confidence coefficient of each classification result is greater than a preset threshold value, taking the model coding result of the target text as a second coding result; and generating a final coding result of the target text according to the first coding result and the second coding result. According to the method and the device, the target text is processed based on the related algorithm of the natural language processing technology, so that the precision ratio and the recall ratio of the final coding result can be greatly improved.)

1. A method of text encoding, comprising:

carrying out rule matching on a target text, and if the matching is successful, generating a first coding result of the target text, wherein the successful matching refers to matching of at least one preset keyword and/or at least one preset regular expression from the target text;

classifying the target text by utilizing at least two text classification models, and if the classification results of the target text classified by the at least two text classification models are the same and the confidence coefficient of each classification result is greater than a preset threshold value, taking the model coding result of the target text as a second coding result;

and generating a final coding result of the target text according to the first coding result and the second coding result.

2. The method of claim 1, wherein before the rule matching the target text, further comprising:

acquiring an initial text to be coded;

and removing stop words and/or meaningless punctuation marks in the initial text to obtain the target text.

3. The method of claim 2, wherein before obtaining the target text, further comprising:

and deleting the idiomatic sentence in the initial text.

4. The method of claim 1, wherein the rule matching the target text comprises:

and carrying out rule matching on the target text, each preset keyword and each preset regular expression in a code box, wherein the code box is used for converting a large amount of collected text corpora into a standard architecture of data.

5. The method of any of claims 1-4, wherein the at least two text classification models comprise:

at least two models of a tgrocery model based on a Support Vector Machine (SVM), a long-short time memory network (LSTM) model based on a neural network and a fasttext model.

6. The method according to any one of claims 1-4, wherein the generating a final encoding result of the target text according to the first encoding result and the second encoding result comprises:

if the same coding result and the different coding results exist, taking the same coding result and the different coding results as final coding results in the target text;

the same encoding result is the encoding result of each first text unit in the target text in the first encoding result or the second encoding result, and the first text unit refers to a text unit having the same encoding result in the first encoding result and the second encoding result; the different encoding results are encoding results of second text units in the target text in the second encoding results, and the second text units are text units with different encoding results in the first encoding results and the second encoding results.

7. The method of claim 6, further comprising:

and if the same coding result does not exist, taking the second coding result as the final coding result in the target text.

8. The method according to any one of claims 1-4, wherein after the rule matching of the target text, further comprising:

and if the matching fails, taking the second coding result as the final coding result of the target text.

9. A text encoding apparatus, comprising:

the first coding unit is used for carrying out rule matching on a target text, and if the matching is successful, generating a first coding result of the target text, wherein the successful matching refers to matching of at least one preset keyword and/or at least one preset key expression from the target text;

the second coding unit is used for classifying the target text by utilizing at least two text classification models, and if the classification results of the target text classified by the at least two text classification models are the same and the confidence coefficient of each classification result is greater than a preset threshold value, the model coding result of the target text is used as a second coding result;

and the third encoding unit is used for generating a final encoding result of the target text according to the first encoding result and the second encoding result.

10. An electronic device, comprising: a processor, a memory;

the memory for storing a computer program;

the processor for executing the text encoding method of any one of claims 1-8 by calling the computer program.

11. A computer-readable storage medium, on which a computer program is stored, which program, when being executed by a processor, is adapted to carry out the text encoding method of any one of claims 1 to 8.

Technical Field

The present application relates to the field of control technologies, and in particular, to a text encoding method, apparatus, device, and computer readable storage medium.

Background

The automobile industry has a large amount of customer feedback, and relates to product evaluation, experience evaluation and the like, and because automobile manufacturers and dealers pay great attention to improving the overall experience of customers, the calculation mechanism is enabled to solve a large amount of customer feedback, valuable information is extracted from the customer feedback, and the improvement of the product and service level of the automobile manufacturers and dealers is particularly important.

At present, a coding mode for a text fed back by a client is mainly a manual coding mode, but the manual coding has high cost and low efficiency for processing mass data, a coding result based on personal understanding is unstable, and information extraction has deviation.

In addition, the existing text coding technology is mainly rule coding, the rule coding mainly extracts text information according to a keyword or a key expression structure, but the precision rate of the rule coding is high, but the recall rate is very low; moreover, a meaning often has a plurality of expression modes and the text has emotional colors, and the text meaning is difficult to be accurately grasped by using the rule coding only; meanwhile, the encoding of keywords and key expressions also causes a great amount of text missing. Also, the processing efficiency for text information with only regular encoding is not high.

Disclosure of Invention

The application provides a text encoding method, a text encoding device, text encoding equipment and a computer readable storage medium, which can improve the accuracy and comprehensiveness of encoding results.

In a first aspect, the present application provides a text encoding method, including:

carrying out rule matching on a target text, and if the matching is successful, generating a first coding result of the target text, wherein the successful matching refers to matching of at least one preset keyword and/or at least one preset regular expression from the target text;

classifying the target text by utilizing at least two text classification models, and if the classification results of the target text classified by the at least two text classification models are the same and the confidence coefficient of each classification result is greater than a preset threshold value, taking the model coding result of the target text as a second coding result;

and generating a final coding result of the target text according to the first coding result and the second coding result.

In a second aspect, the present application provides a text encoding apparatus comprising:

the first coding unit is used for carrying out rule matching on a target text, and if the matching is successful, generating a first coding result of the target text, wherein the successful matching refers to matching of at least one preset keyword and/or at least one preset key expression from the target text;

the second coding unit is used for classifying the target text by utilizing at least two text classification models, and if the classification results of the target text classified by the at least two text classification models are the same and the confidence coefficient of each classification result is greater than a preset threshold value, the model coding result of the target text is used as a second coding result;

and the third encoding unit is used for generating a final encoding result of the target text according to the first encoding result and the second encoding result.

In a third aspect, the present application provides an electronic device, comprising: a processor, a memory;

the memory for storing a computer program;

the processor is used for executing the text coding method by calling the computer program.

In a fourth aspect, the present application provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the text encoding method described above.

In the technical scheme provided by the application, the target text is subjected to rule matching, and if the matching is successful, a first coding result of the target text is generated; classifying the target text by using at least two text classification models, and if the classification results of the target text classified by the at least two text classification models are the same and the confidence coefficient of each classification result is greater than a preset threshold value, taking the model coding result of the target text as a second coding result; and generating a final coding result of the target text according to the first coding result and the second coding result. Therefore, the method and the device have the advantages that the target text is classified by optimally combining multiple algorithms based on the related algorithms of the natural language processing technology, namely, the text is classified by utilizing multiple text classification models, whether the model coding result is used or not is selected based on the classification result, then the final coding result of the target text can be generated based on the model coding result and the rule coding result, and the precision ratio and the recall ratio of the final coding result can be greatly improved.

Drawings

Fig. 1 is a schematic flow chart of a text encoding method shown in the present application;

FIG. 2 is a schematic diagram of a multilevel code shown in the present application;

FIG. 3 is a schematic diagram illustrating the analysis of precision and recall shown in the present application;

FIG. 4 is a schematic diagram of a text encoding apparatus shown in the present application;

fig. 5 is a schematic structural diagram of an electronic device shown in the present application.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present application, as detailed in the appended claims.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this application and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.

It is to be understood that although the terms first, second, third, etc. may be used herein to describe various information, such information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the present application. The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination", depending on the context.

Referring to fig. 1, a schematic flowchart of a text encoding method provided in an embodiment of the present application is shown, where the method includes the following steps S101 to S103:

s101: and performing rule matching on the target text, and if the matching is successful, generating a first coding result of the target text, wherein the successful matching refers to matching of at least one preset keyword and/or at least one preset key expression from the target text.

In the embodiment of the present application, the target text may be an original initial text or a text obtained by preprocessing the initial text.

It should be noted that, the embodiment of the present application does not limit the text field to which the initial text belongs, for example, the initial text may be a customer feedback text of the automobile sales service. In addition, the text length of the initial text is not limited in the embodiments of the present application, for example, the initial text is a sentence or a paragraph.

Since a certain initial text may be preprocessed to obtain a target text, in an implementation manner of the embodiment of the present application, before "rule matching the target text" in S101, the method may further include: acquiring an initial text to be coded; and removing stop words and/or meaningless punctuation marks in the initial text to obtain the target text.

In the implementation manner, data cleaning may be performed on the initial text to remove meaningless punctuations and/or stop words in the initial text, wherein, in order to remove stop words in the initial text, a stop word list may be created in advance, the initial text and the stop word list are matched by traversing the stop word list, and each matched stop word is deleted from the initial text; in addition, other words and sentences interfering with the meaning of the sentence can be removed from the initial text through text matching or semantic recognition and the like, so that the interference caused by nonsense words is reduced. Thus, the target text can be obtained through one or more kinds of processing, and the subsequent text coding result based on the target text is more accurate.

Further, the meaningless idioms that do not need to be coded can also be deleted according to certain rules, such as:

sentence 1: the customer indicates that the speed of maintenance is fast.

Sentence 2: this problem has been addressed by JOY itself.

Sentence 3: the customer indicates that the outlet will be contacted if not understood.

Wherein, sentence 1 is not a conventional sentence and needs to be encoded; sentences 2 and 3 are conventional sentences and do not need to be encoded.

Then, for the initial text described above, the initial text may contain one or more sentences. When the initial text contains a sentence and the sentence is a conventional sentence, the initial text is not encoded subsequently, otherwise, when the initial text contains a sentence and the sentence is not a conventional sentence, the initial text is encoded subsequently; however, when the initial text contains a plurality of sentences, the idiomatic sentence among the sentences may be removed, and the remaining sentences may be subsequently encoded.

Therefore, in an implementation manner of the embodiment of the present application, before the target text is obtained, the idiomatic sentence in the initial text may be deleted, so that the initial text after the idiomatic sentence is deleted is used as the target text.

It should be noted that, when a large amount of initial texts need to be processed, batch processing may be performed on the initial texts, including the operations of "removing stop words, removing meaningless punctuations, removing words and sentences interfering with sentence meanings", and deleting familiar sentences "described above, so as to obtain one or more target texts.

In the embodiment of the present application, for each obtained target text, text encoding may be performed in the following manner, which is specifically described below.

Firstly, word segmentation processing can be performed on a target text by adopting a word segmentation technology, so that each word segmentation in the target text is obtained. In specific implementation, the whole target text can be sliced according to rules, for this reason, configuration files such as slicing rules and the like and an AI (artificial intelligence) model used for machine coding need to be loaded, and the aim is to roughly divide the semantic interval of the target text.

Then, rule matching is performed on each segmented word of the target text through step S101.

In an implementation manner of the embodiment of the present application, the "rule matching a target text" in S101 may specifically include: and carrying out rule matching on the target text, each preset keyword and each preset regular expression in the code frame, wherein the code frame is used for converting a large amount of collected text corpora into a standard architecture of data.

In this implementation manner, a large number of keywords and regular expressions may be preset in the code box, where for each keyword, the keyword may be a word with a positive tone or a negative tone.

Regarding the code frame, the code frame is used for converting a large amount of collected text corpora into a standard architecture of data, and the standard architecture is expanded in a multi-level tree shape. For example, if the system is divided into three levels, the primary code is the primary code describing the largest aspect of the content, such as the content of "consultation service", "reception service", "product introduction", "test riding and pilot driving", "price negotiation", "vehicle handing over", "hardware", etc. describing the car sales service; the second-level code is the development of the first-level code in all aspects, such as the second-level code describing contents of 'trial driving invitation', 'trial driving explanation', 'trial driving flow', 'trial driving vehicle', 'time route', 'trial driving explanation demonstration' and the like under the 'trial driving and trial driving' of the first-level code; the third-level code is the development of the second-level code in all aspects, for example, the third-level code describing contents such as 'actively introducing products' and 'questioning and answering in the process' under the 'trial driving explanation demonstration' of the second-level code, and the third-level code is the minimum unit for describing automobile sales service.

Each level of the code in the code frame may be represented by a number, such as 1010101, 1010102, … …, 102101, 102102, … …, and so on. The digits have certain meanings, as shown in the multi-level code diagram of fig. 2, by taking 1010101 as an example, the 1 st digit of the left digit represents positive and negative tonality, 1 represents positive and 3 represents negative, the 2 nd and 3 rd digits of the left digit represent a first-level code, the 4 th and 5 th digits of the left digit represent a second-level code, and the 6 th and 7 th digits of the left digit represent a third-level code.

It should be noted that the above-mentioned frame may be created for a specific field, for example, the frame is formed after many adjustments according to years of experience accumulated in the customer feedback field of the automobile industry.

Based on the rule matching, the target text can be matched with each preset keyword and each preset regular expression in the code frame.

These keywords can be classified into keywords with positive tone and keywords with negative tone. For example, in the context of sales return text, the keywords of the justification may include: the method comprises the following steps of actively contacting, actively communicating, frequently calling, having a return visit, actively calling, having subsequent tracking, periodically calling, timely following, real-time tracking, frequently calling, calling and following, inquiring the condition of a vehicle used by a client, inquiring the service condition of the vehicle, calling to care about the client, immediately calling the client, calling all the clients, returning visit and other keywords; the keywords of the downregulation may include: no revisit, no follow-up, should follow-up, no know-how, no reply, hope to reply, no contact client, etc.

The regular expression (regular expression) describes a pattern (pattern) for matching a character string, and may be used to check whether a string contains a certain substring, replace the matched substring, or extract a substring that meets a certain condition from a certain string, and the like. For example, [ want ], [ drink ] drink.

When the target text is regularly matched with each preset keyword and each preset regular expression in the code frame, if the target text contains one or more preset keywords (for example, obtained by traversing a keyword list) and/or is matched with one or more previous regular expressions, the matching is successful, at this time, the target text can be encoded according to a preset encoding mode (an encoding mode based on the keywords and/or the regular expressions), and here, an encoding result obtained by encoding the target text is defined as a first encoding result.

However, if the matching is not successful, that is, if the target text does not contain one or more preset keywords and does not match the previous one or more regular expressions, the target text is not encoded.

S102: and classifying the target text by using at least two text classification models, and if the classification results of the target text classified by the at least two text classification models are the same and the confidence coefficient of each classification result is greater than a preset threshold value, taking the model coding result of the target text as a second coding result.

It should be noted that, it is an important direction of artificial intelligence in recent years to let the computer understand human Language, and text classification is one of important branches of Natural Language Processing (NLP) technology, taking text corpus as client feedback text as an example, the NLP technology mainly solves the problems in this application: the evaluation of each aspect (code frame or evaluation object) of the experience of the user defined in advance is mined from the feedback comments of the user, and the problem is finally solved by parsing and performing text multi-classification. In NLP, commonly used text classification algorithms include naive bayes, Support Vector Machines (SVMs), Long Short-Term Memory Networks (LSTM), Convolutional Neural Networks (CNNs), Bidirectional Encoder tokens from Transformers (BERTs), Gradient Boost Decision Trees (GBDTs), etc., but because of the complexity of text meaning fed back by clients, the precision and recall of classification results of single models are not high, and the evaluation effect of models is general.

For the precision ratio and the recall ratio, see fig. 3 for an analysis diagram of the precision ratio and the recall ratio. In fig. 3, a denotes retrieved, relevant (searched and wanted), B denotes retrieved, but not relevant (searched and not wanted), C denotes not retrieved, but relevant (not searched and wanted), and D denotes not retrieved, nor relevant (not searched and not wanted); precision (Precision, abbreviated as P) ═ the number of items accurately identified by the system/the number of all identified items, i.e., P (Precision) ═ a/(a + B); recall (recalling) is the number of items that the system identifies accurately/the number of items of all the categories, i.e., R (Recall) is a/(a + C).

Because the precision ratio and the recall ratio of the classification result of the single model are not high, the embodiment of the application combines a plurality of weak classifiers into a strong classifier through multi-model combination, namely, a plurality of weak classifiers can be trained and formed into a strong classifier. Therefore, in an implementation manner of the embodiment of the present application, the "at least two text classification models" in S102 may include: at least two models of a tgrocery model based on a Support Vector Machine (SVM), a long-short time memory network (LSTM) model based on a neural network and a fasttext model.

In the implementation mode, a tgrocery model based on an SVM, an LSTM model based on a neural network, and a fasttext model need to be constructed in advance to perform advantage complementation by using each model. The method comprises the steps that a tgrocery model and a fasttext model based on an SVM (support vector machine) can be used for simply and quickly classifying texts, and the classification effect on short texts is good; while the LSTM model based on neural networks, which uses word vector deep neural networks, is slower and much more pre-preparation, but can complement the advantages of traditional machine learning in long text and semantic understanding.

To build the model, a data set may be created in advance. For example, based on the above-mentioned frame, since a large number of initial human coding samples have been accumulated, a certain number (e.g., about 10 ten thousand) of human coding samples can be extracted as a data set to be modeled by the data set. The data set may be divided according to a certain ratio (e.g., 4:1), with one part serving as a training set of the model and the other part serving as a test set of the model.

The following describes the SVM-based tgrocery model, the neural network-based LSTM model, and the fasttext model, respectively.

1. Tgrocery model based on SVM

Experiments show that after a tgrocery model based on an SVM is obtained by training with a training set, the model is tested with a test set, and the test accuracy can reach 84.6%.

For the tgrocery model based on the SVM, a text is regarded as a point of a high-dimensional space, different types of texts are divided through a plane, and when a certain code of a text belonging to a code box is predicted, only the space in which the point of the text falls needs to be determined, and tgrocery is based on the idea.

2. LSTM model based on neural network

Experiments show that when the LSTM model based on the neural network is obtained by training with the training set, the model is tested with the testing set, and the testing accuracy can reach 89.02%.

The cyclic neural network is a nonlinear system with a relatively complex structure, a text is regarded as a time sequence, words (such as word2vec vectors of the words) in the text are taken as units and input into the network according to the time sequence, when the last word of the text is input, the corresponding output of the whole system is a class code to which the text belongs, the model has a plurality of network parameters and long training convergence time, but the model considers the contextual semantic information of the text, so that the problem of language converse which is difficult to process by rule coding can be processed, therefore, the model has unique advantages in emotion analysis, and the used LSTM is based on the network.

3. Fasttext model

Experiments show that after a fasttext model is obtained by training with a training set, the test accuracy can reach 81.2% by testing the model with a test set.

fasttext is a short text classification tool, and is a part with some linear problems mainly based on the text classification problem, that is, many classification information can be captured without excessive nonlinear conversion and feature combination, so that some tasks can be solved even though simple models are used, and the single-layer network training speed is high.

In S102, the target text is input into N (N ≧ 2) text classification models (i.e., weak classifiers), respectively, and the N (N ≧ 2) weak classifiers constitute a strong classifier. The target text may be subjected to text classification by using each text classification model, if the classification results of the N text classification models are consistent, the confidence level of the classification result of each text classification model is determined, and when the confidence levels of the N classification results are all greater than a preset threshold (for example, the preset threshold is 0.95, and the balance precision and the recall ratio), the encoding results of the target text by the three text classification models may be obtained, where the encoding result is defined as a second encoding result.

Since the model encoding methods of the N text classification models are identical, the encoding result of any one of the N models may be used as the second encoding result.

However, when the classification results of the N text classification models are inconsistent, or the classification is consistent but the confidence of the N classification results is not uniform and is greater than the preset threshold, the model coding result of the target text is not obtained, that is, the target text is not coded.

S103: and generating a final coding result of the target text according to the first coding result and the second coding result.

And when the first coding result of the target text is obtained through S101 and the second coding result of the target text is obtained through S102, generating a final coding result of the target text based on the first coding result and the second coding result of the target text.

In an implementation manner of the embodiment of the present application, the "generating a final encoding result of the target text according to the first encoding result and the second encoding result" in S103 may specifically include:

if the same coding result and the different coding results exist, the same coding result and the different coding results are used as final coding results in the target text; the same coding result is the coding result of each first text unit in the target text in the first coding result or the second coding result, and the first text unit refers to the text unit with the same coding result in the first coding result and the second coding result; the different coding results are the coding results of the second text units in the target text in the second coding results, and the second text unit refers to the text unit with different coding results in the first coding result and the second coding result.

Specifically, in this implementation, regarding the first encoding result and the second encoding result of the target text, since the two encoding results can be encoded in units of words, for each participle in the target text, the encoding result of the participle can be found from the first encoding result and the second encoding result, and when the encoding results of the participle in the first encoding result and the second encoding result are the same, the participle is defined as a first text unit, whereas when the encoding results of the participle in the first encoding result and the second encoding result are different, the participle is defined as a second text unit. Then, for each first text unit in the target text, the same encoding result of the first text unit in the first encoding result or the second encoding result is obtained, for each second text unit in the target text, the encoding result of the second text unit in the second encoding result is obtained, and the obtained encoding results are combined, so that the final encoding result of the target text can be obtained.

Further, the embodiment of the present application may further include: and if the same coding result does not exist, taking the second coding result as the final coding result in the target text. Specifically, when the first text unit does not exist in the target text, that is, the same encoding result of the same word segmentation does not exist in the first encoding result and the second encoding result, the second encoding result is directly used as the final encoding result of the target text.

Further, the embodiment of the present application may further include: after the target text is subjected to rule matching through S101, if the matching fails, the second encoding result is used as the final encoding result of the target text. Specifically, the target text is subjected to rule matching, and if the matching is not successful, that is, if the target text does not contain one or more preset keywords and does not match one or more regular expressions, the target text is not encoded, but the second encoding result is directly used as the final encoding result of the target text.

Further, in this embodiment of the present application, a manual correction platform may be created in advance, and an authorized user (for example, a coder) may extract the final coding result of the target text or the model coding result in S102 for verification, and if the verification result is not accurate enough, the manual correction platform may be used to correct the coding result. Meanwhile, the encoding platform records the updated manual encoding result, and adds the new manual encoding result into the data set for iterative training of the model in S102, the model can be continuously optimized by adopting the method, so that the model can realize self-learning, on the basis, the accuracy of machine encoding can be counted according to the manual correction result, and the accuracy of the classification result output by the model is gradually improved by continuously adjusting the model parameters.

In the text coding method provided by the embodiment of the application, the target text is subjected to rule matching, and if the matching is successful, a first coding result of the target text is generated; classifying the target text by using at least two text classification models, and if the classification results of the target text classified by the at least two text classification models are the same and the confidence coefficient of each classification result is greater than a preset threshold value, taking the model coding result of the target text as a second coding result; and generating a final coding result of the target text according to the first coding result and the second coding result. Therefore, the method and the device have the advantages that the target text is classified by optimally combining multiple algorithms based on the related algorithms of the natural language processing technology, namely, the text is classified by utilizing multiple text classification models, whether the model coding result is used or not is selected based on the classification result, then the final coding result of the target text can be generated based on the model coding result and the rule coding result, and the precision ratio and the recall ratio of the final coding result can be greatly improved.

Referring to fig. 4, a schematic diagram of a text encoding apparatus provided in an embodiment of the present application is shown, where the apparatus includes:

the first encoding unit 410 is configured to perform rule matching on a target text, and if the matching is successful, generate a first encoding result of the target text, where the successful matching refers to matching of at least one preset keyword and/or at least one preset key expression from the target text;

a second encoding unit 420, configured to classify the target text by using at least two text classification models, and if the classification results of the target text classified by the at least two text classification models are the same and the confidence of each classification result is greater than a preset threshold, take the model encoding result of the target text as a second encoding result;

a third encoding unit 430, configured to generate a final encoding result of the target text according to the first encoding result and the second encoding result.

In an implementation manner of the embodiment of the present application, the apparatus further includes:

the preprocessing unit is used for acquiring an initial text to be coded before rule matching is carried out on the target text; and removing stop words and/or meaningless punctuation marks in the initial text to obtain the target text.

In an implementation manner of the embodiment of the present application, the preprocessing unit is further configured to:

and deleting the idioms in the initial text before the target text is obtained.

In an implementation manner of the embodiment of the present application, the first encoding unit 410 is specifically configured to:

and carrying out rule matching on the target text, each preset keyword and each preset regular expression in a code box, wherein the code box is used for converting a large amount of collected text corpora into a standard architecture of data.

In an implementation manner of the embodiment of the present application, the at least two text classification models include:

at least two models of a tgrocery model based on a Support Vector Machine (SVM), a long-short time memory network (LSTM) model based on a neural network and a fasttext model.

In an implementation manner of the embodiment of the present application, the third encoding unit 430 is specifically configured to:

if the same coding result and the different coding results exist, taking the same coding result and the different coding results as final coding results in the target text;

the same encoding result is the encoding result of each first text unit in the target text in the first encoding result or the second encoding result, and the first text unit refers to a text unit having the same encoding result in the first encoding result and the second encoding result; the different encoding results are encoding results of second text units in the target text in the second encoding results, and the second text units are text units with different encoding results in the first encoding results and the second encoding results.

In an implementation manner of the embodiment of the present application, the apparatus further includes:

and the fourth encoding unit is used for taking the second encoding result as the final encoding result in the target text if the same encoding result does not exist.

In an implementation manner of the embodiment of the present application, the apparatus further includes:

and the fifth coding unit is used for taking the second coding result as the final coding result of the target text if the matching fails after the target text is subjected to rule matching.

The implementation process of the functions and actions of each unit in the above device is specifically described in the implementation process of the corresponding step in the above method, and is not described herein again.

For the device embodiments, since they substantially correspond to the method embodiments, reference may be made to the partial description of the method embodiments for relevant points. The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the scheme of the application. One of ordinary skill in the art can understand and implement it without inventive effort.

An embodiment of the present application further provides an electronic device, a schematic structural diagram of the electronic device is shown in fig. 5, where the electronic device 5000 includes at least one processor 5001, a memory 5002, and a bus 5003, and the at least one processor 5001 is electrically connected to the memory 5002; the memory 5002 is configured to store at least one computer-executable instruction that the processor 5001 is configured to execute in order to perform the steps of any of the text encoding methods as provided in any of the embodiments or any alternative embodiments herein.

Further, the processor 5001 may be an FPGA (Field-Programmable Gate Array) or other devices with logic processing capability, such as an MCU (micro controller Unit) and a CPU (Central processing Unit).

By applying the embodiment of the application and the related algorithm based on the natural language processing technology, the target text is classified by optimally combining multiple algorithms, namely, the text is classified by utilizing multiple text classification models, whether a model coding result is used or not is selected based on the classification result, then the final coding result of the target text can be generated based on the model coding result and the rule coding result, and the precision ratio and the recall ratio of the final coding result can be greatly improved.

The embodiments of the present application further provide another computer-readable storage medium, which stores a computer program, and the computer program is used for implementing, when executed by a processor, the steps of any one of the text encoding methods provided in any one of the embodiments or any one of the alternative embodiments of the present application.

The computer-readable storage medium provided by the embodiments of the present application includes, but is not limited to, any type of disk including floppy disks, hard disks, optical disks, CD-ROMs, and magneto-optical disks, ROMs (Read-Only memories), RAMs (Random Access memories), EPROMs (Erasable Programmable Read-Only memories), EEPROMs (Electrically Erasable Programmable Read-Only memories), flash memories, magnetic cards, or optical cards. That is, a readable storage medium includes any medium that stores or transmits information in a form readable by a device (e.g., a computer).

By applying the embodiment of the application and the related algorithm based on the natural language processing technology, the target text is classified by optimally combining multiple algorithms, namely, the text is classified by utilizing multiple text classification models, whether a model coding result is used or not is selected based on the classification result, then the final coding result of the target text can be generated based on the model coding result and the rule coding result, and the precision ratio and the recall ratio of the final coding result can be greatly improved.

The above description is only exemplary of the present application and should not be taken as limiting the present application, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the scope of protection of the present application.

14页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:文本处理方法、装置、计算机设备及计算机可读存储介质

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!