Deep learning-based method for automatically extracting merchant information

文档序号：1073213 发布日期：2020-10-16 浏览：24次中文

阅读说明：本技术 基于深度学习的自动提取商家信息的方法 (Deep learning-based method for automatically extracting merchant information ) 是由黄诗雅罗睦军张志青于 2020-06-03 设计创作，主要内容包括：本发明公开了一种基于深度学习的自动提取商家信息的方法,包括：获取商家文本,提取商家特征,生成商家信息文本数据集；对提取结果进行复核和修正；对人工修正结果进行降噪处理,根据提取的商家特征对商家文本进行标注,完成商家信息文本数据集的构建；对训练语料的字映射成索引表示,构建字-索引映射表和标签-索引映射表,从预训练的字向量模型中读取出字向量,作为初始化值输入到字向量模型中,通过映射表把商家文本与实体标签数值化为索引表示并填充为定长,提交给序列标注模型训练；序列标注模型对待测试的商家文本进行预测标注,找出商家文本中存在的实体信息。本发明减少人工逐一标注的人力、减少时间成本、高效率、自动化、准确性高。(The invention discloses a deep learning-based method for automatically extracting merchant information, which comprises the following steps: acquiring a merchant text, extracting merchant characteristics and generating a merchant information text data set; rechecking and correcting the extraction result; denoising the manual correction result, labeling the merchant text according to the extracted merchant characteristics, and completing construction of a merchant information text data set; mapping words of a training corpus into index representation, constructing a word-index mapping table and a label-index mapping table, reading word vectors from a pre-trained word vector model, inputting the word vectors into the word vector model as initialization values, quantizing merchant texts and entity labels into index representation through the mapping table, filling the index representation into fixed length, and submitting the index representation to a sequence labeling model for training; and the sequence labeling model carries out prediction labeling on the merchant text to be tested, and entity information existing in the merchant text is found out. The invention reduces manpower marked one by manpower, time cost, high efficiency, automation and high accuracy.)

1. A method for automatically extracting merchant information based on deep learning is characterized by comprising the following steps:

A) after acquiring a merchant text, extracting merchant characteristics through a characteristic rule to generate a merchant information text data set;

B) rechecking and correcting the feedback extraction result by a manual sampling method;

C) denoising the manual correction result, eliminating marking error data, marking an entity on the original merchant text according to the extracted merchant characteristics, and marking the other entity on the original merchant text to complete the construction of the merchant information text data set;

D) mapping words of a training corpus into index representations, constructing a word-index mapping table, constructing a label-index mapping table for category labels, reading word vectors from a pre-trained word vector model, inputting the word vectors into the word vector model as initialization values, digitizing merchant texts and entity labels into index representations through the word-index mapping table and the label-index mapping table, filling the index representations into fixed lengths, and submitting the index representations to a sequence labeling model for training;

E) and the sequence labeling model carries out prediction labeling on the merchant text to be tested and finds out entity information existing in the merchant text.

2. The deep learning based method for automatically extracting merchant information according to claim 1, wherein the extracting merchant features through feature rules comprises the following steps:

A1) constructing a word dictionary to be eliminated;

A2) carrying out condition judgment on each merchant name, wherein dictionary content at the beginning of the merchant name needs to be filtered, dictionary content exists in the lower half section of the merchant name, word extraction and content behind the word in the dictionary content are skipped, and the parenthesis content is directly skipped if the merchant name has text in parenthesis;

A3) and correcting the feature extraction content according to the length.

3. The method for automatically extracting the business information based on the deep learning as claimed in claim 2, wherein the step A3) is specifically as follows: if the text of the merchant is shorter than 3, directly extracting the feature words; if the merchant text is longer than 15, the feature words are extracted 6 bits at maximum, and the company type is at most two bits.

4. The deep learning based automatic merchant information extraction method according to claim 3, wherein the denoising process is: and matching the manual repair labeling characteristics with the full name of the merchant by adopting a condition matching method, labeling characteristic words and other word labels for the full name of the merchant by adopting a BIO labeling system, and skipping the text of the merchant if the full name of the merchant cannot be matched.

5. The deep learning based automatic merchant information extraction method according to any one of claims 1 to 4, wherein the sequence labeling model generation step further comprises:

D1) reading the training corpus into a memory, filtering out words with the word frequency smaller than a minimum threshold value and higher than a maximum threshold value by calculating the word frequency of each word in a document, mapping the rest unrepeated words into an index representation, adding filling characters, unknown characters and digital characters to form a word-index mapping table, and constructing a label-index mapping table for labels;

D2) storing all merchant texts in a list form, setting and filtering minimum word frequency, maximum word frequency and context selection window size, training the merchant texts by adopting a word2vec model to obtain a word vector model, and reading word vectors corresponding to a word-index mapping table from the word vector model to serve as initial values of the word vector model;

D3) digitizing each document word through the word-index mapping table, carrying out fixed length processing on the condition that the length of each document is inconsistent, intercepting the document with the length longer than the highest threshold value and expanding the document with < PAD > shorter than the lowest threshold value, digitizing the label by the same method, and storing the label-index mapping table and the word vector into a configuration file.

6. The deep learning-based method for automatically extracting merchant information as claimed in claim 1, wherein the predictive annotation bits are implemented based on BILSTM-CRF.

7. The deep learning based method for automatically extracting merchant information as claimed in claim 5, wherein the word vector is obtained based on a word2vec model, and the word2vec model is composed of an input layer, a hidden layer and an output layer.

8. The method for automatically extracting the merchant information based on the deep learning of claim 1, wherein the merchant text is predicted through a BILstm-CRF sequence model, and the BILstm-CRF method is used for converting the merchant text into a text sequence with a fixed length and then training the text sequence in a BILstm-CRF network structure.

Technical Field

The invention relates to the field of telecommunication, in particular to a method for automatically extracting merchant information based on deep learning.

Background

When a user proposes specific information of a certain company in call content, such as detailed address information and company telephone number, the system cannot directly capture the company name and effectively feed back the information because the user speaks the company name to have a missing state to a great extent, only speak for short, omit the initial address information state of the company, and forget the whole name of the company. However, in the call content, the user may use a large number of feature words to supplement the explanatory company content. By using the supplemented feature word information, the system can more effectively judge the company name.

The existing difficulty is that the merchant information in the customer service hotline does not have a feature word for supplementary explanation, and the name of the merchant needs to be manually characterized. If a telecommunication operator needs to label the name of a merchant, a large amount of manpower is consumed to label hundreds of thousands of merchant names, and if the name of the subsequent merchant is changed, timely maintenance and modification are needed. The accuracy of manually labeling the characteristic words of the merchants one by one is high, but a great deal of manpower and time are consumed. The main reasons are two, on one hand, one city has tens of thousands of conditions, namely, the number of merchants is too many; on the other hand, the situation that new merchants and merchant names are changed along with the change of time can occur.

Disclosure of Invention

The technical problem to be solved by the present invention is to provide a method for automatically extracting merchant information based on deep learning, which reduces manpower for manual one-by-one labeling, reduces time cost, and has high efficiency, automation, and high accuracy.

The technical scheme adopted by the invention for solving the technical problems is as follows: a method for automatically extracting merchant information based on deep learning is constructed, and comprises the following steps:

A) after acquiring a merchant text, extracting merchant characteristics through a characteristic rule to generate a merchant information text data set;

B) rechecking and correcting the feedback extraction result by a manual sampling method;

E) and the sequence labeling model carries out prediction labeling on the merchant text to be tested and finds out entity information existing in the merchant text.

In the method for automatically extracting the merchant information based on deep learning, the method for extracting the merchant characteristics through the characteristic rules comprises the following steps:

A1) constructing a word dictionary to be eliminated;

A3) and correcting the feature extraction content according to the length.

In the method for automatically extracting merchant information based on deep learning, the step a3) is specifically as follows: if the text of the merchant is shorter than 3, directly extracting the feature words; if the merchant text is longer than 15, the feature words are extracted 6 bits at maximum, and the company type is at most two bits.

In the method for automatically extracting the merchant information based on deep learning, the denoising process is as follows: and matching the manual repair labeling characteristics with the full name of the merchant by adopting a condition matching method, labeling characteristic words and other word labels for the full name of the merchant by adopting a BIO labeling system, and skipping the text of the merchant if the full name of the merchant cannot be matched.

In the method for automatically extracting merchant information based on deep learning, the step of generating the sequence labeling model further includes:

In the method for automatically extracting the merchant information based on deep learning, the prediction marking position is realized based on BILSTM-CRF.

In the method for automatically extracting the merchant information based on deep learning, the word vector is obtained based on a word2vec model, and the word2vec model consists of an input layer, a hidden layer and an output layer.

In the method for automatically extracting the merchant information based on deep learning, the text of the merchant is predicted through a BILstm-CRF sequence model, and the BILstm-CRF method is used for converting the text of the merchant into a text sequence with a fixed length and then training the text in a BILstm-CRF network structure.

The method for automatically extracting the merchant information based on deep learning has the following beneficial effects: after acquiring the information text of the merchant of the telecom operator, identifying the merchant characteristics based on the characteristic word rule, manually rechecking and repairing the text with wrong identification, converting the text into a specified form of a sequence labeling text, and finishing the manufacture of the training corpus; then, deep learning is adopted to build and train a model for the corpus, and finally, feature information and merchant categories are extracted for more merchant names through the trained model; the invention reduces manpower marked one by manpower, time cost, high efficiency, automation and high accuracy.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a flow chart of one embodiment of a deep learning based method for automatically extracting merchant information in accordance with the present invention;

FIG. 2 is a block flow diagram of a method for automatically extracting merchant information based on deep learning in the embodiment;

FIG. 3 is a detailed flowchart of merchant feature extraction according to the feature rules in the embodiment;

FIG. 4 is a specific flowchart of the sequence annotation model generation in the embodiment.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In the embodiment of the method for automatically extracting business information based on deep learning, the flow chart of the method for automatically extracting business information based on deep learning is shown in fig. 1. Fig. 2 is a flowchart of a method for automatically extracting merchant information based on deep learning in this embodiment. In fig. 1, the method for automatically extracting merchant information based on deep learning includes the following steps:

step S01, after acquiring the merchant text, extracting merchant features through feature rules to generate a merchant information text data set: in the step, the feature rule extraction system downloads a merchant text file through the FTP, and after acquiring a merchant text, the feature rule extraction system extracts merchant features through the feature rule to generate a merchant information text data set.

Step S02, by the manual sampling method, rechecking and correcting the fed back extraction result: in the step, the feedback extraction result is rechecked and corrected by a manual sampling method.

Step S03, denoising the manual correction result, eliminating the marking error data, labeling the original merchant text with entity and non-entity label according to the extracted merchant characteristics, and completing the construction of the merchant information text data set: in the step, noise reduction processing is carried out on the manual correction result, marking error data are removed, and then entity and non-entity marking other are carried out on the original merchant text according to the extracted merchant characteristics, so that construction of a merchant information text data set is completed. Specifically, after reading a text file of a merchant, the feature rule extraction system filters noise information such as addresses, special organization information and parenthesis content. And respectively extracting the company type and the characteristics of the filtered text according to a characteristic rule, such as the condition of company type characteristic words of 'limited company'. And (4) sampling and checking each merchant text by service personnel of the feature rule extraction system, and repairing the label according to the real feature type of the merchant text. And the feature rule extraction system is used for marking each merchant information text with the manual repair result, and if the manual repair result is not the required feature word, the manual repair result is marked with other (O), so that the making of the training corpus is completed.

The noise reduction processing specifically includes the following processes: and matching the manual repair marking characteristics with the full name of the merchant by adopting a condition matching method. And a BIO labeling system is adopted to label the characteristic words and other word labels for the whole names of the merchants. If the full name of the merchant cannot be matched, skipping the text of the merchant.

Step S04, mapping words of the training corpus into index representation, constructing a word-index mapping table, constructing a label-index mapping table for the category labels, reading word vectors from the pre-trained word vector model, inputting the word vectors into the word vector model as initialization values, digitizing the merchant text and the entity labels into index representation and filling the index representation into fixed length through the word-index mapping table and the label-index mapping table, and submitting the index representation to the sequence labeling model for training: in this step, words of the training corpus are mapped into index representations, a word-index mapping table is constructed, a label-index mapping table is constructed for the category labels, and then word vectors are read from a pre-trained word vector model and are input into the word vector model as initialization values. In addition, the merchant text and the entity labels are numerically expressed as index representation through the word-index mapping table and the label-index mapping table, the index representation is filled with fixed length, and finally the fixed length is submitted to the training of the sequence labeling model.

Step S05, the serial labeling model carries out prediction labeling on the merchant text to be tested, and entity information existing in the merchant text is found out: in this step, the prediction labeling is realized based on the BILSTM-CRF. Loading a training text by the characteristic information extraction system, training a model by using a label text, obtaining an optimal prediction and storing the optimal prediction in a model file; the characteristic information extraction system downloads texts needing to predict and extract merchant information through FTP, performs merchant information characteristic extraction through a BILSTM-CRF sequence labeling model, and finally obtains merchant characteristics and type results.

After acquiring the information text of the telecom operator merchant, the method marks the characteristics of the merchant based on the characteristic word rule, manually rechecks and repairs the text with wrong marks, converts the text into a specified form of a sequence labeling text, and completes the manufacture of the training corpus. And then, deep learning is adopted to build and train a model for the corpus, and finally, the trained model is used for extracting feature information and merchant categories for more merchant names. The method for automatically extracting the merchant information based on deep learning solves the problem that operators need to manually label a large number of merchant names in each city and consume a large amount of manpower at present. The method is based on natural language processing and deep learning, has the characteristics of high reliability, strong modeling and high accuracy, only needs few manual operations in the whole process, and does not depend on operators to provide training corpora, thereby saving a large amount of manpower and time cost for the operators.

For this embodiment, the above step of extracting merchant features through the feature rule may be further refined, and a flowchart after refinement is shown in fig. 3. In fig. 3, the step of extracting merchant features by using the feature rule further includes the following steps:

step S11, constructing a word dictionary needing to be removed: in this step, a word dictionary to be removed is constructed, which includes addresses and special proper nouns, such as: "Tianhe area", "Communist Party of China", "Guangzhou province", etc.

Step S12 is to perform condition judgment on each merchant name, where dictionary content exists at the beginning of the merchant name and needs to be filtered, dictionary content exists in the lower half of the merchant name, word extraction and content behind the word are skipped over in the dictionary content, and the merchant name exists in text in parentheses, and the parenthesis content is directly skipped over: in the step, condition judgment is carried out on each merchant name, dictionary content at the beginning of the merchant name needs to be filtered, dictionary content exists in the lower half section of the merchant name, extraction of words and content behind the words in the dictionary content are skipped, and parenthesis content is directly skipped if the merchant name exists in text in parenthesis.

Step S13, correcting feature extraction content according to length: in the step, correcting the feature extraction content according to the length, and if the merchant text is shorter than 3, directly extracting feature words; if the merchant text is longer than 15, the feature words are extracted 6 bits at maximum, and the company type is at most two bits.

For the present embodiment, the generation of the sequence annotation model can be further refined, and a flow chart after refinement is shown in fig. 4. In fig. 4, the step of generating the sequence annotation model further includes:

step S41 is to read the training corpus into the memory, filter out words whose word frequency is smaller than the lowest threshold and higher than the highest threshold by calculating the word frequency of each word appearing in the document, then map the remaining unrepeated words into an index representation, add the padding characters, unknown characters and numeric characters to form a word-index mapping table, and construct a label-index mapping table for the label: in the step, the training corpus is read into a memory, words with the word frequency smaller than a minimum threshold value and higher than a maximum threshold value are filtered out by calculating the word frequency of each word in the document, and then the rest non-repeated words are mapped into index representation. The word-index map is constructed by adding '< PAD >', '< UNK >', and'< NUM >', which indicate padding characters, unknown characters, and numeric characters, respectively, and a label-index map is constructed for the label.

Step S42, storing all merchant texts in a list form, setting and filtering the minimum word frequency, the maximum word frequency and the size of a context selection window, training the merchant texts by adopting a word2vec model to obtain a word vector model, reading word vectors corresponding to a word-index mapping table from the word vector model, and taking the word vectors as initial values of the word vector model: in the step, all merchant texts are stored in a list form, the minimum word frequency, the maximum word frequency and the size of a context selection window are set, then a word2vec model is adopted to train the merchant texts to obtain a word vector model, and word vectors corresponding to a word-index mapping table are read from the word vector model and serve as initial values of the word vector model.

Step S43, digitizing each document word through the word-index mapping table, performing fixed length processing on the inconsistent length of each document, intercepting the document with the length longer than the highest threshold value, extending the document with < PAD > shorter than the lowest threshold value, digitizing the label by the same method, and storing the label-index mapping table and the word vector in the configuration file: in the step, each document word is digitalized through a word-index mapping table, in addition, the fixed length processing is carried out when the length of each document is inconsistent, the length is longer than the highest threshold value and is intercepted, and the length is shorter than the lowest threshold value and is expanded by adopting < PAD >. In addition, the labels are digitized by the same method, and the label-index mapping table and the word vectors are stored in a configuration file.

Word vectors are obtained based on the word2vec model, and a CBOW and continuous word bag model is mainly adopted. The main idea is to predict the prediction of the current word from the known context information of the input. The word2vec model mainly comprises three layers of neural networks (an input layer, a hidden layer and an output layer), and specifically comprises the following steps:

(1) and (5) inputting a one-hot context under the condition that the word vector space dim is V and the number of context words is C. And numbering words of all the documents, extracting a feature vector of each document, marking the word as 1 when the word appears in the document, and otherwise, marking the word as 0.

(2) All one-hot are multiplied by the shared input weight matrix W, respectively. W is a V N matrix, and N is a self-set number. The resulting vectors are then summed and averaged as the hidden layer vector, 1 × N.

(3) And multiplying by an output weight matrix W', namely { N x V }, and obtaining the desired word vector matrix.

And predicting the merchant text by a BILstm-CRF sequence model. The BILstm-CRF method converts the merchant text into a text sequence with a fixed length, and then puts the text sequence into a BILstm-CRF network structure for training. The specific prediction step comprises:

(1) input layer (word embedding layer): at the input layer a text sequence c of fixed length n is entered, each word being represented by a vector xi, the dimension k in which each word is embedded. The sentence is represented by xi::::::::::::::::::::::::::::::::::::::::::::::::. The word vector uses the pre-training word2vec as the input of the input layer, and is not fine-tuned in the model training process.

(2) The sequence of word vectors is subjected to BILSTM, and the predicted score for each tag in the text sequence c is output. As text c, the BILSTM layer outputs 1.5(B-person),0.9(I-person),0.1 (other).

(3) Since the result of the output tag of BILSTM cannot be guaranteed to be correct, i.e. to be storedIn the mark offset problem. The CRF adds some constraint rules to reduce the probability of prediction error.

Matrix A is initialized randomly before training the model_yi,yi+1The CRF layer continuously learns the constraints as the number of training iterations increases, making it more and more "reasonable".

In summary, the present invention relates to the field of telecommunication, deep learning and natural language, and in particular, to a method for extracting merchant information from operator texts based on deep learning. The occurrence of deep learning can finish the making of training corpora by extracting features and few manual sampling and repairing methods according to the existing feature word rules on the premise of reducing the labels of the personnel at the early stage as much as possible; the training corpus is modeled through deep learning, and the model result is used for extracting the characteristics of other business names.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

11页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：基于深度学习实现语种识别的方法

Deep learning-based method for automatically extracting merchant information

相关技术

网友询问留言