Rail transit standard entity identification method based on catalog topic classification

文档序号:1831804 发布日期:2021-11-12 浏览:6次 中文

阅读说明:本技术 基于目录主题分类的轨道交通规范实体识别方法 (Rail transit standard entity identification method based on catalog topic classification ) 是由 黑新宏 董林靖 朱磊 方潇颖 焦瑞 于 2021-07-19 设计创作,主要内容包括:本发明主要是基于目录主题分类的轨道交通规范实体识别方法,采用RoBERTa预训练语言模型以及全词遮盖(Whole Word Masking)机制,通过采集较大规模的建筑规范文本实现领域自适应预训练,并加入主题分类信息,提高命名实体识别任务的性能。另外将训练得到的预训练语言模型应用到命名实体识别任务中,为构建领域知识图谱提供重要支持;会带来很多好处:使命名实体识别模型更好地对领域文本进行表示,提高对于建筑实体的识别性能。逐步增加文本语料库,对已经完成的预训练语言模型进行扩展,从而使预训练语言模型适应更多样多复杂的文本内容;一次训练、多次使用,经过领域自适应预训练的语言模型可以直接应用于其他自然语言处理任务中。(The invention mainly relates to a track traffic standard entity recognition method based on catalog subject classification, which adopts a RoBERTA pre-training language model and a Whole Word Masking (Whole Word Masking) mechanism, realizes field self-adaptive pre-training by collecting larger-scale building standard texts, and adds subject classification information to improve the performance of a named entity recognition task. In addition, a pre-training language model obtained by training is applied to a named entity recognition task, so that important support is provided for constructing a domain knowledge graph; many benefits are brought to the user: the named entity recognition model can better represent the field text, and the recognition performance of the building entity is improved. Gradually increasing a text corpus and expanding the completed pre-training language model, so that the pre-training language model is suitable for more various and complicated text contents; the language model after the field self-adaptive pre-training can be directly applied to other natural language processing tasks after one-time training and multiple-time use.)

1. The rail transit specification entity recognition method based on the catalog subject classification is characterized in that an original RoBERTA pre-training model issued by Google is used as a reference model, and field self-adaptive pre-training is achieved by collecting large-scale rail transit specification texts. Combining a rail transit standard field dictionary to add a full word covering mechanism, so that the RoBERTA pre-training model has the capability of rail transit field knowledge; then, performing topic classification training on the pre-training model with the domain knowledge, and performing topic classification on each standard text by using the chapter name or section name in the catalog based on the catalog data contained in each national standard; then applying the generated pre-training model to a named entity recognition task, inputting the model file into a mainstream NER model BilSTM-CRF model for entity recognition training, and providing a CAT-RailRoBERta-BilSTM-CRF model; finally, inputting the test set data into the trained model, and judging the effect of the model according to the evaluation index; and setting the trained entity recognition model as a server test model effect, inputting the prediction data into the model to output the standard entity and entity category, and judging the usability of the model according to the recognition effect.

2. The method for identifying rail transit normative entities based on catalog subject classification as claimed in claim 1, wherein the experimental data is derived from national standard building base subway design specifications, and the domain adaptive pre-training dataset adopts a large amount of linguistic data such as rail transit normative and building domain information normative formulated by the country.

3. The track traffic regulation entity identification method based on the catalog subject classification as claimed in claim 1, which is characterized by comprising the following steps:

step 1, acquiring a track traffic standard experiment corpus;

the experimental corpus is derived from the 'subway design specification [ enclosed article description ] GB 50157-plus 2013' in the national standard library national specification of the national building, and the specification is crawled by using a crawler technology to perform entity identification research;

step 2, cleaning the data of the acquired rail transit standard corpus;

removing dirty data includes deleting duplicate information, correcting existing errors, checking data consistency, and processing invalid and missing values;

step 3, performing text analysis on the cleaned data;

combining a building information model classification coding standard and a term labeling and terminology standard, and defining entity categories of subway design specification problems by experts;

step 4, manually marking the data set;

selecting 1650 specifications from the normalized corpus to label data; manually marking the entities contained in each specification by combining two aspects of entity categories and professional terms defined by experts, namely marking entity boundaries and entity categories; carrying out a statistical set on entities marked with data;

step 5, dividing a data set;

the experimental data divides a data set by subway design specification entry information, and the proportion of a training set, a verification set and a test set is about 7:2: 1;

step 6, constructing an experimental data set;

constructing experimental data by using the standard corpora marked with the entities, and generating a rail transit data set for naming the entity recognition task; adopting a BIO labeling mode, wherein the experimental data file only comprises two columns of information, entities and labels corresponding to the entities;

step 7, constructing a field self-adaptive pre-training data set;

acquiring text data associated with building design specifications through various channels, and removing special symbols such as line feed symbols, tab symbols, HTML (hypertext markup language) tags and the like after simple cleaning to generate json data in a uniform format; the data set comprises 'subway design specification' linguistic data, and also collects the linguistic data of other building fields, wherein 811,120 standard texts are obtained;

step 8, constructing a self-adaptive pre-training language model in the rail transit field;

inputting the field self-adaptive pre-training data set obtained in the step 7 into a RoBERTA-base pre-training model proposed by Google, adding a term dictionary of subway design specifications, and generating a Chinese rail transit field pre-training language model;

step 9, constructing a theme classification data set;

constructing a theme classification data set by using the unlabelled standard corpus, and generating a track traffic data set for a theme classification task; the method comprises the steps of firstly adopting section names to mark a theme of a standard;

step 10, constructing a subject classification model, and generating a CAT-RailRoBERTA pre-training model by taking the RoBERTA-800 k pre-training language model generated in the step 8 and the subject classification data set constructed in the step 9 as the input of a text classification model;

step 11, constructing an entity recognition model, and taking the pre-training language model file and the training set generated in the step 10 as the input of the entity recognition model;

and step 12, setting the trained entity recognition model as a server test model effect, inputting the test data set into the model, recognizing entity boundaries and entity class labels of the test data, and finally realizing automatic recognition of named entities in the rail transit specification text.

4. The track traffic specification entity recognition method based on the catalog subject classification as claimed in claim 2, wherein in the step 8, a track traffic field adaptive pre-training language model is constructed; inputting the field self-adaptive pre-training data set obtained in the step 7 into a RoBERTA-base pre-training model proposed by Google, adding a term dictionary of subway design specifications, and generating a Chinese rail transit field pre-training language model;

step 8.1, the invention adopts the whole word Mask mechanism, if some sub-words of a complete word are masked, other parts belonging to the word are also masked;

step 8.2, extracting manually marked entities to form an entity dictionary, adding the entity dictionary to perform word segmentation on the input text specification when a jieba word segmentation tool is called, replacing the input token with a mask at a probability of 80%, keeping the probability of 10% unchanged, and replacing the probability of 10% with a random token; the mechanism is introduced into a word segmentation function of a RoBERTA model so that complete semantics of a rail transit standard text entity can be realized when a Mask mechanism predicts; taking the example that the noise peak value of the platform door does not exceed 70 decibels, after the term dictionary is added, the pre-training language model can more correctly represent two entities of the platform door and the decibel;

step 8.3, inputting 800K of pre-training data of the rail transit field and a subway design specification entity dictionary into the model, and setting the training iteration times to be 200 times to obtain a pre-training model RoBERTA _800K of the rail transit field;

the BERT model is a model that combines context information in all layers; the method uses a multi-layer bidirectional transducer as an encoder module to pre-train a deep bidirectional representation, BERT-Base comprises 12 layers of transducer structures, the dimensionality of each layer of hidden state is 768, the multi-head attention of 12 heads is used, and the total parameter number is about 110M;

each encoder of the Transformer first passes an input sentence through a multi-head attention layer; the multi-head attention layer helps the encoder to pay attention to other words in the sentence when encoding each word, and then the input is transmitted into a feedforward neural network, and the feedforward neural network corresponding to the word at each position is completely the same and has no shared parameters; an Add & Norm Layer is also included above the Multi-Head attachment, wherein the Add represents residual connection for preventing network degradation, and the Norm represents Layer Normalization for normalizing the activation value of each Layer;

the most key part in the Transformer is self-attention calculation, in the NER task, an attention mechanism can be used for searching relatively important words or words in an input sentence, and the weight of each word or word in the sentence is calculated by using a hidden layer and a softmax function, so that the model is particularly concerned about key information and is fully learned; because the input sentence and the output sentence are actually the same sequence when the Transformer calculates, the words at each position have global semantic information, which is beneficial to establishing a long dependency relationship; the weights of different connections can be generated by using a self-attention mechanism, so that a variable-length information sequence is processed; with X ═ X1,x2,…,xn]Representing n input information, obtaining a query vector sequence Q, a key vector sequence K and a value vector sequence V through the following linear transformation, wherein the calculation method is shown in formula 1 to formula 3;

Q=WQX

equation 1

K=WKX

Equation 2

V=WVX

Equation 3

After the matrix Q, K, V is obtained, the output of the Self-orientation can be calculated, and the calculated formula is formula 4:

wherein d iskIs Q, the number of columns of the K matrix, i.e. the dimension of the vector; kTTranspose for K matrix;

the Transformer also sets a multi-head attention mechanism on the basis of the self-attention mechanism, wherein h in the network structure indicates that h different self-attention mechanisms exist; wherein, each group of Q/K/V is different and is used for expanding the 'representation subspace' of the attention layer and then obtaining a plurality of different weight matrixes; each weight matrix may project the input vector to a different representation subspace, while different heads may learn the semantics of the different representation subspaces at different locations; the feedforward layer does not need a plurality of matrix inputs, so that scaling dot-product operation (Scale dot-product attribute) needs to be carried out after the weight matrixes are spliced, and the input dimension required by the feedforward layer is ensured, so that the input and output dimensions of a plurality of encoders are kept consistent; the words in the sentence are calculated in parallel, the position information of the words in the sentence, namely the sequence information of the sentence is not considered, so that word embedding of an input part is formed by splicing a word vector and a position coding part of the words and then is transmitted to a linear activation function layer; the specific calculation method is shown in equations 5 to 6;

MultiHead(Q,K,V)=Concat(head1,…,headn)WOequation 5

headi=Attention(QWi Q,KWi K,VWi V) Equation 6

Wherein, WOIs a linear mapping matrix; finally, the transform introduces Position Encoding (PE), which is the Position information of the added word in the word vector, and the specific calculation method is shown in formulas 7 to 8;

in equations 7 and 8, pos represents the position of the word, and i represents the dimension of the word; where 2i denotes the even position, 2i +1 denotes the odd position, pos ∈ (1,2, …, N), N is the length of the input series, i ∈ (0,1, …, d)model/2),dmodelIs the dimension of word embedding.

5. The track traffic regulation entity identification method based on the catalog subject classification as claimed in claim 2, wherein the specific process of the step 10 is as follows:

step 10, constructing a subject classification model, and generating a CAT-RailRoBERTA pre-training model by taking the RoBERTA-800 k pre-training language model generated in the step 8 and the subject classification data set constructed in the step 9 as the input of a text classification model;

step 10.1, adopting a BERT-CNN model by the text classification task, and importing a model file by the BERT model by utilizing the field self-adaptive RoBERTA _800k pre-training model trained in the step 8; the text expression vector output by the BERT layer is input into the convolutional neural network, so that the model can be helped to extract more characteristic information, such as local relative position and other information, and the robustness and the expansibility of the model are enhanced;

in the text classification model of BERT-CNN, it is assumed that the output matrix of BERT layer is R ═ { V ═ V1,V2,…,VnWhere the length of the convolution kernel is l, the sliding step is set to 1, then R can be divided into { V }1:l,V2:l+1,…,Vn-l+1:nIn which V isi+jRepresents a vector ViTo Vj(iii) a cascade of; let P be { P } as the result of the convolution operation1,p2,…,pn},piThe calculation method of (a) is shown in equation 9;

pi=WTVi:i+l-1+b

equation 9

Wherein W is a parameter of the convolution kernel, updated by training of the model, and b is an offset variable; furthermore, the dimensionality of the matrix will be reduced using maximal pooling, i.e. the largest element will be selected in the pooling window;

step 10.2, inputting the theme classification data set constructed in the step 9 into a BERT-CNN model; and (4) generating a CAT-RailRoBERTA pre-training model with text classification information.

6. The track traffic regulation entity identification method based on the catalog subject classification as claimed in claim 2, wherein the specific process of the step 11 is as follows:

step 11, constructing an entity recognition model, and taking the pre-training language model file and the training set generated in the step 10 as the input of the entity recognition model;

step 11.1, inputting the experimental data set constructed in the step 6 into a CAT-RailRoBERTA model which is trained by text classification, converting a specification into a vector form for representation, and obtaining a word vector, a segment vector and a position vector of a sentence; the text vectorization representation of the CAT-RailRoBERTA model, taking the specification "the distance between outdoor fire hydrants in vehicle bases should not be greater than 120 m" as an example, Token Embeddings is the first word CLS sign, which can be used for classification tasks; segment entries are used to distinguish two sentences, and can be used for a classification task with two sentences as input; position Embedding represents the Position, and three kinds of Embedding are obtained through training; taking the segment vector and the position vector as the input of a deep learning model, and finally outputting a text feature vector fused with full-text semantic information;

and step 11.2, inputting the text feature vector into a BilSTM-CRF model to generate a CAT-RailRoBERTA-BilSTM-CRF entity identification model.

Technical Field

The invention belongs to the field of information extraction of natural language processing, and relates to a track traffic standard entity identification method based on catalog topic classification.

Background

In 2013 and 2020, the length of urban rail transit operation lines in China is increased year by year. By the end of 2020, 40 cities in China are cumulatively open urban rail transit operation, and the operation line reaches 7978.19 km. The rail transit construction engineering belongs to complex engineering, and a plurality of specifications are designed in the planning, designing, examining and constructing processes. The engineering design specifications issued by the housing and construction department generally exist in the form of characters, and the paper specifications cannot be directly processed and need to be stored digitally first. But the data types in the specification are very complex, which puts higher demands on the processing accuracy. In recent years, methods for processing natural languages using algorithm models based on deep learning have become mainstream, and especially since 2018, pre-trained language models represented by BERT can well understand natural language texts, so that good results such as information extraction, text classification, intelligent question and answer and the like can be obtained in more downstream tasks. Meanwhile, some researchers conduct research on data enhancement directions for natural languages in vertical domains, thereby better understanding domain knowledge.

The core task of the present invention is named entity recognition, although in the open world and in the published datasets this task has made a good progress. However, named entity recognition research still faces many challenges for certain areas, especially when facing rail transit engineering design specifications, due to the lack of a necessary knowledge base.

(1) Normative data is very complex

The canonical content generally contains many types of data formats, such as text, pictures, tables, formulas, and so on. Nesting of multiple types of data often occurs in the acquired data, the front form and the back form of the same type of data are inconsistent, and the hierarchical structure of the data is not uniform.

(2) Across subject, multi-domain long difficult sentence understanding difficulty

Because the rail transit engineering design relates to dozens of subjects and hundreds of work categories, and the national specifications are written by professionals, the requirement on professional knowledge is extremely high, a large number of professional terms are contained in the text and are mainly in complex sentence patterns, and a lot of difficulties are brought to further structural processing.

(3) Low resource and high quality contradiction

Low resources mean no complete term dictionary, no explicit entity classification criteria, no public data set. The downstream application has extremely high requirements on the quality of the knowledge graph, and the quality and the integrity of the knowledge graph directly determine the accuracy and the completeness of the inspection result by taking automatic compliance inspection as an example.

With the in-depth application of deep learning in natural language processing tasks, the number of parameters of pre-trained models is also rapidly increasing, and a larger data set is required to adequately train the model parameters in order to prevent overfitting. However, for most NLP tasks, constructing large-scale annotation data is a huge challenge because the annotation cost is very large, and especially the annotation difficulty rises sharply when related to semantically related tasks in the vertical domain. In contrast, it is relatively easy to construct large-scale unlabeled corpora, and Pre-trained language Models (Pre-train Models, PTMs) can use these unlabeled data to extract a large amount of semantic information from them and apply these semantic representations to other tasks. Recent studies have shown that PTMs are a significant improvement in many NLP tasks. It is difficult to adapt the open-source pre-training language model to the downstream tasks, which generally require different language models, for example, a text generation task generally requires a specific task to pre-train the encoder and decoder, while a text matching task requires a specific pre-training task to be designed for a sentence pair. The variability of the task may lead to a counterproductive result if the data distribution of the model and the domain are not considered.

The project is oriented to the rail transit field, the non-labeled text in the field is subjected to field self-adaptive pre-training, a large amount of semantic information and field related knowledge can be extracted from the non-labeled data, and the semantic representations are applied to other tasks; and classifying the subjects of each standard text according to the chapter names or section names in the standard catalog, and adding subject information. And then, the rail transit standard is subjected to informatization processing and storage, and the model extracts unstructured data information through learning of structured data, so that useful information can be automatically analyzed and extracted. The development of the research can ensure the quality of engineering design on the premise of shortening the examination time of an engineering project, and the specific data structure knowledge graph is used for storage, so that the most basic data support is provided for intelligent application, the speed of a search engine and the accuracy of an intelligent question-answering system are improved, the complexity of the work is greatly simplified, and the intelligent level of the rail transit field is improved.

Disclosure of Invention

The invention aims to provide a track traffic standard entity recognition method based on catalog topic classification, and solves the problem of low model entity recognition accuracy caused by the fact that an open source pre-training language model is not adaptive to a vertical field text.

The technical scheme adopted by the invention is that the track traffic standard entity recognition method based on the catalog subject classification firstly uses an original RoBERTA pre-training model issued by Google as a reference model and realizes the field self-adaptive pre-training by collecting a larger-scale track traffic standard text. Combining a rail transit standard field dictionary to add a Whole Word covering (wheel Word Masking) mechanism, so that the RoBERTA pre-training model has the capability of rail transit field knowledge; then, performing topic classification training on the pre-training model with the domain knowledge, and performing topic classification on each standard text by using the chapter name or section name in the catalog based on the catalog data contained in each national standard; then applying the generated pre-training model to a named entity recognition task, inputting the model file into a mainstream NER model BilSTM-CRF model for entity recognition training, and providing a CAT-RailRoBERta-BilSTM-CRF model; finally, inputting the test set data into the trained model, and judging the effect of the model according to the evaluation index; and setting the trained entity recognition model as a server test model effect, inputting the prediction data into the model to output the standard entity and entity category, and judging the usability of the model according to the recognition effect.

The experimental data are from national standard base subway design specifications, and the field self-adaptive pre-training data set adopts a large amount of corpora such as rail transit specifications and building field information specifications formulated by the country.

The method specifically comprises the following steps:

step 1, acquiring a track traffic standard experiment corpus;

the experimental corpus is derived from the 'subway design specification [ enclosed article description ] GB 50157-plus 2013' in the national standard library national specification of the national building, and the specification is crawled by using a crawler technology to perform entity identification research.

Step 2, cleaning the data of the acquired rail transit standard corpus;

removing dirty data includes deleting duplicate information, correcting existing errors, checking data for consistency, and processing invalid and missing values.

Step 3, performing text analysis on the cleaned data;

and (3) combining the building information model classification coding standard and the term labeling and terminology standard, and defining the entity category of the subway design specification problem by an expert.

And 4, manually marking the data set.

1650 specifications are selected from the normalized corpus to label data. And manually marking the entities contained in each specification by combining two aspects of the entity classes and the professional terms defined by experts, namely marking the entity boundaries and the entity classes. By performing a statistical set on the entities marked with data, it can be found that the length distribution of the entities in the marked data set is shown in fig. 3, the frequency distribution of the occurrence of each length entity is shown in fig. 4, the longest entity in the data set includes 45 characters, the shortest entity includes 2 characters, the average length is 5.33, and the entity lengths are mainly concentrated in 5, 3, 7 and 4. The statistical data of the part has important significance for the super-parameter setting during model training and the analysis of the prediction result.

Step 5, dividing a data set;

the experimental data divides a data set by subway design specification entry information, and the proportion of a training set, a verification set and a test set is about 7:2: 1.

Step 6, constructing an experimental data set;

constructing experimental data by using the standard corpora marked with the entities, and generating a rail transit data set for naming the entity recognition task; and a BIO labeling mode is adopted, and the experimental data file only contains two columns of information, entities and labels corresponding to the entities.

Step 7, constructing a field self-adaptive pre-training data set;

acquiring text data associated with building design specifications through various channels, and removing special symbols such as line feed symbols, tab symbols, HTML (hypertext markup language) tags and the like after simple cleaning to generate json data in a uniform format; the data set comprises 'subway design specification' linguistic data, and also collects the linguistic data of other building fields, wherein 811,120 standard texts are obtained.

Step 8, constructing a self-adaptive pre-training language model in the rail transit field;

and (4) inputting the field self-adaptive pre-training data set obtained in the step (7) into a RoBERTA-base pre-training model proposed by Google, adding a term dictionary of subway design specifications, and generating a Chinese rail transit field pre-training language model.

Step 9, constructing a theme classification data set;

and constructing a theme classification data set by using the unlabeled standard corpus to generate a track traffic data set for the theme classification task. The method firstly adopts the section name to mark the subject of the specification.

And 10, constructing a subject classification model, and generating a CAT-RailRoBERTA pre-training model by taking the RoBERTA-800 k pre-training language model generated in the step 8 and the subject classification data set constructed in the step 9 as the input of a text classification model.

And 11, constructing an entity recognition model, and taking the pre-training language model file and the training set generated in the step 10 as the input of the entity recognition model.

And step 12, setting the trained entity recognition model as a server test model effect, inputting the test data set into the model, recognizing entity boundaries and entity class labels of the test data, and finally realizing automatic recognition of named entities in the rail transit specification text.

And 8, constructing a self-adaptive pre-training language model in the rail transit field. And (4) inputting the field self-adaptive pre-training data set obtained in the step (7) into a RoBERTA-base pre-training model proposed by Google, adding a term dictionary of subway design specifications, and generating a Chinese rail transit field pre-training language model.

Step 8.1, the invention adopts a whole word Mask mechanism, if a part of the sub-words of a complete word is masked, other parts of the word belonging to the same genus will also be masked.

And 8.2, extracting manually marked entities to form an entity dictionary, adding the entity dictionary to perform word segmentation on the input text specification when a jieba word segmentation tool is called, replacing the input token with a mask at a probability of 80%, keeping the probability of 10% unchanged, and replacing the probability of 10% with a random token. The mechanism is introduced into a word segmentation function of a RoBERTA model, so that complete semantics of a track traffic standard text entity can be realized after the mechanism is predicted by a Mask mechanism, and the model structure is shown in FIG. 9. Taking the example that the peak of the platform door noise should not exceed 70 db, the pre-trained language model can more correctly represent the two entities of platform door and db after adding the term dictionary.

And 8.3, inputting 800K of pre-training data in the rail transit field and a subway design specification entity dictionary into the model, and setting the training iteration times to be 200 times to obtain a pre-training model RoBERTA _800K in the rail transit field.

The BERT model is based on combining context information in all layers. It uses multi-layer bi-directional transform as the encoder module to pre-train the deep bi-directional representation, BERT-Base contains 12 layers of transform structure, the dimension of each layer of hidden state is 768, the total parameter number is about 110M using 12-head multi-head attention.

Each Encoder (Encoder) of the Transformer first passes the input sentence through a Multi-Head Attention layer; the multi-head attention layer helps the encoder to focus on other words in the sentence as each word is encoded, and then pass the input into a feed-forward neural network, where the corresponding feed-forward neural network for each word at each location is identical and does not share parameters. An Add & Norm Layer is also included above the Multi-Head attachment, wherein Add represents Residual Connection (Residual Connection) for preventing network degradation, and Norm represents Layer Normalization for normalizing the activation value of each Layer.

The most critical part of the transform is Self-attention (Self-attention) calculation, and in the NER task, an attention mechanism can be used for finding relatively important words or words in an input sentence, and the weight of each word or word in the sentence is calculated by using a hidden layer and a softmax function, so that the model is particularly concerned about key information and can be fully learned. Because the input sentence and the output sentence are actually the same sequence when the Transformer calculates, the words at each position have global semantic information, which is beneficial to establishing a long dependency relationship. The weights for different connections can be generated using a self-attention mechanism to handle longer information sequences. With X ═ X1,x2,…,xn]Representing n pieces of input information, a query vector sequence Q, a key vector sequence K and a value vector sequence V can be obtained through the following linear transformation, and the calculation methods are shown in formulas 1 to 3.

Q=WQX

Equation 1

K=WKX

Equation 2

V=WVX

Equation 3

After the matrix Q, K, V is obtained, the output of the Self-orientation can be calculated, and the calculated formula is formula 4:

wherein d iskIs Q, the number of columns of the K matrix, i.e. the dimension of the vector; kTIs the transpose of the K matrix.

The Transformer also sets a multi-head attention mechanism on the basis of the self-attention mechanism, wherein h in the network structure indicates that h different self-attention mechanisms exist; wherein, each group of Q/K/V is different and is used for expanding the 'representation subspace' of the attention layer and then obtaining a plurality of different weight matrixes; each weight matrix may project the input vector to a different representation subspace, while different heads may learn the semantics of the different representation subspaces at different locations; the feedforward layer does not need a plurality of matrix inputs, so scaling dot-product operation (Scale dot-product attribute) needs to be carried out after the weight matrixes are spliced, the input dimension needed by the feedforward layer is ensured, and the input and output dimensions of a plurality of encoders are kept consistent. The words in the sentence are calculated in parallel, the position information of the words in the sentence, namely the sequence information of the sentence is not considered, so that the word embedding of the input part is formed by splicing a word vector and a position coding two parts of the words (concat), and then the words are transmitted to a linear activation function layer (linear). The specific calculation method is shown in equations 5 to 6.

MultiHead(Q,K,V)=Concat(head1,…,headn)WOEquation 5

headi=Attention(QWi Q,KWi K,VWi V) Equation 6

Wherein, WOIs a linear mapping matrix. Finally, the transform introduces Position Encoding (PE), which is the Position information of the word added to the word vector, and the specific calculation method is shown in formulas 7 to 8.

In equations 7 and 8, pos represents the position of the word and i represents the dimension of the word. Where 2i denotes the even position, 2i +1 denotes the odd position, pos ∈ (1,2, …, N), N is the length of the input series, i ∈ (0,1, …, d)model/2),dmodelIs the dimension of word embedding.

The specific process of step 10 is as follows:

and 10, constructing a subject classification model, and generating a CAT-RailRoBERTA pre-training model by taking the RoBERTA-800 k pre-training language model generated in the step 8 and the subject classification data set constructed in the step 9 as the input of a text classification model.

Step 10.1, adopting a BERT-CNN model for the text classification task, wherein the model structure is shown in figure 8; the BERT model uses the domain adaptive RoBERTA _800k pre-training model trained in the step 8 to import the model file. The text expression vector output by the BERT layer is input into the convolutional neural network, so that the model can be helped to extract more characteristic information, such as local relative position and other information, and the robustness and the expansibility of the model are enhanced.

In the text classification model of BERT-CNN, it is assumed that the output matrix of BERT layer is R ═ { V ═ V1,V2,…,VnWhere the length of the convolution kernel is l, the sliding step is set to 1, then R can be divided into { V }1:l,V2:l+1,…,Vn-l+1:nIn which V isi+jRepresents a vector ViTo VjIs cascaded. Let P be { P } as the result of the convolution operation1,p2,…,pn},piThe calculation method of (c) is shown in equation 9.

pi=WTVi:i+l-1+b

Equation 9

Where W is the parameter of the convolution kernel, updated by training of the model, and b is the offset variable. Furthermore, maximum pooling will be employed to reduce the dimensionality of the matrix, i.e., the largest element will be selected in the pooling window.

Step 10.2, inputting the theme classification data set constructed in the step 9 into a BERT-CNN model; and (4) generating a CAT-RailRoBERTA pre-training model with text classification information.

The specific process of step 11 is as follows:

and 11, constructing an entity recognition model, and taking the pre-training language model file and the training set generated in the step 10 as the input of the entity recognition model.

And 11.1, inputting the experimental data set constructed in the step 6 into a CAT-RailRoBERTA model which is subjected to text classification training, converting a specification into a vector form for representation, and obtaining a word vector, a segment vector and a position vector of a sentence. The text vectorization representation of the CAT-RailRoBERTA model is shown in FIG. 10, taking the specification "the distance between outdoor fire hydrants at vehicle bases should not be greater than 120 m" as an example, Token Embeddings is the first word CLS flag that can be used for classification tasks; segment entries are used to distinguish two sentences, and can be used for a classification task with two sentences as input; position Embedding represents Position, and three kinds of Embedding are obtained through training. And finally, outputting a text feature vector fused with full-text semantic information by taking the segment vector and the position vector as the input of the deep learning model.

And step 11.2, inputting the text feature vector into a BilSTM-CRF model to generate a CAT-RailRoBERTA-BilSTM-CRF entity identification model. The beneficial effect of the invention is that,

the method is based on a RoBERTA pre-training language model and a Whole Word Masking (Whole Word Masking) mechanism, achieves field self-adaptive pre-training by collecting large-scale building standard texts, and improves the performance of a named entity recognition task by adding topic classification information. In addition, the pre-training language model obtained by training is applied to the named entity recognition task, so that important support is provided for constructing the domain knowledge graph, and a lot of benefits can be brought: first, the named entity recognition model can better represent the domain text, and the recognition performance of the building entity is improved. Secondly, the text corpus can be increased step by step, and the completed pre-trained language model is expanded, so that the pre-trained language model is adapted to more various and complicated text contents. Thirdly, the language model which is trained once and used for many times and is subjected to the field adaptive pre-training can be directly applied to other natural language processing tasks, such as text retrieval, text classification, intelligent question answering and the like.

Drawings

FIG. 1 is a general framework diagram of the track traffic specification entity identification method based on catalog topic classification according to the present invention;

FIG. 2 is a general flowchart of the track traffic specification entity identification method based on directory topic classification according to the present invention;

FIG. 3 is a graph of the length distribution of each entity class and the frequency of occurrence of each class for the experimental data set of the present invention;

FIG. 4 is a graph of frequency distribution of occurrences of entities of various lengths of an experimental data set in accordance with the present invention;

FIG. 5 is an annotation case based on the BIO annotation system of the present invention;

FIG. 6 is a schematic diagram of a transform encoder module according to the present invention;

FIG. 7 is a schematic view of a model of the attention mechanism of the present invention;

FIG. 8 is a schematic structural diagram of the BERT-CNN model in the present invention;

FIG. 9 is a schematic diagram of the mask process structure of the RoBERTA-WWM model of the present invention;

FIG. 10 is a schematic diagram of the text vectorization representation of the RoBERTA-800 k model in the present invention.

Detailed Description

The present invention will be described in detail below with reference to the accompanying drawings and specific embodiments.

The invention aims to provide a track traffic regulation entity identification method based on catalog subject classification, and a specific framework is shown in figure 1. A self-adaptive entity recognition model CAT-RailRoBERTA-BilSTM-CRF in the rail transit field is provided, and the model structure is shown in figure 2. RoBERTA is used as a basic model for the domain adaptive pre-training. Roberta (robustly Optimized BERT approach) adopts the original BERT architecture, but with targeted modifications, it can be understood as a fully trained BERT. RoBERTA adopts larger batch size, larger pre-training corpus, eliminates NSP (Next sequence prediction) task, adopts dynamic mask instead of static mask, and adopts Byte-Pair Encoding (BPE) to perform text Encoding, and the model structure is shown in FIG. 10. RoBERTa, which contains only 1.1 billion parameters, is a pre-trained language model much smaller than today's billions of parameters, is the best benchmark model in situations where computer computing power cannot be rapidly increased. Then, performing topic classification training on the pre-training model with the domain knowledge, and performing topic classification on each standard text by using chapter names or section names in a catalog based on the catalog data contained in each national standard from the characteristics of the standard text; and finally, inputting the generated pre-training language model into a BilSTM-CRF model for entity recognition training.

Referring to fig. 1, the track traffic regulation entity identification method based on the catalog subject classification is implemented according to the following steps:

step 1, acquiring a rail transit standard experiment corpus. The experimental corpus is derived from the 'subway design specification [ enclosed article description ] GB 50157-plus 2013' in the national standard library national specification of the national building, and the specification is crawled by using a crawler technology to perform entity identification research.

Step 2: and carrying out data cleaning on the acquired track traffic standard corpus. Removing dirty data includes deleting duplicate information, correcting existing errors, checking data for consistency, and processing invalid and missing values.

And step 3: and performing text analysis on the cleaned data. And (3) combining the building information model classification coding standard and the term labeling and terminology standard, and defining the entity category of the subway design specification problem by an expert. The predefined entity types are specifically shown in table 1.

Table 1 predefined entity types

And 4, manually marking the data set. 1650 specifications are selected from the normalized corpus to label data. And manually marking the entities contained in each specification by combining two aspects of the entity classes and the professional terms defined by experts, namely marking the entity boundaries and the entity classes. By performing a statistical set on the entities marked with data, it can be found that the length distribution of the entities in the marked data set is shown in fig. 3, the frequency distribution of the occurrence of each length entity is shown in fig. 4, the longest entity in the data set includes 45 characters, the shortest entity includes 2 characters, the average length is 5.33, and the entity lengths are mainly concentrated in 5, 3, 7 and 4. The statistical data of the part has important significance for the super-parameter setting during model training and the analysis of the prediction result.

And 5, dividing the data set. The experimental data divides a data set by subway design specification entry information, and the proportion of a training set, a verification set and a test set is about 7:2: 1.

And 6, constructing an experimental data set. Constructing experimental data by using the standard corpora marked with the entities, and generating a rail transit data set for naming the entity recognition task; and a BIO labeling mode is adopted, and the experimental data file only contains two columns of information, entities and labels corresponding to the entities.

Step 6.1, generating a json file through data marked by a marking tool, and extracting marked entity types and initial position and end position information of entities from the json file;

step 6.2, performing sequence labeling on the original standard text by adopting a mode of combining a BIO labeling strategy and position information, wherein abbreviations of B (Begin), I (Intermediate) and O (Other), wherein Begin refers to characters at the starting position of the identified object, Intermediate refers to the middle character of the identified object, and Other refers to non-entity characters in the identified object; the labeling scheme based on the BIO labeling system is shown in FIG. 5;

and 6.3, processing the sequence into a data format specified by the deep learning model after the sequence is labeled, wherein each character in the data file occupies one line, and one line contains two columns of information of an entity and an entity label, so that a track traffic data set is generated.

And 7, constructing a field self-adaptive pre-training data set. Text data associated with the building design specifications are collected through various channels, and json data in a uniform format is generated after simple cleaning (special symbols such as linefeed symbols, tab symbols, HTML (hypertext markup language) tags and the like are removed). The data set comprises 'subway design specification' linguistic data, and also collects the linguistic data of other building fields, wherein 811,120 standard texts are obtained.

And 8, constructing a self-adaptive pre-training language model in the rail transit field. And (4) inputting the field self-adaptive pre-training data set obtained in the step (7) into a RoBERTA-base pre-training model proposed by Google, adding a term dictionary of subway design specifications, and generating a Chinese rail transit field pre-training language model.

Step 8.1, the invention adopts a whole-word Mask mechanism, if a part of sub-words of a complete word are masked, other parts belonging to the word are also masked, which is more in line with Chinese grammar habit, so that the model can better learn Chinese language expression mode.

And 8.2, extracting manually marked entities to form an entity dictionary, adding the entity dictionary to perform word segmentation on the input text specification when a jieba word segmentation tool is called, replacing the input token with a mask at a probability of 80%, keeping the probability of 10% unchanged, and replacing the probability of 10% with a random token. The mechanism is introduced into a word segmentation function of a RoBERTA model, so that complete semantics of a track traffic standard text entity can be realized after the mechanism is predicted by a Mask mechanism, and the model structure is shown in FIG. 9. Taking the example that the peak of the platform door noise should not exceed 70 db, the pre-trained language model can more correctly represent the two entities of platform door and db after adding the term dictionary.

And 8.3, inputting 800K of pre-training data in the rail transit field and a subway design specification entity dictionary into the model, and setting the training iteration times to be 200 times to obtain a pre-training model RoBERTA _800K in the rail transit field. The pseudo code is as follows:

the BERT (bidirectional Encoder retrieval from transformations) model is implemented by combining context information in all layers. It uses multi-layer bi-directional transform as the encoder module to pre-train the deep bi-directional representation, BERT-Base contains 12 layers of transform structure, the dimension of each layer of hidden state is 768, the total parameter number is about 110M using 12-head multi-head attention.

Each Encoder (Encoder) of the Transformer first passes the input sentence through a Multi-Head Attention layer. As shown in fig. 6, the multi-head attention layer helps the encoder to focus on other words in the sentence when encoding each word, and then pass the input into a feed-forward neural network, where the corresponding feed-forward neural network for each word at each position is identical and does not share parameters. An Add & Norm Layer is also included above the Multi-Head attachment, wherein Add represents Residual Connection (Residual Connection) for preventing network degradation, and Norm represents Layer Normalization for normalizing the activation value of each Layer.

The most critical part of the transform is Self-attention (Self-attention) calculation, and in the NER task, an attention mechanism can be used for finding relatively important words or words in an input sentence, and the weight of each word or word in the sentence is calculated by using a hidden layer and a softmax function, so that the model is particularly concerned about key information and can be fully learned. Because the input sentence and the output sentence are actually the same sequence when the Transformer calculates, the words at each position have global semantic information, which is beneficial to establishing a long dependency relationship. The weights for different connections can be generated using a self-attention mechanism to handle longer information sequences. With X ═ X1,x2,…,xn]Representing n pieces of input information, a query vector sequence Q, a key vector sequence K and a value vector sequence V can be obtained through the following linear transformation, and the calculation methods are shown in formulas 1 to 3.

Q=WQX formula 1

K=WKX formula 2

V=WVX formula 3

After the matrix Q, K, V is obtained, the output of the Self-orientation can be calculated, and the calculated formula is formula 4:

wherein d iskIs Q, the number of columns of the K matrix, i.e. the dimension of the vector; kTIs the transpose of the K matrix.

The Transformer also sets a multi-head attention mechanism on the basis of the self-attention mechanism, the network structure is shown in FIG. 7, and h represents that h different self-attention mechanisms exist; wherein, each group of Q/K/V is different and is used for expanding the 'representation subspace' of the attention layer and then obtaining a plurality of different weight matrixes; each weight matrix may project the input vector to a different representation subspace, while different heads may learn the semantics of the different representation subspaces at different locations; the feedforward layer does not need a plurality of matrix inputs, so scaling dot-product operation (Scale dot-product attribute) needs to be carried out after the weight matrixes are spliced, the input dimension needed by the feedforward layer is ensured, and the input and output dimensions of a plurality of encoders are kept consistent. The words in the sentence are calculated in parallel, the position information of the words in the sentence, namely the sequence information of the sentence is not considered, so that the word embedding of the input part is formed by splicing a word vector and a position coding two parts of the words (concat), and then the words are transmitted to a linear activation function layer (linear). The specific calculation method is shown in equations 5 to 6.

MultiHead(Q,K,V)=Concat(head1,…,headn)WOEquation 5

headi=Attention(QWi Q,KWi K,VWi V) Equation 6

Wherein, WOIs a linear mapping matrix. Finally, the transform introduces Position encoding (Position Embedding), which is the Position information of the added word in the word vector, and the specific calculation method is shown in formulas 7 to 8.

In equations 7 and 8, pos represents the position of the word and i represents the dimension of the word. Where 2i denotes the even position, 2i +1 denotes the odd position, pos ∈ (1,2, …, N), N is the length of the input series, i ∈ (0,1, …, d)model/2),dmodelIs the dimension of word embedding.

And 9, constructing a theme classification data set. Constructing a theme classification data set by using the unlabelled standard corpus, and generating a track traffic data set for a theme classification task; through statistics, in the subway design specification [ enclosed article description ] GB 50157-2013 ], 29 chapters and 150 sections are counted. The method firstly adopts the section name to mark the subject of the specification, if the first section is an operation mode, the text specification mark format is as follows: 3.3.3 except for the unmanned mode, the subway train should be configured with at least one driver to drive or monitor the train operation. 1

Where "1" represents the first section, this specification belongs to the first section operation mode topic category.

And 10, constructing a subject classification model, and generating a CAT-RailRoBERTA pre-training model by taking the RoBERTA-800 k pre-training language model generated in the step 8 and the subject classification data set constructed in the step 9 as the input of a text classification model.

Step 10.1, adopting a BERT-CNN model for the text classification task, wherein the model structure is shown in figure 8; the BERT model uses the domain adaptive RoBERTA _800k pre-training model trained in the step 8 to import the model file. The text expression vector output by the BERT layer is input into the convolutional neural network, so that the model can be helped to extract more characteristic information, such as local relative position and other information, and the robustness and the expansibility of the model are enhanced.

In the text classification model of BERT-CNN, it is assumed that the output matrix of BERT layer is R ═ { V ═ V1,V2,…,VnWhere the length of the convolution kernel is l, the sliding step is set to 1, then R can be divided into { V }1:l,V2:l+1,…,Vn-l+1:nIn which V isi+jRepresents a vector ViTo VjIs cascaded. Let P be { P } as the result of the convolution operation1,p2,…,pn},piThe calculation method of (c) is shown in equation 9.

pi=WTVi:i+l-1+b

Equation 9

Where W is the parameter of the convolution kernel, updated by training of the model, and b is the offset variable. Furthermore, maximum pooling will be employed to reduce the dimensionality of the matrix, i.e., the largest element will be selected in the pooling window.

Step 10.2, inputting the theme classification data set constructed in the step 9 into a BERT-CNN model; and (4) generating a CAT-RailRoBERTA pre-training model with text classification information.

And 11, constructing an entity recognition model, and taking the pre-training language model file and the training set generated in the step 10 as the input of the entity recognition model.

And 11.1, inputting the experimental data set constructed in the step 6 into a CAT-RailRoBERTA model which is subjected to text classification training, converting a specification into a vector form for representation, and obtaining a word vector, a segment vector and a position vector of a sentence. The text vectorization representation of the CAT-RailRoBERTA model is shown in FIG. 10, taking the specification "the distance between outdoor fire hydrants at vehicle bases should not be greater than 120 m" as an example, Token Embeddings is the first word CLS flag that can be used for classification tasks; segment entries are used to distinguish two sentences, and can be used for a classification task with two sentences as input; position Embedding represents Position, and three kinds of Embedding are obtained through training. And finally, outputting a text feature vector fused with full-text semantic information by taking the segment vector and the position vector as the input of the deep learning model.

And step 11.2, inputting the text feature vector into a BilSTM-CRF model to generate a CAT-RailRoBERTA-BilSTM-CRF entity identification model, wherein the model structure is shown in figure 2. The pseudo code is as follows:

and step 12, setting the trained entity recognition model as a server test model effect, inputting the test data set into the model, recognizing entity boundaries and entity class labels of the test data, and finally realizing automatic recognition of named entities in the rail transit specification text.

21页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:金融命名实体识别方法及系统、存储介质及终端

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!