Text semantic similarity calculation method

文档序号:135714 发布日期:2021-10-22 浏览:21次 中文

阅读说明:本技术 一种文本语义相似度计算方法 (Text semantic similarity calculation method ) 是由 许晓伟 张善平 王晓东 曹媛 于 2021-06-11 设计创作,主要内容包括:本发明公开了一种文本语义相似度计算法方法,包括:将文本转化为向量矩阵,利用Siamese BiLSTM网络并结合Co-Attention机制,得到包含文本交互信息的全局特征矩阵,利用CapsNet网络提取文本的局部特征,并使用SENet网络自动校准每个局部特征的重要程度,得到文本的局部特征矩阵;将全局特征矩阵和局部特征矩阵融合后再一次使用BiLSTM网络提取上下文信息,得到两个文本的相似度矩阵,最后通过融合、池化、全连接层判断文本的语义相似度,充分提取文本的有效特征信息;因为提取到了待匹配文本的交互信息,同时保留了文本的有价值信息,提高了文本语义相似度计算的准确性。(The invention discloses a text semantic similarity algorithm method, which comprises the following steps: converting the text into a vector matrix, obtaining a global feature matrix containing text interaction information by using a Siamese BilSTM network and combining a Co-Attention mechanism, extracting local features of the text by using a CapsNet network, and automatically calibrating the importance degree of each local feature by using a SEnet network to obtain a local feature matrix of the text; after the global feature matrix and the local feature matrix are fused, extracting context information by using a BilSTM network again to obtain a similarity matrix of two texts, and finally judging the semantic similarity of the texts through a fusion layer, a pooling layer and a full connection layer to fully extract effective feature information of the texts; because the interactive information of the text to be matched is extracted, the valuable information of the text is kept, and the accuracy of text semantic similarity calculation is improved.)

1. A text semantic similarity calculation method is characterized by comprising the following steps:

step 1, converting a text into a vector matrix;

step 2, extracting global features of the text by using a Siamese BilSt network, and extracting a global feature matrix containing text interaction information by combining a Co-Attention mechanism;

step 3, extracting local features of the text by using a vector matrix through a CapsNet, and calibrating the importance degree of each local feature through a SEnet network to obtain a local feature matrix;

step 4, fusing the global feature matrix and the local feature matrix, and extracting context information by using a BILSTM network to obtain a semantic similarity matrix of the two texts;

and 5, fusing the two semantic similarity matrixes, and judging the semantic similarity of the two texts through a pooling layer and a full connection layer.

2. The text semantic similarity calculation method according to claim 1, wherein the step 1 specifically comprises:

and (3) cutting the length of the sentence into m, performing word embedding by using an n-dimensional pre-trained GloVe model, and representing the text as an m-n vector matrix containing semantic information.

3. The text semantic similarity calculation method according to claim 1, wherein the step 2 specifically comprises:

extracting global features of the texts by using a parameter-sharing Simese BilSt network to obtain global feature matrixes M and N of the two texts;

combining the global feature matrixes M and N by using a Co-Attention mechanism to obtain a matrix L which is MT*N;

Solving softmax for L according to rows and columns respectively to obtain an attention matrix A of the first text to the second textNSoftmax (L) and AM=softmax(LT);

Applying the attention to the second text, and generating an attention matrix C of the first text based on the information of the second text after adding the attentionN=M*ANAnd CM=CN*AMAnd obtaining a global feature matrix containing the interactive information.

4. The text semantic similarity calculation method according to claim 1, wherein calibrating the importance degree of each local feature by using a SENET network specifically comprises:

and taking the output of the DigitCaps layer of the CapsNet network as a text local feature matrix, inputting the text local feature matrix into the SEnet network, and constructing the SECapsNet network to calibrate the local features of the text.

5. The semantic similarity calculation method according to claim 3, wherein global features of the text are extracted using a parameter-sharing siense BiLSTM network to obtain global feature matrices M and N of the two texts, specifically:

using two LSTMs to respectively carry out calculation from the front direction and the rear direction, wherein the output of the BilSTM at the time t is as follows:

wherein the content of the first and second substances,represents the output of the forward LSTM at time t,indicating the output to LSTM at time t, wfAnd WbHidden layer states of forward LSTM and backward LSTM, respectively, btIndicates the offset, htIndicating the output of BiLSTM at time t.

Technical Field

The invention belongs to the technical field of natural language processing, and particularly relates to a text semantic similarity calculation method based on feature adaptive calibration and a Co-attribute mechanism.

Background

With the rapid internet technology, in order to enable a computer to better understand human intentions, information required by human beings is returned from massive text information, and natural language processing becomes a popular field of research. The development and the common application of social networks, the text information such as microblogs, movie reviews, news and the like show explosive growth.

The text semantic similarity calculation is a basic task in the field of natural language processing, and plays an irreplaceable role in information retrieval, text classification, question-answering systems and the like, so that the research on text similarity calculation has very important significance.

The research work on semantic similarity of texts in recent years can be divided into three categories: the method is a traditional method based on keyword and character string matching; kondrak proposes an N-Gram model to calculate the similarity of texts, and the similarity of the texts is represented by calculating the ratio of the number of common N-tuples to the total N-tuples in the two texts; niwattanakul et al propose a Jaccard coefficient to measure the similarity between two texts, and do not pay attention to the difference between set elements; the keyword-based method only focuses on the similarity of the vocabulary level of the text to be matched, does not consider the semantic information of the text, and has great limitation. Secondly, a calculation method based on a vector space model; salton et al first proposed a Vector Space Model (VSM) that maps text to vectors based on word frequency statistics; landauer et al propose LSA model based on VSM model, map the text from word vector space to semantic vector space through Singular Value Decomposition (SVD), make the vector have certain semantic information; the calculation method based on the vector space generally uses Euclidean distance, cosine distance, Manhattan distance and the like to represent the similarity of texts, and well supplements the defects of the traditional vocabulary matching method; in the method, the text is expressed as a high-dimensional and sparse vector, which is not beneficial to the calculation of similarity; the assumed words are mutually independent, and the phenomena of word ambiguity, near meaning words and synonyms are ignored, so that the assumption is difficult to meet the practical situation; the context connection between words in the text is ignored, and enough semantic information is lacked. Thirdly, a calculation method based on deep learning; mikolov et al propose a word2vec model, and compared with a one-hot model, the word2vec model trains low-dimensional and dense word vectors by using context information of words, so that semantic information is richer; YIn and the like model sentence pairs, three schemes of fusing an attention mechanism and CNN are provided, and verification is performed on tasks such AS AS and PI; kusner et al propose word movement distance to measure the similarity between two texts according to the development of the entrepreneur embedding research; the deep learning-based method extracts text features through a neural network to judge the similarity of texts, and the text features include semantic information to a certain extent, but are lack of sentence-to-sentence contrast and the like.

As described above, the conventional statistical-based text similarity calculation method has certain drawbacks: the text is expressed as a high-dimensional and sparse vector, which is not beneficial to the calculation of similarity; the assumed words are mutually independent, and the phenomena of word ambiguity, near meaning words and synonyms are ignored, so that the assumption is difficult to meet the practical situation; the context connection between words in the text is ignored, and semantic information is lacked. The approach of using pre-trained language models in conjunction with neural networks also has some disadvantages: the convolutional neural network pays attention to local information of a text, local features of the text are extracted through a vector matrix, context interactive information is lacked, and more effective features are lost in the process of convolution and pooling; although the recurrent neural network can extract context mutual information, the recurrent neural network lacks the capability of extracting text features for long-distance dependency relations.

Disclosure of Invention

The invention aims to provide a text semantic similarity calculation method to make up for the defects in the prior art, and the method uses a Co-Attention mechanism to extract interactive information of a text to be matched, simultaneously uses an SECapsNet network to fully reserve valuable information of the text, fully extracts effective characteristic information of the text, and improves the accuracy of short text semantic similarity calculation.

The invention is realized by adopting the following technical scheme:

a text semantic similarity calculation method is provided, which comprises the following steps:

step 1, converting a text into a vector matrix;

step 2, extracting global features of the text by using a Siamese BilSt network, and extracting a global feature matrix containing text interaction information by combining a Co-Attention mechanism;

step 3, extracting local features of the text by using a vector matrix through a CapsNet, and calibrating the importance degree of each local feature through a SEnet network to obtain a local feature matrix;

step 4, fusing the global feature matrix and the local feature matrix, and extracting context information by using a BILSTM network to obtain a semantic similarity matrix of the two texts;

and 5, fusing the two semantic similarity matrixes, and judging the semantic similarity of the two texts through a pooling layer and a full connection layer.

Further, step 1 specifically includes: and (3) cutting the length of the sentence into m, performing word embedding by using an n-dimensional pre-trained GloVe model, and representing the text as an m-n vector matrix containing semantic information.

Further, step 2 specifically includes: extracting global features of the texts by using a parameter-sharing Simese BilSt network to obtain global feature matrixes M and N of the two texts; combining the global feature matrixes M and N by using a Co-Attention mechanism to obtain a matrix L which is MTN; solving softmax for L according to rows and columns respectively to obtain an attention matrix A of the first text to the second textNSoftmax (L) and AM=softmax(LT) (ii) a Applying the attention to the second text, and generating an attention matrix C of the first text based on the information of the second text after adding the attentionN=M*ANAnd CM=CN*AMAnd obtaining a global feature matrix containing the interactive information.

Further, calibrating the importance degree of each local feature by using a SENET network specifically comprises: and taking the output of the DigitCaps layer of the CapsNet network as a text local feature matrix, inputting the text local feature matrix into the SEnet network, and constructing the SECapsNet network to calibrate the local features of the text.

Further, extracting global features of the texts by using a parameter-sharing siense BilSTM network to obtain global feature matrixes M and N of the two texts, specifically:

using two LSTMs to respectively carry out calculation from the front direction and the rear direction, wherein the output of the BilSTM at the time t is as follows:

wherein the content of the first and second substances,represents the output of the forward LSTM at time t,indicating the output to LSTM at time t, wfAnd wbHidden layer states of forward LSTM and backward LSTM, respectively, btIndicates the offset, htIndicating the output of BiLSTM at time t.

Compared with the prior art, the invention has the advantages and positive effects that: the text semantic similarity meter algorithm method provided by the invention comprises the steps of firstly preprocessing a text, embedding words by using a pre-trained GloVe model, and converting the text into a vector matrix; then, extracting global features and interaction information of the text by using a Siamese BilSt network and combining a Co-Attention mechanism to obtain a global feature matrix containing text interaction information; meanwhile, extracting local features of the text by using a CapsNet network, and automatically calibrating the importance degree of each local feature by using a SENet network to form an SECapsNet network so as to obtain a local feature matrix of the text; after the global feature matrix and the local feature matrix are fused, extracting context information by using a BilSTM network again to obtain a similarity matrix of two texts, and finally judging the semantic similarity of the texts through a fusion layer, a pooling layer and a full connection layer to fully extract effective feature information of the texts; compared with the prior art, the method has the advantages that the Co-Attention mechanism is used for extracting the interactive information of the text to be matched, and meanwhile, the SECapsNet network fully retains the valuable information of the text, so that the effective characteristic information of the text can be fully extracted, and the accuracy of text semantic similarity calculation is improved.

Other features and advantages of the present invention will become more apparent from the detailed description of the embodiments of the present invention when taken in conjunction with the accompanying drawings.

Drawings

FIG. 1 is a flow chart of a text semantic similarity calculation method proposed by the present invention;

FIG. 2 is a technical flow chart of a text semantic similarity calculation method according to the present invention;

FIG. 3 is a diagram of the BILSTM model architecture in the present invention;

FIG. 4 is a diagram of a construction of the CapsNet model according to the present invention.

Detailed Description

The following describes embodiments of the present invention in further detail with reference to the accompanying drawings.

The text semantic similarity calculation method provided by the invention, as shown in fig. 1 and fig. 2, comprises the following steps:

step S1: the text is converted to a vector matrix.

First, for the problems of misspelling, confusion of case and case, and the like existing in the text to be matched (hereinafter, taking the first text and the second text as an example), the text is preprocessed, including but not limited to spell check, symbol replacement, abbreviation unification, and the like.

After pre-processing, word embedding was performed using a pre-trained GloVe model:

because training the GloVe model on a large data set takes a lot of time, in some embodiments of the invention, a 300-dimensional pre-trained GloVe model is used for word embedding, the model is trained based on a hundred million data set corpus, and the generated text vector contains more semantic information, so that the problem of dimension disaster caused by insufficient carried information and overlarge dimension due to too small dimension can be avoided.

300-dimensional randomized embedding is also performed for words outside the word embedding dictionary. Meanwhile, in view of the problem of inconsistent sentence lengths in the data set, in some embodiments of the present invention, the length of the sentence is cut to 25, and the original sentence is represented as a 25 × 300 vector matrix by the word embedding layer.

Step S2: and extracting the global features of the text by using the Siamese BilSt network, and extracting a global feature matrix containing text interaction information by combining a Co-Attention mechanism.

Using two LSTMs to perform calculation from the front and back directions respectively, the BiLSTM model structure is shown in fig. 3, and the output of the BiLSTM at time t is:

wherein the content of the first and second substances,represents the output of the forward LSTM at time t,indicating the output to LSTM at time t, wfAnd wbHidden layer states of forward LSTM and backward LSTM, respectively, btIndicates the offset, htRepresenting the output of BilSTM at time t, then htByAndjointly determining; finally, global feature matrixes M and N of the two texts are obtained through BilSTM network extraction; in the figure, etRepresenting the word after word segmentation.

Then, a Co-Attention mechanism is used for acquiring the interactive information of the two texts to obtain a global feature matrix C containing the text interactive informationMAnd CN

Specifically, firstly, the information of the global feature matrixes M and N extracted by using a Co-Attention mechanism and combining with the BilSTM is used for obtaining the matrixes

L=MT*N

Solving softmax according to the rows and the columns of the matrix L after the information combination to obtain an attention matrix of the first text A and the second text B:

AN=softmax(L)

AM=softmax(LT)

matrix ANIndicating that each word in the first text A has an entry for each word in the second text B, matrix AMThe same is true.

Then, firstly, the attition is applied to the second text B, and then the attition of the first text A is generated based on the information of the second text B after the attition is added:

CN=M*AN

CM=CN*AM

thus, the information interaction of the two texts is realized, and the global feature matrix C of the two texts containing the interactive information is obtainedMAnd CN

Step S3: and extracting local features of the text by using the CapsNet for the vector matrix, and calibrating the importance degree of each local feature by using a SEnet network to obtain a local feature matrix.

The invention uses a CapsNet network to extract local characteristics of a text, the model structure is shown in figure 4, and the CapsNet comprises an input layer, a convolution layer, a Primarycaps layer, a Digitcaps layer and a full connection layer when used for a classification task. The input layer is a matrix representation of the raw data; the Cov1 layer is a standard convolution layer, and characteristic information of input data is extracted through a convolution kernel; at the Primary caps layer, the extracted features are packaged in a plurality of Primary capsules for comprehensively reflecting the features of a certain class; the Digitcaps layer carries out propagation and dynamic route updating on the basis of vectors output by the Primarycaps layer; and finally, decoding the extracted characteristic information at the full connection layer to judge the type of the original data.

The dynamic routing algorithm in the CapsNet network is shown in the following embodiment:

1) initializing all l-layer capsules i and l + 1-layer capsules j: bij=0

2)For r=1 to 3do

3) For each capsule i in layer l: c. Cij=softmax(bij)

4) For each capsule j in the l +1 layer: u. ofi=vi*wi,sj=∑icijui

5) For each capsule j in the l +1 layer:

vj=Squash(sj)

6) for all l layers of capsules i and l +1 layers of capsules j: bij=bij+uivj

7)return vj

Wherein v isiIs an input vector, wiIs a weight matrix, uiBy viAnd wiMultiplication to obtain cijCalled coupling coefficients, indicating the possibility of the lower capsule activating the upper capsule, by means of bijPerforming an iterative update, Squash being a squeeze operation, similar to the activation function, vjRepresenting the output vector.

Compared with the traditional convolutional neural network, the method for extracting the local features of the text by using the CapsNet network can reserve more valuable information such as the original sequence and semantic representation of the text.

Meanwhile, the output of the DigitCaps layer of the Capsule network is used as a text local feature matrix and is input into the SENet network, so that the SECapsule network is constructed. The SENet mainly comprises three parts, namely Squeeze, Excitation and weight, wherein the Squeeze compresses a multilayer characteristic channel U obtained by convolution by using global average pooling, so that a receptive field is converted from local to global, and thus enough effective information can be obtained; the Excitation is responsible for generating weights for each feature, similar to the mechanism of gates in the recurrent neural network; weight can accomplish the recalibration of the original features in the channel dimension on the basis of the Squeeze and the Excitation operation.

Therefore, the SECapsNet network not only contains more text features and semantic information, but also realizes the self-adaptive calibration of the text local features.

Step S4: and fusing the global feature matrix and the local feature matrix, and extracting context information by using a BILSTM network to obtain a semantic similarity matrix of the two texts.

In the embodiment of the invention, the global features and the local features of the text are fused by using a splicing method, and the fused feature matrix contains more text features.

The BILSTM network still adopts a parameter-sharing Simese structure, and context information and text features are extracted from the fused feature matrix again, so that a semantic similarity matrix of two texts can be obtained.

Step S5: and fusing the two semantic similarity matrixes, and judging the semantic similarity of the two texts through a pooling layer and a full connection layer.

And fusing the semantic similarity matrix of the first text A and the semantic similarity matrix of the second text B, and extracting text features through pooling operation, wherein the pooling mode is global average pooling and global maximum pooling.

At the fully-connected layer, using a 3-layer Dense network as the fully-connected layer, hidden layer units are set to 128, 32, and 1, respectively.

In the following, the semantic similarity calculation method proposed by the present invention is experimented based on a data set of quira Questions Pairs, which is composed of more than 40 ten thousand question Pairs, where 0 or 1 indicates whether the meanings of the two Questions are the same.

In the embodiment of the invention, 10000 samples of balanced data are selected as a development set, 10000 samples of balanced data are selected as a test set, and the rest examples are reserved as a training set, wherein the examples of the Quora data set are shown in the following table I:

watch 1

The experimental process uses accuracy, recall, precision and F1 values to evaluate the performance of the model in general.

For a two-classification problem, two-by-two combination of the predicted result and the actual result can lead to four situations: TP, TN, FP, FN.

Tp (true positive) represents the number of positive classes predicted as positive classes; tn (true negative) represents the number of negative classes predicted as negative; fp (false positive) represents the number of negative classes predicted as positive classes; FN (false negative) indicates the number of positive classes predicted as negative classes.

Accuracy (Accuracy), namely the percentage of a correct result in a total sample is predicted, and the discrimination capability of a model on the whole data set can be measured; recall (Recall) refers to the probability of being predicted as a positive sample among the actual positive samples; precision (Precision) is the probability that all samples predicted to be positive are actually positive samples; the F1 value is a harmonic mean of the precision rate and the recall rate, and the higher the F1 value is, the better the overall performance of the model is.

In order to verify the effectiveness of the method provided by the invention, three groups of comparison experiments are set:

experiment 1: comparative experiments with classical models.

Experiment 2: the comparative experiment of an attention mechanism and a capsNet network is successively integrated in the experimental process.

Experiment 3: comparative experiments with methods proposed in other documents.

In experiment 1, the model of the present invention was compared with the classical LSTM, BiLSTM, siamesestm, etc. models. The experimental results are shown in table two:

watch two

As can be seen from the second table, compared with the traditional method, the method provided by the invention has better performance in the aspect of text semantic similarity, and the accuracy rate reaches 87.31%.

The performance of the BilSTM model is superior to that of the LSTM model because the BilSTM model can extract forward and backward sequence information simultaneously. After the Siamese structure is blended, compared with the LSTM model, the accuracy, the recall ratio, the precision ratio and the F1 value of the BiLSTM model are obviously improved, and the connected structure can effectively improve the performance of the model. Therefore, the method provided by the invention adopts the Siamese structure.

In experiment 2, the method proposed herein was compared with a model method that incorporates the attention mechanism and the CapsNet network in sequence, and the experimental results are shown in table three:

watch III

According to results of the BilSTM and the BilSTM-Attention model, the BilSTM model integrated with the Attention mechanism has higher accuracy and F1 value, which shows that the Attention mechanism improves the accuracy of the model by extracting interactive information between texts. From the table three, it can be seen that the accuracy of the method after the method is merged into the CapsNet network reaches 88.27%, which indicates that the CapsNet network can better extract the local features of the text. After the SENEt network is integrated, the accuracy of the method is reduced, but other indexes are improved, particularly the recall rate is improved by 3.3%, and the result shows that the SENEt network can establish the dependency relationship among the feature channels through learning and further improve the utilization rate of the features.

In experiment 3, the method proposed by the present invention was compared with the methods proposed in other documents, and the experimental results are shown in table four:

watch four

Finally, comparative experiments were performed with the methods proposed in other documents. In the document of the BiLSTM-DenseNet method, the text features are extracted by using the stacked BiLSTM network, so that a longer time is needed for training the model, and compared with the model, the accuracy of the method provided by the invention is improved by 1.81%, and the F1 value is improved by 0.25%; it can also be seen from table four that the method proposed by the present invention has better accuracy and F1 value than the BiLSTM-CNN model.

In summary, the results of experiments on data sets based on the quadra problem indicate the effectiveness of this approach.

It should be noted that the above description is not intended to limit the present invention, and the present invention is not limited to the above examples, and those skilled in the art should also make changes, modifications, additions or substitutions within the spirit and scope of the present invention.

13页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:用于扩充数据集的方法

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!