Intelligent detection method for semantic mutual exclusion of multi-source management terms

文档序号:1378993 发布日期:2020-08-14 浏览:6次 中文

阅读说明:本技术 一种多源管理条款的语义互斥的智能检测方法 (Intelligent detection method for semantic mutual exclusion of multi-source management terms ) 是由 元雨暄 林欣郁 贺惠新 于 2020-04-17 设计创作,主要内容包括:本发明公开了一种多源管理条款的语义互斥的智能检测方法,首选通过获取各相关的管理条款的文本数据,用于模型训练,通过对各条款文本的预处理,结合统计特征、词汇语义特征以及矛盾规则识别特征,构建了一个基于义原的文本语义冲突检测模型,将该模型可用于管理条款的语义互斥的智能检测;本发明采用自动的分类算法构建文本冲突检测模型,应用于实际场景,有效实现了计算机自动对文本对进行冲突检测判断的目的,为预防多源管理条款的文本语义冲突提供了一种新的方法。(The invention discloses an intelligent detection method for semantic mutual exclusion of multi-source management clauses, which comprises the steps of firstly, acquiring text data of each related management clause for model training, preprocessing each clause text, and combining statistical characteristics, vocabulary semantic characteristics and contradiction rule identification characteristics to construct a text semantic conflict detection model based on an semantic source, wherein the model can be used for intelligent detection of semantic mutual exclusion of the management clauses; the invention adopts an automatic classification algorithm to construct a text conflict detection model, is applied to actual scenes, effectively realizes the aim of automatically detecting and judging the conflict of the text pair by a computer, and provides a new method for preventing text semantic conflict of multi-source management terms.)

1. An intelligent detection method for semantic mutual exclusion of multi-source management terms comprises a model training stage and a detection stage, and is characterized by comprising the following specific steps:

a model training stage:

step S1: acquiring management clause texts, sorting upper and lower clause text pairs, splitting and matching the text pairs according to items to obtain NS policy entry text pair sets S ═ { S (i) } as a training corpus, and marking each text pair as S (i), wherein the upper policy entry is T (i), the lower policy entry is H (i), i is more than or equal to 1 and less than or equal to NS, and NS is required to be more than or equal to 10000;

step S2: for NS policy entry text pairs, constructing a text pair NS pair containing contradictions, marking the label of the text pair containing the contradictions as 1, constructing a text pair NS pair not containing the contradictions, marking the label of the text pair not containing the contradictions as 0, and taking 2NS text pairs as training corpora in total;

step S3: performing text preprocessing on the training corpus;

step S4: performing feature extraction on the preprocessed text pairs T, H to obtain a feature list of the text pairs, wherein the feature list comprises statistical features, vocabulary semantic features and contradiction rule characterization;

step S5: obtaining a feature list according to the step S4, selecting a support vector machine as a classifier for training, and carrying out class balance processing on the sample, wherein the model obtained by training is M;

a detection stage:

step T1: preprocessing texts T 'and H' to be classified, wherein the upper-level policy entry is T 'and the lower-level policy entry is H';

step T2: extracting the characteristics of the preprocessed text pairs T 'and H' to obtain a characteristic list F of the text pairs;

step T3: and inputting the feature list F of the text pair into the classification model M for classification to obtain output, wherein the output is 0 or 1, 1 represents that the text pair is contradictory, namely the subordinate policy entry violates the superior policy entry, and 0 represents that the text pair is not contradictory, namely the subordinate policy entry conforms to the content of the superior policy entry.

2. The method for intelligently detecting semantic mutual exclusion of multi-source management clauses according to claim 1, wherein the preprocessing in step S3 specifically includes:

step S31, extracting Chinese numbers with units and counting units in the text, converting the Chinese numbers into Arabic numbers, converting words with expression directions in front and back of the numbers into mathematical symbol character strings, combining the Arabic numbers after the text pairs T (i), H (i) and the character strings after the text pairs T (i) and H (i) are converted into tuples with words with expression directions, if no words with expression directions exist, recording the tuples as "" empty character strings "" and recording the tuple sets as digitT (i) and digitH (i);

step S32: for each pair of texts t (i) and h (i), after performing word segmentation, word removal and part-of-speech tagging, two part-of-speech sets wtt (i) and wth (i) are obtained, where wtt (i) { (t (i) (j, word), t (i) (j, tag)) }, wth (i) { (h (i) (j, word), h (i) (j, tag)) }, t (i) (j, word) represents the j-th word in the text t (i), t (i) (j, tag) represents the part of speech of the j-th word in the text t (i), h (i) (k, word) represents the k-th word in the text h (i), and h (i) (k, tag) represents the part of speech of the k-th word in the text h (i);

step S33: for each pair of texts T (i) and H (i), respectively extracting negative words to form negative word sets nwT (i) and nwH (i);

step S34: for each pair of texts T (i) and H (i), replacing phrases related to time in the texts into a format of 'xxxx year xx month xx day' and adding the phrases into a set T (i)time、H(i)time

3. The method for intelligently detecting semantic mutual exclusion of multi-source management clauses according to claim 2, wherein the step S4 specifically includes:

step S41: calculating word overlap degrees wLap (T (i) and H (i)), wherein the word set contained in T (i) is NwordT (i), the word set contained in H (i) is NwordH (i), and the calculation formula is as follows:

step S42: calculating the text length difference diffLen (T (i), H (i)), and calculating the formula of the len () function for calculating the character string length as follows:

diffLen(T(i),H(i))=|len(T(i))-len(H(i))|

step S43: calculating Jaro-Winkler distance jwSim (T (i), H (i)) of the text pair, recording jSim (T (i), H (i)) as the Jaro distance between the text pair T (i) and H (i), wherein m represents the number of matched characters of a text string T (i) and H (i), t represents half of transposition number transportations, a matching window is mw (T (i), H (i)), f represents the number of common prefix characters of the two strings, f is more than or equal to 1 and less than or equal to 4, p is a constant of a scaling factor, p is more than 0 and less than or equal to 0.25, and the calculation formula is as follows:

jwDis(T(i),H(i))=jSim(T(i),H(i))+fp(1-jSim(T(i),H(i)))

step S44: calculating cosine similarity cosSim (T (i), H (i)) based on sememe, and part of speech set T (i) of text T (i)wtEach of which isTuple T (i)wt(j) According to the part of speech of the word, obtaining the corresponding sememe to obtain the corresponding sememe vector T (i)wt(j)vecSumming the sense original vectors and averaging to obtain vector representation vecT (i) of text T (i), recording T (i)wtThe total number of the middle tuples is sT (i)wtIn the same way, the vector representations vecH (i) and H (i) of the text H (i) can be obtainedwtTotal number of middle tuples sH (i)wtThe calculation formula is as follows:

step S45: calculating negative word characteristics negF (T) (i), H (i), nwT (i), nwH (i), wherein the total number of words in the nwT (i) and the nwH (i) is negT (i), and the calculation formula of negH (i) is as follows:

negF=|negT(i)-negH(i)|mod 2

wherein mod 2 is the remainder of dividing by 2;

step S46: calculating the semantic similarity combiSim (T (i), H (i)) of the text pairs of the comprehensive known network and the synonym forest, and recording the function of converting the semantic distance into the similarity as simw(w1,w2) Wherein w is1And w2Represents two sememes, dis (w)1,w2) The semantic distance of two sememes is the path length of the two sememes in the sememe tree, a is the sememe distance when the similarity is about 0.5, and the calculation formula is as follows:

recording the similarity calculation function of synonym forest words as simt(w1,w2),C1And C2Is an artificial atom w1And w2Word encoding of (2), disT (C)1,C2) Is a distance function of two word codes in a tree structure, equal to the sum of the weights of the edges in the connection path of the word pair, simt(C1,C2) The calculation formula of (2) is as follows:

wherein n is the density of the nearest public father node of the word pair, and k is the distance between branches where the word pair is located;

wherein s is T (i)wtAnd H (i)wtThe minimum value of the number of each tuple, i represents the serial number of the tuple in the set;

step S47: calculating to obtain the numeric contradiction characteristics numCF of the text pairs T (i), H (i), comparing the preprocessed binary set digitT (i) and digitH (i), wherein smin is the minimum value of the total number of elements of the two binary sets, and the calculation formula is as follows:

step S48: calculating time contradiction characteristics timeCF of text pairs T (i) and H (i), converting the date into a time stamp, forming a binary group together with a corresponding mathematical symbol character string to obtain two binary group sets timeT (i) and timeH (i), calculating whether the time range represented by the binary group in H (i) is in the time range represented by the tuple in T (i), and if all the time ranges are met, setting the timeCF to be 0, otherwise, setting the timeCF to be 1;

step S49: calculating modifier contradictory characteristics adjCF of text pairs T (i) and H (i), forming a new two-tuple set adjSet from wtT (i) and wtH (i) in sequence according to part-of-speech tags of tuples, wherein the elements of each two-tuple set are words of which the tags in wtT (i) and wtH (i) are adjectives respectively, if wtT (i) has adjectives and wtH (i) does not exist, replacing the second element of the new tuple set with a 'empty string', and if the tags in wtT (i) and wtH (i) do not exist, replacing the first element of the new tuple set with the 'empty string', keeping the total element number of the adjSet as sa, and calculating the formula as follows:

4. the method for intelligently detecting semantic mutual exclusion of multi-source management clauses according to claim 1, wherein in step T1, the preprocessing specifically includes: extracting digit tuple sets digT ', digH', part of speech sets wtT ', wtH', negative word sets nwT ', nwH', time sets timeT 'and timeH'.

5. The method for intelligently detecting semantic mutual exclusion of multi-source management clauses according to claim 4, wherein the step T2 specifically comprises:

t21: calculating word overlap degree by using wtT 'and wtH' as the value of the characteristic f 1;

t22: taking the length of T 'and H' as a difference and taking an absolute value to obtain a text length difference as a value of the characteristic f 2;

t23: calculating the Jaro-Winkler distance of the text T 'and the text H' by wtT 'and wtH' as the value of the feature f 3;

t24: according to words in wtT 'and wtH', obtaining respective semantic sets sT and sH of texts T 'and H', summing and averaging the semantic vectors to obtain vector representations vT and vH of the two texts, and calculating to obtain cosine similarity as a value of the feature f 4;

t25: calculating the value of the characteristics of the negative words according to the negative word sets nwT 'and nwH' to obtain the value of the characteristic f 5;

t26: calculating the text semantic similarity of the comprehensive knowledge network and the synonym forest by utilizing wtT 'and wtH', and taking the text semantic similarity as the value of the characteristic f 6;

t27: judging the number range containing relationship of the elements in the two sets by using digT 'and digH', if so, setting f7 to 1, otherwise, setting f7 to 0;

t28: judging the time range containing relationship of the elements in the two sets according to the timeT 'and the timeH', if so, setting f8 to 1, otherwise, setting f8 to 0;

t29: and respectively extracting words of which the parts of speech are adjectives from the wtT 'and wtH' according to the words and the parts of speech in the wtT 'and wtH' to form a new adjective set, calculating the similarity of the adjectives by using the similarity of the comprehensive known network and the synonym forest, and dividing the sum of the similarities by the total number of tuples in the set to obtain the contradiction degree of the modifiers, which is used as the value of the characteristic f 9.

Technical Field

The invention relates to the field of semantic intelligent detection, in particular to a semantic mutual exclusion intelligent detection method for multi-source management terms.

Background

The management clause file is a specific action criterion which is set by a management department for completing tasks in a certain historical period. To ensure the achievement of the target, the management department firstly combs the target and then sends the packet downwards step by step until the basic department facing the target. However, due to the asymmetry of information and the imbalance of authority and responsibility, there is a high possibility that the upper and lower level control conflicts occur during the downward packet sending process, which hinders the implementation of the actual task. Mutual exclusion detection is carried out on the regulation issued by the upper and lower levels, so that the conflict can be found in time, the supervision on the package issuing process of the management files can be enhanced, and the method has important significance for guaranteeing the benefits of organizations.

Disclosure of Invention

The invention mainly aims to overcome the defects in the prior art, and provides an intelligent semantic mutual exclusion detection method for multi-source management terms, which can effectively prevent text semantic conflict of the multi-source management terms.

The invention adopts the following technical scheme:

an intelligent detection method for semantic mutual exclusion of multi-source management terms comprises a model training stage and a detection stage, and is characterized by comprising the following specific steps:

a model training stage:

step S1: acquiring management clause texts, sorting upper and lower clause text pairs, splitting and matching the text pairs according to items to obtain NS policy entry text pair sets S ═ { S (i) } as a training corpus, and marking each text pair as S (i), wherein the upper policy entry is T (i), the lower policy entry is H (i), i is more than or equal to 1 and less than or equal to NS, and NS is required to be more than or equal to 10000;

step S2: for NS policy entry text pairs, constructing a text pair NS pair containing contradictions, marking the label of the text pair containing the contradictions as 1, constructing a text pair NS pair not containing the contradictions, marking the label of the text pair not containing the contradictions as 0, and taking 2NS text pairs as training corpora in total;

step S3: performing text preprocessing on the training corpus;

step S4: performing feature extraction on the preprocessed text pairs T, H to obtain a feature list of the text pairs, wherein the feature list comprises statistical features, vocabulary semantic features and contradiction rule characterization;

step S5: obtaining a feature list according to the step S4, selecting a support vector machine as a classifier for training, and carrying out class balance processing on the sample, wherein the model obtained by training is M;

a detection stage:

step T1: preprocessing texts T 'and H' to be classified, wherein the upper-level policy entry is T 'and the lower-level policy entry is H';

step T2: extracting the characteristics of the preprocessed text pairs T 'and H' to obtain a characteristic list F of the text pairs;

step T3: and inputting the feature list F of the text pair into the classification model M for classification to obtain output, wherein the output is 0 or 1, 1 represents that the text pair is contradictory, namely the subordinate policy entry violates the superior policy entry, and 0 represents that the text pair is not contradictory, namely the subordinate policy entry conforms to the content of the superior policy entry.

Specifically, the preprocessing in step S3 includes:

step S31, extracting Chinese numbers with units and counting units in the text, converting the Chinese numbers into Arabic numbers, converting words with expression directions in front and back of the numbers into mathematical symbol character strings, combining the Arabic numbers after the text pairs T (i), H (i) and the character strings after the text pairs T (i) and H (i) are converted into tuples with words with expression directions, if no words with expression directions exist, recording the tuples as "" empty character strings "" and recording the tuple sets as digitT (i) and digitH (i); step S32: for each pair of texts t (i) and h (i), after performing word segmentation, word removal and part-of-speech tagging, two part-of-speech sets wtt (i) and wth (i) are obtained, where wtt (i) { (t (i) (j, word), t (i) (j, tag)) }, wth (i) { (h (i) (j, word), h (i) (j, tag)) }, t (i) (j, word) represents the j-th word in the text t (i), t (i) (j, tag) represents the part of speech of the j-th word in the text t (i), h (i) (k, word) represents the k-th word in the text h (i), and h (i) (k, tag) represents the part of speech of the k-th word in the text h (i);

step S33: for each pair of texts T (i) and H (i), respectively extracting negative words to form negative word sets nwT (i) and nwH (i);

step S34: for each pair of text T (i) and H (i), replacing phrases in the text with respect to timeChanged into a format of 'xxxx year xx month xx day' and respectively added into a set T (i)time、H(i)time

Specifically, the step S4 specifically includes:

step S41: calculating word overlap degrees wLap (T (i) and H (i)), wherein the word set contained in T (i) is NwordT (i), the word set contained in H (i) is NwordH (i), and the calculation formula is as follows:

step S42: calculating the text length difference diffLen (T (i), H (i)), and calculating the formula of the len () function for calculating the character string length as follows:

diffLen(T(i),H(i))=|len(T(i))-len(H(i))|

step S43: calculating Jaro-Winkler distance jwSim (T (i), H (i)) of the text pair, recording jSim (T (i), H (i)) as the Jaro distance between the text pair T (i) and H (i), wherein m represents the number of matched characters of a text string T (i) and H (i), t represents half of transposition number transportations, a matching window is mw (T (i), H (i)), f represents the number of common prefix characters of the two strings, f is more than or equal to 1 and less than or equal to 4, p is a constant of a scaling factor, p is more than 0 and less than or equal to 0.25, and the calculation formula is as follows:

jwDis(T(i),H(i))=jSim(T(i),H(i))+fp(1-jSim(T(i),H(i)))

step S44: calculating cosine similarity cosSim (T (i), H (i)) based on sememe, and part of speech set T (i) of text T (i)wtEach tuple of (1) T (i)wt(j) According to the part of speech of the word, obtaining the corresponding sememe to obtain the corresponding sememe vector T (i)wt(j)vecSumming the sense original vectors and averaging to obtain vector representation vecT (i) of text T (i), recording T (i)wtThe total number of the middle tuples is sT (i)wtIn the same way, the vector representations vecH (i) and H (i) of the text H (i) can be obtainedwtTotal number of middle tuples sH (i)wtThe calculation formula is as follows:

step S45: calculating negative word characteristics negF (T) (i), H (i), nwT (i), nwH (i), wherein the total number of words in the nwT (i) and the nwH (i) is negT (i), and the calculation formula of negH (i) is as follows:

negF=|negT(i)-negH(i)|mod 2

wherein mod 2 is divided by 2 to obtain the remainder;

step S46: calculating the semantic similarity combiSim (T (i), H (i)) of the text pairs of the comprehensive known network and the synonym forest, and recording the function of converting the semantic distance into the similarity as simw(w1,w2) Wherein w is1And w2Represents two sememes, dis (w)1,w2) The semantic distance of two sememes is the path length of the two sememes in the sememe tree, a is the sememe distance when the similarity is about 0.5, and the calculation formula is as follows:

recording the similarity calculation function of synonym forest words as simt(w1,w2),C1And C2Is an artificial atom w1And w2Word encoding of (2), disT (C)1,C2) Is a distance function of two word codes in a tree structure, equal to the sum of the weights of the edges in the connection path of the word pair, simt(C1,C2) The calculation formula of (2) is as follows:

wherein n is the density of the nearest public father node of the word pair, and k is the distance between branches where the word pair is located;

wherein s is T (i)wtAnd H (i)wtThe minimum value of the number of each tuple, i represents the serial number of the tuple in the set;

step S47: calculating to obtain the numeric contradiction characteristics numCF of the text pairs T (i), H (i), comparing the preprocessed binary set digitT (i) and digitH (i), wherein smin is the minimum value of the total number of elements of the two binary sets, and the calculation formula is as follows:

step S48: calculating time contradiction characteristics timeCF of text pairs T (i) and H (i), converting the date into a time stamp, forming a binary group together with a corresponding mathematical symbol character string to obtain two binary group sets timeT (i) and timeH (i), calculating whether the time range represented by the binary group in H (i) is in the time range represented by the tuple in T (i), and if all the time ranges are met, setting the timeCF to be 0, otherwise, setting the timeCF to be 1;

step S49: calculating modifier contradictory characteristics adjCF of text pairs T (i) and H (i), forming a new two-tuple set adjSet from wtT (i) and wtH (i) in sequence according to part-of-speech tags of tuples, wherein the elements of each two-tuple set are words of which the tags in wtT (i) and wtH (i) are adjectives respectively, if wtT (i) has adjectives and wtH (i) does not exist, replacing the second element of the new tuple set with a 'empty string', and if the tags in wtT (i) and wtH (i) do not exist, replacing the first element of the new tuple set with the 'empty string', keeping the total element number of the adjSet as sa, and calculating the formula as follows:

specifically, in the step T1, the preprocessing specifically includes: extracting digit tuple sets digT ', digH', part of speech sets wtT ', wtH', negative word sets nwT ', nwH', time sets timeT 'and timeH'.

Specifically, the step T2 specifically includes:

t21: calculating word overlap degree by using wtT 'and wtH' as the value of the characteristic f 1;

t22: taking the length of T 'and H' as a difference and taking an absolute value to obtain a text length difference as a value of the characteristic f 2;

t23: calculating the Jaro-Winkler distance of the text T 'and the text H' by wtT 'and wtH' as the value of the feature f 3;

t24: according to words in wtT 'and wtH', obtaining respective semantic sets sT and sH of texts T 'and H', summing and averaging the semantic vectors to obtain vector representations vT and vH of the two texts, and calculating to obtain cosine similarity as a value of the feature f 4;

t25: calculating the value of the characteristics of the negative words according to the negative word sets nwT 'and nwH' to obtain the value of the characteristic f 5;

t26: calculating the text semantic similarity of the comprehensive knowledge network and the synonym forest by utilizing wtT 'and wtH', and taking the text semantic similarity as the value of the characteristic f 6;

t27: judging the number range containing relationship of the elements in the two sets by using digT 'and digH', if so, setting f7 to 1, otherwise, setting f7 to 0;

t28: judging the time range containing relationship of the elements in the two sets according to the timeT 'and the timeH', if so, setting f8 to 1, otherwise, setting f8 to 0;

t29: and respectively extracting words of which the parts of speech are adjectives from the wtT 'and wtH' according to the words and the parts of speech in the wtT 'and wtH' to form a new adjective set, calculating the similarity of the adjectives by using the similarity of the comprehensive known network and the synonym forest, and dividing the sum of the similarities by the total number of tuples in the set to obtain the contradiction degree of the modifiers, which is used as the value of the characteristic f 9.

As can be seen from the above description of the present invention, compared with the prior art, the present invention has the following advantages:

the invention provides a text conflict detection method based on statistical characteristics, vocabulary semantic characteristics and contradiction rule characteristics, adopts an automatic classification algorithm to construct a text conflict detection model, is applied to an actual scene, and effectively realizes the purpose of automatically detecting and judging the conflict of a text pair by a computer.

Detailed Description

The invention is further described below by means of specific embodiments.

Model training phase

Step S1: acquiring dependent resources of a model training stage, namely acquiring each management clause text, sorting upper and lower clause text pairs, splitting and matching the text pairs according to items to obtain NS policy entry text pair sets S ═ { S (i) } as training linguistic data, and marking each text pair as S (i), wherein upper policy entries are T (i), lower policy entries are H (i), wherein i is more than or equal to 1 and less than or equal to NS, and NS is required to be more than or equal to 10000;

step S2: for NS policy entry text pairs, modifying corresponding content to construct a text pair NS pair containing contradictions, marking the label of the text pair containing contradictions as 1, marking the label of the text pair without contradictions as 0, and taking 2NS text pairs as training corpora in total;

step S3: in order to facilitate feature extraction, the training corpus is subjected to text preprocessing, and the specific processing mode is as follows:

step S31: extracting Chinese numbers with units in the text and counting units thereof, and converting the Chinese numbers into Arabic numbers: replacing Chinese numbers 'zero, one, two, three, five, six, seven and ninety' with '0123456789', replacing units of Chinese units such as hundred, thousand, ten thousand and the like with numbers such as 100, 1000, 10000 and the like, converting words such as 'below', 'not more', 'not less' and the like which are in expression directions and are carried before and after the numbers into mathematical symbol character strings such as '<', '>', and the like, combining converted Arabic numbers of text pairs T (i), H (i) and converted character strings of words in expression directions into tuples, and if no words in expression directions are carried out, recording the tuples as 'empty character strings', and recording the tuple sets as digitT (i) and digitH (i);

step S32: for each pair of texts t (i) and h (i), after performing word segmentation, word removal and part-of-speech tagging, two part-of-speech sets wtt (i) and wth (i) are obtained, where wtt (i) { (t (i) (j, word), t (i) (j, tag)) }, wth (i) { (h (i) (j, word), h (i) (j, tag)) }, t (i) (j, word) represents the j-th word in the text t (i), t (i) (j, tag) represents the part of speech of the j-th word in the text t (i), h (i) (k, word) represents the k-th word in the text h (i), and h (i) (k, tag) represents the part of speech of the k-th word in the text h (i);

step S33: for each pair of texts T (i) and H (i), respectively extracting negative words to form negative word sets nwT (i) and nwH (i), wherein the negative words are words existing in the sets { "none", "not", "no", "not", "must" };

step S34: for each pair of texts T (i) and H (i), replacing phrases related to time in the texts into a format of 'xxxx year xx month xx day' and adding the phrases into a set T (i)time、H(i)time

Step S4: and (3) performing feature extraction on each pair of preprocessed text pairs T (i), H (i), wherein the feature extraction comprises statistical features, vocabulary semantic features and contradiction rule characterization, and the specific steps are as follows:

step S41: calculating word overlap degree wLap (T (i) and H (i)), wherein the word set contained in T (i) is NwordT (i), the word set contained in H (i) is NwordH (i), and the calculation formula is

Step S42: calculating text length difference diffLen (T (i)), H (i)), len () as character string length calculating function formula diffLen (T (i)), H (i) ═ len (T (i)) -len (H (i)))) as character string length calculating function formula diffLen (T (i)), H (i)) -len (H (i)))

Step S43: calculating Jaro-Winkler distance jwSim (T (i), H (i)) of the text pair, recording jSim (T (i), H (i)) as the Jaro distance between the text pair T (i) and H (i), wherein m represents the number of matched characters of the text string T (i) and H (i), t represents half of transposition number transportations, a matching window is mw (T (i), H (i)), f represents the number of common prefix characters of the two strings, f is more than or equal to 1 and less than or equal to 4, p is a constant of a scaling factor, p is more than 0 and less than or equal to 0.25, and the calculation formula is that

jwDis(T(i),H(i))=jSim(T(i),H(i))+fp(1-jSim(T(i),H(i)))

Step S44: calculating cosine similarity cosSim (T (i), H (i)) based on sememe, and part of speech set T (i) of text T (i)wtEach tuple of (1) T (i)wt(j) According to the part of speech of the word, obtaining the corresponding sememe to obtain the corresponding sememe vector T (i)wt(j)vecSumming the sense original vectors and averaging to obtain vector representation vecT (i) of text T (i), recording T (i)wtThe total number of the middle tuples is sT (i)wtIn the same way, the vector representations vecH (i) and H (i) of the text H (i) can be obtainedwtTotal number of middle tuples sH (i)wtThe calculation formula is

Step S45: calculating negative word characteristics negF (T (i), H (i)), nwT (i), nwH (i), wherein the total number of words in the negT (i) and the nwH (i) is negT (i), and negH (i) has a calculation formula of negF | -negT (i) -negH (i) | mod 2

Wherein mod 2 is divided by 2 to obtain the remainder;

step S46: calculating the semantic similarity combiSim (T (i), H (i)) of the text pairs of the comprehensive known network and the synonym forest, and recording the function of converting the semantic distance into the similarity as simw(w1,w2) Wherein w is1And w2Represents two sememes, dis (w)1,w2) Is the semantic distance of two sememes, the value is the path length of the two sememes in the sememe tree, a is the sememe distance when the similarity is about 0.5, and the calculation formula is

Recording the similarity calculation function of synonym forest words as simt(w1,w2),C1And C2Is an artificial atom w1And w2Word encoding of (2), disT (C)1,C2) Is a distance function of two word codes in a tree structure, equal to the sum of the weights of the edges in the connection path of the word pair, simt(C1,C2) Is calculated by the formula

Where n is the density of the nearest common parent node of the word pair, k is the spacing of the branches where the word pair is located,

wherein s is T (i)wtAnd H (i)wtThe minimum value of the number of each tuple, i represents the serial number of the tuple in the set;

step S47: calculating to obtain the numeric contradiction characteristics numCF of the text pairs T (i), H (i), comparing the two tuple sets digitT (i) and digitH (i) obtained by preprocessing, smin is the minimum value of the total number of elements of the two tuple sets,

step S48: calculating time contradiction characteristics timeCF of text pairs T (i) and H (i), wherein time expression formats are unified in the text preprocessing process, extracting time content by using a regular expression, extracting words such as ' earlier than ' and ' later ' and the like which represent time ranges, converting the words into mathematical symbol character strings such as ' < ' > ', and the like, converting dates into time stamps, forming binary groups together with the corresponding mathematical symbol character strings to obtain two binary group sets timeT (i) and timeH (i), calculating whether the time ranges represented by the binary groups in H (i) are in the time ranges represented by the binary groups in T (i), and if all the time contradicton characteristics timeCF are met, setting the timeCF to be 0, and otherwise setting the time expression format to be 1;

step S49: calculating modifier contradictory characteristics adjCF of text pairs T (i) and H (i), forming a new duplet set adjSet in sequence from wtT (i) and wtH (i) according to part-of-speech tags of the tuples, wherein the elements of each duplet are words of which the tags in wtT (i) and wtH (i) are adjectives respectively, if wtT (i) also has the adjectives and wtH (i) does not exist, replacing the second element of the new tuple with "" empty string "", and otherwise, replacing the first element of the new tuple with "" empty string "", keeping the total element number of the adjSet as sa, and calculating the formula as follows

Step S5: obtaining a feature list according to the steps, selecting a support vector machine as a classifier for training, and carrying out class balance processing on the sample, wherein the model obtained by training is M;

detection phase

The management clause text pair T, H to be subjected to text conflict detection is processed by the following steps:

step T1: the text pairs to be classified are pre-processed,

extracting digit tuple sets digT and digH, part-of-speech sets wtT and wtH, and negative word sets nwT and nwH, normalizing time representation in T, H and extracting time sets timeT and timeH;

step T2: extracting features F ═ F1, F2, F3 … from the obtained set, and specifically extracting the features F ═ F1, F2, F3 … as follows:

step T21: calculating word overlapping degree by using wtT and wtH as the value of the characteristic f 1;

step T22: taking the absolute value of the difference between the lengths of T and H to obtain the text length difference as the value of the characteristic f 2;

step T23: calculating the Jaro-Winkler distance of T and H by wtT and wtH as the value of the characteristic f 3;

step T24: according to words in wtT and wtH, obtaining respective semantic sets sT and sH of the texts T and H, summing and averaging the semantic vectors to obtain vector representations vT and vH of the two texts, and calculating to obtain cosine similarity as a value of the feature f 4;

step T25: calculating the value of the characteristics of the negative words according to the negative word sets nwT and nwH to obtain the value of the characteristics f 5;

step T26: utilizing wtT and wtH to calculate the text semantic similarity of the comprehensive learning network and the synonym forest as the value of the characteristic f 6;

step T27: judging the number range inclusion relationship of the elements in the two sets by using the digT and the digH, namely, whether the number conflict condition exists or not, if so, setting f7 to 1, otherwise, setting f7 to 0;

step T28: judging the time range containing relationship of the elements in the two sets according to the timeT and the timeH, namely whether time conflict exists, if so, setting f8 to 1, otherwise, setting f8 to 0;

step T29: using wtT and word parts in wtH to respectively extract two sets of words with word parts being adjectives in sequence to form a new adjective set, using the word similarity of the comprehensive learning network and the synonym forest to calculate the adjective similarity, and dividing the similarity sum by the total number of tuples in the set to obtain the contradiction degree of the modifier as the value of the characteristic f 9;

step T3: and inputting the feature list F of the text pair into the classification model M for classification to obtain output, wherein the output is 0 or 1, 1 represents that the text pair is contradictory, namely the subordinate policy entry violates the superior policy entry, and 0 represents that the text pair is not contradictory, namely the subordinate policy entry conforms to the content of the superior policy entry.

The above description is only an embodiment of the present invention, but the design concept of the present invention is not limited thereto, and any insubstantial modifications made by using the design concept should fall within the scope of infringing the present invention.

13页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种用于自然语言内容标题消歧的方法、设备和系统

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!