Pseudo-fuzzy detection method in privacy policy document

文档序号:1905183 发布日期:2021-11-30 浏览:18次 中文

阅读说明:本技术 一种隐私政策文档中伪模糊检测方法 (Pseudo-fuzzy detection method in privacy policy document ) 是由 连小利 吕鹤阳 黄丹 张莉 于 2021-08-26 设计创作,主要内容包括:本发明公开了一种隐私政策文档中伪模糊检测方法,所述方法包括:获取隐私政策样本集,并基于扎根理论对所述隐私政策样本集中支撑语句的支撑模式进行总结归类,以构建伪模糊检测模型;基于深度神经网络模型的模糊检测算法,获取待检测隐私政策文档的模糊语句;基于所述模糊语句和所述待检测隐私政策文档,根据所述伪模糊检测模型,对每个所述模糊语句进行潜在伪模糊判定,以识别潜在伪模糊语句。本发明增加了对模糊语句的二次检测,可筛除第一次检测中出现的错误结果,提高了检测的准确性。(The invention discloses a pseudo-fuzzy detection method in a privacy policy document, which comprises the following steps: acquiring a privacy policy sample set, and summarizing and classifying a support mode of support statements in the privacy policy sample set based on a root theory to construct a pseudo-fuzzy detection model; acquiring a fuzzy statement of a privacy policy document to be detected based on a fuzzy detection algorithm of a deep neural network model; and based on the fuzzy statements and the privacy policy document to be detected, performing potential pseudo-fuzzy judgment on each fuzzy statement according to the pseudo-fuzzy detection model so as to identify potential pseudo-fuzzy statements. The invention adds the secondary detection to the fuzzy statement, can screen out the error result in the first detection and improves the detection accuracy.)

1. A method for detecting false-fuzziness in a privacy policy document, comprising:

acquiring a privacy policy sample set, and summarizing and classifying a support mode of support statements in the privacy policy sample set based on a root theory to construct a pseudo-fuzzy detection model;

acquiring a fuzzy statement of a privacy policy document to be detected based on a fuzzy detection algorithm of a deep neural network model;

and based on the fuzzy statements and the privacy policy document to be detected, performing potential pseudo-fuzzy judgment on each fuzzy statement according to the pseudo-fuzzy detection model so as to identify potential pseudo-fuzzy statements.

2. The method of claim 1, wherein the summarization and classification of support patterns for support statements in the privacy policy sample set based on the root theory to construct a pseudo-fuzzy detection model comprises:

labeling fuzzy words of each privacy policy document in the privacy policy sample set, and determining the fuzzy degree of fuzzy sentences with the fuzzy words;

judging whether the fuzzy statements with the fuzzy degree larger than the threshold have supporting statements in the corresponding privacy policy document or not so as to identify potential pseudo-fuzzy statements;

and analyzing the characteristics and the incidence relation of the potential pseudo-fuzzy sentences and the support sentences thereof to classify the support modes of the support sentences and determine the recognition algorithm of each support mode to construct a pseudo-fuzzy detection model.

3. The method of claim 2, wherein the support mode comprises: a supplemental support mode;

and designing a recognition algorithm based on keyword matching and paragraph structure matching for the supplementary support pattern.

4. The method of claim 3, wherein the performing a potential false-fuzzy decision on each of the fuzzy statements to identify potential false-fuzzy statements according to the false-fuzzy detection model based on the fuzzy statements and the privacy policy document to be detected comprises:

sentence segmentation is carried out on the privacy policy document to be detected;

carrying out incomplete statement identification on the privacy policy document to be detected after the sentence is segmented so as to identify an initial statement and an enumeration statement;

and carrying out similarity detection on the fuzzy statement, the initial statement and the enumeration statement, and outputting the fuzzy statement with the similarity detection result larger than a first set value as a potential pseudo-fuzzy statement.

5. The method of claim 2, wherein the support mode comprises: example support mode;

and designing a recognition algorithm based on keyword matching for the example support mode.

6. The method of claim 5, wherein the performing a potential false-fuzzy decision on each of the fuzzy statements to identify potential false-fuzzy statements according to the false-fuzzy detection model based on the fuzzy statements and the privacy policy document to be detected comprises:

and judging whether the next sentence of the fuzzy sentence is a sentence starting from for example/for instance based on the privacy policy document to be detected, and if so, outputting the fuzzy sentence as a potential pseudo-fuzzy sentence.

7. The method of claim 2, wherein the support mode comprises: interpreting a support pattern;

and designing a recognition algorithm for recognizing the interpretation type candidate sentences and the interpreted words in the candidate sentences based on the characteristics of the keywords for the interpretation support modes.

8. The method of claim 7, wherein the performing a potential false-fuzzy decision on each of the fuzzy statements to identify potential false-fuzzy statements according to the false-fuzzy detection model based on the fuzzy statements and the privacy policy document to be detected comprises:

acquiring an interpretation statement in the privacy policy document to be detected by keyword matching;

extracting the interpreted words in the interpreted sentences based on heuristic rules according to the text content, the syntactic structure tree and the semantic dependency relationship of the sentences in the privacy policy document to be detected;

and carrying out similarity detection on the interpreted words in the interpreted sentences and the fuzzy words in the fuzzy sentences, and outputting the fuzzy sentences of which the similarity detection results are greater than a second set value as potential pseudo-fuzzy sentences.

9. The method of claim 8, wherein the similarity detection comprises synonym term determination and LCS-based phrase similarity detection.

10. The method of claim 1, wherein the obtaining of the fuzzy statement of the privacy policy document to be detected based on the fuzzy detection algorithm of the deep neural network model comprises:

performing sentence division processing on the privacy policy document to be detected by adopting a word division tool provided by the Stanford NLP Group;

and inputting the privacy policy document to be detected after sentence division processing into a fuzzy detection algorithm based on a deep neural network model so as to obtain a fuzzy sentence of the privacy policy document to be detected.

Technical Field

The invention relates to the field of information technology processing, in particular to a method for detecting pseudo-ambiguity in a privacy policy document.

Background

In recent years, people and countries have increasingly paid more attention to the privacy of users. The privacy policy is a binding agreement between the enterprise and the user, and is the basis for the user to ask for responsibility and to be regulated by law, and the description of the privacy policy must be ensured to be accurate and unambiguous. A great deal of business cases and academic research have demonstrated that there is a great deal of ambiguity in privacy policies.

Existing research only focuses on fuzzy words or isolated statements in the privacy policy, and does not consider the association between contexts in the privacy policy. This results in an inaccurate ambiguity detection, some of which is in the context of the privacy policy that the interpretation supports.

Disclosure of Invention

The embodiment of the invention provides a pseudo-fuzzy detection method in a privacy policy document, which is used for solving the problem that the ambiguity detection is not accurate enough due to the fact that the upper and lower association of a privacy policy is not considered in the detection process in the prior art.

The pseudo-fuzzy detection method in the privacy policy document according to the embodiment of the invention comprises the following steps:

acquiring a privacy policy sample set, and summarizing and classifying a support mode of support statements in the privacy policy sample set based on a root theory to construct a pseudo-fuzzy detection model;

acquiring a fuzzy statement of a privacy policy document to be detected based on a fuzzy detection algorithm of a deep neural network model;

and based on the fuzzy statements and the privacy policy document to be detected, performing potential pseudo-fuzzy judgment on each fuzzy statement according to the pseudo-fuzzy detection model so as to identify potential pseudo-fuzzy statements.

According to some embodiments of the present invention, the summarizing and classifying the support mode of the support statements in the privacy policy sample set based on the root theory to construct a pseudo-fuzzy detection model includes:

labeling fuzzy words of each privacy policy document in the privacy policy sample set, and determining the fuzzy degree of fuzzy sentences with the fuzzy words;

judging whether the fuzzy statements with the fuzzy degree larger than the threshold have supporting statements in the corresponding privacy policy document or not so as to identify potential pseudo-fuzzy statements;

and analyzing the characteristics and the incidence relation of the potential pseudo-fuzzy sentences and the support sentences thereof to classify the support modes of the support sentences and determine the recognition algorithm of each support mode to construct a pseudo-fuzzy detection model.

According to some embodiments of the invention, the support pattern comprises: a supplemental support mode;

and designing a recognition algorithm based on keyword matching and paragraph structure matching for the supplementary support pattern.

According to some embodiments of the present invention, the performing, based on the fuzzy statements and the privacy policy document to be detected, a latent pseudo-fuzzy decision on each of the fuzzy statements according to the pseudo-fuzzy detection model to identify the latent pseudo-fuzzy statements includes:

sentence segmentation is carried out on the privacy policy document to be detected;

carrying out incomplete statement identification on the privacy policy document to be detected after the sentence is segmented so as to identify an initial statement and an enumeration statement;

and carrying out similarity detection on the fuzzy statement, the initial statement and the enumeration statement, and outputting the fuzzy statement with the similarity detection result larger than a first set value as a potential pseudo-fuzzy statement.

According to some embodiments of the invention, the support pattern comprises: example support mode;

and designing a recognition algorithm based on keyword matching for the example support mode.

According to some embodiments of the present invention, the performing, based on the fuzzy statements and the privacy policy document to be detected, a latent pseudo-fuzzy decision on each of the fuzzy statements according to the pseudo-fuzzy detection model to identify the latent pseudo-fuzzy statements includes:

and judging whether the next sentence of the fuzzy sentence is a sentence starting from for example/instance based on the privacy policy document to be detected, and if so, outputting the fuzzy sentence as a potential pseudo-fuzzy sentence.

According to some embodiments of the invention, the support pattern comprises: interpreting a support pattern;

and designing a recognition algorithm for recognizing the interpretation type candidate sentences and the interpreted words in the candidate sentences based on the characteristics of the keywords for the interpretation support modes.

According to some embodiments of the present invention, the performing, based on the fuzzy statements and the privacy policy document to be detected, a latent pseudo-fuzzy decision on each of the fuzzy statements according to the pseudo-fuzzy detection model to identify the latent pseudo-fuzzy statements includes:

acquiring an interpretation statement in the privacy policy document to be detected by keyword matching;

extracting the interpreted words in the interpreted sentences based on heuristic rules according to the text content, the syntactic structure tree and the semantic dependency relationship of the sentences in the privacy policy document to be detected;

and carrying out similarity detection on the interpreted words in the interpreted sentences and the fuzzy words in the fuzzy sentences, and outputting the fuzzy sentences of which the similarity detection results are greater than a second set value as potential pseudo-fuzzy sentences.

According to some embodiments of the invention, the similarity detection comprises synonym term determination and LCS-based phrase similarity detection.

According to some embodiments of the present invention, the obtaining a fuzzy statement of a privacy policy document to be detected based on a fuzzy detection algorithm of a deep neural network model includes:

performing sentence division processing on the privacy policy document to be detected by adopting a word division tool provided by StanfordNLPgroup;

and inputting the privacy policy document to be detected after sentence division processing into a fuzzy detection algorithm based on a deep neural network model so as to obtain a fuzzy sentence of the privacy policy document to be detected.

By adopting the embodiment of the invention, the fuzzy sentence in the privacy policy document to be detected, which is acquired by the fuzzy detection algorithm based on the deep neural network model, is secondarily detected by using the detection method combined with the context of the privacy policy document, so that the potential pseudo-fuzzy sentence is effectively filtered, and the accuracy of the existing fuzzy detection method is improved.

The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. In the drawings:

FIG. 1 is a flow chart of a method of false ambiguity detection in an embodiment of the present invention;

FIG. 2 is a flow chart of supplemental support mode detection in an embodiment of the present invention;

fig. 3 is a flowchart for explaining the support mode detection in the embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the invention are shown in the drawings, it should be understood that the invention can be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.

An embodiment of the present invention provides a method for detecting pseudo-ambiguity in a privacy policy document, as shown in fig. 1, including:

s1, obtaining a privacy policy sample set, and summarizing and classifying a support mode of support statements in the privacy policy sample set based on a rooting theory to construct a pseudo-fuzzy detection model;

the privacy policy sample set is a collection of privacy policy documents, including a number of privacy policy documents.

S2, acquiring fuzzy statements of the privacy policy document to be detected based on a fuzzy detection algorithm of the deep neural network model;

and S3, based on the fuzzy statements and the privacy policy document to be detected, performing potential pseudo-fuzzy judgment on each fuzzy statement according to the pseudo-fuzzy detection model to identify potential pseudo-fuzzy statements.

The potentially pseudo-ambiguous statement herein is understood to be a statement that exists in the privacy policy document that supports the statement to interpret.

According to the embodiment of the invention, the pseudo-fuzzy detection is performed on the fuzzy statements detected by the fuzzy detection algorithm based on the deep neural network model by the pseudo-fuzzy detection model established in advance and combining the privacy policy document to be detected, so that the occurrence of wrong detection results is further avoided, and the detection accuracy is improved.

On the basis of the above-described embodiment, various modified embodiments are further proposed, and it is to be noted herein that, in order to make the description brief, only the differences from the above-described embodiment are described in the various modified embodiments.

In some embodiments of the present invention, the summarizing and classifying the support pattern of the support statements in the privacy policy sample set based on the root theory to construct a pseudo-fuzzy detection model includes:

labeling fuzzy words of each privacy policy document in the privacy policy sample set, and determining the fuzzy degree of fuzzy sentences with the fuzzy words;

in some examples of the invention, a plurality of intervals reflecting the degree of blur may be set, each interval corresponding to a different degree of blur. For example, four sections [1,2], (2,3], (3,4], (4,5] may be set, corresponding to four categories of "clear", "slightly blurred", "blurred", and "extremely blurred", respectively.

And analyzing the characteristics and the incidence relation of the potential pseudo-fuzzy sentences and the support sentences thereof to classify the support modes of the support sentences and determine the recognition algorithm of each support mode to construct a pseudo-fuzzy detection model.

For example, in the first cycle, attribute coding (attribute coding) data processing policy is used, and the fuzzy statements in the privacy policy are analyzed to determine whether they have "potential pseudo-fuzzy" attributes, that is, whether there are supporting statements in the privacy policy text. In this stage, two annotators a and B independently read the full text of 15 privacy policies and judge whether the fuzzy sentence sets have sentences supporting the two annotators or a certain fuzzy word in the full text of the privacy policy. If so, the label is < potential pseudo-fuzzy statement, support statement > statement pair. In the second period, a pattern coding (pattern) data processing strategy is applied to classify the support pattern of the support statement. In the stage, annotators A and B firstly discuss and analyze < potential pseudo-fuzzy sentences and support sentences > sentence pairs annotated in a first period, and reserve the sentence pairs which are considered by the annotators A and B to have a support effect on the fuzzy sentences, so that the accuracy and consistency of annotation data are ensured. And then classifying the support relation of the support statement to the potential pseudo-fuzzy statement to make a classification guide. And then, a third annotator C is enabled to independently read the 15 privacy policies, annotate the potential pseudo-fuzzy sentences and the supporting sentences thereof, and classify the sentence pairs according to the classification guidelines. Finally, the annotators ABC discuss the annotation together, compare and analyze the annotation result of C with the annotation result of AB, and reasonably improve the annotation samples to achieve the final consistency. And put forward the improvement suggestion to the classification guideline, refine the support mode classification.

In the double-period coding process, the marked potential pseudo-fuzzy sentences and the support sentences are fully discussed, so that the final result is accurate and consistent. The method also takes the potential pseudo-fuzzy sentences and the support sentence identification rules thereof as samples for analyzing.

In some embodiments of the invention, the support mode comprises: a supplemental support mode; and designing a recognition algorithm based on keyword matching and paragraph structure matching for the supplementary support pattern.

It is noted herein that in reading privacy policy documents, it is often found that the interpretation of complex concepts or facts is itemized when they are introduced. This is a clear expression when manually reading privacy policies. But natural language clauses tend to segment the sentences. Without context, it is caused to be misjudged as fuzzy in the current deep learning algorithm recognition process. Incomplete sentences have two cases: a start statement and an enumerate term statement, the start statement and the enumerate statement being complementary. The start statement is an overview of the enumerator statement, for which a specification of the target is stated, and the enumerator statement is a line-by-line refinement of the start statement. The embodiment of the present invention defines the target specification of the start statement to the enumeration statement, and the detailed statement of the enumeration statement to the start statement as the supplementary support mode.

According to some embodiments of the present invention, the performing, based on the fuzzy statements and the privacy policy document to be detected, a latent pseudo-fuzzy decision on each of the fuzzy statements according to the pseudo-fuzzy detection model to identify the latent pseudo-fuzzy statements includes:

sentence segmentation is carried out on the privacy policy document to be detected; for example, sentences belonging to the same segment may be placed in a list while sentence-dividing the privacy policy document to be detected and retaining the segment structure information.

Carrying out incomplete statement identification on the privacy policy document to be detected after the sentence is segmented so as to identify an initial statement and an enumeration statement; wherein the starting statement is an overview of the enumerator statement, a description of its stated target; the beginning sentence will often obviously end in a colon, representing that the following is a separate statement of the present sentence. The enumerate term statement is a line-by-line refinement of the starting statement. The enumeration statement has more characteristics, including i) punctuation characteristics: a single enumerator statement; "end, all enumerations end with" - "; ii) sequence characteristics: the sentence is organized by beginning with numbers, letters, roman numerals, or the like; iii) paragraph characteristics: an enumerated item statement is a plurality of paragraphs beginning with a subject term, each subject belonging to an aspect of the complex concept being expressed; iv) specific expression characteristics: today's information systems do not exist in isolation, and most use some third party services. The website index of the third-party service is directly given when the third-party service is not explained generally.

Based on the five heuristic rules summarized above, a regular matching algorithm and a paragraph structure matching algorithm can be adopted, thereby realizing the automatic identification of the supplementary support pattern (the initial statement and the enumeration statement). Because the two statements are located in close proximity in the privacy policy, the starting statement may be identified first, and then it may be determined whether the statement immediately following the starting statement conforms to the enumeration statement feature. The sentence recognition process for the supplemental support mode is shown in FIG. 2.

And carrying out similarity detection on the fuzzy statement, the initial statement and the enumeration statement, and outputting the fuzzy statement with the similarity detection result larger than a first set value as a potential pseudo-fuzzy statement.

The first setting value can be flexibly set based on the sensitivity requirement of detection and the requirement of detection.

According to some embodiments of the invention, the support pattern comprises: example support mode; and designing a recognition algorithm based on keyword matching for the example support mode.

In setting forth an important fact, or a less understandable matter, people generally prefer to illustrate. The illustrative statements may help the user understand the ambiguous statements to some extent. The embodiment of the invention classifies the statements in the privacy policy, which exemplify the fuzzy statements, into the exemplifying support mode.

In some embodiments of the present invention, the performing, based on the fuzzy statements and the privacy policy document to be detected, a latent pseudo-fuzzy determination on each fuzzy statement according to the pseudo-fuzzy detection model to identify a latent pseudo-fuzzy statement includes:

and judging whether the next sentence of the fuzzy sentence is a sentence starting from for example/instance based on the privacy policy document to be detected, and if so, outputting the fuzzy sentence as a potential pseudo-fuzzy sentence.

In some embodiments of the invention, the support mode comprises: interpreting a support pattern; and designing a recognition algorithm for recognizing the interpretation type candidate sentences and the interpreted words in the candidate sentences based on the characteristics of the keywords for the interpretation support modes.

The statement explaining the support pattern is a statement explaining a certain fuzzy word of a fuzzy statement in the privacy policy.

In some embodiments of the present invention, the performing, based on the fuzzy statements and the privacy policy document to be detected, a latent pseudo-fuzzy determination on each fuzzy statement according to the pseudo-fuzzy detection model to identify a latent pseudo-fuzzy statement includes:

extracting the interpreted sentences in the privacy policy sample set, and analyzing the features of the interpreted sentences in the sample set to obtain the recognition rules of the interpreted sentences, for example: acquiring an interpretation statement in the privacy policy document to be detected by keyword matching;

extracting the interpreted words in the interpreted sentences based on heuristic rules according to the text content, the syntactic structure tree and the semantic dependency relationship of the sentences in the privacy policy document to be detected;

and carrying out similarity detection on the interpreted words in the interpreted sentences and the fuzzy words in the fuzzy sentences, and outputting the fuzzy sentences of which the similarity detection results are greater than a second set value as potential pseudo-fuzzy sentences.

The second setting value can be flexibly set based on the sensitivity requirement of the detection and the requirement of the detection.

According to some embodiments of the invention, the similarity determination includes synonym determination for fuzzy words and phrase similarity detection based on phrase matching (LCS).

In some embodiments of the present invention, the obtaining a fuzzy statement of a privacy policy document to be detected based on a fuzzy detection algorithm of a deep neural network model includes:

performing sentence division processing on the privacy policy document to be detected by adopting a word division tool provided by StanfordNLPgroup;

and inputting the privacy policy document to be detected after sentence division processing into a fuzzy detection algorithm based on a deep neural network model so as to obtain a fuzzy sentence of the privacy policy document to be detected.

The method for detecting false ambiguity in a privacy policy document according to an embodiment of the present invention is described in detail in a specific embodiment with reference to fig. 2-3. It is to be understood that the following description is illustrative only and is not intended to be in any way limiting. All similar structures and similar variations thereof adopted by the invention are intended to fall within the scope of the invention.

The 15 company privacy policy documents in the Logan corpus are chosen randomly to label and analyze the potential pseudo-fuzzy sentences and the supporting sentences. The Logan Lebanoff provides a web site privacy policy corpus that includes 100 web site privacy policies. These privacy policies are collected by amazon turkish robot network (amazon cybermaculturn), originating from the most commonly visited web sites in 15 categories (from art, business, computer to science, shopping, sports, etc.). The privacy policy totals 133K words and 4.5K sentences.

The sentence library labels the fuzzy words and the fuzzy degree of the privacy policy sentence in a crowdsourcing mode. Five persons are recruited for marking in each privacy policy statement, and the marking persons need to mark out fuzzy words in the statement and score the fuzzy degree of the statement. The statement blurriness score is from 1 to 5. Then, the average value of the scores of the five annotators is taken, and the average value of the sentence fuzzy degrees is distributed in four intervals of [1,2], (2,3], (3,4], (4,5], and respectively corresponds to four categories of ' clear ', ' slightly fuzzy ', ' and ' extremely fuzzy '.

Because the method is used for researching the fuzzy sentences in the privacy policy, the non-fuzzy sentences in the privacy policy sample set are filtered at first, and the sentences classified as clear sentences, namely the sentences with the fuzzy degree lower than 2 points on average, are removed. The sample set of privacy policies eventually used to support pattern classification analysis included (1)15 texts of privacy policies: expressed in XML format, the privacy policy is divided into several paragraphs, each with a title. (2) Standard answers to manual annotations: expressed in json format, comprising fuzzy sentences, fuzzy words in the sentences and fuzzy degree scores of the sentences.

In the first period, attribute coding (Attributecoding) data processing strategies are used, and the method is used for analyzing whether fuzzy statements in the privacy policy have "potential pseudo-fuzzy" attributes, namely judging whether the fuzzy statements have supporting statements in the privacy policy full text. In this stage, two annotators a and B independently read the full text of 15 privacy policies and judge whether the fuzzy sentence sets have sentences supporting the two annotators or a certain fuzzy word in the full text of the privacy policy. If so, the label is < potential pseudo-fuzzy statement, support statement > statement pair. In the second period, a pattern coding (pattern) data processing strategy is applied to classify the support pattern of the support statement. In the stage, annotators A and B firstly discuss and analyze < potential pseudo-fuzzy sentences and support sentences > sentence pairs annotated in a first period, and reserve the sentence pairs which are considered by the annotators A and B to have a support effect on the fuzzy sentences, so that the accuracy and consistency of annotation data are ensured. And then classifying the support relation of the support statement to the potential pseudo-fuzzy statement to make a classification guide. And then, a third annotator C is enabled to independently read the 15 privacy policies, annotate the potential pseudo-fuzzy sentences and the supporting sentences thereof, and classify the sentence pairs according to the classification guidelines. Finally, the annotators ABC discuss the annotation together, compare and analyze the annotation result of C with the annotation result of AB, and reasonably improve the annotation samples to achieve the final consistency. And put forward the improvement suggestion to the classification guideline, refine the support mode classification.

In the double-period coding process, the marked potential pseudo-fuzzy sentences and the support sentences are fully discussed, so that the final result is accurate and consistent. The method also takes the potential pseudo-fuzzy sentences and the support sentence identification rules thereof as samples for analyzing.

Based on the root theory, the method divides the potential pseudo-fuzzy sentences into four types according to the support relation of the support sentences to the potential pseudo-fuzzy sentences: potential pseudo-fuzzy statements describing phenomena, supplementarily supported potential pseudo-fuzzy statements, exemplarily supported potential pseudo-fuzzy statements and interpretively supported potential pseudo-fuzzy statements. Wherein the potentially pseudo-ambiguous statement describing the phenomenon is unsupported. Such statements describe characteristics of other things that are weakly related to the core content of the privacy policy discussion. The method is not used for processing the support mode, and the related concepts are wide, so that the method is difficult to uniformly identify due to the fact that the related concepts concern the field knowledge of specific applications and products.

According to the above labeling and analysis of the original data set, the support statements are classified into the following three support modes:

1. based on supplementary support mode

Since some statements are complicated to refer to, they may be set forth in separate pieces. Statements of this type in the context of the privacy policy typically include a start statement, which is a summary of the enumerated statement, for which a statement of the target is stated, and several enumerated statements, which are refinements of the start statement from item to item. When sentences are divided, the sentences are often separated, so that the initial sentences and the enumerated statement sentences in the recognition process of the current deep learning algorithm are incomplete. In such sentences, the start statement and the enumerated statement are complementary. Thus, the method defines a target specification of such a starting statement to an enumeration statement, and a detailed statement of the enumeration statement to the starting statement as a supplemental support pattern.

In reading privacy policy documents, it is found that cases often arise where complex concepts or facts are itemized for interpretation when introduced. This is a clear expression when manually reading privacy policies. But natural language clauses tend to segment the sentences. Without context, these statements would be misinterpreted as ambiguous. Incomplete sentences have two cases: a start statement and an enumerate term statement. Where a start statement is an overview of an enumerator statement, for which a specification of a target is stated, and an enumerator statement is a line-by-line refinement of the start statement.

The invention extracts all incomplete statement sentences in 15 privacy policies and performs characteristic analysis on text contents and paragraph structures, and summarizes the characteristics of the supplementary support mode. The beginning sentence will often obviously end in a colon, representing that the following is a separate statement of the present sentence. The enumeration statement has more characteristics, including i) punctuation characteristics: a single enumerator statement; "end, all enumerations end with" - "; ii) sequence characteristics: the sentence is organized by beginning with numbers, letters, roman numerals, or the like; iii) paragraph characteristics: an enumerated item statement is a plurality of paragraphs beginning with a subject term, each subject belonging to an aspect of the complex concept being expressed; iv) specific expression characteristics: today's information systems do not exist in isolation, and most use some third party services. The website index of the third-party service is directly given when the third-party service is not explained generally.

Based on the five summarized heuristic rules, the method adopts a regular matching algorithm and a paragraph structure matching algorithm, thereby realizing the automatic identification of the supplementary support mode (initial statement and enumeration statement). Since the two statements are located in close proximity in the privacy policy, the starting statement is identified first, and then it is determined whether the statement immediately following the starting statement matches the enumeration statement feature. The sentence recognition process for the supplemental support mode is shown in FIG. 2. Firstly, sentence segmentation is carried out on an XML privacy policy original text, meanwhile paragraph structure information of the XML privacy policy original text is kept, and sentences belonging to the same paragraph are placed in a list. And incomplete statement identification is carried out on the privacy policy of the sentence segmentation, and enumeration statement identification is carried out on the next sentence after the initial statement is identified. And then judging whether fuzzy sentences exist in the identified < initial sentence, enumeration item > sentence set or not, and if so, outputting potential pseudo-fuzzy sentences and supplementary support sentences thereof.

2. Example support mode

In setting forth an important fact, or a less understandable matter, people generally prefer to illustrate. The illustrative statements may help the user understand the ambiguous statements to some extent. Statements in the privacy policy that exemplify the fuzzy statements are referred to herein as example support patterns.

Through text analysis of the privacy policy, it is found that the support sentences exemplifying the previous sentence in the original text mostly begin with the obvious keyword, forexample/instance. However, there are also very few exemplary support statements that do not begin with fork/instance. The judgment of such sentences is extremely difficult in combination with deep understanding of sentence semantics. For example sentences without feature words, the method is not recognized for the moment.

For example support patterns, the matching rules herein directly determine whether the next sentence of the current fuzzy sentence starts with fork/force. If so, the current sentence is a potential pseudo-ambiguous sentence and the next sentence is a supporting sentence.

3. Explain the mode of support

The statement explaining the support pattern is a statement explaining a certain fuzzy word of a fuzzy statement in the privacy policy.

The method classifies the statements in the privacy policy original text for explaining the fuzzy words in the fuzzy statements into an explanation support mode. The interpreted support sentences and their potential pseudo-ambiguous sentences are generally distributed in different sections of the document and are difficult to recognize, so that the recognition of such patterns is the focus of the research in this document, and the flow chart thereof is shown in fig. 3 and mainly includes the following three points:

(1) recognizing interpreted sentences

In the stage, characteristic analysis is carried out on the sample of the interpretation type supporting statement, the identification rule of the interpretation type statement is defined, and the identification algorithm for identifying the interpretation type statement candidate set from the privacy policy is realized.

(2) Extracting interpreted words of interpreted sentences

In the stage, feature analysis is carried out on the candidate interpretation statements from the three aspects of text content, syntactic parsing structure trees and semantic dependency relations so as to define heuristic rules for extracting the interpreted words in the statements. And then implementing an explained word extraction algorithm according to the rule, and outputting the explained words in the candidate interpreted sentences.

(3) Matching fuzzy statements and interpreted support statements

The bridge linking the fuzzy sentence and the explanatory support sentence is a term, i.e., a fuzzy word in the fuzzy sentence, and also an interpreted word of the explanatory sentence. The fuzzy words of all fuzzy statements in the privacy policy are matched with the interpreted words of the candidate interpreted statements. The fuzzy word of the fuzzy sentence belongs to the potential pseudo-fuzzy sentence if it is similar to the interpreted word of the interpretative sentence, i.e. the supporting sentence thereof.

According to some embodiments of the invention, 15 privacy policies which are analyzed and labeled based on the rooting theory are used as a training data set, and support statements are classified into three support modes, namely a supplementary support mode, an example support mode and an explanation support mode, according to the support relation of the support statements to potential pseudo-fuzzy statements. And then manually analyzing text characteristics of support sentences in different modes and defining heuristic rules of mode identification. 5 identification rules are provided for the supplementary support mode, and 1 identification rule is provided for the example support mode. The identification of the interpretation support pattern is complex and comprises three steps. (i) And acquiring the interpretation type candidate sentences by using keyword matching. (ii) The text content, the syntactic structure tree and the semantic dependency relationship of the sentence are manually analyzed, and 5 heuristic rules for extracting the interpreted words are defined. And extracting the interpreted words of all the interpreted sentences in the privacy policy according to the heuristic rules. (iii) And carrying out similarity detection on the interpreted words of the interpreted sentences and the fuzzy words of the privacy policy fuzzy sentences to identify potential pseudo-fuzzy sentences and supporting sentences of the interpretation supporting mode. Wherein the similarity detection comprises synonym term judgment and LCS-based phrase similarity detection.

And respectively defining heuristic rules for identifying the support statements and heuristic rules for matching the fuzzy statements and the support statements for the three support modes, and providing a potential pseudo-fuzzy and support statement identification algorithm based on the heuristic rules.

Those skilled in the art will appreciate that all or part of the flow of the method implementing the above embodiments may be implemented by a computer program, which is stored in a computer readable storage medium, to instruct related hardware. The computer readable storage medium is a magnetic disk, an optical disk, a read-only memory or a random access memory.

Compared with the prior art, the invention adopts a fuzzy detection method combining the privacy context when fuzzy detection is carried out: firstly, fuzzy sentences and fuzzy words in a privacy policy are identified based on an existing fuzziness detection algorithm. And then, the potential pseudo-fuzzy sentences are filtered by identifying whether the fuzzy sentences have supporting sentences, so that the accuracy of the existing fuzzy detection method is improved.

It should be noted that the above-mentioned embodiments are merely preferred embodiments of the present invention, and the present invention is not limited thereto, and various combinations of the embodiments of the present invention can be freely implemented, and various modifications and changes can be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

13页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种文本处理方法、系统、设备及介质

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!