Document analysis method and device, intelligent terminal and storage medium

文档序号：701001 发布日期：2021-04-13 浏览：2次中文

阅读说明：本技术 一种文档分析方法、装置及智能终端、存储介质 (Document analysis method and device, intelligent terminal and storage medium ) 是由杨智威于 2020-12-29 设计创作，主要内容包括：本申请实施例公开了一种文档分析方法、装置及智能终端、存储介质,其中,所述方法包括：获取待分析的第一文档和第二文档；将所述第一文档拆分成M个内容段,将所述第二文档拆分成N个内容段,所述M和N为正整数；将所述M个内容段与所述N个内容段输入到分析模型,并获取所述分析模型输出的相似度分析结果；从所述相似度分析结果包括的M组相似度值中选择P个相似度值,其中,P为正整数；根据所述P个相似度值,确定所述第一文档和所述第二文档之间的相似度。采用本发明,可以更好地捕捉文档的整体信息,提高文档对比的精确度。(The embodiment of the application discloses a document analysis method, a document analysis device, an intelligent terminal and a storage medium, wherein the method comprises the following steps: acquiring a first document and a second document to be analyzed; splitting the first document into M content segments, and splitting the second document into N content segments, wherein M and N are positive integers; inputting the M content segments and the N content segments into an analysis model, and acquiring a similarity analysis result output by the analysis model; selecting P similarity values from M groups of similarity values included in the similarity analysis result, wherein P is a positive integer; and determining the similarity between the first document and the second document according to the P similarity values. By adopting the invention, the whole information of the document can be better captured, and the document comparison accuracy is improved.)

1. A method of document analysis, comprising:

acquiring a first document and a second document to be analyzed;

splitting the first document into M content segments, and splitting the second document into N content segments, wherein M and N are positive integers;

inputting the M content segments and the N content segments into an analysis model, and obtaining a similarity analysis result output by the analysis model, wherein the similarity analysis result comprises M groups of similarity values, and the similarity value between any one of the M content segments and each of the N content segments obtained by analysis of the analysis model forms a group of similarity values;

selecting P similarity values from M groups of similarity values included in the similarity analysis result, wherein P is a positive integer;

and determining the similarity between the first document and the second document according to the P similarity values.

2. The method of claim 1, wherein splitting the first document into M content segments and the second document into N content segments comprises:

performing content analysis on the first document according to a target symbol group, determining segmentation and splitting position information in the first document, and splitting the first document into M content segments according to the segmentation and splitting position information;

analyzing the content of the second document according to a target symbol group, determining segmentation and splitting position information in the second document, and splitting the second document into N content segments according to the segmentation and splitting position information;

the target symbol group includes: any one or more of a symbol group consisting of periods and carriage return symbols, a symbol group consisting of question marks and carriage return symbols, and a symbol group consisting of exclamation marks and carriage return symbols.

3. The method of claim 1, wherein inputting the M content segments and the N content segments into an analysis model and obtaining a similarity analysis result output by the analysis model comprises:

inputting the M content segments and the N content segments into a first embedding layer of the analytical model and a second embedding layer of the analytical model, respectively;

converting, by a first embedding layer of the analytical model, the M content segments into M first feature vectors;

converting, by a second embedding layer of the analytical model, the N content segments into N second feature vectors;

respectively carrying out memory processing on the M first feature vectors and the N second feature vectors through two long-short term memory networks (LSTM) of the analysis model to obtain M third feature vectors and N fourth feature vectors;

and inputting the M third feature vectors and the N fourth feature vectors into a semantic matching layer of the analysis module through the semantic matching layer of the analysis model to obtain a similarity analysis result output by the analysis model.

4. The method of claim 3,

the semantic matching layer of the analysis module comprises: the splicing layer is used for splicing the M third eigenvectors and the N fourth eigenvectors, the Dropout layer is used for preventing overfitting, and the full-connection layer is used for determining similarity values of the M third eigenvectors and the N fourth eigenvectors so as to obtain a similarity analysis result according to the similarity values determined by the full-connection layer.

5. The method of claim 1, wherein P-M, said selecting P similarity values from the M sets of similarity values comprised by the similarity analysis result comprises:

and selecting the maximum similarity value from each group of similarity values of the M groups of similarity values included in the similarity analysis result to obtain M maximum similarity values.

6. The method of claim 1, wherein said determining a similarity between said first document and said second document based on said P similarity values comprises:

and averaging the P similarity values to obtain a similarity value between the first document and the second document.

7. The method of any of claims 1 to 6, further comprising:

obtaining a training sample, wherein the training sample comprises a first training document and a second training document, the first training document comprises X first training content segments, the second training document comprises Y second training content segments, and X and Y are positive integers;

inputting X first training content segments included by the first training document, Y second training content segments included by the second training document and labeling information used for representing the similarity between the first training content segments and the second training content segments into an initial model, and acquiring a similarity training analysis result output by the initial model;

and optimizing and updating the initial model according to the correlation between the similarity training and analyzing result and the labeling information.

8. A document analysis apparatus, comprising:

the acquisition module is used for acquiring a first document and a second document to be analyzed;

a splitting module, configured to split the first document into M content segments, and split the second document into N content segments, where M and N are positive integers;

the processing module is used for inputting the M content segments and the N content segments into an analysis model and acquiring a similarity analysis result output by the analysis model, wherein the similarity analysis result comprises M groups of similarity values, and the similarity value between any one of the M content segments and each of the N content segments obtained by analysis of the analysis model forms a group of similarity values;

a selecting module, configured to select P similarity values from M groups of similarity values included in the similarity analysis result, where P is a positive integer;

and the determining module is used for determining the similarity between the first document and the second document according to the P similarity values.

9. An intelligent terminal, characterized in that the intelligent terminal comprises a storage device and a processor, wherein the storage device stores a computer program, and the processor calls the computer program to realize the method according to any one of claims 1 to 7.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1 to 7.

Technical Field

The invention relates to the technical field of computer application, in particular to a document analysis method and device, an intelligent terminal and a storage medium.

Background

The document analysis technology is widely applied to scenes of duplicate checking of graduation papers and peer-to-peer repetition rate of copyright ratio. The existing document analysis technology is used for duplicate checking and comparison in most repetition rate scenes in a manual reading mode, and the mode is low in efficiency, inconsistent in auditing standards and incapable of ensuring accuracy.

In addition, the technology used in a small part of scenes with repetition rates is generally to divide an article into a plurality of sentences, then to perform word segmentation on the sentences to remove unnecessary words and phrases, to obtain an effective feature word, and then to compare the similarity between the two articles based on a large number of feature words. However, semantic information is not considered in the method, comparison is performed only by taking the feature words as granularity, any context relation cannot be embodied, and the similarity comparison accuracy is low.

Disclosure of Invention

The technical problem to be solved by the embodiments of the present application is to provide a document analysis method, a document analysis device, an intelligent terminal, and a storage medium, which can improve the accuracy of similarity comparison between documents.

In one aspect, an embodiment of the present application provides a document analysis method, including:

acquiring a first document and a second document to be analyzed;

splitting the first document into M content segments, and splitting the second document into N content segments, wherein M and N are positive integers;

selecting P similarity values from M groups of similarity values included in the similarity analysis result, wherein P is a positive integer;

and determining the similarity between the first document and the second document according to the P similarity values.

In one embodiment, the splitting the first document into M content segments and the splitting the second document into N content segments includes:

In one embodiment, the inputting the M content segments and the N content segments into an analysis model and obtaining a similarity analysis result output by the analysis model includes:

inputting the M content segments and the N content segments into a first embedding layer of the analytical model and a second embedding layer of the analytical model, respectively;

converting, by a first embedding layer of the analytical model, the M content segments into M first feature vectors;

converting, by a second embedding layer of the analytical model, the N content segments into N second feature vectors;

In one embodiment, the method further comprises:

In one embodiment, the selecting P similarity values from the M groups of similarity values included in the similarity analysis result includes:

and selecting the maximum similarity value from each group of similarity values of the M groups of similarity values included in the similarity analysis result to obtain M maximum similarity values.

In one embodiment, the determining the similarity between the first document and the second document according to the P similarity values includes:

and averaging the P similarity values to obtain a similarity value between the first document and the second document.

In one embodiment, the method further comprises:

obtaining a training sample, wherein the training sample comprises a first training document and a second training document, the first training document comprises X first training content segments, and the second training document comprises Y second training content segments;

and optimizing and updating the initial model according to the correlation between the similarity training and analyzing result and the labeling information.

On the other hand, an embodiment of the present application further provides a document analysis apparatus, including:

the acquisition module is used for acquiring a first document and a second document to be analyzed;

a splitting module, configured to split the first document into M content segments, and split the second document into N content segments, where M and N are positive integers;

a selecting module, configured to select P similarity values from M groups of similarity values included in the similarity analysis result, where P is a positive integer;

and the determining module is used for determining the similarity between the first document and the second document according to the P similarity values.

In another aspect, an embodiment of the present application further provides an intelligent terminal, where the intelligent terminal includes a storage device and a processor, where the storage device stores a computer program, and the processor calls the computer program to implement the method in the embodiment of the present application.

Correspondingly, the embodiment of the application also provides a computer-readable storage medium, wherein a computer program is stored in the computer-readable storage medium, and the computer program realizes the document analysis method in the process of being executed by the processor.

According to the method and the device for analyzing the similarity between the first document and the second document, the first document and the second document to be analyzed are obtained, the M content segments split from the first document and the N content segments split from the second document are input into the analysis model, and then the similarity between the first document and the second document can be determined according to the similarity analysis result obtained by the analysis model. Therefore, the process of splitting the document to be analyzed into content segments and inputting the content segments into the analysis model for similarity analysis can improve the contrast granularity of document comparison to a paragraph, better capture the overall information of the document and improve the accuracy of document comparison.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1a is a schematic flow chart of a document analysis method provided in an embodiment of the present application;

FIG. 1b is a schematic interface diagram of a document analysis provided by an embodiment of the present application;

FIG. 1c is a schematic view of a document selection interface provided in an embodiment of the present application;

fig. 2 is a schematic structural diagram of an analysis module provided in an embodiment of the present application;

FIG. 3 is a schematic flow chart diagram illustrating an analytical model-based processing method according to an embodiment of the present disclosure;

FIG. 4 is a schematic structural diagram of a document analysis apparatus according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of an intelligent terminal according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The method and the device for processing the duplicate checking of the two documents can be used for carrying out duplicate checking processing on any two documents, and an analysis module is introduced to respectively compare paragraphs obtained by splitting the two documents so as to determine the similarity between the two documents. Compared with the comparison of a single characteristic word, the analysis model has more reference characteristics when analyzing the paragraph, at least refers to the association relationship between different sentences and different words in the paragraph, and refers to the context relationship between each sentence and word, so that the obtained similarity is more accurate.

The embodiment of the application can be applied to scenes such as duplicate checking of the graduation papers and the like, and can also be used in scenes of copyright comparison. For example, for duplicate checking of papers, two graduation papers to be checked can be sequentially obtained, and the two graduation papers can be subjected to content analysis to be respectively split into a plurality of content segments. And inputting the plurality of content segments into an analysis model, and carrying out similarity analysis on the plurality of content segments by the analysis model to obtain a similarity analysis result. And determining the similarity between the two graduation papers according to the obtained similarity analysis result, and further obtaining the duplicate checking result of the two graduation papers. By splitting the graduation paper into content segments for comparison and analysis, the comparison granularity of the paper comparison is improved, the overall information of the graduation paper is better captured, and the accuracy of duplicate checking of the paper can be improved.

Please refer to fig. 1a, which is a flowchart illustrating a document analysis method according to an embodiment of the present application, where the method may be executed on an intelligent terminal, and the intelligent terminal may be, for example, a server, or a terminal such as a smart phone, a personal computer, a tablet computer, a smart wearable device, and the method includes the following steps:

step S101, a first document and a second document to be analyzed are obtained. The method comprises the steps that a document analysis interface can be displayed on the intelligent terminal, the document analysis interface is used for obtaining a first document and a second document which need to be contrastively analyzed by a user and are locally stored in the intelligent terminal, and after the first document and the second document are input into the document analysis interface, content segments obtained by splitting the first document and the second document and a document contrast result can be displayed. The document analysis interface can be seen in fig. 1b specifically, in fig. 1b, a document analysis interface b10, a first document input button b101, a first document display area b102, a second document input button b103, a second document display area b104, a first document content segment display area b105, a second document content segment display area b106 and a comparison result display area b107 are included.

There are various ways of inputting documents to the interface shown in fig. 1b, and in an embodiment, a user may input a corresponding document in a manner of triggering the first document input button b101, and after clicking the first document input button b101, the intelligent terminal displays a new window, which is a file information display window. The document information display window can be seen in fig. 1c, in which fig. 1c includes a file information display window c10, a local file display area c101, and a confirmation button c 102. The file information display window c10 is used for the user to select and confirm the first document and the second document that need to be contrasted and analyzed.

All the file information stored locally is displayed in the local file display area c101 of the file information display window c 10. The user can select a first document needing to be contrastively analyzed in the local file display area c101, and then trigger the confirmation button c102, that is, the first document can be considered to be uploaded to the intelligent terminal. After the document is uploaded successfully, the intelligent terminal closes the file information display window c10, and displays the document name of the first document in the first document display area b102 of the document analysis interface b 10.

Similarly, after the user triggers the second document input button b102, the intelligent terminal displays a file information display window c 10. The user can select a second document to be contrasted and analyzed in the local file display area c101, and then trigger the confirmation button c102, that is, the second document can be considered to be uploaded to the intelligent terminal. After the document is uploaded successfully, the intelligent terminal closes the file information display window c10, and displays the document name of the second document in the second document display area b104 of the document analysis interface b 10.

In one embodiment, a user opens a storage area storing a first document and a second document to be analyzed at the local end of the intelligent terminal, and drags the first document to the first document display area b102 of the document analysis interface b10 after selecting the first document, that is, the user can be considered as uploading the first document to the intelligent terminal. Similarly, the user selects the second document and drags the second document to the second document display area b104 of the document analysis interface b10, that is, the user can upload the second document to the intelligent terminal. After the document is successfully uploaded, the first document display area b102 displays the document name of the first document, and the second document display area b104 displays the document name of the second document, which means that the intelligent terminal has already acquired the first document and the second document to be analyzed.

In other embodiments, the user may also drag the first document and the second document to be analyzed to the corresponding display areas to load the first document and the second document, so as to implement step S101, for example, drag the first document into the first document display area b102, and drag the second document into the second document display area b104, where the dragging may be implemented by long pressing a left mouse button and moving the mouse to implement dragging after the first document or the second document is frequently selected, and the present invention is not limited to this.

Step S102, splitting the first document into M content segments, and splitting the second document into N content segments, wherein M and N are positive integers. The content of the first document can be analyzed according to the target symbol group, the segmentation and splitting position information in the first document is determined, and the first document is split into M content segments according to the segmentation and splitting position information. And simultaneously, performing content analysis on the second document according to the target symbol group, determining segmentation and splitting position information in the second document, and splitting the second document into N content segments according to the segmentation and splitting position information.

In one embodiment, the intelligent terminal finds symbols that match the target symbol set by traversing the entire contents of the first document. The set of target symbols includes: any one or more of a symbol group consisting of periods and carriage return symbols, a symbol group consisting of question marks and carriage return symbols, and a symbol group consisting of exclamation marks and carriage return symbols. And considering the symbol position information conforming to the target symbol group as segmentation split position information. Then, all the contents between two adjacent symbols that meet the target symbol group are regarded as one content segment according to the segmentation split position information, the first document may be split into M content segments, and the specific contents of the M content segments are displayed in the first document content segment display area b105 of the document analysis interface b 10. Meanwhile, the second document is analyzed in the same way, the second document is split into N content segments, and the specific content of the N content segments is displayed in the second document content segment display area b 106. In the embodiment of the application, the paragraph segmentation is performed by taking the target coincidence group as a reference, so that the erroneous judgment of the paragraphs can be obviously avoided, and each document paragraph can be relatively accurately segmented.

Step S103, inputting the M content segments and the N content segments into an analysis model, and acquiring a similarity analysis result output by the analysis model. The similarity analysis result comprises M groups of similarity values, and the similarity value between any one of the M content segments and each of the N content segments obtained by analysis of the analysis model forms a group of similarity values;

in one embodiment, any one of the M content segments and each of the N content segments are input into the analytical model, respectively. The M content segments are converted into M first feature vectors by analyzing a first embedding layer of the model. The N content segments are converted into N second feature vectors by analyzing a second embedding layer of the model. And then, respectively carrying out memory processing on the M first feature vectors and the N second feature vectors by using two LSTMs of the analysis model to obtain M third feature vectors and N fourth feature vectors. And inputting the M third feature vectors and the N fourth feature vectors into a semantic matching layer of the analysis model for semantic matching, and obtaining a similarity analysis result output by the analysis model, wherein the similarity analysis result comprises M groups of similarity values. The specific structure of the analysis model can be referred to the description of the subsequent embodiments.

And step S104, selecting P similarity values from the M groups of similarity values included in the similarity analysis result, wherein P is a positive integer. A suitable similarity value may be selected from each of the M groups of similarity values included in the similarity analysis result to obtain M similarity values, and then P similarity values may be selected from the M similarity values.

In one embodiment, where P is M, the maximum similarity value is selected from each of the M sets of similarity values included in the similarity analysis result, resulting in M (i.e., P) maximum similarity values.

In one embodiment, M is obtained by averaging each of the M groups of similarity values included in the similarity analysis result, and M (i.e., P) average similarity values are obtained.

S105: and determining the similarity between the first document and the second document according to the P similarity values. The P similarity values may be averaged to obtain a similarity value between the first document and the second document. A similarity between the first document and the second document is determined based on the similarity value.

In one embodiment, the intelligent terminal performs averaging processing on the P similarity values to obtain a similarity value between the first document and the second document, and obtains a similarity analysis result. After obtaining the similarity analysis result, the intelligent terminal displays the similarity value percentage between the first document and the second document in the comparison result display area b107 of the document analysis interface b 10. The user can intuitively know the similarity between the first document and the second document according to the percentage of the similarity value displayed in the comparison result display area b 107.

In one embodiment, the intelligent terminal performs averaging processing on the P similarity values to obtain a similarity value between the first document and the second document, and compares and analyzes the similarity value with a threshold value, wherein the threshold value can be set by a user in a self-defined manner. When the similarity value is smaller than the threshold value, setting the similarity analysis result to be 0, namely the first document is not similar to the second document; when the similarity value is not less than the threshold value, the similarity analysis result is set to 1, that is, the first document and the second document are similar. After obtaining the similarity analysis result, the intelligent terminal displays the corresponding character in the comparison result display area b107 of the document analysis interface b 10. If the similarity analysis result output is 0, characters indicating that the first document and the second document are not similar are displayed in the comparison result display area b107, for example, "not similar" is directly displayed; if the similarity analysis result output is 1, characters indicating that the first document and the second document are similar are displayed in the comparison result display area b107, for example, "similar" is directly displayed. The user can know the similarity between the first document and the second document from the characters displayed in the comparison result display area b 107.

The document analysis steps based on the above are further exemplified as follows:

1. acquiring a first document and a second document;

2. performing segmentation processing on the first document and the second document to obtain M content segments corresponding to the first document and N content segments corresponding to the second document;

in one embodiment, the first document is document a, the second document is document b, document a has 2 content segments, and document b has 3 content segments, the segmentation result can be shown in table 1:

TABLE 1

3. Respectively calculating the similarity between each content segment in the M content segments and each second content segment in the N content segments to obtain M x N similarity values;

in a specific implementation, the similarity between every two content segments can be calculated based on a trained analysis model, and data shown in table 2 is obtained:

TABLE 2

Document a	Document b	Score
			Content segment a1	Content segment b1	Score1
Content segment a1	Content segment b2	Score2
			Content segment a1	Content segment b3	Score3
Content segment a2	Content segment b1	Score4
			Content segment a2	Content segment b2	Score5
Content segment a2	Content segment b3	Score6

4. And selecting the maximum similarity value corresponding to each content segment of the document a, and processing each maximum similarity value to obtain the similarity value between the first document and the second document. The largest similarity value is selected from the similarity values of each content segment in the document a, and assuming Score1 and Score6, the similarity value S of the two final documents is (Score1+ Score 6)/2.

The specific processing procedure of the analysis model is described in detail with reference to fig. 2 and 3. In an embodiment, please refer to fig. 2, which is a schematic structural diagram of an analysis module according to an embodiment of the present application. As shown in fig. 2, the specific structure of the analysis module includes: content segment 201, content segment 202, first embedding layer 203, second embedding layer 204, LSTM205, LSTM206, semantic matching layer 207, and match score 208.

Referring to fig. 3 specifically, the flowchart is a schematic flowchart of a processing method of an analysis model according to an embodiment of the present application, where the method may be executed on an intelligent terminal, and the intelligent terminal may be, for example, a server or a terminal device, and the method includes the following steps:

step S301, inputting the M content segments and the N content segments into a first embedding layer of the analysis model and a second embedding layer of the analysis model respectively;

in one embodiment, the M content segments 201 split from the first document are input into the first embedding layer 203 of the analysis model, respectively; the N content segments 202 split from the second document are input to a second embedding layer 204 of the analytical model.

Step S302, converting the M content segments into M first feature vectors through a first embedding layer of the analysis model; converting, by a second embedding layer of the analytical model, the N content segments into N second feature vectors;

in one embodiment, the M content segments 201 are converted into M first feature vectors by analyzing the first embedding layer 203 of the model; the N content segments 202 are converted into N first feature vectors by analyzing the second embedding layer 204 of the model.

Step S303, respectively carrying out memory processing on the M first eigenvectors and the N second eigenvectors through two long-short term memory networks (LSTMs) of the analysis model to obtain M third eigenvectors and N fourth eigenvectors;

in one embodiment, the M first feature vectors are subjected to memory processing through the LSTM205 of the analysis model to obtain M third feature vectors; and performing memory processing on the N second feature vectors through the LSTM206 of the analysis model to obtain N third feature vectors.

Step S304, inputting the M third feature vectors and the N fourth feature vectors into a semantic matching layer of the analysis module through the semantic matching layer of the analysis model, and obtaining a similarity analysis result output by the analysis model.

In one embodiment, the M third feature vectors and the N fourth feature vectors are input into the semantic matching layer 207 of the analytical model. The semantic matching layer 207 of the analytical model includes: a splice layer, a random deactivation Dropout layer, and a full link layer. And splicing the M third eigenvectors and the N fourth eigenvectors through the splicing layer to obtain spliced vectors. The stitched vector is input to the Dropout layer to prevent overfitting. The full-connection layer can determine the similarity value of any one of the M third feature vectors and each fourth feature vector of the N fourth feature vectors through the full-connection layer by using an activation function Relu function and an activation function Softmax function, so as to obtain M groups of similarity values. And inputting the M groups of similarity values determined by the full connection layer into the matching score 208, and selecting and averaging the M groups of similarity values by the matching score 208 to obtain a similarity analysis result output by the analysis model.

The analytical model may be performed during a training phase based on a large number of training samples. Parameters in each layer of the initial model can be optimized and adjusted according to the training result of each training, and the initial model after final optimization and adjustment is used as an analysis model for realizing document analysis.

In one embodiment, the intelligent terminal can obtain a training sample, wherein the training sample comprises a first training document and a second training document. The first training document is split into X first training content segments 201 according to the target symbol group, and the second training document is split into Y second training content segments 202 according to the target symbol group.

The X first training content segments 201 are input to the first embedding layer 203 of the initial model, respectively, and the Y second training content segments 202 are input to the first embedding layer 203 of the initial model, respectively. Converting the X first training content segments 201 into X first training feature vectors by the first embedding layer 203 of the initial model; the Y second training content segments 202 are converted into Y second training feature vectors by the second embedding layer 204 of the initial model. Performing memory processing on the X first training feature vectors through an LSTM205 of the initial model to obtain X third training feature vectors; and performing memory processing on the Y second training feature vectors through the LSTM206 of the initial model to obtain Y fourth training feature vectors.

The X third training feature vectors and the Y fourth training feature vectors are input into the semantic matching layer 207 of the initial model. The semantic matching layer 207 of the initial model includes: a splice layer, a random deactivation Dropout layer, and a full link layer. And splicing the X third training characteristic vectors and the Y fourth training characteristic vectors through the splicing layer to obtain spliced vectors. The stitched vector is input to the Dropout layer to prevent overfitting. And determining similarity values of the X third training feature vectors and the Y fourth training feature vectors through the full connection layer. The similarity value is input into the matching score 208, and the matching score 208 judges and processes the similarity value to obtain a similarity training analysis result. And comparing the similarity training analysis result with the relevance of the labeling information used for representing the similarity between the first training content segment and the second training content segment, determining the training correctness according to the comparison result, and further carrying out optimization adjustment on parameters in each layer of the initial model so as to re-input the first training content segment and the second training content segment to carry out two rounds of training on the initial model. And after multiple rounds of training, taking the initial model after final optimization and adjustment as an analysis model so as to analyze the similarity between the documents. The correlation degree comparison process is as follows:

optionally, if the labeling information of the first training content segment and the second training content segment is 1, that is, the first training content segment is similar to the second training content segment, and the similarity value output by the initial model is greater than a preset threshold, it is determined that the similarity training analysis result is associated with the labeling information, and then the training is correct.

Optionally, if the labeling information of the first training content segment and the second training content segment is 0, that is, the first training content segment is not similar to the second training content segment, and the similarity value output by the initial model is smaller than the preset threshold, it is determined that the similarity training analysis result is associated with the labeling information, and then the training is correct.

The training phase based on the above is exemplified as follows:

1. training sample data required by the initial model is prepared, and structured data shown in table 3 is obtained:

TABLE 3

Wherein, the "first training content segment a" and the "second training content segment b" respectively represent two independent training content segments, and the "label information" is the category label information (1 is similar, 0 is dissimilar) of the similarity between the first training content segment and the second training content segment;

2. constructing an initial model, wherein the specific structure is as follows: 2 embedding layers, 2 long short term memory networks (LSTM), and a semantic matching layer. The embedding layer is used for converting the first training content segment a into a first training feature vector and converting the second training content segment b into a second training feature vector. The LSTM is used for memorizing the first training feature vector and the second training feature vector, converting the first training feature vector into a third training feature vector, and converting the second training feature vector into a fourth training feature vector. The semantic matching layer comprises: splicing layer, Dropout layer, full connection layer. And the splicing layer is used for splicing the third training characteristic vector and the fourth training characteristic vector. The Dropout layer is used to prevent overfitting. The fully-connected layer may use an activation function Relu function and a Softmax function for determining similarity values for the third training feature vector and the fourth training feature vector.

3. The specific process of training the initial model may be to obtain a first training feature vector corresponding to the first training content segment a and a second training feature vector corresponding to the first training content segment b, and input the first training feature vector and the second training feature vector into the model to obtain a similarity training analysis result between the first training content segment a and the second training content segment b. And comparing the similarity training analysis result with the labeling information to obtain a round of training result, performing multiple rounds of training on the model according to the mode, and determining that the training of the initial model is completed when the accuracy is higher than a preset threshold value. When the similarity value output by the initial model is greater than a preset threshold value, the training is determined to be correct for the similar first training content segment a and the second training content segment b; and for the first training content segment a and the second training content segment b which are not similar, when the similarity value output by the initial model is smaller than a preset threshold value, determining that the training is correct.

Please refer to fig. 4, which is a schematic structural diagram of a document analysis apparatus provided in the present application. As shown in fig. 4, the document analyzing apparatus 40 may include: an acquisition module 401, a splitting module 402, a processing module 403, a selection module 404 and a confirmation module 405;

the obtaining module 401 is configured to obtain a first document and a second document to be analyzed.

A splitting module 402, configured to split the first document into M content segments, and split the second document into N content segments, where M and N are positive integers.

The processing module 403 is configured to input the M content segments and the N content segments into an analysis model, and obtain a similarity analysis result output by the analysis model, where the similarity analysis result includes M groups of similarity values, and a group of similarity values is formed by similarity values between any one of the M content segments and each of the N content segments obtained through analysis by the analysis model.

A selecting module 404, configured to select P similarity values from M groups of similarity values included in the similarity analysis result, where P is a positive integer.

A determining module 405, configured to determine, according to the P similarity values, a similarity between the first document and the second document.

In one embodiment, the splitting module 402 is configured to perform content analysis on the first document according to a target symbol group, determine segmentation and splitting position information in the first document, and split the first document into M content segments according to the segmentation and splitting position information; analyzing the content of the second document according to a target symbol group, determining segmentation and splitting position information in the second document, and splitting the second document into N content segments according to the segmentation and splitting position information; the target symbol group includes: any one or more of a symbol group consisting of periods and carriage return symbols, a symbol group consisting of question marks and carriage return symbols, and a symbol group consisting of exclamation marks and carriage return symbols.

In one embodiment, the processing module 403 is configured to input the M content segments and the N content segments into a first embedding layer of the analysis model and a second embedding layer of the analysis model, respectively; converting, by a first embedding layer of the analytical model, the M content segments into M first feature vectors; converting, by a second embedding layer of the analytical model, the N content segments into N second feature vectors; respectively carrying out memory processing on the M first feature vectors and the N second feature vectors through two long-short term memory networks (LSTM) of the analysis model to obtain M third feature vectors and N fourth feature vectors; and inputting the M third feature vectors and the N fourth feature vectors into a semantic matching layer of the analysis module through the semantic matching layer of the analysis model to obtain a similarity analysis result output by the analysis model.

In one embodiment, the semantic matching layer of the analytical model comprises: the splicing layer is used for splicing the M third eigenvectors and the N fourth eigenvectors, the Dropout layer is used for preventing overfitting, and the full-connection layer is used for determining similarity values of the M third eigenvectors and the N fourth eigenvectors so as to obtain a similarity analysis result according to the similarity values determined by the full-connection layer.

In an embodiment, the selecting module 404 is configured to select a maximum similarity value from each of the M groups of similarity values included in the similarity analysis result, so as to obtain M maximum similarity values.

In an embodiment, the determining module 405 is configured to perform an averaging process on the P similarity values to obtain a similarity value between the first document and the second document.

In one embodiment, the apparatus may further include: a training module 406, where the training module 406 is configured to obtain a training sample, where the training sample includes a first training document and a second training document, the first training document includes X first training content segments, and the second training document includes Y second training content segments; inputting X first training content segments included by the first training document, Y second training content segments included by the second training document and labeling information used for representing the similarity between the first training content segments and the second training content segments into an initial model, and acquiring a similarity training analysis result output by the initial model; and optimizing and updating the initial model according to the correlation between the similarity training and analyzing result and the labeling information.

In the embodiments of the present invention, the specific implementation of each module may refer to the description of the related content in the foregoing embodiments, and is not repeated.

Referring to fig. 5, a schematic structural diagram of an intelligent terminal according to an embodiment of the present invention is shown, where the intelligent terminal according to an embodiment of the present invention may be a server, or may be a smart phone, a personal computer, a tablet computer, an intelligent wearable device, and the like. The intelligent terminal according to the embodiment of the present invention may include a storage device 501 and a processor 502, and may also include a network interface 503, a communication interface 504, and other interfaces for exchanging data. Interface modules such as a power supply module, a USB data interface, etc. may also be included.

The storage device 501 may include a volatile memory (volatile memory), such as a random-access memory (RAM); the storage device 501 may also include a non-volatile memory (non-volatile memory), such as a flash memory (flash memory), a solid-state drive (SSD), or the like; the memory device 501 may also comprise a combination of memories of the kind described above.

The processor 502 may be a Central Processing Unit (CPU). The processor 502 may further include a hardware chip. The hardware chip may be an application-specific integrated circuit (ASIC), a Programmable Logic Device (PLD), or the like. The PLD may be a field-programmable gate array (FPGA), a General Array Logic (GAL), or the like.

The network interface 503 may be various interfaces for accessing to a computer network, and the communication interface 904 may be a mobile communication interface 504 capable of accessing to a 4G/5G mobile communication network.

Optionally, the storage device 501 is also used for storing computer programs. The processor 502 may invoke the computer program to implement the document analysis method as shown in the embodiments of fig. 1a and fig. 3 of the present application.

In one embodiment, the processor 502 calls a computer program stored in the storage device for obtaining a first document and a second document to be analyzed; splitting the first document into M content segments, and splitting the second document into N content segments, wherein M and N are positive integers; inputting the M content segments and the N content segments into an analysis model, and obtaining a similarity analysis result output by the analysis model, wherein the similarity analysis result comprises M groups of similarity values, and the similarity value between any one of the M content segments and each of the N content segments obtained by analysis of the analysis model forms a group of similarity values; selecting P similarity values from M groups of similarity values included in the similarity analysis result, wherein P is a positive integer; and determining the similarity between the first document and the second document according to the P similarity values.

In one embodiment, the processor 502 is configured to perform content analysis on the first document according to a target symbol group, determine segmentation and splitting position information in the first document, and split the first document into M content segments according to the segmentation and splitting position information; analyzing the content of the second document according to a target symbol group, determining segmentation and splitting position information in the second document, and splitting the second document into N content segments according to the segmentation and splitting position information; the target symbol group includes: any one or more of a symbol group consisting of periods and carriage return symbols, a symbol group consisting of question marks and carriage return symbols, and a symbol group consisting of exclamation marks and carriage return symbols.

In one embodiment, the processor 502 is configured to input the M content segments and the N content segments into a first embedding layer of the analysis model and a second embedding layer of the analysis model, respectively; converting, by a first embedding layer of the analytical model, the M content segments into M first feature vectors; converting, by a second embedding layer of the analytical model, the N content segments into N second feature vectors; respectively carrying out memory processing on the M first feature vectors and the N second feature vectors through two long-short term memory networks (LSTM) of the analysis model to obtain M third feature vectors and N fourth feature vectors; and inputting the M third feature vectors and the N fourth feature vectors into a semantic matching layer of the analysis module through the semantic matching layer of the analysis model to obtain a similarity analysis result output by the analysis model.

In one embodiment, the semantic matching layer of the analysis module comprises: the splicing layer is used for splicing the M third eigenvectors and the N fourth eigenvectors, the Dropout layer is used for preventing overfitting, and the full-connection layer is used for determining similarity values of the M third eigenvectors and the N fourth eigenvectors so as to obtain a similarity analysis result according to the similarity values determined by the full-connection layer.

In one embodiment, the processor 502 is configured to select a maximum similarity value from each of the M groups of similarity values included in the similarity analysis result, and obtain M maximum similarity values.

In an embodiment, the processor 502 is configured to perform an averaging process on the P similarity values to obtain a similarity value between the first document and the second document.

In one embodiment, the processor 502 is further configured to obtain a training sample, where the training sample includes a first training document and a second training document, the first training document includes X first training content segments, and the second training document includes Y second training content segments; inputting X first training content segments included by the first training document, Y second training content segments included by the second training document and labeling information used for representing the similarity between the first training content segments and the second training content segments into an initial model, and acquiring a similarity training analysis result output by the initial model; and optimizing and updating the initial model according to the correlation between the similarity training and analyzing result and the labeling information.

Further, here, it is to be noted that: the present application further provides a computer-readable storage medium, and the computer-readable storage medium stores therein the aforementioned computer program executed by the document analysis device 40, and the computer program includes program instructions, and when the processor executes the program instructions, the description of the data processing method in the embodiment corresponding to any one of fig. 1a and fig. 3 can be performed, so that details will not be repeated here. In addition, the beneficial effects of the same method are not described in detail. For technical details not disclosed in the embodiments of the computer storage medium to which the present invention relates, reference is made to the description of the method embodiments of the present invention.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.

While the invention has been described with reference to a number of embodiments, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

20页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：一种基于NER和NLU的骚扰信息判断方法及系统

Document analysis method and device, intelligent terminal and storage medium

相关技术

网友询问留言