Method, device and storage medium for removing duplicate of news document
阅读说明:本技术 一种新闻文档去重的方法、装置及存储介质 (Method, device and storage medium for removing duplicate of news document ) 是由 冯博琳 王秋森 刘斌生 吴中恒 于 2018-07-02 设计创作,主要内容包括:本申请公开了一种新闻文档去重的方法、装置及存储介质,该方法包括:对文档进行分词;计算词项在文档中的权重;根据词项得到文档向量;根据文档向量计算文档之间的相似度;将相似度大于预设值的文档聚类成一个簇,并根据簇中的文档之间的相似度确定簇心;根据簇心标记出重复文档。本申请能够取得的有益效果在于,不需要人工标注训练样本,解决了人工标注训练样本费时费力问题;根据词项在文档中的权重计算相似度;提升命名实体和事件行为词项的权重,解决了受低频噪音词的影响较大问题;将相似度大于预设值的文档聚类成一个簇,每篇文档仅出现于单一簇中,使重复的文档具有唯一性;被标记的重复文档用于去重,避免多次处理重复的文档。(The application discloses a method, a device and a storage medium for removing duplicate news documents, wherein the method comprises the following steps: performing word segmentation on the document; calculating the weight of the terms in the document; obtaining a document vector according to the terms; calculating the similarity between the documents according to the document vectors; clustering the documents with the similarity larger than a preset value into a cluster, and determining a cluster center according to the similarity between the documents in the cluster; and marking out repeated documents according to the cluster centers. The method has the advantages that training samples do not need to be marked manually, and the problem that the training samples are marked manually, which wastes time and labor is solved; calculating similarity according to the weight of the terms in the document; the weights of the named entities and the event behavior terms are improved, and the problem that the named entities and the event behavior terms are greatly influenced by low-frequency noise words is solved; clustering the documents with the similarity larger than a preset value into a cluster, wherein each document only appears in a single cluster, so that the repeated documents have uniqueness; the marked repeated document is used for removing the duplicate, and repeated documents are prevented from being processed for multiple times.)
1. A method for deduplication of a news document, the method comprising:
segmenting each road news document in the news document set to obtain a term of each road news document;
calculating the weight of the lexical item of each road news document in the road news document;
obtaining each road news document vector according to the lexical item of each road news document with the weight;
calculating a first similarity between each road news document according to each road news document vector;
clustering the road news documents with the first similarity larger than a preset similarity threshold into a cluster, and determining a cluster center according to the first similarity between the road news documents in the cluster;
and marking repeated road news documents according to the cluster centers of the clusters, wherein the marked repeated road news documents are used for duplicate removal.
2. The method for removing duplicate news documents according to claim 1, wherein the news documents are centrally stored as road news documents in the administrative district to which the news documents belong; the method further comprises the following steps: classifying the news document set into the affiliated administrative divisions according to the administrative divisions;
the word segmentation is performed on each road news document in the news document set, and the word segmentation specifically comprises the following steps:
the method includes the step of performing word segmentation on each road news document in a news document set belonging to the same administrative division.
3. A method of removing duplicate news documents as claimed in claim 1 or 2, wherein said calculating the weight of the term of each road news document in the road news document comprises:
according to the formula
wherein the content of the first and second substances,
4. The method of claim 1, wherein determining a cluster based on a first similarity between road news documents in the cluster comprises:
respectively adding the first similarity between each road news document in the cluster and other road news documents in the cluster under the condition that the number of the road news documents in the cluster is larger than a preset threshold value to obtain a second similarity of each road news document;
and taking the road news document corresponding to the second similarity with the maximum value as a cluster center.
5. A method of deduplication as recited in claim 1, wherein the road news document comprises: historical road news documents and newly added road news documents; the marking of the repeated road news document according to the cluster center of the cluster comprises:
if the road news documents in the cluster are all newly added road news documents, reserving the newly added road news documents serving as the cluster center of the cluster, and marking the newly added road news documents except the cluster center as repeated road news documents;
if the road news documents in the cluster comprise historical road news documents and newly-added road news documents, and the cluster center is the historical road news documents, marking the newly-added road news documents as repeated road news documents;
and if the road news documents in the cluster comprise the historical road news documents and the newly added road news documents, and the cluster center is the newly added road news documents, marking the newly added road news documents as repeated road news documents.
6. An apparatus for deduplication of a news document, the apparatus comprising: the system comprises a word segmentation module, a weight calculation module, a road news document vector obtaining module, a similarity calculation module, a clustering module and a marking module;
the word segmentation module is used for segmenting words of each road news document in the news document set to obtain a word item of each road news document;
the weight calculating module is used for calculating the weight of the term of each road news document in the road news document;
the road news document vector obtaining module is used for obtaining each road news document vector according to the lexical item of each road news document with weight;
the similarity calculation module is used for calculating a first similarity between each two road news documents according to each road news document vector;
the clustering module is used for clustering the road news documents with the first similarity larger than a preset similarity threshold into a cluster and determining a cluster center according to the first similarity between the road news documents in the cluster;
the marking module is used for marking repeated road news documents according to the cluster centers of the clusters, and the marked repeated road news documents are used for removing the duplicate.
7. The apparatus for removing duplicate news documents according to claim 6, wherein the news documents are collectively stored as road news documents in the administrative division; the device also comprises a classification module used for classifying the news document set into the affiliated administrative division according to the administrative division; the word segmentation module is specifically used for performing word segmentation on each road news document in a news document set belonging to the same administrative division.
8. Apparatus for de-duplicating a news document according to claim 6 or 7, wherein said means for calculating weights is specifically adapted to calculate weights according to a formula
9. The apparatus for deduplication of a news document as in claim 6, wherein the clustering module comprises a determine cluster center module; the cluster center determining module is specifically configured to add the first similarity between each road news document in the cluster and other road news documents in the cluster respectively to obtain a second similarity of each road news document when the number of the road news documents in the cluster is greater than a preset threshold; and taking the road news document corresponding to the second similarity with the maximum value as a cluster center.
10. An apparatus for deduplication of a news document as recited in claim 6, wherein the road news document comprises: historical road news documents and newly added road news documents;
the marking module is specifically configured to, if the road news documents in the cluster are all newly added road news documents, retain the newly added road news documents serving as the cluster center of the cluster, and mark the newly added road news documents except the cluster center as repeated road news documents; if the road news documents in the cluster comprise historical road news documents and newly-added road news documents, and the cluster center is the historical road news documents, marking the newly-added road news documents as repeated road news documents; and if the road news documents in the cluster comprise the historical road news documents and the newly added road news documents, and the cluster center is the newly added road news documents, marking the newly added road news documents as repeated road news documents.
11. A storage medium having stored thereon program data for, when executed by a processor, implementing a method of de-duplication of a news document as claimed in any one of claims 1-5.
Technical Field
The present application relates to the field of natural language processing technologies, and in particular, to a method and an apparatus for removing duplicate news documents, and a storage medium.
Background
With the development of the internet, the amount of network news information is increasing dramatically. A large amount of repeated news information is processed many times, reducing the information processing efficiency. Therefore, how to duplicate the news information becomes an urgent problem to be solved.
Disclosure of Invention
The embodiment of the application provides a method and a device for removing duplicate news documents and a storage medium. The problem that manual labeling of training samples wastes time and labor in supervised learning and the problem that influence of low-frequency noise words is large in unsupervised learning are solved.
The application provides a method for removing duplicate news documents, which comprises the following steps:
segmenting each road news document in the news document set to obtain a term of each road news document;
calculating the weight of the lexical item of each road news document in the road news document;
obtaining each road news document vector according to the lexical item of each road news document with the weight;
calculating a first similarity between each road news document according to each road news document vector;
clustering the road news documents with the first similarity larger than a preset similarity threshold into a cluster, and determining a cluster center according to the first similarity between the road news documents in the cluster;
and marking repeated road news documents according to the cluster centers of the clusters, wherein the marked repeated road news documents are used for duplicate removal.
The application also provides a device for removing duplicate news documents, which comprises: the system comprises a word segmentation module, a weight calculation module, a road news document vector obtaining module, a similarity calculation module, a clustering module and a marking module;
the word segmentation module is used for segmenting words of each road news document in the news document set to obtain a word item of each road news document;
the weight calculating module is used for calculating the weight of the term of each road news document in the road news document;
the road news document vector obtaining module is used for obtaining each road news document vector according to the lexical item of each road news document with weight;
the similarity calculation module is used for calculating a first similarity between each two road news documents according to each road news document vector;
the clustering module is used for clustering the road news documents with the first similarity larger than a preset similarity threshold into a cluster and determining a cluster center according to the first similarity between the road news documents in the cluster;
the marking module is used for marking repeated road news documents according to the cluster centers of the clusters, and the marked repeated road news documents are used for removing the duplicate.
The application also provides a storage medium on which program data are stored, the program data being used for implementing the method for removing duplicate news documents when being executed by a processor.
Compared with the prior art, the method has the advantages that training samples do not need to be marked manually, and the problem that the training samples are marked manually, which wastes time and labor is solved; calculating similarity according to the weight of the terms in the document; clustering the documents with the similarity larger than a preset value into a cluster, wherein each document only appears in a single cluster, so that the repeated documents have uniqueness; the marked repeated document is used for removing the duplicate, so that repeated documents are prevented from being processed for many times; in addition, the technical problem of great influence of low-frequency noise words is solved by improving the weights of the named entities and the event behavior terms.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:
FIG. 1 is a flowchart illustrating an embodiment of a method for deduplication of news documents provided in the present application;
FIG. 2 is another schematic flow chart diagram illustrating an embodiment of a method for deduplication of news documents provided herein;
FIG. 3 is an example of a news document set provided herein;
FIG. 4 is an example of a duplicate road news document provided herein;
fig. 5 is a schematic structural diagram of an embodiment of an apparatus for removing duplicate news documents provided in the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the technical solutions of the present application will be described in detail and completely with reference to the following specific embodiments of the present application and the accompanying drawings. It should be apparent that the described embodiments are only some of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
Fig. 1 is a schematic flowchart of an embodiment of a method for removing duplicate news documents, where the schematic flowchart includes:
105, segmenting each road news document in the news document set to obtain a term of each road news document;
optionally, the news documents are centrally stored as road news documents in the administrative district to which the news documents belong; classifying the news document set into the affiliated administrative divisions according to the administrative divisions; further, word segmentation is performed for each road news document in the news document set belonging to the same administrative division. As shown in
For analysis of samplesThe road news document in (1) is segmented to obtain terms, as shown in
TABLE 1 event behavior vocabulary
in the embodiment, the TFIDF weighting algorithm is improved, the weights of named entities and event behavior terms in the road news document are improved, and the recognition capability of the road news document to different events is enhanced.
For analysis of samples
The road news document in the road news document word matrix is constructed after word segmentation, each line in the matrix is a road news document, each column is a word, and each element is the weight of the current word in the road news document. Optionally according to a formulaCalculating road news documentsOf the kth term wkA weight in the road news document, wherein,for a news document collectionJ th road news document of (1)iRepresenting different cities, kw (w)k) For the extracted kth term wkTFIDF () is a word frequency-inverse file word frequency weight algorithm, i, j, k are all positive integers. In this embodiment, if wkIf the named entity is detected, the first preset threshold value is 1.5; if wkFor event behavior, the second preset threshold is 1.2.
in this embodiment, the term of each road news document with weight is input to the bag-of-words model, resulting in each road news document vector.
optionally, normalizing said each road news document vector; calculating a first similarity between each road news document according to each standardized road news document vector; further, each road news document vector is normalized by using L2 normalization, and the calculation formula is as follows
Shown, wherein the vectors V, ViIs the component (dimension) in V, i is the sequence number of the terms in the road news document, n is the total number of the terms in the road news document, n and i are positive integers, L2(V) is the original value of each component divided by the length of the current vector V (denominator, i.e., the square and root of each component). A first similarity between each road news document after the normalization process is calculated.
in this embodiment, since the clustering method Canopy has a one-to-many case of dividing one sample into a plurality of clusters, in order to uniquely determine a document that is duplicated with a current sample, a Canopy clustering algorithm is adjusted on the basis of Canopy, so that the Canopy clustering algorithm better meets the uniqueness requirement of a current task on duplicated road news documents (that is, each road news document is duplicated with at most one road news document). The adjustment process is as follows: setting a preset similarity threshold T, a cluster element set CE { }, a broken news document set index and a cosine similarity S of a road news document pair in a news document set; traversing the ind, if the road news document di corresponding to the current subscript in the ind does not exist in the CE, taking the di as a cluster center, and enabling the similarity of the S and the di to be larger than or equal to T and the news document set which does not exist in the CEAs an element of the current cluster, a new cluster is obtained
Mixing di andthe medium road news document is added to the CE.Can be null, then di alone is taken as a cluster at this point, and the loop ends when the ind traversal ends or the CE size is the same as the size of the whole news document set.In this embodiment, the road news document includes: historical road news documents and new road news documents. Suppose that 10 road news documents are available as shown in fig. 3, wherein the first 6 are historical road news documents existing in the history library, and the last 4 are new road news documents, and the clustering deduplication operation needs to be performed on the 4 new road news documents. Suppose city number c of WuhaniC, then its news document set is
WhereinFor the set of historical news documents for wuhan,a newly added news document set is obtained;k is 0, 1 … 5 is the kth road news document in the wuhan historical news document set,j is 0, and 1 … 3 is the jth road news document in the wuhan new news document set. Setting a preset similarity threshold value T to 0.5, and assuming that the similarity between road news documents is greater than 0.5 in 10 road news documentsThe document pairs are as follows: first historical road news document(the national court of four roads district court of Japan, northbound bus station migration) and a second historical road news document(Wuhan Gutian four-way court north-row multi-way bus stop migration); sixth historical road news document(start of project of main body project of north road of Wuhan ink lake) and first new road news document(Wuhan two-ring line forming in the north of the instant ink lake for main engineering start).Let the news document set index be traversed in order: go throughThe similarity to other road news documents,
andthe similarity is greater than T, thus forming a clusterWill be provided withAndadding the cluster element CE; go throughSimilarity to other road news documents due toHas appeared in the CE, then the next element is traversed; go throughThe similarity with other road news documents is independent to form clusters because no road news document with the similarity larger than T exists And all are added into the cluster element CE; go throughThe similarity to other road news documents,and newly added road news documentForm a clusterAnd adding CE; go throughSimilarity to other road news documents due toHas appeared in the CE, then the next element is traversed; go throughThe similarity with other road news documents is not higher than the preset similarity threshold value because of no road news with the similarity higher than the preset similarity threshold valueDocuments, then clustered individually Finishing clustering to obtain a cluster set
Optionally, when the number of the road news documents in the cluster is greater than a preset threshold, respectively adding the first similarity between each road news document in the cluster and other road news documents in the cluster to obtain a second similarity of each road news document; and taking the road news document corresponding to the second similarity with the maximum value as a cluster center.
Further, in this embodiment, when the number of the road news documents in the cluster is greater than 2, the first similarity between each road news document in the cluster and other road news documents in the cluster is respectively added to obtain the second similarity of each road news document; and taking the road news document corresponding to the second similarity with the maximum value as a cluster center. Furthermore, the number of the road news documents in the cluster is 4, and the number of the road news documents is 1, 2, 3 and 4, first similarity between 1 and 2, and first similarity between 3 and 4 are calculated, and the first similarities are added to obtain a second similarity of 3.2 of the road news document 1; calculating first similarity between 2 and 1, 3 and 4, and adding the first similarity to obtain a second similarity of 3.4 of the road news document 2; and calculating first similarity between 3 and 1, 2 and 4, adding the first similarities to obtain that the second similarity of the road news document 3 is 3.5, calculating first similarity between 4 and 1, 2 and 3, and adding the first similarities to obtain that the second similarity of the road news document 4 is 3.8. The road news document 4 has the highest second similarity, and the road news document 4 is taken as the cluster center.
Because the manual processing information takes a lot of time and new information cannot be generated if the manual processing information is repeated information, the repeated road news document is marked out, so that the repeated road news document does not participate in the subsequent processing flow, and the efficiency of the subsequent manual processing of the information and the information is improved.
And viewing the obtained cluster after clustering is finished. By passing
The elements (per cluster) in this set are known about the duplication between the road news documents. Optionally, if the road news documents in the cluster are all new road news documents, retaining the new road news documents serving as the cluster center of the cluster, and marking the new road news documents except the cluster center as repeated road news documents; if the road news documents in the cluster comprise historical road news documents and newly-added road news documents, and the cluster center is the historical road news documents, marking the newly-added road news documents as repeated road news documents; and if the road news documents in the cluster comprise the historical road news documents and the newly added road news documents, and the cluster center is the newly added road news documents, marking the newly added road news documents as repeated road news documents.In this embodiment, of the 4 new road news documents,
quiltAnd marking as repetition, wherein no repetition occurs in other newly added road news documents.In this embodiment, as shown in fig. 4, the processing result is written into the database, and for the new road news document marked as DUPLICATE, two fields, i.e. BIAOSHI and DUPLICATE _ ID, in the database are updated, respectively indicating the title and ID of the road news document DUPLICATE therewith. And the newly added road news documents marked as repeated are stored in a database, and the newly added road news documents not marked as repeated are used for later manual processing, so that new information is provided for users.
Compared with the prior art, the method has the advantages that training samples do not need to be marked manually, and the problem that the training samples are marked manually, which wastes time and labor is solved; calculating similarity according to the weight of the terms in the document; the weights of the named entities and the event behavior terms are improved, and the problem that the named entities and the event behavior terms are greatly influenced by low-frequency noise words is solved; clustering the documents with the similarity larger than a preset value into a cluster, wherein each document only appears in a single cluster, so that the repeated documents have uniqueness; the marked repeated document is used for removing the duplicate, and repeated documents are prevented from being processed for multiple times.
Fig. 5 is a schematic structural diagram of an apparatus for removing duplicate news documents according to the present application, where the schematic structural diagram includes: a
the
the
the road news document
the
the
the marking
Optionally, the news documents are centrally stored as road news documents in the administrative district to which the news documents belong; the device also comprises a classification module used for classifying the news document set into the affiliated administrative division according to the administrative division; further, the
the
news document collectionJ th road news document of (1)iRepresent a differenceCity, i is city number, j is road news document setThe sequence number of the news document on the middle road, k is the sequence number of the term in the news document on the road, and i, j and k are positive integers.
Optionally, the
Optionally, the road news document comprises: historical road news documents and newly added road news documents; the marking
Compared with the prior art, the method has the advantages that training samples do not need to be marked manually, and the problem that the training samples are marked manually, which wastes time and labor is solved; the weight calculation module calculates the weight of the terms in the document and is used for calculating the similarity; clustering the documents with the similarity larger than a preset threshold into a cluster by a clustering module, wherein each document only appears in a single cluster, so that the repeated documents have uniqueness; the marked repeated document is used for removing the duplicate, and repeated documents are prevented from being processed for multiple times.
The application also provides a storage medium, wherein program data are stored on the storage medium, and the program data are used for realizing word segmentation of each road news document in a news document set when being executed by a processor to obtain a term of each road news document; calculating the weight of the lexical item of each road news document in the road news document; obtaining each road news document vector according to the lexical item of each road news document with the weight; calculating a first similarity between each road news document according to each road news document vector; clustering the road news documents with the first similarity larger than a preset similarity threshold into a cluster, and determining a cluster center according to the first similarity between the road news documents in the cluster; and marking repeated road news documents according to the cluster centers of the clusters, wherein the marked repeated road news documents are used for duplicate removal.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, method, article, or apparatus that comprises the element.
The above are merely examples of the present application and are not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.
- 上一篇:一种医用注射器针头装配设备
- 下一篇:一种添加信息方法及相关装置