Duplicate removal target page determining method and device, computer equipment and computer readable storage medium

文档序号：1889400 发布日期：2021-11-26 浏览：4次中文

阅读说明：本技术 一种去重目标页面确定方法、装置、计算机设备及计算机可读存储介质 (Duplicate removal target page determining method and device, computer equipment and computer readable storage medium ) 是由李珅于 2021-08-30 设计创作，主要内容包括：本发明涉及页面去重处理技术领域,公开了一种去重目标页面确定方法、装置、计算机设备及计算机可读存储介质。通过本发明创造,提供了一种基于猜试索引实现去重目标页面确定的新方案,即先针对所有已存页面,通过对应的关键字构建一个在输入关键字后可快速返回得到查找失败信息或高匹配度关键字的猜试索引,然后针对新增页面,通过构建对应的新增关键字和前述猜试索引,可以快速高效地查找到可能匹配的已存页面内容,并进行匹配对比得到确定结果,进而无需针对每个已存页面内容进行匹配对比,可以实现提升确定速度、缩短所需时间和降低所需计算资源的目的,利于提升用户搜索体验,便于实际应用和推广。(The invention relates to the technical field of page duplicate removal processing, and discloses a duplicate removal target page determining method and device, computer equipment and a computer readable storage medium. The invention provides a new scheme for determining duplicate removal target pages based on guess indexes, namely, firstly, aiming at all stored pages, constructing a guess index which can quickly return to obtain search failure information or high-matching-degree keywords after inputting keywords through corresponding keywords, then aiming at the new added pages, quickly and efficiently finding stored page contents which are possibly matched through constructing corresponding new added keywords and the guess index, and performing matching comparison to obtain a determination result, further, the matching comparison does not need to be performed on each stored page content, the purposes of increasing the determination speed, shortening the required time and reducing the required computing resources can be realized, the search experience of users can be improved, and the practical application and popularization are facilitated.)

1. A method for determining a duplicate removal target page is characterized by comprising the following steps:

respectively constructing corresponding keywords for each stored page in all stored pages;

constructing a guess index aiming at all the keywords, wherein the guess index meets the following conditions (A) to (C):

(A) any two indexed keywords in all the indexed keywords are not repeated;

(B) for a given input key, if a matched unique indexed key exists in all indexed keys, the unique indexed key can be found and returned by the guess index;

(C) for a given input keyword, if no matched indexed keyword exists in all indexed keywords, returning search failure information or a guess keyword which is not matched with the input keyword through the guess index, wherein the guess keyword is positioned in all indexed keywords, and the ratio of the probability of returning the guess keyword through the guess index to the probability of returning the search failure information is smaller than a preset value;

aiming at the newly added page, constructing corresponding newly added keywords, wherein the construction mode of the newly added keywords is consistent with that of the keywords;

taking the newly added keyword as a given input keyword, and if a keyword to be checked is returned through the guess index, acquiring corresponding page content according to the keyword to be checked;

judging whether the acquired page content is matched with the page content of the newly added page or not;

and if the newly added pages are matched, determining the newly added pages as the duplicate removal target pages.

2. The method of claim 1, wherein the stored pages comprise deduplicated pages and/or inactive pages with liveness below a predetermined threshold.

3. The method of claim 1, wherein the keywords comprise a page hash value and page content corresponding to the stored page.

4. The method of claim 1, wherein after constructing a corresponding add key for an add page, the method further comprises:

and taking the new keyword as a given input keyword, and if the guess index returns the search failure information, determining the new page as a non-duplication-removal target page.

5. The method of claim 1, wherein after determining whether the acquired page content matches the page content of the newly added page, the method further comprises:

and if not, determining the newly added page as a non-duplication-removal target page.

6. The method of claim 4 or 5, wherein after determining the added page as a non-deduplicated target page, the method further comprises:

and storing the newly added page, and updating the guess index according to the newly added keywords.

7. The method of claim 1, wherein after determining the added page as a deduplication target page, the method further comprises:

determining a corresponding stored page according to the keyword to be checked, and representing the newly added page by adopting a physical page corresponding to the stored page;

and releasing the newly added page.

8. A duplicate removal target page determining device is characterized by comprising a keyword building module, a guess index building module, an index execution module, a matching judgment module and a target determining module;

the keyword building module is used for respectively building corresponding keywords for each stored page in all stored pages and building corresponding new keywords for a new page, wherein the building mode of the new keywords is consistent with that of the keywords;

the guess index constructing module is in communication connection with the keyword constructing module and is used for constructing guess indexes aiming at all the keywords, wherein the guess indexes meet the following conditions (A) to (C):

(A) any two indexed keywords in all the indexed keywords are not repeated;

(B) for a given input key, if a matched unique indexed key exists in all indexed keys, the unique indexed key can be found and returned by the guess index;

the index execution module is respectively in communication connection with the keyword construction module and the guess index construction module and is used for taking the newly added keyword as a given input keyword, and if a keyword to be checked is returned through the guess index, corresponding page content is obtained according to the keyword to be checked;

the matching judgment module is respectively in communication connection with the keyword construction module and the index execution module and is used for judging whether the acquired page content is matched with the page content of the newly-added page;

and the target determining module is in communication connection with the matching judging module and is used for determining the newly added page as a duplicate removal target page when the matching is judged.

9. A computer device comprising a memory and a processor, wherein the memory is used for storing a computer program, and the processor is used for reading the computer program and executing the method for determining the deduplication target page according to any one of claims 1 to 7.

10. A computer-readable storage medium having stored thereon instructions for performing, when running on a computer, a method for deduplication target page determination as recited in any one of claims 1-7.

Technical Field

The invention belongs to the technical field of page duplicate removal processing, and particularly relates to a duplicate removal target page determination method and device, computer equipment and a computer readable storage medium.

Background

In search engine optimization, website internal optimization is important, wherein the search page has a repetition phenomenon which needs to be noticed: the same article often appears repeatedly at different web sites of the same web site. Search engines do not like this repetitive content because the user experience is too poor if the user sees the same article from different websites on the first two pages while searching; although all content related, the search engine wants to return only one of the same articles, so that duplicate content needs to be identified and deleted before indexing, and this process becomes "optimized deduplication".

At present, when a search engine is started based on an operating system or a virtual machine Hypervisor (also called a virtual machine monitor, abbreviated as VMM, which is used to establish and execute software, firmware, or hardware of a virtual machine, a computer used by the Hypervisor to execute one or more virtual machines is called a host machine, and these virtual machines are called guest machines, the Hypervisor provides a virtual operating platform to execute a guest operating system and is responsible for managing the execution phase of other guest operating systems, and these guest operating systems share virtualized hardware resources together), the core step of performing duplicate page merge storage includes: firstly, searching pages with the same content, namely determining a page with a removed duplicate target; then, the page with the same content is represented by a physical page, and finally, redundant pages are released. However, in the foregoing process of determining a duplicate removal target page, generally, a page to be determined is compared with each stored page one by one, if the pages are the same, the page to be determined is determined to be the duplicate removal target page, otherwise, the page to be determined is determined to be a non-duplicate removal target page, and the determining process has the problems of slow speed, long required time, and more required computing resources, which may further affect user experience.

Disclosure of Invention

In order to solve the problems of low determining speed, long required time and more required computing resources of the duplicate removal target page in the process of combining and storing the duplicate pages, the invention aims to provide a novel duplicate removal target page determining method, a device, computer equipment and a computer readable storage medium, which can realize the purposes of improving the determining speed, shortening the required time and reducing the required computing resources without matching and comparing the content of each stored page, are favorable for improving the user search experience and are convenient for practical application and popularization.

In a first aspect, the present invention provides a method for determining a deduplication target page, including:

respectively constructing corresponding keywords for each stored page in all stored pages;

constructing a guess index aiming at all the keywords, wherein the guess index meets the following conditions (A) to (C):

(A) any two indexed keywords in all the indexed keywords are not repeated;

(B) for a given input key, if a matched unique indexed key exists in all indexed keys, the unique indexed key can be found and returned by the guess index;

aiming at the newly added page, constructing corresponding newly added keywords, wherein the construction mode of the newly added keywords is consistent with that of the keywords;

judging whether the acquired page content is matched with the page content of the newly added page or not;

and if the newly added pages are matched, determining the newly added pages as the duplicate removal target pages.

Based on the content of the invention, a new scheme for determining the duplicate removal target page based on the guess index is provided, namely, firstly, aiming at all stored pages, a guess index which can quickly return to obtain search failure information or high-matching-degree keywords after inputting the keywords is constructed through corresponding keywords, then aiming at the new added page, the stored page content which is possibly matched can be quickly and efficiently found through constructing corresponding new added keywords and the guess index, the matching comparison is carried out to obtain a determination result, further, the matching comparison does not need to be carried out on each stored page content, the purposes of improving the determination speed, shortening the required time and reducing the required computing resources can be realized, the search experience of users can be improved, and the practical application and popularization are facilitated.

In one possible design, the stored pages include deduplicated pages and/or inactive pages with liveness below a predetermined threshold.

In one possible design, the keyword includes a page hash value and/or page content of the stored page.

In one possible design, after constructing a corresponding new keyword for a new page, the method further includes:

and taking the new keyword as a given input keyword, and if the guess index returns the search failure information, determining the new page as a non-duplication-removal target page.

In one possible design, after determining whether the acquired page content matches the page content of the newly added page, the method further includes:

and if not, determining the newly added page as a non-duplication-removal target page.

In one possible design, after determining the new added page as a non-deduplication target page, the method further includes:

and storing the newly added page, and updating the guess index according to the newly added keywords.

In one possible design, after determining the new added page as a deduplication target page, the method further includes:

determining a corresponding stored page according to the keyword to be checked, and representing the newly added page by adopting a physical page corresponding to the stored page;

and releasing the newly added page.

In a second aspect, the invention provides a duplicate removal target page determining device, which comprises a keyword building module, a guess index building module, an index executing module, a matching judgment module and a target determining module;

(A) any two indexed keywords in all the indexed keywords are not repeated;

(B) for a given input key, if a matched unique indexed key exists in all indexed keys, the unique indexed key can be found and returned by the guess index;

In a third aspect, the present invention provides a computer device, including a memory and a processor, which are communicatively connected, where the memory is used to store a computer program, and the processor is used to read the computer program and execute the method for determining a deduplication target page according to the first aspect or any possible design of the first aspect.

In a fourth aspect, the present invention provides a computer-readable storage medium having stored thereon instructions which, when run on a computer, perform the method for determining a deduplication target page as described in the first aspect above or any of the possible designs of the first aspect.

In a fifth aspect, the present invention provides a computer program product comprising instructions which, when run on a computer, cause the computer to perform the method of determining a deduplication target page as described above in the first aspect or any possible design of the first aspect.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a schematic flow chart of a deduplication target page determination method provided by the present invention.

Fig. 2 is a schematic structural diagram of a deduplication target page determining apparatus provided in the present invention.

Fig. 3 is a schematic structural diagram of a computer device provided by the present invention.

Detailed Description

The invention is further described with reference to the following figures and specific embodiments. It should be noted that the description of the embodiments is provided to help understanding of the present invention, but the present invention is not limited thereto. Specific structural and functional details disclosed herein are merely representative of exemplary embodiments of the invention. This invention may, however, be embodied in many alternate forms and should not be construed as limited to the embodiments set forth herein.

It will be understood that, although the terms first, second, etc. may be used herein to describe various objects, these objects should not be limited by these terms. These terms are only used to distinguish one object from another. For example, a first object may be referred to as a second object, and similarly, a second object may be referred to as a first object, without departing from the scope of example embodiments of the present invention.

It should be understood that, for the term "and/or" as may appear herein, it is merely an associative relationship that describes an associated object, meaning that three relationships may exist, e.g., a and/or B may mean: a exists alone, B exists alone or A and B exist at the same time; for the term "/and" as may appear herein, which describes another associative object relationship, it means that two relationships may exist, e.g., a/and B, may mean: a exists singly or A and B exist simultaneously; in addition, for the character "/" that may appear herein, it generally means that the former and latter associated objects are in an "or" relationship.

As shown in fig. 1, the method for determining a deduplication target page provided in the first aspect of this embodiment may be executed by, but not limited to, a Computer device or a virtual machine with certain computing resources, for example, a Personal Computer (PC, which refers to a multipurpose Computer with a size, price, and performance suitable for Personal use, a desktop Computer, a notebook Computer, a mini-notebook Computer, a tablet Computer, an ultra-book, and the like all belong to a Personal Computer), a smart phone, a Personal digital assistant (PAD), an electronic device such as a wearable device, or a virtual machine Hypervisor, so as to quickly and efficiently find, for an added page, stored page contents that may be matched by constructing a corresponding added keyword and a guess index that is constructed for all stored pages in advance, and perform matching comparison to obtain a determination result, and then need not to match and compare to every page content that has already been stored, can realize promoting the purpose of confirming speed, shortening required time and reducing required computational resource, do benefit to and promote user's search experience, be convenient for practical application and popularization. As shown in FIG. 1, the method for determining the deduplication target page may include, but is not limited to, the following steps S1-S6.

S1, aiming at each stored page in all stored pages, corresponding keywords are respectively constructed.

In step S1, the stored page is a page that has been merged and stored by the local computer device or virtual machine, and may include, but is not limited to, a deduplicated page and/or an inactive page with an activity lower than a preset threshold. The keyword refers to a key (key) in the database, which is an important concept in the relational model, is a logical structure, and is not a physical part of the database; specifically, the keywords include, but are not limited to, a page hash value and page content of the corresponding stored page, where the page hash value is a necessary option and can be calculated by a conventional hash algorithm, and the page content is a selectable option and can be obtained by performing conventional extraction in the corresponding stored page.

S2, constructing a guess index aiming at all the keywords, wherein the guess index meets the following conditions (A) to (C):

(A) any two indexed keywords in all the indexed keywords are not repeated;

(B) for a given input key, if a matched unique indexed key exists in all indexed keys, the unique indexed key can be found and returned by the guess index;

(C) and for a given input keyword, if no matched indexed keyword exists in all indexed keywords, returning search failure information or a guess keyword which is not matched with the input keyword through the guess index, wherein the guess keyword is positioned in all indexed keywords, and the ratio of the probability of returning the guess keyword through the guess index to the probability of returning the search failure information is smaller than a preset value.

In step S2, the all indexed keywords are all the keywords. The preset value is used to indicate that the probability of returning the guess keyword through the guess index is much smaller than the probability of returning the search failure information, and may be set, for example, to 0.1, 0.01, or 0.001. An index is a storage structure used in a relational database to individually and physically sort one or more columns of values in a database table, and is a collection of one or more columns of values in a table and a corresponding logical pointer list pointing to data pages in the table for physically identifying the values, i.e. the index functions as a directory of a book, and can quickly find a desired content according to page numbers in the directory. Because the index provides pointers to data values stored in specified columns of the table and these pointers are then sorted according to a specified sorting order, the database may use the index to find a particular value and then sort the pointers to the rows containing the value, which may allow SQL (Structured Query Language) statements corresponding to the table to execute faster and access particular information in the database table quickly. The guess index is a special index satisfying the above conditions (a) to (C), and the construction method is an existing conventional index construction method, so that based on the feature limitation of the above conditions (a) to (C), the guess index can be input with keywords to quickly obtain returned search failure information or returned keywords with the characteristic of high matching degree. In addition, the guessing keyword can be returned with a significantly lower probability than the search failure information by reasonably designing keys in the guessing index.

And S3, aiming at the newly added page, constructing corresponding newly added keywords, wherein the construction mode of the newly added keywords is consistent with that of the keywords.

In step S3, the new added page is a to-be-determined page newly acquired with respect to all stored pages. Furthermore, the step of building the new key may be triggered at an appropriate time, for example, when the new page is retrieved.

And S4, taking the newly added keyword as a given input keyword, and if a keyword to be checked is returned through the guess index, acquiring corresponding page content according to the keyword to be checked.

In step S4, since the keyword to be verified is one of all the keywords, the page content corresponding to the stored page may be obtained based on a one-to-one correspondence relationship; or if the keyword contains the page hash value and the page content of the corresponding stored page, the corresponding page content can be directly obtained from the keyword to be checked. In addition, if the search failure information is returned through the guess index, it indicates that there is no stored page that is the same as the newly added page in all the stored pages, and the newly added page may be directly determined as a non-deduplication target page, and then the newly added page is stored, and the guess index is updated according to the new keywords, where the updating manner of the guess index may be, but is not limited to, reconstructing the guess index for all the keywords and the new keywords.

And S5, judging whether the acquired page content is matched with the page content of the newly added page or not.

And S6, if the new page is judged to be matched, determining the new page as a duplicate removal target page.

In the step S6, if it is determined that the acquired page content matches the page content of the newly added page, it indicates that stored pages identical to the newly added page exist in all stored pages, and therefore, the newly added page needs to be determined as a duplicate removal target page, then a corresponding stored page is determined according to the to-be-verified keyword, and the newly added page is represented by a physical page corresponding to the stored page, and finally the newly added page is released, so that the purpose of merging and storing the newly added page is achieved. And if the judgment is not matched (namely the keyword to be tested is a guessing keyword), the judgment indicates that the stored pages which are the same as the newly added pages do not exist in all the stored pages, the newly added pages can be determined as non-duplicate removal target pages, then the newly added pages are stored in the same way, and the guessing index is updated according to the newly added keyword.

Therefore, based on the duplicate removal target page determination method described in the foregoing steps S1 to S6, a new scheme for implementing duplicate removal target page determination based on guess indexes is provided, that is, firstly, for all stored pages, a guess index capable of quickly returning to obtain search failure information or high-matching-degree keywords after inputting keywords is constructed through corresponding keywords, then, for a new added page, by constructing corresponding new added keywords and the guess index, stored page contents which are possibly matched can be quickly and efficiently found, and a determination result is obtained through matching comparison, so that matching comparison for each stored page content is not needed, the purposes of increasing determination speed, shortening required time and reducing required computing resources can be achieved, improvement of user search experience is facilitated, and practical application and popularization are facilitated.

As shown in fig. 2, a second aspect of the present embodiment provides a virtual device for implementing the method for determining a duplicate removal target page in the first aspect, including a keyword building module, a guess index building module, an index execution module, a matching judgment module, and a target determination module;

(A) any two indexed keywords in all the indexed keywords are not repeated;

(B) for a given input key, if a matched unique indexed key exists in all indexed keys, the unique indexed key can be found and returned by the guess index;

For the working process, working details, and technical effects of the foregoing apparatus provided in the second aspect of this embodiment, reference may be made to the method for determining a duplicate removal target page in the first aspect, which is not described herein again.

As shown in fig. 3, a third aspect of the present embodiment provides a computer device for executing the method for determining a deduplication target page in the first aspect, and the computer device includes a memory and a processor, which are communicatively connected, where the memory is used for storing a computer program, and the processor is used for reading the computer program and executing the method for determining a deduplication target page in the first aspect. For example, the Memory may include, but is not limited to, a Random-Access Memory (RAM), a Read-Only Memory (ROM), a Flash Memory (Flash Memory), a First-in First-out (FIFO), and/or a First-in Last-out (FILO), and the like; the processor may be, but is not limited to, a microprocessor of the model number STM32F105 family. In addition, the computer device may also include, but is not limited to, a power module, a display screen, and other necessary components.

For the working process, working details, and technical effects of the foregoing computer device provided in the third aspect of this embodiment, reference may be made to the method for determining a deduplication target page in the first aspect, which is not described herein again.

A fourth aspect of the present embodiment provides a computer-readable storage medium storing instructions including the method for determining a deduplication target page according to the first aspect, that is, the computer-readable storage medium has instructions stored thereon, and when the instructions are executed on a computer, the method for determining a deduplication target page according to the first aspect is performed. The computer-readable storage medium refers to a carrier for storing data, and may include, but is not limited to, floppy disks, optical disks, hard disks, flash memories, flash disks and/or Memory sticks (Memory sticks), etc., and the computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable devices.

For a working process, working details, and technical effects of the foregoing computer-readable storage medium provided in the fourth aspect of this embodiment, reference may be made to the method for determining a duplicate removal target page in the first aspect, which is not described herein again.

A fifth aspect of the present embodiment provides a computer program product containing instructions, which when run on a computer, cause the computer to execute the method for determining a deduplication target page according to the first aspect. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable devices.

Finally, it should be noted that the present invention is not limited to the above alternative embodiments, and that various other forms of products can be obtained by anyone in light of the present invention. The above detailed description should not be taken as limiting the scope of the invention, which is defined in the claims, and which the description is intended to be interpreted accordingly.

12页详细技术资料下载

Duplicate removal target page determining method and device, computer equipment and computer readable storage medium

相关技术

网友询问留言