Automatic processing method and device for click type hyperlink in website crawler

文档序号:1846148 发布日期:2021-11-16 浏览:4次 中文

阅读说明:本技术 关于网站爬虫中点击类型超链接的自动化处理方法及装置 (Automatic processing method and device for click type hyperlink in website crawler ) 是由 董仲舒 张阳光 何文欢 程杰 毕静静 姚金龙 于 2021-08-30 设计创作,主要内容包括:本发明公开一种关于网站爬虫中点击类型超链接的自动化处理方法及装置,该方法包括:通过网络爬虫抓取页面链接;生成网页内容;然后生成超链接队列;利用xpath表达式判断是否含有需要单击的元素;若是返回该表达式,并继续执行,若否跳至最后一步;利用selinum调用虚拟浏览器,重新加载当前页面,如果加载成功,则继续执行,否则跳至最后一步;根据该表达式找到需要单击的元素、进行模拟单击;如果浏览器响应成功,继续执行,否则跳至最后一步;获取响应的内容并进行UTF-8编码,然后跳到第二步;取出下一个超链接,采用广度或者深度遍历算法继续进行抓取;然后跳到第一步。本发明极大的提高了网络爬虫内容的完整性和准确性。(The invention discloses an automatic processing method and a device for clicking type hyperlinks in website crawlers, wherein the method comprises the following steps: capturing page links through a web crawler; generating webpage content; then generating a hyperlink queue; judging whether an element needing to be clicked is contained or not by using an xpath expression; if the expression is returned, continuing to execute, and if the expression is not returned, jumping to the last step; calling the virtual browser by utilizing selinum, reloading the current page, if the loading is successful, continuing to execute, otherwise, jumping to the last step; finding the elements needing clicking according to the expression and carrying out simulated clicking; if the browser response is successful, continuing to execute, otherwise, jumping to the last step; acquiring the response content, encoding by UTF-8, and jumping to the second step; taking out the next hyperlink, and adopting an breadth or depth traversal algorithm to continuously capture the next hyperlink; and then jumps to the first step. The invention greatly improves the integrity and accuracy of the contents of the web crawler.)

1. An automated processing method for clicking type hyperlinks in web crawlers, comprising:

step 1: capturing page links through a web crawler;

step 2: generating webpage content according to the page link;

and step 3: generating a hyperlink queue after all hyperlinks in the webpage are deduplicated;

and 4, step 4: on the basis of the webpage content, judging whether an element needing to be clicked is contained by using an xpath expression; if yes, returning the xpath expression, executing the step 5, and if not, jumping to the step 8;

and 5: calling the virtual browser by utilizing selinum, reloading the current page, executing the step 6 if the loading is successful, otherwise jumping to the step 8;

step 6: selecting according to the xpath expression returned in the step 4, performing simulated clicking on the element after the element needing clicking is found, and then waiting for the response of the browser; if the browser response is successful, executing step 7, otherwise jumping to step 8;

and 7: acquiring the response content from the browser, carrying out UTF-8 encoding on the response content, and then jumping to the step 2;

and 8: taking out the next hyperlink in the hyperlink queue, and adopting an breadth or depth traversal algorithm to continuously capture; then jump to step 1.

2. The method of claim 1, wherein in step 4, the xpath expression comprises:

1) // td [ details (, 'lower page') and starts-with (@ on click, 'window. location') ];

2) // a [ details (, 'Next page') and starts-with (@ href, 'javascript:') ];

3) // a [ details (), 'next page') and @ once and @ href ═ # ];

4) /[ contacts (, 'Next page') and @ onclick ];

5)//a[@onclick]。

3. an automated processing apparatus for clicking type hyperlinks in web crawlers, comprising:

the page link capturing module is used for capturing page links through a web crawler;

the webpage content generating module is used for generating webpage content according to the page link;

the duplication eliminating module is used for generating a hyperlink queue after all hyperlinks in the webpage are duplicated;

the first judgment module is used for judging whether the webpage contains an element which needs to be clicked or not by using an xpath expression on the basis of the webpage content; if yes, returning the xpath expression and executing a second judgment module, and if not, executing a circulation module;

the second judgment module is used for calling the virtual browser by utilizing selinum, reloading the current page, if the loading is successful, executing the third judgment module, and otherwise, executing the circulation module;

the third judgment module is used for selecting according to the xpath expression returned by the first judgment module, carrying out simulated clicking on the element after the element needing clicking is found, and then waiting for the response of the browser; if the browser response is successful, executing the coding module, otherwise executing the circulating module;

the encoding module is used for acquiring the response content from the browser, carrying out UTF-8 encoding on the response content and then executing the webpage content generating module;

the circulation module is used for taking out the next hyperlink in the hyperlink queue and continuously grabbing the next hyperlink by adopting an breadth or depth traversal algorithm; and then executing a page link grabbing module.

4. The automated processing apparatus for clicking type hyperlinks according to claim 3, wherein said xpath expression comprises:

1) // td [ details (, 'lower page') and starts-with (@ on click, 'window. location') ];

2) // a [ details (, 'Next page') and starts-with (@ href, 'javascript:') ];

3) // a [ details (), 'next page') and @ once and @ href ═ # ];

4) /[ contacts (, 'Next page') and @ onclick ];

5)//a[@onclick]。

Technical Field

The invention belongs to the technical field of website crawlers, and particularly relates to an automatic processing method and device for click type hyperlinks in a website crawler, which are suitable for links which can be further accessed only by manually clicking in the process of the website crawler.

Background

With the development of the modern webpage front-end technology, particularly front-end language javascript; many excellent front-end frameworks such as jquery, vue, act, and angular appear, and with the emergence of these frameworks, some excellent UI component libraries such as bootstrap, element-UI, etc. appear, and these frameworks are excellent in compatibility, applicability, convenience, and internationalization, which greatly improves the efficiency of website development, so that more and more websites begin to be developed by using these frameworks.

Although these frameworks bring great convenience to website development, they bring great difficulties and challenges to the fields of website crawlers and content retrieval, one of the most prominent problems is that some hyperlinks require one click to continue accessing; the traditional web crawler only grabs hyperlinks similar to < a href ═ xxx > xxx existing in the web page, but catches an elbow for the hyperlinks of < a onclick ═ xxx "> xxx >; the more such hyperlinks exist in existing websites, particularly in hyperlinks "previous page" and "next page"; the hyperlinks are basic styles and methods for paging the content of the website, so that the content captured by the crawler of the website is incomplete and inaccurate.

Disclosure of Invention

The invention provides an automatic processing method and device for clicking type hyperlinks in a website crawler, aiming at the problem that the clicking type hyperlinks (such as previous page hyperlinks and next page hyperlinks) cannot be grabbed when the traditional website crawler captures the clicked type hyperlinks, so that the grabbing content is incomplete and inaccurate.

In order to achieve the purpose, the invention adopts the following technical scheme:

the invention provides an automatic processing method of a click type hyperlink in a website crawler, which comprises the following steps:

step 1: capturing page links through a web crawler;

step 2: generating webpage content according to the page link;

and step 3: generating a hyperlink queue after all hyperlinks in the webpage are deduplicated;

and 4, step 4: on the basis of the webpage content, judging whether an element needing to be clicked is contained by using an xpath expression; if yes, returning the xpath expression, executing the step 5, and if not, jumping to the step 8;

and 5: calling the virtual browser by utilizing selinum, reloading the current page, executing the step 6 if the loading is successful, otherwise jumping to the step 8;

step 6: selecting according to the xpath expression returned in the step 4, performing simulated clicking on the element after the element needing clicking is found, and then waiting for the response of the browser; if the browser response is successful, executing step 7, otherwise jumping to step 8;

and 7: acquiring the response content from the browser, carrying out UTF-8 encoding on the response content, and then jumping to the step 2;

and 8: taking out the next hyperlink in the hyperlink queue, and adopting an breadth or depth traversal algorithm to continuously capture; then jump to step 1.

Further, in step 4, the xpath expression includes:

1) // td [ details (, 'lower page') and starts-with (@ on click, 'window. location') ];

2) // a [ details (, 'Next page') and starts-with (@ href, 'javascript:') ];

3) // a [ details (), 'next page') and @ once and @ href ═ # ];

4) /[ contacts (, 'Next page') and @ onclick ];

5)//a[@onclick]。

in another aspect, the present invention provides an apparatus for automatically processing a click type hyperlink in a web crawler, comprising:

the page link capturing module is used for capturing page links through a web crawler;

the webpage content generating module is used for generating webpage content according to the page link;

the duplication eliminating module is used for generating a hyperlink queue after all hyperlinks in the webpage are duplicated;

the first judgment module is used for judging whether the webpage contains an element which needs to be clicked or not by using an xpath expression on the basis of the webpage content; if yes, returning the xpath expression and executing a second judgment module, and if not, executing a circulation module;

the second judgment module is used for calling the virtual browser by utilizing selinum, reloading the current page, if the loading is successful, executing the third judgment module, and otherwise, executing the circulation module;

the third judgment module is used for selecting according to the xpath expression returned by the first judgment module, carrying out simulated clicking on the element after the element needing clicking is found, and then waiting for the response of the browser; if the browser response is successful, executing the coding module, otherwise executing the circulating module;

the encoding module is used for acquiring the response content from the browser, carrying out UTF-8 encoding on the response content and then executing the webpage content generating module;

the circulation module is used for taking out the next hyperlink in the hyperlink queue and continuously grabbing the next hyperlink by adopting an breadth or depth traversal algorithm; and then executing a page link grabbing module.

Further, the xpath expression includes:

1) // td [ details (, 'lower page') and starts-with (@ on click, 'window. location') ];

2) // a [ details (, 'Next page') and starts-with (@ href, 'javascript:') ];

3) // a [ details (), 'next page') and @ once and @ href ═ # ];

4) /[ contacts (, 'Next page') and @ onclick ];

5)//a[@onclick]。

compared with the prior art, the invention has the following beneficial effects:

the method adopts the simulated browser technology, when the characteristic that manual clicking is needed in the webpage content is judged, the current page is loaded once by directly utilizing the simulated browser, then the link needing clicking currently is found to carry out the simulated clicking event, then the browser carries out simulated loading on the link, and finally the corresponding content is returned. According to the invention, when the web crawler is crawled, the traditional web page content can be grabbed, and click type hyperlinks (such as 'previous page' hyperlink and 'next page' hyperlink) which can not be grabbed by the traditional web crawler can also be grabbed, so that the completeness and the accuracy of the web crawler content are greatly improved.

Drawings

FIG. 1 is a basic flowchart of a method for automated processing of click-type hyperlinks in web crawlers, according to an embodiment of the present invention;

FIG. 2 is a graph comparing website crawler data;

FIG. 3 is a block diagram of an exemplary automated processing device for clicking on a hyperlink in a web crawler.

Detailed Description

The invention is further illustrated by the following examples in conjunction with the accompanying drawings:

as shown in FIG. 1, an automated processing method for clicking type hyperlinks in a web crawler includes:

step 1: capturing page links through a web crawler;

step 2: generating webpage content according to the page link;

and step 3: generating a hyperlink queue after all hyperlinks in the webpage are deduplicated;

and 4, step 4: on the basis of the webpage content, judging whether an element needing to be clicked is contained by using an xpath expression; if yes, returning the xpath expression, executing the step 5, and if not, jumping to the step 8;

and 5: calling the virtual browser by utilizing selinum, reloading the current page, executing the step 6 if the loading is successful, otherwise jumping to the step 8;

step 6: selecting according to the xpath expression returned in the step 4, performing simulated clicking on the element after the element needing clicking is found, and then waiting for the response of the browser; if the browser response is successful, executing step 7, otherwise jumping to step 8;

and 7: acquiring the response content from the browser, carrying out UTF-8 encoding on the response content, and then jumping to the step 2;

and 8: taking out the next hyperlink in the hyperlink queue, and adopting an breadth or depth traversal algorithm to continuously capture; then jump to step 1.

Further, in step 4, the xpath expression includes:

1) // td [ details (, 'lower page') and starts-with (@ on click, 'window. location') ];

2) // a [ details (, 'Next page') and starts-with (@ href, 'javascript:') ];

3) // a [ details (), 'next page') and @ once and @ href ═ # ];

4) /[ contacts (, 'Next page') and @ onclick ];

5)//a[@onclick]。

to verify the effect of the present invention, the following experiment was performed:

in the test, five websites are used for testing, in order to eliminate network jitter and server performance influence, each website is tested for ten times, the former five times are tested by a standard crawler method, and the later five times are tested by the method; the test results are shown in the following table. For privacy reasons, etc., we replace the website name with a website id.

Table 1: comparative test result table

The comparative test chart is shown in fig. 2.

As can be seen from the table 1 and the figure 2, the data amount captured by the method for web crawlers on certain websites is obviously improved, and the integrity of crawler data is greatly improved.

On the basis of the above embodiment, as shown in fig. 3, another aspect of the present invention provides an automatic processing apparatus for clicking type hyperlinks in web crawlers, comprising:

the page link capturing module is used for capturing page links through a web crawler;

the webpage content generating module is used for generating webpage content according to the page link;

the duplication eliminating module is used for generating a hyperlink queue after all hyperlinks in the webpage are duplicated;

the first judgment module is used for judging whether the webpage contains an element which needs to be clicked or not by using an xpath expression on the basis of the webpage content; if yes, returning the xpath expression and executing a second judgment module, and if not, executing a circulation module;

the second judgment module is used for calling the virtual browser by utilizing selinum, reloading the current page, if the loading is successful, executing the third judgment module, and otherwise, executing the circulation module;

the third judgment module is used for selecting according to the xpath expression returned by the first judgment module, carrying out simulated clicking on the element after the element needing clicking is found, and then waiting for the response of the browser; if the browser response is successful, executing the coding module, otherwise executing the circulating module;

the encoding module is used for acquiring the response content from the browser, carrying out UTF-8 encoding on the response content and then executing the webpage content generating module;

the circulation module is used for taking out the next hyperlink in the hyperlink queue and continuously grabbing the next hyperlink by adopting an breadth or depth traversal algorithm; and then executing a page link grabbing module.

Further, the xpath expression includes:

1) // td [ details (, 'lower page') and starts-with (@ on click, 'window. location') ];

2) // a [ details (, 'Next page') and starts-with (@ href, 'javascript:') ];

3) // a [ details (), 'next page') and @ once and @ href ═ # ];

4) /[ contacts (, 'Next page') and @ onclick ];

5)//a[@onclick]。

in summary, the invention adopts the simulation browser technology, when the feature that manual clicking is needed in the webpage content is judged, the simulation browser is directly used for loading the current page once, then the link needing clicking currently is found for simulating the clicking event, then the browser carries out simulation loading on the link, and finally the corresponding content is returned. According to the invention, when the web crawler is crawled, the traditional web page content can be grabbed, and click type hyperlinks (such as 'previous page' hyperlink and 'next page' hyperlink) which can not be grabbed by the traditional web crawler can also be grabbed, so that the completeness and the accuracy of the web crawler content are greatly improved.

The above shows only the preferred embodiments of the present invention, and it should be noted that it is obvious to those skilled in the art that various modifications and improvements can be made without departing from the principle of the present invention, and these modifications and improvements should also be considered as the protection scope of the present invention.

10页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种基于云计算分布式网络视频数据挖掘采集系统

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!