Dynamic data capture method based on script and puppeteer

文档序号:1921687 发布日期:2021-12-03 浏览:7次 中文

阅读说明:本技术 一种基于scrapy和puppeteer的动态数据抓取方法 (Dynamic data capture method based on script and puppeteer ) 是由 谭文婷 于 2021-09-07 设计创作,主要内容包括:本发明涉及一种基于scrapy和puppeteer的动态数据抓取方法,本发明所述的动态数据抓取方法包括以下步骤:步骤1):分析网络请求;步骤2):分析动态网络请求;步骤3):安装scrapy-redis;步骤4):设置爬虫的初始url;步骤5):提前定义scrapy.Item对象;步骤6):在pipeline中处理Item对象;步骤7):puppeteer监听redis队列;步骤8):等待页面加载完成,获取页面html,通过xpath提取目标数据;步骤9):将目标数据直接插入、或通过id更新到mongodb中持久化保存。本发明结合了scrapy和puppeteer两者的优势,将数据进行分类,实现快速高效的采集,puppeteer采集异步加载的、API加密的、复杂的数据,本发明通过redis进行通信,易于扩展。(The invention relates to a dynamic data capture method based on script and puppeteer, which comprises the following steps: step 1): analyzing the network request; step 2): analyzing the dynamic network request; step 3): installing a script-redis; step 4): setting an initial url of the crawler; step 5): item objects are defined in advance; step 6): processing Item objects in pipeline; step 7): puppieeer monitors a redis queue; step 8): waiting for the completion of page loading, acquiring page html, and extracting target data through xpath; step 9): and directly inserting the target data or updating the target data into mongodb through the id to be persistently stored. The method combines the advantages of both scratch and puppeteer, classifies data, realizes quick and efficient acquisition, and enables puppeteer to acquire asynchronously loaded, API (application program interface) encrypted and complex data.)

1. A dynamic data capture method based on script and puppeteer is characterized by comprising the following steps:

step 1): analyzing the network request, and analyzing which data to be acquired are static data and which data are dynamic data; putting the static data into script for collection;

step 2): analyzing the dynamic network request, and judging whether the API interface can stably return data or not; if so, putting the dynamic data into script for collection; if not, the dynamic data is put into puppeteer for collection;

step 3): installing a script-redis to enable script to support distributed acquisition;

step 4): setting an initial url of the crawler, requesting the url by using a GET method through script, and automatically calling a default callback function parse to return a request result after the request is successful; when the request is completed, the script transmits a result response returned by the request as a parameter to the callback function;

step 5): item objects are defined in advance, and required target data is defined as attributes of items; analyzing the response content in the callback function;

step 6): processing Item objects in pipeline;

step 7): the puppieeer monitors a redis queue, when data exist in the queue, the stored url and some other key information are taken out, the chroma is started, and the url of the target website is opened;

step 8): waiting for the completion of page loading, acquiring page html, and extracting target data through xpath;

step 9): and directly inserting the target data or updating the target data into mongodb through the id to be persistently stored.

2. The dynamic data capture method based on script and puppeteer according to claim 1, wherein the step 1) is specifically as follows: analyzing the network request: using a chrome developer tool to check the network request of a target webpage, and analyzing which data to be collected are static data and which data are dynamic data; the static data is: data that can be obtained in html returned in a Doc-type network request; the dynamic data is data returned by a request sent by extra calling ajax instead of being returned by a Doc type network request; static data was collected in script.

3. The dynamic data capture method based on script and puppeteer according to claim 1, wherein the step 2) is specifically as follows: analyzing the dynamic network request: using a postman tool to try to manually send the ajax request obtained from the step 1 for multiple times, checking whether the return result can obtain target data or not according to the same elements, and checking whether the same return result can be obtained each time, so as to judge whether the API interface can stably return data or not; if the target data can be obtained every time, the interface can stably return data, the part of dynamic data is collected in script, and if the target data cannot be stably returned, the part of dynamic data is collected in puppeteer.

4. The method for capturing dynamic data based on script and puppeteer as claimed in claim 2, wherein said step 2) comprises: the same elements are url, request header, cookies and parameters.

5. The method for dynamic data crawling based on script and puppeteer as claimed in claim 1, wherein said step 4) comprises: the initial url request reads all initial urls from a database and a redis by using a start _ requests method of rewriting script, manually defines a request header, cookies and parameters, and sets a self-defined callback function.

6. The method for dynamic data crawling based on script and puppeteer as claimed in claim 1, wherein said step 5) comprises: creating an Item object, assigning specific target data to corresponding attributes, and finally returning the packaged Item object.

7. The method for dynamic data crawling based on script and puppeteer as claimed in claim 6, wherein said step 5) comprises: if html type data is returned, data is extracted using xpath.

8. The method for dynamic data crawling based on script and puppeteer as claimed in claim 6, wherein said step 5) comprises: if json-type data is returned, json _ dumps is used to extract the data.

9. The method for dynamic data crawling based on script and puppeteer as claimed in claim 1, wherein said step 6) comprises: putting the collected target data into mongodb for persistent storage; and packaging and putting the key fields of the data which are not collected, such as url of the page and data id, into a redis queue.

10. The method for capturing dynamic data based on script and puppeteer as claimed in claim 1, wherein said step 8) is: monitoring a response event of the http request, triggering the response event when the API returns data, and processing the returned content in a callback function of the event; if the data is html type data, extracting the data by using xpath; if the data is of the json type, the data is extracted using json _ dump.

Technical Field

The invention belongs to the technical field of data acquisition, and particularly relates to the technical field of a dynamic data capturing method based on script and puppeteer.

Background

With the continuous development of network technology, more and more websites adopt a dynamic webpage technology, content is asynchronously obtained by using ajax, and then the security protection of the API is more and more perfect through JavaScript rendering. These greatly enhance the user's experience and data security, but also present new challenges to the crawler. The scrapy is a set of popular crawler frame at present, can quickly and efficiently grab a website and extract structured data from a page of the website, supports asynchronization, concurrency and duplication removal, but can only acquire statically displayed data and data stably acquired through api, and many asynchronously loaded data and data which cannot be decoded by random parameters in an interface cannot be acquired. For this portion of data, a browser-like approach may be used for collection. There are two main types of simulated browsers: one is to use the selenium to position to the element position to acquire data, but the selenium is a tool specially developed for the automatic test, has slow running speed and instability, and cannot intercept the network request; the other puppeteer has stronger functions, can directly acquire dynamic request data by intercepting ajax response, and has higher acquisition speed. In order to realize efficient acquisition of dynamic data acquisition, a set of new capture method needs to be developed.

Disclosure of Invention

The invention provides a dynamic data capture method based on script and puppeteer to solve the defects of the problems. The invention combines the advantages of script and puppeteer, develops a set of method for data acquisition, acquires static data by using script, has higher speed, and solves the technical problem by handing the request of a dynamic page to the puppeteer for processing. And the redis communication is used in the middle, so that the distributed processing task is realized, and the expansion is easy. The crawler collecting method is high in concurrency, easy to expand and capable of processing dynamic requests.

The invention is realized by adopting the following technical scheme.

The invention discloses a dynamic data capture method based on script and puppeteer, which comprises the following steps:

step 1): analyzing the network request, and analyzing which data to be acquired are static data and which data are dynamic data; putting the static data into script for collection;

step 2): analyzing the dynamic network request, and judging whether the API interface can stably return data or not; if so, putting the dynamic data into script for collection; if not, the dynamic data is put into puppeteer for collection;

step 3): installing a script-redis to enable script to support distributed acquisition;

step 4): setting an initial url of the crawler, requesting the url by using a GET method through script, and automatically calling a default callback function parse to return a request result after the request is successful; when the request is completed, the script transmits a result response returned by the request as a parameter to the callback function;

step 5): item objects are defined in advance, and required target data is defined as attributes of items; analyzing the response content in the callback function;

step 6): processing Item objects in pipeline;

step 7): the puppieeer monitors a redis queue, when data exist in the queue, the stored url and some other key information are taken out, the chroma is started, and the url of the target website is opened;

step 8): waiting for the completion of page loading, acquiring page html, and extracting target data through xpath;

step 9): and directly inserting the target data or updating the target data into mongodb through the id to be persistently stored.

Further, the step 1) of the present invention specifically comprises: analyzing the network request: using a chrome developer tool to check the network request of a target webpage, and analyzing which data to be collected are static data and which data are dynamic data; the static data is: data that can be obtained in html returned in a Doc-type network request; the dynamic data is data returned by a request sent by extra calling ajax instead of being returned by a Doc type network request; static data was collected in script.

Further, the step 2) of the present invention specifically comprises: analyzing the dynamic network request: using a postman tool to try to manually send the ajax request obtained from the step 1 for multiple times, checking whether the return result can obtain target data or not according to the same elements, and checking whether the same return result can be obtained each time, so as to judge whether the API interface can stably return data or not; if the target data can be obtained every time, the interface can stably return data, the part of dynamic data is collected in script, and if the target data cannot be stably returned, the part of dynamic data is collected in puppeteer.

Further, in step 2) of the present invention: the same elements are url, request header and cookies.

Further, step 4) of the present invention includes: the initial url request reads all initial urls from a database and a redis by using a start _ requests method of rewriting script, manually defines a request header, cookies and parameters, and sets a self-defined callback function.

Further, step 5) of the present invention includes: creating an Item object, assigning specific target data to corresponding attributes, and finally returning the packaged Item object.

Further, step 5) of the present invention includes: if html type data is returned, data is extracted using xpath.

Further, step 5) of the present invention includes: if json-type data is returned, json _ dumps is used to extract the data.

Further, step 6) of the present invention includes: putting the collected target data into mongodb for persistent storage; and packaging and putting the data which is not collected, such as key fields of url, data id and the like of the page into a redis queue.

Further, the step 8) of the present invention is: monitoring a response event of the http request, triggering the response event when the API returns data, and processing the returned content in a callback function of the event; if the data is html type data, extracting the data by using xpath; if the data is of the json type, the data is extracted using json _ dump.

The method has the advantages that the advantages of both scratch and puppeteer are combined, data are classified, static data and data which can be stably obtained through an API are put into the scratch for collection, and quick and efficient collection is achieved. puppieeer can collect asynchronously loaded, API encrypted, complex data. And the communication is carried out through the redis, so that the expansion is easy. According to the amount of different types of data collected specifically, the high-performance state can be kept by increasing or decreasing the corresponding script and puppeteer servers.

The invention is further explained below with reference to the drawings and the detailed description.

Drawings

FIG. 1 is a schematic flow chart of a method according to an embodiment of the present invention.

Detailed Description

A dynamic data capturing method based on script and puppeteer comprises the following steps:

1. analyzing the network request: and viewing the network request of the target webpage by using a chrome developer tool, and analyzing which data to be collected are static and which data are obtained through ajax requests.

Data which can be obtained in html returned in a Doc type network request is called as static data; on the contrary, the data returned by the request sent by ajax needs to be called additionally instead of being returned by the Doc type network request, and the data is called dynamic data. This portion of static data was collected in script.

2. Analyzing the dynamic network request: and (3) using a postman tool to try to manually send the ajax request obtained from the step (1) for multiple times, checking whether the returned result can obtain target data or not according to the same url, request header, cookies and parameters, and checking whether the same returned result can be obtained every time, thereby judging whether the API interface can stably return data or not.

If the target data can be obtained every time, the interface can stably return data, the part of dynamic data is also put into script to be collected, and if the target data cannot be stably returned, the part of dynamic data is put into puppeteer to be collected.

3. And installing the script-redis so that the script can support distributed acquisition.

4. Setting an initial url of the crawler, requesting the url by using a GET method through the script, and automatically calling a default callback function parse to return a request result after the request is successful.

If the initial url request is complex, the method of start _ requests of script can be rewritten, all the initial urls are read from the database and the redis, the request header, the cookies and the parameters are manually defined, and a self-defined callback function is set.

When the request is completed, script passes the result response returned by the request as an argument to the callback function.

5. Item object is defined in advance, and required target data is defined as an attribute of Item. The response content is analyzed in the callback function.

If the returned data is html type data, extracting the data by using xpath;

if json-type data is returned, json _ dumps extracts the data.

Creating an Item object, assigning the just extracted specific target data to the corresponding attribute, and finally returning the packaged Item object.

6. The Item object is processed in pipeline.

Putting the collected target data into mongodb for persistent storage;

and packaging and putting the data which is not collected, such as key fields of url, data id and the like of the page into a redis queue.

And 7, the puppieeer monitors the redis queue, and when data exists in the queue, the stored url and some other key information are taken out, the chroma is started, and the url of the target website is opened.

8. And waiting for the completion of page loading, acquiring page html, and extracting target data through xpath.

Or monitoring response event of http request, when the API interface returns data, triggering response event, processing returned content in the callback function of the event,

if the data is html type data, extracting the data by using xpath,

if the data is of the json type, the data is extracted using json _ dump.

The response event of the http request can be monitored without waiting for the completion of the loading of all the pages, and for the web pages with a large number of dynamic requests, the acquisition time and the server bandwidth are saved.

9. And finally, the target data is directly inserted or updated to mongodb through id for persistent storage.

As shown in FIG. 1:

example (b):

1. an e-commerce website is collected, which contains a commodity listing page and a commodity detail page. The list page contains the item id and item name, and the details page contains the price and sales volume of the item.

2. And (3) mounting scratch, scratch-redis, puppeteer, chrome, mongodb, redis and postman on the computer.

3. And opening the chrome, jumping to a commodity list page, checking all network requests by using a developer tool of the chrome, analyzing and finding that no commodity list data exists in the Doc type request, wherein the list data is dynamically loaded. Finding a key API in an XHR-type request can return a list of items. And opening the postman, manually sending the request for many times by using the same url, request header, cookies and parameters in the postman, returning to the commodity list, and obtaining the same commodity list in each request, wherein the same commodity list can be stably obtained through the API. So the tabular page data is collected in script.

4. And opening the chrome, jumping to a commodity detail page, viewing all network requests by using a developer tool of the chrome, analyzing and finding that no commodity price and sales data exist in the Doc type request, wherein the data are dynamically loaded. Finding the key API in an XHR type of request can return the price and sales of the good. And opening postman, manually sending the request for multiple times by using the same url, request header, cookies and parameters in the postman, finding that only the first request can successfully return commodity price and sales volume, and requesting to return errors again later. The reason is that the acquired commodity API parameter is provided with a token which can fail after being used, and the current technology cannot explain the principle of token generation and stably acquire data through the API. The detail page data is put into puppeteer for collection.

5. The list page is set as the initial url, and the callback function is set. And acquiring json type response data in the callback function, extracting the data by json _ dumps, creating an Item object, and assigning the id and the name of the commodity to the attribute of the Item object for returning.

6. The Item object is processed in pipeline. The id and the commodity name are put into mongodb for persistent storage, and then the commodity id is put into a redis queue.

And 7, the puppieeer monitors a redis queue, when data exist in the queue, the commodity id is taken out, the url of the commodity detail page is spliced, the chroma is started, and the url is opened.

8. And monitoring and acquiring a detail API response, and setting a response callback function. And acquiring json type response data in the callback function, and extracting the price and sales volume of the commodity.

9. And finally, updating the price and sales data of the commodities into mongodb for persistent storage through the id.

The foregoing is only a part of the specific embodiments of the present invention and specific details or common general knowledge in the schemes have not been described herein in more detail. It should be noted that the above-mentioned embodiments do not limit the present invention in any way, and all technical solutions obtained by means of equivalent substitution or equivalent transformation for those skilled in the art are within the protection scope of the present invention. The scope of the claims of the present application shall be determined by the contents of the claims, and the description of the embodiments and the like in the specification shall be used to explain the contents of the claims.

8页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:基于组织机构代码的信息获取方法和装置

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!