Method, device and medium for collecting webpage configurable items based on RPA

文档序号:1904790 发布日期:2021-11-30 浏览:6次 中文

阅读说明:本技术 一种基于rpa的网页可配置项的采集方法、装置及介质 (Method, device and medium for collecting webpage configurable items based on RPA ) 是由 梁威 谢宏亮 于 2021-08-26 设计创作,主要内容包括:本发明公开了一种基于RPA的网页可配置项的采集方法、装置及介质,方法包括在网页界面中锁定一个采集区域,以使采集区域内包含有当前网页的若干个可配置项,并且每个可配置项之间具备相似性;定位采集区域内中的所有可配置项;定位并绑定所有可配置项中的所有采集项。本发明选择在定位网页所有可配置项之前,先锁定一采集区域,使得该采集区域内包含有当前网页的若干个可配置项,并且每个可配置项之间具备相似性,通过先锁定一采集区域,能够避免定位超出该采集区域,从而避免找到不需要的可配置项,然后本发明再结合可配置项内部结构分析,使得采集项的定位更加清晰,采集项的采集更加准确和完整。(The invention discloses a method, a device and a medium for acquiring webpage configurable items based on RPA, wherein the method comprises the steps of locking an acquisition area in a webpage interface so that the acquisition area contains a plurality of configurable items of a current webpage, and each configurable item has similarity; locating all configurable items within the collection area; all acquisition items in all configurable items are located and bound. According to the method, before all configurable items of the webpage are positioned, an acquisition area is firstly locked, so that the acquisition area contains a plurality of configurable items of the current webpage, and each configurable item has similarity, the acquisition area is firstly locked, the acquisition area can be prevented from being positioned beyond the acquisition area, and unnecessary configurable items are prevented from being found.)

1. A method for collecting webpage configurable items based on RPA is characterized by comprising the following steps:

locking a collection area in a webpage interface so that the collection area contains a plurality of configurable items of a current webpage, and each configurable item has similarity;

locating all of the configurable items within the collection area;

locating and binding all acquisition items in all the configurable items.

2. The method for collecting web page configurable items based on RPA as claimed in claim 1, further comprising the steps of:

and scrolling the webpage interface, if the current webpage has new configurable item loading, positioning all the new configurable items in the acquisition area according to the same mode, and positioning and binding all the acquisition items in all the new configurable items.

3. The method for collecting web page configurable items based on RPA according to claim 2, wherein the distance of scrolling said web page interface each time is the same as the height of the frame of said web page interface.

4. The collection method of the RPA-based web page configurable item according to claim 1, wherein the collection area is locked in the web interface by xpath and/or dom.

5. The collection method of RPA-based web page configurable items according to any of claims 1-4, wherein said configurable items are cards or tables.

6. An apparatus for collecting web page configurable items based on RPA, comprising:

the acquisition area positioning module is used for locking an acquisition area in a webpage interface so that the acquisition area contains a plurality of configurable items of a current webpage and each configurable item has similarity;

a configurable item location module for locating all the configurable items within the acquisition area;

and the acquisition item positioning and binding module is used for positioning and binding all acquisition items in all the configurable items.

7. An acquisition device of webpage configurable items based on RPA is characterized in that: comprises at least one control processor and a memory for communicative connection with the at least one control processor; the memory stores instructions executable by the at least one control processor to enable the at least one control processor to perform the method of acquiring an RPA-based web page configurable item of any one of claims 1 to 5.

8. A computer-readable storage medium characterized by: the computer-readable storage medium stores computer-executable instructions for causing a computer to perform the method for acquiring the RPA-based web page configurable item of any one of claims 1 to 5.

Technical Field

The invention relates to the technical field of RPA webpage configuration, in particular to a method, a device and a medium for acquiring webpage configurable items based on RPA.

Background

In the BS system (browser/Server, a system based on wide area network), the content in the web page is partitioned, for example, a certain item of a certain shopping web page is searched, information of the certain item appears, and each different item in the region is in a similar card, and the content and the sequence of the package arrangement in each card have similar meanings.

At present, the webpage card acquisition technology based on the RPA (software automation) searches for areas with similarity from the whole webpage, but the areas with the similarity are easy to analyze out the areas with the unneeded similarity, and even if the areas are well adjusted in the configuration period, in the operation period, because most of webpages have the condition of page turning, the situation different from the configuration period can occur in a certain page through page turning acquisition and when the data of each page slightly changes, the RPA searches for the areas with the unneeded similarity, thereby finding the unneeded cards.

Disclosure of Invention

The present invention is directed to at least solving the problems of the prior art. Therefore, the invention provides a method, a device and a medium for collecting webpage configurable items based on RPA. The method can avoid collecting the unnecessary web configurable items, so that the collected items are positioned more clearly, and the collected items are collected more accurately and completely.

The invention provides a method for acquiring webpage configurable items based on RPA, which comprises the following steps:

locking a collection area in a webpage interface so that the collection area contains a plurality of configurable items of a current webpage, and each configurable item has similarity;

locating all of the configurable items within the collection area;

locating and binding all acquisition items in all the configurable items.

According to the embodiment of the invention, at least the following technical effects are achieved:

compared with the prior art, the similarity area is searched from the whole webpage, so that the unnecessary similarity card or table area is easily separated. According to the method, before all configurable items of a webpage are positioned, an acquisition area is firstly locked, so that the acquisition area contains a plurality of configurable items of the current webpage, and each configurable item has similarity, positioning exceeding the acquisition area can be avoided by firstly locking the acquisition area, and unnecessary configurable items are avoided being found.

According to some embodiments of the invention, further comprising the step of: and scrolling the webpage interface, if the current webpage has new configurable item loading, positioning all the new configurable items in the acquisition area according to the same mode, and positioning and binding all the acquisition items in all the new configurable items.

According to some embodiments of the invention, the distance of each scrolling of the web interface is the same as the height of the frame of the web interface.

According to some embodiments of the invention, the acquisition area is locked in the web interface by xpath and/or dom.

According to some embodiments of the invention, the configurable item is a card or a table.

In a second aspect of the present invention, an apparatus for acquiring a web page configurable item based on RPA is provided, including:

the acquisition area positioning module is used for locking an acquisition area in a webpage interface so that the acquisition area contains a plurality of configurable items of a current webpage and each configurable item has similarity;

a configurable item location module for locating all the configurable items within the acquisition area;

and the acquisition item positioning and binding module is used for positioning and binding all acquisition items in all the configurable items.

According to the embodiment of the invention, at least the following technical effects are achieved:

compared with the prior art, the similarity area is searched from the whole webpage, so that the unnecessary similarity card or table area is easily separated. The device firstly locks a collection area before positioning all configurable items of a webpage, so that the collection area contains a plurality of configurable items of the current webpage, and each configurable item has similarity, the situation that the collection area exceeds the collection area in positioning can be avoided by firstly locking the collection area, and unnecessary configurable items are avoided to be found.

Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.

Drawings

The above and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

fig. 1 is a schematic flowchart of a method for collecting a web page configurable item based on RPA according to a first embodiment of the present invention;

fig. 2 is a schematic configuration flow chart of a method for collecting a web page configurable item based on RPA according to a second embodiment of the present invention;

fig. 3 is a schematic view of a runtime flow of a method for collecting a web page configurable item based on RPA according to a second embodiment of the present invention;

fig. 4 is a schematic diagram of a collection area and a card area of a web page according to a second embodiment of the present invention;

fig. 5 is a schematic diagram of a detail page corresponding to an acquisition item according to a second embodiment of the present invention;

FIG. 6 is a diagram illustrating a card area of a web page according to a second embodiment of the present invention;

FIG. 7 is a diagram illustrating a table area of a web page according to a third embodiment of the present invention;

fig. 8 is a schematic diagram of attribute features on a dom element of a web card according to a third embodiment of the present invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention, and are not to be construed as limiting the present invention.

In the prior art, the similarity area is searched for from the whole webpage, so that the area of the similarity card which is not needed is easily analyzed, even if the similarity card is adjusted in the configuration period, in the operation period, when the data of each page is slightly changed during page turning acquisition, the similarity area which is not needed is searched for when the data of a certain page is different in the configuration period, and the card which is not needed is found.

The method firstly locks a large area range, avoids finding similar cards outside the area range, combines the analysis of the internal structure of the card, and has clearer positioning and more accurate and complete acquisition. The invention can also be used as a collection scheme of the web page form.

Referring to fig. 1, a first embodiment of the present invention provides a method for collecting a web page configurable item based on RPA, where the configurable item is a card, including the following steps:

step S101, a collection area is locked in a webpage interface, so that a plurality of cards of a current webpage are contained in the collection area, and each card has similarity.

And S102, positioning all cards in the acquisition area.

And S103, positioning and binding all the acquisition items in all the cards.

Compared with the prior art, the similarity area is searched from the whole webpage, so that the unnecessary similarity card or table area is easily separated. According to the method, before all cards of a webpage are positioned, an acquisition area is firstly locked, so that a plurality of cards of the current webpage are contained in the acquisition area, and each card has similarity, positioning exceeding the acquisition area can be avoided by firstly locking the acquisition area, and unnecessary cards are avoided being found.

It should be noted that the scheme of the present invention can also be applied to the method for collecting the web page form, and the principle is the same as that of the method for collecting the web page card, which will not be described in detail herein.

As an optional implementation, the method further comprises the following steps:

and S104, scrolling a webpage interface, if new card loading occurs on the current webpage, positioning all new cards in the acquisition area according to the same mode, and positioning and binding all acquisition items in all new cards.

After the collection items in all the cards of the current page in steps S101 to S103 are collected, the web interface is scrolled to locate the card on the current page next time and bind the collection items. It should be noted that the acquisition regions in step S104 and step S101 belong to the same region. Through the embodiment, when the page is turned for collection, the card which is not needed is prevented from being found.

Referring to fig. 2 and 3, for easy understanding, the second embodiment of the present invention is described as a process of acquiring a web card:

firstly, an acquisition area of a current page is configured, the acquisition area is locked, and similar cards outside the area are prevented from being identified. Manual modification fine tuning is provided to locate attributes ID, CLASS, style of a dom (Document Object Model) element, and collection area for the current page by xpath (a language that is a gate to find information in XML documents).

Secondly, positioning the cards, positioning all similar cards in the acquisition area, and setting a filtering condition, for example, setting the width or height of the card to be larger or smaller than a certain value, and adding an offset value as the filtering condition, or taking the attributes of some dom in the card as the filtering condition.

Then, configuring the acquisition items, using the current card as the parent positioning of the acquisition items, traversing the card to the fifth position, using the card as the parent positioning of the acquisition items, using each item as an acquisition record, binding the acquisition record to the variable, and storing the corresponding data field.

And finally, configuring whether to collect in a rolling way, and when the configuration is 'true', rolling a rolling bar after a first batch of cards are collected in the running process, judging whether new data are loaded on a page, if so, continuously traversing and collecting card records in the new data, and defaulting the rolling bar of a rolling window by the rolling bar to designate the rolling bar on an element.

Referring to fig. 4 to 6, fig. 4 includes a configuration process of three regions, and in fig. 4, a labeled box labeled with reference number L1 represents a region range of acquisition, within which all acquisition tasks are within (note that the labeled box labeled with reference number L1 in fig. 4 is not fully shown). The label box labeled L2 shows that when the card is configured, the locator will analyze the same kind of elements to find all similar card locations within the area. The label box labeled with the reference number L3 represents the acquisition configuration positioning item in the card (note that the label box labeled with the reference number L2 contains a plurality of label boxes labeled with the reference number L3, such as "2999.00" shown in fig. 4, etc.), each card generates the acquisition item positioner under the current card, so as to acquire the data and correspond to the table field, and the configuration from the outer layer to the inner layer is divided into 3 steps:

step 1: in the positioning process, the outer layer is positioned firstly (namely the marking frame of the mark number L1 is positioned firstly), the acquisition area is determined, the positioning range is reduced, the range is prevented from being too large, when the similarity of the cards is found, if the similarity is found from the whole page, the positioning which can occur exceeds the boundary problem, so that the cards which are not needed can be found, and the configuration is used for solving the problem.

Step 2: in FIG. 4, the labeled box labeled L2 is the card location, and the configuration is searched within the location area in step 1 (i.e., the area in the labeled box labeled L1), and after selecting this area, similar cards in this area are analyzed as shown in FIG. 6.

And 3, step 3: in fig. 4, the reference box labeled L3 is a capture item, and this positioning is relative to the current card.

The 3-step configuration positioning is performed through a positioner provided in the xpath or the dom, and each step can perform fine adjustment on positioning by changing the grammar of the xpath or the dom, so that the positioning change is more stable and reliable.

During the collection process, as shown in fig. 4, the collection item in the card needs to be spliced with the collection item in the card and the collection item in the detail page as shown in fig. 5 (fig. 5 is the detail page corresponding to the card in fig. 4, and the label box of reference number L3 is the configuration positioning item of the collection item in the detail page) by clicking the detail page (for example, clicking the picture in the current card enters the detail page). And clicking one card in each traversal, entering a detail page (such as an interface of fig. 5), and after returning to the card page (such as an interface of fig. 4), saving each acquisition item of fig. 4 and 5 as a line of record to ensure data integrity.

After all cards in the current page are collected, a page scroll bar starts to scroll, the scrolling distance is the window height of the current page, whether a new data loading item exists in the page or not is judged after the page is scrolled once, if so, the latest card items are collected one by one, and the window height is scrolled after the collection is finished, so that the data corresponding to all the cards of the current page are ensured to be collected until no new data are loaded after the page is scrolled.

For the convenience of understanding, the third embodiment of the present invention is described as the collection process of the web page form:

as in fig. 7 (explanation of the label boxes is the same as fig. 4, note that in fig. 7, three label boxes are represented using R1, R2, and R3), the table range is selected with one row in the table as a card and each field as a collection.

In analyzing the similarity of the cards, analyzing the element attributes (including style attributes and node attributes) and the element height and width (configurable) features (see fig. 8) in the node of the current card dom as the analysis objects of the first step. With the selection of the configuration acquisition item, the element attributes (including style attributes and node attributes) and the node height and width (configurable) of the acquisition item are used as the internal features of the analysis card as card screening conditions so as to find similar cards more accurately.

In a fourth embodiment of the present invention, an RPA-based web page configurable item acquisition device is provided, which may be any type of intelligent terminal, such as a mobile phone, a tablet computer, a personal computer, and so on. Specifically, the apparatus includes: one or more control processors and memory, here exemplified by a control processor. The control processor and the memory may be connected by a bus or other means, here exemplified by a connection via a bus.

The memory, which is a non-transitory computer-readable storage medium, may be used to store non-transitory software programs, non-transitory computer-executable programs, and modules, such as program instructions/modules corresponding to the acquisition device of the RPA-based web page configurable item in the embodiment of the present invention. The control processor implements the collection method of the RPA-based web page configurable item of the above method embodiments by running non-transitory software programs, instructions, and modules stored in the memory.

The memory may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the memory may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory optionally includes a memory remotely located from the control processor, and the remote memories may be connected to the RPA-based web page configurable item acquisition device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The one or more modules are stored in the memory and, when executed by the one or more control processors, perform the collection method for RPA-based web page configurable items in the above embodiments.

The embodiment of the invention also provides a computer-readable storage medium, which stores computer-executable instructions, and the computer-executable instructions are used by one or more control processors to execute the collection method of the webpage configurable item based on the RPA in the above embodiment.

Through the above description of the embodiments, those skilled in the art can clearly understand that the embodiments can be implemented by software plus a general hardware platform. Those skilled in the art will appreciate that all or part of the processes in the methods for implementing the embodiments described above can be implemented by hardware related to instructions of a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes in the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read Only Memory (ROM), a Random Access Memory (RAM), or the like.

While embodiments of the invention have been shown and described, it will be understood by those of ordinary skill in the art that: various changes, modifications, substitutions and alterations can be made to the embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the claims and their equivalents.

12页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:提高网页全站静态化生成效率的方法及系统

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!