Webpage identification method and device, electronic equipment and storage medium

文档序号:436678 发布日期:2021-12-24 浏览:14次 中文

阅读说明:本技术 一种网页识别方法、装置及电子设备和存储介质 (Webpage identification method and device, electronic equipment and storage medium ) 是由 王晓波 位凯志 于 2021-09-27 设计创作,主要内容包括:本申请公开了一种网页识别方法、装置及一种电子设备和计算机可读存储介质,该方法包括:确定目标网页类型和所述目标网页类型对应的目标匹配规则;其中,所述目标匹配规则为基于关键字进行匹配的规则;获取目标网页的HTML文本内容,并利用所述目标匹配规则对所述HTML文本内容进行匹配;若所述HTML文本内容命中所述目标匹配规则,则判定所述目标网页符合所述目标网页类型。由此可见,本申请提供的网页识别方法,使用匹配规则判断目标网页是否符合目标网页类型,提高了页面识别的速度和准确度。(The application discloses a webpage identification method, a webpage identification device, an electronic device and a computer readable storage medium, wherein the method comprises the following steps: determining a target webpage type and a target matching rule corresponding to the target webpage type; the target matching rule is a rule for matching based on keywords; acquiring HTML text content of a target webpage, and matching the HTML text content by using the target matching rule; and if the HTML text content hits the target matching rule, judging that the target webpage conforms to the type of the target webpage. Therefore, the webpage identification method provided by the application judges whether the target webpage conforms to the type of the target webpage by using the matching rule, and improves the speed and accuracy of the webpage identification.)

1. A method for identifying a web page, comprising:

determining a target webpage type and a target matching rule corresponding to the target webpage type; the target matching rule is a rule for matching based on keywords;

acquiring HTML text content of a target webpage, and matching the HTML text content by using the target matching rule;

and if the HTML text content hits the target matching rule, judging that the target webpage conforms to the type of the target webpage.

2. The method for identifying a login webpage according to claim 1, wherein the acquiring HTML text content of the target webpage comprises:

and acquiring an HTML file according to the address of the target webpage, and performing text processing on the HTML file to obtain HTML text content.

3. The method according to claim 1, wherein the HTML text content hits the target matching rule if the HTML text content contains a target keyword related to a function of the target web page type.

4. The method for identifying the login webpage according to claim 1, wherein if the HTML text content includes a preset tag, and the content corresponding to the preset tag includes a preset attribute, and the preset attribute is a target keyword, the HTML text content hits the target matching rule; and the target keywords are keywords related to the functions of the target webpage type.

5. The method according to claim 1, wherein the target web page type includes a login type.

6. A web page recognition apparatus, comprising:

the determining module is used for determining a target webpage type and a target matching rule corresponding to the target webpage type; the target matching rule is a rule for matching based on keywords;

the matching module is used for acquiring HTML text content of a target webpage and matching the HTML text content by using the target matching rule;

and the judging module is used for judging that the target webpage accords with the type of the target webpage when the HTML text content hits the target matching rule.

7. The apparatus for identifying webpages in claim 6, wherein the determination module is specifically a module for determining that the target webpage conforms to the target webpage type when the HTML text content contains a target keyword related to a function of the target webpage type.

8. The apparatus according to claim 6, wherein the determining module is specifically a module for determining that the target webpage conforms to the type of the target webpage when the HTML text content includes a preset tag, a content corresponding to the preset tag includes a preset attribute, and the preset attribute is a target keyword; and the target keywords are keywords related to the functions of the target webpage type.

9. An electronic device, comprising:

a memory for storing a computer program;

a processor for implementing the steps of the web page identification method according to any one of claims 1 to 5 when executing the computer program.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the web page identification method according to any one of claims 1 to 5.

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method and an apparatus for identifying a web page, an electronic device, and a computer-readable storage medium.

Background

In web account auditing, identification of a web page type is important. In the related art, conventional machine learning methods including but not limited to Han and Textcnn models or NLP (natural language processing technology) recognition are employed based on HTML text, and NLP technology mainly involves Bert-based text classification. However, when facing a large number of url websites, the above scheme has a slow processing speed or a poor accuracy, and cannot meet the actual application requirements.

Therefore, how to improve the speed and accuracy of page recognition is a technical problem to be solved by those skilled in the art.

Disclosure of Invention

The application aims to provide a webpage identification method, a webpage identification device, electronic equipment and a computer readable storage medium, and speed and accuracy of webpage identification are improved.

In order to achieve the above object, the present application provides a web page identification method, including:

determining a target webpage type and a target matching rule corresponding to the target webpage type; the target matching rule is a rule for matching based on keywords;

acquiring HTML text content of a target webpage, and matching the HTML text content by using the target matching rule;

and if the HTML text content hits the target matching rule, judging that the target webpage conforms to the type of the target webpage.

The acquiring the HTML text content of the target webpage comprises the following steps:

and acquiring an HTML file according to the address of the target webpage, and performing text processing on the HTML file to obtain HTML text content.

And if the HTML text content contains target keywords related to the function of the target webpage type, the HTML text content hits the target matching rule.

If the HTML text content comprises a preset tag, the content corresponding to the preset tag comprises a preset attribute, and the preset attribute is a target keyword, the HTML text content hits the target matching rule; and the target keywords are keywords related to the functions of the target webpage type.

Wherein the target web page type comprises a login type.

In order to achieve the above object, the present application provides a web page recognition apparatus, including:

the determining module is used for determining a target webpage type and a target matching rule corresponding to the target webpage type; the target matching rule is a rule for matching based on keywords;

the matching module is used for acquiring HTML text content of a target webpage and matching the HTML text content by using the target matching rule;

and the judging module is used for judging that the target webpage accords with the type of the target webpage when the HTML text content hits the target matching rule.

The determining module is specifically a module for determining that the target webpage conforms to the target webpage type when the HTML text content contains a target keyword related to the function of the target webpage type.

The judging module is specifically a module for judging that the target webpage conforms to the type of the target webpage when the HTML text content comprises a preset tag, the content corresponding to the preset tag comprises a preset attribute, and the preset attribute is a target keyword; and the target keywords are keywords related to the functions of the target webpage type.

To achieve the above object, the present application provides an electronic device including:

a memory for storing a computer program;

a processor for implementing the steps of the above-mentioned web page identification method when executing the computer program.

To achieve the above object, the present application provides a computer-readable storage medium having a computer program stored thereon, which, when being executed by a processor, implements the steps of the above-mentioned web page identification method.

According to the scheme, the webpage identification method comprises the following steps: determining a target webpage type and a target matching rule corresponding to the target webpage type; the target matching rule is a rule for matching based on keywords; acquiring HTML text content of a target webpage, and matching the HTML text content by using the target matching rule; and if the HTML text content hits the target matching rule, judging that the target webpage conforms to the type of the target webpage.

According to the webpage identification method, different matching rules are designed for different webpage types, and if the HTML text content of the target webpage hits the target matching rule corresponding to a certain target webpage type, the target webpage is judged to be in accordance with the target webpage type. Therefore, the webpage identification method provided by the application judges whether the target webpage conforms to the type of the target webpage by using the matching rule, and improves the speed and accuracy of the webpage identification. The application also discloses a webpage identification device, an electronic device and a computer readable storage medium, which can also realize the technical effects.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts. The accompanying drawings, which are included to provide a further understanding of the disclosure and are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description serve to explain the disclosure without limiting the disclosure. In the drawings:

FIG. 1 is a flow diagram illustrating a method of web page identification in accordance with an exemplary embodiment;

FIG. 2 is a block diagram illustrating a web page recognition apparatus in accordance with one exemplary embodiment;

FIG. 3 is a block diagram illustrating an electronic device in accordance with an exemplary embodiment.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application. It is to be understood that the embodiments described are only a few embodiments of the present application and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application. In addition, in the embodiments of the present application, "first", "second", and the like are used for distinguishing similar objects, and are not necessarily used for describing a specific order or a sequential order.

The embodiment of the application discloses a webpage identification method, which improves the speed and accuracy of webpage identification.

Referring to fig. 1, a flowchart of a web page identification method according to an exemplary embodiment is shown, as shown in fig. 1, including:

s101: determining a target webpage type and a target matching rule corresponding to the target webpage type; the target matching rule is a rule for matching based on keywords;

the purpose of the present embodiment is to identify whether the target web page conforms to the target web page type. Specifically, different matching rules are designed for different types of web pages, and if the HTML text content of the target web page hits the target matching rule corresponding to a certain target web page type, it is determined that the target web page conforms to the target web page type.

In this step, the type of the target web page to be identified is determined first, which may include a login type, and is not limited specifically herein. And secondly, determining a target matching rule corresponding to the target webpage type, and designing the corresponding matching rule according to different expressions of different functions of different webpage types in the HTML text content. HTML (hypertext markup language) is a markup language that includes a series of tags by which documents on a network are formatted in a uniform manner, linking discrete network resources into a logical whole.

S102: acquiring HTML text content of a target webpage, and matching the HTML text content by using the target matching rule;

in this step, the HTML text content of the target web page to be identified is first acquired. As a possible implementation, the acquiring the HTML text content of the target webpage includes: and acquiring an HTML file according to the address of the target webpage, and performing text processing on the HTML file to obtain HTML text content. In particular implementations, an HTML file of the target web page is crawled according to the URL address of the target web site, the HTML file being descriptive text consisting of HTML commands that can specify words, graphics, animations, sounds, forms, links, etc. And performing text processing on the crawled HTML file, and obtaining HTML text content from the HTML file.

And secondly, matching the HTML text content by using a target matching rule corresponding to the type of the target webpage, and judging whether the HTML text content hits one or more target matching rules. In a specific implementation, if multiple target matching rules corresponding to the target webpage types are used, the matching operation may be completed when a certain target matching rule is hit in the HTML text content, or each target matching rule may be sequentially matched with the HTML text content to determine all target matching rules hit in the HTML text content, which is not specifically limited in this embodiment.

S103: and if the HTML text content hits the target matching rule, judging that the target webpage conforms to the type of the target webpage.

In specific implementation, if the HTML text content of the target web page hits the target matching rule corresponding to a certain target web page type, it is determined that the target web page conforms to the target web page type.

As a possible implementation manner, if the HTML text content includes a target keyword related to a function of the target webpage type, the HTML text content hits the target matching rule. In specific implementation, the target matching rule includes a target keyword related to a function of a target webpage type, and the HTML text content of the target webpage hits the target matching rule, indicating that the HTML text content includes the target keyword, and determining that the target webpage conforms to the target webpage type.

As another possible implementation manner, if the HTML text content includes a preset tag, and the content corresponding to the preset tag includes a preset attribute, and the preset attribute is a target keyword, the HTML text content hits the target matching rule; and the target keywords are keywords related to the functions of the target webpage type. In a specific implementation, the target matching rule includes a preset tag, a preset attribute, and a target keyword related to a function of a target webpage type, and an HTML text content of the target webpage hits the target matching rule, which indicates that the HTML text content includes the preset tag, and a content corresponding to the preset tag includes the preset attribute, and the preset attribute is the target keyword, and it is determined that the target webpage conforms to the target webpage type.

If the target webpage type is the login type, the corresponding target matching rule may include the following situations:

rule 1: if the HTML text content comprises an input tag, the content corresponding to the input tag comprises a type attribute, and the type attribute is password, the HTML text content hits the target matching rule.

In specific implementation, keyword matching is performed on the HTML text content keywords, and if the input keywords are matched, the input tags are described. If a statement "type ═ password" exists, where password is not case-sensitive, then it is determined to be a landing page.

Rule 2: if the HTML text content comprises an input tag, the content corresponding to the input tag comprises an ltype attribute, and the ltype attribute is password, the HTML text content hits a target matching rule;

in specific implementation, keyword matching is performed on the HTML text content keywords, and if the input keywords are matched, the input tags are described. If a statement "ltype ═ password" exists, wherein password is not case-sensitive, then the statement is determined to be a login page.

Rule 3: if the HTML text content comprises an input tag, the content corresponding to the input tag comprises a type attribute and a value attribute, the type attribute is submit, and the value attribute is login, the HTML text content hits a target matching rule;

in specific implementation, keyword matching is performed on the HTML text content keywords, and if the input keywords are matched, the input tags are described. If the sentences of 'type ═ submit' and 'value ═ login' exist, wherein submit is not distinguished by case, and login is not distinguished whether the middle has a blank, the page is judged to be the login page.

Rule 4: if the HTML text content comprises an input tag, the content corresponding to the input tag comprises a type attribute and an id attribute, the type attribute is submit, and the id attribute is sign, then the HTML text content hits a target matching rule;

in specific implementation, keyword matching is performed on the HTML text content keywords, and if the input keywords are matched, the input tags are described. If there are statements "type ═ submit" and "id ═ sign", where submit and sign are not case-specific, it is determined that the page is a landing page.

Rule 5: if the HTML text content comprises an input tag, the content corresponding to the input tag comprises a class attribute, a type attribute and a value attribute, the class attribute is logic, the type attribute is submit, and the value attribute is acquisition or reset, the HTML text content hits a target matching rule;

in specific implementation, keyword matching is performed on the HTML text content keywords, and if the input keywords are matched, the input tags are described. If there are statements "class", "type", "submit", and "value", "acquire", or "reset", where the terms "class", "submit, and" reset "are not used for case, and the" acquire "is not used for distinguishing whether there is a space in the middle, it is determined that the page is a login page.

Rule 6: if the HTML text content comprises an input tag, the content corresponding to the input tag comprises an id attribute, a name attribute and a placeholder attribute, the id attribute is password, the name attribute is password, and the placeholder attribute is password, the HTML text content hits a target matching rule;

in specific implementation, keyword matching is performed on the HTML text content keywords, and if the input keywords are matched, the input tags are described. If sentences of id, name and password exist, wherein password does not need to be distinguished by case, and the password does not need to be distinguished whether a space exists in the middle, the login page is judged.

Rule 7: if the HTML text content comprises an input tag, the content corresponding to the input tag comprises an id attribute and a class attribute, the id attribute is password, and the class attribute is password, the HTML text content hits the target matching rule.

In specific implementation, keyword matching is performed on the HTML text content keywords, and if the input keywords are matched, the input tags are described. If there are statements "id ═ password" and "class ═ password", where password is not case-specific, then it is determined that the page is a login page.

Rule 8: if the HTML text content comprises a button tag, the content corresponding to the button tag comprises a type attribute and a class attribute, the type attribute is submit, and the class attribute is login, the HTML text content hits a target matching rule;

in specific implementation, keyword matching is carried out on the HTML text content keywords, and if the button keywords are matched, the button tags of the HTML text content keywords are described. If there are statements "type ═ submit" and "class ═ logic" that do not distinguish between cases and cases, it is determined that the page is a login page.

Rule 9: if the HTML text content comprises the button tag, the content corresponding to the button tag comprises a type attribute and a span attribute, the type attribute is submit, and the span attribute is login, the HTML text content hits the target matching rule.

In specific implementation, keyword matching is carried out on the HTML text content keywords, and if the button keywords are matched, the button tags of the HTML text content keywords are described. If the sentences of 'type ═ submit' and 'span ═ login' exist, wherein submit and login are not distinguished by case, and whether the middle of the login has a blank or not is judged to be the login page.

During actual testing, for a specific test data set, 32665 samples in total, wherein the test results are shown in table 1 for the sample landing page 13558 and the sample non-landing page 19107:

TABLE 1

Positive report rate Rate of missing reports Processing speed (number of pages processed per second)
Traditional machine learning 100% 24.54% 100+
NLP technique 99.27% 2.7% 60+
Matching rules 99.45% 2.46% 300+

Therefore, the login page identification method based on the matching rules in the embodiment is balanced in processing speed and accuracy, the speed can process the number of 300+ per second, the positive report rate can reach 99%, the missing report rate can be controlled within 3%, and the accuracy and the processing speed are obviously improved compared with those of the related technology.

According to the webpage identification method provided by the embodiment of the application, different matching rules are designed for different webpage types, and if the HTML text content of the target webpage hits the target matching rule corresponding to a certain target webpage type, the target webpage is judged to be in accordance with the target webpage type. Therefore, the webpage identification method provided by the embodiment of the application judges whether the target webpage conforms to the type of the target webpage by using the matching rule, and improves the speed and accuracy of the webpage identification.

In the following, a web page recognition apparatus provided by an embodiment of the present application is introduced, and a web page recognition apparatus described below and a web page recognition method described above may be referred to each other.

Referring to fig. 2, a block diagram of a web page recognition apparatus according to an exemplary embodiment is shown, as shown in fig. 2, including:

a determining module 201, configured to determine a target webpage type and a target matching rule corresponding to the target webpage type; the target matching rule is a rule for matching based on keywords;

the matching module 202 is configured to obtain HTML text content of a target webpage and match the HTML text content by using the target matching rule;

and the judging module 203 is used for judging that the target webpage conforms to the type of the target webpage when the HTML text content hits the target matching rule.

The webpage identification device provided by the embodiment of the application designs different matching rules aiming at different webpage types, and if the HTML text content of the target webpage hits the target matching rule corresponding to a certain target webpage type, the target webpage is judged to be in accordance with the target webpage type. Therefore, the webpage identification device provided by the embodiment of the application judges whether the target webpage conforms to the type of the target webpage by using the matching rule, and improves the speed and accuracy of the webpage identification.

On the basis of the foregoing embodiment, as a preferred implementation, the matching module 202 includes:

the acquisition unit is used for acquiring an HTML file according to the address of the target webpage and performing text processing on the HTML file to obtain HTML text content;

and the matching unit is used for utilizing the target matching rule to match the HTML text content.

On the basis of the foregoing embodiment, as a preferred implementation manner, the determining module 203 is specifically a module for determining that the target webpage conforms to the target webpage type when the HTML text content contains a target keyword related to a function of the target webpage type.

On the basis of the foregoing embodiment, as a preferred implementation manner, the determining module 203 is specifically a module that determines that the target webpage conforms to the type of the target webpage when the HTML text content includes a preset tag, a content corresponding to the preset tag includes a preset attribute, and the preset attribute is a target keyword; and the target keywords are keywords related to the functions of the target webpage type.

On the basis of the above embodiment, as a preferred implementation, the target webpage type includes a login type.

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

Based on the hardware implementation of the program module, and in order to implement the method according to the embodiment of the present application, an embodiment of the present application further provides an electronic device, and fig. 3 is a structural diagram of an electronic device according to an exemplary embodiment, as shown in fig. 3, the electronic device includes:

a communication interface 1 capable of information interaction with other devices such as network devices and the like;

and the processor 2 is connected with the communication interface 1 to realize information interaction with other equipment, and is used for executing the webpage identification method provided by one or more technical schemes when running a computer program. And the computer program is stored on the memory 3.

In practice, of course, the various components in the electronic device are coupled together by the bus system 4. It will be appreciated that the bus system 4 is used to enable connection communication between these components. The bus system 4 comprises, in addition to a data bus, a power bus, a control bus and a status signal bus. For the sake of clarity, however, the various buses are labeled as bus system 4 in fig. 3.

The memory 3 in the embodiment of the present application is used to store various types of data to support the operation of the electronic device. Examples of such data include: any computer program for operating on an electronic device.

It will be appreciated that the memory 3 may be either volatile memory or nonvolatile memory, and may include both volatile and nonvolatile memory. Among them, the nonvolatile Memory may be a Read Only Memory (ROM), a Programmable Read Only Memory (PROM), an Erasable Programmable Read-Only Memory (EPROM), an Electrically Erasable Programmable Read-Only Memory (EEPROM), a magnetic random access Memory (FRAM), a Flash Memory (Flash Memory), a magnetic surface Memory, an optical disk, or a Compact Disc Read-Only Memory (CD-ROM); the magnetic surface storage may be disk storage or tape storage. Volatile Memory can be Random Access Memory (RAM), which acts as external cache Memory. By way of illustration and not limitation, many forms of RAM are available, such as Static Random Access Memory (SRAM), Synchronous Static Random Access Memory (SSRAM), Dynamic Random Access Memory (DRAM), Synchronous Dynamic Random Access Memory (SDRAM), Double Data Rate Synchronous Dynamic Random Access Memory (DDRSDRAM), Enhanced Synchronous Dynamic Random Access Memory (ESDRAM), Enhanced Synchronous Dynamic Random Access Memory (Enhanced DRAM), Synchronous Dynamic Random Access Memory (SLDRAM), Direct Memory (DRmb Access), and Random Access Memory (DRAM). The memory 2 described in the embodiments of the present application is intended to comprise, without being limited to, these and any other suitable types of memory.

The method disclosed in the above embodiment of the present application may be applied to the processor 2, or implemented by the processor 2. The processor 2 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware or instructions in the form of software in the processor 2. The processor 2 described above may be a general purpose processor, a DSP, or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like. The processor 2 may implement or perform the methods, steps and logic blocks disclosed in the embodiments of the present application. A general purpose processor may be a microprocessor or any conventional processor or the like. The steps of the method disclosed in the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software modules may be located in a storage medium located in the memory 3, and the processor 2 reads the program in the memory 3 and in combination with its hardware performs the steps of the aforementioned method.

When the processor 2 executes the program, the corresponding processes in the methods according to the embodiments of the present application are realized, and for brevity, are not described herein again.

In an exemplary embodiment, the present application further provides a storage medium, i.e. a computer storage medium, specifically a computer readable storage medium, for example, including a memory 3 storing a computer program, which can be executed by a processor 2 to implement the steps of the foregoing method. The computer readable storage medium may be Memory such as FRAM, ROM, PROM, EPROM, EEPROM, Flash Memory, magnetic surface Memory, optical disk, or CD-ROM.

Those of ordinary skill in the art will understand that: all or part of the steps for implementing the method embodiments may be implemented by hardware related to program instructions, and the program may be stored in a computer readable storage medium, and when executed, the program performs the steps including the method embodiments; and the aforementioned storage medium includes: a removable storage device, a ROM, a RAM, a magnetic or optical disk, or various other media that can store program code.

Alternatively, the integrated units described above in the present application may be stored in a computer-readable storage medium if they are implemented in the form of software functional modules and sold or used as independent products. Based on such understanding, the technical solutions of the embodiments of the present application may be essentially implemented or portions thereof that contribute to the prior art may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for enabling an electronic device (which may be a personal computer, a server, or a network device) to execute all or part of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a removable storage device, a ROM, a RAM, a magnetic or optical disk, or various other media that can store program code.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

12页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种基于文法向量乘法的藏文音节校对方法

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!