Webpage risk discovery method

文档序号:1937642 发布日期:2021-12-07 浏览:22次 中文

阅读说明:本技术 一种网页风险发现的方法 (Webpage risk discovery method ) 是由 吴琼 牛广方 余智华 于 2021-04-30 设计创作,主要内容包括:本发明公开了一种网页风险发现的方法,包括下列步骤:S1、数据采集阶段,使用高效蔓延的基本原理,获取到网站大量的网页;S2、数据流转阶段,使用Confluent接入Kafka中数据,调用风险识别服务后,将数据存储至Elasticsearch中,完成整个数据流转;S3、风险识别阶段,针对采集到的数据进行分词,与风险规则库中数据进校对,将符合风险规则数据保存;S4、数据存储阶段,针对采集的原始数据进行处理,然后对处理后的数据,根据数据类型、规模、用途按需存储;S5、风险发现整体流程完成。有益效果:在海量网页数据中能够快速的定位风险网站。(The invention discloses a method for discovering webpage risks, which comprises the following steps: s1, in the data acquisition stage, a large number of web pages of the website are acquired by using a basic principle of efficient spreading; s2, in a data circulation stage, using the Confluent to access the data in the Kafka, and after the risk identification service is called, storing the data in an elastic search to complete the whole data circulation; s3, in a risk identification stage, performing word segmentation on the acquired data, checking the data in a risk rule base, and storing the data meeting the risk rules; s4, a data storage stage, namely processing the acquired original data, and then storing the processed data according to the data type, scale and use as required; and S5, finishing the overall risk discovery process. Has the advantages that: the risk website can be quickly positioned in massive webpage data.)

1. A method for web page risk discovery, comprising the steps of:

s1, in the data acquisition stage, a large number of web pages of the website are acquired by using a basic principle of efficient spreading;

s2, in a data circulation stage, using the Confluent to access the data in the Kafka, and after the risk identification service is called, storing the data in an elastic search to complete the whole data circulation;

s3, in a risk identification stage, performing word segmentation on the acquired data, checking the data in a risk rule base, and storing the data meeting the risk rules;

s4, a data storage stage, namely processing the acquired original data, and then storing the processed data according to the data type, scale and use as required;

and S5, finishing the overall risk discovery process.

2. The method for discovering webpage risk according to claim 1, wherein in the step S1 data acquisition stage, using a basic principle of efficient epidemic, the specific steps of acquiring a large number of webpages of a website include:

s11, using the basic principle of high-efficiency spreading, using the interlinkage relation between the web pages, and continuously spreading to obtain all the web pages under the domain name.

S12, the web page collector firstly collects all the web pages under the home page from the website home page;

s13, searching links contained in the webpages by analyzing the acquired webpages;

s14, after the links are further checked by redis, a new seed URL is made,

s15, repeating the steps S11-S14, thereby obtaining a large number of web pages of the website more comprehensively.

3. The method for discovering webpage risk according to claim 1, wherein the step S2 includes the data flow stage of accessing data in Kafka by using a queue, and storing the data in an Elasticsearch after invoking a risk identification service, and the specific step of completing the data flow includes:

s21, providing an interface for realizing public calling for a user through an API;

s22, the user can start the task to exchange data only by editing the configuration file;

s23, accessing the data in Kafka by using the Confluent, calling a risk identification service, storing the data in an elastic search, and completing the whole data circulation.

4. The method for finding risks of web pages according to claim 1, wherein the step S4 is a data storage stage, processing the collected raw data, and storing the processed data according to data type, size and usage, and specifically includes the following steps:

s41, aiming at the collected original data, cleaning and standardizing the collected original data according to a data standard;

s42, providing a data channel with data aggregation, cache and forwarding functions based on a data bus;

s43, realizing fusion processing of multi-source heterogeneous data according to actual requirements;

and S44, storing the processed data according to the data type, the size and the use as required.

Technical Field

The invention relates to the field of data risk identification, in particular to a webpage risk discovery method.

Background

In actual service, data of each channel is collected and temporarily stored in a message queue, a data circulation platform receives real-time data of the message queue, risk identification service is called, and a result set is put into a full-text search library.

Generally, information is published through various channels such as news, WeChat, microblog and the like, so how to quickly and efficiently filter out risk information is more concerned in a user business scene.

An effective solution to the problems in the related art has not been proposed yet.

Disclosure of Invention

The invention aims to provide a method for finding a webpage risk, which aims to quickly and accurately find the risk so as to solve the problems in the background technology.

In order to achieve the purpose, the invention provides the following technical scheme:

a method for webpage risk discovery, comprising the steps of:

s1, in the data acquisition stage, a large number of web pages of the website are acquired by using a basic principle of efficient spreading;

s2, in a data circulation stage, using the Confluent to access the data in the Kafka, and after the risk identification service is called, storing the data in an elastic search to complete the whole data circulation;

s3, in a risk identification stage, performing word segmentation on the acquired data, checking the data in a risk rule base, and storing the data meeting the risk rules;

s4, a data storage stage, namely processing the acquired original data, and then storing the processed data according to the data type, scale and use as required;

and S5, finishing the overall risk discovery process.

Further, in the data acquisition stage of step S1, using the basic principle of efficient epidemic, the specific steps of acquiring a large number of web pages of a website include:

s11, using the basic principle of high-efficiency spreading, using the interlinkage relation between the web pages, and continuously spreading to obtain all the web pages under the domain name.

S12, the web page collector firstly collects all the web pages under the home page from the website home page;

s13, searching links contained in the webpages by analyzing the acquired webpages;

s14, after the links are further checked for duplication by using redis, a new seed URL is made, S15 and the steps S11-S14 are repeated, so that a large number of web pages of the website can be comprehensively acquired.

Further, in the step S2, in the data flow stage, the Confluent is used to access data in Kafka, and after the risk identification service is called, the data is stored in an Elasticsearch, and the specific steps of completing the whole data flow include:

s21, providing an interface for realizing public calling for a user through an API;

s22, the user can start the task to exchange data only by editing the configuration file;

s23, accessing the data in Kafka by using the Confluent, calling a risk identification service, storing the data in an elastic search, and completing the whole data circulation.

Further, in the step S4, the data storage stage of processing the acquired raw data, and then storing the processed data according to the data type, scale, and usage as required includes the following steps:

s41, aiming at the collected original data, cleaning and standardizing the collected original data according to a data standard;

s42, providing a data channel with data aggregation, cache and forwarding functions based on a data bus;

s43, realizing fusion processing of multi-source heterogeneous data according to actual requirements;

and S44, storing the processed data according to the data type, the size and the use as required.

Compared with the prior art, the invention has the following beneficial effects: the method comprises the steps of simultaneously acquiring information through multiple channels such as a website, WeChat and microblog, analyzing and filtering risk information through a word segmentation algorithm, and automatically identifying whether the website violates rules or not, so that the website can be quickly positioned in massive webpage data.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

Fig. 1 is a schematic structural diagram of step S3 in a method for web page risk discovery according to an embodiment of the present invention;

FIG. 2 is a flowchart of a method for web page risk discovery according to an embodiment of the present invention.

Detailed Description

Before further description of the present invention, the terms referred to in the invention are briefly described as follows:

channel

Channels for spreading network public sentiment through news, microblogs, WeChat, instant messaging software and the like are called channels.

Redis

Redis, a remote dictionary service, is a high-performance key-value storage system and supports master-slave synchronization. The method is used for unique key values of data, and repeated data collection is avoided.

Confulent

The smooth Platform is a streaming data Platform, can organize and manage data from different data sources, and has a stable and efficient system.

Kafka

Kafka is a high-throughput distributed publish-subscribe message system, can realize high-speed data exchange, buffering and conversion, migration, and has the characteristics of high throughput, high stability and the like.

Postgres

PostgreSQL is a powerful open source object relational database system that uses and extends the SQL language and incorporates many functions to securely store and extend the most complex data workloads. For using full-text retrieval functions.

ES

The ES is called an Elasticissearch, which is a distributed, high-expansion, high-real-time search and data analysis engine. It can conveniently make a large amount of data have the capability of searching, analyzing and exploring. The method is used for distributed full-text search and solves a plurality of requirements of people on search. The invention is used for storing risk detail data for business personnel to use.

The invention is further described with reference to the following drawings and detailed description:

referring to fig. 1-2, a method for web page risk discovery according to an embodiment of the present invention includes the following steps:

s1, in the data acquisition stage, a large number of web pages of the website are acquired by using a basic principle of efficient spreading;

s2, in a data circulation stage, using the Confluent to access the data in the Kafka, and after the risk identification service is called, storing the data in an elastic search to complete the whole data circulation;

s3, in a risk identification stage, performing word segmentation on the acquired data, checking the data in a risk rule base, and storing the data meeting the risk rules;

s4, a data storage stage, namely processing the acquired original data, and then storing the processed data according to the data type, scale and use as required;

and S5, finishing the overall risk discovery process.

Wherein, the data acquisition and collection stage of step S1 specifically includes:

s11, using the basic principle of efficient spreading: continuously spreading and acquiring all the webpages under the domain name by utilizing the mutual link relation among the webpages;

s12, the web page collector firstly collects all the web pages under the home page from the website home page;

s13, searching links contained in the webpages by analyzing the acquired webpages;

s14, after the links are further checked by redis, a new seed URL is made;

s15, repeating the steps S11-S14, the process is continuously carried out, and the deep level setting can be carried out on the website, so that a large number of webpages of the website can be comprehensively acquired.

The step S2 and the data circulation stage specifically include the following steps:

s21, in order to make the data exchange between the storage systems efficient and simple, an automatic data exchange tool is needed, namely, all the used storage system APIs are packaged, and the interface for realizing public calling is provided for the user;

s22, the user can start the task to exchange data only by editing the configuration file;

s23, at the same time, because the data structures required by different storage systems are different, the data needs to be processed in the data exchange process, and the process adopts the idea of stream processing, i.e. processing one by one. And accessing the data in Kafka by using the Confluent, and storing the data in an elastic search after calling a risk identification service to complete the whole data circulation.

The step S4 of storing data specifically includes the following steps:

s41, aiming at the collected original data, cleaning and standardizing the collected original data according to data standards, and providing basic data support for subsequent data analysis;

s42, providing a data channel with data aggregation, cache and forwarding functions mainly based on a data bus;

s43, fusion processing of multi-source heterogeneous data is achieved based on actual requirements;

and S44, storing the processed data according to the data type, the size and the use as required.

In practical application, the invention uses the following technology:

1) word segmentation

The invention is based on python-Jieba technology word segmentation, including: word segmentation, dictionary self-definition and stop word removal.

Word segmentation

Cut method accepts two input parameters:

the first parameter is a character string needing word segmentation;

the cut _ all parameter is used to control whether full mode is employed.

Cut for search method accepts one parameter: the method is suitable for a search engine to construct the segmentation of the inverted index, and the granularity is fine.

Note that: the string to be participled may be an gbk string, an utf-8 string, or a unicode.

Cut and cut _ for _ search are both iterative generators, each word (unicode) obtained after word segmentation can be obtained by using for loop, and list (jieba. cut (…)) can be converted into list.

2) User-defined dictionary

In the process of segmenting words in a Chinese language, some special words can not be accurately segmented, for example, cloud computing is segmented into cloud and computing, so that a function of adding a custom dictionary is provided for developers, and better word segmentation accuracy is guaranteed. The functional prototype is as follows:

load_userdict(f)

the function has only one parameter, which represents the loaded custom dictionary path, and f is the file class object or the file under the custom dictionary path. The format of the dictionary is: one word occupies one line, and each line is divided into three parts.

Stop word

The stop word filtering is mainly to construct a stop word table text file by self, read the content in the text into list, check the result after word segmentation one by one whether the result is in the stop word list, and finally obtain the filtered result if the result is filtered.

A basic part: removing stop words in a sentence and outputting a result;

and segmenting the text in the csv file, removing stop words specified in the txt file, and finally writing the stop words into the csv file.

TF-IDF dictionary

TF-IDF (Term Frequency-Inverse text Frequency), is a commonly used weighting technique for information retrieval and text mining. TF-IDF is a statistical method to assess how important a word is for one of a set of documents or a corpus. It consists of two parts, TF and IDF.

Word Frequency (Term Frequency, TF)

Refers to the frequency with which a given word appears in the document. The calculation formula is as follows:

reverse file Frequency (IDF)

Is a measure of the general importance of a word. The calculation formula is as follows:

note that: the denominator is increased by 1 to avoid the denominator being 0

3) Keyword extraction

TF-IDF is used as an important means for extracting keywords, and the TF-IDF performs weighting processing on all candidate keywords of the text and ranks the keywords according to the weights. Assuming that Dn is the size of the test corpus, the keyword extraction steps of the algorithm are as follows:

and performing data preprocessing operations such as word segmentation, part-of-speech tagging and stop word removal on the given text D. The classification adopts the Jieba to classify words, and retains the words of 'n', 'nz', 'v', 'vd', 'vn', 'l', 'a', 'D', and finally obtains n candidate keywords, namely D ═ t1, t2, …, tn ];

calculating the word frequency of the word ti in the text D;

calculating IDF of term ti in whole corpus

Dt is the number of documents in the corpus in which the word ti appears.

Calculating TF-IDF (TF-IDF) of the word ti, and repeating (2) - (4) to obtain TF-IDF numerical values of all candidate keywords;

and (4) carrying out reverse order arrangement on the calculation results of the candidate keywords to obtain TopN vocabularies before ranking as text keywords.

4) Word vector

And obtaining the word vector set of the self based on the TI-DIF dictionary and the Tencent open source word vector.

In summary, the invention discloses a method for discovering webpage risk, which collects information through multiple channels such as a website, WeChat, microblog and the like, analyzes and filters risk information through a word segmentation algorithm, and automatically identifies whether the website breaks rules or not. Therefore, the risk website can be quickly positioned in massive webpage data.

Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

8页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种度量文章传播影响力的方法

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!