Phishing webpage detection method based on machine learning

文档序号:1711832 发布日期:2019-12-13 浏览:4次 中文

阅读说明:本技术 基于机器学习的钓鱼网页检测方法 (Phishing webpage detection method based on machine learning ) 是由 范如 范渊 于 2019-08-01 设计创作,主要内容包括:本发明提供一种基于机器学习的钓鱼网页检测方法:包括以下步骤:S1、判断待测网页是否为合法网页,如不是,执行步骤S2;S2、提取步骤S1所得网页的URL;S3、基于逻辑回归算法的钓鱼网页检测方法,判断步骤2得到网页是合法网页还是钓鱼网页。本发明提供了一种基于机器学习的钓鱼网页检测算法,采用了构建网页特征集合,网页过滤技术以及逻辑回归分类算法来实现对钓鱼网页的检测。该检测方法可以有效减少合法网页的检测数量,并实现了对逃逸技术的钓鱼网页的良好检测。(The invention provides a phishing webpage detection method based on machine learning, which comprises the following steps: the method comprises the following steps: s1, judging whether the webpage to be detected is a legal webpage or not, if not, executing the step S2; s2, extracting the URL of the webpage obtained in the step S1; s3, judging whether the webpage obtained in the step 2 is a legal webpage or a phishing webpage based on the phishing webpage detection method of the logistic regression algorithm. The invention provides a fishing webpage detection algorithm based on machine learning, which adopts a webpage feature set construction, a webpage filtering technology and a logistic regression classification algorithm to realize the detection of fishing webpages. The detection method can effectively reduce the detection number of legal webpages and realize good detection of phishing webpages of the escape technology.)

1. The phishing webpage detection method based on machine learning is characterized by comprising the following steps: the method comprises the following steps:

S1, judging whether the webpage to be detected is a legal webpage or not, if not, executing the step S2;

S2, extracting the URL of the webpage obtained in the step S1;

Segmenting the URL by using non-letters, non-numbers but word symbols not including '_' and 'minus', and obtaining a vocabulary set of the segmented URL;

If the phishing attack target words exist in the URL path vocabulary and are different from domain name labels of all levels of the URL, judging the webpage to be a phishing webpage;

if two or more different phishing attack target words exist in the same character string, the webpage is judged to be a phishing webpage;

If the segmented URL vocabulary set does not contain the phishing attack target words, but digital characters in the character string are ignored, and if the phishing attack target words can be found, the webpage is judged to be a phishing webpage;

If the segmented URL word set is combined without the phishing attack target words, then the digital characters in the character strings are ignored and the phishing attack target words are not found, the segmentation is carried out again, and the new URL word set finds the phishing attack target words or the character string substrings of the phishing attack target words, and then the webpage is judged to be a phishing webpage;

if the phishing webpage is not determined, executing step S3;

S3, judging whether the webpage obtained in the step 2 is a legal webpage or a phishing webpage based on the phishing webpage detection method of the logistic regression algorithm.

2. A machine learning-based phishing webpage detection method as claimed in claim 1 wherein:

In step 2, the re-segmentation method when no phishing attack target word exists in the segmented vocabulary set of the URL and no phishing attack target word is found after the digital characters in the character string are ignored is as follows:

the URL is segmented by using "/" and "\\" as segmentation symbols, then segmented by using any non-letter and non-number symbols, and simultaneously the top-level domain name vocabularies in all parts are ignored; then, combining characters adjacent to the index position in each part to form a new URL vocabulary set; and if the phishing attack target words or the character string substrings of the phishing attack target words are found in the new URL vocabulary set, the webpage is judged to be a phishing webpage.

3. a machine learning-based phishing webpage detection method as claimed in claim 2, wherein:

The step 3 is: and (3) extracting the characteristics of the webpage obtained in the step (2), and judging whether the webpage is a legal webpage or a phishing webpage in the step (2) through the characteristics based on a logistic regression algorithm.

4. A machine learning-based phishing webpage detection method as claimed in claim 3 wherein:

the features include a DNS challenge attribute, an HTML tag attribute, character features of the URL, similarity of paths in the URL to fishing vocabularies, and Whois features.

5. A machine learning-based phishing webpage detection method as claimed in claim 4 wherein:

The DNS query degree attribute comprises DNS query degree, IP number of a webpage, IP, an IP subnet, an autonomous system number, whether the country where the IP is located is in a blacklist, RETRY value, TTL value, REFRESH value and EXPIRE value characteristics;

The HTML label attribute comprises whether the page is redirected or not, whether the redirected and skipped webpage and the webpage to be detected are the same domain name or not, the proportion of a link using an https protocol in a link of a < a > label href attribute of the webpage to be detected in the link of the < a > label, the proportion of a link using an https protocol in a link attribute link of the webpage to be detected, the proportion of the number of labels with "#" in the < a > label in the total < a > label, the proportion of the number of labels with "#" in the < link > label in the total < link > label, the proportion of links different from the original URL domain name in the webpage in the total < a > label link number, the proportion of links different from the original URL domain name in the webpage in the total < link > label link number, the average dot number of the < a > labels in the webpage, and the average dot number of the < link > labels in the webpage; the average number of links with "@" in < a > tags and the average number of links with "@" in < link > tags;

The character characteristics of the URL comprise the characteristic information of domain name, path, length of file and parameter, length of longest vocabulary, number of "-" or "_" symbols and point number;

The Whois characteristics include web page registration, updates, expiration time, whether private registration, whether the IP will be locked, the registry, the registrant, the subnet where the IP is located, the country or region where the IP is located, whether the IP autonomous system number exists in a known blacklist.

6. A machine learning-based phishing webpage detection method as claimed in claim 5, wherein:

Similarity F between URL to be detected and fishing vocabularysimthe calculation formula of (2) is as follows:

Wherein JMjRepresenting the similarity of the fishing vocabulary and the Jaccard of the jth word in the URL vocabulary set, n representing the number of elements in the fishing vocabulary set, and k representing the number of elements in the URL vocabulary set generated after the URL of the webpage to be detected is segmented in a mode 1; aj represents the jth word in the URL vocabulary set, and Bj represents the fishing classthe jth word in the vocabulary set.

Technical Field

the invention relates to a phishing webpage detection method, in particular to a phishing webpage detection method based on machine learning.

background

The existing dark chain detection method comprises the following steps: 1. a detection method and device for skipping type phishing webpages; 2. a phishing detection method based on webpage relevance is disclosed.

a detection method and a device for a skip phishing webpage are characterized in that only the characteristics of a URL of the webpage and the characteristics of a URL after skipping are considered, and whether the clustering entity set corresponding to a URL set to be detected and a preset clustering information base have the same clustering entity or not is detected. The detection method is single in dimension and does not have good identification on the novel phishing attack webpage.

The phishing detection method based on the webpage relevance integrates the relevance among the pages and the overall layout characteristics of the pages, and mainly solves the problem of performing quick phishing webpage detection based on the webpage relevance and the visual similarity. The method is mainly characterized in that the webpage is regarded as an integral body and is pressed from the link relevance, the search relevance and the text relevance embedded in the webpage and the integral relevance of the webpage, the characteristics extracted by the detection mode are less, and the accuracy of a search engine is not considered.

accordingly, there is a need for improvements in the art.

disclosure of Invention

The invention aims to provide an efficient phishing webpage detection method based on machine learning.

In order to solve the technical problem, the invention provides a phishing webpage detection method based on machine learning, which comprises the following steps: the method comprises the following steps:

s1, judging whether the webpage to be detected is a legal webpage or not, if not, executing the step S2;

S2, extracting the URL of the webpage obtained in the step S1;

Segmenting the URL by using non-letters, non-numbers but word symbols not including '_' and 'minus', and obtaining a vocabulary set of the segmented URL;

If the phishing attack target words exist in the URL path vocabulary and are different from domain name labels of all levels of the URL, judging the webpage to be a phishing webpage;

If two or more different phishing attack target words exist in the same character string, the webpage is judged to be a phishing webpage;

If the segmented URL vocabulary set does not contain the phishing attack target words, but digital characters in the character string are ignored, and if the phishing attack target words can be found, the webpage is judged to be a phishing webpage;

if the segmented URL word set is combined without the phishing attack target words, then the digital characters in the character strings are ignored and the phishing attack target words are not found, the segmentation is carried out again, and the new URL word set finds the phishing attack target words or the character string substrings of the phishing attack target words, and then the webpage is judged to be a phishing webpage;

if the phishing webpage is not determined, executing step S3;

S3, judging whether the webpage obtained in the step 2 is a legal webpage or a phishing webpage based on the phishing webpage detection method of the logistic regression algorithm.

as an improvement of the machine learning-based phishing webpage detection method of the invention:

in step 2, the re-segmentation method when no phishing attack target word exists in the segmented vocabulary set of the URL and no phishing attack target word is found after the digital characters in the character string are ignored is as follows:

The URL is segmented by using "/" and "\\" as segmentation symbols, then segmented by using any non-letter and non-number symbols, and simultaneously the top-level domain name vocabularies in all parts are ignored; then, combining characters adjacent to the index position in each part to form a new URL vocabulary set; and if the phishing attack target words or the character string substrings of the phishing attack target words are found in the new URL vocabulary set, the webpage is judged to be a phishing webpage.

As a further improvement of the machine learning-based phishing webpage detection method of the invention:

The step 3 is: and (3) extracting the characteristics of the webpage obtained in the step (2), and judging whether the webpage is a legal webpage or a phishing webpage in the step (2) through the characteristics based on a logistic regression algorithm.

as a further improvement of the machine learning-based phishing webpage detection method of the invention:

The features include a DNS challenge attribute, an HTML tag attribute, character features of the URL, similarity of paths in the URL to fishing vocabularies, and Whois features.

As a further improvement of the machine learning-based phishing webpage detection method of the invention:

The DNS query degree attribute comprises DNS query degree, IP number of a webpage, IP, an IP subnet, an autonomous system number, whether the country where the IP is located is in a blacklist, RETRY value, TTL value, REFRESH value and EXPIRE value characteristics;

The HTML label attribute comprises whether the page is redirected or not, whether the redirected and skipped webpage and the webpage to be detected are the same domain name or not, the proportion of a link using an https protocol in a link of a < a > label href attribute of the webpage to be detected in the link of the < a > label, the proportion of a link using an https protocol in a link attribute link of the webpage to be detected, the proportion of the number of labels with "#" in the < a > label in the total < a > label, the proportion of the number of labels with "#" in the < link > label in the total < link > label, the proportion of links different from the original URL domain name in the webpage in the total < a > label link number, the proportion of links different from the original URL domain name in the webpage in the total < link > label link number, the average dot number of the < a > labels in the webpage, and the average dot number of the < link > labels in the webpage; the average number of links with "@" in < a > tags and the average number of links with "@" in < link > tags;

the character characteristics of the URL comprise the characteristic information of domain name, path, length of file and parameter, length of longest vocabulary, number of "-" or "_" symbols and point number;

The Whois characteristics include web page registration, updates, expiration time, whether private registration, whether the IP will be locked, the registry, the registrant, the subnet where the IP is located, the country or region where the IP is located, whether the IP autonomous system number exists in a known blacklist.

As a further improvement of the machine learning-based phishing webpage detection method of the invention:

Similarity F between URL to be detected and fishing vocabularysimThe calculation formula of (2) is as follows:

wherein JMjrepresenting the similarity of the fishing vocabulary and the Jaccard of the jth word in the URL vocabulary set, n representing the number of elements in the fishing vocabulary set, and k representing the number of elements in the URL vocabulary set generated after the URL of the webpage to be detected is segmented in a mode 1; aj represents the jth word in the URL vocabulary set, and Bj represents the jth word in the fishing class vocabulary set.

The fishing webpage detection method based on machine learning has the technical advantages that:

the invention provides a fishing webpage detection algorithm based on machine learning, which adopts a webpage feature set construction, a webpage filtering technology and a logistic regression classification algorithm to realize the detection of fishing webpages. The detection method can effectively reduce the detection number of legal webpages and realize good detection of phishing webpages of the escape technology.

The phishing website detection needs to be carried out by extracting the most various dimensions and new strategies according to a suitable service scene through more detailed analysis on the latest service data, and algorithms such as Naive Bayes, SVM and some algorithms of a deep learning neural network can be used for detecting phishing webpages. The method is based on the fundamental principle, extracts the multi-dimensional phishing webpage characteristics, has good generalization on the detection of novel phishing website attacks, and is of great importance to the safety of a monitoring engine.

drawings

the following describes embodiments of the present invention in further detail with reference to the accompanying drawings.

FIG. 1 is a schematic view of a vocabulary set reconstruction process;

Fig. 2 is a flow chart of a detection algorithm for phishing webpages based on machine learning.

Detailed Description

the invention will be further described with reference to specific examples, but the scope of the invention is not limited thereto.

14页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种GIS空间数据协同编辑的方法

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!

技术分类