Deep learning flow classification method based on combination of space-time characteristics

文档序号:1569760 发布日期:2020-01-24 浏览:14次 中文

阅读说明:本技术 基于时空特性相结合的深度学习流量分类方法 (Deep learning flow classification method based on combination of space-time characteristics ) 是由 顾华玺 魏雯婷 薛智浩 曾祎 于 2019-10-12 设计创作,主要内容包括:本发明公开了一种基于时空特性相结合的深度学习流量分类方法,主要解决现有技术检测准确率低的问题。其实现方案是:1)采集并标注原始流量负载数据;2)基于原始流量负载数据,生成预处理后的流量图集;3)利用流量图集训练基于时空特性相结合的深度学习模型;4)用新采集并生成的流量数据验证训练好的深度学习模型,合格后将模型作为流量分类器部署于真实网络结点;5)对真实网络环境中的流量进行解析分类并标注。本发明构建的模型利用了流量数据的时空特性,提高了流量分类的准确率,且减少分类器占用的资源,能满足当前网络环境下流量分类的需求,可应用于网络边缘节点中实现加密流量识别和恶意流量检测。(The invention discloses a deep learning flow classification method based on the combination of space-time characteristics, which mainly solves the problem of low detection accuracy rate in the prior art. The implementation scheme is as follows: 1) collecting and marking original flow load data; 2) generating a preprocessed flow atlas based on original flow load data; 3) training a deep learning model based on the combination of space-time characteristics by utilizing a flow atlas; 4) verifying the trained deep learning model by using the newly acquired and generated flow data, and deploying the model serving as a flow classifier to a real network node after the model is qualified; 5) and analyzing, classifying and labeling the flow in the real network environment. The model constructed by the invention utilizes the space-time characteristics of the flow data, improves the accuracy of flow classification, reduces resources occupied by the classifier, can meet the flow classification requirement under the current network environment, and can be applied to network edge nodes to realize encrypted flow identification and malicious flow detection.)

1. A deep learning flow classification method based on the combination of space-time characteristics is characterized by comprising the following steps:

(1) acquiring and marking original network traffic load data to obtain marked network traffic load data:

(1a) collecting network traffic load data from a pure network node, and classifying the network traffic load data according to three types, namely encrypted traffic, unencrypted traffic and malicious access traffic, wherein the encrypted traffic is subdivided and labeled according to six types of applications in the Internet, namely Email, Chat, File, P2P, Streaming and VoIP;

(1b) randomly mixing the collected network traffic load data and the past time point data with a pre-constructed database to obtain a labeled network traffic load database;

(2) generating a preprocessed traffic atlas based on the labeled network traffic load database:

(2a) segmenting continuous network flow by using a packet capturing tool to generate and store a data packet in a pcap format;

(2b) protocol impurity removal is carried out on the data packet, namely relevant data of a TCP protocol and a DCP protocol which can directly reflect the flow service type in the data packet are deleted, the data are interference items in malicious access flow or encrypted flow, and information extraction of a model can be interfered in a deep learning model;

(2c) removing physical information from the data packet, namely deleting the information related to the physical address so as to avoid misclassification caused by the fact that the deep learning model mistakenly considers the physical address as a certain service-related identification characteristic;

(2d) blank data packets and repeated data packets are deleted to avoid interference on deep learning training;

(2e) the flow length of the unified data packet is 900 bytes, namely the flow packet which exceeds 900 bytes is intercepted, and the flow packet which is less than 900 bytes is filled by 0x 00;

(2f) performing visual processing on the data packets with uniform length, namely converting each flow packet into a flow graph with the size of 30 × 30, and finally combining all processed data packets into a flow graph set;

(3) constructing a deep learning model which is formed by connecting a first convolution layer, a first local normalization layer, a second convolution layer, a second local normalization layer, a full connection layer, an LSTM layer and a softmax layer in sequence;

(4) training the deep learning model:

(4a) setting training cycle times R;

(4b) inputting the mixed flow atlas into a first convolution layer, a first local normalization layer, a second convolution layer and a second local normalization layer in sequence to learn the spatial characteristics of the flow and carry out normalization processing on abnormal values;

(4c) inputting the processed data in the step (4b) into a full connection layer, and converting the processed data into a data form which can be received by an LSTM model;

(4d) inputting the data obtained in (4c) into the LSTM layer to learn the time characteristics of the flow;

(4e) inputting the data obtained in the step (4d) into a softmax layer, and directly outputting a classification result, namely, giving a label of the original network traffic load data;

(4f) modifying the weight and deviation of each network layer according to the difference between the label obtained in the step (4e) and the real label in the training set;

(4g) repeating the steps (4b) to (4f) until the training cycle number R is reached to obtain a well-trained deep learning model;

(5) verifying the trained deep learning model and deploying real network nodes:

(5a) setting a qualification rate P according to the precision requirement of real network classification;

(5b) according to the steps (1) to (2), original network traffic load data are collected again, and a traffic atlas is generated;

(5c) inputting the flow atlas generated in the step (5b) into a trained deep learning model to obtain a classification result;

(5d) comparing the classification result of (5c) with the real label to obtain the correct sample number, and obtaining the accuracy A of the classification result of the deep learning model:

if A is larger than P, the model is qualified, and the model is used as a flow classifier to be deployed at a real network node;

otherwise, repeating the steps (1) - (4);

(6) classifying encrypted traffic in a real network, introducing the real network traffic graph preprocessed in the step (2) into a traffic classifier, dividing the traffic into malicious traffic, common traffic and six types of encrypted traffic, marking according to a classification result, calling a DPI tool and a port number aiming at the common traffic in the traffic classifier, and directly marking the traffic service type;

(7) and saving the partially acquired data as existing data for updating the deep learning model at the next time point.

2. The method according to claim 1, wherein the parameters of the deep learning model constructed in step (3) are set as follows:

the convolution kernel size of the first convolution layer is 5 x 5, and the number of the convolution kernels is 32;

the convolution kernel size of the second convolution layer is 5 x 5, and the number of the convolution kernels is 64;

the local sizes of the first local normalization layer and the second local normalization layer are both 7, the scaling factors are both 0.00011, and the exponential terms are both 0.75;

the number of hidden layer neurons of the LSTM layer is 256.

3. The method of claim 1, wherein the weight and deviation of each network layer in (4f) are modified according to the difference between the label obtained in (4e) and the real label in the training set, and the following are implemented:

(4f1) and solving the loss L between the output value and the true value of the deep learning model:

Figure FDA0002231063460000021

wherein N is the number of training samples, yiIn order to be the true value of the value,

Figure FDA0002231063460000031

(4f2) Returning the loss back to the network, and sequentially obtaining the loss function L of each network layer through a BP back propagation algorithmn(wn,bn);

(4f3) Determining the loss function L from (4f2)n(wn,bn) Updating the weight w of each network layer by using a gradient descent methodnAnd deviation bnGet updated weights

Figure FDA0002231063460000032

Figure FDA0002231063460000034

Figure FDA0002231063460000035

Wherein alpha is the learning rate, and alpha is more than 0 and less than or equal to 0.1.

Technical Field

The invention belongs to the technical field of computer networks, and particularly relates to a traffic classification method which can be applied to network edge nodes to realize encrypted traffic identification and malicious traffic detection.

Background

The current network traffic environment is increasingly complex, and how to continue to maintain efficient and rapid malicious traffic detection becomes a great challenge in the current network environment. The essence of traffic identification or malicious traffic detection is a classification problem, and conventional traffic classification methods, such as port number-based or deep packet inspection technologies, cannot well meet task requirements in the current network environment; the method based on the traditional machine learning is also used for coping with encrypted flow identification and malicious flow monitoring, but the complicated step of manually selecting the characteristics and marking the characteristic library relates to the problems of labor cost, privacy information and the like, so that the generalization capability of the method is restricted; in recent years, a method based on deep learning which is just developed well overcomes the defects in the predecessor method, but mostly only utilizes information with single dimensionality in time or space in original flow information, so that the performance of a classifier is restricted, and especially when a detection task of simultaneously analyzing encrypted flow and malicious flow is responded, the classifier is easy to encounter a bottleneck during training. In the above, how to design a deep learning classifier which can simultaneously utilize the spatio-temporal characteristics becomes a core problem.

In the patent document filed by Shanghai university of transportation, Zhoufutai et al, "a system and method for detecting encrypted malicious traffic based on deep learning" (application No. 201811244932.5 application No. 2018.10.24 application publication No. CN109104441A), a system for detecting encrypted malicious traffic based on deep learning is disclosed. The method comprises the following specific steps: analyzing the encrypted flow data through flow analysis software to obtain three log files, and connecting and obtaining a series of aggregated data; the second step is that: extracting a series of characteristic data from the aggregated data; the third step: training the feature data in the second step by using an xgboost algorithm to obtain a first model; the fourth step: for all server names aggregated by each flow, training a word vector conversion model by using word2vec, and then converting the word vector conversion model into a word vector matrix; the fifth step: converting the server name into a word vector matrix, and then training by using an LSTM (least squares TM) to obtain a second model; and a sixth step: constructing a flow chart by using the characteristics in payload of the data packet to obtain a third model; the seventh step: and weighting the three obtained models according to different proportions to obtain the final malicious traffic probability. The method has the disadvantages that when the final malicious flow probability is obtained in the seventh step, the three models need to be weighted in different proportions, but the method does not clearly indicate how to distribute the weighted proportions of the three models, and in the actual application process, the traditional artificial interference weighting decision can damage the end-to-end structure of deep learning, so that the self-learning capability of the deep learning is reduced; in addition, although the method uses three models of xgbootst, CNN and LSTM, the classification probabilities are simply combined, and the classification is not realized by completely utilizing the space-time characteristics of the flow. In conclusion, the method has great limitation in realizing encrypted malicious traffic detection.

A method and a device for classifying network traffic based on characterization learning are disclosed in a patent document applied by the Acoustic research institute of Chinese academy of sciences (application No.: 201711189690 application No.: 2018-06-15 application publication No.: CN201711189690. XA). The method comprises the following specific steps: preprocessing the acquired network traffic data, wherein the preprocessing comprises the steps of performing traffic segmentation on the acquired network traffic data, unifying the lengths of the segmented traffic data, and encoding the network traffic data subjected to the segmentation and length unification processing to generate data in a specific format; the second step is that: performing feature extraction on the preprocessed network traffic data by using a convolutional neural network algorithm in characterization learning, and generating network traffic vectors from the network traffic data; the third step: and classifying the network flow data according to the network flow vector, so that the classification of the network flow can be realized. The method has the disadvantages that all spatial characteristics of the flow are only combined, and certain deficiency exists in the utilization of the time sequence characteristics of the flow, so that the accuracy rate of the flow in classification is low, and misjudgment is easy to occur; the method needs to manually extract the characteristics of the flow, the cost of labor and time consumed by the method is high, and end-to-end network flow classification cannot be achieved. In conclusion, the method has great limitation in realizing encrypted malicious traffic detection.

Disclosure of Invention

The invention aims to provide a deep learning traffic classification method based on the combination of space-time characteristics aiming at the defects of the prior art so as to improve the accuracy of traffic classification, reduce resources occupied by a classifier and meet the traffic classification requirement under the current network environment.

In order to achieve the purpose, the technical scheme of the invention comprises the following steps:

(1) acquiring and marking original network traffic load data to obtain marked network traffic load data:

(1a) collecting network traffic load data from a pure network node, and classifying the network traffic load data according to three types, namely encrypted traffic, unencrypted traffic and malicious access traffic, wherein the encrypted traffic is subdivided and labeled according to six types of applications in the Internet, namely Email, Chat, File, P2P, Streaming and VoIP;

(1b) randomly mixing the collected network traffic load data and the past time point data with a pre-constructed database to obtain a labeled network traffic load database;

(2) generating a preprocessed traffic atlas based on the labeled network traffic load database:

(2a) segmenting continuous network flow by using a packet capturing tool to generate and store a data packet in a pcap format;

(2b) protocol impurity removal is carried out on the data packet, namely relevant data of a TCP protocol and a DCP protocol which can directly reflect the flow service type in the data packet are deleted, the data are interference items in malicious access flow or encrypted flow, and information extraction of a model can be interfered in a deep learning model;

(2c) removing physical information from the data packet, namely deleting the information related to the physical address so as to avoid misclassification caused by the fact that the deep learning model mistakenly considers the physical address as a certain service-related identification characteristic;

(2d) blank data packets and repeated data packets are deleted to avoid interference on deep learning training;

(2e) the flow length of the unified data packet is 900 bytes, namely the flow packet which exceeds 900 bytes is intercepted, and the flow packet which is less than 900 bytes is filled by 0x 00;

(2f) performing visual processing on the data packets with uniform length, namely converting each flow packet into a flow graph with the size of 30 × 30, and finally combining all processed data packets into a flow graph set;

(3) constructing a deep learning model which is formed by connecting a first convolution layer, a first local normalization layer, a second convolution layer, a second local normalization layer, a full connection layer, an LSTM layer and a softmax layer in sequence;

(4) training the deep learning model:

(4a) setting training cycle times R;

(4b) inputting the mixed flow atlas into a first convolution layer, a first local normalization layer, a second convolution layer and a second local normalization layer in sequence to learn the spatial characteristics of the flow and carry out normalization processing on abnormal values;

(4c) inputting the processed data in the step (4b) into a full connection layer, and converting the processed data into a data form which can be received by an LSTM model;

(4d) inputting the data obtained in (4c) into the LSTM layer to learn the time characteristics of the flow;

(4e) inputting the data obtained in the step (4d) into a softmax layer, and directly outputting a classification result, namely, giving a label of the original network traffic load data;

(4f) modifying the weight and deviation of each network layer according to the difference between the label obtained in the step (4e) and the real label in the training set;

(4g) repeating the steps (4b) to (4f) until the training cycle number R is reached to obtain a well-trained deep learning model;

(5) verifying the trained deep learning model and deploying real network nodes:

(5a) setting a qualification rate P according to the precision requirement of real network classification;

(5b) according to the steps (1) to (2), original network traffic load data are collected again, and a traffic atlas is generated;

(5c) inputting the flow atlas generated in the step (5b) into a trained deep learning model to obtain a classification result;

(5d) comparing the classification result of (5c) with the real label to obtain the correct sample number, and obtaining the accuracy A of the classification result of the deep learning model:

if A is larger than P, the model is qualified, and the model is used as a flow classifier to be deployed at a real network node;

otherwise, repeating the steps (1) - (4);

(6) classifying encrypted traffic in a real network, introducing the real network traffic graph preprocessed in the step (2) into a traffic classifier, dividing the traffic into malicious traffic, common traffic and six types of encrypted traffic, marking according to a classification result, calling a DPI tool and a port number aiming at the common traffic in the traffic classifier, and directly marking the traffic service type;

(7) and saving the partially acquired data as existing data for updating the deep learning model at the next time point.

Compared with the prior art, the invention has the following advantages:

firstly, because the invention is based on deep learning, the end-to-end structure avoids the complex process of manually selecting the marking characteristics and avoids the information related to privacy, and compared with the traditional method and the method based on machine learning, the invention not only saves labor and time cost, but also has better generalization capability and applicability;

secondly, the invention utilizes two different deep learning networks of CNN and LSTM to respectively carry out feature learning on the original flow from two information dimensions of space and time, overcomes the defect that the time or space characteristics can only be singly utilized in the past, thereby obtaining better performance and higher accuracy compared with other methods based on deep learning in the past;

thirdly, on the basis of deep learning, the common traffic is labeled by combining the DPI technology with the best identification accuracy in the traditional method and a port number-based method, so that the whole classification process is more efficient and accurate;

fourth, the present invention requires less storage resources than previous methods and is therefore more suitable for deployment at edge nodes.

Fifthly, the method is based on the deep learning structure, so that the real-time performance can be realized regardless of the encryption flow marking or malicious access flow identification, and better service experience and service quality are achieved.

Drawings

FIG. 1 is a flow chart of an implementation of the present invention;

FIG. 2 is a sub-flow diagram of the flow pre-processing of the present invention;

fig. 3 is a diagram of a deep learning network structure constructed in the present invention.

Detailed Description

Embodiments of the present invention are described in further detail below with reference to the accompanying drawings.

Referring to fig. 1, the procedure for the embodiment is as follows.

Step 1, collecting and marking original network flow load data.

1.1) collecting network traffic load data from a pure network node, and classifying the network traffic load data according to three types, namely encrypted traffic, unencrypted traffic and malicious access traffic, wherein the encrypted traffic is subdivided and labeled according to six types of applications in the Internet, namely Email, Chat, File, P2P, Streaming and VoIP;

1.2) randomly mixing the collected network traffic load data, the past time point data and a pre-constructed database to expand the content of the database, reduce blind spots after deep learning model training and obtain a labeled network traffic load database.

And 2, generating a preprocessed traffic atlas based on the labeled network traffic load database.

Referring to fig. 2, the specific implementation of this step is as follows:

2.1) segmenting continuous flow in a network flow load database by using a packet capturing tool, and storing the continuous flow as a file in a pcap format;

2.2) protocol impurity removal is carried out on the flow data obtained in the step 2.1), namely, data directly related to a Transmission Control Protocol (TCP) and a discovery and basic configuration protocol (DCP) in a data packet are deleted, and the flow packet after the protocol impurity removal is obtained, wherein the related data are interference items in malicious access flow or encrypted flow, and the interference items can influence the information extraction capability of the deep learning model;

2.3) carrying out physical information impurity removal on the traffic packet subjected to protocol impurity removal, namely deleting data related to the physical information to obtain the traffic packet subjected to impurity removal, wherein the physical information mainly refers to an MAC (media access control) address, and because some hosts in the Internet are only responsible for traffic transmission of a certain type of application, the MAC address can be captured by a deep learning model and is considered to be related to judgment of a certain network service type, so the MAC address at the position should be deleted in the preprocessing;

2.4) deleting blank flow packets and repeated flow packets in the network flow load database, wherein the flow packets can interfere with the training of the deep learning model;

2.5) unifying the lengths of the traffic packets in the network traffic load database after the processing of 2.4) into 900 bytes, namely intercepting the traffic packets with the length exceeding 900 bytes, and supplementing the traffic packets with the length less than 900 bytes by 0x 00;

2.6) carrying out visual processing on the flow packets in the network flow load database after the processing of 2.5), namely mapping 900 bytes in each flow packet to a gray value from 0 to 1, and generating a flow graph with the size of 30 x 30;

2.7) adding all the flow charts processed in the step 2.6) into the real-time flow chart database, and finally finishing the establishment of the real-time flow chart database.

And 3, constructing a deep learning model.

The deep learning model comprises two convolution layers, a local normalization layer, a full connection layer, an LSTM layer and a softmax layer, and the structural relationship of the deep learning model is shown in FIG. 3;

referring to fig. 3, the deep learning model constructed in this step is sequentially: the first convolution layer, the first local normalization layer, the second convolution layer, the second local normalization layer, the full connection layer, the LSTM layer and the softmax layer have the following parameters:

the convolution kernel size of the first convolution layer is 5 x 5, and the number of the convolution kernels is 32;

the convolution kernel size of the second convolution layer is 5 x 5, and the number of the convolution kernels is 64;

the local sizes of the first local normalization layer and the second local normalization layer are both 7, the scaling factors are both 0.00011, and the exponential terms are both 0.75;

the number of hidden layer neurons of the LSTM layer is 256.

And 4, training the deep learning model.

4.1) setting training cycle times R;

4.2) the flow chart in the real-time flow chart database is sequentially input into the first convolution layer, the first local normalization layer, the second convolution layer and the second local normalization layer, the spatial characteristics of the flow are learned through the two convolution layers, and the two local normalization layers perform normalization processing on the abnormal value;

4.3) inputting the data processed in the step 4.2) into the full connection layer to be converted into input data of an LSTM model;

4.4) inputting the data obtained in the step 4.3) into an LSTM layer to learn the time characteristic of the flow, inputting the obtained data into a softmax layer, and directly outputting a classification result, namely a label giving original flow load data;

4.5) modifying the weight w of each network layer according to the difference between the tag value of the original traffic load data and the real tag of 4.4)nAnd deviation bn

4.5.1) solving the loss L between the original traffic load data tag value obtained in 4.4) and the real tag:

Figure BDA0002231063470000061

wherein N is the number of training samples, yiIn order to be a real label, the label,outputting a label for the model;

4.5.2) based on the loss L obtained in 4.5.1), the loss L of each network layer is sequentially obtained through a BP back propagation algorithmn(wn,bn);

4.5.3) determining the loss function L for each layer according to 4.5.2)n(wn,bn) Updating the weight w of each network layer by using a gradient descent methodnAnd deviation bnTo obtain the updated weight

Figure BDA0002231063470000063

And deviation of

Figure BDA0002231063470000064

Figure BDA0002231063470000071

Figure BDA0002231063470000072

Wherein the content of the first and second substances,

Figure BDA0002231063470000073

represents the updated weight value w of the n-th networknThe weight value before the update is shown,

Figure BDA0002231063470000074

indicating updated deviations in the n-th layer of the network, bnThe deviation before the update is indicated,

Ln(wn,bn) Loss of the n-th layer network obtained in 4.5.2), and alpha is a learning rate;

4.6) calculating the accuracy T of the model classification after the training of the current round:

Figure BDA0002231063470000075

4.7) repeating the steps of 4.2) and 4.6) until the training cycle number R is reached, and obtaining the well-trained deep learning model.

In this example, the training cycle number R is set to 2000000, and the optimal accuracy T is 99.96% through multiple simulations.

And 5, verifying the deep learning model and deploying real network nodes.

5.1) setting a qualified rate P according to the precision requirement of real network classification;

5.2) according to the step 1 and the step 2, re-collecting original network traffic load data and generating a traffic atlas;

5.3) inputting the flow atlas generated in the step 5.2) into a trained deep learning model to obtain a classification result;

5.4) comparing the classification result with the real label to obtain the number of classified correct samples, and calculating the accuracy A of the classification result of the deep learning model:

Figure BDA0002231063470000076

5.5) comparing the classification result accuracy A with the qualification rate P:

if A is larger than P, the deep learning model is qualified, and the model is used as a traffic classifier to be deployed on a real network node;

otherwise, step 1 to step 5 are repeated until A > P.

And 6, classifying and labeling the flow in the real network.

6.1) carrying out the pretreatment of the step 2 on the real network flow to obtain a real network flow graph;

6.2) inputting the real network flow graph into a flow classifier to obtain classification results of malicious flow, common flow and six types of encrypted flow;

6.3) marking the flow according to the classification result:

if the classification result is malicious access flow, marking the flow as the malicious flow, and reporting the malicious flow to an intrusion detection system IDS for early warning;

if the classification result is the unencrypted traffic, calling the original information of the traffic, and labeling according to a Deep Packet Inspection (DPI) technology and a port number technology;

otherwise, marking the rest encrypted traffic as six traffic service types of Email, Chat, File, P2P, Streaming and VoIP according to the classification result.

And 7, storing part of real network flow data in the step 6 for updating the model parameters at the next time point, so that the flow classifier is more matched with the real network environment, and the newly appeared malicious flow and encrypted flow can be more reasonably dealt with.

While the invention has been particularly shown and described with reference to exemplary embodiments thereof, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

12页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种聚合调度方法、发送端及计算机可读存储介质

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!