Deep learning flow classification method based on combination of space-time characteristics

文档序号：1569760 发布日期：2020-01-24 浏览：14次中文

阅读说明：本技术 基于时空特性相结合的深度学习流量分类方法 (Deep learning flow classification method based on combination of space-time characteristics ) 是由顾华玺魏雯婷薛智浩曾祎于 2019-10-12 设计创作，主要内容包括：本发明公开了一种基于时空特性相结合的深度学习流量分类方法,主要解决现有技术检测准确率低的问题。其实现方案是：1)采集并标注原始流量负载数据；2)基于原始流量负载数据,生成预处理后的流量图集；3)利用流量图集训练基于时空特性相结合的深度学习模型；4)用新采集并生成的流量数据验证训练好的深度学习模型,合格后将模型作为流量分类器部署于真实网络结点；5)对真实网络环境中的流量进行解析分类并标注。本发明构建的模型利用了流量数据的时空特性,提高了流量分类的准确率,且减少分类器占用的资源,能满足当前网络环境下流量分类的需求,可应用于网络边缘节点中实现加密流量识别和恶意流量检测。(The invention discloses a deep learning flow classification method based on the combination of space-time characteristics, which mainly solves the problem of low detection accuracy rate in the prior art. The implementation scheme is as follows: 1) collecting and marking original flow load data; 2) generating a preprocessed flow atlas based on original flow load data; 3) training a deep learning model based on the combination of space-time characteristics by utilizing a flow atlas; 4) verifying the trained deep learning model by using the newly acquired and generated flow data, and deploying the model serving as a flow classifier to a real network node after the model is qualified; 5) and analyzing, classifying and labeling the flow in the real network environment. The model constructed by the invention utilizes the space-time characteristics of the flow data, improves the accuracy of flow classification, reduces resources occupied by the classifier, can meet the flow classification requirement under the current network environment, and can be applied to network edge nodes to realize encrypted flow identification and malicious flow detection.)

1. A deep learning flow classification method based on the combination of space-time characteristics is characterized by comprising the following steps:

(1) acquiring and marking original network traffic load data to obtain marked network traffic load data:

(1a) collecting network traffic load data from a pure network node, and classifying the network traffic load data according to three types, namely encrypted traffic, unencrypted traffic and malicious access traffic, wherein the encrypted traffic is subdivided and labeled according to six types of applications in the Internet, namely Email, Chat, File, P2P, Streaming and VoIP;

(1b) randomly mixing the collected network traffic load data and the past time point data with a pre-constructed database to obtain a labeled network traffic load database;

(2) generating a preprocessed traffic atlas based on the labeled network traffic load database:

(2a) segmenting continuous network flow by using a packet capturing tool to generate and store a data packet in a pcap format;

(2b) protocol impurity removal is carried out on the data packet, namely relevant data of a TCP protocol and a DCP protocol which can directly reflect the flow service type in the data packet are deleted, the data are interference items in malicious access flow or encrypted flow, and information extraction of a model can be interfered in a deep learning model;

(2c) removing physical information from the data packet, namely deleting the information related to the physical address so as to avoid misclassification caused by the fact that the deep learning model mistakenly considers the physical address as a certain service-related identification characteristic;

(2d) blank data packets and repeated data packets are deleted to avoid interference on deep learning training;

(2e) the flow length of the unified data packet is 900 bytes, namely the flow packet which exceeds 900 bytes is intercepted, and the flow packet which is less than 900 bytes is filled by 0x 00;

(2f) performing visual processing on the data packets with uniform length, namely converting each flow packet into a flow graph with the size of 30 × 30, and finally combining all processed data packets into a flow graph set;

(3) constructing a deep learning model which is formed by connecting a first convolution layer, a first local normalization layer, a second convolution layer, a second local normalization layer, a full connection layer, an LSTM layer and a softmax layer in sequence;

(4) training the deep learning model:

(4a) setting training cycle times R;

(4b) inputting the mixed flow atlas into a first convolution layer, a first local normalization layer, a second convolution layer and a second local normalization layer in sequence to learn the spatial characteristics of the flow and carry out normalization processing on abnormal values;

(4c) inputting the processed data in the step (4b) into a full connection layer, and converting the processed data into a data form which can be received by an LSTM model;

(4d) inputting the data obtained in (4c) into the LSTM layer to learn the time characteristics of the flow;

(4e) inputting the data obtained in the step (4d) into a softmax layer, and directly outputting a classification result, namely, giving a label of the original network traffic load data;

(4f) modifying the weight and deviation of each network layer according to the difference between the label obtained in the step (4e) and the real label in the training set;

(4g) repeating the steps (4b) to (4f) until the training cycle number R is reached to obtain a well-trained deep learning model;

(5) verifying the trained deep learning model and deploying real network nodes:

(5a) setting a qualification rate P according to the precision requirement of real network classification;

(5b) according to the steps (1) to (2), original network traffic load data are collected again, and a traffic atlas is generated;

(5c) inputting the flow atlas generated in the step (5b) into a trained deep learning model to obtain a classification result;

(5d) comparing the classification result of (5c) with the real label to obtain the correct sample number, and obtaining the accuracy A of the classification result of the deep learning model:

if A is larger than P, the model is qualified, and the model is used as a flow classifier to be deployed at a real network node;

otherwise, repeating the steps (1) - (4);

(6) classifying encrypted traffic in a real network, introducing the real network traffic graph preprocessed in the step (2) into a traffic classifier, dividing the traffic into malicious traffic, common traffic and six types of encrypted traffic, marking according to a classification result, calling a DPI tool and a port number aiming at the common traffic in the traffic classifier, and directly marking the traffic service type;

(7) and saving the partially acquired data as existing data for updating the deep learning model at the next time point.

2. The method according to claim 1, wherein the parameters of the deep learning model constructed in step (3) are set as follows:

the convolution kernel size of the first convolution layer is 5 x 5, and the number of the convolution kernels is 32;

the convolution kernel size of the second convolution layer is 5 x 5, and the number of the convolution kernels is 64;

the local sizes of the first local normalization layer and the second local normalization layer are both 7, the scaling factors are both 0.00011, and the exponential terms are both 0.75;

the number of hidden layer neurons of the LSTM layer is 256.

3. The method of claim 1, wherein the weight and deviation of each network layer in (4f) are modified according to the difference between the label obtained in (4e) and the real label in the training set, and the following are implemented:

(4f1) and solving the loss L between the output value and the true value of the deep learning model:

wherein N is the number of training samples, y_iIn order to be the true value of the value,

(4f2) Returning the loss back to the network, and sequentially obtaining the loss function L of each network layer through a BP back propagation algorithm_n(w_n,b_n)；

(4f3) Determining the loss function L from (4f2)_n(w_n,b_n) Updating the weight w of each network layer by using a gradient descent method_nAnd deviation b_nGet updated weights

Wherein alpha is the learning rate, and alpha is more than 0 and less than or equal to 0.1.

Technical Field

The invention belongs to the technical field of computer networks, and particularly relates to a traffic classification method which can be applied to network edge nodes to realize encrypted traffic identification and malicious traffic detection.

Background

The current network traffic environment is increasingly complex, and how to continue to maintain efficient and rapid malicious traffic detection becomes a great challenge in the current network environment. The essence of traffic identification or malicious traffic detection is a classification problem, and conventional traffic classification methods, such as port number-based or deep packet inspection technologies, cannot well meet task requirements in the current network environment; the method based on the traditional machine learning is also used for coping with encrypted flow identification and malicious flow monitoring, but the complicated step of manually selecting the characteristics and marking the characteristic library relates to the problems of labor cost, privacy information and the like, so that the generalization capability of the method is restricted; in recent years, a method based on deep learning which is just developed well overcomes the defects in the predecessor method, but mostly only utilizes information with single dimensionality in time or space in original flow information, so that the performance of a classifier is restricted, and especially when a detection task of simultaneously analyzing encrypted flow and malicious flow is responded, the classifier is easy to encounter a bottleneck during training. In the above, how to design a deep learning classifier which can simultaneously utilize the spatio-temporal characteristics becomes a core problem.

In the patent document filed by Shanghai university of transportation, Zhoufutai et al, "a system and method for detecting encrypted malicious traffic based on deep learning" (application No. 201811244932.5 application No. 2018.10.24 application publication No. CN109104441A), a system for detecting encrypted malicious traffic based on deep learning is disclosed. The method comprises the following specific steps: analyzing the encrypted flow data through flow analysis software to obtain three log files, and connecting and obtaining a series of aggregated data; the second step is that: extracting a series of characteristic data from the aggregated data; the third step: training the feature data in the second step by using an xgboost algorithm to obtain a first model; the fourth step: for all server names aggregated by each flow, training a word vector conversion model by using word2vec, and then converting the word vector conversion model into a word vector matrix; the fifth step: converting the server name into a word vector matrix, and then training by using an LSTM (least squares TM) to obtain a second model; and a sixth step: constructing a flow chart by using the characteristics in payload of the data packet to obtain a third model; the seventh step: and weighting the three obtained models according to different proportions to obtain the final malicious traffic probability. The method has the disadvantages that when the final malicious flow probability is obtained in the seventh step, the three models need to be weighted in different proportions, but the method does not clearly indicate how to distribute the weighted proportions of the three models, and in the actual application process, the traditional artificial interference weighting decision can damage the end-to-end structure of deep learning, so that the self-learning capability of the deep learning is reduced; in addition, although the method uses three models of xgbootst, CNN and LSTM, the classification probabilities are simply combined, and the classification is not realized by completely utilizing the space-time characteristics of the flow. In conclusion, the method has great limitation in realizing encrypted malicious traffic detection.

A method and a device for classifying network traffic based on characterization learning are disclosed in a patent document applied by the Acoustic research institute of Chinese academy of sciences (application No.: 201711189690 application No.: 2018-06-15 application publication No.: CN201711189690. XA). The method comprises the following specific steps: preprocessing the acquired network traffic data, wherein the preprocessing comprises the steps of performing traffic segmentation on the acquired network traffic data, unifying the lengths of the segmented traffic data, and encoding the network traffic data subjected to the segmentation and length unification processing to generate data in a specific format; the second step is that: performing feature extraction on the preprocessed network traffic data by using a convolutional neural network algorithm in characterization learning, and generating network traffic vectors from the network traffic data; the third step: and classifying the network flow data according to the network flow vector, so that the classification of the network flow can be realized. The method has the disadvantages that all spatial characteristics of the flow are only combined, and certain deficiency exists in the utilization of the time sequence characteristics of the flow, so that the accuracy rate of the flow in classification is low, and misjudgment is easy to occur; the method needs to manually extract the characteristics of the flow, the cost of labor and time consumed by the method is high, and end-to-end network flow classification cannot be achieved. In conclusion, the method has great limitation in realizing encrypted malicious traffic detection.

Disclosure of Invention

The invention aims to provide a deep learning traffic classification method based on the combination of space-time characteristics aiming at the defects of the prior art so as to improve the accuracy of traffic classification, reduce resources occupied by a classifier and meet the traffic classification requirement under the current network environment.

In order to achieve the purpose, the technical scheme of the invention comprises the following steps: