High-speed network traffic classification method based on sampling data flow

文档序号:1965931 发布日期:2021-12-14 浏览:8次 中文

阅读说明:本技术 一种基于抽样数据流的高速网络流量分类方法 (High-speed network traffic classification method based on sampling data flow ) 是由 吴桦 陈晰颖 程光 于 2021-11-04 设计创作,主要内容包括:本发明公开了一种基于抽样数据流的高速网络流量分类方法,该方法首先对主干网中海量流量进行抽样,并设计了HASH桶数组结构快速地提取依序抽样所得流量的特征,其次,该方法提出了批量分类器,此批量分类器能够在合理的时间内和有限的内存中实现对未标记流量特征的批量聚类,完成流量特征数据的标记工作,最后,该方法使用有监督的机器学习方法训练批量聚类结果中已标记的特征数据,得到分类模型,该分类模型可用于分类后续达到的主干网流量。本发明可在合理的时间内和有限的内存中实现未标记的海量主干网流量的分类,可用于网络流量分析与网络管理。(The invention discloses a high-speed network flow classification method based on a sampling data flow, which firstly samples mass flow in a backbone network and designs a HASH barrel array structure to quickly extract the characteristics of the flow obtained by sequential sampling, secondly, the method provides a batch classifier which can realize batch clustering of unmarked flow characteristics in reasonable time and limited memory to complete the marking work of flow characteristic data, and finally, the method uses a supervised machine learning method to train the marked characteristic data in the batch clustering result to obtain a classification model which can be used for classifying the subsequently reached backbone network flow. The invention can realize the classification of unmarked mass backbone network flow in reasonable time and limited memory, and can be used for network flow analysis and network management.)

1. A high-speed network traffic classification method based on a sampled data stream is characterized by comprising the following steps:

acquiring backbone network flow data, wherein the backbone network flow data comprises two parts of backbone network flow data acquired at two different time periods, and the acquisition time of the first part of backbone network flow data is earlier than that of the second part of backbone network flow data;

step (2) combining sequential sampling and HASH bucket array structure to rapidly extract flow characteristic vectors and establish a characteristic vector library;

step (3) designing a batch classifier based on an agglomeration clustering algorithm, using the batch classifier to cluster the unmarked first part of backbone network flow characteristic data aiming at the characteristic library established in the step (2), and recording a classification result;

performing model training on the marked feature data in the batch clustering result by using a random forest algorithm to obtain a classification model;

and (5) classifying the second part of the backbone network traffic characteristic data which is not marked in the step (1) by using the classification model in the step (4).

2. The method for classifying high-speed network traffic based on sampled data streams as claimed in claim 1, wherein in step (1), the method for obtaining backbone network traffic data is as follows:

(1.1) acquiring a backbone network flow data set which is obtained at a backbone network node and continuously collects a period of time and contains a large number of data packets,

and (1.2) dividing all data packets in the data set into a first part of data and a second part of data according to a certain proportion in time sequence.

3. The method for classifying high-speed network traffic based on sampled data streams as claimed in claim 1, wherein in the step (2), the method for establishing the feature vector library is as follows:

(2.1) aiming at all the data acquired in the step (1), firstly, sampling by using a sequential sampling method, setting a series of sampling comparison classification results by using a controlled variable method for analysis, and proving that the accuracy of the sampling comparison classification results is not greatly influenced;

(2.2) designing a HASH bucket array structure comprising a plurality of counters to rapidly extract the characteristics of sampled flow, wherein the HASH bucket array structure records characteristic information by using w columns and d rows of two-dimensional arrays, each unit in the arrays is a counting bucket comprising a plurality of counters, and the structure realizes the insertion operation and the characteristic vector extraction operation; the inserting operation is divided into three steps, namely extracting triple information of a data packet, namely a transport layer protocol, an IP and a port as keys, hashing the keys into a counting barrel of each row through d hash functions, adding 1 to a corresponding counter in the counting barrel, and finishing the extracting operation of the feature vector by calculating values of a plurality of feature attributes when the corresponding counter value meets the threshold requirement.

4. The method for classifying high-speed network traffic based on sampled data streams as claimed in claim 1, wherein the step (3) is implemented as follows:

(3.1) the first step is the segmentation of flow characteristics: dividing all the feature vectors obtained by extracting the first part of backbone network flow in the step (2) into a plurality of blocks, wherein the size of each block depends on the memory resource of the current user;

(3.2) Clustering the flow characteristic data in each block by using an agglomeration Clustering (AGC) algorithm, wherein the specific implementation details comprise calculation of similarity among classification nodes, determination of distance threshold in the agglomeration Clustering algorithm and combination of classification nodes in the same class;

(3.2.1) calculation of similarity between classification nodes: calculating the similarity between the nodes by using a proper distance formula;

(3.2.2) determination of distance threshold: according to the clustering principle that classification nodes in the same class have higher similarity and classification nodes in different classes have higher dissimilarity, the method uses the formula (1-1) to evaluate the performance of a batch classifier, and analyzes the relationship between different distance thresholds and the performance of the batch classifier, so as to determine a proper distance threshold in a clustering algorithm; KeyNum in the formula (1-1) represents the total number of the triples, labelNum represents that the flow characteristic vector is copolymerized into a labelNum type, ncIndicating the number of triplets, n, in which all the included traffic feature vectors are assigned to the same classdRepresenting the number of triples with class numbers different from other triples, wherein the triples in the formula refer to the transport layer protocol, the IP and the port extracted from the data packet in the step (2.2);

(3.2.3) merging classification nodes in the same class: suppose a certain class of current clustering results contains n eigenvectors, the ith eigenvector contains m attribute values { (x)1)i,(x2)i,…,(xm)iCalculating the mean value of the eigenvectors using equation (1-2) to obtainAs a new classification node.

(3.3) in the third step, the clustering result of each block in the step (3.2) needs to be merged, assuming that the flow clustering result of the ith block contains N classification nodes, the information of each classification node comprises a feature vector corresponding to the node and the class number obtained by clustering in the step (3.2), and assuming that the flow of the ith block is copolymerized into L classes, the method for clustering the classification nodes of the same class in the ith block to obtain L new classification nodes is as follows:

(3.3.1) defining an int type variable k for expressing the kth class (1 ≦ k ≦ L, k being an integer);

(3.3.2) traversing N classification nodes in the ith block clustering result, merging a plurality of classification nodes classified into the kth class into a new classification node according to a formula (1-2), and increasing k from 1 to L to represent that L times of clustering results of the ith block are traversed to obtain L new classification nodes;

and (3.4) the fourth step is secondary clustering of the merged new classification nodes: clustering the new classification nodes combined in the step (3.3) by using the traditional clustering algorithm again to obtain a second clustering result;

(3.5) the fifth step is the determination of the final classification label: and tracking the classification track of the initial characteristic vector in each block to obtain a final class label of each initial characteristic vector, and finishing the batch clustering of all the flows.

5. The method for classifying high-speed network traffic based on the sampled data stream as claimed in claim 1, wherein in the step (4), the labeled feature data in the batch clustering result obtained in the step (3.5) is model-trained by using a random forest algorithm, and the method for obtaining the classification model is as follows:

(4.1) dividing the marked feature vector data obtained in the step (3.5) into a training set and a verification set according to a certain proportion, and using the training set and the verification set for model training of a random forest classifier;

and (4.2) carrying out model training on the training set and the verification set by using a random forest classifier to obtain a classification model.

6. The method for classifying high-speed network traffic based on sampled data streams as claimed in claim 1, wherein in the step (5), the classification model in the step (4) is used to classify the second part of backbone network traffic characteristic data which is not labeled in the step (1) as follows:

(5.1) using the second part of backbone network flow characteristic data subsequently reached in the step (2) as a test set;

and (5.2) classifying the subsequently reached second part of the backbone network traffic data by using the classification model obtained in the step (4).

Technical Field

The invention relates to a high-speed network traffic classification method based on a sampled data stream, and belongs to the technical field of network measurement.

Background

The backbone network is the core part of the Internet, and the network flow analysis of the backbone network is an important link of the whole network management work. Due to the high speed of the mass traffic in the backbone network, the analysis of the network traffic of the backbone network becomes a challenging task. Backbone traffic classification is the basis for backbone network management. The purpose of network traffic classification is to identify traffic classes from a mixture of different applications and protocols, thereby efficiently supporting downstream applications (e.g., QoS provisioning, network measurement, intrusion detection, etc.).

With the use of dynamic ports and the increase of encrypted traffic, traffic classification methods are gradually changing from traditional port-based methods and load-based methods to statistical-based methods (machine learning algorithms and deep learning algorithms).

Machine Learning (ML) algorithms and Deep Learning (DL) algorithms are widely used in traffic classification research due to their high classification performance and strong adaptability to dynamic port and encryption applications. The flow classifier based on machine learning can obtain higher classification precision by combining with a specific feature extraction scheme, and the flow classifier based on deep learning can automatically and accurately classify unknown flow or encrypted flow.

However, the network traffic of the backbone network has high speed, and the transmission speed of the backbone network traffic is usually about 10 Gbps. Most of related traffic classification work is to collect full traffic and extract features from the acquired full traffic, and when the data stream processing method is applied to mass data transmitted at high speed in a backbone network, a long time and a large amount of memory are needed. Secondly, most flow classification methods need manual labeling on all or part of characteristic data in advance, and for massive backbone network flow, the manual labeling speed is difficult to keep up with the transmission speed of the backbone network flow which is extremely high. In addition, the related classification algorithm has high time complexity and space complexity, and is difficult to process medium-scale data, not to mention massive data in a backbone network.

Disclosure of Invention

In order to realize accurate classification of mass backbone network flow in reasonable time and limited memory. The method combines the characteristics of sequential sampling and HASH bucket array structure to quickly extract the mass backbone network flow, and combines the agglomeration clustering algorithm and the random forest algorithm to finish the training of the classification model, thereby realizing the effective classification of the backbone network flow.

To achieve accurate classification of backbone network traffic. The method firstly carries out sequential sampling on the mass backbone network flow, and designs a HASH barrel array structure comprising a plurality of counters to quickly extract the characteristics of the sampled flow. Secondly, the invention provides a batch classifier based on an agglomeration clustering algorithm, which can realize batch clustering of unmarked flow characteristics in a reasonable time and a limited memory and complete marking work of flow characteristic data. Finally, the invention trains the labeled characteristic data in the batch clustering result by using a supervised machine learning method to obtain a classification model, and the classification model can be used for classifying the subsequently reached backbone network flow.

A high-speed network traffic classification method based on a sampled data stream comprises the following steps:

acquiring backbone network flow data, wherein the backbone network flow data comprises two parts of backbone network flow data acquired at two different time periods, and the acquisition time of the first part of backbone network flow data is earlier than that of the second part of backbone network flow data;

step (2) combining sequential sampling and HASH bucket array structure to rapidly extract flow characteristic vectors and establish a characteristic vector library;

step (3) designing a batch classifier based on an agglomeration clustering algorithm, using the batch classifier to cluster the unmarked first part of backbone network flow characteristic data aiming at the characteristic library established in the step (2) and recording a classification result;

performing model training on the marked feature data in the batch clustering result by using a random forest algorithm to obtain a classification model;

and (5) classifying the second part of the backbone network traffic characteristic data which is not marked in the step (1) by using the classification model in the step (4).

Further, in the step (1), the method for acquiring the backbone network traffic data includes:

(1.1) acquiring a backbone network traffic data set which is obtained at a backbone network node and continuously collects data packets for a period of time and contains a large number of data packets.

And (1.2) dividing all data packets in the data set into a first part of data and a second part of data according to a certain proportion in time sequence. The public data sets are divided in time order to simulate actual backbone network traffic data. In an actual environment, putting the collected historical data (corresponding to the first part of data with earlier collection time) into a model for training to obtain a classification model; this classification model is then used to classify the newly arrived traffic (corresponding to the second portion of data acquired at a later time) in real time.

Further, in the step (2), the method for establishing the feature vector library includes:

(2.1) for all the data acquired in (1), sampling is first performed using a sequential sampling method. In order to verify the robustness of the characteristic sampling method used by the invention, the invention adopts a control variable method, sets a series of sampling comparison classification results for analysis, and proves that the accuracy of the sampling comparison classification results is not greatly influenced.

(2.2) designing a HASH bucket array structure containing a plurality of counters to rapidly extract the characteristics of sampled flow. In order to record multiple attributes of backbone network traffic, the present invention designs a HASH bucket array structure comprising multiple counters as shown in fig. 2, and the HASH bucket array structure records characteristic information using w columns and d rows of two-dimensional arrays. Each cell in the array is a count bucket containing a plurality of counters. This structure can realize an insertion operation and a feature vector extraction operation. The inserting operation is divided into three steps, namely extracting triple information (transport layer protocol, IP and port) of the data packet as a key, hashing the key into a counting bucket of each row through d hash functions, and adding 1 to a corresponding counter in the counting bucket. When the corresponding counter value meets the threshold requirement, the extraction operation of the feature vector is completed by calculating the values of a plurality of feature attributes.

Since the message structures of the TCP protocol and the UDP protocol are different, which results in different traffic characteristics, we design corresponding counters for the data packets using the TCP protocol and the UDP protocol. Because the feature difference of the unidirectional flow and the bidirectional flow is large, the classification performance obtained by mixing and classifying the unidirectional flow and the bidirectional flow is low. Therefore, the flow is divided into two categories of unidirectional flow and bidirectional flow, and then each category is subdivided to obtain a more accurate classification result. In the invention, unidirectional flow and bidirectional flow can be preliminarily distinguished through the numerical value of the counter.

Further, in the step (3), as shown in fig. 3, a batch classifier is designed based on the agglomerative clustering algorithm, and the method for clustering the unlabeled first partial backbone network flow characteristic data and recording the classification result by using the batch classifier with respect to the characteristic library established in the step (2) is as follows:

(3.1) the first step is the segmentation of flow characteristics: the method divides all the characteristics obtained by extracting the first part of backbone network flow in the step (2) into a plurality of blocks, and the size of each block depends on the memory resource of the current user.

And (3.2) Clustering the flow characteristic data in each block by using a traditional Agglomerative Clustering (AGC) algorithm. The specific implementation details comprise calculation of similarity among classification nodes, determination of distance threshold in an agglomeration clustering algorithm and combination of classification nodes in the same class.

(3.2.1) calculation of similarity between classification nodes: the similarity between the nodes is calculated using a suitable distance formula.

(3.2.2) determination of distance threshold: according to the clustering principle that classification nodes in the same class have higher similarity and classification nodes in different classes have higher dissimilarity, the invention uses the formula (1-1) to evaluate the performance of a batch classifier and analyze different distance thresholdsAnd batch classifier performance to determine an appropriate distance threshold in the clustering algorithm. KeyNum in the formula (1-1) represents the total number of the triples, labelNum represents that the flow characteristic vector is copolymerized into a labelNum type, ncIndicating the number of triplets, n, in which all the included traffic feature vectors are assigned to the same classdIndicating the number of triples with class numbers different from other triples. The triplet in the formula refers to the transport layer protocol, IP and port extracted from the packet in step (2.2).

(3.2.3) merging classification nodes in the same class: suppose a certain class of current clustering results contains n eigenvectors, the ith eigenvector contains m attribute values { (x)1)i,(x2)i,…,(xm)iThe mean value of the feature vectors is calculated by using the formula (2-2) in the inventionAs a new classification node.

(3.3) the third step requires merging the clustering results of each block in step (3.2). And (3) assuming that the flow clustering result of the ith block contains N classification nodes, wherein the information of each classification node comprises a feature vector corresponding to the node and the class number obtained by clustering in the step (3.2). Assuming that the flow of the ith block is copolymerized into L types, the method for clustering the classification nodes of the same type in the ith block to obtain L new classification nodes is as follows:

(3.3.1) define int type variable k for k-th class (1. ltoreq. k. ltoreq.L, k is an integer).

And (3.3.2) traversing N classification nodes in the ith clustering result, and merging the classification nodes classified into the kth class into a new classification node according to a formula (1-2). Increasing k from 1 to L represents traversing the clustering result of the ith block for L times to obtain L new classification nodes.

And (3.4) the fourth step is secondary clustering of the merged new classification nodes: the new classification nodes merged in the step (3.3) are clustered again by using the traditional agglomeration clustering algorithm to obtain a second clustering result.

(3.5) the fifth step is the determination of the final classification label: and tracking the classification track of the initial characteristic vector in each block to obtain a final class label of each initial characteristic vector, and finishing the batch clustering of all the flows. Firstly, clustering the classification nodes in each block for the first time by using an agglomeration clustering algorithm; and then merging the clustering results in the primary clustering according to the categories: averaging the classification nodes of the same class in each block to obtain a representative classification node; secondly, taking representative classification nodes in all the blocks as classification objects, and performing secondary clustering by using an agglomeration clustering algorithm; finally, finding out classification mapping between the classification labels of the secondary clustering and the classification nodes of the primary clustering (namely finding out which class the representative nodes in the primary clustering are classified into in the secondary clustering result), and finding out classification mapping between the classification labels of the primary clustering and all the original classification nodes (namely which class the original classification nodes are classified into in the primary clustering result); thereby obtaining the classification label of each original classification node in the final result of the secondary clustering (i.e. which class the original classification node is finally classified into in the secondary clustering result).

Further, in the step (4), model training is performed on the labeled feature data in the batch clustering result obtained in the step (3.5) by using a random forest algorithm, and a method for obtaining a classification model is as follows:

and (4.1) dividing the marked feature vector data obtained in the step (3.5) into a training set and a verification set according to a certain proportion, and using the training set and the verification set for model training of a random forest classifier.

And (4.2) carrying out model training on the training set and the verification set by using a random forest classifier to obtain a classification model. The random forest algorithm is used because random forests are well-recognized supervised learning classification algorithms with good classification performance and stability.

Further, in the step (5), the method for classifying the second part of backbone network traffic characteristic data which is not labeled in the step (1) by using the classification model in the step (4) is as follows:

and (5.1) using the second part of backbone network flow characteristic data subsequently reached in the step (2) as a test set.

And (5.2) classifying the subsequently reached second part of the backbone network traffic data by using the classification model obtained in the step (4).

Compared with the prior art, the technical scheme of the invention has the following beneficial technical effects.

(1) The method can realize the rapid extraction of the characteristics of the mass backbone network flow by combining the sequential sampling and the HASH barrel array structure, and has stronger practicability and innovation.

(2) The invention provides a batch classifier based on agglomerative clustering, which can accurately cluster the flow of a backbone network with sufficiently low time complexity and space complexity.

(3) The invention completes the training of classification models by combining an unsupervised agglomerative clustering algorithm and a supervised random forest algorithm, not only does not need to mark data in advance, realizes the automatic clustering of unmarked flow, but also can accurately classify the subsequently reached backbone network flow; in addition, the batch classifier can update the classification models in batches in time according to the changed flow data, so that the classification accuracy of the classification models is kept, and the batch classifier is more feasible.

Drawings

FIG. 1 is a general flow diagram of the present invention;

FIG. 2 is a schematic diagram of an array structure of a HASH bucket including a plurality of counters according to the present invention;

FIG. 3 is a schematic diagram of a proposed batch classifier based on agglomerative clustering;

FIG. 4 is a data analysis graph of a distance threshold determination experiment in one example of the invention;

FIG. 5 is an analysis of abnormal triple classification tags in classification results in one example of the present invention;

fig. 6 is an analysis of TCP retransmission information in a data packet of an abnormal triple in accordance with an embodiment of the present invention.

Detailed Description

The technical solution of the present invention is further described below with reference to the accompanying drawings and examples.

Example (b): as shown in fig. 1, a method for classifying high-speed network traffic based on a sampled data stream according to the present invention includes the following steps:

acquiring backbone network flow data, wherein the backbone network flow data comprises two parts of backbone network flow data acquired at two different time periods, and the acquisition time of the first part of backbone network flow data is earlier than that of the second part of backbone network flow data;

step (2) combining sequential sampling and HASH bucket array structure to rapidly extract flow characteristic vectors and establish a characteristic vector library;

step (3) designing a batch classifier based on an agglomeration clustering algorithm, using the batch classifier to cluster the unmarked first part of backbone network flow characteristic data aiming at the characteristic library established in the step (2) and recording a classification result;

performing model training on the marked feature data in the batch clustering result by using a random forest algorithm to obtain a classification model;

and (5) classifying the second part of the backbone network traffic characteristic data which is not marked in the step (1) by using the classification model in the step (4).

In an embodiment of the present invention, in step (1), the method for acquiring the backbone network traffic data includes:

(1.1) obtain the public data set collected by the MAWI working group on 3/6/2020, which contains 453,043,378 packets collected for 900 seconds on a 10Gbps link.

(1.2) dividing all data packets in the data set into a first part of data and a second part of data according to a ratio of 2:1 in a time sequence, namely disclosing the data acquired in the first 600 seconds of the data set as the first part of data and acquiring the data acquired in the last 300 seconds as the second part of data.

In one embodiment of the present invention, in step (2), the method for establishing the feature vector library includes the following steps:

(2.1) for all the data acquired in step (1), sampling is first performed using a sequential sampling method. In order to verify the robustness of the characteristic sampling method used by the invention, a control variable method is adopted, a series of sampling comparison and classification results are set as shown in table 1 for analysis, in order to evaluate the classification results, the Micro F1 score is used for comprehensively evaluating the recall ratio and precision ratio of a classifier, and the Accuracy of label Prediction (AoLP) shown in the following formula (3-0) is used for evaluating the generalization capability of a classification model.

TABLE 1 precision and AOLP of identical data at different sampling rates

It can be seen that the overall MicroF1 score and the overall AoLP remained around 97% when different sampling ratios were set for the same data, and that 96.3% and 97.3% were achieved for the overall MicroF1 score and overall AoLP, respectively, even at a sampling ratio of 1: 1024. The above results show that the sampling ratio has little influence on the classification accuracy, and the accuracy can be accepted even if the sampling ratio is 1: 1024. In one embodiment of the present invention, the sampling ratio is fixed to 1: subsequent experiments were performed 32.

(2.2) designing a HASH bucket array structure containing a plurality of counters to rapidly extract the characteristics of sampled flow. In order to record multiple attributes of backbone network traffic, the present invention designs a HASH bucket array structure comprising multiple counters as shown in fig. 2, and the HASH bucket array structure records characteristic information using w columns and d rows of two-dimensional arrays. Each cell in the array is a count bucket containing a plurality of counters. This structure can realize an insertion operation and a feature vector extraction operation. The inserting operation is divided into three steps, namely extracting triple information (transport layer protocol, IP and port) of the data packet as a key, hashing the key into a counting bucket of each row through d hash functions, and adding 1 to a corresponding counter in the counting bucket. When the corresponding counter value meets the threshold requirement, the extraction operation of the feature vector is completed by calculating the values of a plurality of feature attributes.

Since the message structures of the TCP protocol and the UDP protocol are different, which results in different traffic characteristics, we design corresponding counters for the data packets using the TCP protocol and the UDP protocol. A description of the counter in one example of the invention is shown in table 2.

Table 2 description of counters used in HASH bucket array structure

In order to determine the packet length intervals t1-t4 and u1-u4 in table 2 above, and to achieve accurate classification, the present invention groups packets with lengths close to the Maximum Transmission Unit (MTU), the MTU is associated with a path and the MTU of a normal path is between 1000 and 1500 bytes. And then classifying the data packets with the length smaller than the MTU according to the length of the data packets by applying the maximum entropy principle. According to the invention, through obtaining a Probability Density Function (PDF) of the packet length in the public data set, a plurality of packet length aggregation points are found after the packet length exceeds 1100 bytes, and the aggregation points may be MTUs of some paths, so that the packet length exceeding 1100 bytes can be divided into an interval. Then, the present invention equally divides the data packet with the length distributed in 0 to 1100 bytes by using a Cumulative Distribution Function (CDF) to obtain other length intervals. In this example t1 to t4 are (0,83], (83,375], (375,1100) and (1100,1500), respectively, and u1 to u4 are (0,28], (28,140], (140,1100) and (1100,1500), respectively.

In an embodiment of the present invention, in step (3), as shown in fig. 3, a batch classifier is designed based on an agglomerative clustering algorithm, and the method for clustering unlabeled first portion of backbone network traffic feature data and recording classification results by using the batch classifier with respect to the feature library established in step (2) is as follows:

(3.1) the first step is the segmentation of flow characteristics: the method divides all the characteristics obtained by extracting the first part of backbone network flow in the step (2) into a plurality of blocks, and the size of each block depends on the memory resource of the current user. As shown in table 3, different block sizes are set for classification research in this example, and it can be seen that under the same condition, when different block sizes are set, the batch classifier can achieve classification accuracy similar to that of the conventional AGC, and the clustering time required by the batch classifier is reduced along with the reduction of the block sizes, and the clustering time required by the batch classifier is less than that of the conventional AGC.

TABLE 3 comparison of the results of the classification of the batch classifier and AGC for different block sizes

(NoFV denotes the number of feature vectors, AGC denotes the conventional agglomerative clustering algorithm, and the result is labeledNot shown in block)

And (3.2) Clustering the flow characteristic data in each block by using a traditional Agglomerative Clustering (AGC) algorithm. The specific details comprise calculation of similarity among classification nodes, determination of distance threshold in an agglomeration clustering algorithm and combination of classification nodes in the same class.

(3.2.1) calculation of similarity between classification nodes: the cosine distances are used to calculate the similarity between the classification nodes. Suppose there are two classification nodes corresponding to a feature vector of f1=(x1,x2,…,xm) And f2=(y1,y2,…,ym) The similarity between the two classification nodes is calculated using the following formula (3-1).

(3.2.2) determination of distance threshold: according to the clustering principle that the classification nodes in the same class have higher similarity and the classification nodes in different classes have higher dissimilarity, the method uses the formula (3-2) to evaluate the performance of the batch classifier, and analyzes the relationship between different distance thresholds and the performance of the batch classifier, so as to determine the proper distance threshold in the clustering algorithm. KeyNum in the formula (3-2) represents the total number of the triples, labelNum represents that the flow characteristic vector is copolymerized into a labelNum type, ncIndicating the number of triplets, n, in which all the included traffic feature vectors are assigned to the same classdIndicating the number of triples with class numbers different from other triples. The triplet in the formula refers to the transport layer protocol, IP and port extracted from the packet in step (2.2).

This example sets a series of distance thresholds to observe the performance of the batch classifier. As can be seen from fig. 4, the clustering effect is better when the distance threshold is 0.05, so that the subsequent batch classification is continued by setting the distance threshold to 0.05.

(3.2.3) merging classification nodes in the same class: assuming that a certain class of current clustering results contains n feature vectors,the ith feature vector contains m attribute values { (x)1)i,(x2)i,…,(xm)iThe mean value of the feature vectors is calculated by using the formula (3-3) in the invention to obtainAs a new classification node.

(3.3) the third step requires merging the clustering results of each block in step (3.2). And (3) assuming that the flow clustering result of the ith block contains N classification nodes, wherein the information of each classification node comprises a feature vector corresponding to the node and the class number obtained by clustering in the step (3.2). Assuming that the flow of the ith block is copolymerized into L types, the method for clustering the classification nodes of the same type in the ith block to obtain L new classification nodes is as follows:

(3.3.1) define int type variable k for k-th class (1. ltoreq. k. ltoreq.L, k is an integer).

And (3.3.2) traversing N classification nodes in the ith clustering result, and merging the classification nodes classified into the kth class into a new classification node according to a formula (3-3). Increasing k from 1 to L represents traversing the clustering result of the ith block for L times to obtain L new classification nodes.

And (3.4) the fourth step is secondary clustering of the merged new classification nodes: the new classification nodes merged in the step (3.3) are clustered again by using the traditional agglomeration clustering algorithm to obtain a second clustering result.

(3.5) the fifth step is the determination of the final classification label: and tracking the classification track of the initial characteristic vector in each block to obtain a final class label of each initial characteristic vector, and finishing the batch clustering of all the flows. Firstly, clustering the classification nodes in each block for the first time by using an agglomeration clustering algorithm; and then merging the clustering results in the primary clustering according to the categories: averaging the classification nodes of the same class in each block to obtain a representative classification node; secondly, taking representative classification nodes in all the blocks as classification objects, and performing secondary clustering by using an agglomeration clustering algorithm; finally, finding out classification mapping between the classification labels of the secondary clustering and the classification nodes of the primary clustering (namely finding out which class the representative nodes in the primary clustering are classified into in the secondary clustering result), and finding out classification mapping between the classification labels of the primary clustering and all the original classification nodes (namely which class the original classification nodes are classified into in the primary clustering result); thereby obtaining the classification label of each original classification node in the final result of the secondary clustering (i.e. which class the original classification node is finally classified into in the secondary clustering result).

In an embodiment of the present invention, in step (4), a random forest algorithm is used to perform model training on the labeled feature data in the batch clustering result obtained in step (3.5), and a method for obtaining a classification model is as follows:

and (4.1) dividing the marked feature vector data obtained in the step (3.5) into a training set and a verification set according to the proportion of 7:3, and using the training set and the verification set for model training of the random forest classifier.

And (4.2) carrying out model training on the training set and the verification set by using a random forest classifier to obtain a classification model.

Further, in the step (5), the method for classifying the second part of backbone network traffic characteristic data which is not labeled in the step (1) by using the classification model in the step (4) is as follows:

and (5.1) using the second part of backbone network flow characteristic data subsequently reached in the step (2) as a test set.

And (5.2) classifying the subsequently reached second part of backbone network flow data by using the classification model obtained in the step (4), and analyzing a classification result so as to judge the subsequent network state.

In an example of the present invention, it is found that there is a triplet (TCP,163.61.27.198,80) with lower classification accuracy in the classification result of the second part of the backbone network traffic, and the classification change of this triplet data is as shown in fig. 5, and it can be seen that in about 600 seconds, the category to which the triplet data belongs is obviously changed.

In this example, the data packet corresponding to the triplet is extracted and analyzed using a Wireshark tool. TCP retransmissions start to increase after 600 seconds as shown in figure 6. Subsequent further analysis of this triplet of characteristic data revealed a significant increase in packet loss in about 600 seconds.

The above examples are only preferred embodiments of the present invention, it should be noted that: it will be apparent to those skilled in the art that various modifications and equivalents can be made without departing from the spirit of the invention, and it is intended that all such modifications and equivalents fall within the scope of the invention as defined in the claims.

17页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:流量监控方法及装置、计算机存储介质、电子设备

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!