Distributed clustering method and system for large-scale stream data

文档序号：1783062 发布日期：2019-12-06 浏览：23次中文

阅读说明：本技术 一种面向大规模流数据的分布式聚类方法及系统 (Distributed clustering method and system for large-scale stream data ) 是由许利杰王伟魏峻康锴叶星彤于 2019-08-27 设计创作，主要内容包括：本发明涉及一种面向大规模流数据的分布式聚类方法及系统,通过构造时序化的微批式增量计算模型及基于多维划分的并行化方法,实现了可以对典型流式聚类算法进行并行化的分布式系统框架,解决了当前流式聚类分析算法难以并行化、吞吐率低的问题。通过真实的数据集对本发明进行聚类质量、吞吐率和扩展性测试,结果表明基于本发明的系统实现的流式聚类算法可以随着集群规模增长,达到亚线性的吞吐率提升,且保持与标准单机流式聚类算法相近的聚类质量,同时比其他并行化方式(如无序更新方式)实现的流式聚类算法在聚类质量方面提升2.5倍、吞吐率方面提升1.9倍。因此,本发明能够满足大规模高流速数据所需的低延时、高吞吐率的数据聚类分析需求。(The invention relates to a distributed clustering method and a distributed clustering system for large-scale flow data, which realize a distributed system framework capable of parallelizing a typical flow type clustering algorithm by constructing a micro batch type incremental computation model of time sequencing and a parallelization method based on multi-dimensional division, and solve the problems of difficult parallelization and low throughput rate of the current flow type clustering analysis algorithm. The method is tested for clustering quality, throughput rate and expansibility through a real data set, and the result shows that the flow type clustering algorithm realized based on the system can achieve sub-linear throughput rate improvement along with the increase of cluster scale, keeps the clustering quality similar to that of a standard single-machine flow type clustering algorithm, and simultaneously improves the clustering quality by 2.5 times and the throughput rate by 1.9 times compared with the flow type clustering algorithm realized by other parallelization modes (such as a disordered updating mode). Therefore, the method can meet the requirements of low-delay and high-throughput data cluster analysis required by large-scale high-flow-rate data.)

1. A distributed clustering method facing large-scale stream data is characterized by comprising the following steps:

the method comprises the following steps: the method is characterized in that a time-sequenced micro-batch incremental computation model is constructed facing large-scale flow data with high flow rate dynamic change, and the problem that a traditional computation model downstream clustering algorithm is difficult to parallelize is solved; a typical streaming clustering algorithm maintains a micro cluster set, and the algorithm updates the micro cluster set according to received streaming data; the sequencing micro batch type incremental computation model divides the obtained stream data into different batches, updates the micro cluster set according to the data in one batch, and processes the data of the next batch by the micro batch type incremental computation model after the data of the batch is processed; the time-sequenced micro-batch incremental computation model decouples a typical streaming clustering algorithm into three computation stages for ensuring data sequential update, solves the problem that the prior parallelization scheme can not ensure data sequential processing, and has the following three computation stages for specific decoupling:

(1) calculation phase to find the nearest micro-cluster: for each piece of newly arrived data, calculating the distance between the newly arrived data and all the micro-clusters at this stage, selecting the nearest micro-cluster to check whether the newly arrived data can be added, and judging whether the newly arrived data can be added into the micro-clusters according to the distance and the boundary conditions of the micro-clusters; if the micro-cluster can not be added, the data can become an outlier;

(2) Local update phase of micro-cluster set: updating each piece of data to the corresponding micro-cluster obtained by the first-stage calculation in a data increment mode;

(3) Global update phase of micro-cluster set: fusing the updated micro-cluster with the global micro-cluster set;

wherein the calculation and updating process in each stage is determined by the micro-cluster set representation of a specific flow clustering algorithm, the updating process, the aging mechanism and the outlier processing process;

step two: on the basis of the micro batch type incremental computation model of the time sequence, which is constructed in the first step, a parallelization method based on multi-dimensional division is adopted to parallelize each computation and update stage of the flow clustering algorithm, wherein the computation and update stage is a computation stage for searching the nearest micro cluster, a local update stage of a micro cluster set and a global update stage of the micro cluster set in the first step, and the parallelization method based on the multi-dimensional division comprises the following two steps:

(1) The data parallel method comprises the following steps: distributing each piece of data to different computing nodes, wherein each computing node stores all micro cluster sets, and each piece of data and all micro cluster sets are computed and updated;

(2) The model parallel method comprises the following steps: distributing the micro-cluster set to different computing nodes, storing a part of micro-cluster set by each computing node, finding the optimal part of micro-cluster for each piece of data, and updating the optimal part of micro-cluster set on the part of micro-cluster set;

Step three: and (4) taking each micro cluster in the updated micro cluster set as a single data point, and calculating a typical iterative clustering algorithm, such as KMeans, DBSCAN and the like, to obtain a final clustering result.

2. The distributed clustering method oriented to large-scale stream data according to claim 1, wherein: in the first step, the concrete implementation steps of the time-sequenced micro batch type incremental computation model are as follows:

(1) an initialization stage: performing iterative computation on the initial data set according to the number m of the micro-clusters set by a user to obtain m micro-clusters { Q1, Q2, …, Qm-1, Qm }, wherein each micro-cluster Qi, i is more than or equal to 1 and is less than or equal to m and is represented by a CF vector, the CF vector is the sum of all dimensions of data belonging to the micro-cluster, is the sum of squares of all dimensions of the data belonging to the micro-cluster and is the sum of time stamps of the data belonging to the micro-cluster, and n is the sum of weights of the data belonging to the micro-cluster;

(2) a calculation stage of finding the nearest micro-cluster, namely calculating the distance between each data x and the central point O of each micro-cluster Q, wherein the central point of the micro-cluster Q is the ith dimension data of which Oi and xi respectively represent O and x;

(3) local update phase of micro-cluster set: finding a micro-cluster Q nearest to each data x, and judging whether the distance dis between the data x and the micro-cluster Q is smaller than or equal to the radius range epsilon of the micro-cluster Q, wherein the radius range formula is that if the distance dis is smaller than or equal to the radius range epsilon of the micro-cluster Q, the data x is updated into the micro-cluster Q, namely, each dimension data in the data x is updated into a CF vector of the micro-cluster Q, and n is decay n + 1; otherwise, generating a new micro-cluster Q 'from the data x and adding the new micro-cluster Q' into the micro-cluster set, namely generating a new CF vector to represent the micro-cluster Q, wherein n is 1, xi respectively represents ith dimension data in each vector, and decay is an attenuation parameter set by a user;

(4) global update phase of micro-cluster set: the method comprises the steps that the number of micro-clusters of a stream clustering algorithm is fixed m, if the number of the micro-clusters is larger than m at present, the number of the micro-clusters needs to be limited, the limiting process is that the minimum micro-cluster Q in the existing micro-cluster set is deleted at first, if the number of the micro-clusters is still larger than m, two micro-clusters closest to each other are combined into one micro-cluster until the number of the micro-clusters is equal to m, and the distance calculation formula is that a CF vector updating formula is n ═ decapay + 1.

3. The distributed clustering method oriented to large-scale stream data according to claim 1, wherein: in the second step, the parallelization process of each calculation and updating stage of the flow clustering algorithm by adopting the parallelization method based on multi-dimensional division is as follows:

(1) Calculation phase to find the nearest micro-cluster: for each piece of newly arrived data, calculating the distance between the newly arrived data and all the micro-clusters at this stage, selecting the nearest micro-cluster to check whether the newly arrived data can be added, and judging whether the newly arrived data can be added into the micro-clusters according to the distance and the boundary conditions of the micro-clusters; if the micro-cluster can not be added, the data can become an outlier; comparing data parallel mode and model parallel mode from three aspects of network transmission quantity, calculating time delay and parallelizing program, and finally parallelizing by adopting the data parallel mode at the stage;

(2) Local update phase of micro-cluster set: updating each piece of data into a corresponding micro cluster in a data increment mode, comparing data parallel and model parallel modes in four aspects of data time sequence, computing time delay, network transmission quantity and parallelization degree, and finally performing parallelization realization in the model parallel mode at the stage;

(3) Global update phase of micro-cluster set: in the global updating stage of the micro cluster set, the updated micro cluster and the global micro cluster set need to be fused, and since the model global updating stage of the micro cluster set needs the minimum amount of calculation in the three stages and needs to maintain the updating sequence, the stages are sequentially performed in a single machine environment.

4. A clustering system implemented based on the large-scale streaming data-oriented distributed clustering method of any one of claims 1 to 3, comprising: the system comprises an access layer, an application layer, a calculation layer and a storage layer;

An access layer: accessing data to an application layer in the form of data flow, wherein the data flow can be a network flow or a real-time data flow; distributing the data flow to different nodes, and accessing the data by the nodes in a pulling mode;

an application layer: the layer provides four interfaces, namely a model representation process, an updating process, an aging mechanism and an outlier processing process; the user combines the interfaces to realize a specific flow clustering algorithm and customize a solution for a service scene; the application layer is oriented to an algorithm developer, the developer does not need to consider the specific distributed implementation of the algorithm, and only needs to realize the calculation logics of model representation, an updating process, an aging mechanism and an outlier processing process according to a given interface; the application layer completes the concrete implementation steps of the micro batch type incremental calculation model and the micro cluster set which are sequenced during construction;

Calculating a layer: each computing node receives stream data transmitted by the access layer, executes a corresponding processing process according to a calculation and updating process of a specific stream type clustering algorithm in the application layer, wherein the computing layer needs to merge results of each computing node after each batch of data is finished, updates incremental data into a micro cluster set, and finally feeds back the micro cluster set; the calculation layer parallelizes the flow clustering algorithm and the specific requirement implementation process based on a parallelization method of multi-dimensional division;

A storage layer: the layer adopts a distributed storage architecture and is responsible for storing the micro-cluster aggregation state; the micro cluster sets are dispersed to each storage node for storage and backup, a persistent interface is provided, and the micro cluster sets are backed up to a permanent medium, so that the reliability is further improved.

Technical Field

The invention relates to a data mining technology under large-scale stream data, in particular to a distributed clustering method and a distributed clustering system for the large-scale stream data. Belonging to the technical field of software.

Background

At present, the streaming data mining technology is widely applied to various application scenarios, such as data mining of internet of things, network traffic intrusion detection, network click rate analysis, real-time commodity recommendation, and the like. Traditional data mining techniques utilize static batch data for mining, which is not suitable for dynamic stream data mining. Firstly, static data mining knows all data to be mined in advance, streaming data is infinite, and unknown data needs to be processed by a streaming data mining technology, so that the streaming data mining technology needs to process inflowing data quickly and timely; secondly, the flow data is dynamically changed, and the traditional data mining technology cannot timely detect the dynamic data change fed back; thirdly, because of the dynamically changing characteristics of the stream data, stream data mining needs to pay more attention to the newly generated data, namely, the newly generated data is given higher weight; however, static data mining techniques treat all static data as the same weight, and do not distinguish between new data and old data.

It is crucial to accurately mine and obtain potentially data patterns for an infinite data stream in a timely manner. Using a news recommender system as an example, news having similar characteristics may be grouped together into a category. News belonging to the same category can be recommended to users who like the category, however, the news is generated in real time, and a news recommendation system needs to process each piece of news in time so as to make timely recommendation; in addition, the news recommendation system should pay more attention to newly generated news.

stream clustering is a large category in stream data mining technology, and is widely applied to network attack detection, weather prediction, flow click prediction and the like (references: J.de Andrad Silva, E.R.Faria, R.C.Barros, E.R.Hruschka, A.C.P.de Leon Ferreira de Carvalho, and J.Gama.Data stream clustering: A survey.ACM Computing Surveys,46(1):13: 1-13: 31,2013.), and a stream clustering algorithm can effectively capture data changes in a data stream by clustering similar data in the data stream; in addition, the stream clustering algorithm can be used as the basis of other machine learning algorithms, data summary is obtained through stream clustering, and more data mining work can be directly carried out on the obtained data summary. The streaming clustering algorithm assigns different weights to new data and old data by adopting different time window technologies, and further pays more attention to the new data during data processing. However, the current streaming clustering algorithm adopts a mode of updating one data, that is, the algorithm can process the next data until the previous data is processed, so the current streaming clustering algorithm can only be performed by a single machine, and can not be parallelized. The single-machine implementation of the classic streaming clustering algorithm, namely, the clusstream algorithm, can only guarantee about 5k (references: C.C. Aggarwal, J.Han, J.Wang, and P.S. Yu.A frame for clustering evolving Data streams in Proceedings of 29th International Conference on version target Data Bases (VLDB), pages 81-92,2003.), which cannot meet the low-delay and high-throughput Data processing requirements required by the present Data with high flow rate dynamic change.

However, the existing work cannot support the parallelization of the typical streaming clustering algorithm; the traditional parallel machine learning framework or system, such as Spark MLlib or Petuum, basically adopts all data sets to perform iterative computation, and cannot meet the requirements of one-time computation and sequential update of a stream data mining algorithm. Researchers have realized the stream-oriented clustering algorithm on different stream-oriented distributed computing frameworks, however, these works are oriented to a single stream-oriented clustering algorithm, and cannot satisfy the computing logic of more typical stream-oriented clustering algorithms, for example, the data weight is attenuated according to time, etc.; furthermore, the prior art (references: O.Backhoff and E.N.N.toutsi.scalable online-horizontal stream clustering in adaptive space. in IEEE International Conference on Data Mining works (ICDMW), pages 37-44,2016.) cannot guarantee the requirement of processing Data in a time sequence in stream Data processing.

Disclosure of Invention

the invention solves the problems: the method and the system for distributed clustering of the large-scale stream data are suitable for various typical stream clustering algorithms, reduce the calculation time delay of the algorithms, improve the throughput rate of the algorithms and ensure that the algorithms have good expansibility on the premise of ensuring the clustering quality.

the technical scheme of the invention is as follows: a distributed clustering method facing large-scale stream data comprises the following steps:

the method comprises the following steps: streaming data is a dynamic data set that grows indefinitely as time goes on; the method is oriented to large-scale flow data with high flow rate dynamic change, high flow rate description data is generated quickly and can reach GB/s level; the dynamic change describes that the data distribution changes continuously along with time, and does not have stable data distribution; large-scale descriptions produce large amounts of data, which can reach PB levels, and high data dimensions, which can range from tens to thousands of dimensions. Aiming at the characteristics of the data, the invention constructs a time-sequenced micro-batch incremental computation model for large-scale flow data with high flow rate and dynamic change, and solves the problem that the traditional computation model is difficult to parallelize in a downstream clustering algorithm. A typical streaming clustering algorithm maintains a micro cluster set, and the algorithm updates the micro cluster set according to received stream data. The sequencing micro batch type incremental computation model divides the obtained stream data into different batches, the micro batch type incremental computation model updates the micro cluster set according to the data in one batch, and after the data in the batch is processed, the micro batch type incremental computation model processes the data in the next batch. The time-sequenced micro-batch incremental computation model decouples a typical streaming clustering algorithm into three computation stages for ensuring data sequential update, solves the problem that the prior parallelization scheme can not ensure data sequential processing, and has the following three computation stages for specific decoupling:

(2) Local update phase of micro-cluster set: updating each piece of data to the corresponding micro-cluster obtained by the first-stage calculation in a data increment mode;

(3) global update phase of micro-cluster set: fusing the updated micro-cluster with the global micro-cluster set;

Wherein the calculation and updating process in each stage is determined by the micro-cluster set representation of a specific flow clustering algorithm, the updating process, the aging mechanism and the outlier processing process;

Step two: on the basis of the micro batch type incremental computation model of the time sequence, which is constructed in the first step, a parallelization method based on multi-dimensional division is adopted to parallelize each computation and update stage of the flow clustering algorithm, wherein the computation and update stage is a computation stage for searching the nearest micro cluster, a local update stage of a micro cluster set and a global update stage of the micro cluster set in the first step, and the parallelization method based on the multi-dimensional division comprises the following two steps:

(1) the data parallel method comprises the following steps: distributing each piece of data to different computing nodes, wherein each computing node stores all micro cluster sets, and each piece of data and all micro cluster sets are computed and updated;

the invention comprehensively considers the factors of network transmission quantity, calculation time delay and parallelization degree, selects the optimal parallelization method in each calculation and updating stage and realizes the parallelization of the flow clustering algorithm.

step three: and (4) taking each micro cluster in the updated micro cluster set as a single data point, and calculating a typical iterative clustering algorithm, such as KMeans, DBSCAN and the like, to obtain a final clustering result.

In the first step, the specific steps of constructing the time-sequenced micro batch type incremental computation model are as follows:

(2) Calculation phase to find the nearest micro-cluster: performing distance calculation on each piece of data x and a central point O of each micro-cluster Q, wherein the central point of the micro-cluster Q is the ith dimension data of which Oi and xi respectively represent O and x;

(3) local update phase of micro-cluster set: finding out the micro-cluster Q closest to each data x, and judging whether the distance dis between the micro-cluster Q and the data x is smaller than or equal to the radius range epsilon of the micro-cluster Q, wherein the radius range formula is that if the dis is smaller than or equal to the radius range epsilon of the micro-cluster Q, the data x is updated into the micro-cluster Q. Specifically, each dimension data in the data x is updated into a CF vector of the micro-cluster Q, where n is decay n + 1. Otherwise, generating a new micro-cluster Q 'by the data x and adding the new micro-cluster Q' into the micro-cluster set. Specifically, a new CF vector is generated to represent the micro-cluster Q. Wherein n is 1. Wherein xi respectively represents ith dimension data in each vector, and decade is attenuation parameter set by user.

(4) global update phase of micro-cluster set: the number of the micro-clusters of the algorithm is fixed m, and if the number of the micro-clusters is larger than m at present, the number of the micro-clusters needs to be limited. Deleting the smallest micro-cluster Q in the existing micro-cluster set, if the number of the micro-clusters is still larger than m, combining two micro-clusters closest to each other into one micro-cluster until the number of the micro-clusters is equal to m, wherein the distance calculation formula is a CF vector updating formula, and n is decay n + 1.

In the second step, the parallelization process of each calculation stage of the algorithm by adopting the parallelization method based on multi-dimensional division is as follows:

(1) calculation phase to find the nearest micro-cluster: for each new arrival data, this stage calculates its distance to all micro-clusters and selects the closest micro-cluster to check if it can join. Judging whether the data can be added into the micro-clusters or not according to the distance and the boundary conditions of the micro-clusters; if a micro-cluster cannot be added, the data becomes an outlier. And comparing the data parallel mode and the model parallel mode from three aspects of network transmission quantity, computing time delay and parallelizing programs, and finally performing parallelization by adopting a data parallel mode at the stage.

(2) local update phase of micro-cluster set: this stage requires updating each piece of data into the corresponding micro-cluster in data increments. And comparing the data parallel mode and the model parallel mode in four aspects of data time sequence, calculation time delay, network transmission quantity and parallelization degree, and finally parallelizing by adopting the model parallel mode at the stage.

(3) Global update phase of micro-cluster set: in the model global update stage of the micro-cluster set, the updated micro-cluster and the global micro-cluster set need to be fused. Since the model global update phase of a micro-cluster set requires minimal computation in three phases and needs to maintain the update order, this phase is performed in a single-machine environment.

The system architecture of the present invention is shown in fig. 1, and the system as a whole can be divided into the following four parts:

A clustering system realized by a distributed clustering method facing large-scale stream data comprises the following steps: the system comprises an access layer, an application layer, a calculation layer and a storage layer;

Compared with the prior art, the invention has the advantages that:

(1) the invention provides a time-sequential micro-batch incremental updating mode, overcomes the difficulty that the flow clustering algorithm is difficult to parallelize in the existing computing mode, and solves the problem that the existing parallelization scheme cannot ensure the sequential processing of data by decoupling the computing mode into three computing stages for ensuring the sequential updating of the data.

(2) according to the invention, different parallelization designs are carried out on each stage, so that the problem that the existing parallelization scheme cannot give consideration to both clustering quality and calculation throughput is solved.

(3) The invention realizes a large-scale stream data-oriented distributed system on the basis of Spark Streaming, and provides a corresponding interface for a user to realize a specific stream type clustering algorithm.

(4) The invention realizes a specific stream type clustering algorithm on the proposed distributed computing system facing to stream type clustering analysis, and tests the clustering quality, the throughput rate and the expansibility of the algorithm. Experimental results show that the streaming clustering algorithm realized based on the method and the system provided by the invention can reach 99% of the clustering quality of a single streaming clustering algorithm, and simultaneously reach the throughput rate of sub-linear growth under the multi-machine environment; the method can meet the calculation requirements of low-delay and high-throughput data processing required by the high-flow-rate dynamically-changed streaming data on the premise of ensuring the clustering quality.

drawings

FIG. 1 is a system architecture diagram of the present invention;

FIG. 2 is a diagram of a time-sequenced micro-batch incremental computational model according to the present invention;

FIG. 3 is a diagram of a parallelized sequential micro-batch incremental computation model according to the present invention;

FIG. 4 is a graph showing cluster quality on a KDD-99 dataset for an example;

FIG. 5 is a graph of cluster quality display for an example on a CoverType dataset;

FIG. 6 is a graph showing cluster quality on a KDD-98 dataset for an example;

FIG. 7 is a throughput rate display diagram of an example;

FIG. 8 is an extensive display diagram of an example.

Detailed Description

the invention is described in detail below with reference to the figures and specific examples.

fig. 1 is a system architecture diagram of the present invention, and the system as a whole can be divided into four parts, namely an access layer, an application layer, a computation layer and a storage layer. The invention relates to a distributed clustering method facing large-scale stream data, which comprises the following steps:

(1) a sequential micro-batch incremental computation model relaxes the computation dependence of a typical flow clustering algorithm and solves the problems that the algorithm is not easy to realize parallelism and cannot process flow data with high flow rate and dynamic change;

(2) The micro batch type incremental computation model based on time sequence adopts a parallelization method based on multi-dimensional division to parallelize each computation stage of the flow type clustering algorithm, and is used for solving the parallelization problem of the flow type clustering algorithm.

the respective steps will be described in detail below.

1. time-sequenced micro batch type incremental computation model

The computation mode of the conventional streaming clustering algorithm includes two serial steps: (1) finding a micro cluster closest to the current data; (2) updating data to the nearest micro-cluster in a decay mode, and simultaneously carrying out global operations such as merging, deleting and the like on the whole micro-cluster set; the new data is then computed over the updated set of micro-clusters, such that a complete computation is referred to as a feedback loop. The disadvantage of this calculation mode is that the calculation process of each piece of data strictly depends on the calculation result of the previous piece of data, that is, only after the previous piece of data completes the update of the model, the next piece of data can start the calculation; the mode limits the parallelization realization of the algorithm and cannot cope with the flow data scene with high flow speed and dynamic change.

in order to solve the problems, the micro batch type incremental computation model provided by the invention relaxes the strict processing sequence among data. The model enables a batch of data to be calculated on the same micro-cluster set, and when the batch of data is calculated, the model is updated in sequence according to the calculation result. The micro-batch type incremental computation model can enable more parallelization strategies to be applicable, and therefore algorithm performance is improved.

Meanwhile, in order to improve the calculation performance of the updating process, the calculation model is decoupled into three stages: find the calculation phase of the nearest micro-cluster- > local update phase of the micro-cluster set- > global update phase of the micro-cluster set. In the calculation stage of finding the nearest micro-cluster, for each piece of data, the distance between the data and all micro-clusters is calculated, and the nearest micro-cluster is selected to check whether the joining can be carried out. Judging whether the data can be added into the micro-clusters or not according to the distance and the boundary conditions of the micro-clusters; if the micro-cluster can not be added, the data can become an outlier; in the local updating stage of the micro-cluster set, fusing data into the nearest micro-cluster, and correspondingly updating the time and space representation of the micro-cluster; in the global updating stage of the micro-cluster set, according to the calculation flows of different algorithms, combining, deleting and other operations are carried out on partial micro-clusters in the micro-cluster set so as to reflect the latest data stream trend.

In addition, data in a streaming data scene has a time sequence characteristic, although the micro batch-based incremental computation model relaxes the data processing sequence, the time sequence is not strictly followed in the process of searching the nearest micro cluster by the data; however, in order to ensure the clustering quality, the invention keeps the time sequence of data processing in two updating stages:

(1) In a local updating stage of a micro-cluster set, each micro-cluster receives a certain amount of data, the data determines that the micro-cluster is the micro-cluster closest to the data, and a local model can select to update the data into the corresponding micro-cluster together in an incremental updating mode; however, in order to ensure the time sequence of data processing and ensure that the influence of newer data on the model is larger, the invention adopts a time sequence incremental updating mode.

(2) In the global update stage of the micro-cluster set, operations such as deletion, combination and the like of the model are irrevocable, that is, once the operations such as deletion, combination and the like are performed, the micro-cluster can not be restored to the original state, and therefore the time sequence of data is guaranteed to be of great importance. The invention combines the data flow characteristics and the calculation time delay requirements and adopts a model global updating scheme which can guarantee time sequence and can deal with different scenes.

in the example, four pieces of data in a batch are calculated simultaneously according to stages, a micro cluster closest to each piece of data is found first, then the data of each micro cluster is integrated to obtain incremental information, and finally the whole model is updated. Except for the first stage, the second stage and the third stage, the updating process strictly ensures the algorithm processing time sequence; meanwhile, the global updating of the model in the third stage can weaken the influence on the clustering quality of the algorithm possibly brought by the scaling of the data processing sequence in the first stage.

2. parallelization method based on multi-dimensional division

The parallelization method based on multi-dimensional division refers to the comprehensive use of common parallelization methods such as data parallelization, model parallelization and the like. In the aspects of network transmission overhead, calculation time delay, parallelization degree and the like, the invention combines the designed time sequence increment updating model and designs a reasonable parallelization scheme for each calculation stage. FIG. 3 is a schematic diagram of a time-sequenced micro batch type incremental computation model after a parallelization method of multi-dimensional division is adopted.

(1) computation phase of finding nearest micro-clusters

this stage calculates the distance of the data from all the micro-clusters and selects the closest micro-cluster to check if it can join. Judging whether the data can be added into the micro-clusters or not according to the distance and the boundary conditions of the micro-clusters; if a micro-cluster cannot be added, the data becomes an outlier. Because the computation between data and micro-clusters is micro-cluster independent and data independent, model parallelism or data parallelism can be considered. At this stage, the invention considers the influence of the network transmission quantity, the calculation time delay and the parallelization degree on the parallelization to select the most suitable parallelization scheme.

in the aspect of network transmission quantity, because the data quantity is far larger than the number of computing nodes and the size of a model, all data needs to be transmitted to each node in parallel by the model, and the data only needs to be transmitted once in parallel; in addition, the model parallel mode needs to aggregate intermediate calculation results, so that the network transmission amount of data parallel is minimum. In the aspect of calculating time delay, the calculation time delay of the two parallelization strategies is the same when the distance is calculated, but the results of each calculation node need to be aggregated in a model parallel mode, so that extra calculation time delay is generated. In the aspect of parallelization degree, the parallelization degree of the data parallelization mode depends on the size of data volume; the parallelization degree of the model parallel mode depends on the number of the micro-clusters. Because the variable range of the number of the micro-clusters is small, the micro-batch-based calculation model can ensure that the size of a calculation window is changed and the data volume needing to be processed of each batch is adjusted, and therefore the parallelization degree of the data parallel mode is the highest.

After comparing the performances of the two parallel modes in terms of performance, the invention selects the most effective data parallel mode to carry out parallelization processing on the stage. Data parallelism also has the advantage that data can be processed according to the time sequence of the data, so that the time sequence of the subsequent updating stage is convenient to maintain.

(2) local update phase of micro-cluster set

after finding the nearest micro-cluster, several pieces of data need to be updated next to the micro-cluster nearest to it. Specifically, the pieces of data xi are combined into Δ x in a decaying manner, and then Δ x is updated into the micro-cluster. Because incremental updates of data are independent between data and micro-clusters, data-parallel or model-parallel may be employed. At this stage, the scheme is intended to consider the problem of time sequence of data updating and the problem of network transmission and calculation delay.

in the aspect of time sequence, the data related to the current micro-cluster is collected in a parallel mode of the model, and then sequencing and updating operations are carried out, so that the updating time sequence can be ensured; the data parallel mode not only breaks the time sequence among the data, but also needs to design a reasonable method to merge results of all parts. In the aspect of computing time delay, both parallel modes need to sequence data and update models sequentially, but data parallel needs additional operations to aggregate intermediate computing results, so that the computation time delay of model parallel is minimum. In the aspect of network transmission quantity, the model parallel only needs to transmit data, and the data parallel needs to additionally transmit intermediate calculation results, so that the network transmission quantity of the model parallel is minimum.

After comparing the performances of the two parallel modes in terms of performance, the invention selects the most effective model parallel mode to carry out parallelization processing on the stage.

(3) Global update phase for micro-cluster collections

in the model global update stage of the micro-cluster set, the updated micro-cluster and the global micro-cluster set need to be fused. Since the model global update phase of the micro-cluster set requires the minimum amount of computation in three phases and needs to maintain the update timing, the invention performs the computation of this phase in a single-machine environment. Meanwhile, the invention considers the problems of data flow characteristics, calculation time delay and the like, and considers the updating scheme according to the data flow characteristics at the stage.

A distributed clustering method and system for large-scale stream data are evaluated as follows:

The present invention employs three real data sets that are widely used to evaluate streaming clustering algorithms, as shown in table 1, these data sets having different clustering distributions and data dimensions. Data flows into the system in raw data order. The flow type clustering algorithm evaluation method and the experimental result are introduced from three aspects of clustering quality, throughput rate and expansibility.

TABLE 1 characterization of the three datasets

Data set	Data volume	Number of dimensions	number of clusters
				KDD-99	494021	34	23
CoverType	581012	54	7
				KDD-98	95412	315	5

(1) quality of clustering

The present invention uses CMM indexes (references: H.Kremer, P.Krannen, T.Jansen, T.Seidl, A.Bifet, G.Holmes, and B.Pfahringer.an effective evaluation measure for clustering on an evaluation Data stream. in Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pages 868-. CMM will decay the weight of outdated data while penalizing three common types of erroneous clustered data: missing data, misplaced data, noisy data. The CMM will calculate the difference between the actual cluster result and the true cluster of data to quantify these erroneous cluster data and then normalize the difference to a value between 0 and 1, with larger values representing higher cluster quality.

For each data set, the invention transmits data at the rate of 1000 data/second, in order to realize fair comparison with a single-stream type clustering algorithm, the parallelism of the time-sequence distributed type stream type clustering algorithm realized on the basis of the invention is set to be 1, the batch processing interval time is set to be 10 seconds, then the off-line clustering operation is carried out on the model when each batch of data is finished, then the result is evaluated by using the CMM value, and the invention repeats five times of experiments and then records the average result for displaying.

Fig. 4, 5, and 6 show the results of the clustering quality experiments of the clusterflow algorithm on three data sets, and it can be seen that the clustering quality of the clusterflow algorithm based on the micro batch incremental update of the time-series is not much different from that of the original clusterflow algorithm, and the average difference is 1.1%. But the CluStream algorithm based on unordered update shows lower clustering quality on three data sets, and is 60% worse than single machine and ordered parallelization implementation at most. Therefore, the distributed computing method and the distributed computing system for the stream-oriented clustering analysis, which are disclosed by the invention, can keep the stream-oriented clustering algorithm on the distributed computing method and the distributed computing system with good clustering quality.

(2) throughput rate

Because the original data set is smaller, the advantages of the time-sequence distributed streaming clustering algorithm realized based on the method on the throughput rate are difficult to embody, the method forms a larger data set by reading the same original data set for many times, and the larger data set is named as large-KDD99, large-CoverType and large-KDD98 on the basis of the original data set. And setting different data flow rates according to different data dimensions. The data flow rates of 100,000 pieces of data/second, 100,000 pieces of data/second and 10,000 pieces of data/second are respectively set on three data sets of large-KDD99, large-CoverType and large-KDD98, and the three data sets are set for the same batch processing interval time of 10 seconds. The experiment was repeated five times and the average results were recorded for demonstration.

Fig. 7 is a result of throughput rate experiments of the prustream algorithm on three data sets, and the result shows that in a single computing node, the throughput rate of the time-ordered distributed CluStream algorithm implemented based on the present invention is improved by 30% compared with the distributed CluStream algorithm based on the unordered update.

(3) Expansibility

Similar to the throughput experiment, the experiment of the invention in terms of scalability was also performed on three larger data sets, while using the same data flow rates and batch processing intervals as the throughput experiment. In addition, the invention changes the parallelism of distributed implementation, the specific value is 1,2, 4, 8, 16 and 32, records the throughput rate under each parallelism, and observes the specific change situation of the throughput rate along with the change of the parallelism. Because the dimensions of the three data sets are different, the absolute values of the throughput rates are also different, so that the throughput rate is converted into the throughput rate increase rate with the parallelism of 1 for displaying. The throughput rate increase rate with the parallelism n is the division of the specific value of the throughput rate with the parallelism n by the specific value of the throughput rate with the parallelism 1, which is equivalent to the normalization of the throughput rate values, and is more convenient to compare. Likewise, the experiment was repeated five times and the average result was recorded for display.

fig. 8 is an experimental result of the extensibility of the CluStream algorithm on three data sets, and it can be seen that with the increase of the parallelism, the throughput rate increase rates of the three data sets can all show a sub-linear increase trend, and have good extensibility.

The invention provides a time-sequential micro-batch incremental updating mode, and solves the problem that the flow type clustering algorithm is difficult to parallelize in the existing computing mode; by decoupling the calculation mode into three calculation stages for ensuring the data to be updated in sequence, the problem that the existing parallelization scheme cannot ensure the data to be processed in sequence is solved; by carrying out different parallelization designs on each stage, the problem that the existing parallelization scheme cannot give consideration to both clustering quality and calculation throughput is solved. The experimental result shows that the invention reduces the calculation processing time delay of the algorithm, improves the parallelization degree of the algorithm and can process the calculation requirements of low time delay and high throughput rate required by the large-scale dynamically-changed stream data under the condition of ensuring the clustering effect.

The above examples are provided only for the purpose of describing the present invention, and are not intended to limit the scope of the present invention. The scope of the invention is defined by the appended claims. Various equivalent substitutions and modifications can be made without departing from the spirit and principles of the invention, and are intended to be within the scope of the invention.

15页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：一种万物分类表

Distributed clustering method and system for large-scale stream data

相关技术

网友询问留言