Abnormal data cleaning method based on k-means clustering under Spark platform

文档序号:486068 发布日期:2022-01-04 浏览:43次 中文

阅读说明:本技术 一种Spark平台下基于k-means聚类的异常数据清洗方法 (Abnormal data cleaning method based on k-means clustering under Spark platform ) 是由 王军 王志明 隋鹤铭 焦美晴 于 2021-09-07 设计创作,主要内容包括:一种Spark平台下基于k-means聚类的异常数据清洗方法,涉及一种工业系统异常数据清洗方法,该方法提出了基于RDD权重值的替换算法LRU-W,并在Spark框架中替换掉默认的缓存替换策略。提出了基于Spark平台下异常数据清洗方法,设计使用Canopy算法以及加权欧氏距离对K-means数据聚类算法做出了一定的优化及改进。通过使用这种方法在工业大数据环境中清理异常数据,可以在数据确定的准确性和加速性方面获得良好的实验性能。(An abnormal data cleaning method based on k-means clustering under a Spark platform relates to an abnormal data cleaning method of an industrial system, and provides a replacement algorithm LRU-W based on an RDD weighted value, and a default cache replacement strategy is replaced in a Spark framework. The abnormal data cleaning method based on the Spark platform is provided, and the K-means data clustering algorithm is optimized and improved to a certain extent by designing and using the Canopy algorithm and the weighted Euclidean distance. By cleaning abnormal data in an industrial big data environment by using the method, good experimental performance can be obtained in the aspects of accuracy and acceleration of data determination.)

1. An abnormal data cleaning method based on k-means clustering under a Spark platform is characterized by comprising the following steps:

the replacement process of the whole task is as follows: in the Storage module of Spark, the BlockManager class manages the whole Storage module by providing an interactive interface between the Storage module and other modules; the cache replacement strategy maintains an RDD weight list, and for convenience, the RDD with the minimum weight is found; in Spark source code, original block information is stored by using a LinkedHashMap, and the use of each RDD is recorded according to the iteration sequence defined by the LinkedHashMap; determining whether a block corresponding to the RDD needs to be cached by determining the number of times the RDD is used during execution of the task; if there is enough memory space to buffer, then directly buffer and record the information corresponding to the block; if the residual space is insufficient, replacing the cache and updating the weight information;

optimization and improvement based on K-means algorithm

Firstly, selecting a central point of a Canopy algorithm based on the thought of the minimum and maximum principle, and in order to effectively solve the local optimal problem in the process of selecting the central point by using the method, assuming that the first x central point of Canopy is known, then accurately determining x +1 central points, firstly ensuring that the point meets the following conditions;

(1)

(2)

under the above conditions, the minimum value [ d (Ax +1, An) ] represents the minimum distance between the center point of x +1 and the first xx center point, and Dmin (x + 1) represents the optimal distance d (Ax + 1)) is considered as the distance at which all minimum distances are the largest; once you determine Canopy's center algorithm, the next main task is to solve the problem's k value and t area radius; to deal with such a problem more efficiently, this paper uses the concept of boundary identification to set a depth indicator that reflects the range of D variation; for convenience, this is represented here as depth x, with the formula:

(3)

the depth value can clearly be seen in the formula, the depth (x) varies according to the value of x, that is, only if the value of x can reflect the best clustering of the algorithm; depth value depth (x) is maximum; a new definition is obtained: data set C = { xi | = i =1,2, n }, forIf the following conditions are satisfied, thenCandidate is the center of a set of Canopy, dmin (m) indicates the data pointIs the largest of all shortest distances;(4)。

Technical Field

The invention relates to a method for cleaning abnormal data of an industrial system, in particular to a method for cleaning abnormal data based on k-means clustering under a Spark platform.

Background

Industrial systems are being highly integrated with the internet and computer technology, and are also being shifted toward intelligence by manual control. Therefore, the problem that a large amount of industrial data needs to be processed by a computer system is continuously solved, and accordingly, industrial big data is continuously rich in sources and has more and more diversified data, and data information generated in the industrial production process is continuously increased. Relatively speaking, industrial data can be more complex, and as it continues to evolve, the dimensions of industrial big data will continue to grow. Therefore, valuable information mining for industrial data will determine the development of industrial intelligence. Conventional industrial data is processed by using local stand-alone storage, so that the data volume can be relatively small, the data processing technology is ambiguous, the utilization rate of the data is low, and the data analysis result can be greatly deviated due to the relatively small stand-alone data. At present, enterprise users can acquire a large amount of data space through a cloud storage technology, more and more users use a cloud storage service to transfer local data to cloud storage, data sharing and multiple users are achieved, data cloud computing can be achieved, and the problem that single-computer data processing is slow is solved while the data volume is increased.

In the aspect of industrial data processing, compared with a traditional Hadoop platform as a data processing frame, the Spark ecosphere technology is obviously more suitable for high-efficiency data processing, the Spark frame performs calculation and storage by using a memory space during data operation, and execution efficiency during tasks such as data processing is greatly improved.

Disclosure of Invention

The invention aims to provide a k-means clustering-based abnormal data cleaning method under a Spark platform, provides a replacement algorithm LRU-W based on an RDD weight value, and replaces a default cache replacement strategy in a Spark framework. The abnormal data cleaning method based on the Spark platform is provided, and the K-means data clustering algorithm is optimized and improved to a certain extent by using the Canopy algorithm and the weighted Euclidean distance. By cleaning abnormal data in an industrial big data environment by using the method, good experimental performance can be obtained in the aspects of accuracy and acceleration of data determination.

The purpose of the invention is realized by the following technical scheme:

an abnormal data cleaning method based on k-means clustering under a Spark platform comprises the following steps:

the replacement process of the whole task is as follows: in the Storage module of Spark, the BlockManager class manages the whole Storage module by providing an interactive interface between the Storage module and other modules; the cache replacement strategy maintains an RDD weight list, and for convenience, the RDD with the minimum weight is found; in Spark source code, original block information is stored by using a LinkedHashMap, and the use of each RDD is recorded according to the iteration sequence defined by the LinkedHashMap; determining whether a block corresponding to the RDD needs to be cached by determining the number of times the RDD is used during execution of the task; if there is enough memory space to buffer, then directly buffer and record the information corresponding to the block; if the residual space is insufficient, replacing the cache and updating the weight information;

optimization and improvement based on K-means algorithm

Firstly, selecting a central point of a Canopy algorithm based on the thought of the minimum and maximum principle, and in order to effectively solve the local optimal problem in the process of selecting the central point by using the method, assuming that the first x central point of Canopy is known, then accurately determining x +1 central points, firstly ensuring that the point meets the following conditions;

(1)

(2)

under the above conditions, the minimum value [ d (Ax +1, An) ] represents the minimum distance between the center point of x +1 and the first xx center point, and Dmin (x + 1) represents the optimal distance d (Ax + 1)) is considered as the distance at which all minimum distances are the largest; once you determine Canopy's center algorithm, the next main task is to solve the problem's k value and t area radius; to deal with such a problem more efficiently, this paper uses the concept of boundary identification to set a depth indicator that reflects the range of D variation; for convenience, this is represented here as depth x, with the formula:

(3)

the depth value can clearly be seen in the formula, the depth (x) varies according to the value of x, that is, only if the value of x can reflect the best clustering of the algorithm; depth value depth (x) is maximum; a new definition is obtained: data set C = { xi | = i =1,2, n }, forIf the following conditions are satisfied, thenCandidate is the center of a set of Canopy, dmin (m) indicates the data pointIs the largest of all shortest distances.

(4)。

Drawings

FIG. 1 is a flow chart of the LRU-W cache replacement algorithm of the present invention.

Detailed Description

The present invention will be described in detail with reference to examples.

LRU-W cache replacement flow, in Spark platform, RDD is used to implement user logic, and storage module is used to manage user data. The storage module mainly manages various data generated by the task in the computing process, and can be cached in a memory or a disk, which is an important component of the Spark processing framework. The intermediate data includes several types: RDD cache, random cache, data stored on disk and broadcast variables. The important point is that RDD cache management is performed through the storage module, the storage module of Spark can be defined in a master/slave mode, wherein a blockamanager instance running in a driver is defined as a master, and a blockamanager instance running in an Executor is defined as a slave. When the driver saves the block data for each RDD in the Spark job, the driver sends a command to each executive. And the BlockManager of each executor is responsible for managing the block data blocks of the memory and the disk of the corresponding node, receiving a command from the driver, updating the local data in time and returning an updating result.

The main idea of the replacement strategy is: during the task execution process, firstly, the RDD which is used for multiple times is cached according to the executed RDD sequence, therefore, before caching, the weight of the RDD must be calculated firstly, a Map < RDDid, weight > type data structure is defined for storing the weight value of the RDD, and then each node caches the corresponding data. Traversing a Map data set during caching, if the current data is found to be cached, adding 1 to the number of times of use, recalculating the weight value of the RDD, updating the data in the Map, if the caching space is insufficient, calling a cache replacement algorithm to perform cache replacement, comparing the weight value of the RDD with the data in the Map, performing novel cache replacement on the RDD with a small weight value, and updating the data in the Map. For garbage collection, a CMS collector (concurrent low-stall collector) with the objective of obtaining the shortest recovery stall time is selected, the memory clearing and task computing processes are concurrent, and the task execution time is not affected by the cache replacement.

The whole task replacement process is as follows:

in the Storage module of Spark, the BlockManager class manages the entire Storage module by providing an interactive interface between the Storage module and other modules. The cache replacement policy will maintain a list of RDD weights, and for convenience, find the RDD with the smallest weight. In Spark source code, the original block information is stored using a LinkedHashMap, and the usage of each RDD is recorded in the iteration order defined by the LinkedHashMap. During task execution, it is determined whether a block corresponding to the RDD needs to be cached by determining the number of times the RDD is used. If there is enough memory space to cache, then directly cache and record the information corresponding to the block. If the remaining space is insufficient, the cache needs to be replaced and the weight information needs to be updated.

Optimization and improvement based on K-means algorithm

First, the center point of the Canopy algorithm is selected based on the concept of the "min-max principle". To effectively solve the local optimality problem in selecting center points using this method, we assume that the first x center point of Canopy is known, and then accurately determine the x +1 center points. It is first necessary to ensure that this satisfies the following conditions.

(1)

(2)

Under the above conditions, the minimum value [ d (Ax +1, An) ] represents the minimum distance between the center point of x +1 and the center point of the first xx, and Dmin (x + 1) represents the optimum distance d (Ax + 1)) is considered as the distance at which all the minimum pitches are the largest. Once you determine the Canopy's center algorithm, the next main task is to solve the problem's k value and t's region radius [39 ]. To deal with such a problem more efficiently, this paper uses the concept of boundary identification to set a depth indicator that reflects the range of D variation. For convenience, this is represented here as depth x, with the formula:

(3)

the depth value can clearly be seen in the formula, the depth (x) varies depending on the value of x, that is, only if the value of x can reflect the best clustering of the algorithm. The depth value depth (x) is maximum. A new definition is obtained: data set C = { xi | = i =1,2, n }, forIf the following strips are satisfiedAn article of manufacture, anCandidate is the center of a set of Canopy, dmin (m) indicates the data pointIs the largest of all shortest distances.

(4)。

6页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种防止外插卡热插拔时导致线路故障的装置及服务器

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!

技术分类