Large-scale parallelization aerobic capacity grouping method

文档序号：1910735 发布日期：2021-12-03 浏览：19次中文

阅读说明：本技术 一种大规模并行化有氧能力分群方法 (Large-scale parallelization aerobic capacity grouping method ) 是由杨良怀匡东伟于 2021-08-11 设计创作，主要内容包括：一种大规模并行化有氧能力分群方法,包含：1)有氧能力测试序列数据集加载；2)数据分桶；3)并行化处理各个数据桶,处理步骤具体包括：3.1)对桶中的序列进行预处理；3.2)对预处理后的序列进行重表示；4)对并行处理所得到的重表示序列进行归一化处理,得到聚类样本集；5)对聚类样本集进行聚类分群。本发明实现了适用于对大规模多变量非等长的有氧测能力试序列进行聚类的算法,能够快速有效地实现大规模有氧能力分群。(A large-scale parallelization aerobic capacity clustering method comprises the following steps: 1) loading an aerobic capacity test sequence dataset; 2) dividing data into barrels; 3) and carrying out parallelization processing on each data bucket, wherein the processing steps specifically comprise: 3.1) preprocessing the sequences in the barrel; 3.2) re-representing the preprocessed sequence; 4) normalizing the re-expression sequence obtained by parallel processing to obtain a clustering sample set; 5) and clustering the cluster sample set. The invention realizes the algorithm suitable for clustering large-scale multivariable unequal-length aerobic capacity test sequences, and can quickly and effectively realize large-scale aerobic capacity clustering.)

1. A large-scale parallelization aerobic capacity clustering method is characterized by comprising the following steps:

(1) the aerobic capacity test sequence set D ═ AS₁,AS₂,......AS_n]Loading to a memory;

(2) carrying out data barreling on the aerobic capacity test sequence set;

(3) respectively for each barrel B_iThe data in (1) is processed in a parallelization way to generate a clustering subsample set SF_i；

(4) Carrying out normalization processing on samples in the clustering sample set F;

(5) clustering the clustering sample set F by using a partition clustering algorithm KMeans + + to obtain a clustering result C_res。

(6) Result set C_resThe K clusters in (a) correspond to K different populations of aerobic capacity.

2. The massively parallel aerobic clustering method according to claim 1, wherein AS in step (1)_i(i-1, …, n) is a two-dimensional aerobic capacity test sequence containing heart rate and velocity, denoted AS_i＝<ss_i,hs_i>Wherein ss_iRepresenting a sequence of velocity values, hs_iRepresenting a sequence of heart rate values.

3. The massively parallel aerobic clustering method according to claim 1, wherein the step (2) uniformly divides the data set D into p buckets B₁,B₂......B_PThe elements in each small bucket are aerobic capacity test sequences.

4. The massively parallel aerobic clustering method according to claim 1, wherein the step (3) further comprises:

(3.1) aerobic Capacity test sequence Pre-processing, run through B_iEach element AS in_k＝<ss_k,hs_k>Null processing is carried out on the heart rate data, null values in the speed value sequence and the heart rate value sequence are both represented by null, and the preprocessed elements are marked AS AS_k’；

(3.2) traverse B_iEach of the pretreated aerobic capacity test sequence elements AS_k' the re-expression sequence is obtained by performing re-expression processing on the vector, and is denoted by f_kAnd f is_kAddition to SF_iIn, the total cluster sample set is recorded as

5. The massively parallel aerobic clustering method according to claim 4, wherein the step of preprocessing in step (3.1) includes:

(3.1.1) sequence of traversal speed values ss_kIf the traversed current value is null, the corresponding processing is carried out according to the following three conditions: 1. if the current value does not belong to the first or last time in the sequence, the sum of the values of the previous time and the next time before the current value is usedReplacing the null value by an average of the values at the next instant; 2. if the current null value is the first point of the sequence, replacing it by the average of two successive points following it; 3. if the current null value is the last point of the sequence, replacing the null value by the average value of two continuous points before the current null value;

(3.1.2) processing the null value of the heart rate sequence in a synchronous mode of 3.1.1).

6. The massively parallel aerobic clustering method according to claim 4, wherein the step of re-representing the processing in step (3.2) comprises:

(3.2.1) calculating the following global eigenvalues of the sequence: length of the velocity value sequence, average velocity, average of the heart rate value sequence, maximum value and skewness. Then inserted into f in turn_kIn (1).

(3.2.2) taking T AS a segmentation window threshold, and dividing AS_k' split into several sequence segments of length T.

(3.2.3) for each sequence fragment obtained in the last step, sequentially calculating the following local characteristic values of the current sequence fragment: the average value of the sequence of velocity values, the average value of the sequence of heart rate values, the heart rate value at the start of the sequence and the heart rate value at the end of the sequence. These calculated values are then inserted into f in sequence_kIn (1).

7. The massively parallel aerobic capacity clustering method according to claim 6, wherein T in step (3.2.2) is an integer from 1 to 100.

Technical Field

The invention relates to a high-efficiency parallelizable aerobic capacity grouping method suitable for large-scale student population.

Background

With the maturity and wide application of wearable devices, it becomes an effective and feasible method to collect human physiological health data in a large scale through wearable devices. The mode through wearing motion rhythm of the heart bracelet can carry out large-scale collection to people's motion physiological data, and has contained huge value in extensive crowd's the motion physiological data. The physiological data of the large-scale exercise are reasonably analyzed and mined, and the health condition of an individual is obtained from the data, so that a reasonable exercise scheme is customized, and the method is an important means for preventing sudden events in the exercise and improving the physical health of the individual. The motor physiological data essentially belongs to a time sequence data, and common mining tasks for the time sequence are: anomaly detection, classification, and clustering. The abnormity detection is generally used for abnormity alarm, such as the alarm prompt is carried out on abnormal data points mined from the motion data; the prediction is to predict the trend of future data according to historical data; clustering is to divide individuals with high similarity into a cluster in an unsupervised mode, and divide individuals with low similarity into different clusters so as to distinguish the difference of the individuals. Through carrying out cluster analysis on the exercise physiological data of large-scale crowds, the aerobic abilities (or cardiopulmonary endurance) of the crowds can be distinguished to a certain extent, the crowds with similar aerobic abilities are divided into the same group, and different exercise schemes can be further formulated for the groups with different aerobic abilities, so that the individual physical health is effectively promoted, and the occurrence of emergencies is prevented to the greatest extent. Therefore, the clustering of the large-scale movement physiological data is of great significance.

The time series can be divided into univariate time series and multivariate time series according to the variable number of the time series, and can be divided into equal-length time series and unequal-length time series according to whether the length of the time series is equal or not. Clustering methods for time series can be roughly divided into two main categories: and obtaining time series clusters based on the original measure data and the characteristics. The time series clustering based on the original measurement data directly defines the similarity according to the original data, such as Manhattan distance, Euclidean distance or DTW distance, and then selects a clustering algorithm for clustering. The time series clustering based on the characteristics firstly reduces the dimension of original data, extracts the characteristics representing the internal change mechanism of the original data as the basis of similarity measurement, and then clusters the characteristics by using various clustering methods, wherein the time series clustering based on the characteristics is a more mainstream method. At present, a great number of scholars have proposed a great number of effective methods for clustering research of univariate time sequences, but the research on the clustering methods of multivariate and unequal time sequences is still few at present. The invention provides a high-efficiency clustering algorithm suitable for large-scale multivariable and unequal-length time sequences aiming at large-scale student movement physiological data with multivariable and unequal-length characteristics. For the clustering of large-scale multivariable unequal length time sequences, the difficulty lies in how to solve the representation of the multivariable unequal length time sequences, the similarity measurement among the sequences and the parallelization of the algorithm.

The present invention is intended to solve the above-mentioned problems involved in clustering under large-scale exercise physiological data (data collected by students running an aerobic endurance test, hereinafter referred to as an aerobic test sequence). Because the clustering grouping of the aerobic test sequences needs longer time when the data size is large, the grouping request response time of users is too long, and the user experience is seriously influenced, the invention provides an effective aerobic test sequence re-representation method, which is used for effectively reducing the dimension and re-representing the original aerobic test sequences, clustering the re-represented sequences by adopting a K-means + + algorithm to obtain clustering grouping results, and simultaneously, accelerating the clustering grouping process of the large-scale aerobic test sequences by adopting parallel calculation, thereby realizing the rapid and efficient clustering grouping of the large-scale aerobic test sequences.

Disclosure of Invention

The invention provides an efficient parallelizable large-scale aerobic capacity clustering method, which aims to overcome the defects in the prior art and solve the problem of how to quickly and efficiently cluster and cluster large-scale multivariable unequal aerobic test sequences.

Because the aerobic test sequence required to be clustered by the large-scale aerobic capacity clustering method is a numerical time sequence which is multivariable (comprising two variables of heart rate and speed) and has unequal length, and the scale of data volume is large, the invention needs a proper time sequence representation method to effectively represent the original aerobic test sequence, then selects a high-efficiency clustering algorithm and a similarity measurement method to cluster the re-represented sequence, and simultaneously, in order to accelerate the clustering and clustering speed under the condition of large-scale data volume, the invention also needs to adopt parallel calculation to fully exert the advantages of multi-core CPU or distributed calculation, thereby realizing the fast and high-efficiency large-scale aerobic capacity clustering.

According to the above problems and data features, the present invention firstly proposes an effective aerobic test sequence re-representation method, which can reasonably represent multivariable original aerobic test sequences as a single characteristic sequence (simple weighing representation sequence); then, according to the characteristics of the re-expression sequences, a reasonable and efficient clustering algorithm is adopted to cluster and group the re-expression sequences; and accelerating the generation process of the re-representation sequence and the clustering process of the re-representation sequence based on the ideas of division and parallelization. Based on the technologies, the specific scheme of the large-scale aerobic capacity grouping method provided by the invention comprises the following steps:

(1) the aerobic capacity test sequence set D ═ AS₁,AS₂,......AS_n]And loading to the memory. AS_i(i-1, …, n) is a two-dimensional aerobic capacity test sequence (AS shown in fig. 1) containing heart rate and velocity, denoted AS_i＝<ss_i,hs_i>Wherein ss_iRepresenting a sequence of velocity values, hs_iRepresenting a sequence of heart rate values. The aerobic capacity test sequence was collected by students wearing professional sports bracelets while performing the 22 minute aerobic endurance running test.

(2) And carrying out data barreling on the aerobic capacity test sequence set. Uniformly dividing data set D into p buckets B₁,B₂......B_PThe elements in each small bucket are aerobic capacity test sequences.

(3) Respectively for each barrel B_iThe data in (1) is processed in a parallelization way to generate a clustering subsample set SF_i：

(3.1) pretreatment of the aerobic capacity test sequence. Traverse B_iEach element AS in_k＝<ss_k,hs_k>It is processed by null value (null value in speed value sequence and heart rate value sequence is represented by null), and is preprocessedThe latter element is denoted AS AS_k' the pretreatment steps are as follows:

(3.1.1) sequence of traversal speed values ss_kIf the traversed current value is null, the corresponding processing is carried out according to the following three conditions: 1. if the current value does not belong to the first or last time in the sequence, replacing the null value by the average of the value of the current value at the previous time and the value at the next time; 2. if the current null value is the first point of the sequence, the mean value of two successive points behind the current null value is used for replacing the null value; 3. if the current null value is the last point of the sequence, replacing the null value by the average value of two continuous points before the current null value;

(3.1.2) processing the null value of the heart rate sequence in the same way as the step (3.1.1).

(3.2.1) calculating the following global eigenvalues of the sequence: length of the velocity value sequence (denoted SS _ LEN), average velocity (SS _ AVG), average value of the heart rate value sequence (HS _ AVG), maximum value (HS _ MAX) and skewness (HS _ SKE). Then inserted into f in turn_kIn (1).

(3.2.2) taking T AS a segmentation window threshold (experiments show that T is 5, so that a better effect can be obtained), and taking AS_k' split into several sequence segments of length T.

(3.2.3) for each sequence fragment obtained in the last step, sequentially calculating the following local characteristic values of the current sequence fragment: an average value (ss _ avg) of the sequence of velocity values, an average value (hs _ avg) of the sequence of heart rate values, a heart rate value (hs _ start) at the start of the sequence and a heart rate value (hs _ end) at the end of the sequence. These calculated values are then inserted into f in sequence_kIn (1).

(4) Carrying out normalization processing on samples in the clustering sample set F;

(5) clustering the clustering sample set F by using a partition clustering algorithm KMeans + + to obtain a clustering result C_res。

(6) Result set C_resThe K clusters in (a) correspond to K different populations of aerobic capacity.

The above is all the contents of the present invention. Different from the time series clustering method based on the original measurement data, the invention firstly provides a reasonable aerobic test sequence re-representation method for effectively re-representing the multivariable unequal-length original aerobic test sequence into a characteristic value sequence (re-representation sequence). The aerobic test sequence re-representation method converts an aerobic test sequence from a time series of multivariate to a single sequence representation. On one hand, the re-representation method fuses global features and local sequence fragment features of the aerobic test sequence, not only considers global similarity, but also effectively captures local similarity; on the other hand, the method also integrates the corresponding relation between the exercise intensity and the heart rate variation into a re-representation sequence, so that the original aerobic test sequence can be more effectively represented. Compared with the original sequence length, the aerobic test sequence length obtained by the re-expression method is greatly reduced, and the calculated amount in the clustering stage is greatly reduced. According to the method, the traditional KMeans clustering algorithm is not selected to cluster the re-expression sequences, but the KMeans + + clustering algorithm is selected, so that the accuracy of clustering results is improved.

The invention has the advantages that: the large-scale aerobic test sequences can be quickly and effectively clustered, so that large-scale aerobic capacity grouping is realized. The large-scale aerobic capacity clustering method provided by the invention can be parallel, and the clustering speed can be accelerated by fully utilizing the advantages of a multi-core CPU or distributed parallel computation; secondly, the aerobic test sequence re-representation method provided by the invention can reasonably re-represent the original aerobic test sequence and simultaneously has an effective dimension reduction effect; moreover, the KMeans + + clustering algorithm is selected in the clustering process of the re-expressed sequence, so that the clustering accuracy can be obviously improved.

Drawings

Fig. 1 is a schematic diagram of a heart rate and aerobic capacity test sequence.

Fig. 2 is a schematic diagram of a velocity value versus aerobic capacity test sequence.

FIG. 3 is a general framework diagram of the large scale aerobic capacity clustering method of the present invention.

Detailed Description

The efficient and parallelizable large-scale aerobic capacity grouping method provided by the invention is further described in detail below with reference to the accompanying drawings.

Referring to fig. 2, a large-scale aerobic capacity clustering task requires the following steps to be performed in the calculation:

(1) the aerobic capacity test sequence set D ═ AS₁,AS₂,......AS_n]And loading to the memory. AS_i(i-1, …, n) is a two-dimensional aerobic capacity test sequence containing heart rate and velocity, denoted AS_i＝<ss_i,hs_i>Wherein ss_iRepresenting a sequence of velocity values, hs_iRepresenting a sequence of heart rate values. The aerobic capacity test sequence was collected by students wearing professional sports bracelets while performing the 22 minute aerobic endurance running test (the rules for the 22 minute aerobic endurance running test are shown in the following table).

(3) Respectively for each barrel B_iThe data in (1) is processed in a parallelization way to generate a clustering subsample set SF_i：

3.1) pretreatment of the aerobic capacity test sequence. Traverse B_iEach element AS in_k＝<ss_k,hs_k>Null processing is carried out on the data (null values in the speed value sequence and the heart rate value sequence are both represented by null), and the preprocessed elements are marked AS AS_k' the pretreatment steps are as follows:

3.1.1) sequence of traverse speed values ss_kIf the traversed current value is null, the corresponding processing is carried out according to the following three conditions: 1. if the current value does not belong to the first or last time in the sequence, replacing the null value by the average of the value of the previous time and the value of the next time before the current value; 2. if the current null value is the first point of the sequence, the mean value of two successive points behind the current null value is used for replacing the null value; 3. if the current null is the last point of the sequence, the null is replaced by the average of its two consecutive preceding points.

3.1.2) processing the null value of the heart rate sequence, and the processing mode is synchronous 3.1.1)

3.2) traversal of B_iEach of the pretreated aerobic capacity test sequence elements AS_k' the re-expression sequence is obtained by performing re-expression processing on the vector, and is denoted by f_kAnd f is_kAddition to SF_iIn, the total cluster sample set is recorded asThe steps of the re-expression process are as follows:

3.2.1) calculating the following global feature values of the sequence: length of the velocity value sequence (denoted SS _ LEN), average velocity (SS _ AVG), average value of the heart rate value sequence (HS _ AVG), maximum value (HS _ MAX) and skewness (HS _ SKE). Then inserted into f in turn_kIn (1).

3.2.2) taking T AS a threshold of a segmentation window, and combining AS_k' is divided into a plurality of sequence segments with the length of T, wherein the value of T is an integer from 1 to 100.

3.2.3) for each sequence fragment obtained in the last step, sequentially calculating the following local characteristic values of the current sequence fragment: average of a sequence of velocity values(ss _ avg), the average value of the sequence of heart rate values (hs _ avg), the heart rate value at the start of the sequence (hs _ start) and the heart rate value at the end of the sequence (hs _ end). These calculated values are then inserted into f in sequence_kIn (1).

(4) And (5) carrying out normalization processing on the samples in the clustering sample set F.

(5) Clustering the clustering sample set F by using a partition clustering algorithm KMeans + + to obtain a clustering result C_res：

(6) Result set C_resThe K clusters in (a) correspond to K different populations of aerobic capacity.

The embodiments described in this specification are only illustrative of the implementation forms of the inventive concept, and the protection scope of the present invention should not be considered as limited to the specific forms set forth in the embodiments, wherein each step may be changed, and all equivalent changes and modifications based on the technical solutions of the present invention should not be excluded from the protection scope of the present invention.

9页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：一种用于数据采集的心脏内科康复装置

Large-scale parallelization aerobic capacity grouping method

相关技术

网友询问留言