circRNA recognition method based on MapReduce parallelism

文档序号：1289189 发布日期：2020-08-28 浏览：11次中文

阅读说明：本技术 基于MapReduce并行的circRNA识别方法 (circRNA recognition method based on MapReduce parallelism ) 是由邹权牛梦婷于 2020-05-20 设计创作，主要内容包括：本发明公开了一种基于MapReduce并行的circRNA识别方法,首先采用多种特征提取算法提取各序列数据的核酸组成特征、自组织相关性特征、伪核酸组成特征和结构特征,接着使用前期融合的方式将特征文件拼接到一起,形成一个完整的特征集,然后使用MRMD特征选择算法对特征集进行有效特征的选取,得到更有效的特征子集,最后通过MapReduce与极限学习机算法的结合,搭建cirRNAPL分类器,实现对circRNA的识别。本发明通过引进MapReduce并行算法提高了识别的效率,节省用户大量的时间和精力。(The invention discloses a circRNA recognition method based on MapReduce parallelism, which comprises the steps of firstly adopting a plurality of feature extraction algorithms to extract nucleic acid composition features, self-organization correlation features, pseudo nucleic acid composition features and structural features of sequence data, then splicing feature files together in a pre-stage fusion mode to form a complete feature set, then selecting effective features of the feature set by using an MRMD feature selection algorithm to obtain a more effective feature subset, and finally constructing a circRNAPL classifier by combining MapReduce and a limit learning machine algorithm to realize the recognition of circRNA. According to the invention, the recognition efficiency is improved by introducing the MapReduce parallel algorithm, and a great amount of time and energy of a user are saved.)

1. The circRNA recognition method based on MapReduce parallelism is characterized by comprising the following steps:

s1, downloading a circRNA sequence data file, and acquiring an original circRNA characteristic data set to be processed;

s2, extracting the data characteristics of the original circRNA characteristic data set by adopting a plurality of characteristic extraction algorithms to obtain a plurality of characteristic files;

s3, splicing all the feature files by adopting an early-stage fusion mode to obtain a complete feature set;

s4, performing feature selection on the feature set by adopting an MRMD algorithm to obtain a feature subset with strong correlation between features and example categories and low redundancy among the features;

s5, optimizing a kernel function parameter g and a penalty coefficient c of the extreme learning machine algorithm by adopting a particle swarm optimization algorithm, so that the classification performance of the extreme learning machine algorithm is optimal;

s6, performing classification training on the circRNA in the feature subset by adopting an optimized extreme learning machine algorithm and combining MapReduce parallel computation to obtain a trained classification model;

s7, constructing a cirRNAPL classifier by adopting the trained classification model, inputting the feature subset into the cirRNAPL classifier to obtain a classification result, and completing the identification of the circRNA.

2. The circRNA recognition method according to claim 1, wherein the original circRNA feature data set obtained in step S1 includes a positive case data set and a negative case data set, the positive case data set is a circRNA sequence file to be classified, and the negative case data set is a non-circRNA sequence file.

3. The circRNA recognition method according to claim 1, wherein in step S1, before the original circRNA feature data set to be processed is obtained, format judgment and content judgment are required to be performed on the downloaded circRNA sequence data file;

the specific method for judging the format comprises the following steps: when the line of the read circRNA sequence data file takes the character string '>' as the beginning, adding one line of data to be sequence text data;

the specific method for judging the content comprises the following steps: whether the content of the read sequence text data is composed of four letters of 'A', 'U', 'C' or 'G', if any, the input text is prompted to include the letters of 'A', 'U', 'C' and 'G'.

4. The circRNA recognition method of claim 1, wherein the feature extraction algorithm in step S2 comprises a nucleic acid composition feature extraction algorithm, a self-organization correlation feature extraction algorithm, a pseudo nucleic acid composition feature extraction algorithm, and a structural feature extraction algorithm;

the nucleic acid composition feature extraction algorithm comprises a k-mer extraction algorithm, a Mismatch extraction algorithm and a Subsequence extraction algorithm;

the self-organization correlation characteristic extraction algorithm comprises a self-correlation DAC extraction algorithm based on dinucleotides, a cross covariance DCC extraction algorithm based on dinucleotides, a self-correlation DACC extraction algorithm based on dinucleotides, a Moran self-correlation MAC extraction algorithm, a Geary self-correlation GAC extraction algorithm and a normalized Moreau-Broto self-correlation NMBAC extraction algorithm;

the pseudo nucleic acid composition feature extraction algorithm comprises a general parallel correlation pseudo-dinucleotide combination PC extraction algorithm and a general sequence correlation pseudo-dinucleotide composition SC extraction algorithm;

the structural feature extraction algorithm comprises a local structural sequence ternary feature Triplet extraction algorithm, a PseSSC extraction algorithm and a PseDPC extraction algorithm.

5. The circRNA recognition method according to any one of claims 1 and 4, wherein in step S2, a plurality of feature extraction algorithms are simultaneously executed in a MapReduce parallel computing manner to extract data features of the original circRNA feature data set, and the specific method is as follows:

a1, designing a Map function and a Reduce function in MapReduce;

a2, reading and dividing an original circRNA characteristic data set according to lines through a Map function, and converting the original circRNA characteristic data set into a file (key, value 1) with a specific format in a form of a line number and a sample;

a3, traversing all samples, and sequentially extracting the features of each sample to output data (key, value 2) in the form of a line number and a feature set;

a4, receiving the output data < key, value2> of the Map function through the Reduce function, processing the received data, integrating the same key value pair and outputting the same to the same file, namely forming a feature file corresponding to each sample.

6. The circRNA recognition method of claim 1, wherein the feature selection of the feature set in step S4 by MRMD algorithm is based on Max (MR)_i+MD_i) Wherein MR_iDenotes the Pearson coefficient, MD, between class and feature of the ith circRNA example_iRepresents the Euclidean distance between the features of the i-th circRNA example, wherein maxMR_iThe calculation of the values is as follows:

maxMD_ithe calculation of the values is as follows:

wherein PCC (. cndot.) represents the Pearson coefficient, F_iFeature vector, C, representing the ith circRNA example_iClass vector representing the ith circRNA instance, M represents the characteristic dimension of the circRNA instance, S_FiCiIs represented by F_iAll elements in (A) and (C)_iCovariance of all elements in (S)_FiIs represented by F_iStandard deviation of all elements in, S_CiIs represented by C_iStandard deviation of all elements in, f_kIs represented by F_iTo (1)k elements, c_kIs represented by C_iN is F_iAnd C_iThe number of the elements in (1) is,is F_iThe average value of all the elements in (A),is C_iAverage of all elements in (1), ED_iRepresenting the Euclidean distance, COS, between the features of the i th circRNA example_iDenotes the Cosine distance, TC, between the features of the i-th circRNA example_iRepresenting Tanimoto coefficients between the i-th circRNA example features.

7. The circRNA recognition method of claim 1, wherein step S5 comprises the substeps of:

s51, initializing the maximum iteration times of the particle swarm algorithm and the overall size of the particle swarm to be 50 and 50 respectively, wherein each particle consists of a group of kernel function parameters g and a penalty coefficient c;

s52, calculating the classification precision obtained by classifying the circRNAs by using an extreme learning machine algorithm, and taking the classification precision as the fitness value of the particle swarm algorithm;

s53, updating the speed and the position of the overall particles;

s54, judging whether the particle swarm algorithm reaches the maximum fitness value or the maximum iteration number, if so, entering the step S55, otherwise, returning to the step S52;

and S55, acquiring the optimal kernel function parameter g and the penalty coefficient c corresponding to the maximum fitness value, and substituting the optimal kernel function parameter g and the penalty coefficient c into the extreme learning machine algorithm to obtain the extreme learning machine algorithm with the optimal classification performance.

8. The circRNA recognition method according to claim 7, wherein the calculation formula of the classification accuracy in step S52 is:

wherein ACC represents the classification accuracy obtained by classifying the circRNAs by using an extreme learning machine algorithm, TP represents the number of circRNAs with correct predictions, FP represents the number of noncircular RNAs with correct predictions, TN represents the number of circRNAs with wrong predictions, and FN represents the number of noncircular RNAs with wrong predictions.

9. The circRNA recognition method of claim 7, wherein the formula for updating the velocity and position of the population particles in step S53 is:

wherein p is_i(t) and v_i(t) denotes the position and velocity of the ith particle for the tth iteration, ω is the weight, c₁And c₂Is an acceleration factor, R₁And R₂Is a random number between 0 and 1, P_best,iIs the optimal solution of the ith particle, G_bestIs the best solution for the population of particles.

10. The circRNA recognition method of claim 1, wherein step S6 comprises the substeps of:

s61, designing a Map function and a Reduce function in MapReduce;

s62, dividing the feature data in the feature subset into 10 parts;

s63, reading the feature subset by rows through a Map function, and converting the feature subset into a file < key, value2> with a specific format in the form of < row number, feature set >;

s64, traversing each part of feature data, taking one part of feature data as a test set and the remaining 9 parts of feature data as a training set, performing classification training on circRNA in the feature data by adopting an optimized extreme learning machine algorithm, and outputting data (key, value 3) in the form of a line number and a classification result;

s65, receiving output data < key, value3> of the Map function through the Reduce function, and evaluating the classification effect;

and S66, repeating the steps S64-S65 until each piece of feature data is used as a test set to be subjected to classification training, and obtaining a trained classification model.

Technical Field

The invention belongs to the technical field of computers, and particularly relates to a design of a circRNA recognition method based on MapReduce parallelism.

Background

Circular RNA (circRNA) is a novel RNA that differs from the traditional linear RNA of humans in that circRNA is a non-coding RNA molecule that has no 5-end cap and no 3-end tail, but forms a circular structure. In 1969, circRNA was first discovered by Diener in the study of potato spindle block stem disease. Electron microscopy revealed the formation of such closed circular RNA, also known as viroid. The subsequent advent of high-throughput sequencing technologies (RNA-seq) improved the sequencing of circular RNAs from different species, many of which have now been identified. To date, over 10000 different circular RNAs have been successfully identified from drosophila, worms to mice and humans. The circular RNA plays an important role in the occurrence and development of diseases, and provides a new idea for drug development. Accurate recognition of circular RNA is important for the in-depth understanding of its function. There are many studies on protein recognition and site detection based on machine learning, such as RF and artificial neural networks. In contrast, few studies have focused on the recognition of circular RNAs. Therefore, it is necessary to study how to use the characteristics of RNA sequences to achieve more accurate recognition of circrnas.

Disclosure of Invention

The invention aims to provide a circRNA recognition method based on MapReduce parallelism, which expresses the characteristics of a circRNA sequence by utilizing the structural characteristics of RNA and the composition of nucleotides and realizes more accurate recognition of the circRNA.

The technical scheme of the invention is as follows: the circRNA recognition method based on MapReduce parallelism comprises the following steps:

s1, downloading a circRNA sequence data file, and acquiring an original circRNA characteristic data set to be processed.

And S2, extracting the data characteristics of the original circRNA characteristic data set by adopting a plurality of characteristic extraction algorithms to obtain a plurality of characteristic files.

And S3, splicing all the feature files by adopting an early-stage fusion mode to obtain a complete feature set.

S4, feature selection is carried out on the feature set by adopting an MRMD algorithm, and a feature subset with strong correlation between features and example categories and low redundancy among the features is obtained.

S5, optimizing the kernel function parameter g and the penalty coefficient c of the extreme learning machine algorithm by adopting the particle swarm optimization, so that the classification performance of the extreme learning machine algorithm is optimal.

And S6, performing classification training on the circRNA in the feature subset by adopting an optimized extreme learning machine algorithm and combining MapReduce parallel computation to obtain a trained classification model.

S7, constructing a cirRNAPL classifier by adopting the trained classification model, inputting the feature subset into the cirRNAPL classifier to obtain a classification result, and completing the identification of the circRNA.

Further, the original circRNA feature data set obtained in step S1 includes a positive case data set and a negative case data set, the positive case data set is a circRNA sequence file to be classified, and the negative case data set is a non-circRNA sequence file.

Further, in step S1, before acquiring the original circRNA feature data set to be processed, the downloaded circRNA sequence data file needs to be subjected to format judgment and content judgment; the specific method for judging the format comprises the following steps: when the line of the read circRNA sequence data file takes the character string '>' as the beginning, adding one line of data to be sequence text data; the specific method for content judgment comprises the following steps: whether the content of the read sequence text data is composed of four letters of 'A', 'U', 'C' or 'G', if any, the input text is prompted to include the letters of 'A', 'U', 'C' and 'G'.

Further, the feature extraction algorithm in step S2 includes a nucleic acid composition feature extraction algorithm, a self-organization correlation feature extraction algorithm, a pseudo nucleic acid composition feature extraction algorithm, and a structural feature extraction algorithm; the nucleic acid composition feature extraction algorithm comprises a k-mer extraction algorithm, a Mismatch extraction algorithm and a Subsequence extraction algorithm; the self-organization correlation characteristic extraction algorithm comprises a self-correlation DAC extraction algorithm based on dinucleotides, a cross covariance DCC extraction algorithm based on dinucleotides, a self-correlation DACC extraction algorithm based on dinucleotides, a Moran self-correlation MAC extraction algorithm, a Geary self-correlation GAC extraction algorithm and a normalized Moreau-Broto self-correlation NMBAC extraction algorithm; the pseudo nucleic acid composition feature extraction algorithm comprises a general parallel correlation pseudo-dinucleotide combination PC extraction algorithm and a general sequence correlation pseudo-dinucleotide composition SC extraction algorithm; the structural feature extraction algorithm comprises a local structural sequence ternary feature Triplet extraction algorithm, a PseSSC extraction algorithm and a PseDPC extraction algorithm.

Further, in step S2, a MapReduce parallel computing method is adopted to simultaneously execute multiple feature extraction algorithms to extract data features of the original circRNA feature data set, and the specific method is as follows:

a1, designing a Map function and a Reduce function in MapReduce.

A2, reading and dividing the original circRNA characteristic data set by rows through a Map function, and converting the original circRNA characteristic data set into a file < key, value1> with a specific format in the form of < row number, sample >.

And A3, traversing all samples, and sequentially extracting the features of each sample to output data < key, value2> in the form of < line number, feature set >.

A4, receiving the output data < key, value2> of the Map function through the Reduce function, processing the received data, integrating the same key value pair and outputting the same to the same file, namely forming a feature file corresponding to each sample.

Further, the basis for feature selection of the feature set by the MRMD algorithm in step S4 is Max (MR)_i+MD_i) Wherein MR_iDenotes the Pearson coefficient, MD, between class and feature of the ith circRNA example_iRepresents the Euclidean distance between the features of the i-th circRNA example, wherein maxMR_iThe calculation of the values is as follows:

maxMD_ithe calculation of the values is as follows:

wherein PCC (. cndot.) represents the Pearson coefficient, F_iFeature vector, C, representing the ith circRNA example_iClass vector representing the ith circRNA instance, M represents the characteristic dimension of the circRNA instance, S_FiCiIs represented by F_iAll elements in (A) and (C)_iCovariance of all elements in (S)_FiIs represented by F_iAll of the elements inStandard deviation of (S)_CiIs represented by C_iStandard deviation of all elements in, f_kIs represented by F_iThe k-th element of (1), c_kIs represented by C_iN is F_iAnd C_iThe number of the elements in (1) is,is F_iThe average value of all the elements in (A),is C_iAverage of all elements in (1), ED_iRepresenting the Euclidean distance, COS, between the features of the i th circRNA example_iDenotes the Cosine distance, TC, between the features of the i-th circRNA example_iRepresenting Tanimoto coefficients between the i-th circRNA example features.

Further, step S5 includes the following substeps:

s51, initializing and setting the maximum iteration times of the particle swarm algorithm and the overall size of the particle swarm to be 50 and 50 respectively, wherein each particle consists of a group of kernel function parameters g and a penalty coefficient c.

S52, calculating the classification precision obtained by classifying the circRNAs by using the extreme learning machine algorithm, and taking the classification precision as the fitness value of the particle swarm algorithm.

And S53, updating the speed and the position of the overall particles.

And S54, judging whether the particle swarm algorithm reaches the maximum fitness value or the maximum iteration number, if so, entering the step S55, and if not, returning to the step S52.

And S55, acquiring the optimal kernel function parameter g and the penalty coefficient c corresponding to the maximum fitness value, and substituting the optimal kernel function parameter g and the penalty coefficient c into the extreme learning machine algorithm to obtain the extreme learning machine algorithm with the optimal classification performance.

Further, the calculation formula of the classification accuracy in step S52 is:

Further, the formula for updating the velocity and position of the population particle in step S53 is:

Further, step S6 includes the following substeps:

s61, designing a Map function and a Reduce function in MapReduce.

And S62, dividing the feature data in the feature subset into 10 parts.

S63, reading the feature subset by rows through the Map function, and converting the feature subset into a file < key, value2> with a specific format in the form of < row number, feature set >.

S64, traversing each piece of characteristic data, taking one piece of characteristic data as a test set and the remaining 9 pieces of characteristic data as a training set, carrying out classification training on the circRNAs by adopting an optimized extreme learning machine algorithm, and outputting data < key, value3> in the form of < line number, classification result >.

S65, receiving the output data < key, value3> of Map function through Reduce function, and evaluating the classification effect.

And S66, repeating the steps S64-S65 until each piece of feature data is used as a test set to be subjected to classification training, and obtaining a trained classification model.

The invention has the beneficial effects that:

(1) the invention provides a brand-new circRNA recognition method, which utilizes the structural characteristics of RNA and the composition of nucleotide to express the characteristics of a circRNA sequence, can realize accurate recognition of circRNA, and provides a theoretical basis for corresponding drug development.

(2) The method introduces the parallel computation of MapReduce when performing feature extraction and optimizing the algorithm classification of the extreme learning machine, and effectively improves the processing efficiency.

(3) According to the method, the particle swarm optimization is used for optimizing the extreme learning machine algorithm, the classification model is trained based on the optimized extreme learning machine algorithm, the cirRNAPL classifier is further constructed, and the recognition effect of the circRNA is optimized

Drawings

Fig. 1 is a flowchart of a circRNA recognition method based on MapReduce parallelism according to an embodiment of the present invention.

Fig. 2 is a schematic diagram illustrating a feature extraction dimension distribution according to an embodiment of the present invention.

Fig. 3 is a schematic diagram illustrating an optimization effect of extreme learning machine parameters according to an embodiment of the present invention.

Fig. 4 is a schematic diagram illustrating recognition effects of different classification methods according to an embodiment of the present invention.

Fig. 5 is a schematic diagram comparing the conventional blast method according to the embodiment of the present invention.

Fig. 6 is a schematic diagram illustrating comparison between recognition effects of the recognition algorithm of the present invention and those of the existing recognition algorithm.

Detailed Description

Exemplary embodiments of the present invention will now be described in detail with reference to the accompanying drawings. It is to be understood that the embodiments shown and described in the drawings are merely exemplary and are intended to illustrate the principles and spirit of the invention, not to limit the scope of the invention.

The embodiment of the invention provides a circRNA recognition method based on MapReduce parallelism, which comprises the following steps S1-S7 as shown in figure 1:

s1, downloading a circRNA sequence data file, and acquiring an original circRNA characteristic data set to be processed.

The obtained original circRNA characteristic data set comprises a positive case data set and a negative case data set, wherein the positive case data set is a circRNA sequence file to be classified, and the negative case data set is a noncircular RNA sequence file.

In the present example, there are 3 circular RNA sequence data files, namely circular RNA vs PCG (14084 for positive circular RNA sequence and 9533 for reverse non-circular RNA sequence), circular RNA vs lncRNA (14084 for positive circular RNA sequence and 19722 for reverse non-circular RNA sequence) and Stem cell vs not (2082 for positive circular RNA sequence and 2082 for reverse non-circular RNA sequence).

In the embodiment of the invention, before the original circRNA characteristic data set to be processed is acquired, the downloaded circRNA sequence data file needs to be subjected to format judgment and content judgment. The specific method for judging the format comprises the following steps: when the line of the read circRNA sequence data file begins with the character string ">", the data added by one line is taken as the sequence text data. The specific method for content judgment comprises the following steps: whether the content of the read sequence text data is composed of four letters of 'A', 'U', 'C' or 'G', if any, the input text is prompted to include the letters of 'A', 'U', 'C' and 'G'.

In the embodiment of the invention, the feature extraction algorithm comprises a nucleic acid composition feature extraction algorithm, a self-organization correlation feature extraction algorithm, a pseudo nucleic acid composition feature extraction algorithm and a structural feature extraction algorithm.

The nucleic acid composition feature extraction algorithm comprises a k-mer extraction algorithm, a Mismatch extraction algorithm and a Subsequence extraction algorithm; the self-organization correlation characteristic extraction algorithm comprises a self-correlation DAC extraction algorithm based on dinucleotides, a cross covariance DCC extraction algorithm based on dinucleotides, a self-correlation DACC extraction algorithm based on dinucleotides, a Moran self-correlation MAC extraction algorithm, a Geary self-correlation GAC extraction algorithm and a normalized Moreau-Broto self-correlation NMBAC extraction algorithm; the pseudo nucleic acid composition feature extraction algorithm comprises a general parallel correlation pseudo-dinucleotide combination PC extraction algorithm and a general sequence correlation pseudo-dinucleotide composition SC extraction algorithm; the structural feature extraction algorithm comprises a local structural sequence ternary feature Triplet extraction algorithm, a PseSSC extraction algorithm and a PseDPC extraction algorithm.

In the embodiment of the present invention, in the k-mer extraction algorithm, since one feature file is obtained when the occurrence frequency k of adjacent nucleic acids is k ═ 2 and k ═ 3, 15 feature files are obtained in total by using the feature extraction algorithm in 14 above, and the dimension distribution of the 15 feature files is shown in fig. 2.

In the embodiment of the invention, a MapReduce parallel computing mode is adopted to simultaneously execute a plurality of feature extraction algorithms to extract the data features of an original circRNA feature data set so as to improve the computing efficiency, and the specific method comprises the following steps:

a1, designing a Map function and a Reduce function in MapReduce.

And A3, traversing all samples, and sequentially extracting the features of each sample to output data < key, value2> in the form of < line number, feature set >.

And S3, splicing all the feature files by adopting an early-stage fusion mode to obtain a complete feature set.

Common modes of feature fusion in the field include early fusion and later fusion, and the embodiment of the invention adopts the early fusion mode to splice 15 feature files.

In the MRMD algorithm, the correlation between the features and the example categories is characterized by a Pearson coefficient, and the larger the Pearson coefficient is, the stronger the correlation between the features and the example categories is, and the more compact the relationship is; the redundancy between features is characterized by Euclidean distance, which is related to Euclidean distance ED, Cosine distance COS and Tanimoto coefficient TC, and the larger the Euclidean distance, the lower the redundancy between features.

Based on the theory, the basis for selecting the features of the feature set by adopting the MRMD algorithm is Max (MR)_i+MD_i) Wherein MR_iDenotes the Pearson coefficient, MD, between class and feature of the ith circRNA example_iRepresents the Euclidean distance between the features of the i-th circRNA example, wherein maxMR_iThe calculation of the values is as follows:

maxMD_ithe calculation of the values is as follows:

wherein PCC (. cndot.) represents the Pearson coefficient, F_iFeature vector, C, representing the ith circRNA example_iClass vector representing the ith circRNA instance, M represents the characteristic dimension of the circRNA instance, S_FiCiIs represented by F_iAll elements in (A) and (C)_iCovariance of all elements in (S)_FiIs represented by F_iStandard deviation of all elements in, S_CiIs represented by C_iStandard deviation of all elements in, f_kIs represented by F_iThe k-th element of (1), c_kIs represented by C_iN is F_iAnd C_iThe number of the elements in (1) is,is F_iThe average value of all the elements in (A),is C_iAverage of all elements in (1), ED_iTo representEuclidean distance between the features of the i th circRNA example, COS_iDenotes the Cosine distance, TC, between the features of the i-th circRNA example_iRepresenting Tanimoto coefficients between the i-th circRNA example features.

The kernel function of the extreme learning machine algorithm has an important influence on the performance of the algorithm, and the kernel parameter g and the penalty coefficient c in the kernel function have an important influence on the performance of the extreme learning machine algorithm. Where g affects the range of the kernel function and c affects the stability of the model. The embodiment of the invention optimizes the parameters g and c by using the particle swarm algorithm, the search space of the particle swarm algorithm corresponds to the parameters of the extreme learning machine algorithm, the positions of the particles represent the parameter values g and c, and the classification precision of the extreme learning machine algorithm is used as the fitness value of the particle swarm algorithm.

The step S5 includes the following substeps S51-S55:

The calculation formula of the classification precision is as follows:

S53, updating the speed and the position of the overall particles, wherein the updating formula is as follows:

And S54, judging whether the particle swarm algorithm reaches the maximum fitness value or the maximum iteration number, if so, entering the step S55, and if not, returning to the step S52.

The step S6 includes the following substeps S61-S66:

s61, designing a Map function and a Reduce function in MapReduce.

And S62, dividing the feature data in the feature subset into 10 parts.

S63, reading the feature subset by rows through the Map function, and converting the feature subset into a file < key, value2> with a specific format in the form of < row number, feature set >.

S65, receiving the output data < key, value3> of Map function through Reduce function, and evaluating the classification effect.

In the embodiment of the present invention, the indexes for evaluating the classification effect include SE, SP, ACC, and MCC, and the calculation formula thereof is as follows:

wherein TP represents the number of circRNAs with correct predictions, FP represents the number of noncircular RNAs with correct predictions, TN represents the number of circRNAs with wrong predictions, and FN represents the number of noncircular RNAs with wrong predictions.

And S66, repeating the steps S64-S65 until each piece of feature data is used as a test set to be subjected to classification training, and obtaining a trained classification model.

The recognition effect of the present invention is further described below with a set of specific experimental examples.

First, the recognition effects of an unoptimized Extreme Learning Machine algorithm (ELM), an Extreme Learning Machine algorithm (GA-ELM) optimized by using a Particle Swarm Optimization (PSO for short), and a finally constructed cirRNAPL classifier are compared, as shown in fig. 3. As can be seen from FIG. 3, the GA-ELM and cirRNAPL classifiers achieved better results in terms of classification than the ELM. On three data sets, the cirRNAPL classifier obtained classification accuracy ACC values of 0.815, 0.822, and 0.782. Experiments show that the prediction accuracy and the popularization capability of the ELM network are effectively improved, so that the optimized ELM is used as a classification algorithm to identify the circRNA.

The recognition effect of the present invention is then compared with a commonly used machine learning algorithm, as shown in fig. 4. As can be seen from FIG. 4, the result of ACC, SE, SP and MCC in the invention is compared with CNN, RF, SVM, J48 and zeroR algorithm, so that the cirRNAPL classifier constructed in the invention achieves better effect. On three data sets, cirRNAPL achieved recognition accuracies of 0.815, 0.822, 0.782 and validated the effectiveness of PSO-ELM for circRNA recognition.

The recognition results of the alignment of the present invention with the conventional commonly used blast sequence tool were then compared, as shown in FIG. 5. As can be seen from fig. 5, the recognition accuracy of blast is 0.439, 0.605, and 0.611, and the classification accuracy of cirRNAPL classifier is 0.815, 0.802, and 0.782, respectively. It is normal to have a somewhat lower accuracy, considering that blast compares only certain key words of a sequence that are more or less important. Thus, there is no doubt that the sequence data-based cirRNAPL classification approach will have increasingly broad validity and availability in research.

Finally, the invention is compared with the research results of the existing excellent recognition algorithm, and when comparing, on the basis of ensuring the consistency of the used data sets, the consistent evaluation indexes (namely SE, SP, ACC and MCC) are used, and FIG. 6 is the effect comparison of different algorithms. First, comparing the results of cirRNAPL and WebCircRNA, it can be seen from FIG. 6 that cirRNAPL achieved better performance than WebCircRNA on the Stem cell vsnot and cirRNA vs PCG dataset. Next, the results of cirRNAPL were compared with predcircRNA, H-ELM and circDeep, respectively. As can be seen from FIG. 6, cirRNAPL is superior to PredcircRNA and H-ELM in three indexes of ACC, SE and MCC. By comparison, the cirRNAPL has certain effectiveness in the recognition of the circRNA, and can provide a new idea for the research of the circRNA.

It will be appreciated by those of ordinary skill in the art that the embodiments described herein are intended to assist the reader in understanding the principles of the invention and are to be construed as being without limitation to such specifically recited embodiments and examples. Those skilled in the art can make various other specific changes and combinations based on the teachings of the present invention without departing from the spirit of the invention, and these changes and combinations are within the scope of the invention.

18页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：一种整合多组学数据提取微卫星不稳定免疫治疗新抗原的方法和应用

circRNA recognition method based on MapReduce parallelism

相关技术

网友询问留言