Method for predicting DNA replication origin in saccharomyces cerevisiae

文档序号:1044853 发布日期:2020-10-09 浏览:23次 中文

阅读说明:本技术 一种酿酒酵母菌中dna复制起点的预测方法 (Method for predicting DNA replication origin in saccharomyces cerevisiae ) 是由 樊永显 王婉茹 于 2020-07-03 设计创作,主要内容包括:本发明公开了一种酿酒酵母菌中DNA复制起点的预测方法,步骤为:获取酿酒酵母菌中的正样本序列和负样本序列;使用二进制编码法和PSEKNC-I两种方法提取特征;使用F-score和IFS方法对PSEKNC-I法得到的特征进行筛选,得到预筛选特征;将二进制编码法得到特征和预筛选特征进行组合,获得特征组合后的样本数据集;构建CNN预测模型并训练,输入数据获得初步预测结果;调整训练后CNN预测模型中参数,对训练后的CNN预测模型进行优化;使用五折交叉验证法对优化后的CNN预测模型进行评估最终得到最优的CNN预测模型,将数据输入最优模型中,得到最终的预测结果。该方法提取多种DNA信息中的特征,减少了计算时间,避免过拟合现象,选出最优的分类模型,提高了预测复制起点预测的准确率。(The invention discloses a method for predicting a DNA replication origin in saccharomyces cerevisiae, which comprises the following steps: acquiring a positive sample sequence and a negative sample sequence in the saccharomyces cerevisiae; extracting features by using a binary coding method and a PSEKNC-I method; screening the characteristics obtained by the PSEKNC-I method by using an F-score and IFS method to obtain pre-screening characteristics; combining the features obtained by the binary coding method and the pre-screening features to obtain a sample data set after the features are combined; constructing and training a CNN prediction model, and inputting data to obtain a preliminary prediction result; adjusting parameters in the trained CNN prediction model, and optimizing the trained CNN prediction model; and (4) evaluating the optimized CNN prediction model by using a five-fold cross-validation method to finally obtain an optimal CNN prediction model, and inputting data into the optimal model to obtain a final prediction result. The method extracts features in various DNA information, reduces calculation time, avoids overfitting, selects an optimal classification model, and improves the accuracy of predicting the replication origin.)

1. A method for predicting a DNA replication origin in Saccharomyces cerevisiae is characterized by comprising the following steps:

1) acquiring a sample data set: acquiring a positive sample sequence and a negative sample sequence in the saccharomyces cerevisiae;

2) feature extraction: the sample sequence is represented by using a binary coding method and a PSEKNC-I method, namely, one vector is used for representing each NDA sequence;

3) selecting characteristics: screening the features obtained by using the PSEKNC-I method in the step 2) by using an F-score method and an incremental feature selection method to obtain pre-screening features;

4) combining the characteristics: combining the features obtained by the binary coding method in the step 2) and the pre-screening features obtained in the step 3), and further screening the combined features by using binomial distribution to obtain a sample data set after feature combination;

5) constructing a model: constructing a CNN prediction model, carrying out a five-fold cross validation experiment on the sample data set obtained in the step 4), randomly dividing the data set selected in the five-fold cross experiment into 5 groups, wherein 1 group is used as a test set, the rest 4 groups are used as a training set, training the constructed CNN prediction model by using the training set to obtain a trained CNN prediction model, inputting the test set into a trained prediction model classifier, and obtaining a classification result which is a preliminary result of a predicted replication origin;

6) and (3) optimizing parameters: adjusting the number of convolution layers, the number of convolutions, the size and the step length of a filter and the output layer probability in the trained CNN prediction model according to the initial result obtained in the step 5), and optimizing the trained CNN prediction model;

7) and (3) model evaluation: and (3) evaluating the optimized CNN prediction model by using a five-fold cross-validation method, measuring the optimized CNN prediction model by using four evaluation coefficients of sensitivity, specificity, accuracy and a Mazis correlation coefficient to finally obtain an optimal CNN prediction model, and inputting a DNA sequence into the optimal CNN prediction model to obtain a final DNA replication origin prediction result.

2. The method according to claim 1, wherein the binary coding method in step 2) is to use 0 and 1 to represent nucleotides in DNA sequences, and each DNA sequence is converted into a feature vector, wherein the nucleotides in DNA sequences are represented as follows:

Figure RE-FDA0002618045420000011

in the formula (1), A (0,0,0,0) is adenine in the DNA sequence, C (0,1,0,1) is cytosine in the DNA sequence, G (0,0,1,0) is guanine in the DNA sequence, and T (0,0,0,1) is thymine in the DNA sequence.

Technical Field

The invention relates to the technical field of classification prediction of sequence interaction in bioinformatics, in particular to a method for predicting a DNA replication origin in saccharomyces cerevisiae.

Background

In recent years, bioinformatics and computer science have been combined to develop a new direction to store, manage, annotate, and process an extremely large amount of raw data into biological information having a clear biological meaning, mainly using nucleotide, protein, and gene sequence data sets as main research objects, and using means such as mathematics, informatics, and computer science, mainly using computer hardware, software, and computer networks. And rational knowledge such as gene coding, gene regulation, nucleotide and protein structure function machine interrelation and the like is obtained through inquiry, exploration, comparison and analysis of biological information. On the basis of a great deal of information and knowledge, the important problems of life sciences such as life origin, biological evolution, the occurrence of cell organs and individuals, development lesion, decline and death are explored, and the basic rules and the space-time relation of the life origins and the biological evolution are clarified. And finally, the biological significance contained in the diet data is achieved by acquiring, processing, storing, retrieving and analyzing biological experiment data. In the case of genome, obtaining sequence is only the first step, and the latter step is the task of the so-called genome era, and collecting, organizing, retrieving and analyzing structural and functional information expressed in sequence to find out regularity.

The main mode by which life inheritance and gene transmission depend is DNA replication, and an Origin of Replication (ORI) determines the start of replication, and accurate identification of the origin of replication not only helps to optimize gene expression, but also provides a new strategy for the study of new drugs in genetic diseases. Errors in the time and position of replication initiation and nucleotide mismatching during replication can cause DNA sequence mutation, genome recombination and other events, increase the transmission of wrong genetic information and enhance the instability of cell genome. This directly affects the normal division of cells and the normal development of embryos, and is also closely related to the development of cancer and many genetic diseases, and therefore, accurate identification of the origin of DNA replication is of great importance in genetic research.

To date, there have been many studies directed to ORI, all of which have achieved some success. In 2004, the group of Cozzarelli predicted the yeast replication origin by the Oriscan algorithm using self-replicating consensus (ACS) in which the replication origin is rich in AT bases and a 3' region rich in a bases as sequence features. In 2014, Li analyzes component deviation of a saccharomyces cerevisiae gene by calculating values of GC profile and GC skew, extracts sequence information by utilizing a type I pseudo nucleotide component and constructs an online predictor iORI-PseKNC to identify a replication initiation site sequence of the saccharomyces cerevisiae. In 2016, Zhang first attempted to construct a Human ORI dataset and identified Human ORIs using a type I pseudonucleotide component extraction information to construct an iOri-Human online predictor based on a random forest classifier.

Disclosure of Invention

The invention aims to solve the problem of the prediction accuracy of the existing DNA replication origin, and provides a method for predicting the DNA replication origin in saccharomyces cerevisiae.

The technical scheme for realizing the purpose of the invention is as follows:

a method for predicting a DNA replication origin in Saccharomyces cerevisiae comprises the following steps:

1) acquiring a sample data set: acquiring a positive sample sequence and a negative sample sequence in the saccharomyces cerevisiae;

2) feature extraction: the sample sequence is represented by using a binary coding method and a PSEKNC-I method, namely, one vector is used for representing each NDA sequence;

3) selecting characteristics: screening the features obtained by using the PSEKNC-I method in the step 2) by using an F-score method and an Incremental Feature Selection (IFS) method to obtain pre-screening features;

4) combining the characteristics: combining the features obtained by the binary coding method in the step 2) and the pre-screening features obtained in the step 3), and further screening the combined features by using binomial distribution to obtain a sample data set after feature combination;

5) constructing a model: constructing a CNN prediction model, carrying out a five-fold cross validation experiment on the sample data set obtained in the step 4), randomly dividing the data set selected in the five-fold cross experiment into 5 groups, wherein 1 group is used as a test set, the rest 4 groups are used as a training set, training the constructed CNN prediction model by using the training set to obtain a trained CNN prediction model, inputting the test set into a trained prediction model classifier, and obtaining a classification result which is a preliminary result of a predicted replication origin;

6) and (3) optimizing parameters: adjusting the number of convolution layers, the number of convolutions, the size and the step length of a filter and the output layer probability in the trained CNN prediction model according to the initial result obtained in the step 5), and optimizing the trained CNN prediction model;

7) and (3) model evaluation: and (3) evaluating the optimized CNN prediction model by using a five-fold cross-validation method, measuring the optimized CNN prediction model by using four evaluation coefficients of sensitivity (Sn), specificity (Sp), accuracy (Acc) and a Maxius Correlation Coefficient (MCC), finally obtaining the optimal CNN prediction model, and inputting the DNA sequence into the optimal CNN prediction model to obtain the final DNA replication origin prediction result.

In the step 2), the binary coding method is to use 0 and 1 to represent nucleotides in DNA sequences, and convert each DNA sequence into a feature vector, wherein the representation of the nucleotides in the DNA sequences is as follows:

Figure RE-GDA0002618045430000031

in the formula (1), A (0,0,0,0) is adenine in the DNA sequence, C (0,1,0,1) is cytosine in the DNA sequence, G (0,0,1,0) is guanine in the DNA sequence, and T (0,0,0,1) is thymine in the DNA sequence.

In the step 2), the PSEKNC-I method comprises the following steps:

2-1) calculating the occurrence frequency of different k-tuple nucleotide components in the DNA sequence, and expressing DNA sequence samples R consisting of 4 types of L oligonucleotides of adenine A, guanine G, cytosine C and thymine T by using the following formula (2), wherein the value of k is 1, 2, 3, …, k, …, n and n approaches infinity;

R=R1R2R3R4R5R6… Ri… RL(2)

Riis an oligonucleotide at position i in the DNA sequence;

2-2) taking k nucleotides in sequence as a group, and the total number is 4kPerforming seed combination, namely starting from the first nucleotide, taking k adjacent nucleotides from left to right, then moving one nucleotide to the right, taking k adjacent nucleotides behind, repeating the operation for L-k +1 times to traverse the whole DNA sequence pair by using a k-tuple nucleotide component method for each sample DNA sequence pair in a reference data set, wherein L is the length of each sample DNA sequence pair, and counting the occurrence frequency of each k-tuple nucleotide component in the whole DNA sequence pair;

2-3) mixing 4kFrequency of occurrence of the combination turns into 4kVector of dimension to obtain the 1 st to 4 th in the matrix DkDimension vector, matrix D expression is:

in the formula (3), the first and second groups,

Figure RE-GDA0002618045430000033

is the frequency of occurrence of each k-membered nucleotide component in the DNA sequence.

In the step 3), the F-score method is used for the characteristic X extracted in the step 2)kSorting is carried out, k is 1, 2, 3, …, m, if the number of positive samples and negative samples is n respectively+And n-Then the F-score of the ith feature is inferred as:

whereinRespectively mean characteristic values of the ith characteristic in the whole data set, the positive sample set and the negative sample set,is the eigenvalue of the ith feature in the kth positive sample,is the eigenvalue of the ith feature in the kth negative sample, the numerator represents the difference between the positive and negative sets, the denominator represents one sample in each of the two sets, FiIf the value of (d) is larger, it indicates that the ith feature contains higher recognition degree information and has a larger influence on classification, the score obtained by the formula (10) is used as a feature selection criterion, and F is setiAnd ranking according to the sequence from large to small, and selecting a feature set with large influence on classification as a sample data feature set.

In step 3), the incremental feature selection method is to perform feature selection on each feature set, that is, to use one feature set as a training set to train a model, and then add the feature sets obtained by the binomial distribution method in step 3) into the training set one by one and train the model until the number of feature sets with the highest classification accuracy is found.

In step 4), the binomial distribution method ranks the feature sets by using the following formula:

qi=mi/M (11)

wherein q isiIs a priori probability, miRepresenting the number of given data values present in the i-th class of samples, M is the total number of all data values in the feature set,

Figure RE-GDA0002618045430000045

nijrepresenting the number of occurrences of the ith feature in the jth sample, NjRepresenting the number of occurrences of the feature in the ith in all data,

Pj=min(P(n1j),P(n2j)) (13)

CLij=1-P(nij) (14)

CLj=max(CLi1,CLi2) (15)

CLijand sorting the confidence levels in a descending order, selecting a characteristic set with the confidence level more than 0.5 to train the model, and testing.

The method for predicting the DNA replication origin in the saccharomyces cerevisiae extracts the characteristics in various DNA information, reduces the calculation time, avoids the over-fitting phenomenon, selects the optimal classification model at the same time, and improves the accuracy of predicting the replication origin.

Drawings

FIG. 1 is a flow chart of a method for predicting the origin of DNA replication in Saccharomyces cerevisiae;

FIG. 2 is a distribution diagram of a reference data set in an embodiment;

FIG. 3 is a flow chart of convolutional neural network prediction.

Detailed Description

The invention will be further elucidated with reference to the drawings and examples, without however being limited thereto.

13页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:对免疫组库高通量测序样本间序列污染进行过滤的方法

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!