Method for identifying DNA enhancer element based on sequence frequency information

文档序号:193424 发布日期:2021-11-02 浏览:58次 中文

阅读说明:本技术 一种基于序列频率信息识别dna增强子元件的方法 (Method for identifying DNA enhancer element based on sequence frequency information ) 是由 郭菲 吕一诺 何文颖 唐继军 曹晶 于 2021-08-09 设计创作,主要内容包括:本发明公开了一种基于序列频率信息识别DNA增强子元件的方法,所述方法基于支持向量机构建的双层DNA增强子元件预测模型,所述预测模型通过如下步骤生成:步骤(1):通过细胞系的染色质数据库信息构建DNA增强子序列数据集;步骤(2):通过PSTNP算法对DNA增强子序列数据集进行处理获得具有位置特异性的三核苷酸序列的DNA增强子信息;步骤(3):通过Kullback-Leibler散度算法对DNA增强子信息的三核苷酸序列信息进行优化;步骤(4):采用LASSO算法对DNA增强子信息的三核苷酸序列的特征数据进行降维处理;本发明解决了DNA增强子及其强度的预测问题,采用特征优化、特征筛选方法对提取的序列频率信息进行改进,明显提高了预测精度。(The invention discloses a method for identifying a DNA enhancer element based on sequence frequency information, which is a double-layer DNA enhancer element prediction model constructed based on a support vector machine, wherein the prediction model is generated by the following steps: step (1): constructing a DNA enhancer sequence data set from the chromatin database information of the cell line; step (2): processing the DNA enhancer sequence data set by a PSTNP algorithm to obtain DNA enhancer information of a trinucleotide sequence with position specificity; and (3): optimizing trinucleotide sequence information of DNA enhancer information by a Kullback-Leibler divergence algorithm; and (4): performing dimensionality reduction on the characteristic data of the trinucleotide sequence of the DNA enhancer information by adopting an LASSO algorithm; the invention solves the prediction problem of the DNA enhancer and the strength thereof, improves the extracted sequence frequency information by adopting the methods of feature optimization and feature screening, and obviously improves the prediction precision.)

1. A method of identifying a DNA enhancer element based on sequence frequency information, comprising: the method is based on a double-layer DNA enhancer element prediction model constructed by a support vector machine, and the prediction model is generated by the following steps:

step (1): constructing a DNA enhancer sequence data set from the chromatin database information of the cell line;

step (2): processing the DNA enhancer sequence data set by a PSTNP algorithm to obtain DNA enhancer information of a trinucleotide sequence with position specificity;

and (3): optimizing trinucleotide sequence information of DNA enhancer information by a Kullback-Leibler divergence algorithm;

and (4): and performing dimensionality reduction on the characteristic data of the trinucleotide sequence of the DNA enhancer information by adopting a LASSO algorithm.

2. The method of claim 1, wherein the step of identifying the DNA enhancer element based on the sequence frequency information comprises: the trinucleotide composition information with position specificity of the enhancer sequence obtained in the step (2) is generated by adopting the following steps:

2.1 for each 200bp sequence sample S, there are:

S=N1N2…Nl…N200

wherein N islNucleotide representing the l position, consisting of A, C, G, T;

2.2, extracting position specificity information of an enhancer sequence by using a k-mer method, and taking k as 3;

2.3 calculation of trinucleotide position-specific Positive sample frequency information F of enhancer sequences by+

Wherein the content of the first and second substances,represents the 4 th occurrence at the 200-k +1 th position in the sequencekTrinucleotide (3 mer)i) Frequency (positive samples), and 3meriRepresenting AAA, AAC, …, TTT.

2.4 calculation of enhancer sequences byTrinucleotide position-specific negative sample frequency information F-

Wherein the content of the first and second substances,represents the 4 th occurrence at the 200-k +1 th position in the sequencekTrinucleotide (3 mer)i) Frequency of (negative sample), and 3meriRepresenting AAA, AAC, …, TTT.

3. The method of claim 1, wherein the step of identifying the DNA enhancer element based on the sequence frequency information comprises: the process of optimizing the trinucleotide sequence information of the DNA enhancer information in the step (3) comprises the following steps:

3.1, the process of optimizing trinucleotide sequence information using KL divergence is represented as:

wherein, F+And F-Respectively representing the distribution condition of the frequency matrix obtained by the positive and negative sample frequency information;represents the 4 th occurrence at the 200-k +1 th position in the sequencekTrinucleotide (3 mer)i) Positive and negative sample frequency difference of 3metiRepresenting AAA, AAC, …, TTT.

3.2, each sequence sample S is represented by the following formula:

S=[φ1,φ2,…,φw,…,φ200-k+1]T

where T is the transpose operator and phiwThe definition is as follows:

wherein the content of the first and second substances,4 th position representing the appearance at w-th position in the sequencekTrinucleotide (3 mer)i) Positive and negative sample frequency difference degree of (3 mer)iRepresenting AAA, AAC, …, TTT.

Technical Field

The invention belongs to the field of functional element prediction algorithms in bioinformatics, and particularly relates to a method for identifying DNA enhancer elements based on sequence frequency information.

Background

Transcription is the first step of gene expression and is also a key step, controlled by regulatory elements such as promoters and enhancers. Among these, enhancers are short sequences (50-1500bp) on DNA that have the ability to recruit transcription factors and their complexes, thus increasing the likelihood that transcription of certain genes will occur. By predicting enhancers in DNA sequences, researchers in the biological field can be helped to find the cause of an abnormally elevated level of transcription, while enhancers of different strengths make transcriptional level programming possible. Therefore, predictive classification of enhancers has important practical implications. However, since enhancers are cis-acting, their position relative to the target gene is highly variable, which complicates the recognition and functional annotation of enhancers.

Disclosure of Invention

The invention aims to provide a method for accurately and efficiently predicting a DNA enhancer element and the strength thereof, wherein the PSTNP algorithm used by the invention can well extract the position specificity information of trinucleotide, and Kullback-Leibler (KL) divergence is further used for improving the PSTNP so as to more clearly describe the frequency matrix difference between a positive sample and a negative sample; then, LASSO is used to reduce the size of the features; finally, a two-layer prediction model based on a support vector machine is constructed: the first layer determines whether the sequence is an enhancer, and the second layer further predicts the strength level of the identified enhancer and achieves good prediction performance.

The invention is characterized in that the problems of the identification of DNA enhancer elements and the strength prediction are solved, and the method sequentially comprises the following steps:

a method of identifying a DNA enhancer element based on sequence frequency information, the method being based on a support vector machine-constructed two-layer DNA enhancer element prediction model generated by:

step (1): constructing a DNA enhancer sequence data set from the chromatin database information of the cell line;

step (2): processing the DNA enhancer sequence data set by a PSTNP algorithm to obtain DNA enhancer information of a trinucleotide sequence with position specificity;

and (3): optimizing trinucleotide sequence information of DNA enhancer information by a Kullback-Leibler divergence algorithm;

and (4): and performing dimensionality reduction on the characteristic data of the trinucleotide sequence of the DNA enhancer information by adopting a LASSO algorithm.

Further, the step (2) of obtaining position-specific trinucleotide composition information of the enhancer sequence is generated by adopting the following steps:

2.1 for each 200bp sequence sample S, there are:

S=N1N2…Nl…N200

wherein N islNucleotide representing the l position, consisting of A, C, G, T;

2.2, extracting position specificity information of an enhancer sequence by using a k-mer method, and taking k as 3;

2.3 calculation of trinucleotide position-specific Positive sample frequency information F of enhancer sequences by+

Wherein the content of the first and second substances,represents the 4 th occurrence at the 200-k +1 th position in the sequencekTrinucleotide (3 mer)i) Frequency (positive samples), and 3meriRepresenting AAA, AAC, …, TTT.

2.4 calculation of trinucleotide position-specific negative sample frequency information F of enhancer sequences by-

Wherein the content of the first and second substances,represents the 4 th occurrence at the 200-k +1 th position in the sequencekTrinucleotide (3 mer)i) Frequency of (negative sample), and 3meriRepresenting AAA, AAC, …, TTT.

Further, the process of optimizing the trinucleotide sequence information of the DNA enhancer information in the step (3) is as follows:

3.1, the process of optimizing sequence information using KL divergence is represented as:

wherein, F+And F-Respectively representing the distribution condition of the frequency matrix obtained by the positive and negative sample frequency information;represents the 4 th occurrence at the 200-k +1 th position in the sequencekTrinucleotide (3 mer)i) Positive and negative sample frequency difference degree of (3 mer)iRepresenting AAA, AAC, …, TTT.

3.2, finally each sequence sample S can be represented as:

S=[φ12,…,φw,…,φ200-k+1]T

where T is the transpose operator and phiwThe definition is as follows:

wherein the content of the first and second substances,4 th position representing the appearance at w-th position in the sequencekTrinucleotide (3 mer)i) Positive and negative sample frequency difference degree of (3 mer)iRepresenting AAA, AAC, …, TTT.

Advantageous effects

The invention uses sequence frequency information to identify DNA enhancer element and predict its strength; the trinucleotide can well express sequence information, the characteristic of trinucleotide position specificity information of an enhancer sequence is extracted by using a PSTNP algorithm, and the PSTNP is improved by utilizing Kullback-Leibler (KL) divergence, so that the discrete distribution difference of a frequency matrix between a positive sample and a negative sample is enlarged. The LASSO algorithm is used for removing feature redundancy and reserving useful feature information to the maximum extent. Finally, the method constructs a two-layer prediction model based on a support vector machine: the first layer judges whether the sequence is an enhancer; the second layer further predicts the strength level of the identified enhancer and achieves good prediction performance. The prediction accuracy of the invention is higher than that of other existing models, and the invention has important significance for the recognition of DNA enhancer elements and the research of classification prediction problems.

Drawings

FIG. 1 is a flow chart of the computational process of the present invention;

FIG. 2 is a comparison of the performance of the six feature extraction methods over different classification algorithms;

FIG. 3 is a comparison of the performance of two information-theoretic algorithms employed in improving the PSTNP method;

FIG. 4 is a comparison of the performance of five feature selection algorithms;

FIG. 5 shows the dimension selection of the LASSO algorithm in data dimension reduction;

FIG. 6 is a comparison of the performance of the five classification algorithms on the reduced-dimension features;

FIG. 7 compares the performance of the prior art three enhancer prediction models.

Detailed Description

The invention is described in detail below with reference to the accompanying drawings

The enhancer of the invention is a short DNA segment that functions to regulate transcription levels during transcription by recruiting transcription factors, forming transcription complexes, and binding to promoter sites. By predicting enhancers in DNA sequences, researchers in the biological field can be helped to find the cause of an abnormally elevated level of transcription, while enhancers of different strengths make transcriptional level programming possible. At present, the identification of the enhancer mainly depends on biological experiments, but the experimental method is time-consuming and labor-consuming; in contrast, it is easier and faster to predict the enhancer using machine learning methods.

The basic idea of the invention is as follows: and extracting position specificity information of the enhancer sequence, optimizing and improving the characteristics, and constructing a two-layer prediction model based on a support vector machine. The first layer judges whether the sequence is an enhancer; the second layer further predicts the intensity level of the identified enhancer.

The invention mainly comprises the following steps: firstly, a DNA enhancer sequence data set is constructed, then trinucleotide composition information with position specificity of a DNA enhancer sequence is obtained by utilizing a PSTNP algorithm, the extracted sequence information is optimized through Kullback-Leibler (KL) divergence, and data dimensionality reduction is carried out on the extracted sequence characteristics by adopting a LASSO algorithm. And finally, constructing a prediction model by using a support vector machine algorithm, and identifying the enhancer and the strength level thereof. The flow chart of the whole calculation process of the invention is shown in FIG. 1. By using the double-layer prediction model, a better prediction result can be obtained than other existing models. The specific process is as follows:

step (1): constructing a DNA enhancer sequence data set from the chromatin database information of the cell line; the chromatin database information of the cell lines comprises chromatin state information construction DNA enhancer sequence datasets of 9 cell lines such as H1ES, K562, GM12878, HepG2, HUVEC, HSMM, NHLF, NHEK and HMEC; step (2): obtaining position-specific trinucleotide composition information of a DNA enhancer sequence by a PSTNP algorithm; the method comprises the following steps:

2.1 for each 200bp sequence sample S, there are:

S=N1N2…Nl…N200

wherein the content of the first and second substances,Nlnucleotide representing the l position, consisting of A, C, G, T;

2.2, extracting position specificity information of an enhancer sequence by using a k-mer method, and taking k as 3;

2.3 calculation of trinucleotide position-specific Positive sample frequency information F of enhancer sequences by+

Wherein the content of the first and second substances,represents the 4 th occurrence at the 200-k +1 th position in the sequencekTrinucleotide (3 mer)i) Frequency (positive samples), and 3meriRepresenting AAA, AAC, …, TTT.

2.4 calculation of trinucleotide position-specific negative sample frequency information F of enhancer sequences by-

Wherein the content of the first and second substances,represents the 4 th occurrence at the 200-k +1 th position in the sequencekTrinucleotide (3 mer)i) Frequency of (negative sample), and 3meriRepresenting AAA, AAC, …, TTT.

And (3): optimizing the extracted sequence information by Kullback-Leibler (KL) divergence;

the steps of using Kullback-Leibler (KL) divergence to optimize the PSTNP algorithm are as follows:

3.1, the process of optimizing sequence information using KL divergence is represented as:

wherein, F+And F-Respectively representing the distribution situation of the frequency matrix obtained by the positive and negative sample sets;

represents the 4 th occurrence at the 200-k +1 th position in the sequencekTrinucleotide (3 mer)i) Positive and negative sample frequency difference degree of (3 mer)iRepresenting AAA, AAC, …, TTT.

3.2, each sequence sample S is represented by the following formula:

S=[φ12,…,φw,…,φ200-k+1]T

where T is the transpose operator and phiwThe definition is as follows:

wherein the content of the first and second substances,4 th position representing the appearance at w-th position in the sequencekTrinucleotide (3 mer)i) Positive and negative sample frequency difference degree of (3 mer)iRepresenting AAA, AAC, …, TTT.

And (4): performing data dimension reduction on the extracted sequence features by adopting an LASSO algorithm;

according to the calculation method, 5-fold cross validation is carried out on all prediction experiments. Firstly, when extracting sequence information, six different feature extraction methods such as PSTNP, PseEIIP, pseKNC and the like are tried, and fig. 2 is a performance comparison result of the methods on several different classification algorithms such as KNN, randomfort, SVM, GBDT, XgBoost and the like. It can be seen that the PSTNP method performs best on all models, with an overall accuracy significantly higher than the other five strategies. Therefore, we finally choose the PSTNP method for feature extraction. Then, the invention adopts two information theory methods when improving the PSTNP algorithm, as shown in FIG. 3. It can be seen that the predicted results of the features after the KL divergence process performed best (Acc: 82.28%) compared to the original PSTNP method, with obvious advantages compared to other strategies. The invention tries five different methods of LASSO, Ridge, Elastic Net, MRMR and MRMD when selecting the characteristics. As shown in FIGS. 4 and 5, the best prediction results (Acc: 84.23%) were obtained when the LASSO algorithm was designed to reduce the dimension of the data to 52.

After processing the features according to the optimal solution, the present invention uses different classification algorithms for prediction based on 5-fold cross validation, as shown in fig. 6. It can be seen that SVM performs best on MCC (0.68) and also has excellent prediction results on Acc (84.23%).

Through 5 times of cross validation, the performances of different classifiers for solving the problem of enhancer classification are compared. The present invention was compared to other 3 classification methods on the same dataset as shown in fig. 7. The result shows that the iEnhancer-KL classifier provided by the invention is obviously superior to other models in performance. Especially at the level of second-tier recognition enhancer intensity, Acc is almost 30% higher. Clearly, this method is extremely efficient and meaningful.

In conclusion, the invention provides an improved feature extraction algorithm based on PSTNP, and trinucleotide position specificity information of an enhancer sequence is effectively described. Subsequently, the invention also uses LASSO algorithm to select the characteristics and remove the data redundancy. Finally, the invention uses a support vector machine prediction model to identify the DNA enhancer sequence and the strength level thereof, provides a useful method for solving the problem of enhancer prediction identification, and has simple calculation process, easy realization and wide usability.

11页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种坡耕地垄沟布局对微生物影响机理的识别方法

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!