DNA replication initial region identification method based on word vector and convolutional neural network

文档序号：1289192 发布日期：2020-08-28 浏览：8次中文

阅读说明：本技术 基于词向量与卷积神经网络的dna复制起始区域识别方法 (DNA replication initial region identification method based on word vector and convolutional neural network ) 是由杨润涛吴峰张承进陈金桂张丽娜于 2020-04-24 设计创作，主要内容包括：本申请提供的基于词向量与卷积神经网络的DNA复制起始区域识别方法中,首先通过连续三分序列分词将DNA序列进行分词,得到各个三联核苷酸,然后将分词后的三联核苷酸负采样后通过Word2vec迭代训将三联核苷酸进行向量化得到词向量,所有的词向量合并后得到预训练特征向量矩阵,预训练特征向量矩阵中包括各个三联核苷酸的预训练特征向量,将分词后的各三联核苷酸竖向排列后嵌入各个三联核苷酸的预训练特征向量得到词嵌入层,词嵌入层将三联核苷酸序列特征向量化,然后经过卷积、池化训练得到卷积神经网络,通过加入词嵌入层的卷积神经网络进行ORI特征的深度挖掘和分类识别,最终识别出ORI；本申请的识别准确度大大提高。(The method for recognizing the DNA replication initiation region based on the Word vector and the convolutional neural network comprises the steps of firstly segmenting a DNA sequence by continuous three-segment sequence segmentation to obtain each triplet nucleotide, then carrying out negative sampling on the triplet nucleotides after segmentation and vectorizing the triplet nucleotides by Word2vec iterative training to obtain Word vectors, merging all the Word vectors to obtain a pre-training characteristic vector matrix, wherein the pre-training characteristic vector matrix comprises pre-training characteristic vectors of each triplet nucleotide, vertically arranging the triplet nucleotides after segmentation and embedding the pre-training characteristic vectors of each triplet nucleotide to obtain a Word embedding layer, vectorizing the triplet nucleotide sequence characteristics by the Word embedding layer, then carrying out convolutional and pooling training to obtain the convolutional neural network, and carrying out deep mining and classification recognition on ORI characteristics by the convolutional neural network added with the Word embedding layer, finally, identifying the ORI; the identification accuracy of the application is greatly improved.)

1. A DNA replication initial region identification method based on word vectors and convolutional neural networks is characterized by comprising the following steps:

randomly selecting an ORI sequence and a non-ORI sequence from a yeast biological DNA sequence database, and constructing a DNA sequence sample set;

segmenting the ORI sequence and the non-ORI sequence respectively by continuous three-segment sequence segmentation to obtain a positive sample set and a negative sample set, wherein the positive sample set and the negative sample set both comprise each triplet of nucleotides;

after negative sampling of the triple nucleotide, performing iterative training based on Word2vec to obtain a pre-training feature vector matrix;

vertically arranging the triplet nucleotides contained in each sequence of the positive sample set, and then carrying out unique hot coding to obtain a unique hot coding matrix of the sequence, wherein the unique hot coding of the vertically arranged corresponding triplet nucleotides is used as an input layer;

vertically arranging the triple nucleotides contained in each sequence in the positive sample set, and embedding the triple nucleotides into the pre-training characteristic vector matrix to obtain a word embedding layer;

the word embedding layer is subjected to convolution, pooling and loss function training to obtain a convolution neural network model;

and inputting the DNA sequence to be detected into the convolutional neural network model, and outputting the probability that the DNA sequence to be detected is the ORI sequence.

2. The method of claim 1, wherein the segmenting the ORI sequence and the non-ORI sequence respectively into a positive sample set and a negative sample set by continuous tri-sequence segmentation, further comprises:

and segmenting the ORI sequence and the non-ORI sequence respectively through interval trisequence segmentation to obtain a second positive sample set and a second negative sample set.

3. The method of claim 1, wherein said negative sampling of said triplet of nucleotides comprises:

each is connected in threeThe length of the nucleotides is divided unequally into the first [0, 1]]Intervals in which the interval between two nodes is the position L of the corresponding triplet_i＝(I_i-1,I_i),i＝1,2,...,64；

To be provided withM > 64 is the node equally divided into two [0,1]An interval;

will be provided withProjected to the first [0, 1]]On the interval, establishAndthe mapping relationship between the two;

extracting a random number from the second [0, 1] interval and mapping the random number to the first [0, 1] interval according to the mapping relation to obtain a non-target triplet nucleotide;

and combining the target triplet nucleotide and the non-target triplet nucleotide to complete negative sampling of the triplet nucleotide.

4. The method of claim 3, wherein the non-equidistant division of the length of each triplet of nucleotides into a first [0, 1] interval comprises:

according toThe length of each triplet is obtained, where counter (. cndot.) represents the number of occurrences of a triplet.

5. The method for identifying the DNA replication initial region based on the Word vector and the convolutional neural network as claimed in claim 1, wherein the obtaining of the pre-training feature vector matrix based on Word2vec iterative training comprises:

obtaining word vectors of central triplet nucleotides corresponding to the central triplet nucleotides when the triplet nucleotide maximization probability in the context is predicted according to an objective function;

representing the central triplet of nucleotides as a 300-dimensional feature vector by iteration;

and carrying out feature vector training on all the triplet nucleotides to obtain the pre-training feature vector matrix.

6. The method for identifying a DNA replication initiation region based on word vector and convolutional neural network of claim 5, wherein the obtaining a maximized probability of predicting the central triplet of nucleotides from the triplet of nucleotides in the context of the objective function comprises:

the objective function isWherein w represents the central triplet nucleotide vector,representing the respective triplet nucleotide vector in the context,is represented in processingThe set of the central triplet of nucleotides was sampled negatively, u represents the negatively sampled set of w and wTaking the triplet nucleotide vector set in the set after merging,is denoted by three in the present contextThe dinucleotide predicts the probability of the central triplet.

7. The method for identifying a DNA replication origin region based on a word vector and convolutional neural network as claimed in claim 1, wherein the obtaining of the one-hot coding matrix of sequences by one-hot coding after vertically arranging the triplet nucleotides contained in each sequence of the positive sample set comprises:

the structure of the one-hot coding matrix is

8. The method for identifying a DNA replication initiation region based on a word vector and a convolutional neural network of claim 1, wherein the step of embedding the triplet of nucleotides contained in each sequence in the positive sample set into the pre-training feature vector matrix after vertical arrangement to obtain a word embedding layer comprises:

vertically arranging DNA sequences subjected to continuous three-part sequence word segmentation to obtain a top-down triple nucleotide combination;

for each triplet of nucleotides, inquiring corresponding feature vectors from the pre-training feature vectors one by one;

and combining the inquired feature vectors to obtain an untrained word embedding layer.

9. The method for identifying a DNA replication initiation region based on a word vector and a convolutional neural network as claimed in claim 1, wherein the word embedding layer is obtained by vertically arranging triplet nucleotides contained in each sequence in the positive sample set and then embedding the triplet nucleotides into the pre-training feature vector matrix, further comprising:

vertically arranging DNA sequences subjected to continuous three-part sequence word segmentation to obtain a top-down triple nucleotide combination;

using the weight value linked to the position with the median value of 1 in the one-hot coding matrix corresponding to each triplet nucleotide in the triplet nucleotide combination to obtain a pre-training feature vector corresponding to the triplet nucleotide;

and after the pre-training of the feature vectors of all the triplets of nucleotides is completed, a trainable word embedding layer is obtained.

10. The method for identifying a DNA replication initiation region based on a word vector and a convolutional neural network as claimed in claim 1, wherein the word embedding layer is obtained by vertically arranging triplet nucleotides contained in each sequence in the positive sample set and then embedding the triplet nucleotides into the pre-training feature vector matrix, further comprising:

the word embedding layer includes two layers, one being a trainable embedding layer and the other being a non-trainable embedding layer.

Technical Field

The application relates to the technical field of biotechnology and genetic engineering, in particular to a DNA replication initiation region identification method based on word vectors and a convolutional neural network.

Background

DNA replication is a primary step in the transmission of genetic information and has profound biological research significance. DNA replication refers to the biological process by which a DNA duplex undergoes semi-conservative replication with one DNA strand as the parent strand prior to cell division, thereby producing two daughter strands identical to the original DNA duplex. Therefore, the study of DNA replication is the basis for the study of other aspects of biology and is also the first task to study life processes. Numerous biological experiments have shown that DNA Replication starts from a specific region position, which is called ORI (Origin of Replication).

Based on the current development of biotechnology, the position of the replication initiation region of a certain biological DNA can be detected by performing a measurement experiment using a biological experiment. Such as chromosome Immunoprecipitation (ChIP), Chromatin Immunoprecipitation-ChIP (ChIP-ChIP), and surface ion Resonance (surface plasma Resonance). Although these methods can accurately recognize ORI, in the post-genome era, a large number of gene sequences are detected, and the detection by the assay method is prominently disadvantageous in terms of time and cost. For this reason, how to deviate from biological experiments and use computers to quickly and accurately identify ORI is a hotspot of current research.

For this reason, many efforts have been made to solve the ORI identification problem. For bacteria, there is only one ORI in circular DNA, and there are many algorithms that can identify it. However, in eukaryotes, replication from multiple locations is performed simultaneously in order to increase the efficiency of DNA replication, which also greatly increases the difficulty of identification. In recent years, several methods have been proposed to solve the problem of ORI recognition in yeast cells. For example, Chen et al found that the ORI region was much less bendable and cleavable than the non-ORI region, and based on this, proposed a computational model to identify ORIs in Saccharomyces cerevisiae cells. Li and the like generate a K-tuple Pseudo Nucleotide Composition (PseKNC) from a sample sequence, and the Pseudo amino acid Composition is developed from a protein/peptide chain to the DNA/RNA field. The "iORI-PseKNC" predictor was successfully developed and achieved an accuracy of 83.72% with the pseudo nucleotide composition as a feature and input to the support vector machine for identification. To eliminate redundant features and feature dimensions, Dao et al used F-score and minimum-redundancy-Maximum correlation (mRMR) for feature selection and identification using a support vector machine, developing a predictor named "irori-psekncc 2.0" for identification of the yeast genome. Xiao et al added dinucleotide position-specific propensity information to the pseudo-nucleotide composition and proposed a random forest-based predictor, "iRO-gPseKNC". Liu and the like take GC asymmetry and indefinite-length sequences in ORI into consideration, feature extraction is carried out in a 3-window mode, and an iRO-3wPseKNC predictor is provided by combining a random forest algorithm, so that more comprehensive identification and prediction can be carried out on four yeast genomes, and ORI prediction of indefinite-length sequences is realized. And calculating a GC offset value in the sequence based on iRO-3wPseKNC, combining PseKNC, extracting G and C in the sequence as features, and successfully constructing an iRO-PsekGCC predictor.

The predictors have advantages, the ORI identification effect on the yeast cells is gradually improved, and great significance is brought to the promotion of the ORI identification, but indexes such as accuracy of the methods and the like still cannot meet practical requirements. Furthermore, these methods are all based on machine learning and do not allow deep mining of the features of ORI and non-ORI sequences.

Disclosure of Invention

The application provides a DNA replication initial region identification method based on word vectors and a convolutional neural network, and aims to solve the technical problem of low identification precision.

In order to solve the technical problem, the embodiment of the application discloses the following technical scheme:

the application provides a DNA replication initial region identification method based on word vectors and a convolutional neural network, which comprises the following steps:

randomly selecting an ORI sequence and a non-ORI sequence from a yeast biological DNA sequence database, and constructing a DNA sequence sample set;

after negative sampling of the triple nucleotide, performing iterative training based on Word2vec to obtain a pre-training feature vector matrix;

the word embedding layer is subjected to convolution, pooling and loss function training to obtain a convolution neural network model;

and inputting the DNA sequence to be detected into the convolutional neural network model, and outputting the probability that the DNA sequence to be detected is the ORI sequence.

Optionally, the segmenting the ORI sequence and the non-ORI sequence by continuous three-segment sequence segmentation to obtain a positive sample set and a negative sample set, further comprising:

and segmenting the ORI sequence and the non-ORI sequence respectively through interval trisequence segmentation to obtain a positive sample set and a negative sample set.

Optionally, the negative sampling of the triplet of nucleotides comprises:

non-equidistant division of the length of each triplet of nucleotides into the first [0, 1%]Intervals in which the interval between two nodes is the position L of the corresponding triplet_i＝(I_i-1,I_i),i＝1,2,...,64；

To be provided withDivide the node equally into the second [0, 1]]An interval;

will be provided withProjected to the first [0, 1]]On the interval, establishAndthe mapping relationship between the two;

randomly extracting any target triplet nucleotide from the second [0, 1] interval, and mapping the target triplet nucleotide to the first [0, 1] interval according to the mapping relation to obtain a non-target triplet nucleotide;

and combining the target triplet nucleotide and the non-target triplet nucleotide to complete negative sampling of the triplet nucleic acid stem.

Optionally, the non-equidistant partitioning of the length of each triplet of nucleotides into a first [0, 1] interval comprises:

according toThe length of each triplet is obtained, where counter (. cndot.) represents the number of occurrences of a triplet.

Optionally, the obtaining of the pre-training feature vector matrix based on Word2vec iterative training includes:

representing the central triplet of nucleotides as a 300-dimensional feature vector by iteration;

and carrying out feature vector training on all the triplet nucleotides to obtain the pre-training feature vector matrix.

Optionally, the obtaining a maximum probability of predicting the central triplet nucleotide from the triplet nucleotide in the above and below according to the objective function includes:

the objective function isWherein w represents the central triplet nucleotide vector,representing the respective triplet nucleotide vector in the context,is represented in processingThe set of the central triplet of nucleotides was sampled negatively, u represents the negatively sampled set of w and wTaking the triplet nucleotide vector set in the set after merging,the probability of predicting the central triplet with the triplet in the current context is shown.

Optionally, the obtaining of the one-hot coding matrix of sequences by one-hot coding after vertically arranging the triplet nucleotides contained in each sequence of the positive sample set includes:

the structure of the one-hot coding matrix is

Optionally, the vertically arranging the triplet nucleotides contained in each sequence in the positive sample set and then embedding the triplet nucleotides into the pre-training feature vector matrix to obtain a word embedding layer includes:

vertically arranging DNA sequences subjected to continuous three-part sequence word segmentation to obtain a top-down triple nucleotide combination;

for each triplet of nucleotides, inquiring corresponding feature vectors from the pre-training feature vectors one by one;

and combining the inquired feature vectors to obtain an untrained word embedding layer.

Optionally, the embedding the triplet nucleotides contained in each sequence in the positive sample set into the pre-training eigenvector matrix after vertical arrangement to obtain a word embedding layer further includes:

vertically arranging DNA sequences subjected to continuous three-part sequence word segmentation to obtain a top-down triple nucleotide combination;

and after the pre-training of the feature vectors of all the triplets of nucleotides is completed, a trainable word embedding layer is obtained.

the word embedding layer includes two layers, one being a trainable embedding layer and the other being a non-trainable embedding layer.

Compared with the prior art, the beneficial effect of this application is:

the technical scheme shows that in the method for recognizing the DNA replication initiation region based on the Word vector and the convolutional neural network, firstly, Word segmentation is carried out on a DNA sequence through three-part sequence Word segmentation to obtain each triplet nucleotide, then, Word2vec iterative training is carried out on the triplet nucleotide after Word segmentation to obtain a Word vector, all the Word vectors are combined to obtain a pre-training characteristic vector matrix, the pre-training characteristic vector matrix comprises the pre-training characteristic vectors of each triplet nucleotide, the triplet nucleotides after Word segmentation are vertically arranged and embedded into the pre-training characteristic vectors of each triplet nucleotide to obtain a Word embedding layer, the Word embedding layer vectorizes the triplet nucleotide sequence characteristics, then, the convolutional neural network is obtained through convolution and pooling training, deep excavation and classification recognition of ORI characteristics are carried out through the convolutional neural network added into the Word embedding layer, the ORI is eventually identified.

In the application, a DNA sequence is regarded as a 'sentence', wherein an ORI sequence is regarded as a correct sentence, a non-ORI sequence is regarded as an incorrect sentence, the DNA sequence is subjected to Word segmentation, the position relation among all nucleotides is highlighted while the biological significance is kept, then a Word2vec framework is utilized to carry out vectorization on all 'words' in the 'correct sentence' to obtain Word vectors serving as feature vectors of the 'words', so that a pre-training feature vector matrix required by a subsequent Word embedding layer is constructed, finally, the features and the difference features between the ORI sequence and the non-ORI sequence are deeply mined by using a convolutional neural network framework, and an identification task is executed; therefore, the identification accuracy of the application can be greatly improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.

Drawings

In order to more clearly explain the technical solution of the present application, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious to those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a schematic flowchart of a method for identifying a DNA replication origin region based on a word vector and a convolutional neural network according to an embodiment of the present disclosure;

FIG. 2 is a schematic diagram illustrating an application of the interval trisequence participles provided in the embodiment of the present application;

FIG. 3 is a schematic diagram illustrating an application of negative sampling according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a network framework adopted by a DNA replication origin region identification method based on a word vector and a convolutional neural network according to an embodiment of the present application;

FIG. 5 is a schematic diagram of a non-training word embedding operation process according to an embodiment of the present application;

FIG. 6 is a schematic diagram illustrating a process of trainable word embedding operation provided by an embodiment of the present application;

FIG. 7 is a schematic diagram of a dual-channel word embedding operation process provided in an embodiment of the present application;

fig. 8 is a schematic diagram of an experimental result of a continuous three-piece sequence word segmentation method in three word embedding operation modes according to an embodiment of the present application;

fig. 9 is a schematic diagram of an experimental result of the method for segmenting words based on three alternate sequences in the three word embedding operation modes according to the embodiment of the present application.

Detailed Description

In order to make those skilled in the art better understand the technical solutions in the present application, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The present application provides a DNA replication origin region identification method based on word vectors and convolutional neural networks, as shown in fig. 1, including:

s110: randomly selecting ORI sequences and non-ORI sequences from a yeast biological DNA sequence database, and constructing a DNA sequence sample set.

In the examples of the present application, a DNA sample is selected from a database of yeast gene sequences, and a reference data set comprising the DNA sample is constructed.

DNA replication initial region positions of four yeast organisms (saccharomyces cerevisiae, schizosaccharomyces pombe, kluyveromyces lactis and pichia pastoris) are obtained from a DeORI6.0 database, and nucleotide compositions of all regions, namely DNA sequences, are obtained by utilizing Genebank. non-ORI sequences were randomly selected for each of the four species. Firstly, deleting sequence samples with the length of less than 50dp to prevent the possible expression of the gene sequences with too short sequence characteristics, and then deleting the sequence samples with the sequence similarity of more than 80% by using the CD-HIT technology so as to reduce the situations of long model training time and poor effect caused by sample redundancy. The ORI sequence is used as a positive sample, the non-ORI sequence is used as a negative sample, and in order to prevent data imbalance, the number of the positive sample and the negative sample in the data set is reduced to achieve balance. Thus, a reference dataset of four yeast organisms was constructed, consisting specifically of:

where S1 includes 340 positive samples and 342 negative samples, S2 includes 342 positive samples and 338 negative samples, S3 includes 147 positive samples and 147 negative samples, and S4 includes 305 positive samples and 302 negative samples.

S120: segmenting the ORI sequence and non-ORI sequence by successive tripartite sequence segmentations to yield a positive sample set and a negative sample set, wherein the positive sample set and the negative sample set each comprise each triplet of nucleotides.

In the present embodiment, the DNA sequence is treated as a natural sentence, so that the DNA sequence can be intuitively treated as a sentence consisting of four kinds of nucleotides. It is clear that the characteristic information of the sequence that cannot be completely expressed is only expressed by ACGT. There is therefore a need for more rational segmentation of sequences, and thus for more efficient representation of the constituent components of the sequences. In the application, two sequence word segmentation methods are used, namely continuous three-sequence word segmentation and interval three-sequence word segmentation, and can also be called continuous three-sequence word segmentation and interval three-sequence word segmentation.

In one embodiment, the transcription, translation, etc. of the DNA are based on the information transfer process of triplet codon, and triplet nucleotides are used as the constituent units of the sequence. The DNA sequence samples are the corresponding triplets of nucleotides by moving the sequence at step 1 using a sliding window of size 3, so that one sequence can be used for 4³The 64 triplet nucleotide compositions are presented and are named for ease of description as "three consecutive sequence participles".

One DNA sequence can be represented as:

D＝{R₁,R₂,...R_i,...,R_L}(i＝1,2,...,L)

in the above formula, D represents a DNA sample, L represents the length of the DNA sample, and R₁Representing the first nucleotide of the DNA sample, R₂Representing the second nucleotide of the DNA sample, R_iRepresents the ith nucleotide of the DNA sample, and so on. And moving and selecting the sequences by using a sliding window with the size of 3 according to the step length of 1 to obtain a triplet nucleotide set corresponding to each DNA sequence, wherein the triplet nucleotide set comprises the following components: d { (R)₁R₂R₃),(R₂R₃R₄),...,(R_L-2R_L-1R_L)}

For a DNA sequence fragment of "ACGTCGTA", the word segmentation sequence formed by segmenting three continuous sequences is "ACG CGT GTC TCG CGT GTA", and it can be seen that four continuous triplet nucleotides in the word segmentation sequence correspond to the whole connection and the whole connection of part of two adjacent triplet nucleotides in the original sequence.

In another embodiment, three-piece sequence segmentation is adopted, specifically, one sequence is represented as 3 well-segmented sequences, and after all samples are combined, the DNA sequence is known to be composed of single nucleotides, double nucleotides and triple nucleotides, and can be totally divided into 4+16+64 ═ 84 "words", as shown in fig. 2, and for one DNA sequence segment, "CAATCGAACAGTCTGC", in order to reduce the repeatability, three segmentation sequences (1), (2) and (3) in fig. 2-can be formed after the three-piece sequence segmentation operation. It can be seen that the three participle sequences consist of mononucleotides, bigeminal nucleotides and tripartite nucleotides, and that the three participle sequences actually represent the original sequences, with the positions of the divisions differing.

In summary, word segmentation of a DNA sequence can be achieved by word segmentation of three consecutive sequences or by word segmentation of three spaced sequences, wherein each triplet of nucleotides can be obtained by word segmentation of three consecutive sequences, and a single nucleotide or a duplex or triplet of nucleotides can be obtained by word segmentation of three spaced sequences.

S130: and after negative sampling of the triple nucleotide, performing iterative training based on Word2vec to obtain a pre-training feature vector matrix.

In the present example, there are 64 triplet nucleotides, and if a triplet nucleotide (the target triplet nucleotide) is to be negatively sampled, a collection of samples (called non-target triplet nucleotides) that do not include the nucleotide is generated, and the target triplet nucleotide and the non-target triplet nucleotide are combined to form the negatively sampled collection of target triplet nucleotides.

The number of occurrences of 64 triplets in the dataset is different, the probability of selecting a negative sample for a high frequency triplet is high, and the probability of selecting a low frequency triplet is low, so the problem of weighted sampling needs to be faced, i.e., how to ensure that all triplets are selected with the same probability.

First, the length of each triplet is determined by the triplet nucleotide ratio, so that 64 triplet nucleotides are distributed between [0, 1], as shown in the following equation

Where counter (. cndot.) represents the number of occurrences of a triplet of nucleotides. To prevent the occurrence of a low proportion of a triplet of nucleotides, it is soughtTo the power of one.

Next, 64 tripartite nucleotides were introduced into the first [0, 1] respectively]The interval is made byHere w_jRepresenting the jth triplet nucleotide of the 64 triplet nucleotides, andto divide the node can get [0,1]Non-equidistant partition of the interval, the interval between two nodes being the corresponding triplet nucleotide position L_i＝(I_i-1,I_i]1, 2.., 64. Reintroducing a second [0, 1]]Equidistant division of intervals into nodesAs shown in fig. 3.

Will be provided withProjected onto non-equidistant divisions, as shown by the dashed lines in the figure, it can be establishedAndthe mapping relationship between:

Table(n)＝w_k,wherem_j∈L_i,n＝1,2,...,M-1

when a triplet nucleotide is identified as a "target triplet", only a random integer between [1, M-1] needs to be randomly generated to obtain a "non-target triplet".

In one example, the number of 64 tripartite nucleotides has been determined to be "1 for each of the 63 tripartite nucleotides except for AAA, which is 2. To achieve negative sampling, the length of each triplet of nucleotides was first found, 64 in total, and only four length calculation formulas are shown here:

it is evident from the sequence that the sum of the lengths of the 64 triplet nucleotides is 1. For CCC negative samples (the number of negative samples can be specified, say 30), set M to 1000, so [0,1]The interval is equally divided into 1000 parts, each subinterval is 0.001 in length, and the length is set from [1,1000%]Extracting random number, mapping to [0, 1] of non-equidistant partition]The interval is taken to 30 nucleotides (skipped, reselected if CCC is taken), so the negative sample results in neg (CCC), which contains 30 non-visual elementsThe standard triplet of nucleotides, in particular what nucleotide, isA combination method is provided.

After completing word segmentation and negative sampling, vectorizing the trigenerate nucleotide features after word segmentation based on an objective function, specifically comprising:

wherein the objective function isWherein w represents the central triplet nucleotide vector,representing the respective triplet nucleotide vector in the context,is represented in processingThe set of the central triplet of nucleotides was sampled negatively, u represents the negatively sampled set of w and wTaking the triplet nucleotide vector set in the set after merging,the probability of predicting the central triplet with the triplet in the current context is shown.

In the present example, for a DNA sequence fragment "ACGTCGTA", the word segmentation sequence formed by segmenting three consecutive sequences is "ACG CGT GTC TCG CGT GTA", and each triplet of nucleotides is assigned a randomly initialized vector representation. If the central triplet w is TCG at this time, the context range is set to 2, i.e., context (w) is CGT, GTC, CGT and GTA. Using the context to predict and maximize the probability of the central triplet, when predicting the central triplet TCG with CGT, setting M to1000, negative sampling TCG to obtain corresponding negative sampling set NEG^CGT(TCG) calculating the probability that CGT predicts TCG. Similarly, NEG was performed in the same manner for three other contexts of triplet nucleotides^GTC(TCG)，NEG^CGT(TCG) (this set is compared to the first NEG due to the randomness of the negative samples^CGT(TCG) may be different) and NEG^GTA(TCG), the probability of predicting TCG can be determined separately. By maximizing the probability that the four context triplet nucleotides predict the central triplet TCG, it is essential to perform matrix operations and probability multiplications, so that a 300-dimensional vector representation of the triplet nucleotide with the highest probability can be obtained.

Through multiple iterative training, each triplet of nucleotides is represented as a 300-dimensional feature vector, so that a 64-by-300-dimensional feature vector matrix is obtained

In conclusion, word vectors of the central triplet nucleotides corresponding to the triplet nucleotide maximization probability prediction of the central triplet nucleotides are obtained according to the objective function; representing the central triplet of nucleotides as a 300-dimensional feature vector by iteration; and carrying out feature vector training on all the triplet nucleotides to obtain the pre-training feature vector matrix.

S140: and vertically arranging the triplet nucleotides contained in each sequence of the positive sample set, and carrying out unique hot coding to obtain a unique hot coding matrix of the sequence, wherein the unique hot coding of the vertically arranged corresponding triplet nucleotides is used as an input layer.

The convolutional neural network structure adopted in the present application is shown in fig. 4, and includes an input layer, a word embedding layer, a convolutional layer, and a full-connection and output layer, and the specific contents are as follows:

64 kinds of triplet nucleotides can be generated as sequence components through continuous three-piece sequence word segmentation, a single-hot encoding matrix with the dimension of 64 x 64 can be generated after single-hot encoding (one-hot encoding), the dimension of a vector is fixed as the number of the triplets, only one position is 1, the rest are 0, and any two vectors are irrelevant, as shown in the following formula:

the input of the network is the sequence after the word segmentation operation of the sequence, and the sequence is arranged from top to bottom, and each triplet of nucleotides is mapped into a corresponding one-hot code and vertically arranged.

S150: and vertically arranging the triplet nucleotides contained in each sequence in the positive sample set, and embedding the triplet nucleotides into the pre-training characteristic vector matrix to obtain a word embedding layer.

In the embodiment of the application, the feature vectors of all the triplet nucleotides can be trained by using Word2vec, and a pre-trained feature vector matrix is generated. When proceeding to the word-embedding layer of the network, each triplet of nucleotides can be mapped by a query operation to a corresponding feature vector, all arranged top-down in correspondence with the structure of the input layer.

The word embedding operation can be divided into an untrained mode, a trainable mode and a dual-channel mode, wherein the dual-channel mode comprises a channel in the untrained mode and a channel in the trainable mode; these three modes are specifically:

the untrained mode word embedding operation, as shown in fig. 5, includes:

vertically arranging DNA sequences subjected to continuous three-part sequence word segmentation to obtain a top-down triple nucleotide combination;

for each triplet of nucleotides, inquiring corresponding feature vectors from the pre-training feature vectors one by one;

and combining the inquired feature vectors to obtain an untrained word embedding layer.

Wherein, the trainable mode word embedding operation, as shown in fig. 6, further includes:

vertically arranging DNA sequences subjected to continuous three-part sequence word segmentation to obtain a top-down triple nucleotide combination;

and after the pre-training of the feature vectors of all the triplets of nucleotides is completed, a trainable word embedding layer is obtained.

The dual-channel mode, as shown in FIG. 7, includes two layers, one being a trainable embedded layer and the other being a non-trainable embedded layer. The word embedding layer has two layers, and subsequent operations such as convolution, pooling and the like are respectively carried out. A word embedding layer is trained, and the parameter setting is consistent with the network parameters of the trainable mode; a word embedding layer is non-trainable, and the parameter setting is consistent with the network parameter of the non-trainable mode. Therefore, the number of subsequent convolution kernels is twice that of the other two modes, the number of the feature vectors after convolution operation is 2 x 3 x 128, if the network parameters are updated by the probability error values in the training process, the word embedding layer of the trainable channel is also updated, and the word embedding layer of the untrained channel is not updated.

S160: and the word embedding layer is subjected to convolution, pooling and loss function training to obtain a convolution neural network model.

In the embodiment of the present application, vertical convolution is performed when performing convolution operation, that is, the width of a convolution kernel is set to 300. A certain choice is needed for the height. Also considering only the continuous three-part sequence word segmentation method, it can be known that the triplet nucleotide obtained after two word segmentation has two nucleotide compositions which are repeated but have different positions, so that when the height of the convolution kernel is set to 2, the convolution operation is performed on the two triplet nucleotide feature vectors, which can be used to extract the relationship between the first triplet nucleotide and the next single nucleotide in the original sequence. Similarly, the convolution kernel set to 3 may be used to extract the relationship between the first triplet and the next triplet in the original sequence, and the convolution kernel set to 4 may be used to extract the relationship between the first triplet and the next triplet in the original sequence.

The heights of the convolution kernels used in this application are set to 2, 3, and 4, respectively. While setting 128 convolution kernels for each size, more information can be extracted in the convolution region, i.e., between two adjacent triplets of nucleotides in the original sequence.

According to the embodiment of the application, pooling operation is performed after convolution layers, so that the dimensionality of the feature vector formed after convolution can be reduced, the problem of overfitting is reduced, and the training speed is increased. The pooling operation used in this application is maximum pooling, after which the stitching is performed as an input to the following fully connected layer. And the dimension output by the full connection layer is 2, and the classification result of the input sample is obtained by performing Softmax operation.

S170: and inputting the DNA sequence to be detected into the convolutional neural network model, and outputting the probability that the DNA sequence to be detected is the ORI sequence.

When entering the word embedding layer, embedding the pre-training feature vectors of all the triplets of nucleotides according to the vertical arrangement sequence of the triplets of nucleotides, and constructing a 4 x 300 dimensional matrix as the word embedding layer. Convolution layers of the network are provided with convolution kernels (2, 3 and 4) of 3 sizes in total, 128 convolution kernels are arranged in each size, vertical convolution is carried out on the word embedding layers of 4 x 300 dimensions after random initialization, and 3 x 128 feature graphs are obtained and serve as the input of the pooling layer. And splicing each feature map into a feature vector after the maximum value pooling is carried out, and using the feature vector as the input of a full connection layer. The dimension of the fully-connected layer output is 2, i.e., the classification numbers "ORI sequences" and "non-ORI sequences" of the present application. The output of the full link layer is not between 0 and 1, so after the output of the full link layer is subjected to Softmax, two classifications are subjected to probability, and two numbers between 0 and 1 are converted into probability values (the sum is 1). If the probability of the obtained positive sample is greater than 0.5, 1 is output, ACG CGAGAA AAC, namely ACGAAC, is predicted to be an ORI sequence positive sample, and if the probability of the obtained positive sample is less than 0.5, 0 is output, ACG CGA GAA AAC, namely ACGAAC, is predicted to be a non-ORI sequence negative sample.

In the embodiment of the present application, 5 measurement indexes are used to evaluate the prediction quality of the convolutional neural network: acc (Accuracy ), Sn (Sensitivity), Sp (Specificity), Mcc (MatthewsCoration Coefficient), and AUC (Area Under the Curve). Where, AUC refers to the area under the Curve of ROC (Receiver Operating Characteristic Curve), the abscissa is 1-Sp, and the ordinate is Sn. Generally, a curve above the line y-x means better performance, with the ordinate approaching 1 better as the abscissa increases. In addition to the visual representation, AUC is the area under the ROC curve, 0< AUC <1, and as AUC gets closer to 1, it means that the predictor performs better, which also corresponds to the graphical property of ROC just described. Acc, Sp, Sn and Mcc are defined by the following formulas:

where TP represents the number of successful predictions of the ORI sequences, TN represents the number of successful predictions of the non-ORI sequences, FP represents the number of incorrect predictions of the non-ORI sequences as ORI sequences, FN represents the number of incorrect predictions of the ORI sequences as non-ORI sequences, P represents the total number of ORI sequences in the dataset, and N represents the total number of non-ORI sequences in the dataset.

The method is characterized in that a ten-fold cross validation algorithm is used, the whole data set is divided into a plurality of parts, one part is used as a test set in each experiment, the other nine parts are used as a training set, ten groups of TP, TN, FP and FN are obtained, and a series of calculations of evaluation indexes and the drawing of ROC curves are carried out by synthesizing the ten groups of data.

In one embodiment, the results of ORI recognition based on 3 word embedding patterns under continuous three-segment sequence participle are shown in table 1, and the ROC curve is shown in fig. 8.

TABLE 1 results of ORI recognition of four yeasts based on sequential trisequential participles

As can be seen from Table 1, the identification effect for the four species is very good, and the results based on different model modes are changed to some extent. Especially for pichia pastoris, the accuracy of the recognition result based on the dual-channel mode reaches 96.7%. The ROC curves identified by the ORI for each species are shown below.

As can be seen from FIG. 8, the saturation of the ROC curves of the four species is relatively high, and the AUC values are relatively good.

As can be seen from the table, Word2vec is used for constructing Word vectors and then the Word vectors are input into a convolutional neural network for recognition, and the ORI sequence can be recognized by the model with quite excellent performance. The reason for this analysis is that Word2vec was used to deeply mine the relationship between each triplet of nucleotides in the sequence, and biological significance was expressed numerically. When a convolutional neural network incorporating a word embedding layer is used, the word embedding layer may reflect the relative position of the triplet of nucleotides, i.e., positional information is incorporated. And finally, deeply excavating the characteristics of the sequence by using a convolutional neural network, and extracting the characteristics before and after the occurrence of single nucleotide, dinucleotide and trinucleotide in the original sequence by matching a sequence word segmentation method and using proper convolutional kernel size and giving biological significance. Therefore, the model can better learn the sequence and has good recognition effect. Combining the table data and ROC curve, using accuracy (Acc) as the primary criterion and AUC value as the secondary criterion to select the model with the best ORI sequence recognition result for each species, i.e., Saccharomyces cerevisiae (S1) -dual channel mode, Schizosaccharomyces pombe (S2) -dual channel mode, Kluyveromyces lactis (S3) -untrained mode, and Pichia pastoris (S4) -dual channel mode.

In another example, when using the compartmentalized tertiarysequence segmentations, each sequence is represented as consisting of 4 mononucleotides, 16 dinucleotides, and 64 trinucleotides, the feature vectors are obtained and compared in different modes, as shown in table 2, and the ROC curves are shown in fig. 9.

TABLE 2 ORI recognition results for four yeasts based on compartmentalized tertile segmentation

As can be seen from table 2, the recognition effect for the four species is high, and the results based on different model patterns also have a certain variation. The result of the ORI sequence recognition of different species is changed to a certain extent, the improvement of some is reduced, especially the ORI recognition accuracy of the saccharomyces cerevisiae based on the trainable mode reaches 97.5%, and the ORI sequence recognition accuracy of the schizosaccharomyces pombe based on the dual-channel mode also reaches 76.5%.

As can be seen from FIG. 9, the saturation of the ROC curves of the four species is relatively high, and the AUC values are relatively good.

The optimal combination is selected as the ORI recognition result of the application and used for performance comparison with other technologies by integrating results obtained by different model modes and word segmentation methods, and the results are respectively saccharomyces cerevisiae (S1) -interval three-part sequence word segmentation + trainable mode, schizosaccharomyces pombe (S2) -interval three-part sequence word segmentation + dual-channel mode, kluyveromyces lactis (S3) -continuous three-part sequence word segmentation + untrained mode, and pichia pastoris (S4) -continuous three-part sequence word segmentation + dual-channel mode.

The present application uses a deep learning based approach that is very different from other machine learning based approaches used in the prior art. The performance of the model has been analyzed, and in order to better illustrate the advantages of the performance of the model, the embodiments of the present application are compared with other methods.

Liu et al propose a predictor "iRO-3 wPseKNC" to identify the identification of the four species used in this application. Dao et al propose a predictor named "iORI-pseknc2.0" that uses two-step feature selection to further select extracted features that are input to a classifier for recognition. Furthermore, Liu et al propose "iRO-PsekGCC" by adding GC asymmetric distribution as a feature to obtain the recognition results of two eukaryotes. For comparative fairness, based on the same data set, the experimental results of the present application were compared with iRO-3wPseKNC, iRO-PseKGCC, and irori-psekncc 2.0, respectively, to obtain the performance superiority of the ORI identification method of the present application, as shown in table 3.

Table 3 comparison of the performance of the present application with other prior art

From table 3 above, it can be concluded that the ORI recognition method of the present application performs significantly better than other methods in the ORI sequence recognition tasks of saccharomyces cerevisiae, kluyveromyces lactis, and pichia pastoris, not only in terms of accuracy, but also in other indicators. In summary, the present application has significant performance advantages and can be put to practical use to some extent.

In conclusion, the model can better learn the sequence and has good recognition effect.

Since the above embodiments are all described by referring to and combining with other embodiments, the same portions are provided between different embodiments, and the same and similar portions between the various embodiments in this specification may be referred to each other. And will not be described in detail herein.

Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.

The above-described embodiments of the present application do not limit the scope of the present application.

20页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：一种测序数据的分类单元组分计算方法

DNA replication initial region identification method based on word vector and convolutional neural network

相关技术

网友询问留言