Real nanopore sequencing signal filtering method and device based on neural network

文档序号:685303 发布日期:2021-04-30 浏览:14次 中文

阅读说明:本技术 一种基于神经网络的真实纳米孔测序信号滤波方法及装置 (Real nanopore sequencing signal filtering method and device based on neural network ) 是由 陈为刚 张鹏 韩昌彩 赵毅强 于 2020-12-28 设计创作,主要内容包括:本发明公开了一种基于神经网络的真实纳米孔测序信号滤波方法及装置,所述方法包括以下步骤:在纳米孔测序K-mer孔模型中输入核苷酸序列将其转换为与其对应的预期测序信号序列,将预期测序信号序列中的每个信号值重复多次生成待测真实测序信号序列;构建基于双向门控循环单元神经网络的真实测序信号处理模型;构建信号处理模型的损失函数,进行信号处理模型参数的初始化,通过自适应优化器最小化损失函数实现模型参数的训练;将待测真实测序信号序列输入到完成参数训练的信号处理模型中实现对待测真实测序信号序列的滤波处理;本发明的优点在于:能够准确地滤掉待测真实测序信号序列中与实际测序信号序列无关的高频分量并保留有用的高频分量。(The invention discloses a real nanopore sequencing signal filtering method and a device based on a neural network, wherein the method comprises the following steps: inputting a nucleotide sequence into a nanopore sequencing K-mer pore model, converting the nucleotide sequence into an expected sequencing signal sequence corresponding to the nucleotide sequence, and repeating each signal value in the expected sequencing signal sequence for multiple times to generate a real sequencing signal sequence to be detected; constructing a real sequencing signal processing model based on a bidirectional gating cycle unit neural network; constructing a loss function of the signal processing model, initializing parameters of the signal processing model, and realizing the training of the model parameters by minimizing the loss function through a self-adaptive optimizer; inputting the real sequencing signal sequence to be tested into a signal processing model for completing parameter training to realize filtering processing on the real sequencing signal sequence to be tested; the invention has the advantages that: the method can accurately filter out high-frequency components irrelevant to the actual sequencing signal sequence in the real sequencing signal sequence to be detected and reserve useful high-frequency components.)

1. A real nanopore sequencing signal filtering method based on a neural network is characterized by comprising the following steps:

the method comprises the following steps: inputting a nucleotide sequence into a nanopore sequencing K-mer pore model, converting the nucleotide sequence into an expected sequencing signal sequence corresponding to the nucleotide sequence, and repeating each signal value in the expected sequencing signal sequence for multiple times to generate a real sequencing signal sequence to be detected;

step two: constructing a real sequencing signal processing model based on a bidirectional gated circulation unit neural network, wherein the signal processing model comprises three layers of bidirectional gated circulation unit neural networks and a full connection layer, the input of the signal processing model is a real sequencing signal sequence which is input to the first layer of bidirectional gated circulation unit neural network and subjected to median normalization, and the output of the signal processing model is a filtering signal sequence which is output by the full connection layer and has the same length with the input real sequencing signal sequence to be detected;

step three: acquiring actual sequencing signal sequences to be trained output by a nanopore sequencing platform, calculating real sequencing signal sequences to be trained corresponding to each actual sequencing signal sequence to be trained according to the actual sequencing signal sequences to be trained, taking the actual sequencing signal sequences to be trained and the real sequencing signal sequences to be trained corresponding to the actual sequencing signal sequences as supervision training data required by parameter training of a signal processing model, constructing a loss function of the signal processing model, initializing parameters of the signal processing model, and minimizing the loss function through an adaptive optimizer to realize training of the model parameters;

step four: and inputting the real sequencing signal sequence to be detected into a signal processing model for completing parameter training to realize filtering processing of the real sequencing signal sequence to be detected.

2. The method according to claim 1, wherein the step one comprises:

step 101: inputting a nucleotide sequence X ═ X with T basic groups into a nanopore sequencing K-mer pore model1,x2,...,xT,xTRepresents the T base, and takes out one K-mer by moving one base at a time from the first base of the input nucleotide sequence until the end of moving to the position of the last K base, and T bases obtain T-K + 1K-mers in total;

step 102: the nanopore sequencing K-mer pore model comprises expected current signal values corresponding to each K-mer, expected current signal values corresponding to T-K + 1K-mers are sequentially searched by contrasting the nanopore sequencing K-mer pore model, and an expected sequencing signal sequence Y-Y corresponding to the input nucleotide sequence X is generated1,y2,...,yT-5Wherein y isiRepresenting the expected current signal value corresponding to the K-mer starting from position i in X;

step 103: and repeating each signal value in the expected sequencing signal sequence for multiple times according to the signal value repetition time distribution in the actual sequencing signal sequence to generate a to-be-detected real sequencing signal sequence with similar length distribution with the actual sequencing signal sequence.

3. The method for filtering a real nanopore sequencing signal based on a neural network as claimed in claim 1, wherein the second step comprises:

step 201: normalizing the median value to obtain a true sequencing signal sequence I ═ I1,i2,...,iTThe method is used as an input vector of a first-layer bidirectional gated cyclic unit neural network in a signal processing model, wherein a basic unit of the bidirectional gated cyclic unit neural network consists of a forward-propagating gated cyclic unit and a backward-propagating gated cyclic unit, and a forward output vector at the time t is calculated firstlyThe calculation formula is as follows:

wherein itFor the value of the input signal at time t,is the forward output vector at the moment t-1, sigma is sigmoid function, and sigma (z) is 1/(1+ e)-z),Denotes a first intermediate variable, rt fIt is shown that the second intermediate variable,a third intermediate variable is represented which is,corresponding elements representing two equal-dimensional vectors are multiplied,is a first weight matrix, Wr fIs a second weight matrix, WfIs a third weight matrix, and is,in order to be the first offset vector,for the second offset vector, and then calculating the backward output vector at the time tThe calculation formula is as follows:

wherein the content of the first and second substances,is the backward output vector at time t +1,a fourth intermediate variable is represented which is,a fifth intermediate variable is represented which is,a sixth intermediate variable is represented which is,is a fourth matrix of the weights,is a fifth weight matrix, WrIs a sixth weight matrix that is a function of,as a third one of the offset vectors,is the fourth offset vector, the output vector h at the last time ttIs a forward output vector from time tAnd the backward output vector at time tIs connected to obtainWhere | | | represents the concatenation symbol of the vector;

step 202: taking an output vector of a first layer of bidirectional gating circulation unit neural network as an input vector of a second layer of bidirectional gating circulation unit neural network, taking an output vector of the second layer of bidirectional gating circulation unit neural network as an input vector of a third layer of bidirectional gating circulation unit neural network, wherein the calculation processes of the second layer of bidirectional gating circulation unit neural network and the third layer of bidirectional gating circulation unit neural network are the same as the calculation process of the first layer of bidirectional gating circulation unit neural network, and the weight matrix and the offset vector parameters between different layers are different;

step 203: processing the output vector of the last layer in the three-layer bidirectional gated cyclic unit neural network as the input vector of a full connection layer, wherein the output vector of the full connection layer is a filtering signal sequence O (O) with the same length as the input real sequencing signal sequence to be detected1,o2,...,oT

4. The method for filtering a real nanopore sequencing signal based on a neural network as claimed in claim 1, wherein the third step comprises:

step 301: the supervised training data of the signal processing model comprises an actual sequencing signal sequence to be trained and a real sequencing signal sequence to be trained corresponding to the actual sequencing signal sequence, firstly, a fast5 file output by a nanopore sequencing platform is read to obtain the actual sequencing signal sequence to be trained, and then, a real sequencing signal sequence to be trained corresponding to each actual sequencing signal sequence to be trained is calculated by using a continuous wavelet dynamic time warping algorithm;

step 302: a loss function for constructing a signal processing model isWhere cosh represents a hyperbolic function, where oiI-th output signal, r, representing a signal processing modeliRepresenting the ith signal in the actual sequencing signal sequence;

step 303: initializing a weight matrix and a bias vector in each layer of bidirectional gated cyclic unit neural network into an intervalIn which n is a uniform distributioniRepresenting the input vector dimension, n, of each layer of the bi-directional gated cyclic unit neural networkoRepresenting the dimension of an output vector of each layer of the bidirectional gated cyclic unit neural network;

step 304: and (5) iteratively calculating by using an Adam self-adaptive optimizer to obtain model parameters of the minimum loss function, and finishing the parameter training process of the signal processing model.

5. The method according to claim 4, wherein the step 301 comprises:

step 3011: reading a fast5 sequencing file output by a nanopore sequencing platform to obtain an actual sequencing signal sequence to be trained, wherein the fast5 file is in an HDF5 file format;

step 3012: performing base recognition on an actual sequencing signal sequence to be trained to obtain a sequencing read, comparing the sequencing read to a reference genome sequence by adopting a gene sequence comparison algorithm, obtaining a reference genome sequence fragment corresponding to the actual sequencing signal sequence to be trained according to a comparison result, and calculating an expected sequencing signal sequence to be trained by using the reference genome sequence fragment and a nanopore sequencing K-mer hole die;

step 3013: and using a continuous wavelet dynamic time warping algorithm to complete point-to-point mapping between the actual sequencing signal sequence to be trained and the expected sequencing signal sequence to be trained, obtaining an expected signal value corresponding to each signal value in the actual sequencing signal sequence to be trained according to a mapping result, and further calculating the actual sequencing signal sequence to be trained corresponding to the actual sequencing signal sequence to be trained.

6. A real nanopore sequencing signal filtering device based on a neural network is characterized by comprising the following steps:

the real sequencing signal sequence generation module to be detected is used for inputting a nucleotide sequence in the nanopore sequencing K-mer pore model, converting the nucleotide sequence into an expected sequencing signal sequence corresponding to the nucleotide sequence, and repeating each signal value in the expected sequencing signal sequence for multiple times to generate a real sequencing signal sequence to be detected;

the signal processing model building module is used for building a real sequencing signal processing model based on the bidirectional gated circulation unit neural network, and the signal processing model comprises three layers of bidirectional gated circulation unit neural networks and a full connection layer, wherein the input of the signal processing model is a real sequencing signal sequence which is input to the first layer of bidirectional gated circulation unit neural network and subjected to median normalization, and the output of the signal processing model is a filtering signal sequence which is output by the full connection layer and has the same length with the input real sequencing signal sequence to be tested;

the signal processing model training module is used for acquiring an actual sequencing signal sequence to be trained output by the nanopore sequencing platform, calculating a real sequencing signal sequence to be trained corresponding to each actual sequencing signal sequence to be trained according to the actual sequencing signal sequence to be trained, taking the actual sequencing signal sequence to be trained and the real sequencing signal sequence to be trained corresponding to the actual sequencing signal sequence to be trained as supervision training data required by parameter training of the signal processing model, constructing a loss function of the signal processing model, initializing parameters of the signal processing model, and minimizing the loss function through the self-adaptive optimizer to realize training of the model parameters;

and the filtering processing module is used for inputting the real sequencing signal sequence to be detected into the signal processing model completing parameter training to realize filtering processing on the real sequencing signal sequence to be detected.

7. The apparatus according to claim 6, wherein the true nanopore sequencing signal to be tested generation module is further configured to:

step 101: inputting a nucleotide sequence X ═ X with T basic groups into a nanopore sequencing K-mer pore model1,x2,...,xT,xTRepresents the T base, and takes out one K-mer by moving one base at a time from the first base of the input nucleotide sequence until the end of moving to the position of the last K base, and T bases obtain T-K + 1K-mers in total;

step 102: the nanopore sequencing K-mer pore model comprises expected current signal values corresponding to each K-mer, and the K-mers are sequenced by contrasting with the nanoporesSequentially searching expected current signal values corresponding to the T-K + 1K-mers by using the hole model, and generating an expected sequencing signal sequence Y (Y) corresponding to the input nucleotide sequence X1,y2,...,yT-5Wherein y isiRepresenting the expected current signal value corresponding to the K-mer starting from position i in X;

step 103: and repeating each signal value in the expected sequencing signal sequence for multiple times according to the signal value repetition time distribution in the actual sequencing signal sequence to generate a to-be-detected real sequencing signal sequence with similar length distribution with the actual sequencing signal sequence.

8. The apparatus according to claim 6, wherein the signal processing model building module is further configured to:

step 201: normalizing the median value to obtain a true sequencing signal sequence I ═ I1,i2,...,iTThe method is used as an input vector of a first-layer bidirectional gated cyclic unit neural network in a signal processing model, wherein a basic unit of the bidirectional gated cyclic unit neural network consists of a forward-propagating gated cyclic unit and a backward-propagating gated cyclic unit, and a forward output vector at the time t is calculated firstlyThe calculation formula is as follows:

wherein itFor the value of the input signal at time t,is the forward output vector at the moment t-1, sigma is sigmoid function, and sigma (z) is 1/(1+ e)-z),Denotes a first intermediate variable, rt fIt is shown that the second intermediate variable,a third intermediate variable is represented which is,multiplication of corresponding elements representing two equal-dimensional vectors, Wz fIs a first weight matrix, Wr fIs a second weight matrix, WfIs a third weight matrix, and is,in order to be the first offset vector,for the second offset vector, and then calculating the backward output vector at the time tThe calculation formula is as follows:

wherein the content of the first and second substances,is the backward output vector at time t +1,a fourth intermediate variable is represented which is,a fifth intermediate variable is represented which is,a sixth intermediate variable is represented which is,is a fourth matrix of the weights,is a fifth weight matrix, WrIs a sixth weight matrix that is a function of,as a third one of the offset vectors,is the fourth offset vector, the output vector h at the last time ttIs transmitted from the forward direction at time tOutput vectorAnd the backward output vector at time tIs connected to obtainWhere | | | represents the concatenation symbol of the vector;

step 202: taking an output vector of a first layer of bidirectional gating circulation unit neural network as an input vector of a second layer of bidirectional gating circulation unit neural network, taking an output vector of the second layer of bidirectional gating circulation unit neural network as an input vector of a third layer of bidirectional gating circulation unit neural network, wherein the calculation processes of the second layer of bidirectional gating circulation unit neural network and the third layer of bidirectional gating circulation unit neural network are the same as the calculation process of the first layer of bidirectional gating circulation unit neural network, and the weight matrix and the offset vector parameters between different layers are different;

step 203: processing the output vector of the last layer in the three-layer bidirectional gated cyclic unit neural network as the input vector of a full connection layer, wherein the output vector of the full connection layer is a filtering signal sequence O (O) with the same length as the input real sequencing signal sequence to be detected1,o2,...,oT

9. The apparatus according to claim 6, wherein the signal processing model training module is further configured to:

step 301: the supervised training data of the signal processing model comprises an actual sequencing signal sequence to be trained and a real sequencing signal sequence to be trained corresponding to the actual sequencing signal sequence, firstly, a fast5 file output by a nanopore sequencing platform is read to obtain the actual sequencing signal sequence to be trained, and then, a real sequencing signal sequence to be trained corresponding to each actual sequencing signal sequence to be trained is calculated by using a continuous wavelet dynamic time warping algorithm;

step 302: a loss function for constructing a signal processing model isWhere cosh represents a hyperbolic function, where oiI-th output signal, r, representing a signal processing modeliRepresenting the ith signal in the actual sequencing signal sequence;

step 303: initializing a weight matrix and a bias vector in each layer of bidirectional gated cyclic unit neural network into an intervalIn which n is a uniform distributioniRepresenting the input vector dimension, n, of each layer of the bi-directional gated cyclic unit neural networkoRepresenting the dimension of an output vector of each layer of the bidirectional gated cyclic unit neural network;

step 304: and (5) iteratively calculating by using an Adam self-adaptive optimizer to obtain model parameters of the minimum loss function, and finishing the parameter training process of the signal processing model.

10. The apparatus according to claim 9, wherein the step 301 comprises:

step 3011: reading a fast5 sequencing file output by a nanopore sequencing platform to obtain an actual sequencing signal sequence to be trained, wherein the fast5 file is in an HDF5 file format;

step 3012: performing base recognition on an actual sequencing signal sequence to be trained to obtain a sequencing read, comparing the sequencing read to a reference genome sequence by adopting a gene sequence comparison algorithm, obtaining a reference genome sequence fragment corresponding to the actual sequencing signal sequence to be trained according to a comparison result, and calculating an expected sequencing signal sequence to be trained by using the reference genome sequence fragment and a nanopore sequencing K-mer hole die;

step 3013: and using a continuous wavelet dynamic time warping algorithm to complete point-to-point mapping between the actual sequencing signal sequence to be trained and the expected sequencing signal sequence to be trained, obtaining an expected signal value corresponding to each signal value in the actual sequencing signal sequence to be trained according to a mapping result, and further calculating the actual sequencing signal sequence to be trained corresponding to the actual sequencing signal sequence to be trained.

Technical Field

The invention relates to the field of nanopore sequencing signal processing, in particular to a real nanopore sequencing signal filtering method and device based on a neural network.

Background

The introduction of new generation sequencing technologies has enabled researchers to sequence DNA and RNA in a high throughput manner, which has prompted numerous breakthroughs in the fields of genomics, transcriptomics, and epigenomics. The most popular new generation sequencing technologies in the market at present mainly include Illumina, PacBio, Nanopore and other sequencing platforms. Unlike other sequencing techniques, nanopore sequencing techniques primarily detect base sequences directly through a nanopore embedded on a membrane separating two electrolyte chambers. By applying a voltage difference across the membrane, a single-stranded nucleotide sequence is passed through the membrane via the nanopore, and the bases in the pore affect the resistance of the pore, so that the base sequence passing through the pore can be detected from a time-varying current signal. Compared with the currently commonly used short read sequencing technology (such as Illumina MiSeq sequencing platform), the nanopore sequencing technology has multiple advantages. Firstly, nanopore sequencing technology can generate sequencing reads from a single nucleotide molecule in real time, and the time from sequencing biological sample pretreatment to sequencing data analysis is effectively shortened by combining with rapid library preparation work. In addition, nanopore sequencing technology can also be used directly for RNA sequencing without prior reverse transcription or amplification. DNA or RNA molecules of any length can be sequenced using nanopore sequencing technology, long reads are very valuable, and they simplify the study work of genome assembly and structural variation detection. At the same time, nanopore sequencing platforms (e.g., MinION sequencers) are much more portable than current short read sequencing platforms, which enables them to perform sequencing work outside of traditional laboratory environments.

Despite the numerous advantages of nanopore sequencing technologies, the complex sequencing environment can result in sequencing signal sequences with low signal-to-noise ratios, which present challenges for further sequencing data analysis. In the nanopore sequencing process, the measured value of the current signal output when the base sequence to be sequenced moves through the nanopore is stored as a 16-bit integer value. For current MinION pore chemistry, a single DNA strand passes through the nanopore at an average speed of 450bp/s, while the sampling frequency of the sequencing current signal is 4kHz, which means that there are on average 9 discrete measurement signal values per k-mer. Although these values may be different due to fluctuations in the rate of translocation of the motor protein. In order to convert the output original sequencing current signal sequence into a base sequence, complex base recognition software is required for implementation. During nanopore sequencing, there are several factors that result in a low signal-to-noise ratio of the original current signal sequence: the four basic group structures of DNA or RNA passing through the nanopore are similar, so that the difference of original electric signals generated by different basic groups is small; the raw current signal is mainly affected by 5 or 6 bases occupying the nanopore simultaneously, so one electrical signal measurement corresponds to 45 or 46 possible k-mers; the random nature of the motor protein which draws the movement of DNA or RNA sequences results in uneven movement of base sequences; the electrical signal of the homopolymer as it passes through the nanopore does not change, resulting in misinterpretation of long chains of multiple identical bases. The above factors make sequencing reads generated by nanopore sequencing technologies less accurate, thereby limiting practical applications of nanopore sequencing technologies.

With the rapid development of nanopore sequencing technologies, analytical methods and tools for nanopore sequencing data are also emerging rapidly. For example, the gene sequence alignment tools Graphmap, Minimap2 and MashMap2 for nanopore sequencing technology, the genome assembly tools Canu and Racon for assembling nanopore sequencing long reads, the visualization tools BulkVis and squigglet for nanopore sequencing signals. It is anticipated that researchers will develop more data analysis methods and tools for nanopore sequencing in the near future. Researchers usually use experimental sequencing data or simulation sequencing data to perform performance tests on the new algorithms and the new tools, compared with the experimental sequencing data, the simulation sequencing data can greatly save research and development cost, reduce data analysis difficulty and improve research and development efficiency, so that more accurate simulation on the nanopore sequencing data is beneficial to developing more methods and tools with better performance for the nanopore sequencing technology.

The simulation algorithm of the nanopore sequencing data can be divided into nanopore sequencing read simulation and nanopore sequencing signal simulation. And generating a simulation sequencing read by using the three nanopore sequencing read simulation software of ReadSim, SiLiCO and NanoSim by using the input nucleotide sequence and a configuration file, wherein the configuration file comprises a group of preset parameters, such as parameters of insertion rate, abridge rate, substitution rate, read segment length, quality fraction and the like. The three nanopore sequencing read simulation software differ in that ReadSim uses a fixed configuration file, SiLiCO uses a user-provided configuration file, and NanoSim uses user-provided actual sequencing data to learn the configuration file to be used in the simulation phase. Although nanopore sequencing read simulation software can generate high-quality simulation reads, nanopore sequencing electrical signals are the essence of nanopore sequencing technology, and nanopore sequencing read simulation software cannot generate nanopore sequencing simulation signals, so that the application of the sequencing read simulation software is limited.

DeepSimulator is the first simulation software that completely simulates the entire flow of nanopore sequencing technology, and can generate nanopore sequencing simulation signals and simulation reads simultaneously. The whole process of the nanopore sequencing technology mainly comprises three stages: firstly, generating a nucleotide sample for a sequencing experiment through a sample preparation process; measuring a current signal sequence output when the nucleotide sequence passes through the nanopore by using nanopore sequencing equipment (such as MinION), wherein the collected sequencing current signal sequence is usually stored in a fast5 file; and finally, converting the sequencing current signal sequence into a base sequence by using base recognition software. Accordingly, the main working framework of the DeepsSimulator includes a sequence generation module, a signal generation module and a base recognition module. Firstly, a sequence generation module randomly selects and inputs an initial position on a reference genome sequence to generate a shorter nucleotide sequence which meets the length distribution of actual sequencing read lengths. And then, the signal generation module generates a nanopore sequencing simulation signal sequence corresponding to the nucleotide sequence output by the sequence generation module according to a known nanopore sequencing 6-mer pore model. Finally, the base recognition module converts the simulated sequencing signal sequence to a simulated sequencing read using base recognition software (Albacore, Guppy). Among three modules in a main working framework of the DeepSimulator, a signal generation module is a core module of DeepSimulator software, firstly, a real sequencing signal sequence corresponding to an input nucleotide sequence is generated according to a known nanopore sequencing 6-mer pore model and the repeat frequency distribution of an actual sequencing signal sequence, then, a low-pass filter is used for filtering out a high-frequency component which is embedded in the real sequencing signal sequence and is irrelevant to the real sequencing signal sequence, and finally, Gaussian noise is added to a filtering signal sequence to output a final simulation sequencing signal sequence.

In order to generate a high-quality nanopore simulation sequencing signal sequence, a key step in the DeepSimulator signal generation module is to filter the real sequencing signal sequence by using a low-pass filter. The true sequencing signal sequence is composed of a series of square waves whose spectrum is a combination of infinite sinusoids. In order to more accurately simulate the nanopore sequencing signal sequence, the high-frequency component embedded in the square wave of the real sequencing signal sequence must be filtered. The deep Simulator realizes low-pass filtering processing on a real sequencing signal sequence by convolving a windows-sine function with the real sequencing signal sequence. Considering that the speed of the single-stranded nucleotide sequence passing through the nanopore is about 450bp/s, the cut-off frequency of the low-pass filter should be greater than 450 Hz. When the cut-off frequency of the low pass filter was set to 950Hz, a simulated sequencing signal sequence most similar to the actual sequencing signal sequence could be generated. A low pass filter is a filter that passes signals with frequencies below a selected cutoff frequency and attenuates signals with frequencies above the cutoff frequency. Since the low-pass filter attenuates all high-frequency components higher than the cut-off frequency in the real sequencing signal sequence, the high-frequency components related to the actual sequencing signal sequence cannot be reserved, which may cause a large difference between the simulated sequencing signal sequence and the real sequencing signal sequence for some input nucleotide sequences, which brings inconvenience to users concerned about the output of the simulated sequencing signal sequence.

Disclosure of Invention

The invention aims to solve the technical problem that the traditional low-pass filter cannot accurately filter out high-frequency components irrelevant to an actual sequencing signal sequence in a real sequencing signal sequence.

The invention solves the technical problems through the following technical means: a method for filtering a real nanopore sequencing signal based on a neural network, the method comprising the steps of:

the method comprises the following steps: inputting a nucleotide sequence into a nanopore sequencing K-mer pore model, converting the nucleotide sequence into an expected sequencing signal sequence corresponding to the nucleotide sequence, and repeating each signal value in the expected sequencing signal sequence for multiple times to generate a real sequencing signal sequence to be detected;

step two: constructing a real sequencing signal processing model based on a bidirectional gated circulation unit neural network, wherein the signal processing model comprises three layers of bidirectional gated circulation unit neural networks and a full connection layer, the input of the signal processing model is a real sequencing signal sequence which is input to the first layer of bidirectional gated circulation unit neural network and subjected to median normalization, and the output of the signal processing model is a filtering signal sequence which is output by the full connection layer and has the same length with the input real sequencing signal sequence to be detected;

step three: acquiring actual sequencing signal sequences to be trained output by a nanopore sequencing platform, calculating real sequencing signal sequences to be trained corresponding to each actual sequencing signal sequence to be trained according to the actual sequencing signal sequences to be trained, taking the actual sequencing signal sequences to be trained and the real sequencing signal sequences to be trained corresponding to the actual sequencing signal sequences as supervision training data required by parameter training of a signal processing model, constructing a loss function of the signal processing model, initializing parameters of the signal processing model, and minimizing the loss function through an adaptive optimizer to realize training of the model parameters;

step four: and inputting the real sequencing signal sequence to be detected into a signal processing model for completing parameter training to realize filtering processing of the real sequencing signal sequence to be detected.

The method comprises the steps of converting an input nucleotide sequence into an expected sequencing signal sequence based on a known nanopore sequencing K-mer pore model, and repeating each signal value in the expected sequencing signal sequence for multiple times to generate a real sequencing signal sequence to be detected; then establishing a real sequencing signal sequence processing model, and finishing the training of the parameters of the signal processing model by using supervised training data; and finally, filtering the real sequencing signal sequence to be tested by using the trained signal processing model, and learning the time-frequency characteristic of the actual sequencing signal sequence by using a neural network, so that the high-frequency component irrelevant to the actual sequencing signal sequence in the real sequencing signal sequence to be tested can be accurately filtered, and the useful high-frequency component is reserved.

Further, the first step comprises:

step 101: inputting a nucleotide sequence X ═ X with T basic groups into a nanopore sequencing K-mer pore model1,x2,...,xT,xTRepresents the T base, and takes out one K-mer by moving one base at a time from the first base of the input nucleotide sequence until the end of moving to the position of the last K base, and T bases obtain T-K + 1K-mers in total;

step 102: the nanopore sequencing K-mer pore model comprises expected current signal values corresponding to each K-mer, expected current signal values corresponding to T-K + 1K-mers are sequentially searched by contrasting the nanopore sequencing K-mer pore model, and an expected sequencing signal sequence Y-Y corresponding to the input nucleotide sequence X is generated1,y2,...,yT-5Wherein y isiRepresenting the expected current signal value corresponding to the K-mer starting from position i in X;

step 103: and repeating each signal value in the expected sequencing signal sequence for multiple times according to the signal value repetition time distribution in the actual sequencing signal sequence to generate a to-be-detected real sequencing signal sequence with similar length distribution with the actual sequencing signal sequence.

Further, the second step comprises:

step 201: normalizing the median value to obtain a true sequencing signal sequence I ═ I1,i2,...,iTAs input vector of first layer bidirectional gated cyclic unit neural network in signal processing modelThe basic unit via network is composed of a forward-propagating gate-controlled cycle unit and a backward-propagating gate-controlled cycle unit, and the forward output vector at t moment is calculatedThe calculation formula is as follows:

wherein itFor the value of the input signal at time t,is the forward output vector at the moment t-1, sigma is sigmoid function, and sigma (z) is 1/(1+ e)-z),It is shown that the first intermediate variable,it is shown that the second intermediate variable,a third intermediate variable is represented which is,corresponding elements representing two equal-dimensional vectors are multiplied,is a first weight matrix of the weight data set,is a second weight matrix, WfIs a third weight matrix, and is,in order to be the first offset vector,for the second offset vector, and then calculating the backward output vector at the time tThe calculation formula is as follows:

wherein the content of the first and second substances,is the backward output vector at time t +1,a fourth intermediate variable is represented which is,a fifth intermediate variable is represented which is,a sixth intermediate variable is represented which is,is a fourth matrix of the weights,is a fifth weight matrix, WrIs a sixth weight matrix that is a function of,as a third one of the offset vectors,is the fourth offset vector, the output vector h at the last time ttIs a forward output vector from time tAnd the backward output vector at time tIs connected to obtainWhere | | | represents the concatenation symbol of the vector;

step 202: taking an output vector of a first layer of bidirectional gating circulation unit neural network as an input vector of a second layer of bidirectional gating circulation unit neural network, taking an output vector of the second layer of bidirectional gating circulation unit neural network as an input vector of a third layer of bidirectional gating circulation unit neural network, wherein the calculation processes of the second layer of bidirectional gating circulation unit neural network and the third layer of bidirectional gating circulation unit neural network are the same as the calculation process of the first layer of bidirectional gating circulation unit neural network, and the weight matrix and the offset vector parameters between different layers are different;

step 203: processing the output vector of the last layer in the three-layer bidirectional gated cyclic unit neural network as the input vector of a full connection layer, wherein the output vector of the full connection layer is a filtering signal sequence O (O) with the same length as the input real sequencing signal sequence to be detected1,o2,...,oT

Further, the third step includes:

step 301: the supervised training data of the signal processing model comprises an actual sequencing signal sequence to be trained and a real sequencing signal sequence to be trained corresponding to the actual sequencing signal sequence, firstly, a fast5 file output by a nanopore sequencing platform is read to obtain the actual sequencing signal sequence to be trained, and then, a real sequencing signal sequence to be trained corresponding to each actual sequencing signal sequence to be trained is calculated by using a continuous wavelet dynamic time warping algorithm;

step 302: a loss function for constructing a signal processing model isWhere cosh represents a hyperbolic function, where oiI-th output signal, r, representing a signal processing modeliRepresenting the ith signal in the actual sequencing signal sequence;

step 303: initializing a weight matrix and a bias vector in each layer of bidirectional gated cyclic unit neural network into an intervalIn which n is a uniform distributioniRepresenting the input vector dimension, n, of each layer of the bi-directional gated cyclic unit neural networkoRepresenting the dimension of an output vector of each layer of the bidirectional gated cyclic unit neural network;

step 304: and (5) iteratively calculating by using an Adam self-adaptive optimizer to obtain model parameters of the minimum loss function, and finishing the parameter training process of the signal processing model.

Still further, the step 301 includes:

step 3011: reading a fast5 sequencing file output by a nanopore sequencing platform to obtain an actual sequencing signal sequence to be trained, wherein the fast5 file is in an HDF5 file format;

step 3012: performing base recognition on an actual sequencing signal sequence to be trained to obtain a sequencing read, comparing the sequencing read to a reference genome sequence by adopting a gene sequence comparison algorithm, obtaining a reference genome sequence fragment corresponding to the actual sequencing signal sequence to be trained according to a comparison result, and calculating an expected sequencing signal sequence to be trained by using the reference genome sequence fragment and a nanopore sequencing K-mer hole die;

step 3013: and using a continuous wavelet dynamic time warping algorithm to complete point-to-point mapping between the actual sequencing signal sequence to be trained and the expected sequencing signal sequence to be trained, obtaining an expected signal value corresponding to each signal value in the actual sequencing signal sequence to be trained according to a mapping result, and further calculating the actual sequencing signal sequence to be trained corresponding to the actual sequencing signal sequence to be trained.

The invention also provides a real nanopore sequencing signal filtering device based on the neural network, which comprises the following steps:

the real sequencing signal sequence generation module to be detected is used for inputting a nucleotide sequence in the nanopore sequencing K-mer pore model, converting the nucleotide sequence into an expected sequencing signal sequence corresponding to the nucleotide sequence, and repeating each signal value in the expected sequencing signal sequence for multiple times to generate a real sequencing signal sequence to be detected;

the signal processing model building module is used for building a real sequencing signal processing model based on the bidirectional gated circulation unit neural network, and the signal processing model comprises three layers of bidirectional gated circulation unit neural networks and a full connection layer, wherein the input of the signal processing model is a real sequencing signal sequence which is input to the first layer of bidirectional gated circulation unit neural network and subjected to median normalization, and the output of the signal processing model is a filtering signal sequence which is output by the full connection layer and has the same length with the input real sequencing signal sequence to be tested;

the signal processing model training module is used for acquiring an actual sequencing signal sequence to be trained output by the nanopore sequencing platform, calculating a real sequencing signal sequence to be trained corresponding to each actual sequencing signal sequence to be trained according to the actual sequencing signal sequence to be trained, taking the actual sequencing signal sequence to be trained and the real sequencing signal sequence to be trained corresponding to the actual sequencing signal sequence to be trained as supervision training data required by parameter training of the signal processing model, constructing a loss function of the signal processing model, initializing parameters of the signal processing model, and minimizing the loss function through the self-adaptive optimizer to realize training of the model parameters;

and the filtering processing module is used for inputting the real sequencing signal sequence to be detected into the signal processing model completing parameter training to realize filtering processing on the real sequencing signal sequence to be detected.

Further, the module for generating the true sequencing signal sequence to be tested is further configured to:

step 101: inputting a nucleotide sequence X ═ X with T basic groups into a nanopore sequencing K-mer pore model1,x2,...,xT,xTRepresents the T base, and takes out one K-mer by moving one base at a time from the first base of the input nucleotide sequence until the end of moving to the position of the last K base, and T bases obtain T-K + 1K-mers in total;

step 102: the nanopore sequencing K-mer pore model comprises expected current signal values corresponding to each K-mer, expected current signal values corresponding to T-K + 1K-mers are sequentially searched by contrasting the nanopore sequencing K-mer pore model, and an expected sequencing signal sequence Y-Y corresponding to the input nucleotide sequence X is generated1,y2,...,yT-5Wherein y isiRepresenting the expected current signal value corresponding to the K-mer starting from position i in X;

step 103: and repeating each signal value in the expected sequencing signal sequence for multiple times according to the signal value repetition time distribution in the actual sequencing signal sequence to generate a to-be-detected real sequencing signal sequence with similar length distribution with the actual sequencing signal sequence.

Further, the signal processing model building module is further configured to:

step 201: normalizing the median value to obtain a true sequencing signal sequence I ═ I1,i2,...,iTAs signal processingThe input vector of a first layer of bidirectional gated cyclic unit neural network in the model, wherein a basic unit of the bidirectional gated cyclic unit neural network consists of a forward-propagating gated cyclic unit and a backward-propagating gated cyclic unit, and a forward output vector at the time t is calculated firstlyThe calculation formula is as follows:

wherein itFor the value of the input signal at time t,is the forward output vector at the moment t-1, sigma is sigmoid function, and sigma (z) is 1/(1+ e)-z),It is shown that the first intermediate variable,it is shown that the second intermediate variable,a third intermediate variable is represented which is,corresponding elements representing two equal-dimensional vectors are multiplied,is a first weight matrix of the weight data set,is a second weight matrix, WfIs a third weight matrix, and is,in order to be the first offset vector,for the second offset vector, and then calculating the backward output vector at the time tThe calculation formula is as follows:

wherein the content of the first and second substances,is the backward output vector at time t +1,is shown asFour intermediate variables of the number of the intermediate variables,a fifth intermediate variable is represented which is,a sixth intermediate variable is represented which is,is a fourth matrix of the weights,is a fifth weight matrix, WrIs a sixth weight matrix that is a function of,as a third one of the offset vectors,is the fourth offset vector, the output vector h at the last time ttIs a forward output vector from time tAnd the backward output vector at time tIs connected to obtainWhere | | | represents the concatenation symbol of the vector;

step 202: taking an output vector of a first layer of bidirectional gating circulation unit neural network as an input vector of a second layer of bidirectional gating circulation unit neural network, taking an output vector of the second layer of bidirectional gating circulation unit neural network as an input vector of a third layer of bidirectional gating circulation unit neural network, wherein the calculation processes of the second layer of bidirectional gating circulation unit neural network and the third layer of bidirectional gating circulation unit neural network are the same as the calculation process of the first layer of bidirectional gating circulation unit neural network, and the weight matrix and the offset vector parameters between different layers are different;

step 203: processing the output vector of the last layer in the three-layer bidirectional gated cyclic unit neural network as the input vector of a full connection layer, wherein the output vector of the full connection layer is a filtering signal sequence O (O) with the same length as the input real sequencing signal sequence to be detected1,o2,...,oT

Further, the signal processing model training module is further configured to:

step 301: the supervised training data of the signal processing model comprises an actual sequencing signal sequence to be trained and a real sequencing signal sequence to be trained corresponding to the actual sequencing signal sequence, firstly, a fast5 file output by a nanopore sequencing platform is read to obtain the actual sequencing signal sequence to be trained, and then, a real sequencing signal sequence to be trained corresponding to each actual sequencing signal sequence to be trained is calculated by using a continuous wavelet dynamic time warping algorithm;

step 302: a loss function for constructing a signal processing model isWhere cosh represents a hyperbolic function, where oiI-th output signal, r, representing a signal processing modeliRepresenting the ith signal in the actual sequencing signal sequence;

step 303: initializing a weight matrix and a bias vector in each layer of bidirectional gated cyclic unit neural network into an intervalIn which n is a uniform distributioniRepresenting the input vector dimension, n, of each layer of the bi-directional gated cyclic unit neural networkoRepresenting the dimension of an output vector of each layer of the bidirectional gated cyclic unit neural network;

step 304: and (5) iteratively calculating by using an Adam self-adaptive optimizer to obtain model parameters of the minimum loss function, and finishing the parameter training process of the signal processing model.

Still further, the step 301 includes:

step 3011: reading a fast5 sequencing file output by a nanopore sequencing platform to obtain an actual sequencing signal sequence to be trained, wherein the fast5 file is in an HDF5 file format;

step 3012: performing base recognition on an actual sequencing signal sequence to be trained to obtain a sequencing read, comparing the sequencing read to a reference genome sequence by adopting a gene sequence comparison algorithm, obtaining a reference genome sequence fragment corresponding to the actual sequencing signal sequence to be trained according to a comparison result, and calculating an expected sequencing signal sequence to be trained by using the reference genome sequence fragment and a nanopore sequencing K-mer hole die;

step 3013: and using a continuous wavelet dynamic time warping algorithm to complete point-to-point mapping between the actual sequencing signal sequence to be trained and the expected sequencing signal sequence to be trained, obtaining an expected signal value corresponding to each signal value in the actual sequencing signal sequence to be trained according to a mapping result, and further calculating the actual sequencing signal sequence to be trained corresponding to the actual sequencing signal sequence to be trained.

The invention has the advantages that: the method comprises the steps of converting an input nucleotide sequence into an expected sequencing signal sequence based on a known nanopore sequencing K-mer pore model, and repeating each signal value in the expected sequencing signal sequence for multiple times to generate a real sequencing signal sequence to be detected; then establishing a real sequencing signal sequence processing model, and finishing the training of the parameters of the signal processing model by using supervised training data; and finally, filtering the real sequencing signal sequence to be tested by using the trained signal processing model, and learning the time-frequency characteristic of the actual sequencing signal sequence by using a neural network, so that the high-frequency component irrelevant to the actual sequencing signal sequence in the real sequencing signal sequence to be tested can be accurately filtered, and the useful high-frequency component is reserved.

Drawings

FIG. 1 is a schematic diagram of a method for filtering a real nanopore sequencing signal based on a neural network according to an embodiment of the present invention;

FIG. 2 is a schematic diagram showing the distribution of the number of signal value repetitions in an actual sequencing signal sequence of a real nanopore sequencing signal filtering method based on a neural network according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a neural network structure of a gated cyclic unit in a method for filtering a real nanopore sequencing signal based on a neural network according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a neural network structure of a three-layer bidirectional gated cyclic unit in a method for filtering a real nanopore sequencing signal based on a neural network according to an embodiment of the present invention;

FIG. 5 is a schematic diagram illustrating a loss function drop in a parameter training process of a signal processing model in a real nanopore sequencing signal filtering method based on a neural network according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of supervised training data of a signal processing model in a real nanopore sequencing signal filtering method based on a neural network according to an embodiment of the present invention;

FIG. 7 is a schematic diagram showing a comparison between waveforms of output filtered signals of a low-pass filter of the prior art and a signal processing model of the present invention, wherein (A) in FIG. 7 is a schematic diagram of an output signal of the low-pass filter, and (B) in FIG. 7 is a schematic diagram of an output signal of the signal processing model of the present invention;

FIG. 8 is a schematic diagram showing a comparison between waveforms of a real sequencing signal, a Deepsilolator simulation sequencing signal and a simulation sequencing signal according to the present invention, wherein (A) in FIG. 8 is a schematic diagram showing a waveform of a real sequencing signal, (B) in FIG. 8 is a schematic diagram showing a waveform of a Deepsilolator simulation sequencing signal, and (C) in FIG. 8 is a schematic diagram showing a waveform of a simulation sequencing signal according to the present invention;

FIG. 9 is a schematic diagram showing the DTW distance comparison between the DeepSimulator simulation sequencing signal, the simulation sequencing signal of the present invention and the real sequencing signal;

FIG. 10 is a time-frequency diagram of continuous wavelet transform of a real sequencing signal, a Deepsimulator simulation sequencing signal and a simulation sequencing signal of the present invention, wherein (A) in FIG. 10 is a time-frequency diagram of a real sequencing signal, and (B) in FIG. 10 is a time-frequency diagram of a Deepsimulator simulation sequencing signal, and (C) in FIG. 10 is a time-frequency diagram of a simulation sequencing signal of the present invention;

FIG. 11 is a schematic diagram showing PCC comparison between a DeepSimulator simulation sequencing signal, a simulation sequencing signal of the present invention, and a continuous wavelet transform time-frequency diagram of a real sequencing signal;

FIG. 12 is a schematic diagram showing the comparison of error characteristics of real sequencing reads, DeepSimulator simulation sequencing reads, and simulation sequencing reads of the present invention;

FIG. 13 is a schematic diagram showing the comparison of error characteristics of real sequencing reads, DeepSimulator simulation sequencing reads, simulation sequencing reads of the present invention, and simulation sequencing reads of a customized model of Klebsiella pneumoniae.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the embodiments of the present invention, and it is obvious that the described embodiments are some embodiments of the present invention, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

As shown in fig. 1, a real nanopore sequencing signal filtering method based on a neural network mainly takes 4000 Lambda virus sequencing samples, 4000 escherichia coli sequencing samples, 4467 acinetobacter sequencing samples, 15178 klebsiella pneumoniae sequencing samples, 16742 serratia marcescens sequencing samples and 11047 staphylococcus aureus sequencing samples obtained by sequencing a MinION sequencing platform (R9.4 sequencing chip) of oxford nanopore technologies as an example, the Lambda virus sequencing samples and the escherichia coli sequencing samples are taken as training data sets, namely, real sequencing signal sequences to be trained, and residual sequencing data are taken as test data sets, namely, real sequencing signal sequences to be tested, according to the present invention, and the method includes the following steps:

step S1: inputting a nucleotide sequence into a nanopore sequencing K-mer pore model, converting the nucleotide sequence into an expected sequencing signal sequence corresponding to the nucleotide sequence, and repeating each signal value in the expected sequencing signal sequence for multiple times to generate a real sequencing signal sequence to be detected; the specific process is as follows:

step 101: inputting a nucleotide sequence X ═ X with T basic groups into a nanopore sequencing K-mer pore model1,x2,...,xT,xTRepresents the T base, and takes out one K-mer by moving one base at a time from the first base of the input nucleotide sequence until the end of moving to the position of the last K base, and T bases obtain T-K + 1K-mers in total; it should be noted that, given an integer value of K, a short string of consecutive K bases, called a K-mer, is taken from the first position of the input nucleotide sequence, in the present example, K is taken as 6, and therefore, the present invention employs a nanopore sequencing 6-mer pore model to take out one 6-mer each time one base is moved.

Step 102: the nanopore sequencing K-mer pore model comprises expected current signal values corresponding to each K-mer, expected current signal values corresponding to T-K + 1K-mers are sequentially searched by contrasting the nanopore sequencing K-mer pore model, and an expected sequencing signal sequence Y-Y corresponding to the input nucleotide sequence X is generated1,y2,...,yT-5Wherein y isiRepresenting the expected current signal value corresponding to the K-mer starting from position i in X;

step 103: as shown in fig. 2, according to the signal value repetition number distribution in the actual sequencing signal sequence, each signal value in the expected sequencing signal sequence is repeated multiple times, so as to generate a to-be-detected real sequencing signal sequence having a similar length distribution with the actual sequencing signal sequence.

Step S2: constructing a real sequencing signal processing model based on a bidirectional gated circulation unit neural network, wherein the signal processing model comprises three layers of bidirectional gated circulation unit neural networks and a full connection layer, the input of the signal processing model is a real sequencing signal sequence which is input to the first layer of bidirectional gated circulation unit neural network and subjected to median normalization, and the output of the signal processing model is a filtering signal sequence which is output by the full connection layer and has the same length with the input real sequencing signal sequence to be detected; the specific process is as follows:

step 201: normalizing the median value to obtain a true sequencing signal sequence I ═ I1,i2,...,iTAs the input vector of the first layer of bidirectional gated cyclic unit neural network in the signal processing model, wherein the basic unit of the bidirectional gated cyclic unit neural network consists of oneThe forward propagation gated loop unit and the backward propagation gated loop unit are formed by firstly calculating a forward output vector at the time tAs shown in fig. 3, the calculation formula is:

wherein itFor the value of the input signal at time t,is the forward output vector at the moment t-1, sigma is sigmoid function, and sigma (z) is 1/(1+ e)-z),It is shown that the first intermediate variable,it is shown that the second intermediate variable,a third intermediate variable is represented which is,corresponding elements representing two equal-dimensional vectors are multiplied,is a first weight matrix of the weight data set,is a second weight matrix, WfIs a third weight matrix, and is,in order to be the first offset vector,for the second offset vector, and then calculating the backward output vector at the time tThe calculation formula is as follows:

wherein the content of the first and second substances,is the backward output vector at time t +1,a fourth intermediate variable is represented which is,a fifth intermediate variable is represented which is,a sixth intermediate variable is represented which is,is a fourth matrix of the weights,is a fifth weight matrix, WrIs a sixth weight matrix that is a function of,as a third one of the offset vectors,is the fourth offset vector, the output vector h at the last time ttIs a forward output vector from time tAnd the backward output vector at time tIs connected to obtainWhere | | | represents the concatenation symbol of the vector;

step 202: as shown in fig. 4, an output vector of a first layer of bidirectional gated cyclic unit neural network is used as an input vector of a second layer of bidirectional gated cyclic unit neural network, and an output vector of the second layer of bidirectional gated cyclic unit neural network is used as an input vector of a third layer of bidirectional gated cyclic unit neural network, wherein the calculation processes of the second layer of bidirectional gated cyclic unit neural network and the third layer of bidirectional gated cyclic unit neural network are the same as the calculation process of the first layer of bidirectional gated cyclic unit neural network, and the weight matrix and the offset vector parameter between different layers are different;

step 203:processing the output vector of the last layer in the three-layer bidirectional gated cyclic unit neural network as the input vector of a full connection layer, wherein the output vector of the full connection layer is a filtering signal sequence O (O) with the same length as the input real sequencing signal sequence to be detected1,o2,...,oT

Step S3: acquiring actual sequencing signal sequences to be trained output by a nanopore sequencing platform, calculating real sequencing signal sequences to be trained corresponding to each actual sequencing signal sequence to be trained according to the actual sequencing signal sequences to be trained, taking the actual sequencing signal sequences to be trained and the real sequencing signal sequences to be trained corresponding to the actual sequencing signal sequences as supervision training data required by parameter training of a signal processing model, constructing a loss function of the signal processing model, initializing parameters of the signal processing model, and minimizing the loss function through an adaptive optimizer to realize training of the model parameters; the specific process is as follows:

step 301: the supervised training data of the signal processing model comprises an actual sequencing signal sequence to be trained and a real sequencing signal sequence to be trained corresponding to the actual sequencing signal sequence, firstly, a fast5 file output by a nanopore sequencing platform is read to obtain the actual sequencing signal sequence to be trained, and then, a real sequencing signal sequence to be trained corresponding to each actual sequencing signal sequence to be trained is calculated by using a continuous wavelet dynamic time warping algorithm;

step 302: a loss function for constructing a signal processing model isWhere cosh represents a hyperbolic function, where oiI-th output signal, r, representing a signal processing modeliRepresenting the ith signal in the actual sequencing signal sequence;

step 303: initializing a weight matrix and a bias vector in each layer of bidirectional gated cyclic unit neural network into an intervalIn which n is a uniform distributioniRepresenting input vector dimensions of each layer of bidirectional gated cyclic unit neural network,noRepresenting the dimension of an output vector of each layer of the bidirectional gated cyclic unit neural network;

step 304: and (3) iteratively calculating by using an Adam adaptive optimizer with the learning rate of 0.0001 to obtain model parameters of the minimized loss function, completing the parameter training process of the signal processing model, setting batch _ size to 64 during training, setting the iteration number to 1000, and referring to the loss function reduction process in parameter training in fig. 5.

Wherein the step 301 comprises:

step 3011: reading a fast5 sequencing file output by a nanopore sequencing platform to obtain an actual sequencing signal sequence to be trained, wherein the fast5 file is in an HDF5 file format;

step 3012: performing base recognition on an actual sequencing signal sequence to be trained to obtain a sequencing read, comparing the sequencing read to a reference genome sequence by adopting a gene sequence comparison algorithm, obtaining a reference genome sequence fragment corresponding to the actual sequencing signal sequence to be trained according to a comparison result, and calculating an expected sequencing signal sequence to be trained by using the reference genome sequence fragment and a nanopore sequencing K-mer hole die;

step 3013: and (3) using a continuous wavelet dynamic time warping algorithm to complete point-to-point mapping between the actual sequencing signal sequence to be trained and the expected sequencing signal sequence to be trained, referring to fig. 6, obtaining an expected signal value corresponding to each signal value in the actual sequencing signal sequence to be trained according to a mapping result, and further calculating the actual sequencing signal sequence to be trained corresponding to the actual sequencing signal sequence to be trained.

Step S4: and inputting the real sequencing signal sequence to be detected into a signal processing model for completing parameter training to realize filtering processing of the real sequencing signal sequence to be detected.

In order to verify that the signal processing model based on the bidirectional gated cyclic unit neural network provided by the invention can efficiently and accurately filter useless high-frequency components in a standard sequencing signal sequence, a low-pass filter of a Deepsilolator software signal generation module and the signal processing model constructed by the invention are respectively used for filtering the standard sequencing signal sequence, and the comparison result of the output filtered signal waveforms is shown in FIG. 7. According to the comparison result of the filtered signal waveforms, compared with the traditional low-pass filter, the signal processing model based on the bidirectional gated cyclic unit neural network only carries out filtering processing on signals at the 6-mer change position in the standard sequencing signal sequence, and the principle of actual nanopore sequencing is better met. The signal processing model provided by the invention can accurately learn the internal relation between the standard sequencing signal sequence and the real sequencing signal by utilizing the neural network.

The final simulated sequencing signal can be generated by adding gaussian noise to the output filtered signal, and as shown in fig. 8, both the deep simulator software and the present invention can generate a simulated sequencing signal with a waveform similar to that of the real sequencing signal. To quantitatively analyze the similarity between the two simulated sequencing signals and the real sequencing signal, analysis was performed using a dynamic time warping algorithm, which is a standard method for comparing the differences between the two signals. 1000 pieces of sequencing data are randomly selected from the test data set for testing, and the calculation result shows that the average normalized DTW distance between the simulated sequencing signal generated by the DeepSimulator software and the real sequencing signal is 0.132, while the average normalized DTW distance between the simulated sequencing signal generated by the invention and the real sequencing signal is 0.121, which is 7.5% lower than that of the DeepSimulator software, thereby indicating that the simulation sequencing signal closer to the real sequencing signal can be generated by the invention. FIG. 9 shows the experimental results of 1000 sequencing samples, where each point represents a piece of sequencing data, the points above the diagonal line indicate better simulation of the present invention, and the points below the diagonal line indicate better Deepsilolator software, and according to the statistical results, about 89.1% of the samples tested indicate better method of the present invention.

Except for comparing the DTW distance between the simulated sequencing signal and the real sequencing signal, the time-frequency analysis is carried out on the three types of sequencing signals by adopting a continuous wavelet transform algorithm, and the obtained continuous wavelet transform time-frequency graph is shown in FIG. 10. Generally speaking, the time-frequency graphs of the two simulated sequencing signals are very similar to the time-frequency graph of the real sequencing signal, but the time-frequency graph of the simulated sequencing signal generated by the method is closer to the time-frequency graph of the real sequencing signal, especially the high-frequency part. In order to compare the similarity between the two simulated sequencing signal time-frequency graphs and the real sequencing signal time-frequency graph, a Pearson Correlation Coefficient (PCC) between the two simulated sequencing signal time-frequency graphs and the real sequencing signal time-frequency graph is calculated respectively, and PCC at the low frequency part and the high frequency part of the signal time-frequency graph is further calculated. The mean values calculated using 4000 sequencing data in the test dataset are shown in figure 11. Overall, the PCC between the simulated sequencing signal time-frequency diagram generated by the method and the real sequencing signal time-frequency diagram is improved by 9.08% compared with the deepSimulator. For the high-frequency part of the time-frequency diagram, the invention is improved by 18.88% compared with a deep template, which further shows that the signal processing model based on the bidirectional gating cyclic unit neural network, which is provided by the invention, can more accurately filter useless high-frequency components in a standard sequencing signal sequence and reserve useful high-frequency components, thereby generating a simulation sequencing signal which is closer to a real sequencing signal in a time domain and a frequency domain.

One important indicator for assessing the quality of a simulated sequencing signal is to compare whether simulated reads generated from the simulated sequencing signal have similar error characteristics as sequencing reads generated from actual experiments. And respectively carrying out base recognition on the two simulated sequencing signals and the real sequencing signal by using the latest base recognition software Guppy of the Oxford nanopore technology company, and then calculating error characteristics of the three sequencing reads, including the insertion rate, the deletion rate and the substitution rate, by using edlib software. The statistical results obtained using the four biological sequencing samples provided by the test data set are shown in figure 12. Because the accuracy of actual sequencing reads is related to the complex sequencing environment, the accuracy of simulated sequencing reads is very different from the accuracy of actual sequencing reads. Although the accuracy of two simulated sequencing reads generated by different sequencing signal simulation methods is similar, the proportion of the three types of errors in the simulated reads is not consistent. In order to compare error characteristics of two simulated sequencing reads and four real sequencing reads in detail, the proportion of three types of errors in the six sequencing reads is further calculated. In addition to the klebsiella pneumoniae sequencing samples, the simulated sequencing reads generated by the present invention have similar error distributions to the actual sequencing reads. However, the difference between the simulated sequencing reads generated by the deepSimulator and the actual sequencing reads was large for all biological sequencing samples. The analysis result aiming at the simulation sequencing read shows that the nanopore sequencing signal simulation method provided by the invention greatly improves the quality of the simulation sequencing signal.

For Klebsiella pneumoniae sequencing samples, the two simulation sequencing reads generated by the DeepSimulator and the simulation method provided by the invention are greatly different from the actual sequencing reads. In order to further explore the expandability of the sequencing signal simulation method provided by the invention, 4000 sequencing reads are randomly selected from a Klebsiella pneumoniae sequencing sample to train the provided signal processing model based on the bidirectional gated cyclic unit neural network, and the rest sequencing reads are used as a test data set to verify the performance of the Klebsiella pneumoniae customized model. As shown in fig. 13, the simulated sequencing reads generated from the klebsiella pneumoniae custom model had similar insertion and substitution rates as the true sequencing reads, while the deletion rate of the simulated sequencing reads was less than the deletion rate of the true sequencing reads. Compared with other two types of simulation sequencing reads, the error characteristic of the simulation sequencing read generated by the Klebsiella pneumoniae customized model is closer to that of a real sequencing read, which shows that the proposed signal simulation method can effectively learn the time-frequency characteristic of a real sequencing signal and has better expandability than a deep Simulator.

According to the technical scheme, the input nucleotide sequence is converted into an expected sequencing signal sequence based on a known nanopore sequencing K-mer pore model, and each signal value in the expected sequencing signal sequence is repeated for multiple times to generate a real sequencing signal sequence to be detected; then establishing a real sequencing signal sequence processing model, and finishing the training of the parameters of the signal processing model by using supervised training data; and finally, filtering the real sequencing signal sequence to be tested by using the trained signal processing model, and learning the time-frequency characteristic of the actual sequencing signal sequence by using a neural network, so that the high-frequency component irrelevant to the actual sequencing signal sequence in the real sequencing signal sequence to be tested can be accurately filtered, and the useful high-frequency component is reserved.

Example 2

Corresponding to embodiment 1 of the present invention, embodiment 2 of the present invention further provides a real nanopore sequencing signal filtering device based on a neural network, where the device includes the following steps:

the real sequencing signal sequence generation module to be detected is used for inputting a nucleotide sequence in the nanopore sequencing K-mer pore model, converting the nucleotide sequence into an expected sequencing signal sequence corresponding to the nucleotide sequence, and repeating each signal value in the expected sequencing signal sequence for multiple times to generate a real sequencing signal sequence to be detected;

the signal processing model building module is used for building a real sequencing signal processing model based on the bidirectional gated circulation unit neural network, and the signal processing model comprises three layers of bidirectional gated circulation unit neural networks and a full connection layer, wherein the input of the signal processing model is a real sequencing signal sequence which is input to the first layer of bidirectional gated circulation unit neural network and subjected to median normalization, and the output of the signal processing model is a filtering signal sequence which is output by the full connection layer and has the same length with the input real sequencing signal sequence to be tested;

the signal processing model training module is used for acquiring an actual sequencing signal sequence to be trained output by the nanopore sequencing platform, calculating a real sequencing signal sequence to be trained corresponding to each actual sequencing signal sequence to be trained according to the actual sequencing signal sequence to be trained, taking the actual sequencing signal sequence to be trained and the real sequencing signal sequence to be trained corresponding to the actual sequencing signal sequence to be trained as supervision training data required by parameter training of the signal processing model, constructing a loss function of the signal processing model, initializing parameters of the signal processing model, and minimizing the loss function through the self-adaptive optimizer to realize training of the model parameters;

and the filtering processing module is used for inputting the real sequencing signal sequence to be detected into the signal processing model completing parameter training to realize filtering processing on the real sequencing signal sequence to be detected.

Specifically, the generation module of the true sequencing signal sequence to be detected is further configured to:

step 101:inputting a nucleotide sequence X ═ X with T basic groups into a nanopore sequencing K-mer pore model1,x2,...,xT,xTRepresents the T base, and takes out one K-mer by moving one base at a time from the first base of the input nucleotide sequence until the end of moving to the position of the last K base, and T bases obtain T-K + 1K-mers in total;

step 102: the nanopore sequencing K-mer pore model comprises expected current signal values corresponding to each K-mer, expected current signal values corresponding to T-K + 1K-mers are sequentially searched by contrasting the nanopore sequencing K-mer pore model, and an expected sequencing signal sequence Y-Y corresponding to the input nucleotide sequence X is generated1,y2,...,yT-5Wherein y isiRepresenting the expected current signal value corresponding to the K-mer starting from position i in X;

step 103: and repeating each signal value in the expected sequencing signal sequence for multiple times according to the signal value repetition time distribution in the actual sequencing signal sequence to generate a to-be-detected real sequencing signal sequence with similar length distribution with the actual sequencing signal sequence.

Specifically, the signal processing model building module is further configured to:

step 201: normalizing the median value to obtain a true sequencing signal sequence I ═ I1,i2,...,iTThe method is used as an input vector of a first-layer bidirectional gated cyclic unit neural network in a signal processing model, wherein a basic unit of the bidirectional gated cyclic unit neural network consists of a forward-propagating gated cyclic unit and a backward-propagating gated cyclic unit, and a forward output vector at the time t is calculated firstlyThe calculation formula is as follows:

wherein itFor the value of the input signal at time t,is the forward output vector at the moment t-1, sigma is sigmoid function, and sigma (z) is 1/(1+ e)-z),It is shown that the first intermediate variable,it is shown that the second intermediate variable,a third intermediate variable is represented which is,corresponding elements representing two equal-dimensional vectors are multiplied,is a first weight matrix of the weight data set,is a second weight matrix, WfIs a third weight matrix, and is,in order to be the first offset vector,is as followsTwo offset vectors, then calculating the backward output vector at time tThe calculation formula is as follows:

wherein the content of the first and second substances,is the backward output vector at time t +1,a fourth intermediate variable is represented which is,a fifth intermediate variable is represented which is,a sixth intermediate variable is represented which is,is a fourth matrix of the weights,is a fifth weight matrix, WrIs a sixth weight matrix that is a function of,as a third one of the offset vectors,is the fourth offset vector, the output vector h at the last time ttIs a forward output vector from time tAnd the backward output vector at time tIs connected to obtainWhere | | | represents the concatenation symbol of the vector;

step 202: taking an output vector of a first layer of bidirectional gating circulation unit neural network as an input vector of a second layer of bidirectional gating circulation unit neural network, taking an output vector of the second layer of bidirectional gating circulation unit neural network as an input vector of a third layer of bidirectional gating circulation unit neural network, wherein the calculation processes of the second layer of bidirectional gating circulation unit neural network and the third layer of bidirectional gating circulation unit neural network are the same as the calculation process of the first layer of bidirectional gating circulation unit neural network, and the weight matrix and the offset vector parameters between different layers are different;

step 203: processing the output vector of the last layer in the three-layer bidirectional gated cyclic unit neural network as the input vector of a full connection layer, wherein the output vector of the full connection layer is a filtering signal sequence O (O) with the same length as the input real sequencing signal sequence to be detected1,o2,...,oT

Specifically, the signal processing model training module is further configured to:

step 301: the supervised training data of the signal processing model comprises an actual sequencing signal sequence to be trained and a real sequencing signal sequence to be trained corresponding to the actual sequencing signal sequence, firstly, a fast5 file output by a nanopore sequencing platform is read to obtain the actual sequencing signal sequence to be trained, and then, a real sequencing signal sequence to be trained corresponding to each actual sequencing signal sequence to be trained is calculated by using a continuous wavelet dynamic time warping algorithm;

step 302: a loss function for constructing a signal processing model isWhere cosh represents a hyperbolic function, where oiI-th output signal, r, representing a signal processing modeliRepresenting the ith signal in the actual sequencing signal sequence;

step 303: initializing a weight matrix and a bias vector in each layer of bidirectional gated cyclic unit neural network into an intervalIn which n is a uniform distributioniRepresenting the input vector dimension, n, of each layer of the bi-directional gated cyclic unit neural networkoRepresenting the dimension of an output vector of each layer of the bidirectional gated cyclic unit neural network;

step 304: and (5) iteratively calculating by using an Adam self-adaptive optimizer to obtain model parameters of the minimum loss function, and finishing the parameter training process of the signal processing model.

More specifically, the step 301 includes:

step 3011: reading a fast5 sequencing file output by a nanopore sequencing platform to obtain an actual sequencing signal sequence to be trained, wherein the fast5 file is in an HDF5 file format;

step 3012: performing base recognition on an actual sequencing signal sequence to be trained to obtain a sequencing read, comparing the sequencing read to a reference genome sequence by adopting a gene sequence comparison algorithm, obtaining a reference genome sequence fragment corresponding to the actual sequencing signal sequence to be trained according to a comparison result, and calculating an expected sequencing signal sequence to be trained by using the reference genome sequence fragment and a nanopore sequencing K-mer hole die;

step 3013: and using a continuous wavelet dynamic time warping algorithm to complete point-to-point mapping between the actual sequencing signal sequence to be trained and the expected sequencing signal sequence to be trained, obtaining an expected signal value corresponding to each signal value in the actual sequencing signal sequence to be trained according to a mapping result, and further calculating the actual sequencing signal sequence to be trained corresponding to the actual sequencing signal sequence to be trained.

The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

29页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种基于分治法的mRNA序列优化的方法与装置

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!