Base mutation detection method and device based on sequencing data and storage medium

文档序号：395665 发布日期：2021-12-14 浏览：3次中文

阅读说明：本技术 基于测序数据的碱基突变检测方法、装置及存储介质 (Base mutation detection method and device based on sequencing data and storage medium ) 是由刘斯洋黄树嘉金鑫于 2019-05-15 设计创作，主要内容包括：本文公布一种基于测序数据的碱基突变检测方法、装置及存储介质,所述方法包括确定多个待检测样本的测序数据在研究位点为特定碱基的初始频率；基于所述初始频率计算每个待检测样本在研究位点为特定碱基的期望值；利用每个期望值对所述多个待检测样本的测序数据在研究位点为特定碱基的初始频率进行更新；利用更新后的初始频率继续计算每个待检测样本在研究位点为特定碱基的期望值,直到每个待检测样本在研究位点为特定碱基的期望值收敛；根据每个收敛的期望值确定每个待检测样本在研究位点的碱基突变类型以及变异置信度。(A method, a device and a storage medium for detecting base mutation based on sequencing data are disclosed, wherein the method comprises the steps of determining the initial frequency of the sequencing data of a plurality of samples to be detected, which is a specific base at a research site; calculating the expected value of each sample to be detected at the research site as a specific base based on the initial frequency; updating the initial frequency of the sequencing data of the samples to be detected, which is a specific base at the research site, by using each expected value; continuously calculating the expected value of each sample to be detected, which is a specific base at the research site, by using the updated initial frequency until the expected value of each sample to be detected, which is a specific base at the research site, is converged; and determining the base mutation type and the variation confidence coefficient of each sample to be detected at the research site according to the expected value of each convergence.)

A method for detecting base mutation based on sequencing data, comprising:

determining the initial frequency of sequencing data of a plurality of samples to be detected, wherein the sequencing data of the samples to be detected are specific bases at a research site;

calculating the expected value of each sample to be detected at the research site as a specific base based on the initial frequency;

updating the initial frequency of the sequencing data of the samples to be detected, which is a specific base at the research site, by using each expected value;

continuously calculating the expected value of each sample to be detected at the research site as the specific base by using the updated initial frequency, continuously updating the sequencing data of the plurality of samples to be detected at the initial frequency of the research site as the specific base by using each new expected value, and repeating the iteration process until the expected value of each sample to be detected at the research site as the specific base is converged;

determining the base mutation type and the variation confidence coefficient of each sample to be detected at the research site according to the expected value of each convergence;

wherein the specific base comprises an adenine A base, a thymine T base, a cytosine C base or a guanine G base.

The method of claim 1, wherein determining the initial frequency of sequencing data of the plurality of samples to be detected at a particular base at a site of investigation comprises:

counting the number of specific bases in the sequencing data of a plurality of samples to be detected and the total number of four bases in the sequencing data of the plurality of samples to be detected; wherein the four bases comprise: adenine A base, thymine T base, cytosine C base and guanine G base;

taking the quotient of the number of the specific base and the total number of the four bases as the initial frequency of the sequencing data of the plurality of samples to be detected at the research site of the specific base.

The method of claim 1, wherein said calculating an expected value for each sample to be detected to be a specific base at a site of investigation based on said initial frequency comprises:

calculating the expected value of each sample to be detected at the research site as a specific base by the following formula:

wherein, b_i,jIndicating that the sample i to be detected is a specific base j, p at the research site _jRepresenting the initial frequency of sequencing data of a plurality of samples to be detected at a research site as a specific base j, d_iRepresents the base set p (b) in the base sequence covered by the sample i to be detected at the site of investigation_i,j|p _j,d _i) Indicating the expected value, p (b), of the sample i to be examined at the site of investigation for a particular base j_i,j|p _j) Is shown at a given p_jUnder the condition that the sample i to be detected is the prior probability p (d) of a specific base j at the research site_i|b _i,j) And (3) the base quality value of the base sequence covered by the sample i to be detected at the research site is represented.

The method of claim 3, wherein updating the sequencing data of the plurality of samples to be tested with each expected value for the initial frequency of a particular base at the site of investigation comprises:

updating the sequencing data of the plurality of samples to be detected with each expected value for the initial frequency of a particular base at the site of investigation by:

wherein the content of the first and second substances,representing the initial frequency, p (b), of the sequencing data of a plurality of samples to be detected after the updating of a specific base at the research site_i,j|p _j,d _i) Indicating that the sample i to be detected is an expected value of a specific base j at the research site, N indicates the number of the samples to be detected, and N indicates the total number of four bases in sequencing data of a plurality of samples to be detected.

The method of claim 1 or 4, wherein determining the type of base mutation at the site of investigation for each sample to be tested from the expected value of each convergence comprises:

calculating the maximum likelihood estimation value of each specific base mutation type of a plurality of samples to be detected at the research site belonging to four specific base mutation types according to each converged expected value;

calculating the ratio of the maximum likelihood estimated values of two adjacent specific base mutation types;

processing the ratio according to a preset rule to obtain the probability corresponding to the ratio;

under the condition that the probability is smaller than a set threshold value, determining the base mutation type of each sample to be detected at the research site as a specific base mutation type corresponding to the current denominator;

wherein the four specific base mutation types include: single base mutations, two base mutations, three base mutations and four base mutations.

The method of claim 5, wherein calculating a maximum likelihood estimate of the plurality of samples to be tested at the site of investigation for each of the four specific base mutation types based on each of the converged expected values comprises:

calculating the maximum likelihood estimation value of each specific base mutation type of a plurality of samples to be detected at the research site, wherein the samples to be detected belong to four specific base mutation types according to the following formula:

wherein D represents observation data consisting of a base set in a base sequence covered by all samples to be detected at a site of investigation, and p_jRepresenting the frequency, p (Dp) of a particular base j at the site of investigation of the sequencing data of a plurality of samples to be examined obtained from each converged expectation value_j) The maximum likelihood estimation value p (b) representing that the base mutation types corresponding to j are arranged at the research sites of a plurality of samples to be detected_i,j|p _j) Is shown at a given p_jUnder the condition that the sample i to be detected is the prior probability p (d) of a specific base j at the research site_i|b _i,j) Representing the base quality value of the base sequence covered by the sample i to be detected at the research site; the corresponding specific base mutation type is a single base mutation in the case of j ═ 0, a dibasic mutation in the case of j ═ 1, a three base mutation in the case of j ═ 2, or a four base mutation in the case of j ═ 3.

The method of claim 6, wherein said calculating a ratio of maximum likelihood estimates for two adjacent specific base mutation types comprises:

the maximum likelihood estimate of a four base mutation at the site of investigation is f₄The maximum likelihood estimate of the study site being a three base mutation is f₃In the case of (1);

will be provided withThe ratio of the maximum likelihood estimates of two adjacent specific base mutation types is determined.

The method of claim 6, wherein said calculating a ratio of maximum likelihood estimates for two adjacent specific base mutation types comprises:

the maximum likelihood estimate of a two base mutation at the site of investigation is f₂The minimum value of the maximum likelihood estimates of the combinations of four mutations in the three-base mutation is f₃min;

will be provided withIs determined as twoThe ratio of the maximum likelihood estimates of adjacent specific base mutation types.

The method of claim 6, wherein said calculating a ratio of maximum likelihood estimates for two adjacent specific base mutation types comprises:

the maximum likelihood estimate of a single base mutation at the site of investigation is f₁The minimum value among the maximum likelihood estimates of 16 combinations of mutations among the two-base mutations is f₂min;

will be provided withThe ratio of the maximum likelihood estimates of two adjacent specific base mutation types is determined.

The method according to any one of claims 7 to 9, wherein processing the ratio according to a preset rule to obtain a probability corresponding to the ratio comprises:

carrying out natural logarithm taking operation on the ratio to obtain a first result;

multiplying the obtained first result by-2 to obtain a second result;

and obtaining the probability corresponding to the second result by searching a chi-square value distribution table.

The method of claim 9, wherein determining the confidence of variation for each sample to be tested at the study site based on the expectation of each convergence comprises:

will be provided withCarrying out conventional Phred-scale conversion on the corresponding probability to obtain a Phred quality value;

and determining the Phred quality value as the variation confidence of each sample to be detected at the research site.

A base mutation detection device based on sequencing data comprises:

the initial frequency determination module is used for determining the initial frequency of sequencing data of a plurality of samples to be detected at a research site as a specific base;

the expected value calculation module is used for calculating the expected value of each sample to be detected, which is a specific base at the research site, based on the initial frequency;

the updating module is used for updating the initial frequency of the sequencing data of the samples to be detected, which is a specific base at the research site, by using each expected value;

the iteration module is configured to continuously calculate the expected value of each sample to be detected at the research site as the specific base by using the updated initial frequency, continuously update the sequencing data of the plurality of samples to be detected at the initial frequency of the sample to be detected at the research site as the specific base by using each new expected value, and repeat the iteration process until the expected value of each sample to be detected at the research site as the specific base converges;

the variation type determining module is used for determining the base mutation type and the variation confidence coefficient of each sample to be detected at the research site according to each converged expected value;

wherein the specific base comprises an adenine A base, a thymine T base, a cytosine C base or a guanine G base.

A storage medium comprising computer-executable instructions that, when executed by a computer processor, perform a method of base mutation detection based on sequencing data according to any one of claims 1 to 11.

33页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：用于序列判定的方法和系统

Base mutation detection method and device based on sequencing data and storage medium

相关技术

网友询问留言