Method for judging sample degradation based on CNV result

文档序号:170921 发布日期:2021-10-29 浏览:33次 中文

阅读说明:本技术 一种基于cnv结果判定样本降解的方法 (Method for judging sample degradation based on CNV result ) 是由 贺洪鑫 梁萌萌 余伟师 栗海波 李珉 于 2021-09-24 设计创作,主要内容包括:本发明公开了一种基于CNV结果判定样本降解的方法,包括以下步骤:生成测试样本的目标覆盖CNN文件;构建一个参考CNN对照集;从测试样本的目标覆盖CNN文件中检测出拷贝数变异,生成包含拷贝数变异的具体染色体区间和变异类型信息的中间文件,获取候选参数指标;以历史CNV-seq样本的目标覆盖CNN文件作为实验数据,构建分类模型并进行评估,将能完全区分降解样本和正常样本的分类特征及相应阈值作为判断样本是否降解的最终参数指标;利用测试样本的CNV结果对照该最终参数指标,判断测试样本是否降解。该方法给出了能够准确甄别出降解样本的参数指标包括对应的阈值范围,能够自动、高效、准确地区分降解样本和正常样本。(The invention discloses a CNV result-based method for judging sample degradation, which comprises the following steps: generating a target coverage CNN file of the test sample; constructing a reference CNN control set; detecting copy number variation from a target coverage CNN file of a test sample, generating an intermediate file containing specific chromosome intervals of the copy number variation and variation type information, and acquiring candidate parameter indexes; taking a target coverage CNN file of a historical CNV-seq sample as experimental data, constructing a classification model and evaluating, and taking classification characteristics and corresponding thresholds which can completely distinguish a degraded sample from a normal sample as final parameter indexes for judging whether the sample is degraded; and comparing the final parameter index with the CNV result of the test sample to judge whether the test sample is degraded. The method provides a parameter index which can accurately discriminate the degraded sample and comprises a corresponding threshold range, and can automatically, efficiently and accurately discriminate the degraded sample from the normal sample.)

1. A method for determining sample degradation based on CNV results, comprising the steps of:

(1) generating a target coverage CNN file of the test sample according to the reference genome;

(2) taking a target coverage CNN file of a historical CNV-seq sample as a reference sample, merging a plurality of reference samples according to a reference genome, and constructing a reference CNN reference set; the historical CNV-seq samples comprise normal samples and degraded samples;

(3) according to a reference CNN control set, detecting copy number variation from a target coverage CNN file of a test sample, and generating an intermediate file containing specific chromosome intervals of the copy number variation and variation type information;

(4) acquiring a candidate parameter index for judging whether the test sample is degraded or not according to the CNV result from the intermediate file;

(5) taking the target coverage CNN file of the history CNV-seq sample in the step (2) as experimental data, randomly dividing the experimental data into a training set and a testing set, randomly combining the candidate parameter indexes in the step (4), and taking a combined form as a classification feature to form a plurality of classification features; constructing a classification model by using a training set, verifying the performance of the classification model by using a test set so as to evaluate the classification performance of each classification characteristic, and taking the classification characteristic and a corresponding threshold value which can completely distinguish a degraded sample from a normal sample as a final parameter index for judging whether the sample is degraded or not;

(6) and (5) judging whether the test sample is degraded or not by adopting the final parameter index in the step (5) according to the CNV result of the test sample.

2. The method according to claim 1, wherein the specific method for generating the target overlay CNN file of the test sample in step (1) is as follows:

a. acquiring a corresponding target bed file according to the reference genome version number;

b. and calculating the coverage of the compared bam file according to the given bed area in the target bed file to obtain a target coverage CNN file.

3. The method of claim 1, wherein the specific method for constructing the reference CNN control set in step (2) is as follows:

acquiring a reference genome sequence file, and creating an index for the reference genome sequence file so as to accelerate the speed of comparing the sequencing sequence of the control sample to the reference genome;

and taking the target coverage CNN file of the historical sample CNV-seq as a comparison sample, calculating the sequencing depth of the comparison sample and the GC content of each region of the reference genome sequence, and combining all the comparison samples into a sequencing distribution model of a normal genome, namely a reference CNN comparison set.

4. The method according to claim 1, wherein the specific method for generating the intermediate file in step (3) is as follows:

a. correcting the regional coverage and GC content deviation of a target coverage CNN file of a test sample by taking a reference CNN comparison set as a standard to obtain a copy number ratio table file;

b. inferring discrete copy number segments from the copy number ratio table file;

c. obtaining the absolute copy number of each fragment from the discrete copy number fragments;

d. filtering the absolute copy number of each fragment by taking the cn value of the autosomal chromosome and the female sex chromosome as a filtering condition, wherein the cn value of the male sex chromosome as a filtering condition is not equal to 1 and the cn value of each fragment as a filtering condition to form a filtered file;

e. and judging the variation type of the copy number of the filtered file, wherein the variation type is deletion when the cn value is less than 2, and the variation type is repetition when the cn value is more than 2, and finally generating an intermediate file containing the specific chromosome interval of the copy number variation and the variation type information.

5. The method of claim 1, wherein the candidate parameter indicators in step (4) include at least two of stdev, segments, mad, total number of detected CNVs, and number corresponding to specific types of detected CNVs.

6. The method of claim 1, wherein the final parameter index of step (5) is stdev and the total number of detected CNVs.

7. The method of claim 6, wherein the test sample is judged to be a degraded sample when the total number of CNV detected is >50, and/or Stdev > 0.5.

Technical Field

The invention relates to the technical field of biology and precise medical high-throughput sequencing and mutation detection, in particular to a method for judging sample degradation based on a CNV result.

Background

In recent years, with the development of Next-Generation Sequencing (NGS), detection technologies such as Whole Genome Sequencing (WGS), Whole Exome Sequencing (we), and Copy Number Variation Sequencing (CNVseq) have been known more and more.

The first step of the sequencing technology is sample DNA extraction, and the DNA extraction is a key step for building a foundation. However, due to the interference of external factors such as temperature, humidity, pH, oxidation reaction and microorganism infection, degradation effect of DNA is inevitable during the extraction process. It is known that when a sequenced sample is degraded, it not only indicates that the sample may be contaminated, but also causes errors in the sequencing result and ultimately leads to erroneous conclusions in genetic interpretation. In order to avoid such a situation, it is necessary to determine the sample in which degradation has occurred in advance.

The conventional method for judging the degraded sample is a gel electrophoresis method, also called a gel running method, and the judgment is mainly based on the fact that whether the phenomenon of electrophoretic pattern tail removal occurs or not is checked, and if the phenomenon of tail removal occurs, the sample is indicated as the degraded sample. The adoption of the method often brings the following three disadvantages:

a. the rubber running method needs design experiments and operation experiments, particularly the operation experiment process, and is time-consuming and labor-consuming;

b. human errors are inevitably caused in the process of operating experiments, so that the judgment result is inaccurate;

c. the operation experiment can be implemented by people with related professional technical background, and the requirement on judgment personnel is high.

In addition, since the above determination method is completed before sequencing, it cannot be determined whether the sample is degraded or contaminated during the period from the determination to the sequencing process.

Disclosure of Invention

The invention aims to solve the problems in the prior art, and provides a CNV result-based method for judging sample degradation, which can automatically identify degraded DNA samples directly according to the threshold of a quality control index parameter, is simple, efficient, time-saving, labor-saving, high in universality and free of professional knowledge of operators.

The technical scheme of the invention is detailed as follows:

a method for judging sample degradation based on CNV results comprises the following steps:

(1) generating a target coverage CNN file of the test sample according to the reference genome;

(2) taking a target coverage CNN file of a historical CNV-seq sample as a reference sample, merging a plurality of reference samples according to a reference genome, and constructing a reference CNN reference set; the historical CNV-seq samples comprise normal samples and degraded samples;

(3) according to a reference CNN control set, detecting copy number variation from a target coverage CNN file of a test sample, and generating an intermediate file containing specific chromosome intervals of the copy number variation and variation type information;

(4) acquiring a candidate parameter index for judging whether the test sample is degraded or not according to the CNV result from the intermediate file;

(5) taking the target coverage CNN file of the history CNV-seq sample in the step (2) as experimental data, randomly dividing the experimental data into a training set and a testing set, randomly combining the candidate parameter indexes in the step (4), and taking a combined form as a classification feature to form a plurality of classification features; constructing a classification model by using a training set, verifying the performance of the classification model by using a test set so as to evaluate the classification performance of each classification characteristic, and taking the classification characteristic and a corresponding threshold value which can completely distinguish a degraded sample from a normal sample as a final parameter index for judging whether the sample is degraded or not;

(6) and (5) judging whether the test sample is degraded or not by adopting the final parameter index in the step (5) according to the CNV result of the test sample.

CNV, copy number variation, is an important component of genetic structural variation, and is caused by genome rearrangement, generally refers to an increase or decrease in copy number of large genomic fragments with a length of 1kb or more, and is mainly expressed as deletion and duplication at a sub-limiting level. The chromosome abnormality detection based on the high-throughput sequencing technology can obtain a copy number variation result, and the CNV result can display information such as chromosome interval information and chromosome types corresponding to the CNV, and parameters such as Filter _ Count, stdev, segments and mad. Whether the test sample is degraded or not can be known by comparing the CNV result of the test sample with the classification characteristic and the corresponding threshold value which are obtained by the method and are suitable for the test sample.

Optionally or preferably, in the method for determining sample degradation based on the CNV result, the specific method for generating the target coverage CNN file of the test sample in step (1) is as follows:

a. acquiring a corresponding target bed file according to the reference genome version number;

b. and calculating the coverage of the compared bam file according to the given bed area in the target bed file to obtain a target coverage CNN file.

Optionally or preferably, in the method for determining sample degradation based on the CNV result, the specific method for constructing the reference CNN control set in step (2) is as follows:

acquiring a reference genome sequence file, and creating an index for the reference genome sequence file so as to accelerate the speed of comparing the sequencing sequence of the control sample to the reference genome;

and taking the target coverage CNN file of the historical sample CNV-seq as a comparison sample, calculating the sequencing depth of the comparison sample and the GC content of each region of the reference genome sequence, and combining all the comparison samples into a sequencing distribution model of a normal genome, namely a reference CNN comparison set.

Optionally or preferably, in the method for determining sample degradation based on the CNV result, the specific method for generating the intermediate file in step (3) is as follows:

a. correcting the regional coverage and GC content deviation of a target coverage CNN file of a test sample by taking a reference CNN comparison set as a standard to obtain a copy number ratio table file;

b. inferring discrete copy number segments from the copy number ratio table file;

c. obtaining the absolute copy number of each fragment from the discrete copy number fragments;

d. filtering the absolute copy number of each fragment by taking the cn value of the autosomal chromosome and the female sex chromosome as a filtering condition, wherein the cn value of the male sex chromosome as a filtering condition is not equal to 1 and the cn value of each fragment as a filtering condition to form a filtered file;

e. and judging the variation type of the copy number of the filtered file, wherein the variation type is deletion when the cn value is less than 2, and the variation type is repetition when the cn value is more than 2, and finally generating an intermediate file containing the specific chromosome interval of the copy number variation and the variation type information.

Optionally or preferably, in the method for determining sample degradation based on a CNV result, the candidate parameter index in step (4) includes at least two of stdev, segments, mad, the total number of detected CNVs, and the number corresponding to a specific type of detected CNV.

Alternatively or preferably, in the method for determining sample degradation based on CNV results, the final parameter index in step (5) is stdev and the total number of detected CNVs.

Alternatively or preferably, the method for determining sample degradation based on CNV results as described above determines that the test sample is a degraded sample when the total number of CNVs detected is >50, and/or Stdev > 0.5.

Compared with the prior art, the invention has the following beneficial effects:

the method can automatically discriminate whether the test sample with the obtained CNV result is degraded or not directly according to the final parameter index and the corresponding threshold value, is simple, efficient, time-saving and labor-saving, and the obtained result is not ahead of CNV detection but is the same sample as the CNV detection.

The method has high universality and is easy to operate, and people without any related professional knowledge background can also perform checking judgment, so that the method has low requirements on the technical performance of the people.

The method has simple flow deployment and convenient use and operation, and can complete the whole flow analysis only by deploying the related computing nodes. The requirement on the computing resources of the server is low, and a common server with 8 cores and 64G memories can allow processing tasks of dozens of target genes to be operated simultaneously.

Drawings

FIG. 1 is a schematic view of the overall process flow of the method for determining sample degradation based on CNV results in example 1;

FIG. 2 is a schematic flow chart of step 1 in example 1;

FIG. 3 is a schematic flow chart of step 2 in example 1;

FIG. 4 is a schematic flow chart of step 3 in example 1;

FIG. 5 is a schematic flow chart of step 4 in example 1;

FIG. 6 is a schematic flow chart of step 5 in example 1;

FIG. 7a is a diagram of SVM classification boundary of a Filter _ Count + Stdev combination in step 5 of example 1;

FIG. 7b is a SVM classification boundary diagram of a Filter _ Count + Mad combination according to step 5 of example 1;

FIG. 7c is a SVM classification boundary diagram of a Stdev + Mad combination in step 5 of example 1;

FIG. 8a is a SVM classification boundary diagram of the combination of Filter _ Count + Stdev in step 5 of example 1;

FIG. 8b is a SVM classification boundary diagram of the combination of Filter _ Count + Mad in step 5 of example 1;

FIG. 8c is the SVM classification boundary map of the scheme two Stdev + Mad "combination in step 5 of example 1;

FIG. 9a is a SVM classification boundary diagram of the combination of three Filter _ Count + Stdev in step 5 of example 1;

FIG. 9b is a SVM classification boundary diagram of the combination of three Filter _ Count + Mad in step 5 of example 1;

FIG. 9c is the SVM classification boundary map of the three Stdev + Mad "combination in step 5 of example 1;

FIG. 10 is a chromosome map of the results of the verification procedure of example 1 for the normal sample 1 CNV;

FIG. 11 is a chromosome map of the results of the abnormal sample 1 CNV in the validation procedure of example 1;

FIG. 12 is a chromosome map of the results of the abnormal sample 2 CNV in the validation procedure of example 1.

Detailed Description

The invention is explained and illustrated in detail below with reference to the drawings and preferred embodiments so that those skilled in the art can better understand the invention and implement it.

Example 1

Referring to fig. 1, the overall process is summarized as follows:

1. generating a target coverage CNN file;

2. constructing a reference CNN control set;

3. detecting copy number variation;

4. acquiring parameter indexes for automatic quality control;

5. features and threshold acquisition for distinguishing degraded from normal samples.

The operation steps of each section are described in detail below.

1. Generation of target overlay CNN files

And generating a target coverage CNN file of the test sample according to the reference genome, wherein the target coverage CNN file is mainly used for recording the coverage of the aligned bam file calculated according to the given region in the target bed file, and detecting the CNV by combining a subsequently constructed reference CNN reference set (named as reference. CNN).

Referring to fig. 2, the construction process is as follows:

reference genomes are available from public gene databases such as Ensembl, NCBI, etc., and there are two current reference genome versions, hg38 and hg19, respectively.

a. Firstly, acquiring a reference genome version number, and acquiring a corresponding target bed file according to the reference genome version number;

b. and then, by utilizing a coverage method in the CNV analysis software CNVkit, the method can calculate the coverage of the bam file according to the given bed area, calculate the coverage of the bam file after comparison according to the given area in the target bed file, and finally obtain a target coverage cnn file.

Inputting a file:

and a bam file of the test sample after comparison and a target bed file of the specific reference genome version.

And (3) related software:

coverage method in Cnvkit software.

Outputting a file:

the test sample target covers the CNN file.

2. Construction of reference CNN control set

And (3) taking a target coverage CNN file of the historical CNV-seq sample (comprising a normal sample and a degraded sample) as a control sample, and combining a plurality of control samples according to a reference genome to construct a reference CNN control set. The step is used for constructing a reference CNN comparison set on the basis of target coverage CNN files of a certain number of historical CNV-seq samples, and performing CNV detection by combining the reference CNN comparison set with the target coverage CNN files of the test samples.

Referring to fig. 3, the construction process is as follows:

a. firstly, acquiring a reference genome version number (hg 19 or hg 38), downloading a corresponding reference genome sequence file from a public gene database (Ensembl, NCBI and the like) according to the reference genome version number, wherein the reference genome sequence file is FASTA format data, and is referred to as ref. fa for short;

b. constructing a reference genome comparison index for the downloaded reference genome sequence file by using an index construction module in the sequence comparison software to generate an index file, wherein the index file mainly comprises ref.fa.amb, ref.fa.bft, ref.fa.ann, ref.fa.fa.fai, ref.fa.misa, ref.fa.pac and ref.fa.sa;

c. and then acquiring a certain number of target coverage CNN files of historical cnvseq samples, taking the files as reference samples, combining all the reference samples by calculating the sequencing depth of the reference samples and the GC content of each region of a reference genome sequence by using a reference method in a cnvkit tool to generate a sequencing distribution model of a normal genome, wherein the sequencing distribution model is a reference CNN reference set and is marked as a reference CNN file.

Inputting a file:

target coverage cnn files, reference genomic sequence files for a number of historical cnvseq samples;

and (3) related software:

the software for constructing the sequence alignment index,

reference method in the cnvkit tool;

outputting a file:

reference CNN control set.

3. Detection of copy number variation

And according to the reference CNN control set, detecting copy number variation from the target coverage CNN file of the test sample, and generating an intermediate file containing the specific chromosome interval of the copy number variation and the variation type information. The step is used for detecting copy number variation from a target coverage CNN file of a test sample by taking a reference CNN file as a reference CNN reference set, and finally forming intermediate files such as cnr, cns, call.

Referring to fig. 4, the construction process is as follows:

a. correcting deviations of regional coverage and GC content of the CNN file covered by the target of the test sample by using a fix method in a cnvkit tool according to a given control set, and outputting a copy number ratio table file (. cnr);

b. deducing discrete copy number fragments ([ cns ] file) from the copy number ratio table file ([ x ] cnr) output in the step a by using a segment method in the cnvkit tool;

c. obtaining an absolute copy number (. call. cns file) of each fragment from discrete copy number fragments (. cns file) using a call method in the cnvkit tool;

d. filtering the call.cns file under the condition that the cn values of the autosomal chromosomes and the female sex chromosomes cannot be equal to 2 and the cn value of the male sex chromosomes cannot be equal to 1 to form a filtered filter.cns file;

e. and finally, judging the variation type of the copy number in the filter. There are two types of variation: repeating gain and loss, and judging conditions are as follows: if cn <2, the mutation type is loss, i.e., deletion, and if cn >2, gain, i.e., duplication. And finally, generating a CNV.bed file containing the copy number variation specific chromosome interval and variation type information, namely an intermediate file.

Inputting a file:

test sample target covers cnn file, reference.

And (3) related software:

the fix method in the cnvkit tool,

segment method in the cnvkit tool,

call method in the cnvkit tool;

outputting a file:

cnr, cns, call.

4. Acquisition of parameter index for automatic quality control

The step obtains candidate parameter indexes used for judging whether the test sample is degraded or not according to the CNV result from the intermediate file.

Referring to fig. 5, the construction process is as follows:

a. cnr and the cns file are combined to generate a quality control index containing parameters such as stdev, segments, mad and the like by using a metrics method in the cnvkit tool;

b. counting the filter. cns files to obtain the total number of the detected CNVs;

c. screening and counting the CNV.bed files to obtain the number corresponding to the specific types (gain and loss) of the detected CNV;

d. and taking parameters such as stdev, segments, mad and the like, the total number of the detected CNV and the number corresponding to the specific types (gain and loss) of the detected CNV as candidate parameter indexes for judging whether the test sample is degraded or not according to the CNV result.

Inputting a file:

intermediate files such as cnr, cns, filter.

And (3) related software:

metrics method in the cnvkit tool;

outputting a file:

and the result file comprises candidate indexes of all the samples capable of being automatically degraded.

5. Acquisition of features and thresholds for distinguishing degraded and normal samples

The method comprises the steps that target coverage CNN files of historical CNV-seq samples are used as experimental data, the experimental data are randomly divided into a training set and a testing set, candidate parameter indexes are randomly combined, one combination form is used as one classification feature, and a plurality of classification features are formed; and (3) constructing a classification model by using the training set, verifying the performance of the classification model by using the test set, thereby evaluating the classification performance of each classification characteristic, and taking the classification characteristic and corresponding threshold value which can completely distinguish the degraded sample from the normal sample as a final parameter index for judging whether the sample is degraded or not.

Referring to fig. 6, the detailed process is as follows:

(1) data acquisition:

sample source: historical cnvseq samples;

total number of samples: 510 (normal samples: 489, abnormal samples: 21).

(2) Respectively generating intermediate files such as cn files, cns files, Filter. cns files, CNV.bed files and the like from the bam files according to the detailed steps described above for 510 samples, combining cnr and the cns files by using a metrics method in a cnvkit tool to generate files containing 3 evaluation parameters of Stdev, Segments and Mad, then counting the Filter. cns files to obtain the total number of CNV detected by each sample (named as Filter _ Count hereinafter), and finally taking the 4 parameters as candidate parameters to participate in the subsequent test comparison. Each sample has values corresponding to these 4 parameters.

(3) The 4 candidate parameters are randomly combined, and each combination is used as a classification feature, so that 15 cases are provided, namely, "Filter _ Count", "Segments", "Stdev", "Mad", "Filter _ Count + Segments", "Filter _ Count + Stdev", "Filter _ Count + Mad", "Segments + Stdev", "Segments + Mad", "Stdev + Mad", "Filter _ Count + Segments + Mad", and "Segments + Stdev + Mad + Filter _ Count".

(4) The classic two-classification model SVM is a linear classifier, belongs to one of machine learning methods, and mainly distinguishes two classes by finding an optimal decision boundary.

The discrimination performance of these 15 classification features was evaluated using the above-described model SVM. Firstly, constructing 15 different matrixes by using 4 parameter values of 510 samples according to 15 combination types, wherein the content in the matrixes is the corresponding parameter value of each sample in different combination modes and the state of the sample (represented by a number, a normal sample is represented by '0', and an abnormal sample is represented by '1').

Then, carrying out SVM algorithm verification on the 15 matrixes in sequence, wherein the verification method comprises the following steps: the matrix data are randomly divided into a training set and a testing set according to the proportion of 8: 2, the training set is used for training and constructing a binary model, the testing set is used for testing and verifying the constructed binary model, and finally 3 parameters of Score, interrupt and Coefficients are introduced to evaluate the classification performance of each classification characteristic, wherein the higher Score value indicates the higher accuracy of the characteristic, the better classification capability is indicated, and the specific numerical results corresponding to 15 characteristics are shown in the following table. The Score values in the table are arranged from large to small, and the top 3 ranked features are selected.

Feature(s) Score Intercept Coefficients
Stdev+Mad 0.99 -1.69347366 1.85678699,1.99126666
Filter_Count+Stdev 0.97 -2.13857193 0.05151421,-0.27803567
Filter_Count+Mad 0.97 -1.95546197 0.04871313,-0.27713227

(5) Then, three different experimental test conditions are designed, the distinguishing performance of the 3 classification features is continuously evaluated and compared, and SVM classification boundary visualization is carried out to visually reflect the classification capability of the 3 features, wherein the specific experimental scheme is as follows:

the first scheme is as follows: the 510 historical cnvseq samples were randomly divided into a training set and a test set, where 80% were the training set and 20% were the test set.

Scheme II: data are manually selected to construct a training set and a testing set, 408 (normal samples: 392, abnormal samples: 16) cnvseq samples are used as the training set, and 102 (normal samples: 97, abnormal samples: 5) cnvseq samples are used as the testing set.

The third scheme is as follows: based on the sample data in the second scheme, firstly, 5 abnormal samples in the test set in the second scheme are unpacked, 2 abnormal samples are put into the training set, the remaining 3 abnormal samples are continuously kept in the test set, 410 samples (normal samples: 392 samples, abnormal samples: 18 samples) are in the training set at the moment, and 100 samples (normal 97 samples, abnormal 3 samples) are in the test set.

(6) According to the three designed experimental schemes, an SVM binary classification model is sequentially used for evaluating the distinguishing performance of the 3 classification features including Stdev + Mad, Filter _ Count + Stdev and Filter _ Count + Mad, and an SVM classification boundary diagram is drawn. The SVM classification boundary plots for each of the 3 different features under the three different experimental protocols are shown in FIGS. 7-9.

Fig. 7a to 7c are SVM classification boundary diagrams of the first scheme, and it can be seen from the diagrams that the "Filter _ Count + Stdev" feature can clearly distinguish degradation from normal CNV-seq samples, the "Filter _ Count + Mad" feature has the second effect, and the "Stdev + Mad" feature cannot completely distinguish the degradation from the normal CNV-seq samples, and the effect is the worst.

Fig. 8a to 8c are SVM classification boundary diagrams of the second scheme, and it can be seen from the diagrams that the characteristics of "Filter _ Count + Stdev" and "Filter _ Count + Mad" can clearly distinguish the occurrence of degradation from the normal cnvseq sample, and the characteristics of "Stdev + Mad" cannot completely distinguish the two, and the effect is the worst.

Fig. 9a to 9c are SVM classification boundary diagrams of the third embodiment, and it can be seen from the diagrams that the "Filter _ Count + Stdev" feature can still clearly distinguish the degradation from the normal cnvseq sample, the "Filter _ Count + Mad" feature has a second order of distinguishing effect, and the "Stdev + Mad" feature cannot completely distinguish the two, and the effect is still the worst.

In conclusion, under the three experimental schemes, the "Filter _ Count + Stdev" feature shows better distinguishing capability between degraded and normal cnvseq samples, and the "Filter _ Count + Mad" feature is inferior to the "Stdev + Mad" feature, so that the effect is not obvious. Therefore, the characteristic combination of 'Filter _ Count + Stdev' is finally selected as the parameter index for automatically controlling the excessive CNV number.

(7) Finally, according to the real values of the Filter _ Count and the Stdev of the critical samples on the three different SVM classification boundary diagrams of the characteristic of 'Filter _ Count + Stdev', the intermediate threshold value capable of distinguishing the degraded samples from the normal cnvseq samples is finally identified: filter _ Count >50 or Stdev > 0.5. That is, when Filter _ Count >50 or Stdev >0.5, it indicates that the cnvseq sample is a degraded sample.

Therefore, the "Filter _ Count + Stdev" feature, and Filter _ Count >50 or Stdev >0.5 are used as the final parameter index for determining whether the sample is degraded.

And judging corresponding to the final parameter index according to the CNV result of the test sample, so that whether the test sample is a normal sample or a sample subjected to degradation can be obtained.

To further verify the accuracy of this criterion and threshold, known CNV-seq samples (3 selected from the historical CNV-seq samples used in the above method, 1 normal sample, 2 abnormal samples) were chosen.

The verification method is-the CNV condition actually detected by looking at these known CNV-seq samples is consistent with the result of applying this criterion and threshold judgment.

Details of the chosen known CNV-seq samples:

normal sample 1: filter _ Count <50 and Stdev < 0.5;

abnormal sample 1: filter _ Count > 50;

abnormal sample 2: stdev > 0.5.

The actual CNV detected for the 3 samples and the corresponding Filter _ Count and Stdev parameter values are as follows:

normal sample 1 (Filter _ Count <50 and Stdev < 0.5):

the sample Filter _ count =36 and Stdev =0.27, it can be determined that the CNV-seq sample is a normal sample

The actually detected CNV status is shown in fig. 10, where the blue area indicated by the arrow indicates that the CNV type is repetitive, and the circled red area indicates that the CNV is missing, and it can be seen from the figure that the CNV-seq sample is a normal sample, and is consistent with the result of feature and threshold determination.

Outlier sample 1 (Filter _ Count > 50):

the sample Filter _ count =566, Stdev =0.497, and it can be determined that the CNV-seq sample is a degraded sample

The actually detected CNV status is shown in fig. 11, where the blue area indicated by the arrow indicates that the CNV type is repetitive, and the circled red area indicates that the CNV is missing, and it can be seen from the figure that the CNV-seq sample has too many abnormal CNVs, is a degraded sample, and is consistent with the result of the feature and threshold determination.

Abnormal sample 2 (Stdev > 0.5):

the sample Filter _ count =63 and Stdev =1.59, then the CNV-seq sample can be determined to be a degraded sample

The actually detected CNV status is shown in fig. 12, where the blue area indicated by the arrow indicates that the CNV type is repetitive, and the circled red area indicates that the CNV is missing, and it can be seen from the figure that the CNV-seq sample has too many abnormal CNVs, is a degraded sample, and is consistent with the result of the feature and threshold determination.

In summary, the results of applying the criteria and thresholds to these known CNV-seq samples are consistent with the CNV conditions actually detected by these known samples.

The inventive concept is explained in detail herein using specific examples, which are given only to aid in understanding the core concepts of the invention. It should be understood that any obvious modifications, equivalents and other improvements made by those skilled in the art without departing from the spirit of the present invention are included in the scope of the present invention.

24页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种基于图神经网络的乳酸菌抗菌肽预测方法

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!