Construction method, detection device and computer readable medium of liver cancer early screening model

文档序号:21610 发布日期:2021-09-21 浏览:33次 中文

阅读说明:本技术 肝癌早筛模型的构建方法、检测装置以及计算机可读取介质 (Construction method, detection device and computer readable medium of liver cancer early screening model ) 是由 刘睿 包华 吴雪 吴舒雨 魏玉林 包海荣 邵阳 杨珊珊 朱柳青 崔月利 刘璟文 于 2021-07-03 设计创作,主要内容包括:本发明涉及肝癌早筛模型的构建方法、检测装置以及计算机可读取介质。对170例对照人群和192例肝癌患者的WGS cfDNA读段长度进行统计,发现在总片段(40-300bp),短片段(40-80bp)和超长片段(200-300bp)的数量在两组间存在差异;同时以染色体长短臂统计不同长度片段的数量,在两组间也存在显著差异。本发明首次基于血浆cfDNA高通量低深度测序提供了DNA片段大小片单分布和末端序列占比与肝癌关系的诊断模型,该模型不仅能够诊断早期肝癌还能够区分肝硬化,具有无创检测,通量低,检测特异性和敏感性高的有点。(The invention relates to a construction method, a detection device and a computer readable medium of a liver cancer early screening model. Counting the length of WGS cfDNA reads of 170 control populations and 192 liver cancer patients, and finding that the number of total fragments (40-300bp), short fragments (40-80bp) and ultra-long fragments (200-300bp) is different between the two populations; meanwhile, the number of fragments with different lengths is counted by the long and short arms of the chromosome, and the significant difference also exists between the two groups. The invention provides a diagnosis model of relation between single distribution of DNA fragment size and terminal sequence proportion and liver cancer for the first time based on high-flux low-depth sequencing of plasma cfDNA, the model can diagnose early liver cancer and distinguish liver cirrhosis, and has the advantages of non-invasive detection, low flux, high detection specificity and sensitivity.)

1. A construction method of a liver cancer early-screening model is characterized by comprising the following steps:

step 1, extracting cfDNA from samples of a positive group and a control group and sequencing to obtain reading data;

step 2, comparing the reading data result to a reference genome;

step 3, obtaining the number of reads in different length intervals in different window ranges on the reference genome as an initial characteristic value;

step 4, screening out characteristic values with significant difference between the samples of the positive group and the control group in the initial characteristic values as model characteristic vectors;

and 5, inputting the model characteristic vectors of the samples of the positive group and the control group into the model, and training the model by taking the probability of suffering from the liver cancer as a model output value to obtain the early screening model.

2. The method for constructing a liver cancer early-screening model according to claim 1, wherein the step 3 comprises:

step 3-1, dividing the reference genome into a plurality of windows, and respectively obtaining the number of all reads, the number of short reads and the number of ultra-long reads within the range of each window;

step 3-2, respectively taking the long arm and the short arm on each chromosome as region ranges, and obtaining the number of reads in gradient intervals with different lengths in each range;

and 3-3, taking the data obtained in the steps 3-1 and 3-2 together as an initial characteristic value.

3. The method for constructing a model of early screening of liver cancer as claimed in claim 2, wherein the short reads are of length 40-80bp, and the number of the ultra-long reads is 200-300 bp; all reads refer to a length in the range of 40-300 bp.

4. The method for constructing a model for early screening of liver cancer according to claim 2, wherein the window size in step 3-1 is 2-7 Mb.

5. The method for constructing a model of early screening of liver cancer according to claim 2, wherein the different length gradient intervals in step 3-2 are different length gradient ranges obtained by increasing step sizes of 8-12bp within a range of 40-300 bp.

6. The method of claim 2, wherein the number of reads is normalized.

7. A construction device of a liver cancer early-screening model is characterized by comprising:

the sequencing module is used for extracting and sequencing cfDNA of the samples of the positive group and the control group to obtain reading data;

the comparison module is used for comparing the reading data result to a reference genome;

the characteristic value acquisition module is used for acquiring the number of the reads in different length intervals in different window ranges on the reference genome as an initial characteristic value;

the screening module is used for screening out a characteristic value with a significant difference between the samples of the positive group and the control group in the initial characteristic value as a model characteristic vector;

and the model construction module is used for inputting the model characteristic vectors of the samples of the positive group and the control group into the model, and training the model by taking the probability of suffering from the liver cancer as the output value of the model to obtain the early screening model.

8. A computer readable medium, the computer readable medium comprising a stored program, wherein when the program runs, the apparatus where the computer readable medium is located is controlled to execute the method for constructing a liver cancer early-stage screening model according to claim 1.

Technical Field

The invention relates to a Hepatoma (HCC) early sieve, belonging to the technical field of molecular biomedicine.

Background

Liver cancer is a malignant tumor of liver, and the incidence rate of new liver cancer patients worldwide is about sixty-one hundred thousand every year, and the mortality rate is the second. The liver cancer is hidden, the hepatitis-cancer conversion process is long, no obvious symptoms and signs exist in the early stage, most patients are diagnosed in the middle and late stages, and the early diagnosis rate is low. The survival time of the middle and late stage liver cancer patients in China is less than 2 years, however, the five-year survival rate of the liver cancer can reach 90% through early intervention.

Liver cancer still lacks effective screening means, and the detection performance and accessibility of the traditional early screening means restrict the effective implementation of clinical screening. The current liver cancer screening mode mainly comprises hematology AFP (alpha fetoprotein) detection and imaging examination. The AFP combined ultrasonic screening method has higher requirement on the compliance of patients, the accessibility of the AFP combined ultrasonic screening method can not reach the clinical requirement, and meanwhile, the diagnosis sensitivity of the AFP combined ultrasonic screening method to early liver cancer is insufficient, so that the effective implementation of the current clinical screening method is restricted; the imaging detection still has certain limitation, and the requirement of screening cannot be met, so that the development of an effective, economic and practical screening means suitable for a wide range of people is urgently needed in China.

Disclosure of Invention

The invention provides a method for performing WGS sequencing on plasma sample cfDNA, and performing high resolution fragmentation size distribution (high resolution fragmentation size distribution) analysis on liver cancer healthy human differential DNA fragments on a high-throughput sequencing result to construct a model, so that the aim of noninvasive accurate diagnosis of liver cancer is fulfilled.

A construction method of a liver cancer early-screening model comprises the following steps:

step 1, extracting cfDNA from samples of a positive group and a control group and sequencing to obtain reading data;

step 2, comparing the reading data result to a reference genome;

step 3, obtaining the number of reads in different length intervals in different window ranges on the reference genome as an initial characteristic value;

step 4, screening out characteristic values with significant difference between the samples of the positive group and the control group in the initial characteristic values as model characteristic vectors;

and 5, inputting the model characteristic vectors of the samples of the positive group and the control group into the model, and training the model by taking the probability of suffering from the liver cancer as a model output value to obtain the early screening model.

Step 3, comprising:

step 3-1, dividing the reference genome into a plurality of windows, and respectively obtaining the number of all reads, the number of short reads and the number of ultra-long reads within the range of each window;

step 3-2, respectively taking the long arm and the short arm on each chromosome as region ranges, and obtaining the number of reads in gradient intervals with different lengths in each range;

and 3-3, taking the data obtained in the steps 3-1 and 3-2 together as an initial characteristic value.

The short read segment is 40-80bp in length, and the number of the ultra-long read segments is 200-300 bp; all reads refer to a length in the range of 40-300 bp.

The window size in step 3-1 is in the range of 2-7 Mb.

The different length gradient intervals in the step 3-2 refer to different length gradient ranges obtained by increasing steps of 8-12bp within a range of 40-300 bp.

The number of reads is normalized.

A construction device of a liver cancer early-screening model comprises:

the sequencing module is used for extracting and sequencing cfDNA of the samples of the positive group and the control group to obtain reading data;

the comparison module is used for comparing the reading data result to a reference genome;

the characteristic value acquisition module is used for acquiring the number of the reads in different length intervals in different window ranges on the reference genome as an initial characteristic value;

the screening module is used for screening out a characteristic value with a significant difference between the samples of the positive group and the control group in the initial characteristic value as a model characteristic vector;

and the model construction module is used for inputting the model characteristic vectors of the samples of the positive group and the control group into the model, and training the model by taking the probability of suffering from the liver cancer as the output value of the model to obtain the early screening model.

The characteristic value acquisition module comprises:

the first reading number counting module is used for dividing the reference genome into a plurality of windows and respectively obtaining the total reading number, the short reading number and the ultra-long reading number in each window range;

a second read number counting module, configured to take the long arm and the short arm on each chromosome as region ranges, and obtain the number of reads in gradient intervals of different lengths in each range;

and the merging module is used for taking the data obtained in the first reading number counting module and the second reading number counting module as the initial characteristic value together.

A computer readable medium comprises a stored program, and when the program runs, the device where the readable medium is located is controlled to execute the construction method of the liver cancer early-screening model.

Advantageous effects

(1) The concentration of ctDNA in the early stage of liver cancer is higher than that of other cancer species, the blood ctDNA content of patients with hepatocellular carcinoma (HCC) is far higher than that of healthy people and patients with common liver diseases, and the difference can be shown from the very early stage, even if the patients have no solid tumor or have very small tumor, the blood ctDNA content is also obviously higher than the common level, and the method is very suitable for adopting a liquid biopsy technology based on ctDNA detection. According to the current clinical research data of liver cancer early screening, the sensitivity and specificity of the liquid biopsy liver cancer early screening product both exceed 90%, and the product has high clinical value.

(2) Counting the length of WGS cfDNA reads of 170 control populations and 192 liver cancer patients, and finding that the number of total fragments (40-300bp), short fragments (40-80bp) and ultra-long fragments (200-300bp) is different between the two populations; meanwhile, the number of fragments with different lengths is counted by the long and short arms of the chromosome, and the significant difference also exists between the two groups.

(3) The invention provides a diagnosis model of relation between single distribution of DNA fragment size and terminal sequence proportion and liver cancer for the first time based on high-flux low-depth sequencing of plasma cfDNA, the model can diagnose early liver cancer and distinguish liver cirrhosis, and has the advantages of non-invasive detection, low flux, high detection specificity and sensitivity.

Drawings

FIG. 1 is a schematic diagram of a model building process;

FIG. 2 is the statistics of different lengths of DNA fragments of liver cancer patients and the control group;

FIG. 3 shows statistics of DNA fragments of 120bp or less between liver cancer patients and control groups;

FIG. 4 is a heat map of the differences in the profile of the percentage of total DNA reads over the first 505 Mb window between patients with liver cancer and controls;

FIG. 5 is a heat map of the differences in the profile of the first 505 Mb window DNA short reads between the hepatoma patients and the control groups;

FIG. 6 is a heat map of the difference in the profile of the first 505 Mb window of DNA overlength reads between the hepatoma patients and the control group;

FIG. 7 is a heat map of the differences in the characteristics of the ratios of reads of different lengths between the windows of the first 50 chromosome arms between the liver cancer patients and the control groups;

FIG. 8 is a graph of the predicted results of classifiers on a validation set and a test set;

FIG. 9 is a graph of the predicted results of classifiers on a validation set;

FIG. 10 is a graph of the predicted results of classifiers on a test set;

FIG. 11 is an AUC curve over validation and test sets;

FIG. 12 is an AUC curve over the validation set;

FIG. 13 is an AUC curve over the test set;

FIG. 14 is the AUC curve of different single DNA fragment statistics of hepatocarcinoma-non-hepatocarcinoma group;

FIG. 15 is the AUC curve of the different combinations of DNA fragments in the hepatocarcinoma-non-hepatocarcinoma group;

FIG. 16 is the AUC curve of the liver cancer-cirrhosis group by different single DNA fragment statistics;

FIG. 17 is the AUC curve of the liver cancer-cirrhosis group by different combinations of DNA fragment statistics;

Detailed Description

The calculation method of the invention is detailed as follows:

the invention firstly needs to carry out the steps of extraction, library construction, sequencing and the like of cfDNA from blood samples. The extraction and library construction method is not particularly limited, and can be adjusted from the extraction methods in the prior art. The base information of cfDNA can be obtained using a sequencing technique in the related art in the sequencing process here.

The data set conditions adopted in the model construction process of the invention are as follows:

method for extracting and sequencing plasma cfDNA sample

Adopt purple blood collection pipe (EDTA anticoagulation pipe) to collect patient 8ml whole blood sample, in time centrifugation plasma (in 2 hours), after transporting to the laboratory, the plasma sample adopts QIAGEN plasma DNA extraction kit to carry out ctDNA according to the instruction and draws. And establishing a library for the collected cfDNA sample, and performing WGS-2 multiplication sequencing. After the off-line data is obtained, the data is compared to the human reference genome to obtain the base data information of the corresponding reading.

Data processing

The marker data in the invention mainly utilizes high resolution fragmentation size distribution (high resolution fragmentation size distribution) to carry out machine learning and establish a prediction model, thereby distinguishing non-liver cancer patients (healthy people and liver cirrhosis patients) from liver cancer patients.

For DNA fragment size distribution, it reflects the distribution characteristics of the length sizes of cfDNA reads. By comparing the lengths of cfDNA reads of 190 liver cancer patients and 170 control populations, the number of fragments between 40-80bp and 200-300bp is found to be different between the two groups, which can be used as a distinguishing feature.

The cfDNA read length data is obtained by the following method: among the aligned bams, quality, length and alignment position information for each read was recorded, and the human reference genome was selected from the hg19 sequence provided by University of California at Cruz (University of California, Santa Cruz, UCSC). The human reference genome is cut into 572 windows according to the length of 5Mb, and the total number of reads (40-300bp), the number of short reads (40-80bp) and the number of ultra-long reads (200-300bp) in each window are respectively counted. And respectively carrying out normalized conversion on each reading quantity according to the counting result of each reading quantity in all windows, namely, the normalized value is (original value-average value)/standard deviation. This results in a set of numbers of 572 sets of reads of different lengths.

Meanwhile, in order to obtain high-resolution reading results, 41 regions of each chromosome long and short arm of the human reference genome are used as windows, which are shown as follows:

chr1_p chr4_q chr8_p chr11_q chr16_q chr20_p
chr1_q chr5_p chr8_q chr12_p chr17_p chr20_q
chr2_p chr5_q chr9_p chr12_q chr17_q chr21_q
chr2_q chr6_p chr9_q chr13_q chr18_p chr22_q
chr3_p chr6_q chr10_p chr14_q chr18_q chrX_p
chr3_q chr7_p chr10_q chr15_q chr19_p chrX_q
chr4_p chr7_q chr11_p chr16_p chr19_q

the fragments of 40-300bp are divided into 27 length gradients (for example, 40-49bp on 1q arm of chr1 and 50-59bp … …) by increasing the number of the fragments in each length gradient, the number of the fragments in each window of the long arm and the short arm is counted, and the standardization and conversion are carried out, so that 2823 characteristic results of the size distribution results of the high-resolution DNA fragments are obtained in total (2823: 572 total read standardization results +572 short read standardization results +572 ultra-long individual section standardization results + 41: 27 length gradient standardization results).

After obtaining the high-resolution DNA data information of 192 liver cancer patients and 170 comparison crowds, taking the size distribution statistical result of the high-resolution DNA fragments as an input value (the input vector of each sample comprises a characteristic value formed by 2823 read ratio values), and judging whether the sample to be detected is classified with a normal sample by a deep network learning model method; deep learning is based on a multi-layer feedforward artificial neural network that is trained with random gradient descent using back propagation. The network may comprise a number of hidden layers consisting of neurons with hyperbolic tangent, rectifying and maximum power activation functions. Advanced functions such as adaptive learning rate, rate annealing, momentum training, learning by dropping, L1 or L2 regularization, checkpointing, and grid search can achieve higher prediction accuracy. In learning training, each compute node trains a copy of the global model parameters on its local data using multiple threads (asynchronously), and periodically contributes to the global model by model averaging over the network. Feed-forward Artificial Neural Network (ANN) models, also known as Deep Neural Networks (DNNs) or multi-layer perceptrons (MLPs), are the most common type of deep neural network and the type used by this patent for deep learning.

After training, the deep network learning model sorts the discrimination contribution values of 2823 high-resolution DNA size distribution information, and 926 features with obvious difference between the two groups are screened out (208 total read quantity distributions, 244 short read quantity distributions, 177 ultra-long read quantity distributions and 297 chromosome arm read quantities). The first 50 features of each distribution were analyzed differentially, as shown by heatmap, with the 50 features of the two groups in each distribution being significantly different;

the distinctive features on the chromosome arms are shown in the following table, where chr represents the chromosome number, p/q represents the short/long arms, respectively, and the range values represent the base number intervals.

chromosome/Long-short arm Number of bases chromosome/Long-short arm Number of bases
chr19_q 210-219 chr7_p 220-229
chr19_p 200-209 chr8_q 170-179
chr18_p 170-179 chr7_q 290-299
chr19_p 170-179 chr17_p 200-209
chr1_p 160-169 chr1_q 290-299
chrX_q 140-149 chr2_q 170-179
chrX_q 130-139 chr17_q 290-299
chr20_p 170-179 chr22_q 160-169
chr18_p 180-189 chr1_q 230-239
chr1_p 80-89 chr8_p 210-219
chr12_q 140-149 chr20_p 210-219
chr16_q 220-229 chr12_q 240-249
chr10_q 230-239 chr1_q 260-269
chr3_p 230-239 chr8_q 140-149
chr9_q 160-169 chr15_q 220-229
chr17_q 220-229 chr16_q 290-299
chr18_p 190-199 chr22_q 140-149
chr12_p 290-299 chr19_p 160-169
chr7_p 290-299 chr4_q 230-239
chr1_p 170-179 chr1_q 270-279
chr11_q 280-289 chr12_p 210-219
chr20_q 210-219 chr9_q 220-229
chr11_p 290-299 chr12_q 230-239
chr16_q 210-219 chr5_p 210-219
chr1_p 240-249 chr18_p 200-209

Meanwhile, as can be seen in the heat map of the chromosome arm reading distribution, some characteristics are clearly different between the liver cancer patients and the liver cirrhosis patients.

The results obtained with the above model are shown in the following table:

in the case of different model input vectors, the model prediction performance is as follows:

all reads, short reads, ultra-long reads and chromosome arm read distributions can be trained independently respectively to distinguish non-cancer patients from cancer patients to a certain extent, and the combined use of the non-cancer patients and the cancer patients as a high-resolution DNA fragment size distribution result can achieve the best training prediction effect, and the highest AUC can reach 0.995. Meanwhile, the combined input vector has better distinguishing effect on distinguishing liver cancer patients from liver cirrhosis patients, and the highest AUC can reach 0.985.

23页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种基于lncRNA对的结肠癌预后预测模型及其构建方法

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!