Method for identifying protein biomarkers independent of database search

文档序号：1478003 发布日期：2020-02-25 浏览：11次中文

阅读说明：本技术 一种不依赖数据库搜索的蛋白质生物标志物鉴定方法 (Method for identifying protein biomarkers independent of database search ) 是由朱云平常乘刘祎贺福初于 2019-10-31 设计创作，主要内容包括：本发明公开了一种不依赖数据库搜索的蛋白质生物标志物鉴定方法,其步骤包括：1)提取训练数据集中每一个质谱原始文件中的离子流色谱峰；2)对离子流色谱峰列表进行预处理,将共同检测到的质荷比对应的信号强度值的平均值和标准差,以点(平均值,标准差)的形式顺序排列成特征向量；3)采用深度学习技术,以用预处理后的离子流色谱峰列表为训练集,构建实验组与对照组样本分类模型；4)用训练好的分类模型对待鉴定实验数据进行类别鉴定,区分其属于实验组还是对照组；5)确认鉴定结果准确率符合要求后,输出分类模型所采用的关键特征向量；6)利用靶向蛋白质组学技术确定所述关键特征向量对应的肽段及蛋白质序列,作为生物标志物。(The invention discloses a protein biomarker identification method independent of database search, which comprises the following steps: 1) extracting ion current chromatographic peaks in each mass spectrum original file in the training data set; 2) preprocessing an ion current chromatographic peak list, and sequentially arranging the average value and the standard deviation of signal intensity values corresponding to the commonly detected mass-to-charge ratios into a characteristic vector in a point (average value and standard deviation) form; 3) establishing sample classification models of an experimental group and a control group by using the ion flow chromatographic peak list after pretreatment as a training set by adopting a deep learning technology; 4) carrying out category identification on experimental data to be identified by using a trained classification model, and distinguishing whether the experimental data belongs to an experimental group or a control group; 5) after confirming that the accuracy of the identification result meets the requirement, outputting a key feature vector adopted by the classification model; 6) and determining the peptide segment and the protein sequence corresponding to the key feature vector by utilizing a targeted proteomics technology to serve as biomarkers.)

1. A method for identifying protein biomarkers independent of database search, comprising the steps of:

1) extracting ion current chromatographic peaks in each mass spectrum original file in the training data set; wherein the mass spectra files in the training dataset comprise files derived from an experimental group of samples and files derived from a control group of samples;

2) preprocessing an ion current chromatographic peak list, and sequentially arranging the average value and the standard deviation of signal intensity values corresponding to the commonly detected mass-to-charge ratios into a characteristic vector in a point (average value and standard deviation) form and storing the characteristic vector;

3) establishing sample classification models of an experimental group and a control group by adopting a deep learning technology and taking the ion flow chromatographic peak list after pretreatment as a training set;

4) carrying out category identification on experimental data to be identified by using a trained classification model, and distinguishing whether the experimental data belongs to an experimental group or a control group;

5) after confirming that the accuracy of the identification result meets the requirement, outputting a key feature vector adopted by the classification model in the step 3);

6) and determining the peptide segment and the protein sequence corresponding to the key feature vector by utilizing a targeted proteomics technology to serve as biomarkers.

2. The method of claim 1, wherein the step of extracting the ion current chromatographic peak comprises:

1-1) reading a mass spectrum file to obtain the number, retention time, number of spectral peaks, spectral peak intensity and spectral peak mass-to-charge ratio of each spectrogram in the mass spectrum file;

1-2) searching isotope peak clusters in each spectrogram, and recording a peak with the highest intensity in each peak cluster as a monoisotope peak;

1-3) recording the monoisotopic peaks with the equal mass-to-charge ratio within a set time difference of retention time as an ion flow chromatographic peak group;

1-4) fitting each ion current chromatographic peak group to be used as an ion current chromatographic peak, and calculating the peak area and the average retention time of each ion current chromatographic peak.

3. The method of claim 2, wherein in steps 1-4), each set of ion current chromatographic peaks is fitted with a gaussian peak as an ion current chromatographic peak.

4. The method according to claim 2, wherein the ion current chromatographic peak information obtained in steps 1-4) is output as a list, and each row stores information of one ion current chromatographic peak, including the mass-to-charge ratio, peak area, intensity and average retention time of the ion current chromatographic peak.

5. The method of claim 1, wherein in step 2), the feature vector is generated by: traversing all samples in the training data set to obtain all mass-to-charge ratios existing in the samples, and counting the number of the mass-to-charge ratios common to each type of samples; then, for any category i, taking the mass-to-charge ratio shared by the samples with the set proportion above in the category i as the mass-to-charge ratio of the category i and storing the mass-to-charge ratio as a common mass-to-charge ratio vector of the category i; and combining the common mass-to-charge ratio vectors of the various samples to serve as a total sample common mass-to-charge ratio vector, extracting the intensity values corresponding to the mass-to-charge ratios in each sample according to the total sample common mass-to-charge ratio vector, sequentially calculating the average value and the standard deviation of all the intensity values in each sample, and sequentially arranging the average value and the standard deviation into the feature vector in a point (average value and standard deviation) form.

6. The method as claimed in claim 5, wherein in step 4), firstly extracting ion current chromatographic peaks of the mass spectrum file to be identified, extracting an average value and a standard deviation of signal intensity values corresponding to each mass-to-charge ratio in the mass spectrum file to be identified according to the total sample common mass-to-charge ratio vector, obtaining a feature vector of the mass spectrum file to be identified, inputting the feature vector into a trained classification model, and judging the type of the mass spectrum file to be identified according to an output result, namely judging whether the mass spectrum file to be identified is from an experimental group or a control group.

7. The method of claim 1, wherein in step 3), the classification model is constructed based on a convolutional neural network; wherein the classification model comprises three convolutional layers and two fully-connected layers, the first convolutional layer comprises N different filters, the second convolutional layer comprises 2N filters, the third convolutional layer comprises 4N filters, the size of the first fully-connected layer is 64N, and the size of the second fully-connected layer is 8N; the first convolution layer is connected with the second convolution layer through a first pooling layer, the second convolution layer is connected with the third convolution layer through a second pooling layer, the third convolution layer is connected with the first complete connection layer through a third pooling layer, and the output of the first complete connection layer is connected with the input of the second complete connection layer.

8. The method of claim 1, wherein the peptide fragment and protein sequence corresponding to each ion flow chromatographic peak are identified as biomarkers using targeted proteomics based on the peak area, retention time, intensity, mass-to-charge ratio of the ion flow chromatographic peak corresponding to the key feature vector.

9. The method as claimed in claim 1, wherein in step 5), the key feature vector used by the classification model in step 3) is output by using the interpretability method of the deep learning model.

10. The method of claim 9, wherein the interpretable method of the deep learning model is Grad-CAM.

Technical Field

The invention relates to a method for identifying protein biomarkers in proteomics, in particular to a method for identifying protein biomarkers in shotgun proteomics.

Background

The Biomarker (Biomarker) is an index which can be objectively detected and evaluated, can be used as an indicator factor for normal biological processes, pathological processes or therapeutic intervention pharmacological responses, and has important significance for screening, diagnosing or monitoring diseases, guiding molecular targeted therapy, evaluating therapeutic effects and the like (references: Ludwig JA, Weinstein JN. biomarkers in Cancer stage, prognosis and treatment selection. Nature events Cancer 5,845-856 (2005)). The protein is used as a vector for bearing life activities at the tail end of the central rule, and due to the existence of variable splicing, single nucleotide polymorphism and post-translational modification, the state of the protein contains more dimensional information, is closely related to all aspects of life activities, and is more suitable to be used as a biomarker. However, the discovery of protein biomarkers is more challenging than DNA and RNA derived markers due to the larger dynamic range of protein expression, higher complexity of proteome data, etc. (ref: Rifai N, Gillette MA, Carr SA. protein biomarker discovery: the long and uncertain path to clinical utility. Nat Biotechnology 24,971-983 (2006)). At present, as a mainstream method for proteomics research, mass spectrometry technology has been widely applied to the research of protein biomarker screening by virtue of the advantages of high throughput, high sensitivity and the like (reference: Changyouping, cinnapine. quantitative proteomics strategy and method research progress based on mass spectrometry. China science: life science 45,425-438 (2015)). At present, the screening of protein biomarkers is mostly based on the difference of protein expression abundance between an experimental group and a control group, and can be mainly divided into two strategies. One is the classical biomarker screening strategy, which can be divided into three stages of protein biomarker discovery, confirmation and verification, also called "triangle" strategy (ref: Whiteaker JR, et al. A targeted proteomics-based pipeline for verification of biomarkers in plasma. NatBiotechnol 29, 625) 634(2011) because the number of samples required for each stage is from small to large, and the number of candidate proteins is from large to small. Another strategy is a "rectangle" strategy similar to whole genome association analysis (references: Geyer PE, Holdt LM, Teupper D, Man M.revising biomarker discovery by plasma proteomics. mol Syst Biol 13,942 (2017)), and the analysis of large-array shotgun proteome data is carried out in the initial discovery stage, and the correlation of protein expression quantity, modification state change and disease state is discovered, and the large-scale shotgun proteome data is also adopted in the verification stage. Researchers in both of the above strategies rely on the accuracy and sensitivity of qualitative and quantitative results of proteome data. However, the spectrum resolution of the mass spectrometry data is still not high, and much information is missed by searching for peptide fragment/protein markers based on qualitative and quantitative results. And traditional screening strategies have used the effect of a single marker as a criterion rather than screening markers from an overall level based on expression patterns.

Disclosure of Invention

Aiming at the technical problems in the prior art, the invention aims to extract the key feature vector of a training data set under the condition of not depending on database search by using a deep learning method and taking a mass spectrum original file as input data and identify the category of other unknown mass spectrum files to be identified.

Step 1) extracting ion current chromatographic peaks of a mass spectrum original file;

step 2) preprocessing the ion flow chromatographic peak list, and arranging the average value and the standard deviation of the signal intensity values corresponding to the commonly detected mass-to-charge ratios into a characteristic vector in a point (average value and standard deviation) form sequence and storing the characteristic vector;

step 3) adopting a deep learning technology, taking the ion flow chromatographic peak list after pretreatment as a training set, and constructing sample classification models of an experimental group and a control group;

step 4) carrying out class identification on other experimental data to be identified by using the trained classification model, and distinguishing whether the experimental data belongs to an experimental group or a control group;

step 5) after confirming that the accuracy of the identification result meets the requirement, outputting the key feature vector adopted by the classification model in the step 4) by using an interpretability method of the deep learning model;

and 6) determining peptide fragments and protein sequences corresponding to the key characteristic vectors by using a targeted proteomics technology to serve as biomarkers.

In the above technical solution, in the step 1), the step of extracting a peak of ion current chromatography of the original file of the mass spectrum includes:

step 1-1) reading all original mass spectrum files to obtain information such as the number, retention time, number of spectral peaks, spectral peak intensity, spectral peak mass-to-charge ratio and the like of each spectrogram; the mass spectra files in the training dataset comprise files derived from an experimental group of samples (such as cancer tissue) and files derived from a control group of samples (such as paracancerous tissue);

step 1-2) searching isotope peak clusters in each spectrogram, wherein the isotope peak clusters are characterized by a plurality of continuous spectral peaks with equal mass-to-charge ratio difference values, and recording the peak with the highest intensity in each peak cluster as a single isotope peak;

step 1-3) recording the monoisotopic peaks with the equal mass-to-charge ratio within 5min of retention time as ion flow chromatographic peak groups;

step 1-4) fitting each ion current chromatographic peak group by using a Gaussian peak as an ion current chromatographic peak, and calculating the peak area and the average retention time of each ion current chromatographic peak;

and 1-5) outputting all the obtained ion current chromatographic peak information according to a list, wherein each line stores information of one ion current chromatographic peak, and the information mainly comprises a mass-to-charge ratio, a peak area, intensity and average retention time.

In the above technical solution, in the step 2), two decimal places are reserved for the mass-to-charge ratio of the data, all the mass-to-charge ratios existing in the samples are obtained by traversing all the samples, and the number of the mass-to-charge ratios common to each class of samples is counted (a specific classification method can be classified according to a specific target, and the classification is performed according to cancer and cancer side in the specific implementation of the present invention). And (3) taking the mass-to-charge ratio shared by the samples in each class with the set proportion (such as 80%) and storing the mass-to-charge ratio as a common mass-to-charge ratio vector, and combining the common mass-to-charge ratio vectors of the samples as the common mass-to-charge ratio vector of the total sample. And extracting the intensity values corresponding to the mass-to-charge ratios in each sample according to the obtained common mass-to-charge ratio vector of the total samples, sequentially calculating the average value and the standard deviation of all the intensity values in each sample, sequentially arranging the average value and the standard deviation into a characteristic vector in a point (average value and standard deviation) mode, and storing the characteristic vector.

In the above technical solution, in the step 3), the constructed deep learning model is based on a basic convolutional neural network, and is composed of three convolutional layers and two fully connected layers, where the first convolutional layer contains 16 different filters, and the second and third convolutional layers contain 32 and 64 filters, respectively. Each convolutional layer is followed by a pooling layer. And finally, two full connection layers are arranged, and the sizes of the full connection layers are 1024 and 128 respectively. The input layer adjusts the size according to the characteristic vector obtained in the step 2), and the output is 0 or 1. And 3) constructing a deep learning model required by the step 3) by taking the feature vector obtained in the step 2) as a training set.

In the above technical solution, in the step 4), the mass spectrum original file derived from the unknown sample is processed according to the step 1), and meanwhile, the feature vector is extracted according to the form of the step 2) according to the total sample common mass-to-charge ratio vector in the step 2), and the feature vector is input into the model trained in the step 3), and it is determined whether the unknown sample is derived from the experimental group or the control group according to the output result.

In the above technical solution, in the step 5), the interpretable method of the deep learning model refers to a method for interpreting a classification basis of the deep learning model, and the method is characterized in that a weight of input data (a feature vector in the step 2)) during classification can be marked; by using the method, a key feature vector list according to the deep learning model in classification can be obtained.

In the above technical solution, in the step 6), each feature vector in the feature vector list obtained in the step 5) may be reversely deduced according to the feature vector construction method described in the step 2) to obtain an ion flow chromatographic peak corresponding thereto, each ion flow chromatographic peak may be determined to have a peptide fragment and a protein sequence corresponding thereto by using a targeted proteomics technology, and the finally obtained proteins may be used as biomarkers.

The invention has the following advantages:

1, a qualitative and quantitative process of protein is not depended on, differential mass-to-charge ratios in samples of an experimental group and a control group are directly mined from a mass spectrogram, and potential biomarkers which are difficult to detect by mass spectrometry or have low abundance are expected to be detected;

2, the traditional biomarker screening strategy is based on the difference degree of a single marker between an experimental group and a control group for screening, and the invention directly screens the biomarker on the whole level by adopting a mode based on an expression mode, thereby being more beneficial to screening and finding of marker combinations.

Drawings

FIG. 1 is a flow chart of the method for identifying protein biomarkers based on deep learning independent of database search according to the present invention;

fig. 2 is a schematic diagram of the classification model of the experimental group-control group samples.

Detailed Description

The invention is further described with reference to the following figures and detailed description.

9页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：一种ATAC-seq测序数据的生物信息分析方法

Method for identifying protein biomarkers independent of database search

相关技术

网友询问留言