Methods, apparatus and computer readable media for glycopeptide identification

文档序号：1047806 发布日期：2020-10-09 浏览：16次中文

阅读说明：本技术 用于糖肽鉴定的方法、装置和计算机可读介质 (Methods, apparatus and computer readable media for glycopeptide identification ) 是由朱森发 T·阮-孔 P·M·鲁德于 2019-02-21 设计创作，主要内容包括：方法鉴定样品中的糖肽。该方法包括将样品的MS1前体的质谱转换为图中的多于一个节点,每个节点对应样品中待鉴定的糖肽的一个质量和一个保留时间；计算节点对的所有组合之间的质量和/或保留时间的差异；生成节点的图论网络；并基于节点的图论网络预测样品中糖肽的组成,以鉴定糖肽。(The method identifies glycopeptides in the sample. The method comprises converting mass spectra of MS1 precursors of the sample to more than one node in a graph, each node corresponding to a mass and a retention time of a glycopeptide to be identified in the sample; calculating differences in quality and/or retention time between all combinations of node pairs; generating a graph theory network of nodes; and predicting the composition of the glycopeptide in the sample based on the graph theory network of the nodes so as to identify the glycopeptide.)

1. A method for identifying a glycopeptide in a sample, the method comprising:

converting a mass spectrum of the MS1 precursor of the sample to more than one node in a graph, each node corresponding to a mass and a retention time of a glycopeptide to be identified in the sample;

calculating differences in quality and/or retention time between all combinations of node pairs;

generating a graph theory network of nodes; and

predicting the composition of the glycopeptides in the sample based on the graph theory network of nodes, thereby identifying the glycopeptides.

2. The method of claim 1, further comprising:

for each node of the graph, the node is set as a center node and all other nodes are set as arm nodes, and the center node and the arm nodes are connected to form a node pair.

3. The method of claim 1 or 2, wherein the step of generating a graph theory network of the nodes comprises:

more than one node pair is retained, wherein a difference in quality between each of the more than one retained node pair is equal to a glycan attachment quality in the known glycan attachment list, and/or a difference in retention time between each of the more than one retained node pair is less than a retention time threshold.

4. The method of claim 3, wherein the list of known glycan attachments comprises one or more of: n-acetylhexosamine, sialic acid, hexoses and deoxyhexoses.

5. The method of claim 4, wherein the N-acetylhexosamine includes N-acetylglucosamine and N-acetylgalactosamine, the sialic acid includes N-acetylneuraminic acid and N-glycolylneuraminic acid, the hexose sugar includes mannose and galactose; and the deoxyhexose comprises fucose.

6. The method of any of claims 3 to 5, wherein the retention time threshold is 50 seconds.

7. The method of any one of claims 3 to 6, wherein the retention time threshold is variable based on separation performance of a liquid chromatography device used prior to obtaining the MS1 precursor of the sample.

8. The method of any of claims 3 to 7, further comprising:

extracting the more than one node pair that is retained as one or more node subgraphs, each node subgraph being separate from the other node subgraphs;

wherein each node subgraph represents a group of glycopeptides that share the same peptide backbone with different glycan attachments, an

Wherein the graph-theoretic network of nodes comprises the one or more node subgraphs.

9. The method of claim 8, wherein the step of predicting glycopeptide composition comprises:

identifying the sequence of the peptide backbone in each node subgraph;

calculating the mass of the peptide backbone based on the identified sequence; and

the composition of glycopeptides is predicted by identifying glycan attachments of glycopeptides based on differences in mass and/or retention time between each pair of nodes in the subgraph.

10. The method of claim 8 or 9, further comprising:

for each node subgraph, selecting a reference node;

identifying a sequence of reference nodes of the subgraph; and

the composition of the remaining nodes in the subgraph is predicted based on the differences in quality and/or retention time between each pair of nodes in the subgraph and the sequence of the reference nodes.

11. The method of claim 10, wherein the step of predicting the composition of the remaining nodes in the subgraph comprises:

predicting relative composition of remaining nodes in the subgraph based on differences in quality and/or retention time between each pair of nodes in the subgraph; and

predicting an absolute composition of the remaining nodes in a subgraph by merging the relative composition of the remaining nodes with the sequence of reference nodes.

12. The method of any of claims 1-11, further comprising:

providing the predicted composition of glycopeptides to a database comprising compositions of known glycans and peptides; and

performing one or more searches based on the database to identify glycopeptides in the sample.

13. An apparatus for identifying a glycopeptide in a sample, the apparatus comprising:

at least one input module;

at least one output module;

at least one processor; and

at least one memory including computer program code;

wherein the input module is configured to receive data from a liquid chromatography-mass spectrometry (LC-MS) system, the data comprising mass spectrometry data,

wherein the output module is configured to output a result of the identified glycopeptide; and

wherein the at least one memory and the computer program code are configured to, with the at least one processor, cause the apparatus at least to:

calculating differences in quality and/or retention time between all combinations of node pairs;

generating a graph theory network of nodes; and

predicting the composition of glycopeptides in the sample based on the graph theory network of nodes, thereby identifying glycopeptides.

14. An apparatus of claim 13, wherein the apparatus is further caused to:

for each node of the graph, the node is set as a center node and all other nodes are set as arm nodes, and the center node and the arm nodes are connected to form a node pair.

15. An apparatus according to claim 13 or 14, wherein in the step of generating a graph theory network of nodes, the apparatus is caused to:

16. The apparatus of claim 15, wherein the list of known glycan attachments comprises one or more of: n-acetylhexosamine, sialic acid, hexoses and deoxyhexoses.

17. The device of claim 16, wherein the N-acetylhexosamine includes N-acetylglucosamine and N-acetylgalactosamine, the sialic acid includes N-acetylneuraminic acid and N-glycolylneuraminic acid, the hexose sugar includes mannose and galactose; and the deoxyhexose comprises fucose.

18. The apparatus of any of claims 15-17, wherein the retention time threshold is 50 seconds.

19. The device of any one of claims 15 to 18, wherein the retention time threshold is variable based on separation performance of a liquid chromatography device used prior to obtaining the MS1 precursor of the sample.

20. An apparatus according to any one of claims 15 to 19, wherein the apparatus is further caused to:

extracting the more than one node pair that is retained as one or more node subgraphs, each node subgraph being separate from the other node subgraphs;

wherein each node subgraph represents a group of glycopeptides that share the same peptide backbone with different glycan attachments, an

Wherein the graph-theoretic network of nodes comprises the one or more node subgraphs.

21. The apparatus of claim 20, wherein in the step of predicting glycopeptide composition, the apparatus is caused to:

identifying the sequence of the peptide backbone in each node subgraph;

calculating the mass of the peptide backbone based on the identified sequence; and

the composition of glycopeptides is predicted by identifying glycan attachments of glycopeptides based on differences in mass and/or retention time between each pair of nodes in the subgraph.

22. An apparatus according to claim 20 or 21, wherein the apparatus is further caused to:

for each node subgraph, selecting a reference node;

identifying a sequence of reference nodes of the subgraph; and

23. The apparatus of claim 18, wherein the step of predicting the composition of the remaining nodes in the subgraph causes the apparatus to:

predicting relative composition of remaining nodes in the subgraph based on differences in quality and/or retention time between each pair of nodes in the subgraph; and

predicting an absolute composition of the remaining nodes in a subgraph by merging the relative composition of the remaining nodes with the sequence of reference nodes.

24. An apparatus according to any one of claims 13 to 23, wherein the apparatus is further caused to:

providing the predicted composition of glycopeptides to a database comprising compositions of known glycans and peptides; and

performing one or more searches based on the database to identify glycopeptides in the sample.

25. A computer readable medium comprising instructions which, when executed by a processor, cause the processor to perform a method for identifying a glycopeptide in a sample according to any one of claims 1 to 12.

Technical Field

The present invention relates to the field of glycopeptide identification. In particular, the present invention relates to methods, apparatus and computer readable media for glycopeptide identification using graph-theoretic analysis of Liquid Chromatography Mass Spectrometry (LCMS) data.

Background

Current methods of glycopeptide identification focus on interpreting glycopeptide LCMS data based on proteomic analysis. These methods typically involve database-driven MS/MS fragmentation searches that rely on combining databases of theoretical peptides and theoretical glycans generated from the genome to produce a theoretical glycopeptide database with theoretical MS/MS fragments of glycopeptides. To limit the size of databases that would otherwise be computationally impractical, a common approach to studying monosaccharide proteins is to first limit the glycan list by characterizing the set of N-sugars released in glycomics experiments. Glycopeptides in LCMS data were identified by matching the detected precursor (i.e., MS1 precursor) and MS/MS fragment ions (i.e., MS2 fragment ions) to the theoretical database mentioned above, and then statistically scoring the confidence levels.

Based on theoretical databases, several database-dependent algorithms and software are available to identify N-glycopeptides, such as Byonic (ProteinMetrics), Proteome Discoverer (Thermo), GlycopeptideSearch, GlycopePepEvaluator, MAGIC, and pGlyco. These programs work well when there is sufficient a priori knowledge about the glycoprotein and its glycosylation. However, an inherent limitation of these database-dependent approaches is that they fail to identify unexpected glycopeptides that are not in the user-provided database, even though the glycopeptides in the raw data appear obvious to expert researchers. Furthermore, the database-dependent software does not show (1) which peaks were not identified, (2) the spectrum that was not identified is likely to be a glycopeptide, (3) an assessment of annotation integrity that alerts the researcher when the search parameters are suboptimal, and (5) a visual representation of the dense LCMS data that enables exploration, when the spectrum matches the database well but matches sequences outside the database better. Since the total set of glycopeptides is unknown, the current approach described above creates the problem of "not knowing something you are unaware" and also limits proteomic solutions to increase the number of peptides found and reduce the false detection rate.

Accordingly, there is a need for a method that addresses the above-mentioned shortcomings and also provides other related advantages.

Summary of The Invention

Exemplary embodiments include methods and apparatus for glycopeptide identification.

One exemplary embodiment is a method for identifying a glycopeptide in a sample. The method comprises converting mass spectra of MS1 precursors of the sample to more than one node in a graph, each node corresponding to a mass and a retention time of a glycopeptide to be identified in the sample; calculating differences in quality and/or retention time between all combinations of node pairs; generating a graph theory network of nodes; and predicting the composition of the glycopeptide in the sample based on the graph theory network of the nodes so as to identify the glycopeptide.

One exemplary embodiment is an apparatus for identifying a glycopeptide in a sample. The apparatus includes at least one input module; at least one output module; at least one processor; and at least one memory including computer program code. The input module is configured to receive data from a liquid chromatography-mass spectrometry (LC-MS) system, the data including mass spectrometry data. An output module is configured to output a result of the identified glycopeptide. The at least one memory and the computer program code configured to, with the at least one processor, cause the apparatus at least to: converting mass spectra of the MS1 precursor of the sample to more than one node in the graph, each node corresponding to a mass and a retention time of a glycopeptide to be identified in the sample; calculating differences in quality and/or retention time between all combinations of node pairs; generating a graph theory network of nodes; and predicting the composition of the glycopeptide in the sample based on the graph theory network of the nodes so as to identify the glycopeptide.

27页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：用于过敏原检测的系统和方法

Methods, apparatus and computer readable media for glycopeptide identification

相关技术

网友询问留言