Multi-species GC-MS endogenous metabolite database and establishment method thereof

文档序号:1312741 发布日期:2020-07-10 浏览:37次 中文

阅读说明:本技术 一种多物种gc-ms内源性代谢物数据库及其建立方法 (Multi-species GC-MS endogenous metabolite database and establishment method thereof ) 是由 胡哲 尹小羚 彭章哓 陆嘉伟 胡绪俊 舒烈波 于 2020-05-13 设计创作,主要内容包括:本发明公开了一种多物种GC-MS内源性代谢物数据库的建立方法,包括:1)将多物种样本衍生化后的GCMS数据基于NIST库搜库,保留打分在700以上的物质作为筛选出的高分物质;2)提取步骤1)筛选出的高分物质的质谱信息,建立高分NIST库;3)将高分NIST库中带衍生化基团的名称翻译并替换成衍生化之前的名称,得到高分库;4)将高分库与扩增后的背景噪音库和扩增后的标准品数据库合并,得到多物种GC-MS内源性代谢物数据库。本发明提供的数据库既可以满足植物、动物及微生物等多类型生物样本的检索需求,又能更加准确的定性到更多的代谢物。(The invention discloses a method for establishing a multi-species GC-MS endogenous metabolite database, which comprises the following steps: 1) searching a database based on an NIST database by GCMS data after derivatization of a multi-species sample, and reserving substances with the score of more than 700 as screened high-molecular substances; 2) extracting mass spectrum information of the high-molecular substances screened in the step 1) and establishing a high-molecular NIST library; 3) translating and replacing the name with a derivative group in the high-resolution NIST library with the name before derivatization to obtain a high-resolution library; 4) and combining the high-resolution library with the amplified background noise library and the amplified standard substance database to obtain a multi-species GC-MS endogenous metabolite database. The database provided by the invention can meet the retrieval requirements of various biological samples such as plants, animals, microorganisms and the like, and can be used for qualitatively obtaining more metabolites more accurately.)

1. A method for establishing a multi-species GC-MS endogenous metabolite database, comprising the following steps:

1) searching a database based on an NIST database by GCMS data after derivatization of a multi-species sample, and reserving substances with the score of more than 700 as screened high-molecular substances;

2) extracting mass spectrum information of the high-molecular substances screened in the step 1) and establishing a high-molecular NIST library;

3) translating and replacing the name with a derivative group in the high-resolution NIST library with the name before derivatization to obtain a high-resolution library;

4) and combining the high-resolution library with the amplified background noise library and the amplified standard substance database to obtain a multi-species GC-MS endogenous metabolite database.

2. The method for building the database of endogenous metabolites of multi-species GC-MS according to claim 1, wherein in the step 2), the screened high-molecular substances are matched and the mass spectrum information is extracted by adopting a screening script written based on Python.

3. The method for building the database of endogenous metabolites of multi-species GC-MS according to claim 1, wherein the names and translated forms of the derivatized groups in step 3) comprise:

form ①, Name, n TMS (derivative)/n (trimethlsil) ether, wherein, Name is the Name before the derivatization of the substance, n is the number or characters representing the number of the derivatization groups, TMS derivative/(trimethlsil) ether is the derivatization group, which is removed during translation, and the Name retained by n TMS derivative/(trimethlsil) ether "is the Name after translation;

the form ② is Name, N trimethylsilyl ether, (O) -methyoxime (), wherein, Name is the Name before the derivatization of the substance, N is the number of the derivatization groups, trimethylsilyl ether is the derivatization group trimethylsilyl ester, methyoxime () is the group after the oximation reaction, — C (R) -N-O- ", the structure which generates the oximation reaction is reduced into ketone group or aldehyde group firstly during the translation, then the TMS group is removed, and finally the Name is given according to the structure;

form ③ A material containing a "bis (trimethylsilyl) phosphate" group is translated by removing the first two derivatizing groups of the phosphate group or by structure after removal of the derivatizing groups, following the nomenclature of form ②;

form ④ name which cannot be directly translated, requires NIST software to search the library for the structure of the substance, remove the TMS group, and finally name the substance by structure.

4. The method for building the database of endogenous metabolites of multi-species GC-MS as claimed in claim 1, wherein the translated name is replaced with the original name with derivative group in the high-molecular NIST library in step 3) by using a name replacement script written based on R language.

5. The method for building the database of endogenous metabolites of multi-species GC-MS according to claim 1, wherein in step 4), the background noise library is amplified by searching NIST library for Blank samples in different time periods and different samples to collect common background interferents.

6. The method for establishing the multi-species GC-MS endogenous metabolite database according to claim 1, wherein in the step 4), a standard substance is adopted for verification, and the standard substance is subjected to derivatization and then is subjected to GCMS computer collection of mass spectrum information, so that the standard substance database is amplified.

7. The method for building the multi-species GC-MS endogenous metabolite database according to claim 1, wherein in the step 4), the obtained multi-species GC-MS endogenous metabolite database comprises species source classification for metabolites from different types of samples, the species sources are classified into animal sources, plant sources and microorganism sources, HMDB ID, CAS number and KEGG number information and substance classification information.

8. A multi-species GC-MS endogenous metabolite database created using the method of creating a multi-species GC-MS endogenous metabolite database of any one of claims 1-7.

Technical Field

The invention belongs to the field of biological databases, and particularly relates to a multi-species GC-MS endogenous metabolite database and an establishment method thereof.

Background

The GCMS technique is one of the most commonly used analysis methods in metabonomics research at present, and is generally used for analyzing some small molecule metabolites with strong volatility, and for some metabolites with smaller molecular weight and larger polarity, such as amino acids, sugar alcohols, organic acids, biogenic amines, and organic phosphates, the boiling point of the metabolites needs to be reduced and the thermal stability of the metabolites needs to be increased by derivatization (silanization, esterification, etc.) so as to be analyzed by GCMS. In the GCMS analysis process, the most important step is to perform qualitative analysis on the metabolites, and the accuracy and the quantity of the qualitative analysis depend on a database.

The database commonly used for GCMS is published by National Institute of Standards and Technology (NIST), The NIST standard mass spectra database is updated to NIST v17, The version of The Search library software (NIST MS Search) is also updated to 2.3, wherein The main library mainlibrary has contained mass spectra information of 267376 compounds, so The NIST library is a very large database, and as such, it contains a very diverse species, both exogenous and endogenous, and The derivatized species is usually a new species containing a derivatization group, The Search library time is very long, which results in inefficient Search through The NIST library, The resulting species also needs to be renamed to obtain The pre-derivatized species name of The derivatized species, The derivative library is a new species, The term of The derivative library is very long, The species obtained is also a new species that is a species that is derived from a chemical group, The resulting species obtained by The Search library is not as long as compared to The original species of The derived species, The chemical spectrum of The plant species obtained by The organ library is a few species, The chemical spectrum of The plant species obtained by The organ library is developed as a chemical spectrum of The organ library, and The plant species obtained by The chemical analysis of The plant species of The organ library is a few species, thus The plant species of The plant origin is a little chemical spectrum of The plant origin, The plant is developed as well as The derivative library, The sample is a few species of The sample of The plant-derived species mentioned above mentioned as mentioned.

Therefore, the two most widely used databases, i.e., the NIST library and the Fiehn library, have certain limitations, the NIST library is very large and complicated, the substance retrieved from the NIST library has the disadvantages of low qualitative accuracy, difficult substance name backtracking, long retrieval time and the like, and the Fiehn library has the advantages of higher accuracy compared with the NIST library, the substance name being the backtracked name and the like, and has certain limitations, for example, mainly aiming at the endogenous substances of animals, the types of the related species are few, and the covered substances are hundreds of species.

Disclosure of Invention

Aiming at the problems, the invention provides a multi-species GC-MS endogenous metabolite database and an establishment method thereof, and a self-established database which covers multiple species, has retroactive names, contains more important passage substances, is quick to retrieve and has more accurate qualitative is invented through multiple ways of combining multiple programming languages with manual correction and standard substance verification.

In order to achieve the purpose, the invention adopts the technical scheme that:

a method for establishing a multi-species GC-MS endogenous metabolite database, comprising the following steps:

1) searching a database based on an NIST database by GCMS data after derivatization of a multi-species sample, and reserving substances with the score of more than 700 as screened high-molecular substances;

2) extracting mass spectrum information of the high-molecular substances screened in the step 1) and establishing a high-molecular NIST library;

3) translating and replacing the name with a derivative group in the high-resolution NIST library with the name before derivatization to obtain a high-resolution library;

4) and combining the high-resolution library with the amplified background noise library and the amplified standard substance database to obtain a multi-species GC-MS endogenous metabolite database.

Preferably, in the step 2), the screened high-molecular substances are matched by adopting a screening script written based on Python, and mass spectrum information is extracted.

Preferably, the names and translated forms of the derivatized groups in step 3) include:

form ①, Name, n TMS (derivative)/n (trimethlsil) ether, wherein, Name is the Name before the derivatization of the substance, n is the number or characters representing the number of the derivatization groups, TMS derivative/(trimethlsil) ether is the derivatization group, which is removed during translation, and the Name retained by n TMS derivative/(trimethlsil) ether "is the Name after translation;

the form ② is Name, N trimethylsilyl ether, (O) -methyoxime (), wherein, Name is the Name before the derivatization of the substance, N is the number of the derivatization groups, trimethylsilyl ether is the derivatization group trimethylsilyl ester, methyoxime () is the group after the oximation reaction, — C (R) -N-O- ", the structure which generates the oximation reaction is reduced into ketone group or aldehyde group firstly during the translation, then the TMS group is removed, and finally the Name is given according to the structure;

form ③ A material containing a "bis (trimethylsilyl) phosphate" group is translated by removing the first two derivatizing groups of the phosphate group or by structure after removal of the derivatizing groups, following the nomenclature of form ②;

form ④ name which cannot be directly translated, requires NIST software to search the library for the structure of the substance, remove the TMS group, and finally name the substance by structure.

Preferably, the translated name is replaced with the name of the originally derived group in the high-resolution NIST library in step 3) by using a name replacement script written based on the R language.

Preferably, in step 4), the background noise library is amplified by searching NIST library for Blank samples in different time periods and different samples to collect common background interferents.

Preferably, in the step 4), a standard substance is adopted for verification, and the standard substance is subjected to derivatization and then is subjected to GCMS computer collection of mass spectrum information, so that a standard substance database is amplified.

Preferably, the obtained multi-species GC-MS endogenous metabolite database in step 4) includes species source classifications for metabolites from different types of samples, the species sources being classified into animal sources, plant sources, and microorganism sources, HMDB ID, CAS number, and KEGG number information, and substance classification information.

The invention also provides a multi-species GC-MS endogenous metabolite database established by the establishing method.

Compared with the prior art, the invention has the beneficial effects that:

the Untarget database of GC-MS from L um ingbio (L UG database) established by the invention contains 2082 endogenous metabolites (shown in figure 3) capable of being detected by GC-MS, the EI map accounts for 6251 (shown in figure 4), and the range of the mass-to-nuclear ratio is between 85 and 650 in continuous updating, and various types of endogenous small molecule metabolites including lipids, amino acids, fatty acids, amines, alcohols, saccharides, aminosugars, sugar alcohols, sugar acids, organic phosphates, hydroxyl acids, aromatics, purines and sterol are covered.

In addition, the L UG database inputs background mass spectrum signals in advance, and the impurity signals can be deducted even if blank samples are not made through database comparison.

L UG database metabolites have multiple matrix (animal source, plant source and microorganism) species source information and pass 300 standard product verification, so that L UG database meets the high-accuracy qualitative analysis of different sample matrixes.

Drawings

FIG. 1 is a technical scheme of the method for establishing the endogenous metabolite database of the multi-species GC-MS according to the present invention.

Fig. 2 is a schematic diagram of an example of a structure found on a NIST MS Search in accordance with version ④ of the present invention.

FIG. 3 is a schematic representation of a portion of metabolites in the L UG database of the present invention.

FIG. 4 is a schematic diagram of the spectral fragment information of a part of the metabolites in the L UG database of the present invention.

FIG. 5 is a diagram illustrating spectral fragment information of a portion of background interfering substances in the L UG database of the present invention.

FIG. 6 is a schematic representation of a chromatogram of a portion of a standard according to the invention.

Detailed Description

For a better understanding of the present invention, the contents of the present invention will be further explained below with reference to the drawings and examples, but the present invention is not limited to the following examples.

21页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种预测有机污染物在聚乙烯型微塑料和水相之间分配平衡常数的方法

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!