Metabolite marking method, metabolite marking device, computer device, and storage medium

文档序号：1891643 发布日期：2021-11-26 浏览：27次中文

阅读说明：本技术 代谢物标记方法、装置、计算机设备及存储介质 (Metabolite marking method, metabolite marking device, computer device, and storage medium ) 是由郭建影徐啸于 2021-08-31 设计创作，主要内容包括：本申请涉及人工智能领域,公开一种代谢物标记方法、装置、计算机设备及存储介质,包括：获取待标记的代谢物的分子结构式；采集所述原子信息中的节点特征,构建所述代谢物的节点矩阵；基于所述化学连接关系,构建所述各原子之间的邻接关系,并根据所述邻接关系生成所述代谢物的邻接矩阵；对所述节点矩阵和所述邻接矩阵进行矩阵融合生成融合矩阵,并将所述融合矩阵输入至预设的标记模型中,所述标记模型是基于伪标签的自训练方式训练至收敛状态,用于对所述代谢物进行质谱图分类的神经网络模型；读取所述标记模型输出的分类结果,并根据所述分类结果对所述代谢物进行质谱图标记。(The application relates to the field of artificial intelligence, and discloses a metabolite marking method, a metabolite marking device, computer equipment and a storage medium, wherein the metabolite marking method comprises the following steps: obtaining a molecular structural formula of a metabolite to be marked; collecting node characteristics in the atomic information and constructing a node matrix of the metabolite; constructing an adjacency relation among the atoms based on the chemical connection relation, and generating an adjacency matrix of the metabolite according to the adjacency relation; performing matrix fusion on the node matrix and the adjacent matrix to generate a fusion matrix, and inputting the fusion matrix into a preset labeling model, wherein the labeling model is a neural network model which is trained to a convergence state based on a self-training mode of a pseudo label and is used for performing mass spectrogram classification on the metabolites; and reading the classification result output by the labeling model, and labeling the mass spectrogram of the metabolite according to the classification result.)

1. A metabolite labeling method, comprising:

obtaining a molecular structural formula of a metabolite to be marked, wherein the molecular structural formula comprises atomic information for forming the metabolite and a chemical connection relation between atoms;

collecting node characteristics in the atomic information and constructing a node matrix of the metabolite;

constructing an adjacency relation among the atoms based on the chemical connection relation, and generating an adjacency matrix of the metabolite according to the adjacency relation;

performing matrix fusion on the node matrix and the adjacent matrix to generate a fusion matrix, and inputting the fusion matrix into a preset labeling model, wherein the labeling model is a neural network model which is trained to a convergence state based on a self-training mode of a pseudo label and is used for performing mass spectrogram classification on the metabolites;

and reading the classification result output by the labeling model, and labeling the mass spectrogram of the metabolite according to the classification result.

2. The method of claim 1, wherein obtaining the molecular structural formula of the metabolite to be labeled comprises:

sending request query information to a plurality of preset metabolite databases, wherein the request query information comprises the identity information of the metabolites;

determining a target database according to the reply information of the plurality of metabolite databases;

and sending a request to the target database for obtaining information, and receiving the molecular structural formula of the metabolite sent by the target database.

3. The metabolite labeling method according to claim 2, wherein the response information includes a response time length of each metabolite database and a storage state of the molecular structural formula, and the determining the target database from the response information of the plurality of metabolite databases includes:

screening the plurality of metabolite databases according to the storage state to obtain at least one database to be selected;

and performing ascending arrangement on the at least one database to be selected by taking the response time length as a sorting condition, and determining the database to be selected positioned at the head of the sorting as the target database.

4. The metabolite labeling method according to claim 3, wherein the collecting of the node features in the atomic information and before the constructing of the node matrix of the metabolite comprises:

storing the molecular structural formula in a local database, and generating a storage linked list based on the storage position of the molecular structural formula;

based on a plurality of preset storage hash algorithms, carrying out hash operation on the identity information to generate a hash structural formula of the molecular structural formula;

and storing the hash structural formula in a preset storage bitmap, and generating the hash structural formula to be in mapping association with the storage linked list.

5. The metabolite labeling method according to claim 1, wherein the matrix fusing the node matrix and the adjacency matrix to generate a fusion matrix, and the inputting the fusion matrix into a preset labeling model includes:

sequentially coding each atom according to a preset identification rule;

sequencing the characteristic elements corresponding to the atoms in the node matrix and the adjacent matrix according to the sequence codes;

inserting the characteristic elements corresponding to the atoms in the node matrix after sequencing into the adjacent matrix before the characteristic elements of the corresponding atoms, and generating the fusion matrix;

and inputting the fusion matrix into a preset marking model.

6. The method of claim 1, wherein the labeling model is trained by:

acquiring a training sample set, wherein the training sample set comprises a marked sample set and a non-marked sample set;

carrying out supervision training on an initial labeled model through the labeled sample set to obtain a first model;

classifying the unmarked sample set through the first model to obtain a first classification result, screening unmarked samples in a preset proportion and a classification result corresponding to the unmarked samples based on the first classification result, and constructing a first marked sample;

updating the first labeled sample to the labeled sample set, and performing supervised training on the first model through the updated labeled sample set to generate a second model;

and classifying the rest unmarked sample sets through the second model to obtain a second classification result, repeatedly and iteratively updating the marked sample sets, and training the marked model based on the updated marked sample sets until the marked model is trained to be converged.

7. The metabolite marking method according to claim 1, wherein after reading the classification result output by the marking model and performing mass spectrogram marking on the metabolite according to the classification result, the method comprises:

encrypting the mass spectrogram mark according to a preset asymmetric encryption algorithm to generate ciphertext information;

carrying out Hash operation on the ciphertext information based on a plurality of preset encryption Hash algorithms to generate a target password of the ciphertext information, and encrypting the ciphertext information according to the target password to generate an encrypted ciphertext;

and sending the encrypted ciphertext and the target password to corresponding request terminals, wherein the encrypted ciphertext and the target password are sent through different interfaces.

8. A metabolite labeling device, comprising:

the acquisition module is used for acquiring a molecular structural formula of a metabolite to be marked, wherein the molecular structural formula comprises atomic information for forming the metabolite and a chemical connection relation among atoms;

the acquisition module is used for acquiring node characteristics in the atomic information and constructing a node matrix of the metabolite;

the processing module is used for constructing the adjacency relation among the atoms based on the chemical connection relation and generating an adjacency matrix of the metabolites according to the adjacency relation;

the fusion module is used for performing matrix fusion on the node matrix and the adjacent matrix to generate a fusion matrix, and inputting the fusion matrix into a preset labeling model, wherein the labeling model is a neural network model which is trained to a convergence state based on a self-training mode of a pseudo label and is used for performing mass spectrogram classification on the metabolites;

and the reading module is used for reading the classification result output by the marking model and marking the mass spectrogram of the metabolite according to the classification result.

9. A computer device comprising a memory and a processor, the memory having stored therein computer readable instructions which, when executed by the processor, cause the processor to perform the steps of the metabolite labeling method of any one of claims 1 to 7.

10. A storage medium having stored thereon computer-readable instructions which, when executed by one or more processors, cause the one or more processors to perform the steps of the metabolite labeling method of any one of claims 1 to 7.

Technical Field

The embodiment of the invention relates to the field of artificial intelligence, in particular to a metabolite marking method, a metabolite marking device, computer equipment and a storage medium.

Background

The detection and quantification of cellular metabolites using Mass Spectrometry (MS) has become a common detection method and has great potential for development in a number of biomedical research and applications. However, the biggest challenge in mass spectrometric detection and quantification of metabolites is that most metabolites in an organism lack annotation of mass spectra, and only a very small number of metabolites have mass spectra annotation of standards. For example, quantitative analysis of targeted metabolomics (determination of absolute content of metabolites in a sample) requires manual identification of standards one by one, and establishment of a mass spectrum library, so as to identify and quantify the metabolites of interest in a biological sample.

The prior art relies primarily on expanding the mass spectrometric identification range of standards to expand the metabolite repertoire, but this method relies on significant time, economic and labor costs.

The inventor of the invention finds in research that the marking of the metabolite spectrogram in the prior art is measured and marked in a laboratory through a large amount of manpower, and the metabolite cannot be rapidly identified and marked through the mass spectrogram.

Disclosure of Invention

The embodiment of the invention provides a metabolite marking method, a metabolite marking device, computer equipment and a storage medium, which can improve the metabolite spectrogram marking efficiency.

In order to solve the above technical problem, the embodiment of the present invention adopts a technical solution that: provided is a metabolite labeling method including:

collecting node characteristics in the atomic information and constructing a node matrix of the metabolite;

constructing an adjacency relation among the atoms based on the chemical connection relation, and generating an adjacency matrix of the metabolite according to the adjacency relation;

and reading the classification result output by the labeling model, and labeling the mass spectrogram of the metabolite according to the classification result.

Optionally, the molecular structural formula for obtaining the metabolite to be labeled comprises:

sending request query information to a plurality of preset metabolite databases, wherein the request query information comprises the identity information of the metabolites;

determining a target database according to the reply information of the plurality of metabolite databases;

and sending a request to the target database for obtaining information, and receiving the molecular structural formula of the metabolite sent by the target database.

Optionally, the reply information includes a response time length of each metabolite database and a storage state of the molecular structural formula, and the determining the target database according to the reply information of the plurality of metabolite databases includes:

screening the plurality of metabolite databases according to the storage state to obtain at least one database to be selected;

Optionally, before the collecting the node features in the atomic information and constructing the node matrix of the metabolite, the method includes:

storing the molecular structural formula in a local database, and generating a storage linked list based on the storage position of the molecular structural formula;

based on a plurality of preset storage hash algorithms, carrying out hash operation on the identity information to generate a hash structural formula of the molecular structural formula;

and storing the hash structural formula in a preset storage bitmap, and generating the hash structural formula to be in mapping association with the storage linked list.

Optionally, the performing matrix fusion on the node matrix and the adjacency matrix to generate a fusion matrix, and inputting the fusion matrix into a preset labeling model includes:

sequentially coding each atom according to a preset identification rule;

sequencing the characteristic elements corresponding to the atoms in the node matrix and the adjacent matrix according to the sequence codes;

and inputting the fusion matrix into a preset marking model.

Optionally, the training mode of the label model is as follows:

acquiring a training sample set, wherein the training sample set comprises a marked sample set and a non-marked sample set;

carrying out supervision training on an initial labeled model through the labeled sample set to obtain a first model;

updating the first labeled sample to the labeled sample set, and performing supervised training on the first model through the updated labeled sample set to generate a second model;

Optionally, after reading the classification result output by the labeling model and performing mass spectrogram labeling on the metabolite according to the classification result, the method includes:

encrypting the mass spectrogram mark according to a preset asymmetric encryption algorithm to generate ciphertext information;

and sending the encrypted ciphertext and the target password to corresponding request terminals, wherein the encrypted ciphertext and the target password are sent through different interfaces.

In order to solve the above-mentioned technical problem, an embodiment of the present invention further provides a metabolite labeling device, including:

the acquisition module is used for acquiring node characteristics in the atomic information and constructing a node matrix of the metabolite;

and the reading module is used for reading the classification result output by the marking model and marking the mass spectrogram of the metabolite according to the classification result.

Optionally, the metabolite labeling device further comprises:

the first request submodule is used for sending request query information to a plurality of preset metabolite databases, wherein the request query information comprises identity information of the metabolites;

the first processing submodule is used for determining a target database according to the reply information of the plurality of metabolite databases;

and the second request submodule is used for sending a request to the target database to acquire information and receiving the molecular structural formula of the metabolite sent by the target database.

Optionally, the reply message includes a response time of each metabolite database and a storage state of the molecular structural formula, and the metabolite marking device further includes:

the first screening submodule is used for screening the plurality of metabolite databases according to the storage state to obtain at least one database to be selected;

and the first execution submodule is used for performing ascending arrangement on the at least one database to be selected by taking the response time length as a sorting condition, and determining the database to be selected positioned at the head of the sorting as the target database.

Optionally, the metabolite labeling device further comprises:

the first storage submodule is used for storing the molecular structural formula in a local database and generating a storage linked list based on the storage position of the molecular structural formula;

the first operation submodule is used for carrying out hash operation on the identity information based on a plurality of preset storage hash algorithms to generate a hash structural formula of the molecular structural formula;

and the second storage submodule is used for storing the hash structural formula in a preset storage bitmap and generating the hash structural formula to be in mapping association with the storage linked list.

Optionally, the metabolite labeling device further comprises:

the first coding submodule is used for sequentially coding each atom according to a preset identification rule;

the first sequencing submodule is used for sequencing the characteristic elements corresponding to the atoms in the node matrix and the adjacent matrix according to the sequence codes;

the first fusion submodule is used for inserting the characteristic elements corresponding to the atoms in the node matrix after sequencing into the adjacent matrix before the characteristic elements of the corresponding atoms, and generating the fusion matrix;

and the first input submodule is used for inputting the fusion matrix into a preset marking model.

Optionally, the metabolite labeling device further comprises:

the device comprises a first obtaining submodule and a second obtaining submodule, wherein the first obtaining submodule is used for obtaining a training sample set, and the training sample set comprises a marked sample set and an unmarked sample set;

the first training submodule is used for carrying out supervised training on an initial labeled model through the labeled sample set to obtain a first model;

the first classification submodule is used for classifying the unmarked sample set through the first model to obtain a first classification result, screening unmarked samples in a preset proportion and a classification result corresponding to the unmarked samples based on the first classification result, and constructing a first marked sample;

the second training submodule is used for updating the first labeled sample to the labeled sample set and carrying out supervised training on the first model through the updated labeled sample set to generate a second model;

and the second classification submodule is used for classifying the rest unmarked sample sets through the second model to obtain a second classification result, repeatedly and iteratively updating the marked sample sets, and training the marked model based on the updated marked sample sets until the marked model is trained to be converged.

Optionally, the metabolite labeling device further comprises:

the first encryption sub-module is used for encrypting the mass spectrogram mark according to a preset asymmetric encryption algorithm to generate ciphertext information;

the second operation submodule is used for carrying out hash operation on the ciphertext information based on a plurality of preset encryption hash algorithms to generate a target password of the ciphertext information and encrypting the ciphertext information according to the target password to generate an encrypted ciphertext;

and the first sending submodule is used for sending the encrypted ciphertext and the target password to a corresponding request terminal, wherein the encrypted ciphertext and the target password are sent through different interfaces.

In order to solve the above technical problem, an embodiment of the present invention further provides a computer device, including a memory and a processor, where the memory stores computer-readable instructions, and the computer-readable instructions, when executed by the processor, cause the processor to execute the steps of the above-mentioned metabolite marking method.

In order to solve the above technical problem, embodiments of the present invention further provide a storage medium storing computer-readable instructions, which, when executed by one or more processors, cause the one or more processors to perform the steps of the metabolite labeling method described above.

The embodiment of the invention has the beneficial effects that: the molecular structural formula of the labeled metabolite is subjected to atom characteristic extraction, the chemical connection relation between atoms is converted into the adjacency relation, and the adjacency relation is further converted into the adjacency matrix. The method comprises the steps of performing matrix fusion on an adjacency matrix and an atomic node matrix to obtain a full-quantity matrix of the metabolite, performing neural network classification on the fusion matrix to quickly obtain a mass spectrogram of the metabolite, completing mass spectrogram marking of the metabolite, and improving marking efficiency of the mass spectrogram.

Drawings

The foregoing and/or additional aspects and advantages of the present application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

FIG. 1 is a schematic diagram of a basic process flow for a metabolite labeling method according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a process for obtaining a molecular structural formula according to an embodiment of the present application;

FIG. 3 is a flow diagram illustrating the determination of a target database according to an embodiment of the present application;

FIG. 4 is a schematic flow chart of the local storage of molecular structural formulas in accordance with one embodiment of the present application;

FIG. 5 is a schematic diagram illustrating a process for generating and using a fusion matrix according to an embodiment of the present application;

FIG. 6 is a schematic diagram illustrating a training process of a label model according to an embodiment of the present application;

FIG. 7 is a schematic flow chart of a transmitted mass spectrogram according to an embodiment of the present application;

FIG. 8 is a schematic view of the basic structure of a metabolite marking device according to an embodiment of the present application;

fig. 9 is a block diagram of a basic structure of a computer device according to an embodiment of the present application.

Detailed Description

Reference will now be made in detail to embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary only for the purpose of explaining the present application and are not to be construed as limiting the present application.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It will be understood by those within the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

As used herein, a "terminal" includes both devices that are wireless signal receivers, devices that have only wireless signal receivers without transmit capability, and devices that have receive and transmit hardware, devices that have receive and transmit hardware capable of performing two-way communication over a two-way communication link, as will be understood by those skilled in the art. Such a device may include: a cellular or other communication device having a single line display or a multi-line display or a cellular or other communication device without a multi-line display; PCS (Personal Communications Service), which may combine voice, data processing, facsimile and/or data communication capabilities; a PDA (Personal Digital Assistant), which may include a radio frequency receiver, a pager, internet/intranet access, a web browser, a notepad, a calendar and/or a GPS (Global Positioning System) receiver; a conventional laptop and/or palmtop computer or other device having and/or including a radio frequency receiver. As used herein, a "terminal" may be portable, transportable, installed in a vehicle (aeronautical, maritime, and/or land-based), or situated and/or configured to operate locally and/or in a distributed fashion at any other location(s) on earth and/or in space. The "terminal" used herein may also be a communication terminal, a web-enabled terminal, a music/video playing terminal, such as a PDA, an MID (Mobile Internet Device) and/or a Mobile phone with music/video playing function, and may also be a smart tv, a set-top box, etc.

Referring to FIG. 1, FIG. 1 is a schematic diagram of a basic flow chart of the metabolite labeling method of the present embodiment. As shown in fig. 1, a metabolite labeling method includes:

s1100, obtaining a molecular structural formula of a metabolite to be marked, wherein the molecular structural formula comprises atomic information for forming the metabolite and a chemical connection relation among atoms;

metabolites, also known as intermediate metabolites, refer to substances produced or consumed by metabolic processes in the human or animal body, excluding biological macromolecules. The precursors and degradation products of biological macromolecules are true metabolites. And detecting the metabolites after the metabolites are generated, and detecting the metabolites through a metabolite professional detection instrument to obtain the chemical names or numbers of the metabolites.

After the chemical name or the number of the metabolite is obtained, a request message is required to be sent to a preset metabolite database to request to obtain the molecular structure of the metabolite, because the molecular formulas of the metabolites stored in different metabolite databases are different, that is, in practical application, no full-scale metabolite database is provided, and the molecular structures of all kinds of metabolites are stored. Therefore, it is necessary to send request information to a plurality of metabolite databases in a synchronous query manner, to request whether the molecular structural formula of the metabolite is stored in the database, and to determine to request the metabolite database for obtaining the molecular structural formula according to the feedback of each metabolite database.

The molecular structural formula is a chemical formula which uses element symbols and short lines to represent the arrangement and combination mode of atoms in the molecules of a compound (or a simple substance), and is a method for simply describing the molecular structure. The structural formula can completely draw chemical bonds between each atom in the molecule.

The molecular structural formula includes atoms constituting the molecule, and chemical connection relationship between the atoms, the chemical connection relationship refers to chemical bonds between the atoms, and the chemical bonds include (but are not limited to): ionic, covalent or metallic bonds.

S1200, collecting node characteristics in the atomic information and constructing a node matrix of the metabolite;

according to the atom information in the molecular structural formula, the node characteristics of the metabolite are generated. The node characteristics include standard atomic weight, atomic type, bond number, adjacent hydrogen atom number, whether the atom is in a ring or an aromatic ring, and the like, and record metabolite atom composition information.

And carrying out coding mapping on each node characteristic according to a preset coding list, wherein a vector value mapped by each node characteristic is recorded in the coding list. After each node feature is mapped into a vector value correspondingly, the vector values of the node features are arranged according to a set sorting mode to generate a vector matrix of N x M, wherein N is larger than or equal to 1, and M is larger than or equal to 1. Each row in the node matrix represents a node characteristic of one atom in the molecular structure.

S1300, constructing an adjacency relation among the atoms based on the chemical connection relation, and generating an adjacency matrix of the metabolite according to the adjacency relation;

the adjacency relationship between atoms is constructed based on the chemical connection relationship between atoms described in the molecular structural formula. The adjacency relation is used to indicate whether or not the atoms in the molecular structural formula are connected, the type of chemical bond between the connecting atoms, and the like.

According to the type of chemical bonds among atoms, carrying out vector mapping on different chemical bonds, then, according to the incidence relation among atoms recorded in the molecular structural formula, when the atoms are not connected, the adjacency relation among the atoms is 0, and when the atoms have the connection relation, the atoms are correspondingly mapped into corresponding vector values according to the chemical bonds among the atoms, so that the conversion of the adjacency relation among the atoms is completed.

After the adjacency of each atom with respect to other atoms is arranged in order, an adjacency matrix of metabolites composed of the adjacency is generated. The adjacent matrix is an N-M dimensional matrix, N is larger than or equal to 1, and M is larger than or equal to 1. Each row of the adjacency matrix represents an adjacency between one atom and another atom.

S1400, performing matrix fusion on the node matrix and the adjacent matrix to generate a fusion matrix, and inputting the fusion matrix into a preset labeling model, wherein the labeling model is a neural network model which is trained to a convergence state based on a self-training mode of a pseudo label and is used for performing mass spectrogram classification on the metabolites;

fusing the generated node matrix and the adjacency matrix, wherein the fusion mode is plug-in fusion, and specifically comprises the following steps: inserting the corresponding row in the node matrix corresponding to each atom before the adjacent relation row corresponding to each atom in the adjacent matrix, namely, the next row in the node characteristic row of each atom in the fusion matrix is the adjacent relation row of the atom. And a fusion matrix obtained by fusing the node matrix and the adjacent matrix is a 2N x M-dimensional matrix, wherein N is greater than or equal to 1, and M is greater than or equal to 1.

The plug-in fusion can enable the relevant characteristics of atoms to be arranged in a concentrated manner, is favorable for extracting the characteristics of the atoms, and improves the efficiency and the accuracy of subsequent processing.

However, the fusion method of the node matrix and the adjacency matrix is not limited to this, and according to different application scenarios, in some embodiments, the fusion method of the node matrix and the adjacency matrix is as follows: and performing matrix splicing, namely splicing the node matrix before the adjacent matrix or splicing the adjacent matrix before the node matrix to construct an N x 2M-dimensional matrix, wherein N is more than or equal to 1, and M is more than or equal to 1.

The input of the labeling model is a fusion matrix, the output is a mass spectrogram corresponding to the metabolite, and the mass/charge ratio of the metabolite and the relative abundance of the metabolite and the charge/charge ratio are recorded in the mass spectrogram.

The labeled model is obtained by training a forward neural network model, but the category of the labeled model is not limited to this, and according to different application scenarios, in some embodiments, the labeled model can also be formed by: the convolutional neural network model, the deep convolutional neural network model or the cyclic neural network model and any variant model of the three models are obtained through training.

Training a label model by adopting a self-training mode of a pseudo label, wherein the training mode is to obtain a training sample set, and the training sample set comprises a labeled sample set and an unlabeled sample set; carrying out supervision training on the initial labeled model through a labeled sample set to obtain a first model; classifying the unmarked samples through a first model to obtain a first classification result, screening the unmarked samples in a preset proportion and the classification results corresponding to the unmarked samples based on the first classification result, and constructing a first marked sample; updating the first labeled sample to a labeled sample set, and performing supervised training on the first model through the updated labeled sample set to generate a second model; and classifying the rest unmarked sample sets through the second model to obtain a second classification result, repeatedly and iteratively executing the steps of updating the marked sample sets and training the marked model based on the updated marked sample sets until the marked model is trained to be converged.

The labeled model trained to the convergent state can classify the mass spectrogram of the metabolite according to the input fusion model pair.

S1500, reading the classification result output by the labeling model, and labeling the mass spectrogram of the metabolite according to the classification result.

Reading a classification result output by the marking model, wherein the classification result is a mass spectrogram of the metabolite, and performing key value pair storage on the metabolite and the mass spectrogram, or performing mapping association on a storage address of the mass spectrogram and the metabolite to establish a mapping list, and then completing marking of the mass spectrogram of the metabolite.

In the above embodiment, the molecular structural formula of the labeled metabolite is subjected to atomic feature extraction, and the chemical connection relationship between atoms is converted into the adjacency relationship, and further the adjacency relationship is converted into the adjacency matrix. The method comprises the steps of performing matrix fusion on an adjacency matrix and an atomic node matrix to obtain a full-quantity matrix of the metabolite, performing neural network classification on the fusion matrix to quickly obtain a mass spectrogram of the metabolite, completing mass spectrogram marking of the metabolite, and improving marking efficiency of the mass spectrogram.

In some embodiments, the metabolite information stored in the different metabolite databases is different, and therefore, there is no full database and all metabolite information is collected together. When the molecular structural formula of the metabolite is obtained, query information needs to be sent to a plurality of metabolite databases. Referring to fig. 2, fig. 2 is a schematic flow chart of obtaining a molecular structural formula according to the present embodiment.

As shown in fig. 2, S1100 includes:

s1111, sending request query information to a plurality of preset metabolite databases, wherein the request query information comprises the identity information of the metabolites;

the terminal or the server in the present embodiment locally stores access addresses of a plurality of metabolite databases in advance. After the identity information of the metabolites is obtained, query information is sequentially sent to a plurality of prestored metabolite databases.

The query information is recorded with identity information of a pre-query metabolite, and the identity information comprises a chemical name, a common name or other universal identification marks capable of identifying the metabolite.

S1112, determining a target database according to the reply information of the plurality of metabolite databases;

after the query information is sent to the plurality of metabolite databases, each metabolite database carries out retrieval query in the respective database according to the identity information of the metabolites in the query information, and the retrieval result includes the information including the metabolites or does not include the information including the metabolites. The metabolite database including the metabolite information sends the reply information to the terminal corresponding to the query information, but the metabolite database not including the metabolite information does not send the reply information to the terminal, and the metabolite database is disconnected from the terminal, and the interface connected with the terminal is released into the interface queue again.

After receiving the reply information sent by the metabolite database, the terminal calculates the response time of the reply information, and the calculation mode of the response time is as follows: timestamp of accepting reply message-timestamp of sending query message. That is, the terminal generates a first time stamp of the query message when transmitting the query message, and generates a second time stamp of the reply message when receiving the reply message. And after the time of the first timestamp is subtracted from the time of the second timestamp, generating the response time length of each reply message.

The response time of each reply message can reflect the response speed of the corresponding metabolite database, and the metabolite database with the shortest response time is selected as the target database. However, the method of confirming the target database is not limited to this, and in some embodiments, the terminal matches whether the terminal itself meets the acquisition policy according to the metabolite acquisition policy described in the reply information, and when the acquisition policy matches the conditions of the terminal itself, the matched metabolite database is determined as the target database.

S1113, sending a request to the target database to obtain information, and receiving the molecular structural formula of the metabolite sent by the target database.

After the target database is determined, the terminal sends a request for obtaining information to the target database, the request for obtaining information or the identity information of the metabolite is included, and after the target server receives the request for obtaining information, the molecular structural formula corresponding to the request for obtaining information is called according to the identity information of the request for obtaining information, and the molecular structural formula is replied to the terminal.

In some embodiments, after the terminal obtains the reply information of each metabolite database, the target database is determined according to the response time length associated with the reply information. Referring to fig. 3, fig. 3 is a schematic flow chart illustrating the determination of the target database according to the present embodiment.

As shown in fig. 3, S1112 includes:

s1121, screening the plurality of metabolite databases according to the storage state to obtain at least one database to be selected;

after the query information is sent to the plurality of metabolite databases, each metabolite database carries out retrieval query in the respective database according to the identity information of the metabolites in the query information, and the retrieval result includes the information including the metabolites or does not include the information including the metabolites. The metabolite database including the metabolite information sends the reply information back to the terminal corresponding to the query information, and the metabolite database not including the metabolite information does not send the reply information back to the terminal.

And after the terminal obtains the reply information, determining the metabolite database which sends the reply information back as a database to be selected, wherein the number of the databases to be selected is 1, 2 or 3 or more.

And S1122, with the response time length as a sorting condition, performing ascending sorting on the at least one database to be selected, and determining the database to be selected at the head of the sorting as the target database.

After receiving the reply information sent by the database to be selected, the terminal calculates the response time length of the reply information, and the calculation mode of the response time length is as follows: timestamp of accepting reply message-timestamp of sending query message. That is, the terminal generates a first time stamp of the query message when transmitting the query message, and generates a second time stamp of the reply message when receiving the reply message. And after the time of the first timestamp is subtracted from the time of the second timestamp, generating the response time length of each reply message.

And after the response time length of each database to be selected is obtained through calculation, the response time lengths corresponding to the databases to be selected are arranged in an ascending order, the database to be selected with the first order position in the queue arranged in the ascending order is the database with the highest response speed, and the database to be selected is determined as the target database. The database with the highest response speed can be determined by determining the target database, and the acquisition speed of acquiring the molecular structural formula of the metabolite is improved.

In some embodiments, the terminal stores the molecular structure locally after obtaining the molecular structure of the metabolite. Referring to fig. 4, fig. 4 is a schematic flow chart illustrating the local storage of the molecular structural formula according to the present embodiment.

As shown in fig. 4, S1200 previously includes:

s1131, storing the molecular structural formula in a local database, and generating a storage linked list based on the storage position of the molecular structural formula;

and after the terminal receives the molecular structural formula of the metabolite, storing the molecular structural formula in a local database, and generating a storage linked list of the storage position of the molecular structural formula after the molecular structural formula is stored. The storage chain table records the physical address or the logical address of the molecular formula storage position, and the molecular formula can be accessed through the address information.

S1132, carrying out Hash operation on the identity information based on a plurality of preset storage Hash algorithms to generate a Hash structural formula of the molecular structural formula;

when the terminal stores the molecular structural formula, hash calculation needs to be performed on the identity information of the metabolite through a plurality of preset hash algorithms, each hash algorithm is distinguished, different numbers are set for different hash algorithms, and the hash algorithms are arranged according to the numbers. And carrying out hash calculation on the identity information in sequence according to the arranged storage hash algorithms, outputting a hash array with equal length by each storage hash algorithm, and carrying out sequence arrangement on the hash arrays according to the arrangement sequence of the storage hash algorithms to generate a hash structural formula.

S1133, storing the hash structural formula in a preset storage bitmap, and generating the hash structural formula to be in mapping association with the storage linked list.

After the hash structure is generated, the hash structure is stored in a preset storage bitmap, and the hash structure is composed of two-dimensional numbers, so that the hash structure can be mapped into 0 and 255 pixel values to be stored in the preset storage bitmap. And then, associating the hash structural formula with the storage linked list, and establishing a mapping relation between the hash structural formula and the storage linked list.

When the molecular structural formula is called again after the molecular structural formula is stored, the molecular structural formula cannot be directly obtained through direct query because the molecular structural formula is stored in an image mode. At the moment, the identity information of the metabolite is calculated according to a storage Hash algorithm, a Hash structural formula corresponding to the identity information is generated, and the storage position of the molecular structural formula is obtained according to the mapping relation between the Hash structural formula and the storage chain table.

By the mode, the molecular structure is prevented from being named when the molecular structure is stored, the molecular structure is directly searched through a Hash algorithm, the storage efficiency and the safety of the molecular structure are improved, and other people cannot obtain the storage information of the molecular structure according to the stored name of the molecular structure.

In some embodiments, the merging of the node matrix and the adjacency matrix is a plug-in merge. Referring to fig. 5, fig. 5 is a schematic diagram illustrating a generating and using process of the fusion matrix of the present embodiment.

As shown in fig. 5, S1400 includes:

s1411, sequentially coding each atom according to a preset identification rule;

in this embodiment, each atom in the molecular structural formula needs to be encoded, and the encoding mode is performed according to a preset identification rule, which means that the encoding of each atom is sequentially encoded.

The identification rule is based on the sequence of molecular structural formula, and the coding is carried out from left to right, when the molecular formula includes ring structural formula such as benzene ring, the coding is carried out in a clockwise mode.

S1412, sorting the characteristic elements corresponding to the atoms in the node matrix and the adjacent matrix according to the sequence codes;

each row in the node matrix represents a node characteristic of one atom in the molecular structural formula, and each row in the adjacency matrix represents an adjacency relationship between one atom and other atoms. And correspondingly coding each row in the node matrix according to the coding information of each atom in the sequential coding, so that the coding of the atom is in one-to-one correspondence with the coding of the characteristic elements of the node matrix. And correspondingly coding each row in the adjacent matrix according to the coding information of each atom in the sequential coding, so that the coding of the atom is in one-to-one correspondence with the coding of the characteristic elements of the adjacent matrix. And after the characteristic elements in the node matrix and the adjacent matrix are coded, the characteristic elements are arranged in a descending order or an ascending order according to the size of the codes and the node matrix and the adjacent matrix.

S1413, inserting the characteristic elements corresponding to the atoms in the node matrix after sorting into the adjacent matrix before the characteristic elements of the corresponding atoms, and generating the fusion matrix;

after the sorting is completed, the row corresponding to each atom in the node matrix is inserted before the adjacent relation row corresponding to each atom in the adjacent matrix, that is, the next row below the node characteristic row of each atom in the fusion matrix is the adjacent relation row of the atom. And a fusion matrix obtained by fusing the node matrix and the adjacent matrix is a 2N x M-dimensional matrix, wherein N is greater than or equal to 1, and M is greater than or equal to 1.

And S1414, inputting the fusion matrix into a preset mark model.

And fusing the node matrix and the adjacent matrix to generate a fusion matrix, and inputting the fusion matrix into a preset labeling model after the fusion matrix is generated, wherein the labeling model is a neural network model which is trained to a convergence state based on a self-training mode of a pseudo label and is used for carrying out mass spectrogram classification on metabolites. The plug-in fusion can enable the relevant characteristics of atoms to be arranged in a concentrated manner, is favorable for extracting the characteristics of the atoms, and improves the efficiency and the accuracy of subsequent processing.

In some embodiments, the labeling model is trained to a converged state based on a self-training approach of the pseudo-label. Referring to fig. 6, fig. 6 is a schematic diagram of a training process of the labeling model of the present embodiment.

As shown in fig. 6, the training mode is as follows:

s1611, obtaining a training sample set, wherein the training sample set comprises a labeled sample set and a non-labeled sample set;

when the labeled model in this embodiment is trained, a training sample set of the training labeled model needs to be collected first, and training samples in the training sample set include a labeled sample set and an unlabeled sample set. And training samples in the marked sample set are marked samples, and each molecular structural formula in the marked samples is marked with a corresponding mass spectrogram. The training samples in the unlabeled sample set are unlabeled samples, and each molecular structural formula in the unlabeled samples has an unlabeled mass spectrogram.

S1612, performing supervised training on the initial labeled model through the labeled sample set to obtain a first model;

training the initialized labeled model by training samples in the labeled sample set, wherein the training mode is supervised training, and the specific training mode is as follows: and sequentially inputting the training samples in the marked sample set to an initial marked model, and only inputting one training sample each time, wherein the initial marked model performs feature extraction and mass spectrogram classification according to the input training samples to obtain a mass spectrogram classification result corresponding to the training samples. Calculating a characteristic distance between a mass spectrogram classification result and a labeled mass spectrogram according to a loss function of the initialized labeling model, comparing the characteristic distance with a preset threshold value, and when the characteristic distance is less than or equal to the threshold value, indicating that the classification result is correct; when the characteristic distance is greater than the threshold value, the classification result is incorrect, and at this time, the weight of the initial mark model needs to be adjusted through the return function of the initial mark model, so that the classification result of the initial mark model approaches to the pre-marked mass spectrogram in a gradient manner. And training the initial marking model through the marking sample set, and generating the first model after the training times of the initial marking model reach the set times.

S1613, classifying the unmarked sample set through the first model to obtain a first classification result, screening unmarked samples in a preset proportion and a classification result corresponding to the unmarked samples based on the first classification result, and constructing a first marked sample;

after the first model is obtained through training, classifying the unlabeled samples through the first model, and sequentially inputting the training samples in the unlabeled sample set into the first model for classification to obtain a first classification result corresponding to each unlabeled sample.

After obtaining a first classification result corresponding to each unmarked sample, determining the unmarked samples in the first classification result through screening, wherein the classification result is matched with the unmarked samples, and the screening mode is as follows: and screening the unmarked samples with the confidence level of more than 80% in the first classification result. However, the screening method is not limited to this, and in some embodiments, the screening method is to manually identify and screen out the first classification result with the correct classification result according to different application scenarios.

And after the screening is finished, taking the first classification result meeting the conditions as the marking information of the corresponding unmarked sample, so that the unmarked sample becomes the marked sample, and the generated marked sample is the first marked sample. In some embodiments, the ratio of the first marked sample to the number of samples in the unmarked sample set is defined, and the defined ratio is 10%, but the value of the preset ratio is not limited thereto, and according to different application scenarios, in some embodiments, the defined ratio can be any value less than 100% and greater than 0%.

S1614, updating the first labeled sample to the labeled sample set, and performing supervised training on the first model through the updated labeled sample set to generate a second model;

updating the first labeled sample to a labeled sample set to form an updated labeled sample set, sequentially inputting training samples in the updated labeled sample set to the first model, only inputting one training sample each time, and performing feature extraction and mass spectrogram classification on the first model according to the input training samples to obtain a mass spectrogram classification result corresponding to the training samples. Calculating a characteristic distance between a mass spectrogram classification result and the marked mass spectrogram according to a loss function of the first model, comparing the characteristic distance with a preset threshold value, and when the characteristic distance is smaller than or equal to the threshold value, indicating that the classification result is correct; when the characteristic distance is greater than the threshold value, the classification result is incorrect, and at this time, the weight of the first model needs to be adjusted through the return function of the first model, so that the classification result of the first model approaches to the pre-marked mass spectrogram in a gradient manner. And training the first model by updating the labeled sample set, and generating a second model after the training times of the first model reach the set times.

S1615, classifying the rest unmarked sample sets through the second model to obtain a second classification result, iteratively updating the marked sample sets repeatedly, and training the marked models based on the updated marked sample sets until the marked models are trained to be converged.

After the second model is obtained through training, classifying the remaining unlabeled samples through the second model, and sequentially inputting the training samples in the remaining unlabeled sample set into the second model for classification to obtain a second classification result corresponding to each unlabeled sample.

After obtaining a second classification result corresponding to each unmarked sample, screening to determine the unmarked samples in the second classification result, wherein the classification result is matched with the unmarked samples, and the screening mode is as follows: and screening the unmarked samples with the confidence level of more than 80% in the second classification result. However, the screening method is not limited to this, and in some embodiments, the screening method is to manually identify and screen out a second classification result with a correct classification result according to different application scenarios.

And after the screening is finished, taking the second classification result meeting the conditions as the marking information of the corresponding unmarked sample, so that the unmarked sample becomes a marked sample, and the generated marked sample is the second marked sample. In some embodiments, the ratio of the second marked sample to the number of samples in the unmarked sample set is limited, and the limited ratio is 10%, but the value of the preset ratio is not limited thereto, and according to different application scenarios, in some embodiments, the limited ratio can be any value less than 100% and greater than 0%.

Therefore, the unmarked samples with the centralized classification result of the unmarked samples reaching the preset standard are repeatedly updated into the marked samples, the number of the marked samples is continuously expanded, the robustness of the marked model can be improved through limited marked sample training, and the accuracy of the marked samples is improved. For example, a labeled metabolite data set a1 is trained to obtain a model 1, the model 1 is used in an unlabeled metabolite data set B1 to predict a corresponding mass spectrum y1, i.e., a soft pseudo label, and according to prediction scoring, the first 10% of samples with higher predicted quality in B1 are screened and incorporated into a labeled data set a1 to obtain a new data set a 2. And (3) training the labeling model again by using A2 to obtain an iterative model 2, using the model 2 in a residual unlabeled metabolite data set B2 to predict to obtain a corresponding mass spectrogram y2, screening the first 10% of samples with higher predicted quality in B2 according to prediction scoring, merging the samples into a labeled data set A2 to obtain a new data set A3, training the labeling model again by using A3 to obtain an iterative model 3, and repeating the steps in the same way to perform multiple iterations to finally obtain an optimal model.

In some embodiments, when the terminal is a server or serves as a server, after the generation of the mass spectrogram classification result of the metabolite is completed, the mass spectrogram needs to be sent to a request terminal requesting to acquire a corresponding mass spectrogram. Referring to fig. 7, fig. 7 is a schematic flow chart of sending a mass spectrogram according to the present embodiment.

As shown in fig. 7, S1500 includes:

s1511, encrypting the mass spectrogram mark according to a preset asymmetric encryption algorithm to generate ciphertext information;

when a mass spectrogram which is sent by a request terminal and requests to obtain metabolites is obtained for encryption, the encryption mode is asymmetric encryption, and an asymmetric encryption algorithm needs two keys: public keys (public keys for short) and private keys (private keys for short). The public key and the private key are a pair, and if data is encrypted by the public key, the data can be decrypted only by the corresponding private key. This algorithm is called asymmetric encryption algorithm because two different keys are used for encryption and decryption.

And encrypting the mass spectrogram mark through a public key in an asymmetric encryption algorithm to generate ciphertext information.

S1512, carrying out hash operation on the ciphertext information based on a plurality of preset encryption hash algorithms to generate a target password of the ciphertext information, and encrypting the ciphertext information according to the target password to generate an encrypted ciphertext;

and calculating the ciphertext information through a plurality of preset encryption hash algorithms, wherein different encryption hash algorithms are different, the same input is input, and the output results are different, but the output results of different encryption hash algorithms are the same in length. And inputting the ciphertext information into a plurality of encryption hash algorithms to generate a plurality of encryption arrays, and splicing the plurality of encryption arrays to generate a target password of the ciphertext information. And encrypting the ciphertext information through the target password to generate an encrypted ciphertext.

S1513, the encrypted ciphertext and the target password are sent to the corresponding request terminal, wherein the encrypted ciphertext and the target password are sent through different interfaces.

The encrypted ciphertext and the target password are transmitted to the requesting terminal according to the communication address of the requesting terminal, but the encrypted ciphertext and the target password are transmitted through different interfaces, for example, the encrypted ciphertext is transmitted through a network link interface, and the target password is transmitted through interfaces such as short message, telephone, mail and the like. However, the channels for sending the encrypted ciphertext and the target password are different, and mainly the problem that the two data transmitted by the same-channel interface are easy to be intercepted by others is avoided, so that the data security can be improved by double-channel sending.

And after the request terminal receives the target password and the encrypted ciphertext, decrypting the encrypted ciphertext according to the target password to obtain ciphertext information, and decrypting the ciphertext information according to a private key in the asymmetric encryption to generate a mass spectrogram of the metabolite.

The data transmission method has the advantages that the ciphertext information and the encrypted ciphertext are generated through the asymmetric encryption algorithm and the encrypted Hash algorithm, data security can be improved, data transmission is carried out through two different interfaces, and data transmission security can be further improved.

Specifically, referring to fig. 8, fig. 8 is a schematic view of the basic structure of the metabolite marking device of the present embodiment.

As shown in fig. 8, a metabolite marking device includes: an acquisition module 1100, an acquisition module 1200, a processing module 1300, a fusion module 1400, and a reading module 1500. The acquisition module 1100 is configured to acquire a molecular structural formula of a metabolite to be labeled, where the molecular structural formula includes atom information constituting the metabolite and a chemical connection relationship between atoms; the collecting module 1200 is configured to collect node features in the atomic information, and construct a node matrix of the metabolite; the processing module 1300 is configured to construct an adjacency relation between the atoms based on the chemical connection relation, and generate an adjacency matrix of the metabolite according to the adjacency relation; the fusion module 1400 is configured to perform matrix fusion on the node matrix and the adjacency matrix to generate a fusion matrix, and input the fusion matrix into a preset labeling model, where the labeling model is a neural network model trained to a convergence state in a pseudo-label-based self-training manner and used for performing mass spectrogram classification on the metabolites; the reading module 1500 is configured to read the classification result output by the labeling model, and perform mass spectrogram labeling on the metabolite according to the classification result.

The metabolite labeling device extracts atomic features from a molecular structural formula of a labeled metabolite, converts a chemical connection relationship between atoms into an adjacency relationship, and further converts the adjacency relationship into an adjacency matrix. The method comprises the steps of performing matrix fusion on an adjacency matrix and an atomic node matrix to obtain a full-quantity matrix of the metabolite, performing neural network classification on the fusion matrix to quickly obtain a mass spectrogram of the metabolite, completing mass spectrogram marking of the metabolite, and improving marking efficiency of the mass spectrogram.

In some embodiments, the metabolite labeling device further comprises:

the first processing submodule is used for determining a target database according to the reply information of the plurality of metabolite databases;

In some embodiments, the reply message includes a response time of each metabolite database and a storage status of the molecular structural formula, and the metabolite labeling device further includes: