Automatic excavation method of oligopeptide medicine guide substance based on machine learning

文档序号：1955261 发布日期：2021-12-10 浏览：28次中文

阅读说明：本技术 一种基于机器学习的寡肽药先导物的自动挖掘方法 (Automatic excavation method of oligopeptide medicine guide substance based on machine learning ) 是由张永彪肖百川王晓刚马超于 2021-09-17 设计创作，主要内容包括：本发明公开了一种基于机器学习的寡肽药先导物的自动挖掘方法,包括以下步骤：获取功能蛋白集,并提取功能蛋白集的固有无序区(IntrinsicallyDisordered Regions,IDRs)；构建基于深度神经网络的N-Gram模型；基于N-Gram模型学习IDRs的语义分布模式,得到可能成药的寡肽的氨基酸的上下文概率向量；采用蒙特卡罗方法根据氨基酸的上下文概率向量模拟寡肽从零开始延升的过程,得到候选寡肽；对候选寡肽进行打分和排名,并选取排名结果靠前的若干个候选寡肽进行功能验证。本发明结合N-Gram模型和蒙特卡罗方法,从与相关疾病的治疗有正向关系的功能蛋白集中挖掘出可能成药的功能寡肽,具有普适性。(The invention discloses an automatic digging method of oligopeptide medicine guide based on machine learning, which comprises the following steps: obtaining a functional protein set, and extracting Inherent Disordered Regions (IDRs) of the functional protein set; constructing an N-Gram model based on a deep neural network; learning the semantic distribution pattern of IDRs based on an N-Gram model to obtain the context probability vector of amino acids of oligopeptides which are possibly ready to use; simulating the process of extending the oligopeptide from zero by adopting a Monte Carlo method according to the context probability vector of the amino acid to obtain a candidate oligopeptide; and scoring and ranking the candidate oligopeptides, and selecting a plurality of candidate oligopeptides with top ranking results for functional verification. The invention combines an N-Gram model and a Monte Carlo method to excavate the functional oligopeptides which can become drugs from the functional protein set which has positive relation with the treatment of related diseases, and has universality.)

1. An automatic excavation method of oligopeptide drug leads based on machine learning is characterized by comprising the following steps:

s1, obtaining a functional protein set, and extracting IDRs of the functional protein set;

s2, constructing an N-Gram model based on a deep neural network;

s3, learning the semantic distribution mode of the IDRs based on the N-Gram model to obtain the context probability vector of the amino acids of the oligopeptide which can be used as a medicine;

s4, simulating the process of extending the oligopeptide from zero by adopting a Monte Carlo method according to the context probability vector of the amino acid to obtain a candidate oligopeptide;

s5, scoring and ranking the candidate oligopeptides, and selecting a plurality of candidate oligopeptides with top ranking results for functional verification.

2. The method for automatically mining oligopeptide drug leads based on machine learning according to claim 1, wherein the expression of the N-Gram model is as follows:

p(ω_k|context(ω_k))＝F(i_ωk,v(context(ω_k)),θ)；

wherein F represents a deep neural network, theta represents a parameter to be learned in F,denotes the k-th word ω_kNumber in the character set of amino acids, v (context (. omega.))_k) ) the character omega_kContext of (ω)_k) The word vector of (2).

3. The method for automatically mining oligopeptide drug guide based on machine learning according to claim 1, wherein S4 comprises the following steps:

s41, selecting any amino acid as a starting amino acid;

s42, deducing context probability vectors of the jointed amino acids of the oligopeptides to be prolonged by using the N-Gram model;

s43, generating the linked amino acids by adopting a Monte Carlo method according to the context probability vector simulation deduced from S42;

s44, connecting the linking amino acid with the current oligopeptide to be prolonged to obtain a new oligopeptide to be prolonged;

and S45, circularly executing S42-S44, and increasing one amino acid in each round until a preset ending condition is met to obtain the candidate oligopeptide.

4. The method for automatically mining oligopeptide drug guide based on machine learning of claim 3, wherein the preset termination conditions in S45 are as follows: the oligopeptide has a length of 10 and the probability of all potential linker amino acids of the current oligopeptide is less than the random probability.

5. The method for automatically mining oligopeptide drug lead based on machine learning according to claim 3, wherein S5 comprises:

performing grouping clustering according to the length of the candidate oligopeptides;

and respectively scoring the recommendation degree of the oligopeptides in each cluster in each group of clusters, and selecting a plurality of candidate oligopeptides with top scores for functional verification.

6. The method of claim 5, wherein in step S5, if one or more of the selected candidate oligopeptides meets a predetermined requirement, the remaining candidate oligopeptides in the cluster where the oligopeptides meeting the requirement are located are continuously subjected to functional verification.

7. The method of claim 3, wherein in step S5, the product of the contextual probabilities of the linking amino acids in each round of oligopeptide extension is used as the recommendation score of the candidate oligopeptides, and the candidate oligopeptides are ranked according to the recommendation score.

8. The method for automatically mining oligopeptide drug guide based on machine learning according to claim 1, wherein the deep neural network architecture of the N-Gram model is composed of an input layer, a projection layer, a hidden layer, an output layer and a SoftMax layer.

Technical Field

The invention relates to the technical field of computer-aided drug design, in particular to an automatic excavation method of an oligopeptide drug guide substance based on machine learning.

Background

The polypeptide drug is a drug with high selectivity and strong effect, and simultaneously has high safety and tolerance. However, traditional polypeptide drug design relies heavily on accurate protein structure and functional annotation, which results in high cost and time-cycle for drug development. In order to reduce the cost and time period of drug development, various methods of machine learning and statistical analysis have been used to assist drug development and have made good progress.

Throughout the related work of artificial intelligence assisted drug development in recent years, almost all common machine learning methods such as deep neural networks, support vector machines, KNNs, random forests and GBMs, logistic regression, discriminant analysis, hidden markov models, etc. are used. From the application scene, the work mainly focuses on the mature fields of data storage such as antibacterial peptide (AMP), antitumor peptide (ACP) and tumor cell neoantigen (neoantigen).

Based on the features used, these algorithms can be divided into two categories: one is a deep learning based approach that achieves high accuracy without manual design of features, but suffers from "data hunger and thirst" and opaque decision making processes. The other type is a traditional machine learning method based on feature engineering, which is not as deep as model capacity, but can obtain more accurate results through high-quality manual features under the condition of data scarcity. Common manual features can be divided into two categories, one of which features the elemental composition of a primary sequence. For example: the number of amino acid residues at the N-and C-termini or the whole peptide; pseudo amino acid composition (PseAAC) method; a sequence order based approach; methods based on Evolutionary Feature Construction (EFC) are based on non-local correlations between motifs. Another type of manual characterization is based on the physicochemical properties of the natural amino acids, characterized by the average of the physicochemical indices of all the amino acids contained in the entire polypeptide sequence or at its ends. Taking antibacterial peptide as an example, 56 physicochemical property indexes based on primary sequence are commonly used at present, wherein 47 peptide segment characteristics and 9 global characteristics comprise well-known t-scale, u-polarity and other structural and effective indexes.

However, these methods that have achieved good results in the development of polypeptide drugs are difficult to be applied to the development of oligopeptide drugs. In one aspect, the data set available for oligopeptide drugs is much less than for polypeptide drugs such as ACP, AMP, etc. To date, only 28 of the FDA-approved oligopeptide drugs exist, and 55 of the experimental oligopeptides, most of which are different modifications or derivatives of the same oligopeptide, severely limit the use of supervised learning approaches such as deep learning. On the other hand, the small number of amino acid residues in oligopeptide drugs makes the manual features for polypeptide drug development difficult to identify on oligopeptide drugs, thereby making the features difficult to migrate. Due to the lack of prior information and the limitation of self length, the design of unique manual characteristics for oligopeptide medicines becomes difficult.

Therefore, it is urgent and necessary to design an automatic design method of oligopeptide drug based on machine learning.

Disclosure of Invention

In view of the above, the invention provides an automatic mining method of oligopeptide drug leads based on machine learning, which is combined with an N-Gram model and a Monte Carlo method to mine functional oligopeptides capable of being used as drugs from functional protein sets having positive relation with treatment of related diseases, and has universality.

In order to achieve the purpose, the invention adopts the following technical scheme:

an automatic excavation method of oligopeptide drug leads based on machine learning comprises the following steps:

s1, obtaining a functional protein set, and extracting IDRs of the functional protein set;

s2, constructing an N-Gram model based on a deep neural network;

s3, learning the semantic distribution mode of the IDRs based on the N-Gram model to obtain the context probability vector of the amino acids of the oligopeptide which can be used as a medicine;

s4, simulating the process of extending the oligopeptide from zero by adopting a Monte Carlo method according to the context probability vector of the amino acid to obtain a candidate oligopeptide;

s5, scoring and ranking the candidate oligopeptides, and selecting a plurality of candidate oligopeptides with top ranking results for functional verification.

Preferably, in the above method for automatically mining oligopeptide drug leads based on machine learning, the expression of the N-Gram model is as follows:

Preferably, in the above method for automatically mining a machine learning-based oligopeptide drug guide, S4 includes the steps of:

s41, selecting any amino acid as a starting amino acid;

s42, deducing context probability vectors of the jointed amino acids of the oligopeptides to be prolonged by using the N-Gram model;

s43, generating the linked amino acids by adopting a Monte Carlo method according to the context probability vector simulation deduced from S42;

s44, connecting the linking amino acid with the current oligopeptide to be prolonged to obtain a new oligopeptide to be prolonged;

and S45, circularly executing S42-S44, and increasing one amino acid in each round until a preset ending condition is met to obtain the candidate oligopeptide.

Preferably, in the above method for automatically mining oligopeptide drug leads based on machine learning, the preset termination conditions in S45 are: the oligopeptide has a length of 10 and the probability of all potential linker amino acids of the current oligopeptide is less than the random probability.

Preferably, in the above method for automatically mining a machine learning-based oligopeptide drug guide, S5 includes:

performing grouping clustering according to the length of the candidate oligopeptides;

Preferably, in the above method for automatically mining oligopeptide drug leads based on machine learning, in S5, if one or more functional verification results of the selected candidate oligopeptides meet a preset requirement, the remaining candidate oligopeptides in the cluster where the oligopeptides meeting the requirement of the functional verification results are located continue to be functionally verified.

Preferably, in the above method for automatically mining oligopeptide drug leads based on machine learning, in S5, the product of the contextual probabilities of the linked amino acids in each round of oligopeptide extension is used as the recommendation degree score of the candidate oligopeptides, and the candidate oligopeptides are ranked according to the recommendation degree score.

Preferably, in the above method for automatically mining oligopeptide drug leads based on machine learning, the deep neural network architecture of the N-Gram model is composed of an input layer, a projection layer, a hiding layer, an output layer and a SoftMax layer.

According to the technical scheme, compared with the prior art, the invention discloses the automatic mining method of the oligopeptide drug guide based on machine learning, and the protein IDRs are the structural basis of the phase change of the protein, and the phase change has strong relevance with the occurrence of diseases, so that the IDRs are used as characteristic regions, the problem of data set shortage can be avoided to a certain extent, and the success rate of developing the oligopeptide drug based on small samples is improved.

Meanwhile, the invention takes the difficulty of manually designing the oligopeptide descriptor into consideration, and adopts a deep learning method to avoid the problem of manual feature design. The invention also considers that the oligopeptide does not have long-distance semantic patterns and the functional protein set (namely the model training set) is usually small, so the most basic natural language processing model, namely N-Gram, is adopted to carry out semantic pattern mining on the IDRs so as to learn the amino acid distribution pattern of the possibly druggable oligopeptide. The N-Gram model is essentially a conditional probability calculation model, the function of the N-Gram model is similar to that of a commonly used naive Bayes model, but the calculation of the inter-word conditional probability is realized through a deep neural network, so that the N-Gram model has larger model capacity than the traditional machine learning model and does not need to design manual features. The model is simple in principle, does not need to rely on a large amount of training data, and the decision probability of each step can be obtained, so that the model is suitable for development of oligopeptide medicines. In addition, the method simulates the process of extending the oligopeptide from zero through the Monte Carlo method, so that the de novo design of the oligopeptide medicine is closer to the natural process.

In general, the invention fills the research blank of the related field by using machine learning for the full-automatic excavation of oligopeptide drug leads; meanwhile, the invention has high universality, and for any application scene (indication), only a functional protein set which is positively related to the treatment of the disease needs to be provided, and the functional oligopeptides which can be used as medicaments can be excavated from the functional protein set.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

FIG. 1 is a flow chart of the method for automatically mining oligopeptide drug leads based on machine learning according to the present invention;

FIG. 2 is a general flow chart of the present invention for mining therapeutic oligopeptides from functional protein aggregates;

FIG. 3 is a flow chart of candidate oligopeptides obtained by combining the N-Gram model and the Monte Carlo method according to the present invention;

FIGS. 4(A) - (E) are schematic diagrams of candidate oligopeptides and experimental verification results provided by the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

As shown in fig. 1, the embodiment of the invention discloses an automatic mining method of oligopeptide drug leads based on machine learning, which comprises the following steps:

s1, obtaining a functional protein set, and extracting IDRs of the functional protein set;

s2, constructing an N-Gram model based on a deep neural network;

s3, learning the semantic distribution pattern of IDRs based on an N-Gram model to obtain the context probability vector of amino acids of the oligopeptide which is possibly ready to be used as a medicine;

s4, simulating the process of extending the oligopeptide from zero by adopting a Monte Carlo method according to the context probability vector of the amino acid to obtain a candidate oligopeptide;

and S5, scoring and ranking the candidate oligopeptides, and selecting a plurality of candidate oligopeptides with top ranking results for functional verification.

The above steps are further described below.

S1, obtaining the functional protein set, and extracting the IDRs of the functional protein set.

In proteins, there is a hot spot region called Intrinsic Disorder Regions (IDRs), which interact with domains of other proteins usually through peptide motifs (conserved linear peptide fragments less than 10 in length) within the region, thereby causing allosterism, and according to existing studies, phase separation due to allosterism of proteins has a strong correlation with the occurrence of diseases, and therefore IDRs of proteins are the central target of interest in drug development work.

As shown in figure 2, the invention extracts IDRs from a functional protein set as a characteristic region, which can avoid the problem of data set shortage to a certain extent and improve the success rate of developing oligopeptide medicaments based on small samples.

And S2, constructing an N-Gram model based on the deep neural network.

In S2, an N-Gram model based on a deep neural network is constructed, and the model is used as an unsupervised deep learning model, and can learn the semantic mode of the model through functional protein IDRs. The N-Gram model is represented as follows:

wherein F represents a deep neural network, theta represents a parameter to be learned in F,denotes the k-th word ω_kNumber in the character set of amino acid residues, v (context (. omega.))_k) ) the character omega_kContext of (ω)_k) The word vector of (2).

Specifically, the deep neural network architecture of the N-Gram model consists of an input layer, a projection layer, a hiding layer, an output layer and a SoftMax layer. Wherein the content of the first and second substances,

1) an input layer: in this layer, each residue is mapped into a word vector of length m. The word vectors are initialized randomly before training and iterated during the training process.

2) Projection layer: the word vectors are mapped into a higher dimensional space to increase the representational power of the model.

3) Hiding the layer: activation is performed using the tanh function for extracting deep features.

4) An output layer: the output of the hidden layer is mapped to a low-dimensional feature vector, the dimensionality of which is the number of possible outcomes.

5) SoftMax layer: and normalizing the output layer result to obtain the probability of each result.

S3, learning the semantic distribution mode of the IDRs based on the N-Gram model to obtain the context probability vector of the amino acids of the oligopeptide which can be used as a medicine.

The invention obtains the semantic distribution mode (context probability vector) of the IDRs of the functional protein set based on N-Gram model learning. The semantic distribution mode refers to: in a text or a sentence, the relative position relationship between characters is specifically expressed by a context probability vector, which describes the probability of each character possibly appearing in a certain context.

And S4, simulating the process of extending the oligopeptide from zero by adopting a Monte Carlo method according to the context probability vector of the amino acid to obtain the candidate oligopeptide.

After context probability vectors of amino acids of oligopeptides which are possibly ready-to-use drugs are obtained based on an N-Gram model, the invention introduces a Monte Carlo simulation method for simulating the natural delay process of the oligopeptides. The monte carlo method uses the probability vector obtained from the softmax layer as the probability distribution of a simulator (similar to a random seed) to simulate the process of oligopeptide extending from zero.

Overall, starting from an amino acid residue, the context probability vector for that character (referred to as character 1) is first calculated using the N-Gram model, and then the next preliminary character (referred to as character 2) is generated using the monte carlo method simulation. Character 1 and character 2 generated based on the character are spliced to form a new character for the next input round (i.e. character 1 of the next input round). The above process is repeated until the final output length (the iteration is terminated when the length reaches 10 due to the definition of the oligopeptide).

Specifically, as shown in fig. 3, the process of modeling candidate oligopeptides by combining the N-Gram model and the monte carlo method is as follows:

s41, selecting any amino acid as a starting amino acid; in this example, 10 amino acids with the highest frequency in the functional protein IDRs were selected as the starting amino acids;

s42, deducing context probability vectors of the linked amino acids of the oligopeptides to be prolonged by using an N-Gram model;

s43, generating the linked amino acids by adopting a Monte Carlo method according to the context probability vector simulation deduced from S42;

s44, connecting the linking amino acid with the current oligopeptide to be prolonged to obtain a new oligopeptide to be prolonged;

and S45, circularly executing S42-S44, and increasing one amino acid in each round until a preset ending condition is met to obtain the candidate oligopeptide.

Wherein, there are two preset end conditions, which are respectively: firstly, the extension length of the oligopeptide reaches 10; and secondly, the probability of the current oligopeptide for connecting the amino acid is less than the random probability, namely 1/20.

And S5, scoring and ranking the candidate oligopeptides, and selecting a plurality of candidate oligopeptides with top ranking results for functional verification.

After the candidate oligopeptides are obtained, the invention carries out grouping clustering according to the length of the candidate oligopeptides, and then carries out recommendation degree scoring on the oligopeptides in each cluster in each group of clusters, wherein the recommendation degree score is the context probability product of the linked amino acids of the oligopeptides in each cycle of delay. And finally, performing functional verification on a plurality of candidate oligopeptides with the top scores, and continuously performing functional verification on the remaining candidate oligopeptides in the cluster where the oligopeptides with the verification results meeting the requirements are located.

The method of the present invention is verified by a specific example, which comprises the following steps:

the invention is realized by dividing into 3 parts in practical application, and firstly, UniProt (A), (B) and (C) are requiredhttps:// www.uniprot.org/) The website searches the functional protein set positively related to the treatment of certain disease, and then uses IUPred2A (orhttps://iupred2a.elte.hu/) And (3) extracting the IDRs of the functional proteins, and finally inputting the IDRs into a deep learning model loaded with an N-Gram model and a Monte Carlo method to obtain the required candidate oligopeptides. This example illustrates the discovery of oligopeptides for the treatment of osteoporosis (bone formation promotion):

1. the UniProt is searched by the 4 keywords of "ossification", "osteopenis", "osteoplast reduction" and "osteoplast differentiation", and 171 related functional protein sequences are obtained.

2. IDRs of these functional proteins are predicted by IUPred 2A.

3. Inputting the sequence of the protein IDRs into a deep learning model to obtain candidate oligopeptides.

4. And performing grouping clustering according to the length of the candidate oligopeptides, selecting 3 oligopeptides with the highest score from a plurality of clusters obtained by each group clustering for functional verification, and performing cell experimental verification on the rest oligopeptides in the cluster if the oligopeptide of the cluster which is ranked at the top 3 has a good experimental effect.

As shown in fig. 4(a), several oligopeptides with good bone formation promoting effect are finally obtained, a cell experiment is performed on 28 oligopeptides generated by the algorithm, the obtained Alizarin Red (ARS) has a value of bone formation staining, the deeper the color is, the stronger the bone formation function is, and an animal experiment is performed on the oligopeptide (named as AIB5P) with the best effect of the cell experiment, through which fig. 4(B) shows that the femur is double-labeled by high-fold calcein and xylenol orange, and the scale bar is 100 μm. Fig. 4(C) shows femoral von kossa staining. The scale bar is 200 μm. FIG. 4(D) shows a graph of immunohistochemical staining against DMP 1. Figure 4(E) shows a representative micro-CT of a mouse femur, with the upper portion being a longitudinal cross-sectional medial axis scan at a scale bar of 1 mm. The lower part is the trabecula bone under the growth plate, and the scale bar is 500 μm. The oligopeptide can be found to have good bone formation promoting effect.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

10页详细技术资料下载

Automatic excavation method of oligopeptide medicine guide substance based on machine learning

相关技术

网友询问留言