Parallel drug-target correlation prediction method based on sequencing learning

文档序号:1289183 发布日期:2020-08-28 浏览:16次 中文

阅读说明:本技术 一种基于排序学习的并行式药物-靶标相关性预测方法 (Parallel drug-target correlation prediction method based on sequencing learning ) 是由 邹权 茹晓青 于 2020-05-22 设计创作,主要内容包括:本发明公开了一种基于排序学习的并行式药物-靶标相关性预测方法,属于生物信息学领域。该方法通过多种特征提取方法提取多种类型的相似度、相关性特征、化学空间特征、基因空间特征,继而由于多角度的特征提取会得到较高维数的特征集且样本无常规的正反例类标签,故用主成分分析法进行降维处理,然后将降维后的特征集输入排序学习算法中最终会预测输出每种查询下所涉及的药物与靶标的相关性程度。利用排序学习不再是简单的将药物与靶标的关系划分为相关或不相关,而是依据二者的相关性程度进行了排序,这样不仅有利于新药研发,还有利于药物的重定向。(The invention discloses a parallel type drug-target correlation prediction method based on sequencing learning, and belongs to the field of bioinformatics. The method extracts various types of similarity, correlation characteristics, chemical space characteristics and gene space characteristics through various characteristic extraction methods, then performs dimension reduction processing through a principal component analysis method because the characteristic set with higher dimension can be obtained through multi-angle characteristic extraction and a sample does not have a conventional positive and negative example label, and finally predicts and outputs the correlation degree of the medicine and the target related to each query by inputting the feature set after dimension reduction into a sorting learning algorithm. The sequencing learning is no longer used for simply dividing the relationship between the drug and the target into correlation or irrelevance, but sequencing is carried out according to the degree of correlation of the drug and the target, so that the method is not only beneficial to research and development of new drugs, but also beneficial to redirection of the drugs.)

1. A parallel drug-target correlation prediction method based on ranking learning is characterized by comprising the following steps:

s1, acquiring a chemical structure sample set of a drug and a sequence sample set of a target;

s2, extracting medicine characteristic information, target characteristic information and correlation characteristic information of the medicine characteristic information and the target characteristic information based on the chemical structure sample set of the medicine and the sequence sample set of the target;

s3, combining the drug characteristic information, the target characteristic information and the associated characteristic information of the drug characteristic information and the target characteristic information, and then performing dimension reduction processing;

s4, using the feature set obtained by the dimension reduction processing as input, inquiring protein or ligand related to the medicine or the target, sorting by adopting a sorting learning method, calculating the size of the correlation, and then sorting in sequence according to the size of the output value;

s5, comparing the obtained sequencing sequence with the real correlation sequence; voting the obtained sequencing result; different types of test sets were used to test generalization ability.

2. The parallel drug-target correlation prediction method based on ranking learning of claim 1, wherein the target object in S1 is G protein coupled receptor and the drug object is related or unrelated thereto.

3. The parallel drug-target correlation prediction method based on rank learning of claim 1, wherein the drug characteristic information in S2 is represented by 2D fingerprints and drug descriptors, the target characteristic information is represented by physicochemical properties of amino acids, frequency distribution and evolution information, and the correlation characteristic information of the two is calculated by k-NN, BLM-svr and NetLapRLS.

4. The parallel medicine-target correlation prediction method based on ranking learning of claim 1, wherein the S2 is used for extracting characteristic information by constructing heterogeneous networks in the aspect of medicine redirection, including medicine-medicine, medicine-disease, medicine side effect and medicine similarity correlation network.

5. The parallel drug-target correlation prediction method based on rank learning according to claim 1, wherein in S3, a principal component analysis method is adopted for dimension reduction.

6. The parallel drug-target correlation prediction method based on rank learning according to claim 1, wherein in S4, the input file is converted into a standard format:

wherein q isiRepresenting a certain query, FjAll the features of the sample j are represented,indicating the degree of correlation.

7. The parallel drug-target association prediction algorithm based on the ranking learning algorithm of claim 1, wherein the true association in S5 is represented by the affinity value between the drug and the target.

8. The parallel drug-target correlation prediction method based on ranking learning of claim 7, wherein the affinity value is IC50, and the IC50 is taken as the logarithm negative value thereofTo visually indicate the degree of drug-target association.

9. The parallel drug-target correlation prediction method based on the ranking learning algorithm of claim 1, wherein the performance of the ranking learning algorithm is measured by the NDCG value in S5, and the calculation formula of the NDCG value is as follows:

wherein K represents the query result of only the first K positions which are calculated and output, ri is the predicted correlation of the drug-protein pair at the ith position, and R is the real correlation of the drug-protein pair at the ith position.

10. The parallel drug-target correlation prediction method based on the rank learning algorithm of claim 1, wherein in the step S5, the performance of new drug development and drug redirection are respectively tested by adjusting samples in different types of test sets.

Technical Field

The invention belongs to the field of biological information systems, and particularly relates to a parallel drug-target correlation prediction method based on sequencing learning.

Background

There are many methods and techniques for predicting drug-protein correlations. Traditional prediction methods are divided into two types, ligand-based and target-based: ligand-based methods require information on the known ligand of the target protein and thus define pharmacophore models to describe common features of the bound ligand, which also suggests that this type of method is not applicable to less information on the known ligand; the target-based approach requires obtaining the 3-dimensional structure of the target in advance, but the 3-dimensional structure of the partial protein sequence is unknown and difficult to obtain.

Although the traditional prediction method can ensure higher accuracy, a great deal of time and money are consumed. Many researchers now introduce machine learning into relevant research, and methods of prediction using machine learning are classified into two types, feature-based and similarity-based. The introduction of machine learning does make great progress in speed, but both feature-based and similarity-based approaches have certain deficiencies: the similarity-based method relies on only unilateral (drug or target) similarity on one hand, and when the number of known ligands (or targets) capable of acting on the targets (or ligands) is small, the similarity of the analyte and only a few samples can be used to draw a conclusion about whether the similarity is relevant, so that the similarity is obviously not sufficient; with feature-based methods, it may not be possible to represent drug information, protein sequence information well in numerical form due to the algorithms used.

In addition, when predicting drug-protein correlations using machine learning, many researchers simply predict whether drugs are related to proteins, i.e., classify the studies into two categories, and do not further investigate the degree of drug-protein correlations, i.e., which proteins (drugs) are most strongly related to a given drug (protein).

Disclosure of Invention

The invention aims to: aiming at the defects in the prior art, a parallel drug-target correlation prediction method based on sequencing learning is provided.

The technical scheme adopted by the invention is as follows:

a parallel drug-target correlation prediction method based on ranking learning comprises the following steps:

s1, acquiring a chemical structure sample set of a drug and a sequence sample set of a target;

s2, extracting medicine characteristic information, target characteristic information and correlation characteristic information of the medicine characteristic information and the target characteristic information based on the chemical structure sample set of the medicine and the sequence sample set of the target;

s3, combining the drug characteristic information, the target characteristic information and the associated characteristic information of the drug characteristic information and the target characteristic information, and then performing dimension reduction processing;

s4, using the feature set obtained by the dimension reduction processing as input, inquiring protein or ligand related to the medicine or the target, sorting by adopting a sorting learning method, calculating the size of the correlation, and then sorting in sequence according to the size of the output value;

s5, comparing the obtained sequencing sequence with the real correlation sequence; voting the obtained sequencing result; different types of test sets were used to test generalization ability.

The invention ranks the degree of drug-protein association by using a ranking learning algorithm, rather than simply classifying drug-protein relationships as related or unrelated. Meanwhile, various types of information obtained through various technical means can be integrated by utilizing the sequencing learning algorithm, and the advantages and the disadvantages of characteristic algorithms are taken and compensated, so that the performance is improved.

Further, the target object in S1 is a G protein-coupled receptor, and the drug object is related or unrelated thereto.

Further, in S2, the drug characteristic information is represented by 2D fingerprints and drug descriptors, the target characteristic information is represented by physicochemical properties, frequency distribution and evolution information of amino acids, and the correlation characteristic information of the two is calculated by k-NN, BLM-svr and NetLapRLS. Different algorithms have respective unique advantages and disadvantages, and chemical spatial features, gene spatial features, similarity and correlation features of the algorithms are extracted in parallel by adopting various feature extraction algorithms based on three angles of medicine, target and medicine-target, so that the advantages and disadvantages of the algorithms can be made up, and the technical performance is further improved.

Further, in S2, feature information is extracted by constructing a heterogeneous network including drug-drug, drug-disease, drug side effect, and drug similarity correlation network in terms of drug redirection.

Further, in S3, Principal Component Analysis (PCA) is used for the dimensionality reduction. The PCA can synthesize high-dimensional variables possibly having correlation into low-dimensional variables which are linearly independent, remove redundant characteristic information, shorten the experimental period, and in addition, the PCA dimension reduction is suitable for a characteristic set without a clear positive and negative example class label.

Further, in S4, the input file is converted into a standard format:

wherein q isiRepresenting a certain query, FjAll the features of the sample j are represented,indicating the degree of correlation.

Further, in S4, regarding the correlation values of the outputs, the relative magnitude therebetween should be noted rather than the precise values.

Further, the true correlation in S5 is represented by the affinity value between the drug and the target.

Further, the affinity value was IC50, and the IC50 was given its logarithmic negativity valueTo visually indicate the degree of drug-target association. The IC50 value can be used as a measure of the ability of a drug to induce apoptosis, i.e., the stronger the induction, the lower the value. However, the IC50 values of the drug-protein pairs have large numerical differences, and in order to more intuitively observe the correlation between the drug-protein pairs, the affinity values between the drug and the target need to be processed, and the method is generally usedAnd (4) showing. The influence of numerical difference on subsequent experiments can be solved by taking the logarithm, and then the negative value of the logarithm is taken to more accord with the thinking logic of statistical data, namely the larger the numerical value of the final result is, the stronger the correlation is.

Further, the performance of the sorting learning algorithm is measured through the NDCG value in S5, the larger the NDCG value is, the better the performance is, so that the effectiveness of the method can be verified; the formula for the NDCG values for the drug-protein pairs at the first K positions is as follows:

wherein K represents the query result of only the first K positions which are calculated and output, ri is the predicted correlation of the drug-protein pair at the ith position, and R is the real correlation of the drug-protein pair at the ith position.

Further, the voting process is carried out on the ranking results to observe which drug-protein pairs always appear at the first K positions, so that the relevance prediction capability of the model on the drug-protein pairs with unknown relevance can be further improved.

Further, in S5, the performance in terms of new drug development and drug redirection is tested by adjusting samples in different types of test sets. When the protein in the test set is never present in the training set, the method can be used for verifying the new use of the old medicine; when drugs in the test set never appeared in the training set, it was possible to verify on which proteins these drugs could act specifically.

In summary, due to the adoption of the technical scheme, the invention has the beneficial effects that:

1. the method is based on sequencing learning, namely the relevance degree of the medicines and the proteins is sequenced by adopting a sequencing learning algorithm, so that the method is not only beneficial to research and development of new medicines, but also beneficial to redirection of the medicines;

2. the invention relates to a parallel prediction method, which is realized in two aspects: firstly, the parallelism of sequencing learning is realized, namely the correlation between a plurality of medicines (targets) and proteins (ligands) can be predicted at the same time; the parallelism of the experimental steps and the program is that the characteristic information based on the medicine, the protein and the like can be extracted in parallel;

3. the method extracts various types of similarity, correlation characteristics, chemical space characteristics and gene space characteristics by various characteristic extraction methods, then performs dimensionality reduction on data to remove redundancy, so that the calculation is more efficient and portable, and then inputs the dimensionality-reduced characteristic set into a ranking learning algorithm to finally predict and output the degree of correlation between the related medicine and the target under each query;

4. the invention can integrate various types of information obtained by various technical means by utilizing the sequencing learning algorithm, thereby realizing the purpose of making up for the deficiencies of characteristic algorithms and further improving the performance;

5. the invention realizes multi-angle application of a technology by adjusting the performance of the sample in the test set in the aspects of research and development of new drugs and redirection of drugs respectively.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained according to the drawings without inventive efforts.

FIG. 1 is a flow chart of parallel drug-target association prediction based on a ranking learning algorithm as described in example 1;

FIG. 2 is a schematic diagram of data file type references supported by parallel drug-target correlation prediction based on the ranking learning algorithm described in example 1;

FIG. 3 is a schematic diagram of the PCA technique before and after dimension reduction in example 1;

FIG. 4 is a schematic diagram of a ranking learning algorithm;

FIG. 5 is the correlation prediction results of each drug-protein pair calculated by the ranking learning algorithm in example 1;

FIG. 6 shows the results of the calculation of a part of the contents of the output file by the ranking learning algorithm in example 1.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the detailed description and specific examples, while indicating the preferred embodiment of the invention, are intended for purposes of illustration only and are not intended to limit the scope of the invention. The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.

Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.

It is noted that relational terms such as "first" and "second," and the like, may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The features and properties of the present invention are described in further detail below with reference to examples.

11页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:基于离散化蝙蝠算法的两个生物网络全局比对方法

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!