Virus-host correlation prediction method based on network fusion and graph embedding

文档序号:191713 发布日期:2021-11-02 浏览:25次 中文

阅读说明:本技术 一种基于网络融合与图嵌入的病毒-宿主关联预测方法 (Virus-host correlation prediction method based on network fusion and graph embedding ) 是由 朱强 代庆辉 李丽 胡新荣 于 2021-07-06 设计创作,主要内容包括:本发明公开了一种基于网络融合与图嵌入的病毒-宿主关联预测方法,通过使用相似网络融合方法和图嵌入方法分别构造了两种病毒-病毒相似网络和宿主-宿主相似网络,并提出一种图挖掘的方式,即从图中提取元路径得分,基于这种图挖掘的方式可从两种网络上获取每对病毒-宿主的特征向量,最后使用机器学习的方法来获得最终的结果。本发明在现有数据集上实现了较高的精确度,相比于其他方法表现比较稳定,由本发明预测一些病毒-宿主关联关系,部分在已发表的论文和数据库中得到了验证,并且本发明的计算方法预测出了在已知文献或数据库中没有的、且全新的病毒宿主关联关系,这些新的病毒-宿主关系可为实验验证提供有效指导。(The invention discloses a virus-host correlation prediction method based on network fusion and graph embedding, which constructs two virus-virus similar networks and host-host similar networks respectively by using a similar network fusion method and a graph embedding method, and provides a graph mining method, namely extracting a meta-path score from a graph, acquiring a feature vector of each pair of virus-host from the two networks based on the graph mining method, and finally obtaining a final result by using a machine learning method. The invention realizes higher accuracy on the existing data set, and has more stable performance compared with other methods, the invention predicts some virus-host association relations, and is verified in some published papers and databases, and the calculation method of the invention predicts the brand new virus-host association relations which are not available in the known documents or databases, and the new virus-host relations can provide effective guidance for experimental verification.)

1. A virus-host association prediction method based on network fusion and graph embedding is characterized by comprising the following steps:

step 1, acquiring known association of virus hosts;

step 2, respectively measuring the similarity of each pair of viruses and each pair of hosts and hosts, and further constructing a plurality of virus and virus similarity networks and a plurality of host and host similarity networks;

step 3, integrating the virus and virus similarity network and the host and host similarity network obtained in the step 2 by using a similar network fusion algorithm to finally obtain a virus fusion similarity matrixFusion similarity matrix with host

Step 4, obtaining the virus fusion similar matrix from the step 3Fusion similarity matrix with hostConstructing a heterogeneous network with known virus-host associations obtained from step 1

Step 5, applying a graph mining technology to the training part virus host association obtained in the step 1 to generate a feature representation of each node, wherein the nodes comprise virus nodes and host nodes;

step 6, calculating the cosine similarity of the characteristic vector of each virus and the characteristic vectors of other viruses obtained in the step 5 and the cosine similarity of the characteristic vector of each host and the characteristic vectors of other hosts, and further constructing a virus cosine similarity matrixAnd host cosine similarity matrix

Step 7, the virus cosine similarity matrix is obtained in the step 6And host cosine similarity matrixAnd constructing a heterogeneous network G by the known virus host association obtained in the step 12

Step 8, for G obtained from step 41And G from step 72From graph G, based on the path structure and its characteristics1Extracting corresponding meta path score from graph G2Extracting corresponding meta-path scores;

step 9, selecting features to eliminate weak features, and then generating a feature vector X and labels Y of all virus host pairs;

and step 10, inputting the feature vector X and the label Y obtained in the step 9 into a supervised machine learning prediction model.

2. The method of claim 1, wherein the method comprises: the similarity is measured in step 2 using oligonucleotide frequencies or gaussian interaction spectra, wherein the specific implementation of the similarity measurement using oligonucleotide frequencies is as follows;

use ofJS, Hao, Teeling calculates the distance of each pair of virus-to-virus and each pair of host-to-host genome oligonucleotide frequency vectors, and measures the similarity of each pair of virus-to-virus and each pair of host-to-host;

is defined by formula (1):

is defined by formula (2):

suppose there are two sequences a ═ a1A2...AnAnd B ═ B1B2...BmConsisting of a letter of finite alphabet Λ of length d, let p be e Λ for a ∈ ΛaIndicates the probability of the letter a occurring; for w ═ w1,...,wk)∈ΛkLet aThe number of occurrences of w at A is calculated, and similarly, YwThe number of occurrences of w at B is calculated, hereAlso, the same applies toIf X and Y are independent mean-zero-normal, then X has a varianceY has variance Are also normal, have varianceFor w ═ w1,...,wkRepresenting the probability of w occurrence, the counting variable in the set is represented as (1), where

Another counting variable (2), whereinIs the probability of an unobserved letter, i.e., the relative count of letters in the concatenation of the two sequences. The relative number of letters a in the concatenation of two sequences which are independent of each other and are all composed ofGenerated by individual letters in a distribution and then usedEstimating w ═ w1,...,wkThe probability of occurrence.

Hao is defined by formula (3):

two sequences A ═ A1A2...AnAnd B ═ B1B2...BmConversion into a resultant vector a ═ a by character1,a2,...,aN) And B ═ B1,b2,...,bN) Wherein N is [1, 4 ]k]The correlation C (a, B) between a and B is the cosine function of the angle between two representative vectors in the N-dimensional space:

teeling is derived from equations (5) (6) (7) and pearson correlation coefficients:

the observed frequency of a sequence of tetranucleotides is denoted N (N)1n2n3n4) The corresponding desired frequency is calculated by a maximum order markov model:

the variance is:

the significance of the too high or too low representation level, i.e. the difference between the observed and expected frequency, was assessed using the Z-score

If two genomic fragments A and B exhibit similar problems with the pattern of tetranucleotide excess and deficiency, the Pearson's correlation coefficient for Z score can be calculatedTo solve the problem;

JS divergence is defined by the formula (8)

Given a sequence in which S comprises N genesS log-likelihood of Markov model is

λ(S)=∑n(b1...bkb)logP(b|b1..bk) (9)

Using JS Scattering to measure two sequences S1And S2Probability distribution P ═ λ (S)1),Q=λ(S2) The JS divergence is a variation of the KL divergence, which is defined as follows:

3. the method of claim 2, wherein the method comprises: the calculation process of the Gaussian kernel interaction spectrum comprises two steps;

first, the virus viInteraction spectrum IP (upsilon)i) Is a vector encoding virus viWith each of the known virus-host networksBinary vectors with or without associations between hosts; second, virus viAnd virus upsilonjThe gaussian kernel similarity between them is calculated from their interactions and is defined as follows:

Sυi,υj)=exp(-γυ|IP(υi)-IP(υj)||2) (11)

parameter gamma thereofυRepresenting the core bandwidth, a new core bandwidth parameter is defined as:

Nhis the number of hosts, r 'according to previous studies'υIs set to 1; analogous hosts hiAnd a host hjThe gaussian kernel similarity between them is defined as:

Sh(hi,hj)=exp(-γh||IP(hi)-IP(hj)||2) (13)

its nuclear bandwidth parameters are defined as:

wherein N isυIs the number of viruses, r'hIs set to 1.

4. The method of claim 1, wherein the method comprises: the specific implementation manner of the step 3 is as follows;

taking the virus similarity network as an example, the edge weights of the virus similarity network are respectively Nυ×NυIs matrix SvExpressed, a normalized weight matrix P, defined by equation (15), can then be obtained for each similarity network by:

in the formula (15), S (i, j) is SvWhere i and j represent the row and column numbers of the matrix, and then measure local relationships using K-nearest neighbors, defined by equation (16):

in the formula (16), NiRepresents the number of neighbors of the virus;

p obtained from the formula (15)(υ)And KNN obtained by the formula (16)(υ)In the formula (15), Pi,jIs the similarity of the ith virus to all other viruses, while KNN (i, j) in equation (16) is the similarity of the ith virus to its neighboring viruses, in the similar network fusion algorithm SNF, P is always assignedi,jAs an initial state, while KNN (i, j) is used as a core matrix in the fusion process of two capacities of capturing local structure and computing efficiency, the process of SNF is to iteratively update the similarity matrix, which is defined by equation (17):

wherein p is(k)The initial value of the similarity matrix in the t step is Pi,j,P(υ)Is the similarity matrix of the t +1 step, formula (17) updates the matrix P each time m parallel exchange diffusion processes are generated on m virus networks(υ)

Then SNF measures local relation by using a K-nearest neighbor method to filter low-similarity edges, finally obtains a matrix through multiple iterations, and obtains a virus similarity matrix through SNF fusionAnd host similarity matrix

5. The method of claim 1, wherein the method comprises: the specific implementation manner of the step 5 is as follows;

using an algorithm framework of Node2vec to perform characterization learning on a virus host heterogeneous network G constructed by the known association of the virus hosts obtained in the step 1, wherein the heterogeneous network G only comprises the association of the known viruses and the hosts, but does not use a virus and virus similar network and a host and host similar network, the Node2vec introduces two super parameters p and q to control a random walk strategy, and if the current random walk reaches a vertex upsilon through edges (t, upsilon), the vertex upsilon is set as piυx=αpq(t,x)·wυx,πυxIs the unnormalized transition probability, w, between vertex v and vertex xυxIs the edge weight of a vertex upsilon and a vertex x, and a path sampling strategy alphapq(t, x) is defined as follows:

in the formula (18), dtxFor the shortest path distance between vertex t and vertex x, the node neighborhood set can be obtained by equation (18);

setting f (u) as a mapping function for mapping the node u to the embedding vector, and defining N for any node u in the graphs(u) is a set of neighboring points of the node u sampled by the formula (18), and f (u) that maximizes the probability of occurrence of its neighboring points is obtained by the formula (19):

according to the following two assumed conditions

(1) Conditional independence, assuming a given source vertex, its neighbor vertex niProbability of occurrence and nearest neighborThe rest of the vertices in the set are irrelevant;

(2) feature space symmetry, where a vertex shares the same set of embedded vectors as the source vertex and as the neighbor vertices;

optimizing equation (19) to a final objective function equation (22):

in equation (22), due to the normalization factorThe calculation cost is high, and a negative sampling technology is adopted for optimization;

when the final target function (22) is maximized, a function form of f (u) is obtained, and a feature vector of each node is obtained.

6. The method of claim 1, wherein the method comprises: the specific implementation of step 8 is as follows,

for each simple path of each virus-host pair, starting from the source node, i.e., the host node, and ending at the target node, i.e., the virus node, a path score is used for calculation, i.e., using equation (23) below:

in the formula(23) In, P ═ { P ═ P1,p2,...,pnIs a connection host node hiAnd virus node vjSet of paths of, PweightsThe weight value between the nodes is obtained; the Path score is the product of all edge weight scores from the starting host node to the ending virus node in each Path structure, and in order to reduce the amount of computation, the Path length is limited to be less than or equal to 3, that is, there are 6 Path structures, Path1, Path2, Path3, Path4, Path5, Path6, each node starting from the host node and ending with one virus node, Path1: (H-H-V), path2: (H-V-V), path3: (H-H-H-V), path4: (H-H-V-V), path5: (H-V-V-V), path6: (H-V-H-V), two features of each path structure are mined,

(1) sum of all meta-path scores for each path structure:

(2) the highest score of all meta-path scores under each path structure:

the meta path refers to all paths with the same path structure, and the meta path score is the product of all edge weights from the starting point host node to the end point virus node in the path structure; ASP represents a pair of viruses upsilonjAnd a host hiMeta-paths between; to ensure that longer paths are not penalized in our method, each maximum or sum path score is computed separately, where each score considers all sets of paths belonging to a particular path structure.

7. The method of claim 1, wherein the method comprises: step 9, an Adaboost classification model is used as a prediction model, Adaboost gives different weak classifiers different weights according to the classification effect of the m weak classifiers on sample data and combines the weights into a strong classifier, and the algorithm flow of Adaboost is as follows;

(1) given a binary data set T { (x)1,y1),(x2,y2),...,(xN,yN) And x represents an input sample, y represents a class space to which the sample belongs, and weight distribution of training data is initialized:

for M classifiers Gm(x) The method comprises the following steps x → {0, 1}, M ∈ (1, M) are respectively trained by using data with weight distribution;

(2) computing weak classifier Gm(x) Classification error rate of (1):

calculation of Gm(x) Coefficient (c):

updating the weight distribution of the training data:

Dm+1=(wm+1,1,...,wm+1,j,...,wm+1,N) (29)

wherein G ism(xi) Representing the result of classifying the sample data by the weak classifier;

(3) constructing a linear combination of basis classifiers:

the final classifier:

Technical Field

The invention belongs to the cross field of bioinformatics, computational biology and artificial intelligence, and particularly relates to a virus-host association prediction method based on network fusion and graph embedding.

Background

Viruses depend on host survival and play an important role in community structure and function, but viruses are diverse in variety and their relationship to hosts varies. The traditional experimental method searches the relation between the virus and the host thereof, so that the experimental cost is high, the experimental period is long, uncertain factors can influence the experimental result, the success rate is low, and a more efficient and accurate method needs to be found. Therefore, computational methods based on mathematical models to predict viral and host interactions have received increasing attention. Because both viruses and hosts face natural selection pressure, they are constantly competing, and the host needs to develop resistance to the virus to protect against infection, but the virus cannot survive if it cannot infect the host, and the end result may be that the virus integrates its genes into the host, and this information can be used to identify the host of the virus, i.e., the virus has a relevant functional relationship with the host.

In the face of the limitations of traditional experimental-based exploration of virus-host associations, researchers have proposed techniques for predicting virus-host interactions based on computer simulations to predict new association relationships between viruses and hosts, which require the introduction of known associations between viruses and hosts, and even require virus-virus associations and host-host associations. In a sample or a colony, a complex network of various interactions, called a heterogeneous network, is formed between microorganisms (bacteria, viruses, etc.) in a relationship of mutualism, parasitism, antagonism, etc. for nutrition or territory. The nodes of the heterogeneous network are bacteria and viruses, and the bacteria-bacteria, viruses-viruses and bacteria-viruses have various interaction relations to form the edges of the heterogeneous network. Traditional heterogeneous network mining processes typically begin by extracting structural features, such as object relationships, network structures, meta paths, and the like, and then inputting these features into a machine learning model for subsequent learning tasks. However, the process of manually designing features is time-consuming and labor-consuming, and the features are not mobile, in other words, the manually designed features are often only suitable for specific application scenarios, and thus are not universal. Therefore, the data mining technology based on heterogeneous network has been shifted to the representation learning mode based on the graph neural network at present. The internal structural and semantic attributes of the heterogeneous network representation learning hypothesis network can be encoded into the potential low-dimensional vectors, so that the model can automatically learn the potential low-dimensional representations of the network objects such as vertexes, edges and subgraphs, and the subsequent learning task is facilitated. For example, some feature-based classification methods sample virus-to-host associations, characterize the samples with the virus-to-host side information as a feature vector, and then use a classifier to distinguish whether an association exists. Although there are various methods of predicting virus-host interactions, these single-information based prediction models are less accurate. As the number of discovered viruses increases, new and efficient analysis methods need to be developed to integrate multiple types of virus-host and virus-virus characteristic information to predict virus-host relationships more accurately and more quickly.

Disclosure of Invention

The invention aims to solve the problems in the background art and provides a virus-host association prediction method based on network fusion and graph embedding.

In order to further improve the accuracy of predicting the association between the virus and the host, the method for calculating the similarity network between various viruses and hosts by using the topological information of the association network between the viruses and the hosts is proposed. The method converts the association prediction problem of the virus and the host into the link prediction problem of the nodes in the heterogeneous network. And various network information is fused by utilizing the graph embedding and similar network fusion technology, so that the limitations of other methods are avoided. The technical scheme of the invention is a calculation method for predicting virus-host interaction based on graph embedding, which specifically comprises the following steps:

step 1, acquiring known association of virus hosts;

step 2, respectively measuring the similarity of each pair of viruses and each pair of hosts and hosts, and further constructing a plurality of virus and virus similarity networks and a plurality of host and host similarity networks;

step 3, integrating the virus and virus similarity network and the host and host similarity network obtained in the step 2 by using a similar network fusion algorithm to finally obtain a virus fusion similarity matrixFusion similarity matrix with host

Step 4, obtaining the virus fusion similar matrix from the step 3Fusion similarity matrix with hostThe heterogeneous network G is formed by associating the known virus hosts obtained in step 11

Step 5, applying a graph mining technology to the training part virus host association obtained in the step 1 to generate a feature representation of each node, wherein the nodes comprise virus nodes and host nodes;

step 6, calculating the cosine similarity of the characteristic vector of each virus and the characteristic vectors of other viruses obtained in the step 5 and the cosine similarity of the characteristic vector of each host and the characteristic vectors of other hosts, and further constructing a virus cosine similarity matrixAnd host cosine similarity matrix

Step 7, the virus cosine similarity matrix is obtained in the step 6And host cosine similarity matrixAnd constructing a heterogeneous network G by the known virus host association obtained in the step 12

Step 8, for G obtained from step 41And G from step 72From graph G, based on the path structure and its characteristics1Extracting corresponding meta path score from graph G2Extracting corresponding meta-path scores;

step 9, selecting features to eliminate weak features, and then generating a feature vector X and labels Y of all virus host pairs;

and step 10, inputting the feature vector X and the label Y obtained in the step 9 into a supervised machine learning prediction model.

Further, the similarity is measured in step 2 using oligonucleotide frequency or gaussian interaction spectrum, wherein the specific implementation of the similarity measurement using oligonucleotide frequency is as follows;

use ofJS, Hao, Teeling calculates the distance of each pair of virus-to-virus and each pair of host-to-host genome oligonucleotide frequency vectors, and measures the similarity of each pair of virus-to-virus and each pair of host-to-host;

is defined by formula (1):

is defined by formula (2):

suppose there are two sequences a ═ a1A2...AnAnd B ═ B1B2...BmConsisting of a letter of finite alphabet Λ of length d, let p be e Λ for a ∈ ΛaIndicates the probability of the letter a occurring; for w ═ w1,...,wk)∈ΛkLet aThe number of occurrences of w at A is calculated, and similarly, YwThe number of occurrences of w at B is calculated, hereAlso, the same applies toIf X and Y are independent mean-zero-normal, then X has a varianceY has variance Are also normal, have varianceFor w ═ x1,...,xkRepresenting the probability of w occurrence, the counting variable in the set is represented as (1), where

Another counting variable (2), whereinIs the probability of an unobserved letter, i.e., the relative count of letters in the concatenation of the two sequences. The relative number of letters a in the concatenation of two sequences, independent of each other and both generated from independent letters in the same distribution, and then usedEstimating w ═ w1,...,wkThe probability of occurrence.

Hao is defined by formula (3):

two sequences A ═ A1A2...AnAnd B ═ B1B2...BmConversion into a resultant vector a ═ a by character1,a2,...,aN) And B ═ B1,b2,...,bN) Wherein N is [1, 4 ]k]The correlation C (a, B) between a and B is the cosine function of the angle between two representative vectors in the N-dimensional space:

teeling is derived from equations (5) (6) (7) and pearson correlation coefficients:

the observed frequency of a sequence of tetranucleotides is denoted N (N)1n2n3n4) The corresponding desired frequency is calculated by a maximum order markov model:

the variance is:

the significance of the too high or too low representation level, i.e. the difference between the observed and expected frequency, was assessed using the Z-score

If two genomic fragments A and B exhibit similar problems with the pattern of tetranucleotide excess and deficiency, the Pearson's correlation coefficient for Z score can be calculatedTo solve the problem;

JS divergence is defined by the formula (8)

Given a sequence in which S comprises N genesS log-likelihood of Markov model is

λ(s)=∑n(b1...bkb)logP(b|b1..bk) (9)

Using JS Scattering to measure two sequences S1And S2Probability distribution P ═ λ (S)1),Q=λ(S2) The JS divergence is a variation of the KL divergence, the KL divergence (Kullback-Leibler divergence) being defined as follows:

further, the calculation process of the Gaussian kernel interaction spectrum comprises two steps;

first, the virus viInteraction spectrum IP (v)i) Is a code for a virus viAnd has already been madeBinary vectors with or without associations between each host in the known virus-host network; second, virus viAnd virus vjThe gaussian kernel similarity between them is calculated from their interactions and is defined as follows:

Sv(vi,vj)=exp(-γv||IP(vi)-IP(vj)||2) (11)

parameter gamma thereofvRepresenting the core bandwidth, a new core bandwidth parameter is defined as:

Nhis the number of hosts, r 'according to previous studies'vIs set to 1; analogous hosts hiAnd a host hjThe gaussian kernel similarity between them is defined as:

Sh(hi,hj)=exp(-γh||IP(hi)-IP(hj)||2) (13)

its nuclear bandwidth parameters are defined as:

wherein N isvIs the number of viruses, r'hIs set to 1.

Further, the specific implementation manner of step 3 is as follows;

taking the virus similarity network as an example, the edge weights of the virus similarity network are respectively Nv×NvIs matrix SvExpressed, a normalized weight matrix P, defined by equation (15), can then be obtained for each similarity network by:

in the formula (15), S (i, j) is SvWhere i and j represent the row and column numbers of the matrix, and then measure local relationships using K-nearest neighbors, defined by equation (16):

in the formula (16), NiRepresents the number of neighbors of the virus;

p obtained from the formula (15)(v)And KNN obtained by the formula (16)(v)In the formula (15), Pi,jIs the similarity of the ith virus to all other viruses, while KNN (i, j) in equation (16) is the similarity of the ith virus to its neighboring viruses, in the similar network fusion algorithm SNF, P is always assignedi,jAs an initial state, while KNN (i, j) is used as a core matrix in the fusion process of two capacities of capturing local structure and computing efficiency, the process of SNF is to iteratively update the similarity matrix, which is defined by equation (17):

wherein p is(k)The initial value of the similarity matrix in the t step is Pi,j,P(v)Is the similarity matrix of the t +1 step, formula (17) updates the matrix P each time m parallel exchange diffusion processes are generated on m virus networks(v)

Then SNF measures local relation by using a K-nearest neighbor method to filter low-similarity edges, finally obtains a matrix through multiple iterations, and obtains a virus similarity matrix through SNF fusionAnd host similarity matrix

Further, the specific implementation manner of step 5 is as follows;

using an algorithm framework of Node2vec to perform characterization learning on a virus host heterogeneous network G constructed by the known association of the virus hosts obtained in the step 1, wherein the heterogeneous network G only comprises the association of the known viruses and the hosts, but does not use a virus and virus similar network and a host and host similar network, the Node2vec introduces two hyper-parameters p and q to control a random walk strategy, and if the current random walk reaches a vertex v through edges (t, v), the vertex v is set to be pivx=αpq(t,x)·ωvx,πvxIs the unnormalized transition probability, ω, between vertex v and vertex xvxIs the edge weight of vertex v and vertex x, the path sampling strategy alphapq(t, x) is defined as follows:

in the formula (18), dtxFor the shortest path distance between vertex t and vertex x, the node neighborhood set can be obtained by equation (18);

setting f (u) as a mapping function for mapping the node u to a low-dimensional vector, and defining N for any node u in the graphs(u) is a set of neighboring points of the node u sampled by the formula (18), and f (u) that maximizes the probability of occurrence of its neighboring points is obtained by the formula (19):

according to the following two assumed conditions

(1) Conditional independence, assuming a given source vertex, its neighbor vertex niThe probability of occurrence is independent of the rest of the vertices in the neighbor set;

(2) feature space symmetry, where a vertex shares the same set of embedded vectors as the source vertex and as the neighbor vertices;

optimizing equation (19) to a final objective function equation (22):

in equation (22), due to the normalization factorThe calculation cost is high, and a negative sampling technology is adopted for optimization;

when the final target function (22) is maximized, a function form of f (u) is obtained, and a feature vector of each node is obtained.

Further, the specific implementation manner of step 8 is as follows,

for each simple path of each virus-host pair, starting from the source node (i.e., the host node) and ending at the target node (i.e., the virus node), a path score is used for calculation, i.e., using equation (23) below:

in formula (23), P ═ P1,p2,...,pnIs a connection host node hiAnd virus node vjSet of paths of, PweightsIs the weight between nodes; the path score is the product of all edge weight scores from the beginning host node to the end virus node in each path structure. In order to reduce the amount of computation, the Path length is limited to be less than or equal to 3, i.e. there are 6 Path structures of Path (Path 1, Path2, Path3, Path4, Path5, Path 6), each node starts from the host node and ends with a virus node, Path1 (H-H-V), Path2 (H-V-V), Path3 (H-H-H-V), Path4 (H-H-V-V), Path5 (H-V-)V), path6 (H-V-H-V), two features of each path structure are mined.

(1) Sum of all meta-path scores for each path structure:

(2) the highest score of all meta-path scores under each path structure:

the meta path refers to all paths with the same path structure, and the meta path score is the product of all edge weights from the starting point host node to the end point virus node in the path structure; ASP represents a pair of viruses vjAnd a host hiMeta-paths between; to ensure that longer paths are not penalized in our method, each maximum or sum path score is computed separately, where each score considers all sets of paths belonging to a particular path structure.

Further, in step 9, an Adaboost classification model is used as a prediction model, and according to the good or bad classification effect of the m weak classifiers on the sample data, the Adaboost gives different weights to different weak classifiers and combines the different weak classifiers into a strong classifier, and the algorithm flow of the Adaboost is as follows;

(1) given a binary data set T { (x)1,y1),(x2,y2),...,(xN,yN) And x represents an input sample, y represents a class space to which the sample belongs, and weight distribution of training data is initialized:

for M classifiers Gm(x) The method comprises the following steps x → {0, 1}, M ∈ (1, M) are respectively trained by using data with weight distribution;

(2) computing weak classifier Gm(x) Classification error rate of (1):

calculation of Gm(x) Coefficient (c):

updating the weight distribution of the training data:

Dm+1=(wm+1,1,...,wm+1,j,...,wm+1,N) (29)

wherein G ism(xi) Representing the result of classifying the sample data by the weak classifier;

(3) constructing a linear combination of basis classifiers:

the final classifier:

aiming at the limitations of the existing virus-host correlation prediction method, the invention provides a calculation method for predicting virus-host interaction based on graph embedding, which greatly improves the accuracy of virus-host correlation prediction, and compared with other methods, the invention greatly improves the prediction performance by using four data sets, realizes higher accuracy in all data sets, and has more stable model performance compared with other methods. And a part of virus host association predicted by the invention is verified in published papers and databases.

Drawings

FIG. 1 is a flow chart of an embodiment of the present invention.

Detailed Description

The technical solution of the present invention can be implemented by a person skilled in the art using computer software technology. Embodiments of the invention are described in detail below with reference to the accompanying drawings:

step 1, obtaining the known association of the virus host from a related biological information paper and an authoritative biological information website (NCBI).

Step 2, inferring the relationship between genomic sequences based on a differential approach to genomic oligonucleotide frequency. The invention usesJS, Hao, Teeling calculated the distance of the genomic oligonucleotide frequency vector for each pair of virus to virus (each pair of host to host), and measured the similarity of each pair of virus to virus (each pair of host to host).

Is defined by formula (1):

is defined by formula (2):

suppose there are two sequences a ═ a1A2...AnAnd B ═ B1B2...BmConsisting of a letter of finite alphabet Λ of length d, let p be e Λ for a ∈ ΛaIndicating the probability of the occurrence of the letter a. For w ═ w1,...,wk)∈ΛkLet aThe number of occurrences of w at A is calculated, and similarly, YwThe number of occurrences of w at B is calculated, hereAlso, the same applies toIf X and Y are independent mean-zero-normal, then X has a varianceY has variance Are also normal, have varianceFor w ═ w1,...,wkRepresenting the probability of w occurrence, the counting variable in the set is represented as (34), where

Another counting variable (35), whereinIs the probability of an unobserved letter, i.e., the relative count of letters in the concatenation of the two sequences. The relative number of letters a in the concatenation of two sequences, independent of each other and both generated from independent letters in the same distribution, and then usedEstimating w ═ w1,...,wkThe probability of occurrence.

Hao is defined by formula (3):

two sequences A ═ A1A2...AnAnd B ═ B1B2...BmConversion into a resultant vector a ═ a by character1,a2,...,aN) And B ═ B1,b2,...,bN) Wherein N is [1, 4 ]k]The correlation C (a, B) between a and B is the cosine function of the angle between two representative vectors in the N-dimensional space:

teeling is derived from equations (5) (6) (7) and pearson correlation coefficients:

the observed frequency of a sequence of tetranucleotides is denoted N (N)1n2n3n4) The corresponding desired frequency is calculated by a maximum order markov model:

the variance is:

the significance of the too high or too low representation level, i.e. the difference between the observed and expected frequency, was assessed using the Z-score

If two genomic fragments A and B exhibit similar problems with the pattern of tetranucleotide excess and deficiency, the Pearson's correlation coefficient for Z score can be calculatedTo solve the problem.

JS divergence (Jensen-Shannon divergence) is defined by formula (8)

Given a sequence in which S comprises N genesS log-likelihood of Markov model is

λ(S)=∑n(b1...bkb)logP(b|b1..bk) (42)

Using JS Scattering to measure two sequences S1And S2Probability distribution P ═ λ (S)1),Q=λ(S2) The JS divergence is a variation of the KL divergence, the KL divergence (Kullback-Leibler divergence) being defined as follows:

in addition, the paper calculates the gaussian nuclear interaction profile (GIP) between viruses (host-to-host) from the known association matrix of viruses and hosts. The Gaussian kernel interaction spectrum is a method for extracting similarity information from a virus and host association network which is used more at present. GaussThe calculation process of the nuclear interaction spectrum mainly comprises two steps. First, the virus viInteraction spectrum IP (v)i) Is a code for a virus viA binary vector with or without an association with each host in a known virus-host network. Second, virus viAnd virus vjThe gaussian kernel similarity between them is calculated from their interactions and is defined as follows:

Sv(vi,vj)=exp(-γv||IP(vi)-IP(vj)||2) (44)

parameter gamma thereofvRepresenting the core bandwidth, a new core bandwidth parameter is defined as:

Nhis the number of hosts, r 'according to previous studies'vIs set to 1. Analogous hosts hiAnd a host hjThe gaussian kernel similarity between them is defined as:

Sh(hi,hj)=exp(-γh||IP(hi)-IP(hj)||2) (46)

its nuclear bandwidth parameters are defined as:

wherein N isvIs the number of viruses, r'hIs set to 1.

Step 3, the invention uses the similar network fusion algorithm (SNF), take the virus similarity network as an example, the edge weight of the virus similarity network uses N of one respectivelyv×NvIs matrix SvExpressed, a normalized weight matrix P, defined by equation (15), can then be obtained for each similarity network by:

in formula (48), S (i, j) is SvWherein i and j represent the row number and column number of the matrix. The local relationship is then measured using K-nearest neighbors (KNN), defined by equation (16):

in the formula (49), NiThe number of neighbors of the virus is represented, the number of neighbors of the virus is predefined, the distance between each element and other elements can be calculated according to a similarity matrix, and the first K elements are selected from the similarity matrix according to the similarity (distance of the distance) between the elements. Text NiThe value of (1) is 5, and the method filters out edges with low similarity.

P obtained by the formula (48)(v)And KNN obtained by the formula (49)(v). In formula (48), Pi,jIs the similarity of the ith virus to all other viruses, while KNN (i, j) in equation (49) is the similarity of the ith virus to its neighboring viruses. In the SNF algorithm, P is always seti,jAs an initial state, while KNN (i, j) as a core matrix is in the fusion process of two capacities of capturing local structure and computational efficiency. The process of SNF is an iterative update of the similarity matrix, defined by equation (17):

wherein p is(k)The initial value of the similarity matrix in the t step is Pi,j,P(v)Is the similarity matrix of step t +1, the formula (50) updates the matrix P each time m parallel exchange diffusion processes are generated on m virus networks(v)

Then the SNF measures local relation by using a K-nearest neighbor (KNN) method to filter low-similarity edges, and finally obtains a matrix through multiple iterations. Virus similarity matrix obtained by SNF fusionAnd host similarity matrix

Step 4, obtaining a virus fusion similarity matrix from the step 3Fusion similarity matrix with hostAnd constructing a heterogeneous network G from the known association of the virus and the host obtained in step 11

And 5, using an algorithm framework of the node2vec to perform characterization learning on the virus host heterogeneous network G constructed by the known association of the virus hosts obtained in the step 1, wherein the heterogeneous network G only comprises the association of the known viruses and hosts, and does not use a virus and virus similar network and a host and host similar network. The node2vec introduces two hyper-parameters p and q to control the strategy of random walk, supposing that the current random walk passes through edges (t, v) to reach a vertex v, and setting pivx=αpq(t,x)·wvx,πvxIs the unnormalized transition probability, w, between vertex v and vertex xvxIs the edge weight of vertex v and vertex x, the path sampling strategy alphapq(t, x) is specifically defined as follows:

in the formula (51), dtxFor the shortest path distance between vertex t and vertex x, a node neighborhood set can be obtained by equation (51).

Setting f (u) as a mapping function for mapping the node u to a low-dimensional vector, and defining N for any node u in the graphs(u) is a set of neighboring points of the node u sampled by the equation (51)The probability f (u) that the neighboring point appears is maximized is obtained from equation (19):

according to the following two assumed conditions

(1) Conditional independence, assuming a given source vertex, its neighbor vertex niThe probability of occurrence is independent of the rest of the vertices in the neighbor set;

(2) feature space symmetry, where a vertex shares the same set of embedded vectors as the source vertex and as the neighbor vertices;

optimizing equation (52) to a final objective function equation (22):

in equation (55), due to the normalization factorThe calculation cost is high, and the negative sampling technology is adopted for optimization.

When this final objective function (55) is maximized, a function form of f (u) is obtained, and a feature vector of each node is obtained.

Step 6, calculating the cosine similarity of the eigenvector of each virus (each host) obtained in the step 5 and the eigenvectors of other viruses (other hosts), and further constructing a virus cosine similarity matrixAnd host cosine similarity matrix

Step 7, the virus cosine similarity matrix is obtained in the step 6And host cosine similarity matrixAnd constructing a heterogeneous network G from the known virus host associations of step 12

Step 8, for the two heterogeneous weighted graphs G obtained from step 4 and from step 71And G2Is used to extract graph-based features. Multiple path scores between each virus-host pair of each graph are used to reflect these features. For each simple path of each virus-hosting pair, starting from the source node (i.e., the hosting node) and ending to the target node (i.e., the virus node), a path score is used for calculation, i.e., using equation (23) below:

in formula (56), P ═ P1,p2,...,pnIs a connection host node hiAnd virus node vjSet of paths of, PweightsIs the weight between nodes. The path score is the product of all edge weight scores from the beginning host node to the end virus node in each path structure. In order to reduce the amount of calculation, the invention limits the Path length to be less than or equal to 3, namely 6 Path structures of Path (Path 1, Path2, Path3, Path4, Path5 and Path 6) exist, each node starts from a host node and ends with a virus node, Path1 (H-H-V), Path2 (H-V-V), Path3 (H-H-H-V), Path4 (H-H-V-V), Path5 (H-V-V-V) and Path6 (H-V-H-V), and two characteristics of each Path structure are mined,

(1) sum of all meta-path scores for each path structure:

(2) the highest score of all meta-path scores under each path structure:

the meta path refers to all paths having the same path structure, and the meta path score is the product of all edge weights from the start point host node to the end point virus node in the path structure. ASP represents a pair of viruses vjAnd a host hiMeta path between. To ensure that longer paths are not penalized in our method, each (maximum or sum) path score is computed separately, where each score considers all sets of paths belonging to a particular path structure. Thus, scores from different path structures do not blend together in one feature. In addition, the scores are further normalized using a minimum-maximum normalization to ensure that the features are treated equally by the classifier.

Step 9, 12 features can be extracted from step 8 for each pair of virus and host and each constructed heterogeneous map, and these features are combined into a 24-dimensional feature vector. The accuracy of the present invention depends on the basic features of the data set. After empirical analysis and many experiments, the most relevant feature set for this classification task was determined. In analyzing performance, combinations of one or more features need to be removed. Thus, after feature selection is applied, the dimensions of the feature vectors input to the predictive model are reduced from 24 to 16, depending on the data set.

Step 10, the known machine learning classification model Adaboost with better performance is used in the invention, and according to the classification effect of m weak classifiers on sample data, different weights are given to different weak classifiers and combined into a strong classifier by the Adaboost, and the algorithm flow of the Adaboost is as follows:

(1) given a binary data set T { (x)1,y1),(x2,y2),...,(xN,yN) And x represents an input sample, y represents a class space to which the sample belongs, and weight distribution of training data is initialized:

for M classifiers Gm(x) The method comprises the following steps x → {0, 1}, M ∈ (1, M), respectively, using data with a weight distribution for training;

(2) computing weak classifier Gm(x) Classification error rate of (1):

calculation of Gm(x) Coefficient (c):

updating the weight distribution of the training data:

Dm+1=(wm+1,1,...,wm+1,j,...,wm+1,N) (62)

wherein G ism(xi) Representing the result of classifying the sample data by the weak classifier;

(3) constructing a linear combination of basis classifiers:

the final classifier:

to verify the validity of the method of the invention, comparative experiments were performed on a plurality of data sets. The experiment used four data sets from a paper, authoritative bioinformatics website, the information of which is shown in the following table (table 1):

TABLE 1 basic information of the four data sets

Dataset I Dataset II Dataset III DatasetⅣ
Number of viruses 728 32 312 1380
Number of hosts 129 119 747 221
Knowing associations 728 368 4539 1479
Unknown associations 93184 3440 228525 303501
Ratio of sparseness 0.0078 0.1070 0.0199 0.0048

And compared with the other five methods of correlation prediction:

■ ILMF-VH, virus-to-host association prediction based on multiinformation matrix fusion. The viral similarity network is constructed based on oligonucleotide frequency (ONF) metrics and the host similarity network is constructed by integrating oligonucleotide frequency similarity and Gaussian Interaction Profile (GIP) nuclear similarity of the host through Similarity Network Fusion (SNF). Then, a domain regularization logic matrix decomposition algorithm is executed on the heterogeneous network of the virus and the host to predict virus host association;

the ■ layer notes that the graph convolution network (LAGCN) associates known viruses with hosts, integrates virus-virus similarity and host-host similarity into a heterogeneous network, and applies graph convolution on the heterogeneous network to learn the embedding of the viruses and the hosts. Second, LAGCN combines the embedding of multiple map convolutional layers using an attention mechanism. The method has good effect on predicting the virus-host association;

■ NetLapRLS, respectively training the virus and host fields by adopting a semi-supervised learning method and a regular least square method on a combined known virus-host interaction network, and then combining the fields to obtain a final prediction result;

■ BLM-NII, neighbor-based interaction Profile inference (NII), and integrates it into a supervised learning approach, a Binary Local Model (BLM) approach, to handle new association problems. Specifically, the inferred interaction relationships are considered as label information and used for model learning of new candidates;

■ CMF, which projects viruses and hosts into a common low-level feature space, and predicts virus-host interactions through the cooperation of two low-rank matrices.

The evaluation indices used in the present invention were AUC and aucr, i.e., the area under the Receiver Operating Characteristic (ROC) curve (AUC), and the area under the precision-recall curve (aucr), and the experimental results are shown in the following table (table 2):

table 2 comparison of experimental results of the present invention and other methods on four data sets

Data set Evaluation index Ours ILMF-VH LAGCN NetLapRLS BLM-NII CMF
Dataset I AUC 0.99991 0.75380 0.92508 0.08741 0.86028 0.76867
AUPR 0.99086 0.21475 0.79621 0.00422 0.24655 0.04473
Dataset II AUC 0.98955 0.79128 0.79811 0.76468 0.80453 0.50939
AUPR 0.91827 0.30862 0.41345 0.50196 0.48382 0.22213
Dataset III AUC 0.99999 0.99391 0.99868 0.99740 0.99683 0.77741
AUPR 0.99999 0.63898 0.96357 0.97915 0.90456 0.42784
DatasetⅣ AUC 0.99965 0.82112 0.91179 0.69508 0.90606 0.73420
AUPR 0.96485 0.24104 0.73203 0.01979 0.38681 0.02030

The present invention predicts the first ten associations on Dataset iv as shown in the following table (table 3):

TABLE 3 Association of the top ten predicted by the present invention

Rank Host Name Virus Name Evidence
1 Campylobacter jejuni Campylobacter phage CP8 PMID:32054081
2 Erysimum Listeria phage A118 unknown
3 Erwinia sp. Erwinia phage phiEa1H PMID:26555076
4 Klebsiella pneumoniae Klebsiella phage PMBT1 PMID:31976857
5 Pseudomonas syringae Pseudomonas phage phiPSA2 PMID:32610695
6 Lactococcus lactis subsp.cremoris Lactococcus phage P680 PMID:30135597
7 Gordonia terrae Gordonia phage Troje unknown
8 Lactococcus sp. Lactococcus phage fd13 unknown
9 Aeropyrum pernix K1 Aeropyrum pernix bacilliform virus 1 PMID:21784945
10 Pseudomonas aeruginosa Pseudomonas phage MP1412 PMID:26115051

The method is based on the fact that the accuracy of a virus-host correlation prediction method based on network fusion and graph embedding is remarkably superior to that of the existing common methods, and the superiority of the method is proved.

The specific embodiments described herein are merely illustrative of the spirit of the invention. Various modifications or additions may be made to the described embodiments or alternatives may be employed by those skilled in the art without departing from the spirit or ambit of the invention as defined in the appended claims.

22页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种弱监督目标定位方法、装置、设备及介质

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!