Virus-host correlation prediction method based on network fusion and graph embedding

文档序号：191713 发布日期：2021-11-02 浏览：25次中文

阅读说明：本技术 一种基于网络融合与图嵌入的病毒-宿主关联预测方法 (Virus-host correlation prediction method based on network fusion and graph embedding ) 是由朱强代庆辉李丽胡新荣于 2021-07-06 设计创作，主要内容包括：本发明公开了一种基于网络融合与图嵌入的病毒-宿主关联预测方法,通过使用相似网络融合方法和图嵌入方法分别构造了两种病毒-病毒相似网络和宿主-宿主相似网络,并提出一种图挖掘的方式,即从图中提取元路径得分,基于这种图挖掘的方式可从两种网络上获取每对病毒-宿主的特征向量,最后使用机器学习的方法来获得最终的结果。本发明在现有数据集上实现了较高的精确度,相比于其他方法表现比较稳定,由本发明预测一些病毒-宿主关联关系,部分在已发表的论文和数据库中得到了验证,并且本发明的计算方法预测出了在已知文献或数据库中没有的、且全新的病毒宿主关联关系,这些新的病毒-宿主关系可为实验验证提供有效指导。(The invention discloses a virus-host correlation prediction method based on network fusion and graph embedding, which constructs two virus-virus similar networks and host-host similar networks respectively by using a similar network fusion method and a graph embedding method, and provides a graph mining method, namely extracting a meta-path score from a graph, acquiring a feature vector of each pair of virus-host from the two networks based on the graph mining method, and finally obtaining a final result by using a machine learning method. The invention realizes higher accuracy on the existing data set, and has more stable performance compared with other methods, the invention predicts some virus-host association relations, and is verified in some published papers and databases, and the calculation method of the invention predicts the brand new virus-host association relations which are not available in the known documents or databases, and the new virus-host relations can provide effective guidance for experimental verification.)

1. A virus-host association prediction method based on network fusion and graph embedding is characterized by comprising the following steps:

step 1, acquiring known association of virus hosts;

step 2, respectively measuring the similarity of each pair of viruses and each pair of hosts and hosts, and further constructing a plurality of virus and virus similarity networks and a plurality of host and host similarity networks;

step 3, integrating the virus and virus similarity network and the host and host similarity network obtained in the step 2 by using a similar network fusion algorithm to finally obtain a virus fusion similarity matrixFusion similarity matrix with host

Step 4, obtaining the virus fusion similar matrix from the step 3Fusion similarity matrix with hostConstructing a heterogeneous network with known virus-host associations obtained from step 1

Step 5, applying a graph mining technology to the training part virus host association obtained in the step 1 to generate a feature representation of each node, wherein the nodes comprise virus nodes and host nodes;

step 6, calculating the cosine similarity of the characteristic vector of each virus and the characteristic vectors of other viruses obtained in the step 5 and the cosine similarity of the characteristic vector of each host and the characteristic vectors of other hosts, and further constructing a virus cosine similarity matrixAnd host cosine similarity matrix

Step 7, the virus cosine similarity matrix is obtained in the step 6And host cosine similarity matrixAnd constructing a heterogeneous network G by the known virus host association obtained in the step 1₂；

Step 8, for G obtained from step 4₁And G from step 7₂From graph G, based on the path structure and its characteristics₁Extracting corresponding meta path score from graph G₂Extracting corresponding meta-path scores;

step 9, selecting features to eliminate weak features, and then generating a feature vector X and labels Y of all virus host pairs;

and step 10, inputting the feature vector X and the label Y obtained in the step 9 into a supervised machine learning prediction model.

2. The method of claim 1, wherein the method comprises: the similarity is measured in step 2 using oligonucleotide frequencies or gaussian interaction spectra, wherein the specific implementation of the similarity measurement using oligonucleotide frequencies is as follows;

use ofJS, Hao, Teeling calculates the distance of each pair of virus-to-virus and each pair of host-to-host genome oligonucleotide frequency vectors, and measures the similarity of each pair of virus-to-virus and each pair of host-to-host;

is defined by formula (1):

is defined by formula (2):

suppose there are two sequences a ═ a₁A₂...A_nAnd B ═ B₁B₂...B_mConsisting of a letter of finite alphabet Λ of length d, let p be e Λ for a ∈ Λ_aIndicates the probability of the letter a occurring; for w ═ w₁，...，w_k)∈Λ^kLet aThe number of occurrences of w at A is calculated, and similarly, Y_wThe number of occurrences of w at B is calculated, hereAlso, the same applies toIf X and Y are independent mean-zero-normal, then X has a varianceY has variance Are also normal, have varianceFor w ═ w₁，...，w_k，Representing the probability of w occurrence, the counting variable in the set is represented as (1), where

Another counting variable (2), whereinIs the probability of an unobserved letter, i.e., the relative count of letters in the concatenation of the two sequences. The relative number of letters a in the concatenation of two sequences which are independent of each other and are all composed ofGenerated by individual letters in a distribution and then usedEstimating w ═ w₁，...，w_kThe probability of occurrence.

Hao is defined by formula (3):

two sequences A ═ A₁A₂...A_nAnd B ═ B₁B₂...B_mConversion into a resultant vector a ═ a by character₁，a₂，...，a_N) And B ═ B₁，b₂，...，b_N) Wherein N is [1, 4 ]^k]The correlation C (a, B) between a and B is the cosine function of the angle between two representative vectors in the N-dimensional space:

teeling is derived from equations (5) (6) (7) and pearson correlation coefficients:

the observed frequency of a sequence of tetranucleotides is denoted N (N)₁n₂n₃n₄) The corresponding desired frequency is calculated by a maximum order markov model:

the variance is:

the significance of the too high or too low representation level, i.e. the difference between the observed and expected frequency, was assessed using the Z-score

If two genomic fragments A and B exhibit similar problems with the pattern of tetranucleotide excess and deficiency, the Pearson's correlation coefficient for Z score can be calculatedTo solve the problem;

JS divergence is defined by the formula (8)

Given a sequence in which S comprises N genesS log-likelihood of Markov model is

λ(S)＝∑n(b₁...b_kb)logP(b|b₁..b_k) (9)

Using JS Scattering to measure two sequences S₁And S₂Probability distribution P ═ λ (S)₁)，Q＝λ(S₂) The JS divergence is a variation of the KL divergence, which is defined as follows:

3. the method of claim 2, wherein the method comprises: the calculation process of the Gaussian kernel interaction spectrum comprises two steps;

first, the virus v_iInteraction spectrum IP (upsilon)_i) Is a vector encoding virus v_iWith each of the known virus-host networksBinary vectors with or without associations between hosts; second, virus v_iAnd virus upsilon_jThe gaussian kernel similarity between them is calculated from their interactions and is defined as follows:

S^υ(υ_i，υ_j)＝exp(-γ_υ|IP(υ_i)-IP(υ_j)||²) (11)

parameter gamma thereof_υRepresenting the core bandwidth, a new core bandwidth parameter is defined as:

N_his the number of hosts, r 'according to previous studies'_υIs set to 1; analogous hosts h_iAnd a host h_jThe gaussian kernel similarity between them is defined as:

S^h(h_i，h_j)＝exp(-γ_h||IP(h_i)-IP(h_j)||²) (13)

its nuclear bandwidth parameters are defined as:

wherein N is_υIs the number of viruses, r'_hIs set to 1.

4. The method of claim 1, wherein the method comprises: the specific implementation manner of the step 3 is as follows;

taking the virus similarity network as an example, the edge weights of the virus similarity network are respectively N_υ×N_υIs matrix S^vExpressed, a normalized weight matrix P, defined by equation (15), can then be obtained for each similarity network by:

in the formula (15), S (i, j) is S^vWhere i and j represent the row and column numbers of the matrix, and then measure local relationships using K-nearest neighbors, defined by equation (16):

in the formula (16), N_iRepresents the number of neighbors of the virus;

p obtained from the formula (15)^(υ)And KNN obtained by the formula (16)^(υ)In the formula (15), P_i，jIs the similarity of the ith virus to all other viruses, while KNN (i, j) in equation (16) is the similarity of the ith virus to its neighboring viruses, in the similar network fusion algorithm SNF, P is always assigned_i，jAs an initial state, while KNN (i, j) is used as a core matrix in the fusion process of two capacities of capturing local structure and computing efficiency, the process of SNF is to iteratively update the similarity matrix, which is defined by equation (17):

wherein p is^(k)The initial value of the similarity matrix in the t step is P_i，j，P^(υ)Is the similarity matrix of the t +1 step, formula (17) updates the matrix P each time m parallel exchange diffusion processes are generated on m virus networks^(υ)；

Then SNF measures local relation by using a K-nearest neighbor method to filter low-similarity edges, finally obtains a matrix through multiple iterations, and obtains a virus similarity matrix through SNF fusionAnd host similarity matrix

5. The method of claim 1, wherein the method comprises: the specific implementation manner of the step 5 is as follows;

using an algorithm framework of Node2vec to perform characterization learning on a virus host heterogeneous network G constructed by the known association of the virus hosts obtained in the step 1, wherein the heterogeneous network G only comprises the association of the known viruses and the hosts, but does not use a virus and virus similar network and a host and host similar network, the Node2vec introduces two super parameters p and q to control a random walk strategy, and if the current random walk reaches a vertex upsilon through edges (t, upsilon), the vertex upsilon is set as pi_υx＝α_pq(t，x)·w_υx，π_υxIs the unnormalized transition probability, w, between vertex v and vertex x_υxIs the edge weight of a vertex upsilon and a vertex x, and a path sampling strategy alpha_pq(t, x) is defined as follows:

in the formula (18), d_txFor the shortest path distance between vertex t and vertex x, the node neighborhood set can be obtained by equation (18);

setting f (u) as a mapping function for mapping the node u to the embedding vector, and defining N for any node u in the graph_s(u) is a set of neighboring points of the node u sampled by the formula (18), and f (u) that maximizes the probability of occurrence of its neighboring points is obtained by the formula (19):

according to the following two assumed conditions

(1) Conditional independence, assuming a given source vertex, its neighbor vertex n_iProbability of occurrence and nearest neighborThe rest of the vertices in the set are irrelevant;

(2) feature space symmetry, where a vertex shares the same set of embedded vectors as the source vertex and as the neighbor vertices;

optimizing equation (19) to a final objective function equation (22):

in equation (22), due to the normalization factorThe calculation cost is high, and a negative sampling technology is adopted for optimization;

when the final target function (22) is maximized, a function form of f (u) is obtained, and a feature vector of each node is obtained.

6. The method of claim 1, wherein the method comprises: the specific implementation of step 8 is as follows,

for each simple path of each virus-host pair, starting from the source node, i.e., the host node, and ending at the target node, i.e., the virus node, a path score is used for calculation, i.e., using equation (23) below:

in the formula(23) In, P ═ { P ═ P₁，p₂，...，p_nIs a connection host node h_iAnd virus node v_jSet of paths of, P_weightsThe weight value between the nodes is obtained; the Path score is the product of all edge weight scores from the starting host node to the ending virus node in each Path structure, and in order to reduce the amount of computation, the Path length is limited to be less than or equal to 3, that is, there are 6 Path structures, Path1, Path2, Path3, Path4, Path5, Path6, each node starting from the host node and ending with one virus node, Path1: (H-H-V), path2: (H-V-V), path3: (H-H-H-V), path4: (H-H-V-V), path5: (H-V-V-V), path6: (H-V-H-V), two features of each path structure are mined,

(1) sum of all meta-path scores for each path structure:

(2) the highest score of all meta-path scores under each path structure:

the meta path refers to all paths with the same path structure, and the meta path score is the product of all edge weights from the starting point host node to the end point virus node in the path structure; ASP represents a pair of viruses upsilon_jAnd a host h_iMeta-paths between; to ensure that longer paths are not penalized in our method, each maximum or sum path score is computed separately, where each score considers all sets of paths belonging to a particular path structure.

7. The method of claim 1, wherein the method comprises: step 9, an Adaboost classification model is used as a prediction model, Adaboost gives different weak classifiers different weights according to the classification effect of the m weak classifiers on sample data and combines the weights into a strong classifier, and the algorithm flow of Adaboost is as follows;

(1) given a binary data set T { (x)₁，y₁)，(x₂，y₂)，...，(x_N，y_N) And x represents an input sample, y represents a class space to which the sample belongs, and weight distribution of training data is initialized:

for M classifiers G_m(x) The method comprises the following steps x → {0, 1}, M ∈ (1, M) are respectively trained by using data with weight distribution;

(2) computing weak classifier G_m(x) Classification error rate of (1):

calculation of G_m(x) Coefficient (c):

updating the weight distribution of the training data:

D_m+1＝(w_m+1，1，...，w_m+1，j，...，w_m+1，N) (29)

wherein G is_m(x_i) Representing the result of classifying the sample data by the weak classifier;

(3) constructing a linear combination of basis classifiers:

the final classifier:

Technical Field

The invention belongs to the cross field of bioinformatics, computational biology and artificial intelligence, and particularly relates to a virus-host association prediction method based on network fusion and graph embedding.

Background

Viruses depend on host survival and play an important role in community structure and function, but viruses are diverse in variety and their relationship to hosts varies. The traditional experimental method searches the relation between the virus and the host thereof, so that the experimental cost is high, the experimental period is long, uncertain factors can influence the experimental result, the success rate is low, and a more efficient and accurate method needs to be found. Therefore, computational methods based on mathematical models to predict viral and host interactions have received increasing attention. Because both viruses and hosts face natural selection pressure, they are constantly competing, and the host needs to develop resistance to the virus to protect against infection, but the virus cannot survive if it cannot infect the host, and the end result may be that the virus integrates its genes into the host, and this information can be used to identify the host of the virus, i.e., the virus has a relevant functional relationship with the host.

In the face of the limitations of traditional experimental-based exploration of virus-host associations, researchers have proposed techniques for predicting virus-host interactions based on computer simulations to predict new association relationships between viruses and hosts, which require the introduction of known associations between viruses and hosts, and even require virus-virus associations and host-host associations. In a sample or a colony, a complex network of various interactions, called a heterogeneous network, is formed between microorganisms (bacteria, viruses, etc.) in a relationship of mutualism, parasitism, antagonism, etc. for nutrition or territory. The nodes of the heterogeneous network are bacteria and viruses, and the bacteria-bacteria, viruses-viruses and bacteria-viruses have various interaction relations to form the edges of the heterogeneous network. Traditional heterogeneous network mining processes typically begin by extracting structural features, such as object relationships, network structures, meta paths, and the like, and then inputting these features into a machine learning model for subsequent learning tasks. However, the process of manually designing features is time-consuming and labor-consuming, and the features are not mobile, in other words, the manually designed features are often only suitable for specific application scenarios, and thus are not universal. Therefore, the data mining technology based on heterogeneous network has been shifted to the representation learning mode based on the graph neural network at present. The internal structural and semantic attributes of the heterogeneous network representation learning hypothesis network can be encoded into the potential low-dimensional vectors, so that the model can automatically learn the potential low-dimensional representations of the network objects such as vertexes, edges and subgraphs, and the subsequent learning task is facilitated. For example, some feature-based classification methods sample virus-to-host associations, characterize the samples with the virus-to-host side information as a feature vector, and then use a classifier to distinguish whether an association exists. Although there are various methods of predicting virus-host interactions, these single-information based prediction models are less accurate. As the number of discovered viruses increases, new and efficient analysis methods need to be developed to integrate multiple types of virus-host and virus-virus characteristic information to predict virus-host relationships more accurately and more quickly.

Disclosure of Invention

The invention aims to solve the problems in the background art and provides a virus-host association prediction method based on network fusion and graph embedding.

In order to further improve the accuracy of predicting the association between the virus and the host, the method for calculating the similarity network between various viruses and hosts by using the topological information of the association network between the viruses and the hosts is proposed. The method converts the association prediction problem of the virus and the host into the link prediction problem of the nodes in the heterogeneous network. And various network information is fused by utilizing the graph embedding and similar network fusion technology, so that the limitations of other methods are avoided. The technical scheme of the invention is a calculation method for predicting virus-host interaction based on graph embedding, which specifically comprises the following steps:

step 1, acquiring known association of virus hosts;

Step 4, obtaining the virus fusion similar matrix from the step 3Fusion similarity matrix with hostThe heterogeneous network G is formed by associating the known virus hosts obtained in step 1₁；

step 9, selecting features to eliminate weak features, and then generating a feature vector X and labels Y of all virus host pairs;

and step 10, inputting the feature vector X and the label Y obtained in the step 9 into a supervised machine learning prediction model.

Further, the similarity is measured in step 2 using oligonucleotide frequency or gaussian interaction spectrum, wherein the specific implementation of the similarity measurement using oligonucleotide frequency is as follows;

is defined by formula (1):

is defined by formula (2):

suppose there are two sequences a ═ a₁A₂...A_nAnd B ═ B₁B₂...B_mConsisting of a letter of finite alphabet Λ of length d, let p be e Λ for a ∈ Λ_aIndicates the probability of the letter a occurring; for w ═ w₁，...，w_k)∈Λ^kLet aThe number of occurrences of w at A is calculated, and similarly, Y_wThe number of occurrences of w at B is calculated, hereAlso, the same applies toIf X and Y are independent mean-zero-normal, then X has a varianceY has variance Are also normal, have varianceFor w ═ x₁，...，x_k，Representing the probability of w occurrence, the counting variable in the set is represented as (1), where

Another counting variable (2), whereinIs the probability of an unobserved letter, i.e., the relative count of letters in the concatenation of the two sequences. The relative number of letters a in the concatenation of two sequences, independent of each other and both generated from independent letters in the same distribution, and then usedEstimating w ═ w₁，...，w_kThe probability of occurrence.

Hao is defined by formula (3):

two sequences A ═ A₁A₂...A_nAnd B ═ B₁B₂...B_mConversion into a resultant vector a ═ a by character₁,a₂,...,a_N) And B ═ B₁,b₂，...，b_N) Wherein N is [1, 4 ]^k]The correlation C (a, B) between a and B is the cosine function of the angle between two representative vectors in the N-dimensional space:

teeling is derived from equations (5) (6) (7) and pearson correlation coefficients:

the observed frequency of a sequence of tetranucleotides is denoted N (N)₁n₂n₃n₄) The corresponding desired frequency is calculated by a maximum order markov model:

the variance is:

the significance of the too high or too low representation level, i.e. the difference between the observed and expected frequency, was assessed using the Z-score

JS divergence is defined by the formula (8)

Given a sequence in which S comprises N genesS log-likelihood of Markov model is

λ(s)＝∑n(b₁...b_kb)logP(b|b₁..b_k) (9)

Using JS Scattering to measure two sequences S₁And S₂Probability distribution P ═ λ (S)₁)，Q＝λ(S₂) The JS divergence is a variation of the KL divergence, the KL divergence (Kullback-Leibler divergence) being defined as follows:

further, the calculation process of the Gaussian kernel interaction spectrum comprises two steps;

first, the virus v_iInteraction spectrum IP (v)_i) Is a code for a virus v_iAnd has already been madeBinary vectors with or without associations between each host in the known virus-host network; second, virus v_iAnd virus v_jThe gaussian kernel similarity between them is calculated from their interactions and is defined as follows:

S^v(v_i，v_j)＝exp(-γ_v||IP(v_i)-IP(v_j)||²) (11)

parameter gamma thereof_vRepresenting the core bandwidth, a new core bandwidth parameter is defined as:

N_his the number of hosts, r 'according to previous studies'_vIs set to 1; analogous hosts h_iAnd a host h_jThe gaussian kernel similarity between them is defined as:

S^h(h_i，h_j)＝exp(-γ_h||IP(h_i)-IP(h_j)||²) (13)

its nuclear bandwidth parameters are defined as:

wherein N is_vIs the number of viruses, r'_hIs set to 1.

Further, the specific implementation manner of step 3 is as follows;

taking the virus similarity network as an example, the edge weights of the virus similarity network are respectively N_v×N_vIs matrix S^vExpressed, a normalized weight matrix P, defined by equation (15), can then be obtained for each similarity network by:

in the formula (15), S (i, j) is S^vWhere i and j represent the row and column numbers of the matrix, and then measure local relationships using K-nearest neighbors, defined by equation (16):

in the formula (16), N_iRepresents the number of neighbors of the virus;

p obtained from the formula (15)^(v)And KNN obtained by the formula (16)^(v)In the formula (15), P_i，jIs the similarity of the ith virus to all other viruses, while KNN (i, j) in equation (16) is the similarity of the ith virus to its neighboring viruses, in the similar network fusion algorithm SNF, P is always assigned_i，jAs an initial state, while KNN (i, j) is used as a core matrix in the fusion process of two capacities of capturing local structure and computing efficiency, the process of SNF is to iteratively update the similarity matrix, which is defined by equation (17):

wherein p is^(k)The initial value of the similarity matrix in the t step is P_i，j，P^(v)Is the similarity matrix of the t +1 step, formula (17) updates the matrix P each time m parallel exchange diffusion processes are generated on m virus networks^(v)；

Further, the specific implementation manner of step 5 is as follows;

using an algorithm framework of Node2vec to perform characterization learning on a virus host heterogeneous network G constructed by the known association of the virus hosts obtained in the step 1, wherein the heterogeneous network G only comprises the association of the known viruses and the hosts, but does not use a virus and virus similar network and a host and host similar network, the Node2vec introduces two hyper-parameters p and q to control a random walk strategy, and if the current random walk reaches a vertex v through edges (t, v), the vertex v is set to be pi_vx＝α_pq(t，x)·ω_vx，π_vxIs the unnormalized transition probability, ω, between vertex v and vertex x_vxIs the edge weight of vertex v and vertex x, the path sampling strategy alpha_pq(t, x) is defined as follows:

in the formula (18), d_txFor the shortest path distance between vertex t and vertex x, the node neighborhood set can be obtained by equation (18);

setting f (u) as a mapping function for mapping the node u to a low-dimensional vector, and defining N for any node u in the graph_s(u) is a set of neighboring points of the node u sampled by the formula (18), and f (u) that maximizes the probability of occurrence of its neighboring points is obtained by the formula (19):

according to the following two assumed conditions

(1) Conditional independence, assuming a given source vertex, its neighbor vertex n_iThe probability of occurrence is independent of the rest of the vertices in the neighbor set;

(2) feature space symmetry, where a vertex shares the same set of embedded vectors as the source vertex and as the neighbor vertices;

optimizing equation (19) to a final objective function equation (22):

in equation (22), due to the normalization factorThe calculation cost is high, and a negative sampling technology is adopted for optimization;

when the final target function (22) is maximized, a function form of f (u) is obtained, and a feature vector of each node is obtained.

Further, the specific implementation manner of step 8 is as follows,

for each simple path of each virus-host pair, starting from the source node (i.e., the host node) and ending at the target node (i.e., the virus node), a path score is used for calculation, i.e., using equation (23) below:

in formula (23), P ═ P₁，p₂，...，p_nIs a connection host node h_iAnd virus node v_jSet of paths of, P_weightsIs the weight between nodes; the path score is the product of all edge weight scores from the beginning host node to the end virus node in each path structure. In order to reduce the amount of computation, the Path length is limited to be less than or equal to 3, i.e. there are 6 Path structures of Path (Path 1, Path2, Path3, Path4, Path5, Path 6), each node starts from the host node and ends with a virus node, Path1 (H-H-V), Path2 (H-V-V), Path3 (H-H-H-V), Path4 (H-H-V-V), Path5 (H-V-)V), path6 (H-V-H-V), two features of each path structure are mined.

(1) Sum of all meta-path scores for each path structure:

(2) the highest score of all meta-path scores under each path structure:

the meta path refers to all paths with the same path structure, and the meta path score is the product of all edge weights from the starting point host node to the end point virus node in the path structure; ASP represents a pair of viruses v_jAnd a host h_iMeta-paths between; to ensure that longer paths are not penalized in our method, each maximum or sum path score is computed separately, where each score considers all sets of paths belonging to a particular path structure.

Further, in step 9, an Adaboost classification model is used as a prediction model, and according to the good or bad classification effect of the m weak classifiers on the sample data, the Adaboost gives different weights to different weak classifiers and combines the different weak classifiers into a strong classifier, and the algorithm flow of the Adaboost is as follows;

for M classifiers G_m(x) The method comprises the following steps x → {0, 1}, M ∈ (1, M) are respectively trained by using data with weight distribution;

(2) computing weak classifier G_m(x) Classification error rate of (1):

calculation of G_m(x) Coefficient (c):

updating the weight distribution of the training data:

D_m+1＝(w_m+1，1，...，w_m+1，j，...，w_m+1，N) (29)

wherein G is_m(x_i) Representing the result of classifying the sample data by the weak classifier;

(3) constructing a linear combination of basis classifiers:

the final classifier:

aiming at the limitations of the existing virus-host correlation prediction method, the invention provides a calculation method for predicting virus-host interaction based on graph embedding, which greatly improves the accuracy of virus-host correlation prediction, and compared with other methods, the invention greatly improves the prediction performance by using four data sets, realizes higher accuracy in all data sets, and has more stable model performance compared with other methods. And a part of virus host association predicted by the invention is verified in published papers and databases.

Drawings

FIG. 1 is a flow chart of an embodiment of the present invention.

Detailed Description

The technical solution of the present invention can be implemented by a person skilled in the art using computer software technology. Embodiments of the invention are described in detail below with reference to the accompanying drawings:

step 1, obtaining the known association of the virus host from a related biological information paper and an authoritative biological information website (NCBI).

Step 2, inferring the relationship between genomic sequences based on a differential approach to genomic oligonucleotide frequency. The invention usesJS, Hao, Teeling calculated the distance of the genomic oligonucleotide frequency vector for each pair of virus to virus (each pair of host to host), and measured the similarity of each pair of virus to virus (each pair of host to host).

Is defined by formula (1):

is defined by formula (2):

suppose there are two sequences a ═ a₁A₂...A_nAnd B ═ B₁B₂...B_mConsisting of a letter of finite alphabet Λ of length d, let p be e Λ for a ∈ Λ_aIndicating the probability of the occurrence of the letter a. For w ═ w₁，...，w_k)∈Λ^kLet aThe number of occurrences of w at A is calculated, and similarly, Y_wThe number of occurrences of w at B is calculated, hereAlso, the same applies toIf X and Y are independent mean-zero-normal, then X has a varianceY has variance Are also normal, have varianceFor w ═ w₁，...，w_k，Representing the probability of w occurrence, the counting variable in the set is represented as (34), where

Another counting variable (35), whereinIs the probability of an unobserved letter, i.e., the relative count of letters in the concatenation of the two sequences. The relative number of letters a in the concatenation of two sequences, independent of each other and both generated from independent letters in the same distribution, and then usedEstimating w ═ w₁，...，w_kThe probability of occurrence.

Hao is defined by formula (3):

two sequences A ═ A₁A₂...A_nAnd B ═ B₁B₂...B_mConversion into a resultant vector a ═ a by character₁，a₂，...，a_N) And B ═ B₁,b₂,...,b_N) Wherein N is [1, 4 ]^k]The correlation C (a, B) between a and B is the cosine function of the angle between two representative vectors in the N-dimensional space:

teeling is derived from equations (5) (6) (7) and pearson correlation coefficients:

the observed frequency of a sequence of tetranucleotides is denoted N (N)₁n₂n₃n₄) The corresponding desired frequency is calculated by a maximum order markov model:

the variance is:

the significance of the too high or too low representation level, i.e. the difference between the observed and expected frequency, was assessed using the Z-score

JS divergence (Jensen-Shannon divergence) is defined by formula (8)

Given a sequence in which S comprises N genesS log-likelihood of Markov model is

λ(S)＝∑n(b₁...b_kb)logP(b|b₁..b_k) (42)

in addition, the paper calculates the gaussian nuclear interaction profile (GIP) between viruses (host-to-host) from the known association matrix of viruses and hosts. The Gaussian kernel interaction spectrum is a method for extracting similarity information from a virus and host association network which is used more at present. GaussThe calculation process of the nuclear interaction spectrum mainly comprises two steps. First, the virus v_iInteraction spectrum IP (v)_i) Is a code for a virus v_iA binary vector with or without an association with each host in a known virus-host network. Second, virus v_iAnd virus v_jThe gaussian kernel similarity between them is calculated from their interactions and is defined as follows:

S^v(v_i，v_j)＝exp(-γ_v||IP(v_i)-IP(v_j)||²) (44)

parameter gamma thereof_vRepresenting the core bandwidth, a new core bandwidth parameter is defined as:

N_his the number of hosts, r 'according to previous studies'_vIs set to 1. Analogous hosts h_iAnd a host h_jThe gaussian kernel similarity between them is defined as:

S^h(h_i，h_j)＝exp(-γ_h||IP(h_i)-IP(h_j)||²) (46)

its nuclear bandwidth parameters are defined as:

wherein N is_vIs the number of viruses, r'_hIs set to 1.

Step 3, the invention uses the similar network fusion algorithm (SNF), take the virus similarity network as an example, the edge weight of the virus similarity network uses N of one respectively_v×N_vIs matrix S^vExpressed, a normalized weight matrix P, defined by equation (15), can then be obtained for each similarity network by:

in formula (48), S (i, j) is S^vWherein i and j represent the row number and column number of the matrix. The local relationship is then measured using K-nearest neighbors (KNN), defined by equation (16):

in the formula (49), N_iThe number of neighbors of the virus is represented, the number of neighbors of the virus is predefined, the distance between each element and other elements can be calculated according to a similarity matrix, and the first K elements are selected from the similarity matrix according to the similarity (distance of the distance) between the elements. Text N_iThe value of (1) is 5, and the method filters out edges with low similarity.

P obtained by the formula (48)^(v)And KNN obtained by the formula (49)^(v). In formula (48), P_i，jIs the similarity of the ith virus to all other viruses, while KNN (i, j) in equation (49) is the similarity of the ith virus to its neighboring viruses. In the SNF algorithm, P is always set_i，jAs an initial state, while KNN (i, j) as a core matrix is in the fusion process of two capacities of capturing local structure and computational efficiency. The process of SNF is an iterative update of the similarity matrix, defined by equation (17):

wherein p is^(k)The initial value of the similarity matrix in the t step is P_i，j，P^(v)Is the similarity matrix of step t +1, the formula (50) updates the matrix P each time m parallel exchange diffusion processes are generated on m virus networks^(v)。

Then the SNF measures local relation by using a K-nearest neighbor (KNN) method to filter low-similarity edges, and finally obtains a matrix through multiple iterations. Virus similarity matrix obtained by SNF fusionAnd host similarity matrix

Step 4, obtaining a virus fusion similarity matrix from the step 3Fusion similarity matrix with hostAnd constructing a heterogeneous network G from the known association of the virus and the host obtained in step 1₁。

And 5, using an algorithm framework of the node2vec to perform characterization learning on the virus host heterogeneous network G constructed by the known association of the virus hosts obtained in the step 1, wherein the heterogeneous network G only comprises the association of the known viruses and hosts, and does not use a virus and virus similar network and a host and host similar network. The node2vec introduces two hyper-parameters p and q to control the strategy of random walk, supposing that the current random walk passes through edges (t, v) to reach a vertex v, and setting pi_vx＝α_pq(t，x)·w_vx，π_vxIs the unnormalized transition probability, w, between vertex v and vertex x_vxIs the edge weight of vertex v and vertex x, the path sampling strategy alpha_pq(t, x) is specifically defined as follows:

in the formula (51), d_txFor the shortest path distance between vertex t and vertex x, a node neighborhood set can be obtained by equation (51).

Setting f (u) as a mapping function for mapping the node u to a low-dimensional vector, and defining N for any node u in the graph_s(u) is a set of neighboring points of the node u sampled by the equation (51)The probability f (u) that the neighboring point appears is maximized is obtained from equation (19):

according to the following two assumed conditions

(1) Conditional independence, assuming a given source vertex, its neighbor vertex n_iThe probability of occurrence is independent of the rest of the vertices in the neighbor set;

(2) feature space symmetry, where a vertex shares the same set of embedded vectors as the source vertex and as the neighbor vertices;

optimizing equation (52) to a final objective function equation (22):

in equation (55), due to the normalization factorThe calculation cost is high, and the negative sampling technology is adopted for optimization.

When this final objective function (55) is maximized, a function form of f (u) is obtained, and a feature vector of each node is obtained.

Step 6, calculating the cosine similarity of the eigenvector of each virus (each host) obtained in the step 5 and the eigenvectors of other viruses (other hosts), and further constructing a virus cosine similarity matrixAnd host cosine similarity matrix

Step 7, the virus cosine similarity matrix is obtained in the step 6And host cosine similarity matrixAnd constructing a heterogeneous network G from the known virus host associations of step 1₂。

Step 8, for the two heterogeneous weighted graphs G obtained from step 4 and from step 7₁And G₂Is used to extract graph-based features. Multiple path scores between each virus-host pair of each graph are used to reflect these features. For each simple path of each virus-hosting pair, starting from the source node (i.e., the hosting node) and ending to the target node (i.e., the virus node), a path score is used for calculation, i.e., using equation (23) below:

in formula (56), P ═ P₁，p₂，...，p_nIs a connection host node h_iAnd virus node v_jSet of paths of, P_weightsIs the weight between nodes. The path score is the product of all edge weight scores from the beginning host node to the end virus node in each path structure. In order to reduce the amount of calculation, the invention limits the Path length to be less than or equal to 3, namely 6 Path structures of Path (Path 1, Path2, Path3, Path4, Path5 and Path 6) exist, each node starts from a host node and ends with a virus node, Path1 (H-H-V), Path2 (H-V-V), Path3 (H-H-H-V), Path4 (H-H-V-V), Path5 (H-V-V-V) and Path6 (H-V-H-V), and two characteristics of each Path structure are mined,

(1) sum of all meta-path scores for each path structure:

(2) the highest score of all meta-path scores under each path structure:

the meta path refers to all paths having the same path structure, and the meta path score is the product of all edge weights from the start point host node to the end point virus node in the path structure. ASP represents a pair of viruses v_jAnd a host h_iMeta path between. To ensure that longer paths are not penalized in our method, each (maximum or sum) path score is computed separately, where each score considers all sets of paths belonging to a particular path structure. Thus, scores from different path structures do not blend together in one feature. In addition, the scores are further normalized using a minimum-maximum normalization to ensure that the features are treated equally by the classifier.

Step 9, 12 features can be extracted from step 8 for each pair of virus and host and each constructed heterogeneous map, and these features are combined into a 24-dimensional feature vector. The accuracy of the present invention depends on the basic features of the data set. After empirical analysis and many experiments, the most relevant feature set for this classification task was determined. In analyzing performance, combinations of one or more features need to be removed. Thus, after feature selection is applied, the dimensions of the feature vectors input to the predictive model are reduced from 24 to 16, depending on the data set.

Step 10, the known machine learning classification model Adaboost with better performance is used in the invention, and according to the classification effect of m weak classifiers on sample data, different weights are given to different weak classifiers and combined into a strong classifier by the Adaboost, and the algorithm flow of the Adaboost is as follows:

for M classifiers G_m(x) The method comprises the following steps x → {0, 1}, M ∈ (1, M), respectively, using data with a weight distribution for training;

(2) computing weak classifier G_m(x) Classification error rate of (1):

calculation of G_m(x) Coefficient (c):

updating the weight distribution of the training data:

D_m+1＝(w_m+1，1，...，w_m+1，j，...，w_m+1，N) (62)

wherein G is_m(x_i) Representing the result of classifying the sample data by the weak classifier;

(3) constructing a linear combination of basis classifiers:

the final classifier:

to verify the validity of the method of the invention, comparative experiments were performed on a plurality of data sets. The experiment used four data sets from a paper, authoritative bioinformatics website, the information of which is shown in the following table (table 1):

TABLE 1 basic information of the four data sets

	Dataset I	Dataset II	Dataset III	DatasetⅣ
					Number of viruses	728	32	312	1380
Number of hosts	129	119	747	221
					Knowing associations	728	368	4539	1479
Unknown associations	93184	3440	228525	303501
					Ratio of sparseness	0.0078	0.1070	0.0199	0.0048

And compared with the other five methods of correlation prediction:

■ ILMF-VH, virus-to-host association prediction based on multiinformation matrix fusion. The viral similarity network is constructed based on oligonucleotide frequency (ONF) metrics and the host similarity network is constructed by integrating oligonucleotide frequency similarity and Gaussian Interaction Profile (GIP) nuclear similarity of the host through Similarity Network Fusion (SNF). Then, a domain regularization logic matrix decomposition algorithm is executed on the heterogeneous network of the virus and the host to predict virus host association;

the ■ layer notes that the graph convolution network (LAGCN) associates known viruses with hosts, integrates virus-virus similarity and host-host similarity into a heterogeneous network, and applies graph convolution on the heterogeneous network to learn the embedding of the viruses and the hosts. Second, LAGCN combines the embedding of multiple map convolutional layers using an attention mechanism. The method has good effect on predicting the virus-host association;

■ NetLapRLS, respectively training the virus and host fields by adopting a semi-supervised learning method and a regular least square method on a combined known virus-host interaction network, and then combining the fields to obtain a final prediction result;

■ BLM-NII, neighbor-based interaction Profile inference (NII), and integrates it into a supervised learning approach, a Binary Local Model (BLM) approach, to handle new association problems. Specifically, the inferred interaction relationships are considered as label information and used for model learning of new candidates;

■ CMF, which projects viruses and hosts into a common low-level feature space, and predicts virus-host interactions through the cooperation of two low-rank matrices.

The evaluation indices used in the present invention were AUC and aucr, i.e., the area under the Receiver Operating Characteristic (ROC) curve (AUC), and the area under the precision-recall curve (aucr), and the experimental results are shown in the following table (table 2):

table 2 comparison of experimental results of the present invention and other methods on four data sets

Data set	Evaluation index	Ours	ILMF-VH	LAGCN	NetLapRLS	BLM-NII	CMF
								Dataset I	AUC	0.99991	0.75380	0.92508	0.08741	0.86028	0.76867
	AUPR	0.99086	0.21475	0.79621	0.00422	0.24655	0.04473
								Dataset II	AUC	0.98955	0.79128	0.79811	0.76468	0.80453	0.50939
	AUPR	0.91827	0.30862	0.41345	0.50196	0.48382	0.22213
								Dataset III	AUC	0.99999	0.99391	0.99868	0.99740	0.99683	0.77741
	AUPR	0.99999	0.63898	0.96357	0.97915	0.90456	0.42784
								DatasetⅣ	AUC	0.99965	0.82112	0.91179	0.69508	0.90606	0.73420
	AUPR	0.96485	0.24104	0.73203	0.01979	0.38681	0.02030

The present invention predicts the first ten associations on Dataset iv as shown in the following table (table 3):

TABLE 3 Association of the top ten predicted by the present invention

Rank	Host Name	Virus Name	Evidence
				1	Campylobacter jejuni	Campylobacter phage CP8	PMID:32054081
2	Erysimum	Listeria phage A118	unknown
				3	Erwinia sp.	Erwinia phage phiEa1H	PMID:26555076
4	Klebsiella pneumoniae	Klebsiella phage PMBT1	PMID:31976857
				5	Pseudomonas syringae	Pseudomonas phage phiPSA2	PMID:32610695
6	Lactococcus lactis subsp.cremoris	Lactococcus phage P680	PMID:30135597
				7	Gordonia terrae	Gordonia phage Troje	unknown
8	Lactococcus sp.	Lactococcus phage fd13	unknown
				9	Aeropyrum pernix K1	Aeropyrum pernix bacilliform virus 1	PMID:21784945
10	Pseudomonas aeruginosa	Pseudomonas phage MP1412	PMID:26115051

The method is based on the fact that the accuracy of a virus-host correlation prediction method based on network fusion and graph embedding is remarkably superior to that of the existing common methods, and the superiority of the method is proved.

The specific embodiments described herein are merely illustrative of the spirit of the invention. Various modifications or additions may be made to the described embodiments or alternatives may be employed by those skilled in the art without departing from the spirit or ambit of the invention as defined in the appended claims.

22页详细技术资料下载

Virus-host correlation prediction method based on network fusion and graph embedding

相关技术

网友询问留言