Medicine knowledge graph representation learning method

文档序号：1923668 发布日期：2021-12-03 浏览：18次中文

阅读说明：本技术 一种药品知识图谱表示学习方法 (Medicine knowledge graph representation learning method ) 是由刘细涓杨晨于 2021-09-15 设计创作，主要内容包括：本发明公开了一种药品知识图谱表示学习方法,其包括：获取药品相关信息生成药物知识图谱；利用平衡因子将药物知识图谱中实体的文本描述信息与知识图谱中和关系结构化信息融合,引入包含因子学习药物知识图谱中药品包含化学成分的关系表达,引入惩罚因子学习药物知识图谱中药品负面相互作用关系与化学成分负面相互作用关系表达,建立相似系数提取具有相似化学成分的药品信息,定义衡量不同类型关系和实体对之间相互关联的多类型得分函数。本发明具有能学习表示药品知识图谱中基本的药品信息文本与结构特征,还可以学习表示隐含的药品相似信息与药品相互作用信息的优点。(The invention discloses a medicine knowledge graph representation learning method, which comprises the following steps: acquiring relevant information of a medicine to generate a medicine knowledge map; text description information of entities in the medicine knowledge graph is fused with knowledge graph neutralization relation structured information by using balance factors, the factor-containing learning medicine knowledge graph is introduced to learn the relation expression of chemical components contained in medicines, the penalty factor is introduced to learn the medicine negative interaction relation and the chemical component negative interaction relation expression in the medicine knowledge graph, the similarity coefficient is established to extract the medicine information with similar chemical components, and the multi-type score function for measuring the correlation between different types of relations and entity pairs is defined. The invention has the advantages of learning and representing basic medicine information text and structural characteristics in the medicine knowledge graph and learning and representing the interaction information of the implicit medicine similar information and the medicine.)

1. A medicine knowledge graph representation learning method is characterized in that: comprises the steps of (a) preparing a mixture of a plurality of raw materials,

identifying and acquiring medicine related information from a medicine specification, a medical document and/or a medicine book from a network or based on character identification, and generating a medicine knowledge graph according to the medicine related information, wherein the generated medicine knowledge graph at least comprises a negative interaction relation between medicines, a negative interaction relation between chemical components and a chemical component containing relation between the medicines;

performing combined extraction on text description information of entities and relations in a medicine knowledge graph based on an end-to-end neural network, fusing the extracted text description information of the entities and relations with structural information of the entities and relations in the knowledge graph by using balance factors, introducing factor-learning medicine knowledge graph containing relation expression of chemical components of medicines, introducing penalty factor learning medicine knowledge graph containing negative interaction relation of medicines and negative interaction relation expression of the chemical components, establishing extraction similarity coefficients, extracting medicine information with similar chemical components, defining a measurement relation and a multi-type score function of the correlation between entity pairs according to different predicted entities and relations based on a TransF model:

negative interaction score function f when a negative interaction relationship is predicted_k(h, r, t) is defined as:

f_k(h，r，t)＝k||((μ·h_s+(1-μ)·h_d+μ·t_s+(1-μ)·t_d)^Tr+(r-μ·t_s+(1-μ)·t_d)^T(μ·h_s+(1-μ)·h_d))||_L2

when the predicted medicine contains chemical component relationship, the score function f of the contained relationship_j(h, r, t) is defined as:

f_j(h，r，t)＝j||((μ·h_s+(1-μ)·h_d+r)^T(μ·t_s+(1-μ)·t_d)+(μ·t_s+(1-μ)·t_d-r)^T(μ·h_s+(1-μ)·h_d))||_L2

when not a negative interaction relationship is predicted and the drug contains a chemical composition relationship, the generic scoring function f (h, r, t) is defined as:

f(h，r，t)＝||((μ·h_s+(1-μ)·h_d+r)^T(μ·t_s+(1-μ)·t_d)+(μ·t_s+(1-μ)·t_d-r)^T(μ·h_s+(1-μ)·h_d))||_L2

wherein mu is a balance factor and has a value range of [0, 1%]K is a penalty factor, j is an inclusion factor, j and k are both positive numbers, h is a head entity vector, t is a tail entity vector, r is a relationship vector between the head entity and the tail entity, h_sIs thatStructured vector of head entity, t_sIs the structured vector of the tail entity, h_dIs the text description vector of the head entity, t_dIs the text description vector of the tail entity, T denotes the transpose matrix, L2 is the L2 paradigm;

to extract drug information with similar chemical compositions, a similarity component score function is established that measures the relationship of drug entities with similar chemical compositions based on similarity coefficients:

SIMI (h, t) represents a chemical component similarity coefficient between the drug entities h and t, and can be calculated by combining all chemical components contained in the drug entities by using the existing text similarity algorithm;

establishing a loss function of the entity and the relation based on the multi-type score function:

where γ is a set boundary value, (h, r, t) represents a positive example triplet, (h ', r ', t ') represents a negative example triplet, Sk (h, r, t) represents all positive example triplet sets with negative interaction relationships in the drug knowledge graph, S_k' (h, r, t) denotes a randomly generated set of negative example triplets with no negative interaction, S_j(h, r, t) all normal triple sets of relationships of chemical composition of the drug in the drug knowledge-graph, S_j'(h, r, t) represents a randomly generated negative example triple set which does not represent the relation of chemical components contained in the medicine, S (h, r, t) represents all positive example triple sets which do not represent the negative interaction relation and the relation of chemical components contained in the medicine knowledge map, (h', r, t ') represents a universal negative example triple generated by immediately replacing the head entity and the tail entity, S' (h, r, t) represents a universal negative example triple set, (h, t) represents a positive example doublet of unconstrained relation, and (h ', t') Representing the universal negative example binary group, S, generated by the head entity and the tail entity being replaced immediately_s(h, t) represents a positive example binary set with similarity coefficient SIMI (h, t) greater than a set boundary value, S_s' (h, t) denotes a universal negative example doublet set;

and minimizing a loss function to learn and fuse low-dimensional vector representation of entities and relations of structural information, similarity and interaction information of text information and medicines, and optimizing by using a stochastic gradient descent method (SGD) algorithm during training.

2. The method as claimed in claim 1, wherein the method comprises:

the entity types of the generated drug knowledge graph include, but are not limited to, drug generic name, drug commodity name, drug description, drug chemical composition, indication, approval document, dosage form, specification, administration mode, administration time, caution, adverse reaction, applicable population entity, drug category, interaction entity, drug compatibility entity.

3. The method as claimed in claim 1, wherein the method comprises:

the medicine knowledge graph representation learning method can be applied to all downstream applications based on the medicine knowledge graph by combining with the general knowledge graph correlation technology, including but not limited to relation prediction, triple classification, entity type classification, relation extraction, and an intelligent question-answering and recommending system.

4. A computer program for implementing the method of claim 1.

5. A storage medium storing a computer program according to claim 1.

6. Terminal device provided with a computer program according to claim 4.

Technical Field

The invention relates to the technical field of knowledge maps and deep learning, in particular to a medicine knowledge map representation learning method.

Background

The essence of the knowledge graph is a directed graph composed of nodes and edges, and people usually organize knowledge in the knowledge graph in the form of a network, wherein each node in the network represents an entity (a person name, a place name, an organization name, a concept and the like), and each edge represents a relationship between the entities. Therefore, most knowledge can be generally represented by triples (h, r, t), where h, t represent the head and tail entities, respectively, and r represents the relationship between the head and tail entities. Large-scale knowledge maps can be widely used for many practical tasks, but their correctness and completeness cannot be guaranteed, and serious problems of data sparseness and computational efficiency are faced. The overall quality of the knowledge graph is improved by finding out the missing or wrong relation by researching the knowledge graph complementing method, and interesting downstream application can be improved or created.

In recent years, the expression learning of the heterophoria represented by deep learning has attracted attention in many fields such as voice recognition, image analysis, and natural language processing. Although knowledge representation learning realizes distributed representation of entities and relations, calculation efficiency is remarkably improved, the problem of data sparsity is effectively relieved, and fusion of heterogeneous information can be realized. However, some existing knowledge representation learning models are too simple to well represent entities in the knowledge graph and the relationship between the entities, and some existing knowledge representation learning models are too complex to be applied to a large-scale knowledge graph.

The prior art, such as Chinese patent with publication number CN108197290B, discloses a knowledge graph representation learning method integrating entity and relationship description, which comprises a knowledge graph representation learning method that an end-to-end model based on a neural network is used for jointly extracting entities and relationships, a balance factor is set to balance structural information and text description information, and different score functions are defined according to different prediction objects; and then, associating the entity vectors and the relation vectors by using a loss function, and optimizing the loss function, so that the vectors of the vectors and the relation of each entity in the knowledge map and the text description information can be learned when the optimization target is reached.

The medicine industry is a special industry, the error and the missing of data can cause great health hidden dangers, and the statistical data shows that 250 ten thousand people in China annually damage the health because of wrong medication, wherein 20 ten thousand people die, which is twice of the number of dead people in national traffic accidents, and the accuracy of the medicine knowledge graph is particularly important. The drug interaction refers to the compound effect generated after patients take two or more drugs at the same time or in a certain time, so that the drug effect can be strengthened or the side effect can be reduced, and the drug effect can be weakened or the undesirable toxic and side effect can be generated. Enhanced action includes increased efficacy and increased toxicity, and reduced action includes decreased efficacy and decreased toxicity. Therefore, when the drug combination is clinically used, the pharmacological actions of the drugs in the drug combination are fully exerted by taking the characteristics of the drugs into consideration, so that the best curative effect and the minimum adverse reaction of the drugs are achieved, and the drug safety is improved.

Disclosure of Invention

The invention aims to provide a medicine knowledge graph representation method which can represent basic medicine information texts and structural features and can also represent implicit medicine similar information and medicine negative interaction information aiming at the prior art. It includes: identifying and acquiring medicine related information from a medicine specification, a medical document and/or a medicine book from a network or based on character identification, and generating a medicine knowledge graph according to the medicine related information, wherein the generated medicine knowledge graph at least comprises a negative interaction relation between medicines, a negative interaction relation between chemical components and a chemical component containing relation between the medicines.

Performing combined extraction on text description information of entities and relations in a medicine knowledge graph based on an end-to-end neural network, utilizing balance factors to combine the extracted text description information of the entities and relations with the structures of the entities and relations in the knowledge graph, introducing factor-containing learning medicine knowledge graph to express the relation of chemical components contained in medicines, introducing penalty factors to learn medicine negative interaction relation and chemical component negative interaction relation expression in the medicine knowledge graph, establishing extraction similarity coefficients to extract medicine information with similar chemical components, and defining a multi-type score function for measuring the correlation between relations and entity pairs according to different predicted entities and relations based on a TransF model:

negative interaction score function f when a negative interaction relationship is predicted_k(h, r, t) is defined as:

f_k(h,r,t)＝k‖((μ·h_s+(1-μ)·h_d+μ·t_s+(1-μ)·t_d)^Tr+(r-μ·t_s+(1-μ)·t_d)^T(μ·h_s+(1-μ)·h_d))‖_L2

when the predicted medicine contains chemical component relationship, the score function f of the contained relationship_j(h, r, t) is defined as:

f_j(h,r,t)＝j‖((μ·h_s+(1-μ)·h_d+r)^T(μ·t_s+(1-μ)·t_d)+(μ·t_s+(1-μ)·t_d-r)^T(μ·h_s+(1-μ)·h_d))‖_L2

when not a negative interaction relationship is predicted and the drug contains a chemical composition relationship, the generic scoring function f (h, r, t) is defined as:

f(h,r,t)＝‖((μ·h_s+(1-μ)·h_d+r)^T(μ·t_s+(1-μ)·t_d)+(μ·t_s+(1-μ)·t_d-r)^T(μ·h_s+(1-μ)·h_d))‖_L2

wherein mu is a balance factor and has a value range of [0, 1%]K is a penalty factor, j is an inclusion factor, j and k are both positive numbers, h is a head entity vector, t is a tail entity vector, r is a relationship vector between the head entity and the tail entity, h_sIs a structured vector of head entities, t_sIs the structured vector of the tail entity, h_dIs the text description vector of the head entity, t_dIs the text description vector of the tail entity, T denotes the transpose matrix, and L2 is the L2 paradigm.

To extract drug information with similar chemical compositions, a similarity component score function is established that measures the relationship of drug entities with similar chemical compositions based on similarity coefficients:

SIMI (h, t) represents a chemical component similarity coefficient between drug entities h and t, and can be calculated by an existing text similarity algorithm by combining all chemical components contained in the drug entities.

Establishing a loss function of the entity and the relation based on the multi-type score function:

wherein gamma is a set boundary value, (h, r, t) represents a positive example triplet, (h ', r ', t ') represents a negative example triplet, and S_k(h, r, t) represents all positive example triplet sets with negative interaction relationships in the drug knowledge-graph, S_k' (h, r, t) denotes a randomly generated set of negative example triplets with no negative interaction, S_j(h, r, t) all normal triple sets of relationships of chemical composition of the drug in the drug knowledge-graph, S_j'(h, r, t) represents a randomly generated negative example triple set which does not represent the relation of chemical components contained in the medicine, S (h, r, t) represents all positive example triple sets which do not represent the negative interaction relation and the relation of chemical components contained in the medicine knowledge graph, (h', r, t ') represents a universal negative example triple generated by immediately replacing the head entity and the tail entity, S' (h, r, t) represents a universal negative example triple set, (h, t) represents a positive example double set of the unconstrained relation, and (h ', t') represents a universal negative example double set generated by immediately replacing the head entity and the tail entity, S (h, r, t)_s(h, t) represents a positive example binary set with similarity coefficient SIMI (h, t) greater than a set boundary value, S_s' (h, t) denotes a generalized negative example binary set.

And minimizing a loss function to learn and fuse low-dimensional vector representation of entities and relations of structural information, similarity and interaction information of text information and medicines, and optimizing by using a stochastic gradient descent method (SGD) algorithm during training.

The knowledge graph is represented and learned, so that the multidimensional and complex knowledge graph can be reduced to a low-dimensional space, the storage space is reduced, the calculation efficiency is improved, and all information of the whole knowledge graph is considered when the application requirements of the downstream are processed, because each entity representation is the interaction result with the whole knowledge graph. The medicine knowledge graph constructed by the invention not only contains relevant information of medicines, but also takes the situation that the medicines have similarity and interact with the medicines into consideration, different scoring functions are constructed according to different entity types, when negative interaction relations are predicted, the scoring functions encourage the entity vectors with interaction to be far away from each other and further express the negative relations of the interaction of the medicines or the chemical components by introducing penalty factors, and the negative influences of the negative interaction relations of the medicines or the chemical components are taken into consideration; emphasizing the inclusion relation by introducing an inclusion factor when the medicine is predicted to contain the chemical composition relation; and further extracting potential drug similarity relation by introducing a chemical component similarity coefficient. The invention learns the vector representation of the medicine knowledge graph based on the multi-type score function of fusion structural information, text information, medicine similarity and interaction information, the generated medicine knowledge graph not only can represent basic medicine information texts and structural characteristics, but also can represent implicit medicine similarity information and medicine interaction information, and a medicine knowledge graph representation method which has more abundant information and better meets the actual medicine use requirement is provided for all downstream application methods based on the knowledge graph.

The entity types of the generated drug knowledge-graph include, but are not limited to, drug generic name, drug commodity name, drug description, drug chemical composition, drug treatment disease description, approval literature, dosage form, specification, mode of administration, time of administration, notice, drug treatment symptom entity, drug applicable population entity, drug class, interaction entity, drug compatibility entity.

The medicine knowledge graph representation learning method can be applied to all downstream applications based on the medicine knowledge graph by combining with the general knowledge graph correlation technology, including but not limited to relation prediction, triple classification, entity type classification, relation extraction, and intelligent question-answering and recommendation systems. After the medicine knowledge graph is expressed and learned, dimensions of a huge multi-dimensional complex medicine knowledge graph can be reduced to a low-dimensional space, and a storage space is reduced. When the downstream application based on the medicine knowledge graph is actually carried out, the low-dimensional vector representation of the medicine knowledge graph can be directly obtained through the medicine knowledge graph representation learning method, and the downstream application is completed through the vector representation and the existing general knowledge graph technology.

The invention adopts the method of acquiring the relevant information of the medicine to generate the medicine knowledge map; performing combined extraction on text description information of entities and relations in a medicine knowledge graph based on an end-to-end neural network, combining the extracted text description information of the entities and relations with structural information of the entities and relations in the knowledge graph by using balance factors, introducing factor-learning medicine knowledge graph to express relations of medicines containing chemical components, introducing penalty factors to learn medicine negative interaction relations and chemical component negative interaction relations in the medicine knowledge graph, establishing similar coefficients to extract medicine information with similar chemical components, and defining a multi-type score function for measuring the correlation between the relations and the entity pairs according to different predicted entities and relations based on a TransF model; and minimizing a loss function to learn and fuse the low-dimensional vector representation of the entity and relation of the structural information, the similarity of the text information and the medicine and the interaction information, and optimizing by using a random gradient descent method algorithm during training. Therefore, the method has the advantages of representing basic medicine information texts and structural characteristics, representing the interaction information of the implicit medicine similar information and the medicine, providing a medicine knowledge graph representation method which has richer information and better meets the actual medicine use requirement for all downstream application methods based on knowledge graphs.

Drawings

FIG. 1 is a flow chart of a method for representing a knowledge graph of a drug according to the present invention;

FIG. 2 is a timing diagram of a method according to an embodiment of the present invention;

FIG. 3 is a diagram illustrating a comparison of related indexes according to an embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to examples.

Example 1:

referring to fig. 1 to 3 of the present embodiment, a method for representing a medicine knowledge graph of the present embodiment includes identifying and acquiring medicine-related information from a medicine instruction book, a medical document and/or a medicine book in a network or based on character recognition, and generating a medicine knowledge graph according to the medicine-related information, where the generated medicine knowledge graph at least includes a negative interaction relationship between a medicine and a medicine, a negative interaction relationship between a chemical component and a chemical component, and a chemical component-containing relationship between the medicine and the chemical component.

Performing combined extraction on text description information of entities and relations in a medicine knowledge graph based on an end-to-end neural network, combining the extracted text description information of the entities and relations with structural information of the entities and relations in the knowledge graph by using balance factors, introducing factor-learning medicine knowledge graph containing relation expression of chemical components of medicines, introducing penalty factor learning medicine knowledge graph containing negative interaction relation of medicines and negative interaction relation expression of chemical components, establishing extraction similarity coefficients, extracting medicine information with similar chemical components, and defining a multi-type score function for measuring the correlation between the relations and the entity pairs according to different prediction entity and relation types based on a TransF model: