Method and system for concept alignment and content inter-translation among multi-source heterogeneous databases

文档序号：1891757 发布日期：2021-11-26 浏览：27次中文

阅读说明：本技术 一种多源异质数据库间概念对齐与内容互译方法及系统 (Method and system for concept alignment and content inter-translation among multi-source heterogeneous databases ) 是由徐颂华代笃伟李宗芳徐宗本于 2021-08-02 设计创作，主要内容包括：本发明公开一种多源异质数据库间概念对齐与内容互译方法及系统,方法为对于数据字典未知的数据库,采用基于数据驱动概念对齐与内容互译方法,采用不确定性函数映射关系挖掘实现数据库间的概念对齐与内容互译；对于字典不完全、不可靠或相互矛盾的异构数据库间,采用基于本体驱动的概念对齐与内容互译的方法；在图同构的判定问题求解的视角下,采用基于无监督的图表征学习方法实现图同构判定；对于字典与数据同时存在且各有缺陷的数据库之间,采用基于数据与本体双驱动的概念对齐与内容互译的方法,借助跨视图领域知识图谱实现概念对齐和内容互译；通过协同挖掘多系统内数据与本体间的映射关系,实现精准、高效、鲁棒、低数据依赖性的对齐互译。(The invention discloses a method and a system for concept alignment and content inter-translation among multi-source heterogeneous databases, wherein the method is to adopt a data-driven concept-based alignment and content inter-translation method and adopt an uncertain function mapping relation to mine for realizing the concept alignment and content inter-translation among databases for an unknown database of a data dictionary; for the heterogeneous databases with incomplete dictionaries, unreliable dictionaries or mutual contradictions, a concept alignment and content inter-translation method based on ontology driving is adopted; under the view angle of solving the judgment problem of graph isomorphism, the graph isomorphism judgment is realized by adopting an unsupervised graph characteristic learning method; for the databases with the defects and the dictionaries existing at the same time, a data and ontology dual-drive-based concept alignment and content inter-translation method is adopted, and the concept alignment and the content inter-translation are realized by means of a cross-view domain knowledge graph; by cooperatively mining the mapping relation between the data and the ontology in the multiple systems, the aligned inter-translation with high precision, high efficiency, robustness and low data dependency is realized.)

1. A method for concept alignment and content inter-translation among multi-source heterogeneous databases is characterized by comprising the following steps:

acquiring basic information of a database to be processed, and judging the defect type of the database to be processed according to the basic information;

for databases where the data dictionary is unknown: obtaining a function mapping relation between data fields which are heterogeneous and unknown in a data dictionary in a multi-source heterogeneous database by using a function dependency and probability statistical model, and mining to realize concept alignment and content inter-translation between databases based on an uncertainty function mapping relation;

for heterogeneous databases where the data dictionary is incomplete, unreliable, or contradictory: according to a data ontology model carried by each database, concepts and relations related to the multi-source heterogeneous medical database are represented into a plurality of graph structures, the problems of concept alignment and content inter-translation among the databases are converted into judgment problems with isomorphism of graphs, unsupervised graph characterization learning methods are adopted to obtain structure information and attribute information of the graphs, then an equivalent label with the same concept graph is given according to the structure information and the attribute information of the graphs based on a weak supervision graph classification method of deep learning, and further the concept alignment and the content inter-translation of the multi-source heterogeneous database are achieved;

for a database with a dictionary and data existing simultaneously and each defect, firstly, a joint learning framework is built, a mutual attention mechanism is introduced, potential medical knowledge in a medical text is explored under the guidance of a body logic rule, and meanwhile, the potential medical knowledge in the medical text is fed back to a knowledge graph built based on the body, so that the characteristics of words and entities, a text relation mode and a graph relation mode are fully fused, and the comprehensive alignment of the words and the entities, the text relation mode and the graph relation mode is realized;

the method comprises the steps of learning and labeling entities by using a mutual attention mechanism, a knowledge enhancement method and a deep neural network, classifying the entities in a fine granularity mode, forming ontology views by using medical concepts in the fine granularity mode, forming example views by instantiating the fine granularity concepts, and finally performing cross-view learning and internal view learning on a knowledge graph by using a cross-view association model and an internal view model so as to achieve concept alignment and content inter-translation.

2. The method of claim 1, wherein for databases with unknown data dictionaries, the concept alignment and inter-content translation between databases are achieved by mining based directly on uncertainty function mapping relationships for structured data; for unstructured data, firstly converting the unstructured data into structured medical data, and then realizing concept alignment and content inter-translation among databases by using a natural language processing method, the method comprises the following specific steps:

extracting required data from a database to be analyzed, and preprocessing the data by adopting data cleaning and normalization;

firstly, preliminarily aligning concepts in a multi-source database according to a numerical distribution rule of the concepts, expressing different concepts as different parameter distributions, calculating similarity among the data concepts through statistical rules among the parameter distributions, such as mean, median, covariance and the like, and preliminarily aligning the data concepts;

secondly, the preliminarily aligned data concepts are further aligned by utilizing the potential relationship among the data concepts, and after the concepts, the relationship and the attribute values are aligned, concept alignment and content inter-translation among multi-source heterogeneous data can be realized.

3. The method for concept alignment and content inter-interpretation among multi-source heterogeneous databases of claim 1, wherein when unstructured data is converted into structured language data, potential complementarity and consistency among different databases are mined based on a relational extraction model among multi-source heterogeneous databases for counterstudy, relationships among entities are extracted from free text of unlabeled medical data to obtain structured medical data, and the entities and the relationships are converted into knowledge to provide basic data for semantic understanding and intelligent inference, and the method comprises the following steps:

firstly, based on the existing medical knowledge graph, performing word segmentation on a Chinese medical text through an integrated learning module consisting of an improved clustering algorithm and a bidirectional cyclic neural network, extracting medical entities with complex description modes from the Chinese medical text after word segmentation, and corresponding the description of the extracted medical entities to a standard entity through deep learning sequencing to complete entity extraction and common finger disambiguation in the medical text;

secondly, based on the multi-source heterogeneous database relation extraction model of counterstudy, the counterstudy method is used for studying the unique properties of the single database under the environment of the multi-source heterogeneous database, meanwhile, the common characteristics of the multi-source heterogeneous database are fused in the overall situation, and more accurate knowledge is obtained by utilizing multiple database corpora for the multi-source heterogeneous database relation extraction model.

4. The method of claim 2, wherein the counterlearning-based multi-source heterogeneous database relationship extraction model comprises a sentence encoder module, a multi-source heterogeneous database attention mechanism module, and a counterlearning module;

in a sentence encoder module, for a sentence containing a plurality of words, converting all words in the sentence into corresponding input word vectors through an input layer; the input word vector is formed by splicing a text word vector and a position vector, the text word vector is used for describing grammar and semantic information of each word, and the position vector is used for describing position information of an entity; on the basis of an input layer, a sentence encoder is used to obtain vector representation of sentences, and two encoding modes, namely independent encoding and cross-database encoding, are respectively used for each database;

in the multi-source heterogeneous database attention mechanism module, the information abundance degree of each entity is measured through an attention mechanism, independent attention mechanism modules of each database and consistent attention mechanism modules among the databases are set, the independent attention mechanism modules adopt sentence-level selective attention mechanisms, the influence of entities with insufficient information on overall extraction is weakened, and the consistent attention mechanism modules among the databases are used for depicting the commonness of the entities in the databases;

in the confrontation learning module, the confrontation learning module comprises an encoder and a discriminator, and the entities from different databases are encoded into a unified semantic space.

5. The method for concept alignment and content inter-translation among multi-source heterogeneous databases according to claim 1, wherein, during unsupervised graph representation learning based on a relational graph convolution network, affine transformation is firstly carried out on attribute information to learn the association relationship among attribute features; and aggregating the feature vectors of the neighbor nodes of each node, and updating the feature vector of the current node.

6. The method of claim 1, wherein when graph isomorphic decision is implemented by an unsupervised graph characterization learning method, unsupervised graph representation learning is implemented by combining an unsupervised loss function, wherein the loss function comprises reconstruction loss-based R-GCN and contrast loss-based R-GCN; based on the reconstruction loss R-GCN, the idea of self-coding is used for reference, and the adjacent relation between the nodes is reconstructed and learned; and setting a scoring function based on the R-GCN of the contrast loss, wherein the scoring function is used for improving the score of the positive sample and reducing the score of the negative sample, and the contrast loss is constructed based on the nodes of the graph data and the objects which have corresponding relations with the nodes.

7. The method of claim 1, wherein the concept alignment and inter-content inter-translation method based on concept graph isomorphism comprises the following steps:

based on the ontology, converting the problems of concept alignment and content inter-translation among databases into a graph isomorphism judgment problem by constructing a concept graph of a multi-source heterogeneous database; the graphs are isomorphic, namely two graphs are given, and whether the two graphs are completely equivalent is judged; the weak supervised graph classification algorithm based on deep learning is adopted, and equivalent conceptual graphs are given with the same labels, specifically as follows:

firstly, carrying out isomorphic judgment on a small part of concept graphs by using a Weisfeiler Lehman method, and then training a weakly supervised graph neural network classification model by using a judgment result as training data for classifying the concept graphs;

based on a Weisfeiler Lehman iterative algorithm, firstly aggregating labels of nodes and neighbors thereof; then, dispersing labels of the aggregated nodes and the neighbors thereof as unique new labels, and if the labels of the nodes between the two graphs are different in some iterations, considering the two graphs as non-isomorphic;

acquiring a concept graph from a multi-source database, and carrying out isomorphic judgment on part of the concept graph by a Weisfeiler Lehman algorithm to obtain a classification label of the concept graph; training a weakly supervised graph neural network classification model by using an unlabeled concept graph and a concept graph with classification labels, and carrying out isomorphic classification alignment on the concept graph based on the graph neural network classification model.

8. A concept alignment and content inter-translation system among multi-source heterogeneous databases is characterized by comprising a database defect judgment module, a data-driven concept alignment and inter-translation module, an ontology-driven concept alignment and inter-translation module and a data-and-ontology-driven concept alignment and inter-translation module;

the database defect judging module is used for acquiring basic information of the database to be processed and judging the defect type of the database to be processed according to the basic information;

the data-driven based concept alignment and translation module is used for a database unknown to the data dictionary: obtaining a function mapping relation between data fields which are heterogeneous and unknown in a data dictionary in a multi-source heterogeneous database by using a function dependency and probability statistical model, and mining to realize concept alignment and content inter-translation between databases based on an uncertainty function mapping relation;

ontology-driven concept alignment and translation modules are used to identify and support the following relationships for heterogeneous databases where the data dictionary is incomplete, unreliable, or contradictory: according to a data ontology model carried by each database, concepts and relations related to the multi-source heterogeneous medical database are represented into a plurality of graph structures, the problems of concept alignment and content inter-translation among the databases are converted into judgment problems with isomorphism of graphs, unsupervised graph characterization learning methods are adopted to obtain structure information and attribute information of the graphs, then an equivalent label with the same concept graph is given according to the structure information and the attribute information of the graphs based on a weak supervision graph classification method of deep learning, and further the concept alignment and the content inter-translation of the multi-source heterogeneous database are achieved;

the data and body dual-drive-based concept alignment and inter-translation module is used for constructing a joint learning framework for databases with dictionaries and data existing at the same time and defects, introducing an inter-attention mechanism, under the guidance of a body logic rule, exploring potential medical knowledge in a medical text, and simultaneously feeding the potential medical knowledge in the medical text back to a knowledge graph constructed based on the body, so that the characteristics of words and entities, a text relation mode and a graph relation mode are fully fused, and the comprehensive alignment of the words and the entities, the text relation mode and the graph relation mode is realized; the method comprises the steps of learning and labeling entities by using a mutual attention mechanism, a knowledge enhancement method and a deep neural network, classifying the entities in a fine granularity mode, forming ontology views by using medical concepts in the fine granularity mode, forming example views by instantiating the fine granularity concepts, and finally performing cross-view learning and internal view learning on a knowledge graph by using a cross-view association model and an internal view model so as to achieve concept alignment and content inter-translation.

9. A computer device, comprising a processor and a memory, wherein the memory is used for storing a computer executable program, the processor reads the computer executable program from the memory and executes the computer executable program, and the processor can realize the concept alignment and content inter-translation method between the source heterogeneous databases according to any one of claims 1 to 7 when executing the computer executable program.

10. A computer-readable storage medium, wherein a computer program is stored in the computer-readable storage medium, and when executed by a processor, the computer program can implement the method for concept alignment and content inter-source heterogeneous database according to any one of claims 1 to 7.

Technical Field

The invention belongs to the technical field of big data processing and multi-source data fusion, and particularly relates to a method and a system for concept alignment and content inter-translation among multi-source heterogeneous databases.

Background

At present, the problems of unknown, incomplete, unreliable or contradictory data architecture and dictionary, unclear data association between systems, non-uniform system value range standard and the like exist in a plurality of information systems of medical institutions. At the regional medical level, the problems are more serious, and the point-to-point interface development (concept alignment and content inter-translation) between organizations is not feasible for large-scale popularization. In order to realize interconnection and intercommunication among multi-source heterogeneous multi-data bases, in recent years, a plurality of scholars propose to adopt ontologies (metadata) as intermediaries for data integration so as to solve semantic problems through mapping between data sources and standard ontologies, and an integration platform in the field of health and health mainly acquires data meanings in a business system through establishing a medical ontology base in advance to assist data understanding. Countries also set a number of data element and data set standards for different medical scenarios. However, it is often difficult to design a unified global ontology library in advance, and when each local data source is dynamically added, deleted or modified, the method for constructing the unified ontology library has poor flexibility and is difficult to meet the user requirements in a short time. Another difficulty is that the mapping between the current business system relational database schema and the ontology lacks automation tools, and the labor cost is huge. The data structure, diseases, inspection, symptoms, medication and operation names of each hospital information system are greatly different and are not standard in naming. If unified ontology management and mapping are desired, the problem of design of a medical information system is not only involved, but also the problem of difference between the expression capability and the use habit of a medical language and a specialist is also involved, and no regional platform can better solve the problem at present. Because the mapping process is too complex and lacks of algorithms with superior performance, the mapping between the database schema (schema) and the ontology is still mostly based on a manual mode. The whole integration work seriously depends on the analyst to carry out a large amount of data carding work, the data analyst completes the condition analysis of the database data through the modes of analyzing the table structure by a tool, extracting summary data, talking with business experts and the like, the system implementation period is longer, and the mapping cost is high.

In order to more intuitively construct the mapping between the database and the ontology, graphical mapping tools have been developed in many projects, which allow users to interactively construct the mapping between the database and the ontology, and typical projects are COG, DartGrid, visaivis, and the like. Such semi-automatic tools have limited utility for reducing labor costs.

In general, current methods fall into two broad categories: manual mapping and automatic mapping. The manual mapping expansibility is poor, and the workload is exponentially increased; the automatic mapping is seriously influenced by noise, needs a large amount of manual marking and is not adopted by the industry.

Disclosure of Invention

In order to solve the problems in the prior art, the invention provides a method and a system for concept alignment and content inter-translation among multi-source heterogeneous databases, which realize semantic intercommunication and interoperation among the multi-sources on the premise of not damaging the storage structure, the management mode and the language use habit of the existing service system.

In order to achieve the purpose, the invention adopts the technical scheme that: a method for concept alignment and content inter-translation among multi-source heterogeneous databases comprises the following steps:

acquiring basic information of a database to be processed, and judging the defect type of the database to be processed according to the basic information;

For an unknown database of a data dictionary, for structured data, concept alignment and content inter-translation among databases are directly mined based on uncertainty function mapping relations; for unstructured data, firstly converting the unstructured data into structured medical data, and then realizing concept alignment and content inter-translation among databases by using a natural language processing method, the method comprises the following specific steps:

extracting required data from a database to be analyzed, and preprocessing the data by adopting data cleaning and normalization;

When converting unstructured data into structured speech data, based on a relation extraction model between multisource heterogeneous databases for counterstudy, potential complementarity and consistency between different databases are mined, relationships between entities are extracted from unlabeled medical data free texts to obtain structured medical data, and then the entities and the relationships are converted into knowledge to provide basic data for semantic understanding and intelligent inference, wherein the method specifically comprises the following steps:

The multi-source heterogeneous database relation extraction model based on the countercheck learning specifically comprises a sentence encoder module, a multi-source heterogeneous database attention mechanism module and a countercheck learning module;

in the confrontation learning module, the confrontation learning module comprises an encoder and a discriminator, and the entities from different databases are encoded into a unified semantic space.

When unsupervised graph representation learning based on the relational graph convolutional network is performed, affine transformation is performed on attribute information, and the association relation among attribute features is learned; and aggregating the feature vectors of the neighbor nodes of each node, and updating the feature vector of the current node.

When the graph isomorphism judgment is realized by adopting an unsupervised graph characterization learning method, unsupervised loss functions are combined to realize unsupervised graph representation learning, wherein the loss functions comprise R-GCN based on reconstruction loss and R-GCN based on contrast loss; based on the reconstruction loss R-GCN, the idea of self-coding is used for reference, and the adjacent relation between the nodes is reconstructed and learned; and setting a scoring function based on the R-GCN of the contrast loss, wherein the scoring function is used for improving the score of the positive sample and reducing the score of the negative sample, and the contrast loss is constructed based on the nodes of the graph data and the objects which have corresponding relations with the nodes.

The concept alignment and content inter-translation method based on concept graph isomorphism specifically comprises the following steps:

A concept alignment and content inter-translation system among multi-source heterogeneous databases comprises a database defect judgment module, a concept alignment and inter-translation module based on data driving, a concept alignment and inter-translation module based on ontology driving and a concept alignment and inter-translation module based on data and ontology dual driving;

A computer device comprises a processor and a memory, wherein the memory is used for storing a computer executable program, the processor reads the computer executable program from the memory and executes the computer executable program, and the processor can realize the concept alignment and content inter-translation method between source heterogeneous databases when executing the computer executable program.

A computer readable storage medium, in which a computer program is stored, and when the computer program is executed by a processor, the method for concept alignment and content inter-translation between source heterogeneous databases according to the present invention can be implemented.

Compared with the prior art, the invention has at least the following beneficial effects:

the data-driven alignment and translation method is adopted, expert marking is not needed, and only the inherent distribution characteristics of the data are relied on; the method is accurate and efficient based on ontology-driven alignment and inter-translation, does not need to rely on a large amount of training data, is based on concept alignment and content inter-translation technologies among multi-source heterogeneous databases driven by data and ontology dual, combines the advantages of the two technologies to complement and promote each other, enables the whole system to reach a higher intelligent level, solves the problems of data heterogeneity, unknown, incomplete, unreliable or mutual contradiction among all service systems and the problem that language use of all courtyards and courtyards in a database lacks a unified guideline specification, realizes semantic inter-communication and inter-operation among the multiple systems on the premise of not damaging the storage structure, management mode and language use habit of the existing service system, and can realize accurate, efficient and robust automatic concept alignment and content inter-translation among the multi-source heterogeneous databases under the following three scenes: 1. under the condition that a dictionary is unknown, alignment and translation are realized by mining mass heterogeneous multi-modal medical data; 2. in the heterogeneous databases with incomplete dictionaries, unreliability or contradiction, the alignment and the translation between the multiple ontology definitions and the models are realized by reasoning the mapping relation between the multiple ontology definitions and the models; 3. under the condition that the dictionary and the data exist at the same time and are all defective, the alignment inter-translation with accuracy, high efficiency, robustness and low data dependency is realized by cooperatively mining the mapping relation between the data and the ontology in the multiple systems.

Drawings

FIG. 1 is a schematic diagram of a key technical framework for the solution of the multi-source heterogeneous database of the present invention.

FIG. 2 is a schematic diagram of key steps of a data-driven, ontology-driven, and dual-drive system oriented to concept alignment and content translation according to the present invention.

FIG. 3 is a domain knowledge graph based on mutual attention mechanism and collaborative training framework for concept-oriented alignment and content inter-translation.

Detailed Description

The present invention will be described in detail below with reference to the accompanying drawings.

The data of the invention refers to: multi-source heterogeneous data across multiple medical institutions; the body means: an ontology is a well-defined specification of a conceptual model that may represent commonly recognized, shareable knowledge. Concept alignment and content inter-translation among multi-source heterogeneous databases based on data driving

Acquiring basic information of a database to be processed, and judging the defect type of the database to be processed according to the basic information; the defect types comprise unknown data dictionaries, incomplete, unreliable or contradictory data dictionaries in the database, existence of dictionaries and data of the database at the same time and defects.

Referring to FIG. 1, there is structured and unstructured data in a multi-source heterogeneous medical database; as an example, for structured data, the invention realizes concept alignment and content inter-translation between databases based on uncertainty function mapping relation mining; for unstructured data, the method firstly converts the unstructured data into structured medical data, and then a natural language processing method realizes concept alignment and content inter-translation among databases.

For a structured data-driven multi-source heterogeneous database: some structured data exists in the multi-source heterogeneous medical database, such as the patient's name, age, sex, height, weight, test results, and the like. Although the structured data all correspond to corresponding fields in corresponding data dictionaries, the data concepts are difficult to align due to the heterogeneous data in different hospitals and the unknown, incomplete, unreliable or contradictory data dictionaries, and the contents cannot be translated with each other, for example, for blood pressure, some hospitals record systolic pressure and diastolic pressure, some hospitals record central artery pressure, and ICD codes in different hospitals may be different. In order to solve the problems, the invention is based on the concept alignment and the content inter-translation between databases mined by uncertainty function mapping relation.

Two concepts may be equivalent if their numerical distributions are similar and have multiple identical attributes. The data mining technology is applied to the medical field by utilizing the function dependency and probability statistical model so as to find out the function mapping relation between data fields with data isomerism and unknown, incomplete, unreliable or contradictory data dictionaries in the multi-source heterogeneous database. The specific scheme is as follows:

extracting required data from a database to be analyzed, and preprocessing the data by adopting data cleaning and normalization;

firstly, according to the value distribution rule of the concepts, the concepts in the multi-source database are preliminarily aligned, different concepts are expressed as different parameter distributions, the similarity among the data concepts is calculated through the statistical rules among the parameter distributions, such as the average number, the median, the covariance and the like, and the data concepts are preliminarily aligned.

Secondly, the preliminarily aligned data concepts are further aligned by utilizing the potential relations among the data concepts. Specifically, for an ontology O, if < X, R, Y >. epsilon.O, then it is marked as R (X, Y), where X is a concept, Y is a concept or an attribute value, and R is a mapping relationship between X and Y, if Then call R^-1For the inverse mapping of R, when the same reference is made between concepts or when the corresponding reference is made to a property value, the concepts or property values are said to be equivalent and are denoted by the symbol "≡". Although the function mapping relation can be used as a judgment basis for concept alignment, the function is not a sufficient necessary condition for alignment, when more errors exist in the ontology, the function relation is simply used for judging whether the concepts are aligned, the fault tolerance rate is low, and in addition, even though some concepts in the ontology do not have the function relationIt is also possible that the relationship R is equivalent, alignable, for example where the relationship R is one-to-many. Therefore, the invention provides a function tau () capable of measuring the functionality of the relation R, and the function tau () is used for measuring the strictness degree of the relation as a function. The functionality is defined according to the function, which must be many-to-one or one-to-one, if R is a function, then τ () is 1, if R is one-to-many or many-to-many, then τ () is less than 1, the value range of τ () is 0-1, which inversely maps τ () to^-1(r)＝τ(r^-1). Reasoning shows that the higher the probability of two Y equivalents and the higher the functionality of the relation R, the higher the probability of two X equivalents. Two logical rules for concept alignment can be expressed as:

the conversion probability is expressed as:

Pr₁(X≡X′)＝1-Π_{R(X,Y),R(X′,Y′)}(1-τ^-1(R)×Pr(Y≡Y′)) (2)

the above description is a method of aligning X (concept), and the same method can be used for aligning relationship or attribute values similarly. After the concepts, the relations and the attribute values are all aligned, concept alignment and content inter-translation among the multi-source heterogeneous data can be achieved.

Concept alignment and content inter-translation among unstructured data-driven multi-source heterogeneous databases:

in the electronic medical record, unstructured texts such as patient symptom expressions, past medical history and treatment records input by doctors are difficult to store in a database in separate fields, and unified standardization cannot be achieved. In order to effectively utilize the medical data, the invention provides a natural language processing method for converting unstructured medical data into structured medical data, and after the structured medical data exist, the concept alignment and the content inter-translation among multi-source heterogeneous databases can be realized according to a method for carrying out the concept alignment and the content inter-translation among the multi-source heterogeneous databases driven by the structured data.

Because the existing extraction method of word segmentation and entities (examples with images and abstract concepts) is mature, a relationship extraction system based on remote supervision makes it possible to train available relationship extraction models by using large-scale data, but the method also has some problems to be solved: a great deal of noise exists in the training data acquired by remote supervision; remote supervision has difficulty in obtaining long-tailed entities and their relationships. The method is based on a relation extraction model between multi-source heterogeneous databases for counterstudy, potential complementarity and consistency between different databases are mined, the relation between entities is extracted from the free text of the unlabeled medical data, structured medical data are obtained, the entities and the relation are further converted into knowledge, and basic data are provided for semantic understanding and intelligent inference.

The method comprises the following specific steps: the method comprises the steps of firstly, based on the existing medical knowledge graph, segmenting Chinese medical texts through an integrated learning module consisting of an improved clustering algorithm and a bidirectional cyclic neural network, of course, segmenting the Chinese medical texts through a self-attention neural network and an antagonistic generation network, extracting medical entities with complex description modes from the Chinese medical texts after segmentation, and corresponding the descriptions of the extracted medical entities to standard entities through a deep learning sorting algorithm to finish entity extraction and common finger disambiguation work in the medical texts.

Referring to fig. 2, the multi-source heterogeneous database relationship extraction model based on counterstudy is specifically as follows:

given an entity pair (h, t), the sentence containing the entity pair in m different databases is defined asWhereinCorresponding to n in the jth database_jThe relational extraction model of the multi-source heterogeneous database of the example set is to utilize S_(h,t)Predicting entity pairs (h, t) with each instance in a multiple source database scenarioThe relation R ∈ R forms the probability of valid knowledge. The multi-source heterogeneous database relation extraction model comprises a sentence encoder module, a multi-source heterogeneous database attention mechanism module and a confrontation learning module.

In a sentence encoder module, for a sentence containing a plurality of words, converting all words in the sentence into corresponding input word vectors through an input layer; the input word vector is formed by splicing a text word vector and a position vector, the text word vector is used for describing grammar and semantic information of each word, and the position vector is used for describing position information of an entity. On the basis of the input layer, a sentence encoder, such as a bi-directional recurrent neural network, is used to obtain a vector representation of the sentence. The multi-source heterogeneous database relation extraction model respectively uses two coding modes of independent coding and cross-database coding for each database.

In the multi-source heterogeneous database attention mechanism module, the information richness of each entity is measured through the attention mechanism, and the sentence encoder separately encodes the independent information of each database and the consistent information among the databases, so that the independent attention mechanism module of each database and the consistent attention mechanism module among the databases are set. The independent attention mechanism module adopts a sentence-level selective attention mechanism, the influence of entities with poor information on the whole extraction is weakened, and the attention mechanism module with consistency among the databases is used for describing the commonality of the entities in the databases.

In the countercheck learning module, the countercheck learning module comprises an encoder and a discriminator, the entities from different databases are encoded into a unified semantic space, and countercheck learning strategies are adopted to ensure that the entities from different databases are fully mixed in the semantic space. The discriminator in the countercheck learning module is used for judging the attribution of the database of the characteristic vector, the encoder in the countercheck learning module is used for generating the characteristic vector which makes the discriminator difficult to distinguish the attribution, after training, when the encoder and the discriminator reach balance, entities with similar semantic information in different databases are encoded to close positions in space, the characteristics are fully fused, so that the model can obtain more accurate knowledge by utilizing multiple database corpora, and a basis is provided for concept alignment and content inter-translation among the multi-source heterogeneous databases.

Concept alignment and content inter-translation among multi-source heterogeneous databases based on ontology driving

According to a data ontology model carried by each database, concepts and relations related to the multi-source heterogeneous medical database are represented into a plurality of graph structures, and then the problems of concept alignment and content inter-translation among the databases are converted into judgment problems with graph isomorphism. Under the view of solving the judgment problem of graph isomorphism, the invention adopts an unsupervised graph characteristic learning method to realize graph isomorphism judgment.

The concepts form an ontology, and the ontology defines computable logic rules among the concepts; according to the guidance of the ontology, the concepts in the database are constructed into a graph representation, the concepts or attribute values of the concepts serve as nodes of the graph, and the relationships or attributes among the concepts serve as edges of the graph. By constructing the concept graph of the multi-source heterogeneous database, the problem of concept alignment and content inter-translation among databases can be converted into a graph isomorphic judgment problem.

The method for learning by adopting an unsupervised graph characteristic and the isomorphic judgment algorithm of the concept graph are concretely as follows.

An unsupervised graph characterization learning method: if the representation of the graph data can contain rich semantic information, related tasks in the downstream, such as node classification, edge prediction, graph classification and the like, can obtain good input features. The traditional graph characteristic learning method comprises a matrix decomposition method and a random walk method. The matrix decomposition method is used for decomposing a matrix describing the data structure information of the graph, converting nodes into a low-dimensional vector space and simultaneously keeping structural similarity, and generally speaking, the methods have analytic solutions, but have high time and space complexity; the random walk method regards a sequence generated by random walk in a graph as a sentence, regards nodes as words, and learns the node representation by the word-comparing vector method. Therefore, the invention adopts an unsupervised graph representation learning method based on a relational graph convolution network (R-GCN).

The learning of attribute information and structure information by a graph and volume network (GCN) can be divided into two steps: firstly, affine transformation is carried out on attribute information, and association relations among attribute features are learned; and secondly, aggregating the characteristics of the neighbor nodes of any node in the graph structure and updating the characteristics of the current node. Since the constructed medical data concept graph has complex relationships, and the GCN does not explicitly consider the differences in relationships between nodes, the present invention contemplates modeling the medical data concept graph using R-GCN and its variants. When the R-GCN processes the node neighbors, for each relationship, the positive and negative directions of the relationship are considered, the R-GCN firstly carries out independent aggregation on the node neighbors of the same relationship, and simultaneously adds self-connection relationship to the R-GCN, and after all the node neighbors of the same relationship are aggregated, the R-GCN carries out total aggregation again. The R-GCN adds the dimensionality of an aggregation relation based on the operation of GCN aggregation neighbors, so that the aggregation operation of the nodes becomes a double aggregation process, and the core formula is as follows:

wherein the content of the first and second substances,represents the state of the node i at the l +1 th layer, l represents the l th layer of the relation graph neural network, R represents all relation sets in the graph,representation and node v_iSet of neighbors with relation r, c_i,rFor normalization, W_rIs a weight parameter corresponding to a neighbor with r relation at the l layer of the relation graph neural network, W_oIs the weight parameter corresponding to the node itself, v_jRepresents node j;the status of the l-th level of the node i,the state of node j at the l-th level,representation and node v_iA set of neighbor nodes having an r relationship therebetween.

The R-GCN is an important neural network structure for characterizing and learning graph data, and can realize unsupervised graph representation learning by combining with a corresponding unsupervised loss function, the unsupervised learning is mainly designed on the loss function, and the invention mainly constructs two types of loss functions: loss of reconstruction based R-GCN and loss of contrast based R-GCN. The R-GCN based on reconstruction loss refers to the thought of self-coding and carries out reconstruction learning on the adjacency relation between the nodes, and the R-GCN based on reconstruction loss comprises an encoder module, a decoder module and a loss function module; and setting a scoring function based on the R-GCN of the contrast loss, wherein the scoring function is used for improving the score of the positive sample and reducing the score of the negative sample, and the contrast loss is constructed based on the nodes of the graph data and the objects which have corresponding relations with the nodes. The objects having a corresponding relationship with the nodes may be neighbors of the nodes, subgraphs where the nodes are located, and full graphs. The scoring function is expected to improve the scores of the nodes and the corresponding objects thereof and reduce the scores of the nodes and the objects unrelated to the nodes.

The unsupervised R-GCN model learns the structure information and the attribute information of the graph at the same time, the two kinds of information are effectively complemented in the learning process, an accurate and robust graph characteristic learning result is obtained, and assistance is provided for tasks such as downstream node classification, edge prediction and graph classification.

The concept alignment and content inter-translation method based on concept graph isomorphism comprises the following steps:

based on the ontology, by constructing the concept graph of the multi-source heterogeneous database, the problems of concept alignment and content inter-translation among databases can be converted into graph isomorphism judgment problems. Graph isomorphism means that given two graphs, it is judged whether the two graphs are completely equivalent. As an example, the Weisfeiler Lehman algorithm can be used for graph isomorphism judgment, the efficiency is relatively low, and the weak supervised graph classification algorithm based on deep learning is preferably adopted by the invention, and equivalent conceptual graphs are given the same label. The method comprises the following specific steps:

firstly, a Weisfeiler Lehman algorithm is used for isomorphic judgment of a small part of concept graphs, and then the judgment result is used as training data to train a weakly supervised graph neural network classification model for classifying the concept graphs.

Weisfeiler Lehman is an iterative algorithm that solves the graph isomorphism problem by the following steps: (1) aggregating labels of nodes and their neighbors; (2) and scattering the labels of the aggregated nodes and the neighbors thereof as unique new labels. Two graphs are considered to be non-homogeneous if the node labels between the two graphs are different in some iterations.

A large number of concept graphs are obtained from a multi-source database, isomorphism judgment is carried out on a small number of concept graphs through a Weisfeiler Lehman algorithm, and classification labels of the concept graphs are obtained. And training a weakly supervised graph neural network classification model by using a large number of unlabeled concept graphs and a small number of concept graphs with classification labels.

The graph classification needs to pay attention to not only attribute information of each node but also structural information of the graph, and needs to perform fusion learning on global information of the graph, so that the graph classification model needs to perform representation learning on the nodes, and also needs to perform pooling integration on the learned node information after multiple iterations. The invention relates to a weak supervision graph classification algorithm based on global pooling and a weak supervision graph classification algorithm based on hierarchical pooling. In hierarchical pooling, the present invention is based on a graph collapsing pooling mechanism and an edge shrinking based pooling mechanism. In the graph collapse pooling mechanism, a graph is divided into different subgraphs, and the subgraphs are regarded as super nodes, so that a collapsed graph is formed, and hierarchical learning of the global information of the graph is realized; in the edge contraction-based pooling mechanism, edges in the graph are removed in parallel, the two removed nodes are merged, the connection relationship of the removed nodes is maintained, and the global information of the graph is gradually learned through a recursive merging operation.

The graph classification model obtained by training can efficiently predict whether the conceptual graph is isomorphic or not. When the two concept graphs are isomorphic, all nodes and edges in the two concept graphs are aligned, and concept alignment and content inter-translation can be carried out on the multi-source heterogeneous database according to the alignment and the edges.

Referring to FIG. 3, a concept alignment and inter-content translation technique between data and ontology dual-driven multi-source heterogeneous databases:

the concept alignment and content inter-translation algorithm of the multi-source heterogeneous database driven by pure data seriously depends on the access to a large number of original data resources in the database, has huge calculation cost and strong data dependence, is not suitable for the condition of limited data access authorization and is easily influenced by noise; on the other hand, although the operation efficiency is greatly improved by a method based on ontology driving, ambiguous results are easy to generate under the condition that ontologies are unknown, unreliable or contradictory, and rich semantic information implied in original data cannot be utilized. The invention adopts a data and body dual-drive concept alignment and content inter-translation method between multi-source heterogeneous databases, firstly, a data and body dual-drive mutual attention algorithm for medical knowledge acquisition is provided, a cross-view domain knowledge graph facing a specific medical scene is constructed on the basis, and the concept alignment and content inter-translation of the multi-source heterogeneous databases are realized by means of the cross-view domain knowledge graph.

The data-driven artificial intelligence algorithm has the automatic learning capability, is relatively easy to establish and maintain, can better simulate thinking processes of human beings such as association, intuition, analogy, induction, learning, memory and the like, but lacks the inversion deduction capability and has insufficient systematicness and interpretability. Ontology-driven logic-based computing techniques have extremely strong deductive reasoning capabilities, but require human supply of a great deal of common sense and domain knowledge as a prerequisite for rule establishment, which is often very expensive to acquire and contains incorrect information that may affect the correctness of reasoning. Therefore, the invention adopts a concept alignment and content inter-translation method between the multi-source heterogeneous databases with data and ontology dual drives, and combines the advantages of data drive and ontology drive to complement and promote each other, so that the whole system achieves higher intelligent level. The invention provides a data and body dual-drive mutual attention algorithm mechanism for medical knowledge acquisition, and simultaneously provides a construction and application method of a cross-view field knowledge graph facing concept alignment and content inter-translation.

Data and body dual-drive mutual attention algorithm mechanism for medical knowledge acquisition

There are two main methods for extending the relevant knowledge in the existing medical knowledge map, one is training the relation extraction model, which is used to extract the medical knowledge from the medical text, and is a data-driven method; the other method is to use a knowledge representation model to perform knowledge filling in a knowledge graph constructed based on an ontology, and is an ontology-driven method. However, the current work rarely considers the combination of the two approaches to carry out unified knowledge extraction, so the invention provides a data and ontology dual-drive algorithm model suitable for medical knowledge acquisition, and introduces a joint learning strategy and an mutual attention mechanism. The method comprises the following specific steps:

firstly, a joint learning framework is built, a mutual attention mechanism is introduced, potential medical knowledge in a medical text can be found more easily by a data mining technology under the guidance of an ontology logic rule, meanwhile, a data mining result can be fed back to a knowledge graph built based on the ontology, knowledge content which has a large influence on training is enhanced, and the joint learning framework carries out comprehensive alignment on words and entities, a text relation mode and a graph relation mode, so that the characteristics of the words and the entities, the text relation mode and the graph relation mode can be fully fused.

The medical knowledge graph G is defined as a large set consisting of an entity set, a relation set and a fact triple set, and the medical text corpus is defined as D. The joint learning framework supports simultaneous training of all models in a unified continuous space, so that embedded representations of entities, relations and words are synchronously obtained, and joint constraints and characteristic information brought by the unified space can be conveniently shared and transferred between the knowledge graph and the text model in the training process. Specifically, all parameters involved in the embedded characterization and model are defined as model parameters, with the notation θ ═ θ_E,θ_R,θ_VIs represented by where θ_E,θ_R,θ_VRespectively representing entities, relations, words, and finding the best embedded representation by a joint training framework to fit the given knowledge-graph structure and semantic information of the entities, relations, words to the maximum extent, i.e., to find the best parameterSo as to satisfy:

where P (G, D | θ) is a conditional probability function that measures the expressive power of embedding maps and text given the entity, relationship, and word embedding model parameters θ. Conditional probability P (G | theta)_E,θ_R) And the method is used for learning the structural characteristics from the knowledge graph G to obtain the embedded representation of the entity and the relationship. Conditional probability P (D | theta)_V) The method is used for learning text features from medical texts to obtain embedded representation of word and semantic relation. Encoding and embedding triples in a set of triples in a medical knowledge-graph using a knowledge representation model, such as TransD, TransR, or PTransE, to optimize a conditional probability function P (G | θ!)_E,θ_R) Performing characterization learning on the text relation by using neural networks CNN, RNN and the like, and optimizing the conditional probability P (D | theta)_V)。

The data acquired by medical knowledge and the body dual-drive mutual attention algorithm model introduce a mutual attention mechanism on the basis of a joint learning framework. The mutual attention model comprises an attention mechanism module based on atlas knowledge and an attention mechanism module based on text semantics, and the two modules promote each other in the training process. In the knowledge-based attention mechanism module, for each triple, a plurality of sentences capable of suggesting the relationship among the entities may exist in the medical text, and since some sentences may contain some fuzzy and wrong components, the invention uses the potential relationship vector among the entities as the knowledge-based attention to highlight important sentences in the training data and reduce noise components. In the semantic-based attention mechanism module, for each relationship, there may be multiple entity pairs that implicate the relationship in the medical knowledge graph, and in order to make the knowledge graph representation model more efficient, the present invention uses semantic information extracted from the medical text model as feedback to help the actual relationship vectors to approach as close as possible to the potential vectors of those most reasonable entity pairs.

The algorithm is a medical knowledge map dual-drive algorithm model constructed by medical text data and an ontology, wherein a joint learning framework and a mutual attention mechanism are introduced, so that medical knowledge can be effectively acquired, words and entities can be comprehensively aligned, text relations and map relations can be comprehensively aligned, and concept alignment and content inter-translation among multi-source heterogeneous databases can be realized.

Construction and application of concept alignment and content inter-translation-oriented cross-view domain knowledge graph

Concepts in the multi-source heterogeneous medical database form an ontology view, ontology concepts form an example view after instantiation, and the existing knowledge graph representation method only focuses on knowledge representation under one view angle and cannot fully utilize the existing information. Meanwhile, the knowledge of the ontology view and the instance view is modeled, so that rich information in the instance representation can be reserved, a hierarchical structure between the ontology view and the instance can be obtained, and the alignment of the instance and the concept is facilitated. The specific scheme is as follows:

firstly, labeling entities by using a knowledge enhancement technology and a deep neural network, secondly, carrying out fine-grained classification on the entities, forming an ontology view by using fine-grained medical concepts, forming an example view by instantiating the fine-grained concepts, and finally, carrying out multi-aspect representation learning on a knowledge map by using a cross-view association model and an internal view model to realize the fusion of ontology and example information.

1) The method comprises the steps of mutually taking knowledge obtained by an ontology base widely existing in the Chinese medical field and a cyclic neural network based on weak supervision as a supplementary knowledge source to obtain a more accurate medical data named entity, specifically, extracting semantic conceptual features based on the medical ontology and fusing the semantic conceptual features with word vector features to construct a named entity recognition model, extracting semantic features and character features by adopting a Transformer framework, combining the semantic features and the character features and obtaining entity labels in Chinese medical texts through a deep learning model with an attention mechanism.

2) A set of medical knowledge network is constructed to provide knowledge for enhancing the understanding of texts, the input texts are converted into a graph structure through the knowledge network, nodes in the graph are entities, attributes, verb adjectives and the like, random walk is carried out on the graph according to context content after the nodes exist, after the random walk is converged, the most appropriate upper concept of each entity in the current context is obtained, fine-grained classification of the entities is obtained, then fine-grained medical concepts are combined into an ontology view, and the fine-grained concepts are combined into an example view after being instantiated.

3) And (2) dividing the feature vector into a body view and an example view by using a Co-training (Co-training) frame, respectively training an entity alignment model jointly representing learning based on two maps at the two views, continuously selecting the most reliable entity alignment result for assisting the training of the model at the other view, realizing the fusion of body and example information, and improving the accuracy of entity alignment by 12%. After entity alignment among a plurality of databases is completed, concept alignment and content inter-translation of the multi-source heterogeneous databases can be realized.

The invention can also provide a computer device, which comprises a processor and a memory, wherein the memory is used for storing a computer executable program, the processor reads part or all of the computer executable program from the memory and executes the computer executable program, and the processor can realize the concept alignment and content inter-translation method between the source heterogeneous databases when executing part or all of the computer executable program.

In another aspect, the present invention provides a computer-readable storage medium, in which a computer program is stored, and the computer program, when executed by a processor, can implement the method for concept alignment and content inter-translation between source heterogeneous databases according to the present invention.

The computer equipment can adopt an onboard computer, a notebook computer, a desktop computer or a workstation.

The processor may be a Central Processing Unit (CPU), a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), or an off-the-shelf programmable gate array (FPGA).

The memory of the invention can be an internal storage unit of a notebook computer, a desktop computer or a workstation, such as a memory and a hard disk; external memory units such as removable hard disks, flash memory cards may also be used.

Computer-readable storage media may include computer storage media and communication media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. The computer-readable storage medium may include: a Read Only Memory (ROM), a Random Access Memory (RAM), a Solid State Drive (SSD), or an optical disc. The Random Access Memory may include a resistive Random Access Memory (ReRAM) and a Dynamic Random Access Memory (DRAM).

16页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：一种基于EMR的非小细胞肺癌治疗药物综合评价方法

Method and system for concept alignment and content inter-translation among multi-source heterogeneous databases

相关技术

网友询问留言