Data integration method and device for heterogeneous database and storage medium

文档序号:923967 发布日期:2021-03-02 浏览:3次 中文

阅读说明:本技术 一种异构数据库的数据集成方法、装置及存储介质 (Data integration method and device for heterogeneous database and storage medium ) 是由 陈曦 王尔昕 张伟 王统仁 麻志毅 于 2020-10-23 设计创作,主要内容包括:本发明公开了一种异构数据库的数据集成方法、装置及存储介质,方法包括:针对第一、第二数据库建立第一、第二无向有权图模型;分别提取第一无向有权图模型和第二无向有权图模型中的关键节点,生成第一关键节点集合和第二关键节点集合;构建第一关键节点集合中各关键节点包含的所有数据列与第二关键节点集合中各关键节点包含的所有数据列之间的相似度矩阵;确定待匹配的数据列,并从相似度矩阵中获取待匹配的数据列对应的多个最优数据列生成候选匹配列表;将候选匹配列表中的多个最优数据列进行降序排列,生成排序后的多个最优数据列;基于排序后的多个最优数据列确定数据匹配结果。因此,采用本申请实施例,可以提升异构数据库中数据集成时的数据匹配效率和匹配准确率。(The invention discloses a data integration method, a device and a storage medium of a heterogeneous database, wherein the method comprises the following steps: establishing a first undirected ownership graph model and a second undirected ownership graph model aiming at a first database and a second database; respectively extracting key nodes in the first undirected authorized graph model and the second undirected authorized graph model to generate a first key node set and a second key node set; constructing a similarity matrix between all data columns contained in each key node in the first key node set and all data columns contained in each key node in the second key node set; determining a data column to be matched, and acquiring a plurality of optimal data columns corresponding to the data column to be matched from the similarity matrix to generate a candidate matching list; performing descending order arrangement on the optimal data columns in the candidate matching list to generate a plurality of ordered optimal data columns; and determining a data matching result based on the sorted optimal data sequences. Therefore, by adopting the embodiment of the application, the data matching efficiency and the matching accuracy rate during data integration in the heterogeneous database can be improved.)

1. A method for data integration of heterogeneous databases, the method comprising:

establishing a first undirected weighted graph model for a first database, and establishing a second undirected weighted graph model for a second database, wherein the first database and the second database are heterogeneous databases;

respectively extracting key nodes in the first undirected authorized graph model and the second undirected authorized graph model to generate a first key node set and a second key node set;

constructing a similarity matrix between all data columns contained in each key node in the first key node set and all data columns contained in each key node in the second key node set;

determining a data column to be matched, and acquiring a plurality of optimal data columns corresponding to the data column to be matched from the similarity matrix to generate a candidate matching list;

performing descending order arrangement on the optimal data columns in the candidate matching list to generate a plurality of ordered optimal data columns;

and determining a data matching result based on the sorted optimal data sequences.

2. The method of claim 1, wherein after determining a data match result based on the sorted optimal data columns, further comprising:

when matching between all data columns contained in each key node in the first key node set and all data columns contained in each key node in the second key node set is completed, generating a plurality of data matching results;

and integrating the first database and the second database according to the plurality of data matching results to generate a target database.

3. The method according to claim 1 or 2, characterized in that the method further comprises:

and when the data column to be matched is detected to be matched, deleting the data column to be matched from the candidate matching list of other data columns.

4. The method of claim 1, wherein establishing a first undirected weighted graph model for a first database and a second undirected weighted graph model for a second database comprises:

respectively traversing data tables in the first database and the second database to generate a first data table set and a second data table set;

determining each data table in the first data table set as a plurality of first nodes, and constructing a first undirected weighted graph model based on the plurality of first nodes;

and determining each data table in the second data table set as a plurality of second nodes, and constructing a second undirected weighted graph model based on the plurality of second nodes.

5. The method of claim 1, wherein the extracting key nodes in the first undirected weighted graph model and the second undirected weighted graph model, respectively, and generating a first set of key nodes and a second set of key nodes comprises:

acquiring the weights of all edges connected by each node in a first undirected weighted graph model, and summing the weights of all edges connected by each node to generate a first target value corresponding to each node;

sorting the first target values corresponding to the nodes in a descending order to generate a plurality of sorted first target values;

selecting a value larger than a preset threshold value from the plurality of first target values, and determining a node corresponding to the value larger than the preset threshold value as a first key node set;

each edge in all the edges is a non-directional edge connected between two nodes, and the weight of the non-directional edge is equal to the number of the same data columns contained in the two tables.

6. The method of claim 5, further comprising:

acquiring the weights of all edges connected with each node in the second undirected weighted graph model, and summing the weights of all edges connected with each node to generate a second target value corresponding to each node;

sorting the second target values corresponding to the nodes in a descending order to generate a plurality of sorted second target values;

and selecting a value larger than a preset threshold value from the plurality of second target values, and determining a node corresponding to the value larger than the preset threshold value as a second key node set.

7. The method of claim 1, wherein the constructing a similarity matrix between all data columns contained in each key node in the first set of key nodes and all data columns contained in each key node in the second set of key nodes comprises:

calculating data column name similarity and data column data similarity between all data columns contained in each key node in the first key node set and all data columns contained in each key node in the second key node set;

carrying out weighted summation on the data column name similarity and the data similarity of the data columns according to a preset weighting coefficient to generate comprehensive similarity between all data columns contained in all key nodes in the first key node set and all data columns contained in all key nodes in the second key node set;

and constructing a similarity matrix between all data columns contained in each key node in the first key node set and all data columns contained in each key node in the second key node set according to the comprehensive similarity.

8. The method of claim 7, wherein the calculating of the data column name similarity and the data similarity of the data columns between all the data columns contained in each key node in the first set of key nodes and all the data columns contained in each key node in the second set of key nodes comprises:

converting data column names corresponding to all data columns contained in each key node in the first key node set and the second key node set into word vectors through a word2vec model, and generating a first word vector set and a second word vector set;

calculating cosine similarity between each word vector in the first word vector set and each word vector in the second word vector set, and generating data column name similarity between all data columns contained in each key node in the first key node set and all data columns contained in each key node in the second key node set;

acquiring data of data columns corresponding to all data columns contained in each key node in the first key node set and the second key node set, and generating a first data set and a second data set;

and calculating the data similarity between each data in the first data set and each data in the second data set, and generating the data similarity between all data columns contained in each key node in the first key node set and all data columns contained in each key node in the second key node set.

9. An apparatus for data integration of heterogeneous databases, the apparatus comprising:

the graph model establishing module is used for establishing a first undirected weighted graph model aiming at a first database and establishing a second undirected weighted graph model aiming at a second database, wherein the first database and the second database are heterogeneous databases;

a key point extracting module, configured to extract key nodes in the first undirected weighted graph model and the second undirected weighted graph model respectively, and generate a first key node set and a second key node set;

a similarity matrix construction module, configured to construct a similarity matrix between all data columns included in each key node in the first key node set and all data columns included in each key node in the second key node set;

the candidate matching list generating module is used for determining the data columns to be matched and acquiring a plurality of optimal data columns corresponding to the data columns to be matched from the similarity matrix to generate a candidate matching list;

the data column sorting module is used for performing descending sorting on the optimal data columns in the candidate matching list to generate a plurality of sorted optimal data columns;

and the matching result generating module is used for determining a data matching result based on the sequenced optimal data sequences.

10. A computer storage medium, characterized in that it stores a plurality of instructions adapted to be loaded by a processor and to carry out the method steps according to any one of claims 1 to 8.

Technical Field

The invention relates to the technical field of computers, in particular to a data integration method and device for a heterogeneous database and a storage medium.

Background

At present, a relational database system is still a mainstream data storage manner, and with the development of information technology, the data amount in a relational database corresponding to a software system in each field increases suddenly, for example, in the same field, the software system in the field corresponds to a plurality of subsystems, and each subsystem corresponds to a respective relational database, so that the software system in the field has a plurality of heterogeneous databases. Since a single database in a plurality of heterogeneous databases has a small data size and has certain limitations on the expression of the whole field, researchers are increasingly eager to integrate the plurality of heterogeneous databases into one database.

In the prior art, when data in various heterogeneous databases are integrated, a solution of "direct matching between two patterns" is often adopted, that is, columns with the same meaning are matched in data tables of the two heterogeneous databases. For example, some current matching algorithms perform similarity measurements between every two columns in two databases to be matched, and generate a matching result for each column. When the similarity between two columns is measured, the similarity of the two columns is mainly measured from the data content of the two columns, that is, statistical characteristics and the like are respectively extracted from the data of the two columns as characteristic vectors of the columns, and then the similarity between the two characteristic vectors is measured. There are also some improved algorithms that measure the similarity of two columns by combining the data characteristics of the columns and the semantics of the column names, in which the labels of the matched element pairs can be regarded as a pair of synonyms and automatically added into the synonym dictionary, and the improvement can be compatible with the problem that two columns with the same meaning use different column names to some extent.

The disadvantages of the above methods are mainly focused on: (1) the algorithm has high complexity, and when the data size of the data source to be matched is large, the algorithm can perform a large amount of similarity calculation on a plurality of data elements (non-key columns) with low occurrence frequency, so that a large amount of calculation resources and time cost are consumed. (2) For columns that do not produce a match, the synonyms for these columns are not contained in the synonym dictionary, so that only a single similarity measure based on column data features can be used for matching these columns. (3) The method for measuring the similarity between two columns is too single, mainly considers the data characteristics of the columns, rarely considers the semantics of column names and does not consider the relation between the columns in the same data table.

Disclosure of Invention

The embodiment of the application provides a data integration method and device for a heterogeneous database and a storage medium. The following presents a simplified summary in order to provide a basic understanding of some aspects of the disclosed embodiments. This summary is not an extensive overview and is intended to neither identify key/critical elements nor delineate the scope of such embodiments. Its sole purpose is to present some concepts in a simplified form as a prelude to the more detailed description that is presented later.

In a first aspect, an embodiment of the present application provides a data integration method for a heterogeneous database, where the method includes:

establishing a first undirected weighted graph model for a first database, and establishing a second undirected weighted graph model for a second database, wherein the first database and the second database are heterogeneous databases;

respectively extracting key nodes in the first undirected authorized graph model and the second undirected authorized graph model to generate a first key node set and a second key node set;

constructing a similarity matrix between all data columns contained in each key node in the first key node set and all data columns contained in each key node in the second key node set;

determining a data column to be matched, and acquiring a plurality of optimal data columns corresponding to the data column to be matched from the similarity matrix to generate a candidate matching list;

performing descending order arrangement on the optimal data columns in the candidate matching list to generate a plurality of ordered optimal data columns;

and determining a data matching result based on the sorted optimal data sequences.

Optionally, after determining a data matching result based on the sorted optimal data columns, the method further includes:

when matching between all data columns contained in each key node in the first key node set and all data columns contained in each key node in the second key node set is completed, generating a plurality of data matching results;

and integrating the first database and the second database according to the multiple data matching results to generate a target database.

Optionally, the method further includes:

and when the data column to be matched is detected to be matched, deleting the data column to be matched from the candidate matching lists of other data columns.

Optionally, the establishing a first undirected weighted graph model for a first database and a second undirected weighted graph model for a second database includes:

respectively traversing data tables in the first database and the second database to generate a first data table set and a second data table set;

determining each data table in the first data table set as a plurality of first nodes, and constructing a first undirected weighted graph model based on the plurality of first nodes;

and determining each data table in the second data table set as a plurality of second nodes, and constructing a second undirected weighted graph model based on the plurality of second nodes.

Optionally, the extracting key nodes in the first undirected weighted graph model and the second undirected weighted graph model respectively, and generating the first key node set and the second key node set, includes:

acquiring the weights of all edges connected by each node in the first undirected weighted graph model, and summing the weights of all edges connected by each node to generate a first target value corresponding to each node;

sorting the first target values corresponding to the nodes in a descending order to generate a plurality of sorted first target values;

selecting a value larger than a preset threshold value from the plurality of first target values, and determining a node corresponding to the value larger than the preset threshold value as a first key node set;

each edge in all the edges is a non-directional edge connected between two nodes, and the weight of the non-directional edge is equal to the number of the same data columns contained in the two tables.

Optionally, the method further comprises:

acquiring the weights of all edges connected with each node in the second undirected weighted graph model, and summing the weights of all edges connected with each node to generate a second target value corresponding to each node;

sorting the second target values corresponding to the nodes in a descending order to generate a plurality of sorted second target values;

and selecting a value larger than a preset threshold value from the plurality of second target values, and determining a node corresponding to the value larger than the preset threshold value as a second key node set.

Optionally, constructing a similarity matrix between all data columns included in each key node in the first key node set and all data columns included in each key node in the second key node set includes:

calculating data column name similarity and data column data similarity between all data columns contained in all key nodes in the first key node set and all data columns contained in all key nodes in the second key node set;

carrying out weighted summation on the similarity of the data column names and the data similarity of the data columns according to a preset weighting coefficient to generate comprehensive similarity between all data columns contained in all key nodes in the first key node set and all data columns contained in all key nodes in the second key node set;

and constructing a similarity matrix between all data columns contained in each key node in the first key node set and all data columns contained in each key node in the second key node set according to the comprehensive similarity.

Optionally, calculating data column name similarity and data similarity of data columns between all data columns included in each key node in the first key node set and all data columns included in each key node in the second key node set includes:

converting data column names corresponding to all data columns contained in each key node in the first key node set and the second key node set into word vectors through a word2vec model, and generating a first word vector set and a second word vector set;

calculating cosine similarity between each word vector in the first word vector set and each word vector in the second word vector set, and generating data column name similarity between all data columns contained in each key node in the first key node set and all data columns contained in each key node in the second key node set;

acquiring data of data columns corresponding to all data columns contained in each key node in a first key node set and a second key node set, and generating a first data set and a second data set;

and calculating the data similarity between each data in the first data set and each data in the second data set, and generating the data similarity between all data columns contained in each key node in the first key node set and all data columns contained in each key node in the second key node set.

In a second aspect, an embodiment of the present application provides a data integration apparatus for heterogeneous databases, where the apparatus includes:

the graph model establishing module is used for establishing a first undirected weighted graph model aiming at a first database and establishing a second undirected weighted graph model aiming at a second database, and the first database and the second database are heterogeneous databases;

the key point extraction module is used for respectively extracting key nodes in the first undirected authorized graph model and the second undirected authorized graph model and generating a first key node set and a second key node set;

the similarity matrix building module is used for building a similarity matrix between all data columns contained in each key node in the first key node set and all data columns contained in each key node in the second key node set;

the candidate matching list generating module is used for determining the data columns to be matched and acquiring a plurality of optimal data columns corresponding to the data columns to be matched from the similarity matrix to generate a candidate matching list;

the data column sorting module is used for performing descending sorting on the optimal data columns in the candidate matching list to generate a plurality of sorted optimal data columns;

and the matching result generating module is used for determining a data matching result based on the sequenced optimal data sequences.

In a third aspect, embodiments of the present application provide a computer storage medium having stored thereon a plurality of instructions adapted to be loaded by a processor and to perform the above-mentioned method steps.

The technical scheme provided by the embodiment of the application can have the following beneficial effects:

in the embodiment of the application, the aim of pruning the database is fulfilled by traversing two heterogeneous databases, respectively modeling the heterogeneous databases based on the graph model and extracting key nodes in the graph model. Similarity measurement is carried out on data elements contained in key nodes in the two heterogeneous databases through multiple measurement methods, a batch of elements with the highest similarity with the data elements to be matched are screened out, a matching candidate list is generated on the data elements to be matched, and finally the two different heterogeneous databases can be integrated into one database based on the matching candidate list. The method can greatly improve the data matching efficiency and matching accuracy between heterogeneous databases and lay a solid foundation for the data integration technology, so that the efficiency of operating the integrated database by the computer is higher than that of operating a plurality of databases, and the data processing speed in the database operated by the computer is improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.

Fig. 1 is a schematic flowchart of a data integration method for a heterogeneous database according to an embodiment of the present application;

fig. 2 is a graph of determination of matching results during data integration of a heterogeneous database according to an embodiment of the present disclosure;

fig. 3 is a process block diagram of a data integration process of a heterogeneous database according to an embodiment of the present application;

fig. 4 is a schematic device diagram of a data integration apparatus for heterogeneous databases according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of a terminal according to an embodiment of the present application.

Detailed Description

The following description and the drawings sufficiently illustrate specific embodiments of the invention to enable those skilled in the art to practice them.

It should be understood that the described embodiments are only some embodiments of the invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present invention. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the invention, as detailed in the appended claims.

In the description of the present invention, it is to be understood that the terms "first," "second," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. The specific meanings of the above terms in the present invention can be understood in specific cases to those skilled in the art. In addition, in the description of the present invention, "a plurality" means two or more unless otherwise specified. "and/or" describes the association relationship of the associated objects, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship.

The data integration method for heterogeneous databases according to the embodiments of the present application will be described in detail below with reference to fig. 1 to 3. The method may be implemented in dependence on a computer program, executable on a data integration apparatus based on a heterogeneous database of the von neumann architecture. The computer program may be integrated into the application or may run as a separate tool-like application. The data integration device of the heterogeneous database in the embodiment of the present application may be a user terminal, including but not limited to: personal computers, tablet computers, handheld devices, in-vehicle devices, wearable devices, computing devices or other processing devices connected to a wireless modem, and the like. The user terminals may be called different names in different networks, for example: user equipment, access terminal, subscriber unit, subscriber station, mobile station, remote terminal, mobile device, user terminal, wireless communication device, user agent or user equipment, cellular telephone, cordless telephone, Personal Digital Assistant (PDA), terminal equipment in a 5G network or future evolution network, and the like.

Referring to fig. 1, a schematic flow chart of a data integration method for a heterogeneous database is provided in an embodiment of the present application. As shown in fig. 1, the method of the embodiment of the present application may include the following steps:

s101, establishing a first undirected weighted graph model for a first database, and establishing a second undirected weighted graph model for a second database, wherein the first database and the second database are heterogeneous databases;

the database is a data warehouse associated with the computer software system, and is used for operating data in the database according to the operation of the software function (for example, the data in the database is added, deleted, updated, queried and the like through a functional node of the software system). The undirected authoritative graph model is generated from a plurality of nodes by considering each table in the database as a node. The heterogeneous database is a database corresponding to each of two systems, for example, a gear plant for gear transmission has several sub-manufacturing plants, and the types of the databases used by the sub-manufacturing plants are different, so that the heterogeneous database is called as a heterogeneous database.

In the embodiment of the application, when two heterogeneous databases are integrated, an undirected weighted graph model needs to be built for the two heterogeneous databases, when the undirected weighted graph model is built, data tables in a first database and a second database are traversed respectively to generate a first data table set and a second data table set, each data table in the first data table set is determined to be a plurality of first nodes, a first undirected weighted graph model is built based on the plurality of first nodes, each data table in the second data table set is determined to be a plurality of second nodes, and a second undirected weighted graph model is built based on the plurality of second nodes.

In one possible implementation, first, two databases are traversed, and undirected weighted graph models are respectively established for the two databases. With each table in each database being a node of the graph. And comparing every two tables, and if the two tables contain the same columns, considering that the two tables are related to each other, connecting a non-directional edge between the two corresponding nodes, wherein the weight of the edge is equal to the number of the same columns contained in the two tables.

S102, respectively extracting key nodes in the first undirected authorized graph model and the second undirected authorized graph model, and generating a first key node set and a second key node set;

in the embodiment of the application, key nodes are extracted from two undirected authorized graph models corresponding to two heterogeneous databases, namely key nodes extracted from a first undirected authorized graph model and key nodes extracted from a second undirected authorized graph model.

When key nodes in the first undirected weighted graph model are extracted, firstly, the weights of all edges connected with all nodes in the first undirected weighted graph model are obtained, the weights of all edges connected with all nodes are summed to generate first target values corresponding to all nodes, then the first target values corresponding to all nodes are sorted in a descending order to generate a plurality of sorted first target values, finally, values larger than a preset threshold value are selected from the plurality of first target values, and the nodes corresponding to the values larger than the preset threshold value are determined as a first key node set. Each edge in all the edges is a non-directional edge connected between two nodes, and the weight of the non-directional edge is equal to the number of the same data columns included in the two tables, which may be specifically explained with reference to step S101, and is not described herein again.

When extracting key nodes in the second undirected weighted graph model, firstly acquiring the weights of all edges connected with each node in the second undirected weighted graph model, summing the weights of all edges connected with each node to generate second target values corresponding to each node, then sequencing the second target values corresponding to each node in a descending order to generate a plurality of sequenced second target values, finally selecting a value larger than a preset threshold value from the plurality of second target values, and determining the node corresponding to the value larger than the preset threshold value as a second key node set.

In a possible implementation manner, when extracting key nodes in graph models corresponding to two heterogeneous databases, the weights of all edges connected to each node in the model graphs are summed at first, then the nodes are sorted in a descending order according to the summed values, finally a threshold value is set according to the total number of the nodes, and all the nodes which are sorted and are larger than the set threshold value are selected as the key nodes.

S103, constructing a similarity matrix between all data columns contained in each key node in the first key node set and all data columns contained in each key node in the second key node set;

the data columns are each column of data in each table in the database. The similarity matrix is a matrix table constructed by the first database and the second database.

In the embodiment of the application, when constructing the matrix formed by the first database and the second database, firstly, the data column name similarity and the data similarity of the data columns between all the data columns contained in each key node in the first key node set and all the data columns contained in each key node in the second key node set are calculated, then the data column name similarity and the data column similarity are weighted and summed according to a preset weighting coefficient, and finally, establishing a similarity matrix between all data columns contained in each key node in the first key node set and all data columns contained in each key node in the second key node set according to the comprehensive similarity.

Further, when calculating the similarity of data column names and the data similarity of data columns between all data columns contained in each key node in the first key node set and all data columns contained in each key node in the second key node set, firstly converting the data column names corresponding to all data columns contained in each key node in the first key node set and the second key node set into word vectors through a word2vec model, generating a first word vector set and a second word vector set, then calculating the cosine similarity between each word vector in the first word vector set and each word vector in the second word vector set, generating the similarity of data column names between all data columns contained in each key node in the first key node set and all data columns contained in each key node in the second key node set, and then acquiring the data of data columns corresponding to all data columns contained in each key node in the first key node set and the second key node set, and finally, calculating the data similarity between each data in the first data set and each data in the second data set, and generating the data similarity between all data columns contained in each key node in the first key node set and all data columns contained in each key node in the second key node set.

Specifically, when constructing the similarity matrix, the column name similarity and the data similarity of the columns are calculated for all the columns included in all the key nodes of the two databases, and then the column name similarity and the data similarity of the columns are calculated according to a certain weighting coefficient (respectively using ω1、ω2Representation) similarity to column names (represented as sim)name) Data similarity to column (denoted sim)data) A weighted sum is performed to represent the combined similarity of the two columns. The formula is as follows: sim ═ ω1simname2simdataAnd finally, a similarity matrix is formed between the two columns according to the comprehensive similarity, for example, as shown in table 1. The rows of the matrix represent all the columns of one database and the columns of the matrix represent all the columns of the other database.

TABLE 1

When calculating the column name similarity, encoding the column names of all columns contained in the key nodes, regarding each column name as a word, regarding all the column names contained in the same table as a sentence due to some internal relations among the columns in the same table, and expressing each column name by using a vector with the same dimension by using a word2vec model. The relationship between vectors corresponding to column names may reflect the degree of similarity between column names. And calculating the cosine similarity of the word vectors corresponding to all the column names pairwise to represent the similarity between the column names.

When calculating the data similarity of the data columns, for example, two columns col _1 and col _2 of the data similarity to be calculated are input into the algorithm, and output is sim _ data (data similarity of two columns).

When measuring the data similarity of the columns, the algorithm mainly comprises the following types:

(1) when the data types of two columns are different, the two columns are considered to have different meanings and are not matched necessarily, and the similarity is defined to be 0.

(2) When the data types of two rows are both int types or both float types, the characteristics of the mean value, the variance, the minimum value, the maximum value and the like of each data row are respectively calculated to form a characteristic vector, and the cosine similarity of the two characteristic vectors is used for expressing the similarity of the two rows of data.

(3) When the data types of two columns are the same short character strings, the similarity degree of the data of the two columns is measured by calculating the editing distance between the character strings.

(4) When the data types of two columns are the same as the long text, the text is segmented by utilizing jieba, then each word is expressed by using a vector by utilizing a word2vec model, and the similarity degree of the data of the two columns is measured through the similarity between word vectors.

S104, determining a data column to be matched, and acquiring a plurality of optimal data columns corresponding to the data column to be matched from the similarity matrix to generate a candidate matching list;

in a possible implementation manner, after the similarity matrix of the two heterogeneous databases is constructed based on step S103, the user terminal first determines a data column to be matched, and then obtains a plurality of optimal data columns corresponding to the data column to be matched from the similarity matrix to generate a candidate matching list. The method selects 10 columns with the highest similarity to generate a candidate matching list.

S105, performing descending order arrangement on the optimal data columns in the candidate matching list to generate a plurality of ordered optimal data columns;

in one possible implementation manner, elements in the candidate list are sorted in a descending order according to the similarity, a plurality of sorted data columns are generated, and a data matching result is determined based on the plurality of sorted optimal data columns. For example, if the similarity difference between an element and the previous element is larger, the set of elements before the element is taken as the exact matching result. As shown in fig. 2, after the elements in the candidate list are arranged in descending order of similarity, the difference between the similarity of the fourth point and the similarity of the third point is large, so the first three elements are taken as the exact matching result.

And S106, determining a data matching result based on the sorted optimal data sequences.

Further, when the data columns to be matched are detected to be matched, the data columns to be matched are deleted from the candidate matching lists of other data columns, when matching between all the data columns contained in each key node in the first key node set and all the data columns contained in each key node in the second key node set is completed, a plurality of data matching results are generated, and finally the first database and the second database are integrated according to the plurality of data matching results to generate the target database.

For example, as shown in fig. 3, fig. 3 is a schematic diagram of a database integration process provided by the present application, a graph model of a database a and a graph model of a database B are respectively established for the data column a and the database B, key points are respectively extracted from the graph models corresponding to the two databases, and a similarity matrix is constructed and generated according to the similarity between two columns of each table in the two databases and the extreme extracted key points. When the similarity between two columns is calculated, the column name similarity and the data similarity are respectively calculated, and then the weighted summation is carried out on the column name similarity and the data similarity to generate the similarity between the two columns. After the similarity matrix is generated, the columns to be matched are determined, and then a plurality of optimal columns corresponding to the columns to be matched are obtained from the similarity matrix to generate a candidate matching list.

In the embodiment of the application, the aim of pruning the database is fulfilled by traversing two heterogeneous databases, respectively modeling the heterogeneous databases based on the graph model and extracting key nodes in the graph model. Similarity measurement is carried out on data elements contained in key nodes in the two heterogeneous databases through multiple measurement methods, a batch of elements with the highest similarity with the data elements to be matched are screened out, a matching candidate list is generated on the data elements to be matched, and finally the two different heterogeneous databases can be integrated into one database based on the matching candidate list. The method can greatly improve the data matching efficiency and matching accuracy between heterogeneous databases and lay a solid foundation for the data integration technology, so that the efficiency of operating the integrated database by the computer is higher than that of operating a plurality of databases, and the data processing speed in the database operated by the computer is improved.

The following are embodiments of the apparatus of the present invention that may be used to perform embodiments of the method of the present invention. For details which are not disclosed in the embodiments of the apparatus of the present invention, reference is made to the embodiments of the method of the present invention.

Referring to fig. 4, a schematic structural diagram of a data integration apparatus for heterogeneous databases according to an exemplary embodiment of the present invention is shown. The data integration device of the heterogeneous database can be implemented by software, hardware or a combination of the two to form all or part of the terminal. The device 1 comprises a graph model establishing module 10, a key point extracting module 20, a similarity matrix establishing module 30, a candidate matching list generating module 40, a data column sorting module 50 and a matching result generating module 60.

The graph model establishing module 10 is configured to establish a first undirected weighted graph model for a first database, and establish a second undirected weighted graph model for a second database, where the first database and the second database are heterogeneous databases;

a key point extracting module 20, configured to extract key nodes in the first undirected authorized graph model and the second undirected authorized graph model, respectively, and generate a first key node set and a second key node set;

a similarity matrix constructing module 30, configured to construct a similarity matrix between all data columns included in each key node in the first key node set and all data columns included in each key node in the second key node set;

a candidate matching list generating module 40, configured to determine a data column to be matched, and obtain multiple optimal data columns corresponding to the data column to be matched from the similarity matrix to generate a candidate matching list;

the data column sorting module 50 is configured to sort the multiple optimal data columns in the candidate matching list in a descending order to generate a plurality of sorted optimal data columns;

and a matching result generating module 60, configured to determine a data matching result based on the sorted optimal data sequences.

It should be noted that, when the data integration apparatus for a heterogeneous database provided in the foregoing embodiment executes the data integration method for a heterogeneous database, only the division of the functional modules is described as an example, and in practical applications, the function distribution may be completed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules, so as to complete all or part of the functions described above. In addition, the data integration device of the heterogeneous database and the data integration method of the heterogeneous database provided in the above embodiments belong to the same concept, and details of implementation processes thereof are referred to in the method embodiments and are not described herein again.

The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments.

In the embodiment of the application, the aim of pruning the database is fulfilled by traversing two heterogeneous databases, respectively modeling the heterogeneous databases based on the graph model and extracting key nodes in the graph model. Similarity measurement is carried out on data elements contained in key nodes in the two heterogeneous databases through multiple measurement methods, a batch of elements with the highest similarity with the data elements to be matched are screened out, a matching candidate list is generated on the data elements to be matched, and finally the two different heterogeneous databases can be integrated into one database based on the matching candidate list. The method can greatly improve the data matching efficiency and matching accuracy between heterogeneous databases and lay a solid foundation for the data integration technology, so that the efficiency of operating the integrated database by the computer is higher than that of operating a plurality of databases, and the data processing speed in the database operated by the computer is improved.

The present invention also provides a computer readable medium, on which program instructions are stored, which when executed by a processor implement the data integration method for heterogeneous databases provided by the above-mentioned method embodiments. The present invention also provides a computer program product containing instructions which, when run on a computer, cause the computer to perform the method for data integration of heterogeneous databases of the various method embodiments described above.

Please refer to fig. 5, which provides a schematic structural diagram of a terminal according to an embodiment of the present application. As shown in fig. 5, terminal 1000 can include: at least one processor 1001, at least one network interface 1004, a user interface 1003, memory 1005, at least one communication bus 1002.

Wherein a communication bus 1002 is used to enable connective communication between these components.

The user interface 1003 may include a Display screen (Display) and a Camera (Camera), and the optional user interface 1003 may also include a standard wired interface and a wireless interface.

The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface), among others.

Processor 1001 may include one or more processing cores, among other things. The processor 1001 interfaces various components throughout the electronic device 1000 using various interfaces and lines to perform various functions of the electronic device 1000 and to process data by executing or executing instructions, programs, code sets, or instruction sets stored in the memory 1005 and invoking data stored in the memory 1005. Alternatively, the processor 1001 may be implemented in at least one hardware form of Digital Signal Processing (DSP), Field-Programmable Gate Array (FPGA), and Programmable Logic Array (PLA). The processor 1001 may integrate one or more of a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a modem, and the like. Wherein, the CPU mainly processes an operating system, a user interface, an application program and the like; the GPU is used for rendering and drawing the content required to be displayed by the display screen; the modem is used to handle wireless communications. It is understood that the modem may not be integrated into the processor 1001, but may be implemented by a single chip.

The Memory 1005 may include a Random Access Memory (RAM) or a Read-Only Memory (Read-Only Memory). Optionally, the memory 1005 includes a non-transitory computer-readable medium. The memory 1005 may be used to store an instruction, a program, code, a set of codes, or a set of instructions. The memory 1005 may include a stored program area and a stored data area, wherein the stored program area may store instructions for implementing an operating system, instructions for at least one function (such as a touch function, a sound playing function, an image playing function, etc.), instructions for implementing the various method embodiments described above, and the like; the storage data area may store data and the like referred to in the above respective method embodiments. The memory 1005 may optionally be at least one memory device located remotely from the processor 1001. As shown in fig. 5, the memory 1005, which is a kind of computer storage medium, may include therein a data integration application of an operating system, a network communication module, a user interface module, and a heterogeneous database.

In the terminal 1000 shown in fig. 5, the user interface 1003 is mainly used as an interface for providing input for a user, and acquiring data input by the user; the processor 1001 may be configured to call a data integration application of the heterogeneous database stored in the memory 1005, and specifically perform the following operations:

establishing a first undirected weighted graph model for a first database, and establishing a second undirected weighted graph model for a second database, wherein the first database and the second database are heterogeneous databases;

respectively extracting key nodes in the first undirected authorized graph model and the second undirected authorized graph model to generate a first key node set and a second key node set;

constructing a similarity matrix between all data columns contained in each key node in the first key node set and all data columns contained in each key node in the second key node set;

determining a data column to be matched, and acquiring a plurality of optimal data columns corresponding to the data column to be matched from the similarity matrix to generate a candidate matching list;

performing descending order arrangement on the optimal data columns in the candidate matching list to generate a plurality of ordered optimal data columns;

and determining a data matching result based on the sorted optimal data sequences.

In one embodiment, the processor 1001, after performing the determination of the data matching result based on the sorted optimal data columns, further performs the following operations:

when the data column to be matched is detected to be matched, deleting the data column to be matched from the candidate matching lists of other data columns;

when matching between all data columns contained in each key node in the first key node set and all data columns contained in each key node in the second key node set is completed, generating a plurality of data matching results;

and integrating the first database and the second database according to the multiple data matching results to generate a target database.

In one embodiment, the processor 1001 specifically performs the following operations when executing the first undirected weighted graph model for the first database and the second undirected weighted graph model for the second database:

respectively traversing data tables in the first database and the second database to generate a first data table set and a second data table set;

determining each data table in the first data table set as a plurality of first nodes, and constructing a first undirected weighted graph model based on the plurality of first nodes;

and determining each data table in the second data table set as a plurality of second nodes, and constructing a second undirected weighted graph model based on the plurality of second nodes.

In one embodiment, when the processor 1001 respectively extracts key nodes in the first undirected weighted graph model and the second undirected weighted graph model, and generates the first key node set and the second key node set, it specifically performs the following operations:

acquiring the weights of all edges connected by each node in the first undirected weighted graph model, and summing the weights of all edges connected by each node to generate a first target value corresponding to each node;

sorting the first target values corresponding to the nodes in a descending order to generate a plurality of sorted first target values;

selecting a value larger than a preset threshold value from the plurality of first target values, and determining a node corresponding to the value larger than the preset threshold value as a first key node set;

acquiring the weights of all edges connected with each node in the second undirected weighted graph model, and summing the weights of all edges connected with each node to generate a second target value corresponding to each node;

sorting the second target values corresponding to the nodes in a descending order to generate a plurality of sorted second target values;

selecting a value larger than a preset threshold value from the plurality of second target values, and determining a node corresponding to the value larger than the preset threshold value as a second key node set;

each edge in all the edges is a non-directional edge connected between two nodes, and the weight of the non-directional edge is equal to the number of the same data columns contained in the two tables.

In one embodiment, the processor 1001 specifically performs the following operations when performing the construction of the similarity matrix between all data columns included in each key node in the first key node set and all data columns included in each key node in the second key node set:

calculating data column name similarity and data column data similarity between all data columns contained in all key nodes in the first key node set and all data columns contained in all key nodes in the second key node set;

carrying out weighted summation on the similarity of the data column names and the data similarity of the data columns according to a preset weighting coefficient to generate comprehensive similarity between all data columns contained in all key nodes in the first key node set and all data columns contained in all key nodes in the second key node set;

and constructing a similarity matrix between all data columns contained in each key node in the first key node set and all data columns contained in each key node in the second key node set according to the comprehensive similarity.

In one embodiment, when performing the calculation of the data column name similarity and the data similarity of the data columns between all the data columns contained in each key node in the first set of key nodes and all the data columns contained in each key node in the second set of key nodes, the processor 1001 specifically performs the following operations:

converting data column names corresponding to all data columns contained in each key node in the first key node set and the second key node set into word vectors through a word2vec model, and generating a first word vector set and a second word vector set;

calculating cosine similarity between each word vector in the first word vector set and each word vector in the second word vector set, and generating data column name similarity between all data columns contained in each key node in the first key node set and all data columns contained in each key node in the second key node set;

acquiring data of data columns corresponding to all data columns contained in each key node in a first key node set and a second key node set, and generating a first data set and a second data set;

and calculating the data similarity between each data in the first data set and each data in the second data set, and generating the data similarity between all data columns contained in each key node in the first key node set and all data columns contained in each key node in the second key node set.

In the embodiment of the application, the aim of pruning the database is fulfilled by traversing two heterogeneous databases, respectively modeling the heterogeneous databases based on the graph model and extracting key nodes in the graph model. Similarity measurement is carried out on data elements contained in key nodes in the two heterogeneous databases through multiple measurement methods, a batch of elements with the highest similarity with the data elements to be matched are screened out, a matching candidate list is generated on the data elements to be matched, and finally the two different heterogeneous databases can be integrated into one database based on the matching candidate list. The method can greatly improve the data matching efficiency and matching accuracy between heterogeneous databases and lay a solid foundation for the data integration technology, so that the efficiency of operating the integrated database by the computer is higher than that of operating a plurality of databases, and the data processing speed in the database operated by the computer is improved.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware that is related to instructions of a computer program, and the program can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a read-only memory or a random access memory.

The above disclosure is only for the purpose of illustrating the preferred embodiments of the present application and is not to be construed as limiting the scope of the present application, so that the present application is not limited thereto, and all equivalent variations and modifications can be made to the present application.

20页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:数据查询方法、装置及设备

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!