Method and device for constructing standard knowledge base

文档序号:1831801 发布日期:2021-11-12 浏览:9次 中文

阅读说明:本技术 一种标准知识库的构建方法及装置 (Method and device for constructing standard knowledge base ) 是由 李海丽 史晨阳 王磊 黄登玺 潘学芳 林勇 金佩 王宇宸 乔佳丽 于 2021-10-14 设计创作,主要内容包括:本发明公开了一种标准知识库的构建方法及装置,包括:获取到待处理用语的字段信息,对待处理用语的字段信息进行标准化处理,得到标准用语的字段信息,再对待处理用语的字段的中文名称进行分词,根据分词得到的中文单词构建标准单词库,并通过标准用语的标准域信息构建标准域库,其中标准域信息包括分类词和数据特征;通过建立标准用语与标准单词库和标准域库之间的关联关系建立标准域库。这样,该标准知识库对标准用语进行了更细粒度的划分,并且通过划分后的细粒度的信息,提升了标准用语检索的成功率。并且,提升了数据的规范化程度,统一了数据类型,进而提升了数据库生成过程中逻辑模型和物理设计的效率。(The invention discloses a method and a device for constructing a standard knowledge base, which comprise the following steps: acquiring field information of a to-be-processed term, standardizing the field information of the to-be-processed term to obtain field information of a standard term, segmenting Chinese names of the field of the to-be-processed term, constructing a standard word library according to Chinese words obtained by segmentation, and constructing a standard domain library through standard domain information of the standard term, wherein the standard domain information comprises classified words and data characteristics; and establishing a standard domain library by establishing an incidence relation between the standard expression and the standard word library and the standard domain library. Therefore, the standard knowledge base divides the standard expression into finer granularity, and the success rate of standard expression retrieval is improved through the divided fine granularity information. Moreover, the standardization degree of the data is improved, the data types are unified, and the efficiency of the logical model and the physical design in the database generation process is improved.)

1. A method for constructing a standard knowledge base is characterized by comprising the following steps:

acquiring field information of the expression to be processed; the field information of the expression to be processed at least comprises: chinese name, English name, data characteristics;

standardizing the field information of the to-be-processed expression according to a preset rule to obtain the field information of the standard expression;

Performing word segmentation processing on the Chinese name of each standard expression to obtain at least one Chinese word;

obtaining an English name corresponding to each Chinese word;

adding the Chinese words obtained after word segmentation processing and the English name corresponding to each Chinese word into a standard word library;

acquiring standard domain information of standard terms, and adding the standard domain information into a standard domain library; the standard domain information includes: categorical words and data characteristics of standard expressions;

and associating the standard expressions, the standard word library and the standard domain library to generate a standard expression library.

2. The method according to claim 1, wherein the normalizing the field information of the to-be-processed expression according to the preset rule to obtain the field information of the standard expression comprises:

removing first characteristic characters contained in the Chinese name of the to-be-processed expression;

removing second special characters contained in the English name of the to-be-processed expression;

if the condition that the data item is missing in the field of the to-be-processed expression is detected, completing the missing data item by adopting preset supplementary information;

and performing de-duplication processing on the standard expression by taking the Chinese name, the English name, the data type and the data characteristic as reference units.

3. The method of claim 1, further comprising, prior to performing the tokenization process on the chinese name for each standard term:

and carrying out duplication removal processing on the standard expression by using the Chinese name and the English name as reference units.

4. The method of claim 1, wherein adding the chinese words obtained after the word segmentation process and the english name corresponding to each chinese word to a standard word bank comprises:

matching the Chinese words obtained after word segmentation processing with words in a standard word library;

marking words which are not successfully matched;

and if the marked words conform to the preset word definition, adding the marked words into the standard word library.

5. The method of claim 1, further comprising:

checking whether a condition that one Chinese word corresponds to a plurality of English names exists in the standard word library;

if a situation that one Chinese word corresponds to a plurality of English names exists, determining the standard English name of the Chinese word;

replacing the English name corresponding to the Chinese word with a standard English name;

or

Detecting whether there are ambiguous words in the standard word library;

If the words with ambiguous meanings exist in the standard word library, the corresponding Chinese names before the words are split are obtained, and the Chinese names are used for replacing the words with ambiguous meanings.

6. The method of claim 1, further comprising:

checking whether words with similar meanings exist in the standard word library;

if words with similar meanings exist, determining a standard word from the words with similar meanings;

removing words with similar meanings to the standard words in the standard word library, and storing the words with similar meanings to the standard words in a non-standard word library;

and establishing a mapping relation between the standard words in the standard word library and the words which have the same meanings with the standard words in the non-standard knowledge base.

7. The method of claim 1, wherein obtaining standard domain information for standard expression comprises:

based on the standard word library, performing word segmentation processing on the Chinese name of the standard expression to obtain at least one Chinese word;

taking the last Chinese word in the Chinese words forming the Chinese name as a classified word;

Acquiring data characteristics of the standard expression; the data characteristics include: data type and long precision;

and taking the data type and the length precision of the classified words and the standard expressions as standard domain information of the standard expressions.

8. The method of claim 1, further comprising:

detecting whether the condition that the Chinese name and the English name of the standard expression are the same but the standard domain information is different exists or not;

if the Chinese name and the English name of the standard expression are the same but the standard domain information is different, generating identification information according to the standard domain information;

the Chinese name and the English name of the standard expression are identified through the identification information.

9. An apparatus for building a standard knowledge base, comprising:

the system comprises a to-be-processed expression acquisition unit, a to-be-processed expression acquisition unit and a to-be-processed expression acquisition unit, wherein the to-be-processed expression acquisition unit is used for acquiring field information of the to-be-processed expressions; the field information of the expression to be processed at least comprises: chinese name, English name, data characteristics;

the standardization unit is used for standardizing the field information to be processed according to a preset rule to obtain the field information of the standardized expression;

the standard word bank building unit is used for performing word segmentation processing on the Chinese name of each standard term to obtain at least one Chinese word; obtaining an English name corresponding to each Chinese word; adding the Chinese words obtained after word segmentation processing and the English name corresponding to each Chinese word into a standard word library;

The standard domain library construction unit is used for acquiring standard domain information of standard expressions and adding the standard domain information into the standard domain library; the standard domain information includes: categorical words and data characteristics of standard expressions;

and the standard phrase library construction unit is used for associating the standard phrases, the standard word library and the standard domain library to generate a standard phrase library.

10. A standard knowledge base, comprising:

a standard word library comprising: chinese and English words;

a standard domain library comprising: classifying word and data features;

a standard corpus comprising: standard expressions, the relation between the standard expressions and a standard word library and the relation between the standard expressions and a standard domain library;

the standard knowledge base is constructed by the construction method of the standard knowledge base according to any one of claims 1 to 8.

11. The standard knowledge base of claim 10, further comprising:

a non-standard word bank comprising words having similar meaning to words in the standard word bank.

Technical Field

The invention relates to the field of data processing, in particular to a method and a device for constructing a standard knowledge base.

Background

At present, the problems of low quality of data, lack of standards for data, incapability of sharing and the like exist, so that the due value of the data cannot be exerted.

Disclosure of Invention

In view of this, the embodiment of the present invention discloses a method and an apparatus for constructing a standard knowledge base, where the standard knowledge base obtained by the method not only contains standard expressions, but also includes words forming the standard expressions and data features of the standard expressions. Therefore, the standard knowledge base divides the standard expression into finer granularity, and the success rate of standard expression retrieval can be improved through the divided fine granularity information.

The embodiment of the invention discloses a method for constructing a standard knowledge base, which comprises the following steps:

acquiring field information of the expression to be processed; the field information of the expression to be processed at least comprises: chinese name, English name, data characteristics;

standardizing the field information of the to-be-processed expression according to a preset rule to obtain the field information of the standard expression;

performing word segmentation processing on the Chinese name of each standard expression to obtain at least one Chinese word;

obtaining an English name corresponding to each Chinese word;

Adding the Chinese words obtained after word segmentation processing and the English name corresponding to each Chinese word into a standard word library;

acquiring standard domain information of standard expression, and adding the standard domain information into the standard domain library; the standard domain information includes: categorical words and data characteristics of standard expressions;

and associating the standard expressions, the standard word library and the standard domain library to generate a standard expression library.

Optionally, the field information of the to-be-processed expression is standardized according to a preset rule to obtain the field information of the standard expression:

removing first characteristic characters contained in the Chinese name of the to-be-processed expression;

removing second special characters contained in the English name of the to-be-processed expression;

if the condition that the data item is missing in the field of the to-be-processed expression is detected, completing the missing data item by adopting preset supplementary information;

and performing de-duplication processing on the standard expression by taking the Chinese name, the English name, the data type and the data characteristic as reference units.

Optionally, before performing word segmentation processing on the chinese name of each standard expression, the method further includes:

and carrying out duplication removal processing on the standard expression by using the Chinese name and the English name as reference units.

Optionally, adding the chinese word obtained after the word segmentation processing and the english name corresponding to each chinese word to the standard word library includes:

matching the Chinese words obtained after word segmentation processing with words in a standard word library;

marking words which are not successfully matched;

and if the marked words conform to the preset word definition, adding the marked words into the standard word library.

Optionally, the method further includes:

checking whether a condition that one Chinese word corresponds to a plurality of English names exists in the standard word library;

if a situation that one Chinese word corresponds to a plurality of English names exists, determining the standard English name of the Chinese word;

and replacing the English name corresponding to the Chinese word by the standard English name.

Or

Detecting whether there are ambiguous words in the standard word library;

if the words with ambiguous meanings exist in the standard word library, the corresponding Chinese names before the words are split are obtained, and the Chinese names are used for replacing the words with ambiguous meanings.

Optionally, the method further includes:

checking whether words with similar meanings exist in the standard word library;

If words with similar meanings exist, determining a standard word from the words with similar meanings;

removing words with similar meanings to the standard words in the standard word library, and storing the words with similar meanings to the standard words in a non-standard word library;

and establishing a mapping relation between the standard words in the standard word library and the words which have the same meanings with the standard words in the non-standard knowledge base.

Optionally, the obtaining of standard domain information of the standard expression includes:

based on the standard word library, performing word segmentation processing on the Chinese name of the standard expression to obtain at least one Chinese word;

taking the last Chinese word in the Chinese words forming the Chinese name as a classified word;

acquiring data characteristics of the standard expression; the data characteristics include: data type and long precision;

and taking the data type and the length precision of the classified words and the standard expressions as standard domain information of the standard expressions.

Optionally, the method further includes:

detecting whether the condition that the Chinese name and the English name of the standard expression are the same but the standard domain information is different exists or not;

if the Chinese name and the English name of the standard expression are the same but the standard domain information is different, generating identification information according to the standard domain information;

The Chinese name and the English name of the standard expression are identified through the identification information.

The embodiment of the invention discloses a device for constructing a standard knowledge base, which comprises:

the system comprises a to-be-processed expression acquisition unit, a to-be-processed expression acquisition unit and a to-be-processed expression acquisition unit, wherein the to-be-processed expression acquisition unit is used for acquiring field information of the to-be-processed expressions; the field information of the expression to be processed at least comprises: chinese name, English name, data characteristics;

the standardization unit is used for standardizing the field information to be processed according to a preset rule to obtain the field information of the standardized expression;

the standard word bank building unit is used for performing word segmentation processing on the Chinese name of each standard term to obtain at least one Chinese word; obtaining an English name corresponding to each Chinese word; adding the Chinese words obtained after word segmentation processing and the English name corresponding to each Chinese word into a standard word library;

the standard domain library construction unit is used for acquiring standard domain information of standard expressions and adding the standard domain information into the standard domain library; the standard domain information includes: categorical words and data characteristics of standard expressions;

and the standard phrase library construction unit is used for associating the standard phrases, the standard word library and the standard domain library to generate a standard phrase library.

The embodiment of the invention discloses a standard knowledge base, which comprises:

a standard word library comprising: chinese and English words;

a standard domain library comprising: classifying word and data features;

a standard corpus comprising: standard expressions, the relation between the standard expressions and a standard word library and the relation between the standard expressions and a standard domain library;

the standard knowledge base is constructed by the method for constructing the standard knowledge base.

Optionally, the method further includes:

a non-standard word bank comprising words having similar meaning to words in the standard word bank.

The embodiment of the invention discloses a method and a device for constructing a standard knowledge base, wherein the method comprises the following steps: acquiring field information of a to-be-processed term, standardizing the field information of the to-be-processed term to obtain field information of a standard term, then segmenting Chinese names of the field of the to-be-processed term, constructing a standard word library according to Chinese words obtained by segmentation, and constructing a standard domain library through standard domain information of the standard term, wherein the standard domain information comprises classified words and data characteristics; and establishing a standard expression library by establishing an association relation between the standard expression and the standard word library and the standard domain library. It can be seen that the standard knowledge base obtained by the method not only contains the standard expression, but also includes the words forming the standard expression and the data characteristics of the standard expression. Therefore, the standard knowledge base divides the standard expression into finer granularity, and the success rate of standard expression retrieval can be improved through the divided fine granularity information. Moreover, the standardization degree of the data is improved, the data types are unified, and the efficiency of the logical model and the physical design in the database generation process is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

FIG. 1 is a flow chart of a method for building a standard knowledge base according to an embodiment of the present invention;

FIG. 2 illustrates a flow diagram of a method of validating a standard thesaurus;

FIG. 3 is a flow diagram illustrating yet another method of verification of a standard word library;

FIG. 4 is a flow chart illustrating another method for verifying a standard word library according to an embodiment of the present invention;

FIG. 5 shows a schematic of the structure of a standard knowledge base;

fig. 6 is a schematic structural diagram of a standard knowledge base building device provided by an embodiment of the invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, a flowchart of a method for building a standard knowledge base according to an embodiment of the present invention is shown, in this embodiment, the method includes:

s101: acquiring field information of the expression to be processed; the field information of the expression to be processed at least comprises: chinese name, English name, data type and long precision;

in this embodiment, the obtaining of the field information of the to-be-processed phrase may include various manners, such as obtaining from an existing standard database model or a standard data dictionary, or obtaining from a data system applied to some scenarios.

S102: standardizing the field information of the to-be-processed expression according to a preset rule to obtain the field information of the standard expression;

in this embodiment, the to-be-processed phrase acquired in S101 may have data missing or data repeating, and in order to improve efficiency and accuracy of subsequent data processing, the acquired standard phrase may be standardized in advance, where the standardization process may include:

removing first special characters contained in the Chinese name of the expression to be processed;

removing second characteristic characters contained in the English name of the to-be-processed expression;

Detecting whether the field of the standard expression has data item missing;

if the situation that the data item is missing in the field of the standard expression is detected, completing the missing data item by adopting preset supplementary information;

and performing de-duplication processing on the fields by taking the Chinese names, English names and data characteristics of the standard expression fields as reference units.

In this embodiment, different fields have different normative requirements, a processing rule of each field is preset, and different fields are subjected to normalized processing according to the preset processing rule, which specifically includes:

the method for standardizing the Chinese name, the English name and the data characteristics of the standard expression according to the preset rule comprises the following steps:

chinese name for standard wording: removing a first special character preset in the Chinese name;

the preset first special character may be any character except a chinese character, an english character, and a numeral, and for example, the special character may include: question marks, quotation marks, etc.

For English names: removing a second special character preset in the English name;

the preset second special character can be a character except an english character, a number and an underline allowed to be used in the database.

Wherein the data characteristics are processed according to the data type and long precision specification of the database.

In this embodiment, as can be seen from the above description, the field information of the standard expression at least includes: chinese name + english name + data characteristics. The field information of the standard expression is likely to have data item missing, such as missing Chinese name, missing English name, etc.

And under the condition that the field of the standard expression is detected to have data item missing, adopting preset supplementary information to complete the missing data item.

In this embodiment, in a general case, the probability that the chinese name is missing in the field information of the standard phrase is large, and in this case, the missing chinese name can be complemented by using the english name of the standard phrase.

For the possibly repeated fields, in this embodiment, in order to ensure the comprehensiveness of the data and avoid the occurrence of repeated fields, the data needs to be deduplicated, where the standard expression mainly includes: chinese name, English name and data characteristics. In the deduplication processing in this embodiment, the chinese name, english name, and data feature of the standard phrase are used as references to perform deduplication processing, and it can be understood that, in the case where the chinese name, english name, and data feature of the standard phrase are completely the same, the chinese name + english name + data feature corresponding to one field is retained.

S103: performing word segmentation processing on the Chinese name of each standard expression to obtain at least one Chinese word;

in this embodiment, the method for performing word segmentation processing on the chinese name of the standard phrase includes multiple methods, and is not limited in this embodiment.

In this embodiment, after the Chinese name is segmented, at least one Chinese word is obtained, and the obtained at least one word can form the Chinese name of the standard expression.

As can be seen from the above description, the standard expression is subjected to deduplication processing based on the chinese name, english name, data type, and length precision of the standard expression. However, in the case of performing the word segmentation processing, the chinese name is processed, but the standard phrase still has a different data type and length precision, but the chinese name is the same as the english name, and in this case, the word segmentation processing is repeated for the same chinese name.

S104: obtaining an English name corresponding to each Chinese word;

In this embodiment, the method for obtaining the english name corresponding to each chinese word includes multiple methods, which are not limited in this embodiment, and the following two methods are preferably provided:

the method comprises the following steps:

detecting whether the English name of the field accords with a preset rule or not;

splitting the English name of the field if the English name accords with a preset rule;

in this embodiment, during the construction of the english name, a certain rule may be followed, for example, two words connected by a downward slide line may be split according to the position where the word is underlined.

For example, the following steps are carried out: the English name of the field is "CUST _ NM", and the English abbreviation of "customer" is "CUST" and the English abbreviation of "name" is "NM", which can be judged from the meaning of the field.

The second method comprises the following steps:

and calling a preset translation tool, and acquiring the English name corresponding to the Chinese word through the preset translation tool.

S105: adding the Chinese words obtained after word segmentation processing and the English name corresponding to each Chinese word into a standard word library;

in this embodiment, the chinese words obtained by word segmentation and the english name corresponding to each chinese word are added to the standard word library, and all the generated chinese words and english words corresponding to the chinese words may be added to the standard word library or may be added according to a preset addition rule.

The method for adding the materials according to the preset adding rule comprises the following steps:

matching the Chinese words obtained by word segmentation with words in a standard word library;

marking words which are not successfully matched;

and if the marked words conform to the preset word definition, adding the marked words into the standard word library.

S106: acquiring standard domain information of standard expression, and adding the standard domain information into the standard domain library; the standard domain information includes: the classification words, data types and long precision of standard expressions;

in this embodiment, in order to distinguish standard phrases, the standard phrases may be expressed by features of different attributes, and in this embodiment, the standard phrases are expressed by classifying attribute features and attribute features of data.

Wherein the data characteristics include: data type and long precision.

In this embodiment, the method for obtaining the attribute feature of the standard expression may include, for example:

based on the standard word library, performing word segmentation processing on the Chinese name of the standard expression to obtain at least one Chinese word;

taking the last Chinese word in the Chinese words forming the Chinese name as a classified word;

acquiring data characteristics of the standard expression; the data characteristics include: data type and long precision;

And taking the data type and the length precision of the classified words and the standard expressions as standard domain information of the standard expressions.

In this embodiment, if the attribute of the standard phrase is represented by a finer granularity, the classified words may be divided into finer granularities, for example, modifiers are added to the classified words, where the modifiers may be chinese words after the division and before the classified words.

Besides, the standard domain information may also include some code information for indicating the meaning of the service.

For example, the following steps are carried out: if the categorical word is "name," the modifiers for "name" may be "user," "system," etc.

In the embodiment, the data type and the long precision are added in the standard domain library, so that the data type, the length, the format and the like used by the same data are kept consistent when the data are used, and the problem of inconsistent interfaces during data interaction is avoided.

S107: associating the standard expressions, the standard word library and the standard domain library to generate a standard expression library;

wherein, standard expression includes a plurality of field information, and field information includes: chinese name, English name and data characteristics. Associating the standard expression with the standard word library and the standard domain library may be understood as associating each field of the standard expression with the standard word library and the standard domain library.

In this embodiment, the chinese name of each standard expression is composed of words in the standard word library, and thus one standard expression may be associated with a plurality of words in the standard word library; each standard expression has standard domain information, and each standard expression is associated with one standard domain information in the standard domain library; words in the standard word library, such as category words, may be associated with a plurality of standard domain information in the standard domain library. In the embodiment, after the field information of the to-be-processed wording is obtained, the field information of the to-be-processed wording is subjected to standardization processing to obtain the field information of the standard wording, then, the Chinese name of the field of the to-be-processed wording is subjected to word segmentation, a standard word library is constructed according to the Chinese words obtained by word segmentation, and a standard domain library is constructed through the standard domain information of the standard wording, wherein the standard domain information comprises classified words and data characteristics; the standard language library is obtained by establishing the association relationship between the standard language and the standard word library and the standard domain library. It can be seen that the standard knowledge base obtained by the method not only contains the standard expression, but also includes the words forming the standard expression and the data characteristics of the standard expression. Therefore, the standard knowledge base divides the standard expression into finer granularity, and the success rate of standard expression retrieval can be improved through the divided fine granularity information. Moreover, the standardization degree of the data is improved, the data types are unified, and the efficiency of the logical model and the physical design in the database generation process is improved.

In this embodiment, there may be some problems in the standard word library obtained by the method, for example, there may be a case where one chinese word corresponds to multiple english names, and therefore, in order to perfect the standard word library, referring to fig. 2, a flowchart of a method for verifying the standard word library is shown, including:

s201: checking whether a condition that one Chinese word corresponds to a plurality of English names exists in the standard word library;

s202: if a situation that one Chinese word corresponds to a plurality of English names exists, determining the standard English name of the Chinese word;

s203: and replacing the English name corresponding to the Chinese word by the standard English name.

In this embodiment, the method for determining the standard english name of the chinese word includes multiple modes, which is not limited in this embodiment, and may preferably adopt the following two modes:

the first method is as follows:

detecting the occurrence frequency of each English name from a preset data model or a database;

the most standard english name with the highest frequency will appear.

The second method comprises the following steps:

calling a preset translation tool, and obtaining a reference English name through the preset translation tool;

And comparing the reference English name with a plurality of English names corresponding to the Chinese words to determine a standard English name.

Wherein, the reference English name may be the same as one of the English names corresponding to the Chinese word, or may be different from the English names; if the reference English name is the same as one of the English names corresponding to the Chinese words, taking the reference English name as a standard English name; if the reference english name is different from any one of the plurality of english names corresponding to the chinese word, the similarity between the reference english name and each english name is calculated, and the english name with the highest similarity is used as the standard english name, or the reference english name can be used as the standard english name.

In this embodiment, there may be words with ambiguous meanings in the standard word library, and therefore words with ambiguous meanings in the standard word library need to be processed to eliminate words with ambiguous definitions, and referring to fig. 3, a flow diagram of another verification method for the standard word library is shown, which in this embodiment includes:

s301: checking whether there are words in the standard word library that are undefined;

S302: if the standard word library has words with undefined definitions, the corresponding Chinese name before the words are split is obtained, and the Chinese name is used for replacing the words with undefined definitions.

In this embodiment, if a term needs to exist as an independent word, and after the term is split into multiple words and the meaning expression is ambiguous, the term is defined as a compound term, and the compound term can replace the ambiguous word.

In this embodiment, a similar meaning word or a synonym may also exist in the standard word library, and in order to further standardize the similar meaning word and the synonym, referring to fig. 4, a flowchart of another verification method for the standard word library provided by the embodiment of the present invention is shown, including:

s401: detecting whether words with similar meanings exist in the standard word library;

wherein words with similar meaning are understood as synonyms or synonyms.

S402: if the words with similar meanings exist, determining a standard word from the words with similar meanings;

s403: removing words with similar meanings to the standard words in the standard word library, and storing the words with similar meanings to the standard words in a non-standard word library;

S404: and establishing a mapping relation between the standard words in the standard word library and the words which have the same meanings with the non-standard words in the non-standard word library.

In this embodiment, standard words are selected from words having similar meanings, and other words are stored in the non-standard word library. Thus, standard words are stored in the standard word bank, and other similar or synonymous words are stored in the non-standard word bank, so that the corresponding standard words can be detected through the non-standard word bank during retrieval. Thereby detecting the standard expression corresponding to the word. The success rate of retrieval is improved.

In this embodiment, the standard wording may have the case where the chinese name and the english name are the same, but the standard domain is different, in order to distinguish the cases, the chinese name and the english name may be specially marked, and the case may be regarded as the different chinese name and the different english name according to the different marks, and specifically, the method further includes:

detecting whether the condition that the Chinese name and the English name of the standard expression are the same but the standard domain information is different exists or not;

if the Chinese name and the English name of the standard expression are the same but the standard domain information is different, generating identification information according to the standard domain information;

The Chinese name and the English name of the standard expression are identified through the identification information.

Referring to fig. 5, a schematic diagram of a standard knowledge base is shown, and in the present embodiment, the standard knowledge base includes:

the standard word library 501 includes: chinese and English words;

a standard domain library 502 comprising: classifying word and data features;

the standard language library 503 includes: standard expressions, the relation between the standard expressions and a standard word library and the relation between the standard expressions and a standard domain library;

the standard knowledge base is constructed by the following method:

acquiring field information of the expression to be processed; the field information of the expression to be processed at least comprises: chinese name, English name, data characteristics;

standardizing the field information of the to-be-processed expression according to a preset rule to obtain the field information of the standard expression;

performing word segmentation processing on the Chinese name of each standard expression to obtain at least one Chinese word;

obtaining an English name corresponding to each Chinese word;

adding the Chinese words obtained after word segmentation processing and the English name corresponding to each Chinese word into a standard word library;

acquiring standard domain information of standard expression, and adding the standard domain information into the standard domain library; the standard domain information includes: categorical words and data characteristics of standard expressions;

And associating the standard expressions, the standard word library and the standard domain library to generate a standard expression library.

Optionally, the field information of the to-be-processed expression is standardized according to a preset rule to obtain the field information of the standard expression:

removing first characteristic characters contained in the Chinese name of the to-be-processed expression;

removing second special characters contained in the English name of the to-be-processed expression;

if the condition that the data item is missing in the field of the to-be-processed expression is detected, completing the missing data item by adopting preset supplementary information;

and performing de-duplication processing on the standard expression by taking the Chinese name, the English name, the data type and the data characteristic as reference units.

Optionally, before performing word segmentation processing on the chinese name of each standard expression, the method further includes:

and carrying out duplication removal processing on the standard expression by using the Chinese name and the English name as reference units.

Optionally, adding the chinese word obtained after the word segmentation processing and the english name corresponding to each chinese word to the standard word library includes:

matching the Chinese words obtained after word segmentation processing with words in a standard word library;

marking words which are not successfully matched;

And if the marked words conform to the preset word definition, adding the marked words into the standard word library.

Optionally, the method further includes:

checking whether a condition that one Chinese word corresponds to a plurality of English names exists in the standard word library;

if a situation that one Chinese word corresponds to a plurality of English names exists, determining the standard English name of the Chinese word;

and replacing the English name corresponding to the Chinese word by the standard English name.

Or

Detecting whether there are ambiguous words in the standard word library;

if the words with ambiguous meanings exist in the standard word library, the corresponding Chinese names before the words are split are obtained, and the Chinese names are used for replacing the words with ambiguous meanings.

Optionally, the method further includes:

checking whether words with similar meanings exist in the standard word library;

if words with similar meanings exist, determining a standard word from the words with similar meanings;

removing words with similar meanings to the standard words in the standard word library, and storing the words with similar meanings to the standard words in a non-standard word library;

and establishing a mapping relation between the standard words in the standard word library and the words which have the same meanings with the standard words in the non-standard knowledge base.

Optionally, the obtaining of standard domain information of the standard expression includes:

based on the standard word library, performing word segmentation processing on the Chinese name of the standard expression to obtain at least one Chinese word;

taking the last Chinese word in the Chinese words forming the Chinese name as a classified word;

acquiring data characteristics of the standard expression; the data characteristics include: data type and long precision;

and taking the data type and the length precision of the classified words and the standard expressions as standard domain information of the standard expressions.

Optionally, the method further includes:

detecting whether the condition that the Chinese name and the English name of the standard expression are the same but the standard domain information is different exists or not;

if the Chinese name and the English name of the standard expression are the same but the standard domain information is different, generating identification information according to the standard domain information;

the Chinese name and the English name of the standard expression are identified through the identification information.

Referring to fig. 6, a schematic structural diagram of an apparatus for building a standard knowledge base according to an embodiment of the present invention is shown, in this embodiment, the apparatus includes:

a to-be-processed expression obtaining unit 601, configured to obtain field information of a to-be-processed expression; the field information of the expression to be processed at least comprises: chinese name, English name, data characteristics;

The standardization unit 602, which standardizes the field information to be processed according to a preset rule to obtain the field information of the standardized expression;

a standard word library constructing unit 603, configured to perform word segmentation processing on the chinese name of each standard term to obtain at least one chinese word; obtaining an English name corresponding to each Chinese word; adding the Chinese words obtained after word segmentation processing and the English name corresponding to each Chinese word into a standard word library;

a standard domain library constructing unit 604, configured to obtain standard domain information of a standard expression, and add the standard domain information to the standard domain library; the standard domain information includes: categorical words and data characteristics of standard expressions;

the standard phrase library constructing unit 605 is configured to associate the standard phrases, the standard word library and the standard domain library to generate a standard phrase library.

Optionally, the method further includes:

a normalization unit for: removing first characteristic characters contained in the Chinese name of the to-be-processed expression; removing second special characters contained in the English name of the to-be-processed expression; if the condition that the data item is missing in the field of the to-be-processed expression is detected, completing the missing data item by adopting preset supplementary information; and performing de-duplication processing on the standard expression by taking the Chinese name, the English name, the data type and the data characteristic as reference units.

Optionally, the method further includes:

and the duplication removing unit is used for carrying out duplication removing processing on the standard expression by using the Chinese name and the English name as reference units.

Optionally, the standard word library constructing unit includes:

the data adding subunit is used for matching the Chinese words obtained after the word segmentation processing with the words in the standard word library; marking words which are not successfully matched; and if the marked words conform to the preset word definition, adding the marked words into the standard word library.

Optionally, the standard word bank first verification unit is configured to:

checking whether a condition that one Chinese word corresponds to a plurality of English names exists in the standard word library;

if a situation that one Chinese word corresponds to a plurality of English names exists, determining the standard English name of the Chinese word;

and replacing the English name corresponding to the Chinese word by the standard English name.

Or

Detecting whether there are ambiguous words in the standard word library;

if the words with ambiguous meanings exist in the standard word library, the corresponding Chinese names before the words are split are obtained, and the Chinese names are used for replacing the words with ambiguous meanings.

Optionally, the standard word bank second constructing unit is configured to:

checking whether words with similar meanings exist in the standard word library;

if words with similar meanings exist, determining a standard word from the words with similar meanings;

removing words with similar meanings to the standard words in the standard word library, and storing the words with similar meanings to the standard words in a non-standard word library;

and establishing a mapping relation between the standard words in the standard word library and the words which have the same meanings with the standard words in the non-standard knowledge base.

Optionally, the standard domain library constructing unit includes:

a standard domain information obtaining subunit configured to:

based on the standard word library, performing word segmentation processing on the Chinese name of the standard expression to obtain at least one Chinese word; taking the last Chinese word in the Chinese words forming the Chinese name as a classified word; acquiring data characteristics of the standard expression; the data characteristics include: data type and long precision; and taking the data type and the length precision of the classified words and the standard expressions as standard domain information of the standard expressions.

Optionally, the method further includes:

the distinguishing unit is used for detecting whether the Chinese name and the English name of the standard expression are the same but the standard domain information is different; if the Chinese name and the English name of the standard expression are the same but the standard domain information is different, generating identification information according to the standard domain information; the Chinese name and the English name of the standard expression are identified through the identification information.

The device of the embodiment is used for standardizing the field information of the to-be-processed wording after the field information of the to-be-processed wording is obtained, then, segmenting the Chinese name of the field of the to-be-processed wording, constructing a standard word library according to the Chinese words obtained by segmentation, and constructing a standard domain library through the standard domain information of the standard wording, wherein the standard domain information comprises classified words and data characteristics; the standard language library is obtained by establishing the association relationship between the standard language and the standard word library and the standard domain library. It can be seen that the standard knowledge base obtained by the method not only contains the standard expression, but also includes the words forming the standard expression and the data characteristics of the standard expression. Therefore, the standard knowledge base divides the standard expression into finer granularity, and the success rate of standard expression retrieval can be improved through the divided fine granularity information. Moreover, the standardization degree of the data is improved, the data types are unified, and the efficiency of the logical model and the physical design in the database generation process is improved.

It should be noted that, in the present specification, the embodiments are all described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments may be referred to each other.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

18页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种诈骗文本命名实体识别方法及系统

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!