Method and device for converting Chinese natural language into database language

文档序号：136541 发布日期：2021-10-22 浏览：26次中文

阅读说明：本技术 中文自然语言转数据库语言的方法及装置 (Method and device for converting Chinese natural language into database language ) 是由陈江捷梁家卿方世能肖仰华于 2020-04-17 设计创作，主要内容包括：本发明提供一种中文自然语言转数据库语言的方法及装置,用于根据数据库将用户输入的自然语言文本转换为可以对数据库进行查询的查询语句,其特征在于,包括如下步骤：预处理步骤,对自然语言文本进行规范化修正得到规范文本；列填充步骤,基于规范文本以及数据库中各个数据表的表头进行列填充处理从而生成连接符、SELECT列与对应的聚合函数以及WHERE列与对应的WHERE操作符；条件填充步骤,基于规范文本以及WHERE列对规范文本进行抽取并填充与WHERE列相对应的WHERE内容；组装输出步骤,将连接符、SELECT列与对应的聚合函数、WHERE列与对应的WHERE操作符和WHERE内容组装为查询语句并输出。(The invention provides a method and a device for converting Chinese natural language into database language, which are used for converting natural language texts input by users into query sentences capable of querying the database according to the database, and are characterized by comprising the following steps: preprocessing, namely performing standardized correction on the natural language text to obtain a standardized text; a column filling step, wherein column filling processing is carried out on the basis of the specification text and the headers of all data tables in the database, so that a connector, a SELECT column and a corresponding aggregation function and a WHERE column and a corresponding WHERE operator are generated; a condition filling step, namely extracting the standard text based on the standard text and the WHERE column and filling WHERE content corresponding to the WHERE column; and an assembly output step, namely assembling the connectors, the SELECT column and the corresponding aggregation function, the WHERE column and the corresponding WHERE operator and the WHERE content into a query statement and outputting the query statement.)

1. A method for converting Chinese natural language into database language is used for converting natural language text input by a user into a query sentence capable of querying the database according to the database, and is characterized by comprising the following steps:

a preprocessing step, namely performing standardized correction on the natural language text to obtain a standardized text;

a column filling step, namely inputting the specification text and the headers of all data tables in the database into a preset first BERT model and a preset first DGCNN model so as to obtain semantic representations of the natural language text and the header representations of all the headers, and performing column filling processing on the basis of the semantic representations and the header representations so as to generate a connector, a SELECT column and a corresponding aggregation function and a WHERE column and a corresponding WHERE operator;

a condition filling step, namely inputting the standard text and the WHERE column into a preset second BERT model and a preset second DGCNN model so as to obtain semantic information, extracting the standard text based on the semantic information, and filling WHERE content corresponding to the WHERE column;

and an assembly output step, namely assembling the connector, the SELECT column and the corresponding aggregation function, the WHERE column and the corresponding WHERE operator and the WHERE content into the query statement and outputting the query statement.

2. The method for converting chinese natural language into database language according to claim 1, wherein:

and the condition filling step extracts text contents corresponding to the WHERE column from the standard text when the standard text is extracted based on the semantic information, sequentially performs fuzzy matching on the text contents and each data content in the data table, and further matches the data content with the highest similarity as the WHERE content.

3. The method for converting chinese natural language into database language according to claim 1, wherein:

wherein the input of the first BERT model is an input text formed by adding token [ CLS ] before the canonical text and separating the header by using [ SEP ] to splice,

the column filling process includes:

connector filling, inputting the corresponding output code vector of [ CLS ] after passing through the first BERT model and the first DGCNN model into a first full-connection layer for prediction so as to fill the connector;

filling a SELECT column and an aggregation function, inputting all the correspondingly output encoding vectors of the [ SEP ] after passing through the first BERT model and the first DGCNN model into a second fully-connected layer for prediction so as to fill the SELECT column and the aggregation function;

filling a WHERE column, sequentially inputting a coding vector correspondingly output by the [ SEP ] corresponding to each header after passing through the first BERT model and the first DGCNN model into a third full-connection layer, predicting whether the corresponding header is in a WHERE condition, and filling the header into the WHERE column if the corresponding header is in the WHERE condition; and

filling a WHERE operator, taking the output of the first DGCNN model as the input of a fourth full-link layer, predicting the WHERE operator corresponding to each character in the canonical question,

and in the assembling and outputting step, when the WHERE column is assembled with the corresponding WHERE operator and the WHERE content, the corresponding WHERE operator is found according to the word corresponding to the WHERE content in the canonical question sentence, and the assembling is completed.

4. The method for converting chinese natural language into database language according to claim 1, wherein:

wherein the normalized correction comprises:

the digital unified processing, which converts the Chinese number into Arabic number by using a regular matching mode;

the specification processing of year and date, the date and time in the natural language text is corrected to be the expression mode consistent with the expression mode in the database;

the numerical unit unification processing is carried out, and different numerical units in the natural language text are unified into numerical units consistent with those in the database; and

and (4) synonymy expression correction processing, namely correcting the reference in the natural language text into a corresponding entity in the database by adopting an entity disambiguation technology.

5. A device for converting a natural language text inputted by a user into a query sentence capable of querying a database according to the database, comprising:

the preprocessing module is used for carrying out standardized correction on the natural language text to obtain a standardized text;

a column filling module, configured to input the canonical text and headers of the data tables in the database into a preset first BERT model and a preset first DGCNN model, so as to obtain semantic representations of the natural language text and header representations of the headers, and perform column filling processing based on the semantic representations and the header representations, so as to generate a connector, a SELECT column and a corresponding aggregation function, and a WHERE column and a corresponding WHERE operator;

the condition filling module is used for inputting the standard text and the WHERE column into a preset second BERT model and a preset second DGCNN model so as to obtain semantic information, extracting the standard text based on the semantic information and filling WHERE content corresponding to the WHERE column; and

and the assembly output module is used for assembling the connector, the SELECT column and the corresponding aggregation function, the WHERE column and the corresponding WHERE operator and the WHERE content into the query statement and outputting the query statement.

Technical Field

The invention belongs to the field of natural language to structured text, and particularly relates to a method and a device for converting Chinese natural language to a structured query language of a database list.

Background

The conversion from natural language to SQL is an important subject of natural language structuring, and requires that a machine can understand information such as query intention and restriction conditions of a human question sentence and generate an executable SQL sentence corresponding to the natural language question sentence according to syntax of a database structured query language. The application scene of converting natural language into SQL is wide, and the technology is a key technology of intelligent customer service and intelligent assistance, but due to the complexity of human language, the technology still needs to be improved.

In the prior art, the method for converting the natural language into the SQL can be divided into three categories:

1) a rule-based approach. The method uses manually defined rules to extract intentions and conditions in the question, such as table fields and table contents in the question through a predefined NER dictionary, and assembles the complete SQL statement according to the SQL grammar.

2) A method based on sequence generation. The method takes a natural language SQL conversion task as a sequence-to-sequence generation task and adopts a method similar to Seq2 Seq. The capability of researchers to enhance the model generation to conform to the SQL grammar based on the SQL grammar combined with reinforcement learning when generating the SQL sequence is provided.

3) Based on the slot filling method, the SQL statement is a highly structured language, and the generated statement conforms to a uniform template. Therefore, the conversion from natural language to SQL can be regarded as a slot filling task, and the slot template is filled through a series of classification or extraction tasks, so that the conversion of the SQL statement is completed, for example, when a field to be queried in SELECT is filled, the field of the table is classified.

In the technology of converting natural language into SQL, because the artificial definition rule is limited, the rule-based method can only generate SQL sentences aiming at the natural language question of a specific simple sentence pattern, and cannot process more complex question sentences; the method based on sequence generation ignores the structural characteristics of SQL sentences, cannot utilize the template information of SQL grammar, often causes the generation of SQL sentences which do not conform to the grammar, and reduces the accuracy rate of the generated sentences; the existing method based on slot filling is limited by the capability of a depth model for coding natural language question sentences, information in the natural language with a sentence pattern and complicated semantics cannot be coded to a low-dimensional vector, the expression of downstream classification and extraction tasks is influenced by weaker natural language representation, and the accuracy of slot value filling is reduced. In addition, the prior art does not consider the problems of synonymy and ambiguity existing in the natural language, so that the accuracy of capturing the association between the data table and the question and the condition value in the SQL sentence is reduced.

Disclosure of Invention

In order to solve the problems, the invention provides a method and a device for enhancing the expression capability of a Chinese natural language by using transfer learning and rules so as to accurately convert the Chinese natural language into an SQL statement, wherein the method adopts the following technical scheme:

the invention provides a method for converting Chinese natural language into database language, which is used for converting natural language text input by a user into query sentences capable of querying a database according to the database and is characterized by comprising the following steps: preprocessing, namely performing standardized correction on the natural language text to obtain a standardized text; a column filling step, namely inputting the standard text and the header of each data table in the database into a preset first BERT model and a preset first DGCNN model so as to obtain the semantic representation of the natural language text and the header representation of each header, and performing column filling processing based on the semantic representation and the header representation so as to generate a connector, a SELECT column and a corresponding aggregation function and a WHERE column and a corresponding WHERE operator; a condition filling step, namely inputting the standard text and the WHERE column into a preset second BERT model and a second DGCNN model so as to obtain semantic information, extracting the standard text based on the semantic information and filling the WHERE content corresponding to the WHERE column; and an assembly output step, namely assembling the connectors, the SELECT column and the corresponding aggregation function, the WHERE column and the corresponding WHERE operator and the WHERE content into a query statement and outputting the query statement.

The method for converting the Chinese natural language into the database language provided by the invention can also have the technical characteristics that in the condition filling step, when the standard text is extracted based on the semantic information, the text content corresponding to the WHERE column is extracted from the standard text, the similarity calculation is sequentially carried out on the text content and each data content in the data table, and the data content with the highest similarity is further selected as the WHERE content.

The method for converting the Chinese natural language into the database language provided by the invention can also have the technical characteristics that the input of the first BERT model is an input text formed by adding token [ CLS ] before a standard text and separating and splicing each header by using [ SEP ], and the column filling processing comprises the following steps: filling a connector, namely inputting a coding vector which is correspondingly output after the [ CLS ] passes through a first BERT model and a first DGCNN model into a first full-connection layer for prediction so as to fill the connector; filling a SELECT column and an aggregation function, inputting all the correspondingly output coding vectors of [ SEP ] after passing through a first BERT model and a first DGCNN model into a second fully-connected layer for prediction so as to fill the SELECT column and the aggregation function; filling a WHERE column, sequentially inputting a coding vector which is correspondingly output after the [ SEP ] corresponding to each header passes through the first BERT model and the first DGCNN model into a third full-connection layer, predicting whether the corresponding header is in the WHERE condition or not, and filling the header into the WHERE column if the corresponding header is in the WHERE condition; and filling a WHERE operator, using the output of the first DGCNN model as the input of a fourth full-connection layer, predicting the WHERE operator corresponding to each word in the canonical question, and finding the corresponding WHERE operator according to the word corresponding to the WHERE content in the canonical question and completing assembly when assembling a WHERE column, the corresponding WHERE operator and the WHERE content in the assembly output step.

The method for converting the Chinese natural language into the database language provided by the invention can also have the technical characteristics that the normalized correction comprises the following steps: the digital unified processing, which converts the Chinese number into Arabic number by using a regular matching mode; the year and date are processed in a standard mode, and the date and time in the natural language text are corrected into an expression mode consistent with that in the database; the numerical unit is processed in a unified way, and different numerical units in the natural language text are unified into numerical units consistent with those in the database; and synonymy expression correction processing, namely correcting the reference in the natural language text into a corresponding entity in the database by adopting an entity disambiguation technology.

The invention also provides a device for converting the Chinese natural language into the database language, which is used for converting the natural language text input by the user into the query sentence capable of querying the database according to the database, and is characterized by comprising the following steps: the preprocessing module is used for carrying out standardized correction on the natural language text to obtain a standardized text; the column filling module is used for inputting the standard text and the header of each data table in the database into a preset first BERT model and a preset first DGCNN model so as to obtain the semantic representation of the natural language text and the header representation of each header, and performing column filling processing based on the semantic representation and the header representation so as to generate a connector, a SELECT column and a corresponding aggregation function and a WHERE column and a corresponding WHERE operator; the condition filling module is used for inputting the standard text and the WHERE column into a preset second BERT model and a second DGCNN model so as to obtain semantic information, extracting the standard text based on the semantic information and filling the WHERE content corresponding to the WHERE column; and the assembly output module is used for assembling the connectors, the SELECT column and the corresponding aggregation function, the WHERE column and the corresponding WHERE operator and the WHERE content into a query statement and outputting the query statement.

Action and Effect of the invention

According to the method and the device for converting the Chinese natural language into the database language, the natural language text is processed into the standard text through the standardized correction, so that the paradigm of asking sentences can be unified, and the subsequent excavation and modeling of text characteristics are facilitated. Further, when the standard text is converted into the SQL query statement, the processing is performed by a column filling step for processing a classification task and a condition filling step for processing a reading understanding task in stages, and two sets of BERT and DGCNN which do not share parameters are respectively adopted for feature extraction, so that in the column filling step, the header of the data table and the standard text can be subjected to semantic analysis simultaneously, and the connector, the SELECT column and the corresponding aggregation function and the WHERE column and the corresponding WHERE operator in the SQL query statement can be more accurately predicted by combining the representations of the two, and meanwhile, in the condition filling step, the corresponding WHERE content can be accurately extracted from the standard text based on the predicted WHERE column, so that the representation capability of the Chinese natural language is enhanced. The conversion method and the device can better adapt to Chinese language texts, well express synonymous entities and extract more accurate contents, thereby ensuring the accuracy of the generated SQL query statement.

Drawings

FIG. 1 is a schematic diagram of a movie data table according to an embodiment of the present invention;

FIG. 2 is a flowchart of a method for converting Chinese natural language to database language according to an embodiment of the present invention;

FIG. 3 is a block diagram of a method for converting Chinese natural language into database language according to an embodiment of the present invention;

FIG. 4 is a block diagram of a column filling step in an embodiment of the present invention; and

FIG. 5 is a schematic diagram of the structure of the SQL statement in the embodiment of the present invention.

Detailed Description

In order to make the technical means, creation features, achievement objectives and effects of the present invention easy to understand, the following embodiments and drawings are used to describe the method and apparatus for converting the chinese natural language into the database language.

< example >

When processing a natural language text, the method for converting a chinese natural language into a database language of the present invention can also process a natural language question (hereinafter, referred to as a question) which is more complex and has rich semantics than a simple sentence pattern, and then, in this embodiment, a flow of processing the natural language text is described by taking how many times the total ratio of the box-office of the two films, namely rheum officinale and darkroom escape in the fourth 19 th year of the question, "as an example.

In addition, in this embodiment, the method for converting the chinese natural language into the database language can be implemented by a computer, and each step of the method is programmed into a corresponding executable module and stored in the computer, and the question sentence is converted into an SQL sentence (query sentence) capable of querying the database by sequentially operating each executable module.

FIG. 2 is a flowchart illustrating a method for converting Chinese natural language into database language according to an embodiment of the present invention.

FIG. 3 is a block diagram of a method for converting Chinese natural language into database language according to an embodiment of the present invention.

As shown in fig. 2 and 3, the method for converting the chinese natural language into the database language includes the following steps:

in the preprocessing step S1, a natural language question (i.e., a natural language text) is normalized and corrected to obtain a normalized question (i.e., a normalized text).

In this embodiment, the chinese natural language question is mainly characterized by a very high degree of spoken language, and lack of uniform description specifications, for example, the numeric representation has both arabic numerals and chinese, and the processing difficulty of the SQL generation method is increased, so that the question preprocessing step S1 performs normalized correction on expressions of an irregular natural language question in four aspects of numbers, year and date, numerical units, synonyms, and the like, and facilitates the subsequent steps to mine and model features of the natural language question. The normalized correction specifically comprises:

1) digital unified processing: in order to reduce the difficulty of extracting contents from question sentences by a downstream method, the numbers convert Chinese numbers into Arabic numbers in a regular matching mode;

2) a year and date specification process of, in order to specify a question to express a date and a year, correcting the date and time in the question to an expression pattern that is consistent with that in the database, for example, correcting an expression pattern such as "10 years" to "2010" that is consistent with that in the data table;

3) the numerical value units are processed uniformly, namely, the question and the numerical value units in the table are unified, namely, different numerical value units in the question are unified into the numerical value units consistent with those in the database, for example, "5000 meters" in the question is unified into "5 kilometers" in the database;

4) and (3) synonymy expression correction processing, in order to carry out question reference resolution, an entity disambiguation technology is adopted to correct the reference in the question into a corresponding entity, such as correcting the 'small yellow car' into 'ofo shared single car'.

Through the normalized correction processing, the natural language question can be corrected into a normalized question, for example, after the preprocessing step S1 is performed to "how many times the total percentage of the box-office of the two films including bumblebee and back-room escape in the fourth week of 19" the normalized question is obtained as how many times the total percentage of the box-office of the two films including bumblebee and back-room escape in the 4 th week of 2019 ".

A column filling step S2, inputting the canonical question and the headers of each data table in the database into a preset first BERT model and a first DGCNN model, so as to obtain semantic representations of the natural language question and header representations of each header, and filling the connectors, SELECT columns and corresponding aggregation functions, and WHERE columns and corresponding WHERE operators based on the semantic representations and the header representations.

FIG. 4 is a flowchart illustrating a column filling step according to an embodiment of the present invention.

As shown in fig. 4, the input of the column filling step S2 is the canonical question modified in the preprocessing step S1 and the header of the data table corresponding to the question, and is used for predicting the SELECT query column, the corresponding aggregation function, the connector, and the WHERE condition column in the pre-generated SQL statement. The column filling step S2 can be divided into two sub-steps:

and step S2-1, representing the question and the header, and inputting the standard question and the header of each data table in the database into a preset first BERT model and a preset first DGCNN model so as to obtain the semantic representation and the header representation of the natural language question.

In step S2-1 of this embodiment, first, a pre-training language model (i.e., the first BERT model) is used to obtain semantic representations of an input natural language (i.e., a canonical question), and then a convolutional neural network model (i.e., the first DGCNN model) based on a Convolutional Neural Network (CNN) and an Attention mechanism (Attention) is used to perform sequence feature extraction, so as to obtain representations of the question and each header.

Since it is not possible to confirm which part of the data table the user really wants to query by only relying on the question itself, the question needs to be encoded together with the header of the data table to help the computer know the correspondence existing between the question and the header. Therefore, to obtain the representation of question and header, Chinese BERT is used as the input coder, because the BERT model is pre-trained using a large-scale corpus and has strong semantic expression capability. Then, a DGCNN model based on a CNN and Attention mechanism is used for further extracting the semantic relation between question headers, the calculation efficiency of the CNN compared with the RNN is utilized, and in order to enable the CNN model to capture information at a longer distance, model parameters are not increased, so that the operation efficiency of a computer is improved, and meanwhile, the prediction accuracy is guaranteed.

Specifically, in step S2-1, as shown in fig. 4, by adding token [ CLS ] in front of the natural language Question (i.e., Question in fig. 4) for classifying the connectors, the header and the header (i.e., H1, H2, H3, H4 in fig. 4) are separated by [ SEP ], thereby splicing the Question and the header. And then, the spliced input text is used as the input of a BERT encoder, and the BERT output of the question is obtained after the input is calculated by a BERT multi-head self-attention mechanism and a plurality of transform layers.

Meanwhile, in order to enhance the expression of the Question, the part-of-speech tagged Question (namely Question POS Tag) acquires the part-of-speech information of each word level through an Embedding layer, and the BERT output of the Question and the Embedding of the part-of-speech tagged Question are added to be used as the input of the DGCNN model. After passing through multiple one-dimensional volume blocks of the DGCNN, the anchoring mechanism is used to replace the Pooling layer in the conventional CNN to integrate the information of the input sequence effectively. Finally, the output of DGCNN, i.e., the column filling step S2, characterizes the question and the header.

And S2-2, column filling, namely filling connectors, the SELECT column and the corresponding aggregation function and the WHERE column and the corresponding WHERE operator based on the semantic representation and the header representation output in the step S2-1.

In this embodiment, before column filling, SELECT and WHERE in the slot template are preset template keywords, in step S2-2, a connector, a SELECT column and a corresponding aggregation function, and a WHERE column and cop are respectively generated by a plurality of full connection layers (sense) according to a coded question and a header. As shown in fig. 5, taking a segment of SQL statement as an example, the connectors 10, i.e. operators when multiple conditions occur in WHERE, are respectively "AND", "OR", "NULL"; the SELECT column 20 is a column name in the SELECT condition, and the corresponding aggregation function 30 is a function for performing aggregation processing such as statistics and summation on values corresponding to the SELECT column in the data table, such as "AVG", "COUNT", "SUM", and the like; the WHERE column 40 is a column name in the WHERE condition, and is used for the condition filling step S3 to extract the WHERE content 50 corresponding to the column name, and cop indicates an operator 60 corresponding to each header in the input. For the above connectors, SELECT columns and corresponding aggregation functions, the specific process of the WHERE column and cop is as follows:

and (3) connector filling: connectors have three possible outputs, "AND", "OR", "NULL", respectively, so filling connectors can be considered a three-classification problem. Here, the input is a state corresponding to [ CLS ], which may be regarded as an expression of a whole question and a header, and [ CLS ] is used to predict a connector through one fully-connected layer (first fully-connected layer D1), that is, the connector is padded such that a coding vector correspondingly output after the [ CLS ] passes through the first BERT model and the first DGCNN model is input to the first fully-connected layer to be predicted, thereby padding the connector.

SELECT column and aggregation function fill: the predicted SELECT column and the corresponding aggregation function are used as the same task, namely, each header is classified, and the output is ' selected but not aggregated function ', ' AVG ', ' MAX ', ' MIN ', ' COUNT ', ' SUM ', ' and ' unselected '. Where the first 6 representations are selected and the 7 th representation is not. Predicting whether each header is selected and corresponds to an aggregation function is therefore essentially a seven-class problem. Regarding the state of each [ SEP ] as a representation of the corresponding header, predicting a SELECT column and an aggregation function through a fully-connected layer (a second fully-connected layer D2), that is, filling the SELECT column and the aggregation function, so that all the coding vectors correspondingly output after the [ SEP ] passes through the first BERT model and the first DGCNN model are input into the second fully-connected layer for prediction, thereby filling the SELECT column and the aggregation function.

WHERE column filling: WHERE column filling, i.e., predicting whether a header is in a WHERE condition, is therefore a binary problem for each header. The [ SEP ] corresponding to each header is passed through a full link layer (third full link layer D3) to predict whether the header is in WHERE condition. That is, the WHERE column is filled, in which the encoding vector output correspondingly after the [ SEP ] corresponding to each header passes through the first BERT model and the first DGCNN model is sequentially input to a third fully-connected layer and whether the corresponding header is in a WHERE condition is predicted, and if so, the header is filled as the WHERE column.

The WHERE operator populates: the WHERE operator fills, that is, the operator corresponding to the prediction WHERE column, because there is a case WHERE one header corresponds to a plurality of operators, the selection column and the aggregation function are regarded as one task, and the prediction WHERE column and the operator are divided into two tasks, unlike the selection column filling. Specifically, operators corresponding to each word of the original question sentence are predicted, namely, table contents in the question sentence are mapped to the corresponding operators. Therefore, the output of the first DGCNN model is used as the input of a fully-connected layer (fourth fully-connected layer D4), the operators corresponding to the words of the question are predicted, and the assembling output step S4 finds the corresponding operator according to the WHERE content extracted by the condition filling step S3.

And a condition filling step S3, inputting the standard text and the WHERE column into a preset second BERT model and a second DGCNN model so as to obtain semantic information, extracting the standard text based on the semantic information, and filling the WHERE content corresponding to the WHERE column. In this embodiment, the conditional padding step S3 can be divided into two sub-steps:

and step S3-1, extracting text content from the standard text based on the WHERE column.

In step S3-1 of this embodiment, the normalized question and the WHERE column predicted in the column filling step S2 are input, and the corresponding WHERE content is further extracted from the normalized question based on the WHERE column, so the task of this step is essentially a reading-understanding substring extraction problem, that is, the starting position and the ending position of the content corresponding to the WHERE column in the normalized question are predicted, and therefore the condition extraction can be modeled as a sequence tagging problem of the question.

Specifically, the input canonical question is spliced with a WHERE column (column name), the input semantic information is extracted by using a second BERT model and a second DGCNN model which have the same structure as those in the column filling step S2 (because of different tasks, the BERT + DGCNN model in the condition filling step S3 and the BERT + DGCNN model in the column filling step S2 do not share parameters), then the output of the second DGCNN model is used as the input of a full connection layer, and then the probability that each token is used as the start position and the end position of the content is predicted. And a certain value is set as a threshold value for extracting the content, so that the text content corresponding to the WHERE column in the standard question sentence can be extracted as much as possible, and the coverage rate is ensured.

In this embodiment, on one hand, the text content is used for performing similarity calculation with the data content of the data table in step S3-2, and finally, the content with the highest similarity is selected as a filling condition; and on the other hand, the method is used for acquiring operators corresponding to the WHERE columns according to the cop generated in the column filling step S2.

And step S3-2, fuzzy matching is carried out on the text content and each data content in the data table in sequence, and the data content with the highest similarity is further matched to serve as the WHERE content.

Although the question has been normalized and corrected in the preprocessing step S1, there still exists a certain proportion of "missing fish", that is, the content in the question does not necessarily completely coincide with the expression of the content in the database. Therefore, a certain post-processing is still required for the text content extracted from the canonical question to find similar content in the database as the final extracted content. According to the content queried by the user in the question sentence, the extracted substrings (i.e. SQL conditional columns) can be roughly divided into two types, namely numeric values and character strings:

1) for a numeric substring, the numeric value needs to be modified in a manner of keeping the unit of the numeric value consistent with the unit of the numeric value applied to the corresponding column in the data table, for example, the unit of the numeric value mentioned in the question is "5 kilometres squared", and the unit of the numeric value of the corresponding column in the data table is "kilometres squared", so that the extracted numeric substring "5" needs to be modified into "50000";

2) for a string-type substring, due to the diversity of the Chinese natural language, there are two cases that need to be handled: the substrings are incomplete abbreviations of corresponding elements of the data table, such as Shanghai transportation university and Shanghai communication or Shanghai communication, and the substrings are synonyms of the corresponding elements. By comprehensively using the characteristics of editing distance and maximum common substring length and the like to carry out fuzzy matching at the character string level and using an entity reference resolution technology to carry out entity disambiguation at the semantic level, the substrings of the character string type can be well matched into a database.

Through the processing, the data content with the highest similarity to the text content is obtained from the data table and is used as the WHERE content, and the WHERE content can be ensured to correspond to the data content in the data table.

And an assembly output step S4, wherein the connectors, the SELECT column and the corresponding aggregation function, the WHERE column and the corresponding WHERE operator and the WHERE content are assembled into a query statement and output.

In the assembly output step S4 of this embodiment, the operator corresponding to the WHERE column can be extracted from the operators corresponding to the tokens in the condition filling step S3 through the contents corresponding to the WHERE column and the WHERE column acquired in the column filling step S2, so that the generated WHERE condition column, operator, and content can complete the filling of the WHERE condition, and further complete the assembly and output of the SQL statement (i.e., query statement).

In this embodiment, the query sentence obtained by final processing of "how many times the total ratio of the box rooms of the two films including bumblebee and escape from a closed room in the fourth week of 19" is shown in fig. 3 and 5.

In this embodiment, the assembly output step S4 may output the query statement to a display screen of the computer, so that the user can confirm or perform other operations such as running on the converted SQL query statement; or directly outputting the query result to an SQL database and operating the SQL database so as to directly obtain the corresponding query result.

In this embodiment, for more convenience of practical application, the steps S1 to S4 of the method for converting the chinese natural language into the database language may also be packaged into corresponding program modules in advance, that is, a preprocessing module, a column filling module, a condition filling module, and an assembly output module, to form a device for converting the chinese natural language into the database language, so as to facilitate the processing of the steps S1 to S4 on the natural language text input by the user and output the query sentence obtained by the conversion.

Examples effects and effects

According to the method and the device for converting the Chinese natural language into the database language, the natural language text is processed into the standard text through the standardized correction, so that the paradigm of question can be unified, and the text characteristics can be conveniently mined and modeled in the subsequent process. Further, when the standard text is converted into the SQL query statement, the processing is performed by a column filling step for processing a classification task and a condition filling step for processing a reading understanding task in stages, and two sets of BERT and DGCNN which do not share parameters are respectively adopted for feature extraction, so that in the column filling step, the header of the data table and the standard text can be subjected to semantic analysis simultaneously, and the connector, the SELECT column and the corresponding aggregation function and the WHERE column and the corresponding WHERE operator in the SQL query statement can be more accurately predicted by combining the representations of the two, and meanwhile, in the condition filling step, the corresponding WHERE content can be accurately extracted from the standard text based on the predicted WHERE column, so that the representation capability of the Chinese natural language is enhanced. The conversion method and the device can better adapt to Chinese language texts, well express synonymous entities and extract more accurate contents, thereby ensuring the accuracy of the generated SQL query statement.

In addition, in the embodiment, when the text content corresponding to the WHERE column is extracted from the standard text, the similarity calculation is also performed on the text content and the data content in the data table, so that the data content with the highest similarity is selected as the WHERE content, and therefore, the effect of introducing external knowledge is achieved, the obtained question sub-string can be mapped to the specific element on the corresponding data table, and the finally obtained SQL statement can be executed in the database. That is to say, the accuracy of the WHERE condition value in the converted SQL statement is further improved, and errors caused by the difference of the synonymous entities at the string level are reduced. However, most of the previous methods for generating SQL have not considered the problem of synonyms between the question and the entity in the data table, and thus neglect the semantic relationship between the question and the entity.

In the embodiment, when the normalization correction is carried out, the natural language text which is expressed in an irregular way is corrected in the four aspects of number, year and date, numerical value unit and synonymy, so that the natural language text with higher spoken language degree can be uniformly described and normalized, and meanwhile, when the synonymy expression correction is carried out, entity disambiguation and reference resolution are carried out by utilizing an entity disambiguation technology, the semantic difference of synonymy is reduced, and therefore the conversion accuracy of the SQL statement is further improved.

The above-described embodiments are merely illustrative of specific embodiments of the present invention, and the present invention is not limited to the description of the above-described embodiments.

13页详细技术资料下载

Method and device for converting Chinese natural language into database language

相关技术

网友询问留言