Natural language understanding method based on pre-training model

文档序号：1614259 发布日期：2020-01-10 浏览：28次中文

阅读说明：本技术 一种基于预训练模型的自然语言理解方法 (Natural language understanding method based on pre-training model ) 是由王春辉胡勇于 2019-09-24 设计创作，主要内容包括：本发明公开一种基于预训练模型的自然语言理解方法。所述方法包括：建立基于双向深度Transformer的预训练模型；对待理解语句进行分词处理,并在待理解语句的开始和结尾分别加入特殊标签,得到待理解语句的文本向量；以所述待理解语句的文本向量为输入,调用所述预训练模型,得到待理解语句的文本语义向量；进行意图识别；进行实体识别。本发明可以准确全面地理解意图、识别实体,为后续的对话提供坚实的基础；可以显著提高人机对话系统的质量和用户体验。(The invention discloses a natural language understanding method based on a pre-training model. The method comprises the following steps: establishing a pre-training model based on a bidirectional depth Transformer; performing word segmentation on the sentence to be understood, and adding special labels at the beginning and the end of the sentence to be understood respectively to obtain a text vector of the sentence to be understood; taking the text vector of the sentence to be understood as input, and calling the pre-training model to obtain the text semantic vector of the sentence to be understood; performing intention recognition; and carrying out entity identification. The invention can accurately and comprehensively understand the intention and identify the entity, and provides a solid foundation for subsequent conversation; the quality and the user experience of the man-machine conversation system can be obviously improved.)

1. A natural language understanding method based on a pre-training model is characterized by comprising the following steps:

step 1, establishing a pre-training model based on a bidirectional depth Transformer, wherein the input of the pre-training model is a text vector obtained after a sentence is subjected to word segmentation and special labels are respectively added at the beginning and the end of the sentence, and the text vector is output as a text semantic vector of the sentence;

step 2, performing word segmentation processing on the sentence to be understood, and adding the special labels at the beginning and the end of the sentence to be understood respectively to obtain a text vector of the sentence to be understood;

step 3, taking the text vector of the sentence to be understood as input, and calling the pre-training model to obtain the text semantic vector of the sentence to be understood;

step 4, inputting the text semantic vector of the sentence to be understood into a first multilayer perceptron to extract semantic features related to intentions, and calculating the probability of the sentence to be understood for each preset intention category by a softmax layer, wherein the intention category with the highest probability is the intention category of the sentence to be understood;

and 5, inputting the text semantic vector of the sentence to be understood into a second multilayer perceptron to extract semantic features related to the entity, then sending the text semantic vector into a bidirectional long-short term memory network to perform fusion of forward semantics and backward semantics, and finally calculating the probability P of each word of the sentence to be understood when each preset entity type is selected by a conditional random field, wherein each entity type when the probability P is maximum is the entity type of each word, so as to obtain the identified entity type.

2. The pre-trained model based natural language understanding method of claim 1, wherein the pre-trained model is obtained by pre-training on the whole Chinese Wikipedia corpus.

3. The pre-trained model based natural language understanding method of claim 2, wherein the pre-trained model is formed by stacking 12 layers of transform structures, each layer is composed of a self-attention network and a forward propagation network connected through a residual error network and a layer normalization network, and 768-dimensional vectors are output.

4. The pre-trained model based natural language understanding method of claim 3, wherein said step 3 extracts the results of the last 4-layer network of the pre-trained model and concatenates the results to obtain a representation with 768 x 4-3072 dimensions for each word.

5. The pre-trained model based natural language understanding method of claim 4, wherein the special labels added at the beginning and end of the sentence are [ CLS ] and [ SEP ], respectively.

6. The pre-training model-based natural language understanding method of claim 5, wherein if a word in the sentence to be understood is not in the pre-set vocabulary for pre-training, a special label [ UNK ] is labeled before the word.

7. The pre-trained model based natural language understanding method according to claim 6, wherein the probability P in step 5 is calculated as follows:

in the formula, A (y)_i+1|y_i) The entity category of the ith word representing the sentence to be understood after word segmentation is y_iWhen the entity category of the i +1 th word is y_i+1Probability of p_i(y_i) The entity class representing the ith word is y_iN is the number of words of the sentence to be understood after word segmentation.

Technical Field

The invention belongs to the technical field of natural language understanding, and particularly relates to a natural language understanding method based on a pre-training model.

Background

In recent years, natural language has gradually become the most mainstream way in human-computer interaction as the most convenient and natural way for human to express their thoughts. Due to the characteristics of diversity, complexity and the like of natural languages, accurate machine understanding is always a research hotspot and difficulty in the field of artificial intelligence.

The first step of a man-machine conversation is natural language understanding, and only if the language of a user is accurately and comprehensively understood, reasonable answers can be given. Natural language understanding specifically includes both the tasks of intent recognition and entity recognition. Type intent recognition and entity recognition can be achieved by building a pre-trained model. The traditional pre-training model is generally based on a two-way long-short term memory network, for example, the deep context semantic representation model proposed in "Deeppendulated word representation" published in CoRR of journal (volume abs/1802.05365) by Matthew E.Peters et al is based on a two-way long-short term memory network. There is a problem in that if a plurality of layers are stacked, the volume of the model is significantly increased, and the time for training the model is also significantly increased. Therefore, a multi-layer network structure is difficult to construct under the framework, so that the pre-training model cannot capture deep semantic information.

Disclosure of Invention

In order to solve the problems in the prior art, the invention provides a natural language understanding method based on a pre-training model, which can accurately identify intentions and comprehensively identify entities in a scene with only a small number of training samples by establishing the pre-training language model based on large-scale linguistic data.

In order to achieve the purpose, the invention adopts the following technical scheme:

a natural language understanding method based on a pre-training model comprises the following steps:

step 3, taking the text vector of the sentence to be understood as input, and calling the pre-training model to obtain the text semantic vector of the sentence to be understood;

and 5, inputting the text semantic vector of the sentence to be understood into a second multilayer perceptron to extract semantic features related to the entity, then sending the text semantic vector into a bidirectional Long-Short Term Memory (LSTM) network to perform fusion of forward semantics and backward semantics, and finally calculating the probability P of each word of the sentence to be understood when each preset entity type is taken by each word by a conditional random field, wherein each entity type when the probability P is maximum is the entity type of each word, so as to obtain the identified entity type.

Compared with the prior art, the invention has the following beneficial effects:

according to the invention, through establishing the pre-training model based on the bidirectional depth Transformer, word segmentation processing is carried out on the sentence to be understood, the pre-training model is called to obtain the text semantic vector of the sentence to be understood, and then intention identification and entity identification are carried out, so that the intention can be accurately and comprehensively understood, the entity can be identified, and a solid foundation is provided for subsequent conversation. The quality and the user experience of the man-machine conversation system can be obviously improved.

Drawings

FIG. 1 is a flowchart of a natural language understanding method based on a pre-training model according to an embodiment of the present invention;

FIG. 2 is a schematic structural diagram of a pre-training model of each layer;

FIG. 3 is a schematic flow chart of the application of a pre-trained model for intent recognition and entity recognition.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings.

The embodiment of the invention provides a natural language understanding method based on a pre-training model, a flow chart is shown in figure 1, and the method comprises the following steps:

s101, establishing a pre-training model based on a bidirectional depth Transformer, wherein the pre-training model is input into a text vector obtained after a sentence is subjected to word segmentation and special labels are respectively added at the beginning and the end of the sentence, and the text vector is output into a text semantic vector of the sentence;

s102, performing word segmentation on the sentence to be understood, and adding the special labels at the beginning and the end of the sentence to be understood to obtain a text vector of the sentence to be understood;

s103, calling the pre-training model by taking the text vector of the sentence to be understood as input to obtain a text semantic vector of the sentence to be understood;

s104, inputting the text semantic vector of the sentence to be understood into a first multilayer perceptron to extract semantic features related to intentions, and calculating the probability of the sentence to be understood for each preset intention category by a softmax layer, wherein the intention category with the highest probability is the intention category of the sentence to be understood;

s105, inputting the text semantic vector of the sentence to be understood into a second multilayer perceptron to extract semantic features related to the entity, then sending the semantic features into a two-way long-short term memory network to fuse forward semantics and backward semantics, and finally calculating the probability P of each word in the sentence to be understood when each preset entity type is selected by a conditional random field, wherein each entity type when the probability P is maximum is the entity type of each word, so as to obtain the identified entity type.

In this embodiment, step S101 is mainly used to construct a pre-training model. The pre-training model of this embodiment is based on a bi-directional depth Transformer structure, and the input of the pre-training model is a text vector of a sentence after word segmentation, and the text vector is obtained after adding special tags (such as [ CLS ] and [ SEP ]) at the beginning and the end of the sentence respectively (each word or Chinese character is represented by its ID). See table 1 below. The output of the pre-trained model is a multi-dimensional vector representing the semantics of the text of the input sentence. Because the pre-training model adopts a bidirectional depth Transformer structure, each word in the sentence can interact with other words in the sentence, and therefore, the special label added at the beginning of the sentence can be regarded as a representation of the whole input sentence.

In this embodiment, step S102 is mainly used to perform word segmentation processing on the sentence to be understood, and add the special tags described in step S101 at the beginning and the end of the sentence to be understood, respectively. For English, the word segmentation processing is word level, namely segmentation is carried out according to spaces; for Chinese, word segmentation processing is Chinese-character level, namely segmentation is carried out according to a single Chinese character, and if the segmentation result of 'I work in Beijing' is as follows: "i", "in", "north", "jing", "work", and "work".

In this embodiment, step S103 is mainly used to obtain a text semantic vector of a sentence to be understood. The method comprises the following steps: and (5) taking the text vector of the sentence to be understood obtained in the step (S102) as input, calling a pre-training model, wherein the output of the pre-training model is the text semantic vector of the sentence to be understood.

In this embodiment, step S104 is mainly used for performing intent recognition on the sentence to be understood. The task of intent recognition is to recognize which preset scene the input sentence is for, and is essentially a task of text classification. The intention recognition is implemented by an intention recognition network comprising a first multi-layered perceptron and a softmax layer, see fig. 3. Firstly, inputting a text semantic vector X of the sentence to be understood into a first multilayer perceptron to extract semantic features related to intentions. The first multi-layered perceptron may employ an excitation function as follows:

relu(x)＝max(0,x)

then, the output of the first multi-layer perceptron is sent into the softmax layer, and the probability p of the sentence to be understood for each preset intention category is calculated, wherein the calculation formula is as follows: #

p＝softmax(relu(WX+b))

Wherein, W is a vector composed of the weight of the input quantity of the first multilayer perceptron, and b is a threshold value. #

The preset intention category is an intention category preset for a specific application scenario. For example, a task-based dialog scenario may preset intent categories such as notify, open chat, and question and answer at school. And after the probability of the sentence to be understood for each preset intention category is calculated, taking the intention category with the highest probability as the intention category of the sentence to be understood.

In this embodiment, step S105 is mainly used to perform entity identification on the sentence to be understood. The task of entity recognition is to recognize the entity name in the input sentence. For example, the sentence "I work in Beijing" is input, and some entity types, such as name of a person and name of a place, are defined before recognition. The recognized "Beijing" is the entity name of a place name class. Thus, entity identification is essentially a sequence-tagged task. The entity recognition is realized by an entity recognition network, which comprises a second multilayer perceptron, a two-way long-short term memory network and a conditional random field, as can be seen in fig. 3. Firstly, inputting the text semantic vector of the sentence to be understood into a second multilayer perceptron to extract semantic features related to the entity. And then, the output of the second multilayer perceptron is sent to a bidirectional long-short term memory network to carry out the fusion of forward semantics and backward semantics. Feeding a word representation X of a sentence into the forward LSTM network may be represented as

(e.g., "I am in Beijing") the inbound-to-the-Back LSTM network may be represented as

(e.g., "Beijing is my"), fusing the forward and backward semantics can be expressed as

Finally, the output of the bidirectional long and short term memory network is sent to a conditional random field, the probability of each word of the sentence to be understood when each preset entity category is taken is calculated, and each entity category is each single entity category when the probability is maximumThe entity category corresponding to the word is obtained on the basis of the entity category corresponding to the word, and the finally identified entity category can be seen in table 2.

As an alternative embodiment, the pre-training model is obtained by pre-training the whole Chinese Wikipedia corpus.

The present embodiment defines training samples of the pre-trained model. The pre-training model of the embodiment is obtained by pre-training the language model on the whole Chinese Wikipedia material. The Chinese Wikipedia corpus contains 1,043,224 Wikipedia bars, with a total size of 1.6G. The pre-training of the language model is carried out on the whole Chinese Wikipedia corpus, and the pre-training model with rich semantics and good robustness can be obtained.

As an alternative embodiment, the pre-training model is formed by stacking 12 layers of transform structures, each layer is composed of a self-attention network and a forward propagation network connected through a residual error network and a layer normalization network, and each layer outputs vectors with 768 dimensions.

The present embodiment further defines the structure of the pre-training model. The pre-training model is a 12-layer Transformer structure, each layer of structure is shown in fig. 2 and comprises a self-attention network and a forward propagation network, the self-attention network and the forward propagation network are connected through a residual error network and a layer normalization network, and each layer outputs 768-dimensional vectors. A deep network can be constructed through a residual error network, so that deep semantic information can be effectively captured. The input of the conventional attention network includes three matrices, q (query), k (key), and v (value), which respectively represent the query, key, and value of the attention network. And Q, K and V in the self-attention network both come from the same input, i.e., sentence vector X, which is represented by a word. This is referred to as a representation of each word in the sentence. The calculation formulas of the attention network and the self attention network are respectively as follows:

where d is the dimension of X, in this example d is 768. By operation of

Each word in the sentence can interact with other words in the entire sentence.

As an alternative, said step 3 extracts the results of the last 4-layer network of the pre-trained model and concatenates the results to obtain a representation of each word with dimensions 768 × 4 ═ 3072.

In this embodiment, in order to obtain richer semantic representation information, the last 4 layers of the pre-trained model are spliced, that is, 4 vectors with 768 dimensions are connected end to end, and a representation with 768 × 4-3072 dimensions for each word can be obtained.

As an alternative embodiment, the special tags added at the beginning and end of a sentence are [ CLS ] and [ SEP ], respectively.

This embodiment gives a specific special label. Namely, a special tag [ CLS ] is added at the beginning of a sentence, and a special tag [ SEP ] is added at the end of the sentence. This embodiment is merely a preferred embodiment and does not exclude or limit other possible special labels.

As an alternative embodiment, if a word in the sentence to be understood is not in the pre-set vocabulary for pre-training, a special label [ UNK ] is tagged before the word.

This embodiment gives a labeling method of a word in a sentence to be understood when the word is not in a preset vocabulary, i.e., a special label [ UNK ] is labeled before the word. The preset vocabulary is a training sample set when the pre-training model is pre-trained. Also, this embodiment is merely a preferred embodiment and does not exclude or limit possible special labels other than UNK.

As an alternative embodiment, the calculation formula of the probability P in step S105 is as follows:

In order to better understand the technical scheme of the invention, an example of performing intention recognition and entity recognition on a sentence to be understood by applying the established pre-training model is given below.

A task-based dialog scenario with 3 intents preset: sending a notice, opening chatting, and asking and answering in the school; presetting 4 entities: contacts, events, time, questions.

The statements to be understood are: "give a notice to the executive that saturday will summon the gym".

Firstly, performing word segmentation on a sentence to be understood. If each character is in the preset word list, the word segmentation result is directly the character; if a word is not in the preset word list, the word is identified by using the UNK. Since each word in the sentence is in the word list, the word segmentation result is the word itself. Finally, a special mark [ CLS ] is added at the beginning of the sentence, and a special mark [ SEP ] is added at the end of the sentence. The final word segmentation results are shown in table 1.

TABLE 1 results of word segmentation

[CLS]

To give

All-purpose

Class

Home-use

Long and long

Hair-like device

Tong (Chinese character of 'tong')

To know

，

Week (week)

Six ingredients

Will be provided with

Call out

Opening device

Fortune

Movable part

Will be provided with

[SEP]

And secondly, calling a pre-training model to extract semantic features. And taking the word segmentation result of the previous step as input, and calling a pre-training model to extract semantic features. The pre-trained model is a 12-layer transformer network through which each word in the sentence interacts with other words in the sentence, the results of the last 4-layer network are extracted and concatenated to obtain a 3072-dimensional representation of each word.

And thirdly, performing intention identification. The result of the previous step is fed into the intention recognition network, as in fig. 3. Firstly, re-extraction of features is carried out through a first multi-layer perceptron, and then the features are sent into a softmax network to obtain probability distribution of each intention. The probability of "notify" is the highest, so the output recognition intent is "notify".

And fourthly, entity recognition is carried out. The result of the second step is fed into the entity identification network, as in fig. 3. Firstly, extracting the characteristics again through a second multilayer perceptron, then sending the characteristics into a bidirectional long-short term memory network to extract the sequence characteristics again, and finally sending the characteristics into a conditional random field, calculating the probability of each word when each preset entity type is selected, wherein each entity type with the maximum probability is the entity type corresponding to each word. The results are shown in Table 2. So, the last output identified entity is: the contact person: all; time: saturday; event: and (4) moving.

TABLE 2 entity identification results

[CLS]

To give

All-purpose

Class

Home-use

Long and long

Hair-like device

Tong (Chinese character of 'tong')

To know

Others

Contact person

Others

，

Week (week)

Six ingredients

Will be provided with

Call out

Opening device

Fortune

Movable part

Will be provided with

[SEP]

Others

Time of day

Others

Event(s)

Others

The above description is only for the purpose of illustrating a few embodiments of the present invention, and should not be taken as limiting the scope of the present invention, in which all equivalent changes, modifications, or equivalent scaling-up or down, etc. made in accordance with the spirit of the present invention should be considered as falling within the scope of the present invention.

9页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：中文姓名获取方法、中文姓名提取模型的训练方法及装置

Natural language understanding method based on pre-training model

相关技术

网友询问留言