Parse tree based vectorization for natural language processing

文档序号：1185104 发布日期：2020-09-22 浏览：4次中文

阅读说明：本技术 用于自然语言处理的基于解析树的向量化 (Parse tree based vectorization for natural language processing ) 是由 M·斯里瓦萨 R·K·甘蒂林沇燮 S·斯里兰甘斯里达兰 A·帕利特于 2020-01-09 设计创作，主要内容包括：本公开的实施例涉及用于自然语言处理的基于解析树的向量化。与叙述性文本的部分相对应的解析树被构造。解析树包括数据结构,数据结构根据语法将叙述性文本的部分的语义结构表示为一组标签。使用解析树中的标签作为焦点词,生成上下文窗口,上下文窗口包括在离焦点词的规定距离内的一组词,根据解析树的链路数目而确定的距离使一组词内的焦点词和上下文词分离。为焦点词和上下文词生成权重。使用权重,生成第一词的第一向量表示,第一词在叙述性文本的第二部分中。(Embodiments of the present disclosure relate to parse tree based vectorization for natural language processing. A parse tree is constructed that corresponds to portions of the narrative text. The parse tree includes a data structure that represents the semantic structure of the portion of the narrative text as a set of tags according to the grammar. A context window is generated using the tags in the parse tree as the focus word, the context window including a set of words within a specified distance from the focus word, the focus word and context word within the set of words being separated by a distance determined according to the number of links of the parse tree. Weights are generated for the focus word and context words. Using the weights, a first vector representation of a first word is generated, the first word being in a second portion of the narrative text.)

1. A method, comprising:

constructing a parse tree corresponding to a portion of narrative text, wherein the parse tree includes a data structure that represents a semantic structure of the portion of narrative text as a set of tags according to a grammar;

generating a context window using a tag in the parse tree as a focused word, the context window including a set of words within a specified distance from the focused word, the distance determined according to a number of links of the parse tree separating the focused word and a contextual word in the set of words;

generating weights for the focus word and the context words; and

generating a first vector representation of a first word in a second portion of the narrative text using the weights.

2. The method of claim 1, further comprising:

training the coding network by applying a training data set comprising the focus word, the context words and the weights to the coding network to convert words into vector representations according to semantic similarity.

3. The method of claim 1, wherein the weight is a constant.

4. The method of claim 1, wherein the weights correspond to the plurality of links of the parse tree, the plurality of links separating the focus word and the context word.

5. The method of claim 4, further comprising determining the weight by:

subtracting the plurality of links from the prescribed distance; and

the subtraction result is divided by the prescribed distance.

6. The method of claim 4, wherein the weight decreases exponentially according to the plurality of links.

7. The method of claim 1, further comprising:

generating a second vector representation of a second word using the weights, the second word being in a third portion of the narrative text;

calculating a measure of distance between the first word and the second word using the first vector representation and the second vector representation.

8. The method of claim 7, wherein the measure of distance comprises cosine similarity.

9. A computer readable storage medium having stored thereon program instructions executable by one or more processing devices to perform a method comprising the steps in the method of any one of claims 1 to 8.

10. The computer readable storage medium of claim 9, wherein the program instructions are stored in a computer readable storage device in a data processing system, and wherein the program instructions are transmitted over a network from a remote data processing system.

11. The computer readable storage medium of claim 9, wherein the program instructions are stored in a computer readable storage device in a server data processing system, and wherein the program instructions are downloaded over a network to a remote data processing system for use in a computer readable storage device associated with the remote data processing system.

12. A computer system comprising one or more processors, one or more computer-readable memories, and one or more computer-readable storage devices, and program instructions stored on at least one of the one or more storage devices for execution by at least one of the one or more processors via at least one of the one or more memories, the stored program instructions comprising:

program instructions to construct a parse tree corresponding to a portion of narrative text, wherein the parse tree includes a data structure representing a semantic structure of the portion of narrative text as a set of tags according to a grammar;

program instructions to generate a context window using a tag in the parse tree as a focus word, the context window including a set of words within a specified distance from the focus word, the distance determined according to a number of links of the parse tree separating the focus word and a context word in the set of words;

program instructions to generate weights for the focus word and the context words; and

program instructions to generate a first vector representation of a first word using the weights, the first word being in a second portion of the narrative text.

13. The computer system of claim 19, further comprising:

program instructions to train an encoding network to convert words into vector representations according to semantic similarity by applying a training data set comprising the focus word, the context words, and the weights to the encoding network.

14. A computer system comprising modules configured to perform the steps of the method according to any one of claims 1 to 8.

Technical Field

The present invention generally relates to a method, system and computer program product for vectorization of natural language processing. More particularly, the present invention relates to a method, system, and computer program product for parse tree based vectorization for natural language processing.

Background

Vectorization or embedding is the general term for a set of language modeling and feature learning techniques in Natural Language Processing (NLP) where units of narrative text are mapped to corresponding vectors of real numbers. An element of the narrative text may be a part of a word (e.g., a root part of the word, a last's' removed plural nouns), a word, a phrase, a sentence, a set of sentences, or another element of the narrative text, depending on the grammar of the language of the narrative text. Unless explicitly disambiguated, the term "word" as used herein refers to the smallest unit of narrative text in a given language for which a model determines a vector. The term "word" may include more than one natural-language word or only a portion of a natural-language word. Similarly, the term "sentence" as used herein refers to a portion of narrative text that includes one or more words according to the grammar of the language of the narrative text, unless explicitly disambiguated. As used herein, the term "sentence" may actually comprise only a portion of a natural language sentence or comprise more than one natural language sentence.

Also, while the examples herein are taken from and reference grammatical features of English, the techniques disclosed herein are equally applicable to languages other than English. A vector is a representation of a unit of narrative text. A vector is an array of real numbers, typically between zero and one, but not limited to. Arrays have many dimensions, e.g., 300. However, as long as the vector range and dimension are consistent when a particular model is trained and used, the exact range and dimension is not important.

For use with NLP, vectors are assigned such that words sharing a common context, or context, have corresponding vectors close together, as measured by a distance metric. One such distance metric is cosine similarity. Cosine similarity is a measure of similarity between two non-zero vectors determined by the cosine of the angle between the vectors. As a result, using vector operations on the corresponding vectors, relationships such as analogies can be simply solved. For example, performing vector calculation vec ("madrid") -vec ("spain") + vec ("france") would result in a vector that is closer to vec ("paris") than any other word vector. Also, because the vector representation is a number rather than a text string, the vector can be conveniently input into other neural networks to perform other NLP tasks. Vectorization, for example, facilitates emotion analysis, natural language understanding, and classification.

A set of related neural network models for assigning vectors is called word2 vec. These models are typically two-layer neural networks. Once trained, these models take as their input a large corpus of narrative text and assign each unique word in the corpus to a corresponding vector, so that words in the corpus that share a common context have vectors that are correspondingly close to each other.

Doc2vec, an extension of the word2vec concept, learns and produces document-specific vector representations of words. In other words, doc2Vec extends the word2Vec model by learning a vector representation of words based on a particular document outside the universal text corpus. This is helpful when dealing with professional narrative text where some words have document-specific or domain-specific meanings. Other models, which are also extensions of the word2vec concept, also learn and generate vector representations of words, phrases, sentences, and other units of narrative text.

The model of the allocation vector must be trained prior to use by using words in the narrative text that are contexts for the focus word. The model typically uses word pairs from a corpus of narratively trained text as training data. Half of each pair is called a focus word. The other half of each pair is selected from the linear context window associated with the focus word. The linear context window includes words within sentences of the narrative text that are within a specified linear distance from a focus word in the document containing the narrative text. The prescribed distance is referred to as the size of the contextual window. Training using data combined in this manner trains the model using words that are contexts for the focal word within the narrative text.

For example, for The sentence "The quick brown fox jumped over The lazy dog", it is assumed that The focus word is "fox". If the size of the linear contextual window is equal to 2, the linear contextual window includes "quick", "brown", "jumped", and "over". As a result, the training data pair may include (fox, quick), (fox, brown), (fox, jumped), and (fox, over). If the size of the linear context window is equal to 3, the linear context window also includes a first instance and a second instance of "the," and the training data pair may also include two instances of (fox, the).

The contextual window may be constant or dynamic. When a constant context window is used, all pairs of training data are weighted in the same way, regardless of the distance from the focus word. Thus, continuing with the example of a constant linear context window, both (fox, quick) and (fox, brown) would have the same weight, even though "quick" is farther away from "fox" than "brown" in the sentence.

In contrast, when using a dynamic contextual window, the training data pairs are weighted differently depending on the distance from the focused word. Using a dynamic contextual window reflects the following observation: when evaluating a focused word within a narrative text, words that are closer to the focused word tend to be more important than words that are farther from the focused word.

A dynamic contextual windowing approach weights the training pairs inversely proportional to the distance from the out-of-focus word. For example, for a context window size of 3, training pairs that include words adjacent to the focused word would be weighted 3/3 or 1, training pairs that include the next word out of the focused word would be weighted 2/3 or 0.67, and training pairs that include the next word out of the focused word would be weighted 1/3 or 0.33. Thus, continuing with the example of a dynamic linear context window of size 3, (fox, quick) would have a weight of 1/3, and (fox, brown) would have a weight of 2/3, since "quick" is farther away from "fox" than "brown" in the sentence.

Another dynamic contextual windowing approach weights the training pair according to distance from the focus word, where the weight for the word at distance k from the focus word is equal to exp (-alpha x k), where alpha is a constant.

Parse trees are ordered root trees that represent the semantic structure of portions of narrative text according to syntax. Each node of the tree is a label. Each tag includes indicia and optionally words of narrative text. The labels of the non-terminal nodes of the tree describe a set of nodes below the node. The label of an end node (i.e. a node without a subordinate node) describes the node itself. For example, a parse tree for a sentence may include a root node having sentence markers, nodes marked as noun phrases and verb phrases directly below the root, and terminal nodes describing each of the noun phrases and verb phrases.

Disclosure of Invention

The illustrative embodiments provide a method, system, and computer program product. Embodiments include a method of constructing a parse tree corresponding to a portion of narrative text, wherein the parse tree includes a data structure representing a semantic structure of the portion of narrative text as a set of tags according to a grammar. Embodiments generate a context window using a tag in a parse tree as a focused word, the context window including a set of words within a specified distance from the focused word, the distance determined according to a number of links of the parse tree separating the focused word and the context word in the set of words. Embodiments generate weights for focus words and context words. Embodiments generate a first vector representation of a first word using the weights, the first word being within a second portion of the narrative text.

Drawings

Certain novel features which are considered as characteristic of the invention are set forth in the appended claims. The invention itself, however, as well as a preferred mode of use, further objectives and advantages thereof, will best be understood by reference to the following detailed description of an illustrative embodiment when read in conjunction with the accompanying drawings, wherein:

FIG. 1 depicts a block diagram of a network of data processing systems in which illustrative embodiments may be implemented;

FIG. 2 depicts a block diagram of a data processing system in which illustrative embodiments may be implemented;

FIG. 3 depicts a block diagram of an example configuration for parse-tree based vectorization for natural language processing, in accordance with an illustrative embodiment;

FIG. 4 depicts an example of parse tree generation in conjunction with parse tree-based vectorization for natural language processing in accordance with an illustrative embodiment;

FIG. 5 depicts another example of parse tree generation in conjunction with parse tree-based vectorization for natural language processing in accordance with an illustrative embodiment;

FIG. 6 depicts an example of training data generation using parse tree based vectorization for natural language processing in accordance with an illustrative embodiment;

FIG. 7 depicts another example of training data generation using parse tree based vectorization for natural language processing in accordance with an illustrative embodiment;

FIG. 8 depicts another example of training data generation using parse tree based vectorization for natural language processing in accordance with an illustrative embodiment; and

FIG. 9 depicts a flowchart of an example process for parse tree based vectorization of natural language processing in accordance with an illustrative embodiment.

An embodiment includes a computer usable program product. A computer usable program product includes one or more computer readable storage devices and program instructions stored on at least one of the one or more storage devices.

27页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：XBRL报告核对方法及装置

Parse tree based vectorization for natural language processing

相关技术

网友询问留言