Word representation method, device, equipment and computer readable storage medium

文档序号：661725 发布日期：2021-04-27 浏览：2次中文

阅读说明：本技术 一种词的表示方法、装置、设备及计算机可读存储介质 (Word representation method, device, equipment and computer readable storage medium ) 是由李小涛游树娟于 2019-10-25 设计创作，主要内容包括：本申请实施例公开了一种词的表示方法、装置、设备及计算机可读存储介质,其中,所述方法包括：获取待表示词的词根；当训练好的向量模型中不存在所述待表示词的词根对应的向量表示时,以单纯词为分解粒度对所述待表示词进行分解,得到单纯词集合；根据所述向量模型获取所述单纯词集合中各个单纯词的向量表示；根据所述各个单纯词的向量表示,确定所述待表示词的词向量表示。(The embodiment of the application discloses a word representation method, a word representation device, word representation equipment and a computer readable storage medium, wherein the method comprises the following steps: acquiring a root of a word to be represented; when vector representation corresponding to the root of the word to be represented does not exist in the trained vector model, decomposing the word to be represented by taking a simple word as decomposition granularity to obtain a simple word set; obtaining the vector representation of each simple word in the simple word set according to the vector model; and determining word vector representation of the word to be represented according to the vector representation of each simple word.)

1. A method of representing a word, the method comprising:

acquiring a root of a word to be represented;

when vector representation corresponding to the root of the word to be represented does not exist in the trained vector model, decomposing the word to be represented by taking a simple word as decomposition granularity to obtain a simple word set;

obtaining the vector representation of each simple word in the simple word set according to the vector model;

and determining word vector representation of the word to be represented according to the vector representation of each simple word.

2. The method according to claim 1, wherein decomposing the word to be represented by using a simple word as a decomposition granularity to obtain a simple word set comprises:

decomposing the words to be represented by taking simple words as decomposition granularity to obtain a first decomposition result;

when the simple word in the first decomposition result is determined to be the word to be represented, converting the word to be represented into a second language to obtain a word to be represented in the second language;

converting the words to be represented in the second language into the first language to obtain target words; the target word and the word to be represented are in a first language;

and decomposing the target word to obtain a second decomposition result, and obtaining a simple word set according to the simple words in the second decomposition result.

3. The method of claim 2, further comprising:

when determining that the simple words in the first decomposition result are not the words to be represented, acquiring the simple words in the first decomposition result;

and obtaining a simple word set according to the simple words in the first decomposition result.

4. The method of claim 1, wherein obtaining a vector representation of each simple word in the set of simple words according to the vector model comprises:

acquiring the ith simple word in the simple word set, wherein i is 1, 2, …, N, and the simple word set contains N simple words;

judging whether a vector representation corresponding to the ith simple word exists in the vector model;

and when the vector representation corresponding to the ith simple word exists in the vector model, acquiring the vector representation of the ith simple word.

5. The method of claim 4, further comprising:

when the vector representation of the ith simple word does not exist in the vector model, extracting the root of the ith simple word;

judging whether a vector representation corresponding to the root of the ith simple word exists in the vector model;

and when the vector representation corresponding to the root word of the ith simple word exists in the vector model, determining the vector representation corresponding to the root word as the vector representation of the ith simple word.

6. The method of claim 5, further comprising:

and when the vector representation corresponding to the root of the ith simple word does not exist in the vector model, determining the vector representation of the simple word as a preset vector.

7. The method of claim 1, further comprising:

when vector representation corresponding to the root of the word to be represented exists in the vector model, obtaining the vector representation corresponding to the root of the word to be represented;

and determining word vector representation of the word to be represented according to the vector representation corresponding to the root word.

8. An apparatus for representing a word, the apparatus comprising:

the first acquisition module is used for acquiring the root of a word to be represented;

the decomposition module is used for decomposing the words to be represented by taking simple words as decomposition granularity to obtain a simple word set when the vector representation corresponding to the root of the words to be represented does not exist in the trained vector model;

the second obtaining module is used for obtaining the vector representation of each simple word in the simple word set according to the vector model;

and the determining module is used for determining the word vector representation of the word to be represented according to the vector representation of each simple word.

9. A word presentation apparatus, comprising:

a processor; and

a memory for storing a computer program operable on the processor;

wherein the computer program realizes the steps of the method of representation of a word as claimed in any one of claims 1 to 7 when executed by a processor.

10. A computer-readable storage medium having stored thereon computer-executable instructions configured to perform the steps of the method of representing a word as claimed in any one of the preceding claims 1 to 7.

Technical Field

The embodiment of the application relates to the technical field of data processing, and relates to a method, a device, equipment and a computer-readable storage medium for representing words.

Background

Word vector (word embedding) is a feature representation of words, each of which is represented as a fixed-length vector by a semantic dictionary or a machine learning technique. At present, word vector models play more and more important roles in the fields of semantic web, information retrieval, data mining, information integration, knowledge management, text classification, speech recognition, word sense disambiguation and the like.

Rare words (rare words), also known as rare words or uncommon words, refer to words that are used less frequently or occur infrequently. In the prior art, word vectors of rare words are usually inferred by training vector representations of words with a word vector algorithm or by means of an external semantic dictionary (WordNet). In the training words of the word vector algorithm, a context word set of the position of a word in a corpus is used as the input of a neural network, and the vectorization expression of the central word is obtained through continuous iteration. However, for rare words, due to the fact that the occurrence frequency in the corpus is low, sufficient training is difficult to obtain, the accuracy of expression of the meaning of the rare words is greatly reduced compared with that of non-rare words, and even word vectors of the rare words cannot be obtained; when the word vector of the rare word is presumed by means of the WordNet, the method is limited by the inclusion range of the WordNet, once the rare word does not appear in the WordNet, the method for presuming the word vector of the rare word by means of the WordNet does not work, the WordNet contains common words in the general field, and exclusive words in some fields are not included. The prior art does not have a good method for obtaining the word vector of the rare word, and even if the vector representation of the rare word can be obtained, the semantic expression accuracy is low.

Disclosure of Invention

In view of this, embodiments of the present application provide a method and an apparatus for representing a word, a device for representing a word, and a computer-readable storage medium.

The technical scheme of the embodiment of the application is realized as follows:

in a first aspect, an embodiment of the present application provides a method for representing a word, where the method includes:

acquiring a root of a word to be represented;

obtaining the vector representation of each simple word in the simple word set according to the vector model;

and determining word vector representation of the word to be represented according to the vector representation of each simple word.

In a second aspect, an embodiment of the present application provides a word representation apparatus, including:

the first acquisition module is used for acquiring the root of a word to be represented;

the second obtaining module is used for obtaining the vector representation of each simple word in the simple word set according to the vector model;

and the determining module is used for determining the word vector representation of the word to be represented according to the vector representation of each simple word.

In a third aspect, an embodiment of the present application provides a word presentation apparatus, including:

a processor; and

a memory for storing a computer program operable on the processor;

wherein the computer program realizes the steps of the above-mentioned expression method of the words when executed by a processor.

In a fourth aspect, embodiments of the present application provide a computer-readable storage medium having stored therein computer-executable instructions configured to perform the steps of the method for representing words described above.

According to the word representation method, device, equipment and computer readable storage medium provided by the embodiment of the application, when the root of a word to be represented (such as a rare word) is not in a trained vector model, the word to be represented is decomposed to obtain a simple word set, and the vector representation of each simple word in the simple word set is obtained through the vector model so as to determine the word to be represented. Because the simple words only have one word sense, the vector representation of the words to be represented is obtained through the vector representation corresponding to the simple words, and therefore the semantic meaning of the word vectors of the words to be represented is more accurately expressed.

Drawings

In the drawings, which are not necessarily drawn to scale, like reference numerals may describe similar components in different views. The drawings illustrate generally, by way of example, but not by way of limitation, various embodiments discussed herein.

Fig. 1 is a schematic structural diagram of a CBOW model provided in the related art;

FIG. 2 is a schematic diagram of the structure of a Skip-gram model provided in the related art;

FIG. 3 is a flow chart illustrating a method for representing words provided by an embodiment of the present application;

FIG. 4 is a schematic structural diagram of an improved CBOW model provided in an embodiment of the present application;

fig. 5 is a schematic flow chart illustrating a process of obtaining a vector representation of each simple word in a simple word set according to an embodiment of the present application;

FIG. 6 is a flow chart illustrating a method for representing words provided by embodiments of the present application;

FIG. 7 is a schematic diagram of a component structure of a device for representing words provided by an embodiment of the present application;

fig. 8 is a schematic diagram of a component structure of a word representing device provided in an embodiment of the present application.

Detailed Description

In order to make the objectives, technical solutions and advantages of the present application clearer, the present application will be described in further detail with reference to the attached drawings, the described embodiments should not be considered as limiting the present application, and all other embodiments obtained by a person of ordinary skill in the art without creative efforts shall fall within the protection scope of the present application.

In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is understood that "some embodiments" may be the same subset or different subsets of all possible embodiments, and may be combined with each other without conflict.

Where similar language of "first/second" appears in the specification, the following description is added, and where reference is made to the term "first \ second \ third" merely for distinguishing between similar items and not for indicating a particular ordering of items, it is to be understood that "first \ second \ third" may be interchanged both in particular order or sequence as appropriate, so that embodiments of the application described herein may be practiced in other than the order illustrated or described herein.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the present application only and is not intended to be limiting of the application.

In order to better understand the word expression method, device, apparatus, and computer-readable storage medium provided in the embodiments of the present application, first, an implementation of the word expression method and problems occurring in the related art are analyzed and explained.

The expression method of the word vector is mainly divided into One-hot representation (One-hot representation) and Distributed representation (Distributed representation), wherein the Distributed representation is the most common way for representing the word characteristics at present.

Among the distributed-representation word-vector models, the word-to-vector (word2 vec) model is the most well-known. word2vec is an open source Learning tool based on Deep Learning (Deep Learning) by Google. word2vec is a tool for converting words into vector form by training a model on a given corpus. Word vectors have good semantic properties and are a common way to represent word features. The value of each dimension of the word vector represents a feature with a certain semantic and grammatical interpretation. Through word vector training, the processing of text content can be simplified into vector operation in a vector space, and the correlation between words in the vector space is calculated to express the semantic correlation of the words.

word2vec contains two training models, the Continuous band (CBOW, Continuous band-Of-Words) model and Skip-word (Skip-gram) model. Fig. 1 is a schematic structural diagram of a CBOW model in the related art, and as shown in fig. 1, the model includes an input layer 101, a mapping layer 102, and an output layer 103. The CBOW model predicts the current word by context, and the Skip-gram model predicts its context by the current word. As shown in FIG. 1, the input layer 101 of the CBOW model is the context of the current word wt, and includes word vectors of w (t-2), w (t-1), w (t +1), and w (t +2), and the input of the mapping layer 102 is the accumulated sum of the word vectors of the contexts, and the word vector of wt is output at the output layer 103.

Fig. 2 is a schematic structural diagram of a Skip-gram model in the related art, as shown in fig. 2, the model includes: an input layer 201, a mapping layer 202 and an output layer 203, where the input layer 201 of the Skip-gram model is wt, and the mapping layer 202 predicts the context of wt, so that its context is output at the output layer 203: the mapping modes of w (t-2), w (t-1), w (t +1) and w (t +2), namely the Skip-gram model and the CBOW model are opposite. Wherein, the CBOW model and the Skip-gram model are both based on a neural network model.

Taking the CBOW model as an example:

the objective function for CBOW model optimization is:

where C is a corpus and context (w) represents the context of word w, i.e., the set of words around w. Taking the mapping layer as an NEG model as an example, knowing the context (w) of the word w, which needs to be predicted, for a given context (w), the word w is a positive sample, the other words are negative samples, and for a given sample (context (w), performing maximization:

wherein neg (w) is a negative sample set, the positive sample label is 1, the negative sample label is 0, and the probability is calculated as follows:

wherein x is_wSum of word vectors, θ, representing words of context (w)^u∈R^mAn auxiliary vector representing the word u, i.e. the parameters for which the model is to be trained, the σ (x) function is a sigmoid function, which is defined as:

substituting equation (3) into equation (1), the final objective function is:

solving (6) by using a random gradient rise method to obtain:

thus, the parameter θ^uAnd the context word vector is updated by the formula:

when the CBOW model is used for distributed representation of training words, a context word set of all appearance positions of a word in a corpus is used as input of a neural network, vectorization representation of a central word is obtained through continuous iteration, but for rare words, due to the fact that the appearance frequency in the corpus is low, sufficient training is difficult to obtain, the accuracy of semantic representation of the rare words is greatly reduced relative to non-rare words, and even word vectors of the rare words cannot be obtained. Taking word2vec algorithm as an example, word vectors are usually trained by using words with a word frequency greater than or equal to 5, so that no distributed representation of words with a word frequency less than 5 is included in either CBOW model or Skip-gram model.

In order to solve the problem, a word vector and text classification model (FastText) with enhanced subword (subword) information is proposed in the related art. The FastText model decomposes each word into a plurality of Chinese language model (N-Gram) strings, and after training N-Gram vectors and word vectors simultaneously, sums the word vectors and all the N-Gram corresponding vectors appearing in the word as the final word vector. Taking the FastText model of 2-Gram as an example, the word vector of "book" is the superposition of all vectors corresponding to 2-Gram in the 2-Gram set { "book", "oo", "ok" } and the word vector of "book" in the model. For a rare word, firstly decomposing the rare word into a plurality of sub-words, then obtaining sub-word vectors corresponding to the FastText model, and taking the sum of the sub-word vectors as the word vector of the rare word. The method can enhance the word vector representation of the rare word by means of the sub-word vector information.

Yet another approach is to infer word vectors for rare words by means of WordNet. For a rare word, searching corresponding semantic information in WordNet, wherein the semantic information comprises word meaning explanation information, example sentence information and words (synonyms, hypernyms, hyponyms and the like) which are semantically related to the rare word, and then overlapping word vectors of the words contained in the semantic information to obtain distributed representation of the rare word.

In the FastText method, the word vectors for rare words are implemented by sub-word vector superposition. But there is no correlation between some simple words and their sub-words, such as the word "choke" and the sub-word "hok". In this case, the expression accuracy of the meaning of the rare words is greatly reduced compared with that of the non-rare words. Moreover, the sub-words contained in many rare words are also rare words, and the sub-words are not contained in the trained sub-word vector model. For example, the sub-words of some domain-specific vocabularies are also rare words, such as iight, scaldabber, etc., and thus word vectors of the rare words cannot be obtained.

The method for estimating the word vector of the rare word by means of WordNet is limited by the inclusion range of the semantic dictionary WordNet, and once the rare word does not appear in WordNet, the method for estimating the word vector of the rare word by means of WordNet does not work. And WordNet contains common words in the general field, and specific words in some fields are not included, so that vectors of rare words cannot be obtained.

Based on the problems in the related art, the embodiments of the present application provide a method for representing words, and the method provided by the embodiments can be implemented by a computer program, and when the computer program is executed, each step in the method provided by the embodiments is completed. In some embodiments, the computer program may be executed by a processor in a presentation device.

Fig. 3 is a schematic flow chart of a word representation method provided in an embodiment of the present application, and as shown in fig. 3, the method is applied to a word representation apparatus, and the method includes:

in step S301, a root word (stem) of a word to be expressed is acquired.

In the embodiment of the present application, the word to be expressed may be an uncommon word or a common word, taking an uncommon word as an example, and the rare word (rare words) is also called a rare word or a rare word, which means a word that is rarely used or frequently appears, and the rare word may be one of a compound word, a morpheme word and a domain specific word. In the embodiment of the present application, it can be generally determined whether a word is an uncommon word by the word frequency of the word in the corpus, where the word frequency is the number of times of occurrence in the corpus. For example, words with a word frequency less than 5 in the corpus can be set as rare words, and words with a word frequency greater than or equal to 5 in the corpus can be set as common words.

In the embodiment of the application, when the word to be represented is obtained, the word representing device may extract the root of the word to be represented.

Taking rare English words as an example, the core part of the English word formation method is the root word, and the meaning of the words is mainly embodied by the root word forming the words. The roots of the words can be independently formed into words or can be combined into words, and the parts of speech and the meanings of the words are changed through prefix and suffix. That is, a word generally consists of three parts: a root word, a prefix, and a suffix. The root of a word determines the meaning of the word, the prefix changes the meaning of the word, and the suffix determines the part of speech of the word. For example: refrigerating extracts the root word of refrigerating.

In some embodiments, the root word of the word to be represented is extracted, or itself, for example: the scaldabiberon still obtains itself after root extraction.

In some embodiments, before step S301, the representation device of the word is further configured to receive a word to be represented.

Step S302, when the vector representation corresponding to the root of the word to be represented does not exist in the trained vector model, the word to be represented is decomposed by taking a simple word as the decomposition granularity to obtain a simple word set.

In the embodiment of the application, the trained vector model includes a word vector and a root vector. In the embodiment of the present application, whether a vector representation corresponding to the extracted root word exists is searched in a vector model, for example, refragering extracts that the root word is "refrager", and at this time, whether a vector representation of "refrager" exists in the vector model can be searched.

In the embodiment of the present application, the simple word is composed of a morpheme, that is, a simple word can only represent one meaning and cannot be disassembled. Illustratively, take, come, back, man, make, etc. are all simple words. In the embodiment of the application, the decomposition granularity is the minimum unit for decomposing the word to be represented, and when the word to be represented is decomposed by taking the simple word as the decomposition granularity, the word to be represented can be decomposed into one or more simple words, so that a simple word set is obtained. In other words, one or more simple words may be included in the simple word set obtained in this step. When the simple word set only includes one simple word, the simple word is also the word to be represented.

In some embodiments, when a vector representation corresponding to the root word of the word to be represented exists in the vector model, obtaining a vector representation corresponding to the root word of the word to be represented; and further determining word vector representation of the word to be represented according to the vector representation corresponding to the root word. Exemplarily, the word to be represented is refrigerating, and the root word refriger can be obtained through root word extraction. And searching a vector of the refriger in the trained vector model to determine the vector representation of the word to be represented, namely V (refrigerating) is equal to V (refriger), and at this time, the flow of the method is ended.

Step S303, obtaining the vector representation of each simple word in the simple word set according to the vector model.

In this embodiment of the application, when the word representing device obtains the vector representation of each simple word in the simple word set according to the vector model, the vector representation of each simple word in the vector model may be obtained by using a simple word, the vector representation of each simple word may also be obtained by using the root of a simple word, and the vector representation of each simple word may also be obtained by using a simple word and the root of a simple word.

In the embodiment of the application, when a plurality of simple words are included in the simple word set, the word vector representation of each simple word is obtained; when the obtained simple words in the simple word set are one, the to-be-expressed words can be converted into near-synonyms or synonyms of the to-be-expressed words, and then the synonyms of the to-be-expressed words are decomposed to obtain a plurality of simple words, so that vector representation of the plurality of simple words is obtained.

And step S304, determining word vector representation of the word to be represented according to the vector representation of each simple word.

In this embodiment of the present application, when step S304 is implemented, the vector representations of the simple words may be accumulated, so as to determine the word vector representation of the word to be represented.

The following are exemplary: the word vector for robotblener represents:

V(“robotcleaner”)＝V(“robot”)+V(“cleaner”)。

word vector representation for scaldabiberon:

V(“scaldabiberon”)＝V(“bottle”)+V(“warmer”)

by the word representation method, when the root of a word to be represented (such as a rare word) is not in the trained vector model, the word to be represented is decomposed to obtain a simple word set, and the vector representation of each simple word in the simple word set is obtained in the vector model so as to determine the word to be represented. Because the simple words only have one word sense, the vector representation of the words to be represented is obtained through the vector representation corresponding to the simple words, and therefore the semantic meaning of the word vectors of the words to be represented is more accurately expressed.

In some embodiments, before step S302, the method further comprises: and obtaining a trained vector model.

In the embodiment of the application, when determining the representation of the word to be represented, the root of the word to be represented needs to be utilized, and therefore the vector representation of each root of the word to be represented needs to be obtained. Multiple words typically share a root word, and sometimes the root word is a new word, e.g., the root words of the words "produced" and "producing" are both "producing," but "producing" is not a word in the corpus dictionary. Therefore, in the embodiment of the present application, the trained vector model may be an improved CBOW model, and the trained vector model is capable of training the word vector and the root vector at the same time.

Fig. 4 is a schematic structural diagram of an improved CBOW model provided in an embodiment of the present application, and as shown in fig. 4, the model includes: an input layer 401, a mapping layer 402 and an output layer 403, wherein context to be expressed as word wt and root word of context word are input in the input layer 401, wherein the context includes: "w (t-2), w (t-1), w (t +1), w (t + 2)"; the root of the context includes: "stem (w (t-2)), stem (w (t-1)), stem (w (t + 2))". The inputs to the mapping layer 402 are the accumulated sum of the word vectors for the context and the accumulated sum of the root vectors for the context words. The output of the output layer 403 is a word vector to be represented by word wt. In each stochastic gradient ascent iterative solution process, the Word vectors and the root vectors of the context are updated at the same time, so that a converged improved CBOW model is finally obtained (in some embodiments, the improved CBOW model may also be referred to as a Word and root vector model (WSE, Word and Stem Embedding).

In some embodiments, step S302: when the vector representation corresponding to the root of the word to be represented does not exist in the trained vector model, the word to be represented is decomposed by taking a simple word as a decomposition granularity to obtain a simple word set, and the method can be realized by the following steps:

and step S3021, the word representing device decomposes the word to be represented by using a simple word as decomposition granularity to obtain a first decomposition result.

In the embodiment of the present application, when one word to be expressed can be decomposed into a plurality of simple words, that is, the word to be expressed is a word to be expressed composed of two or more simple words, then the word to be expressed is a compound word, that is, when the compound word is decomposed with the simple words as decomposition granularity, at least two simple words are obtained. For example, robotcleaner can be decomposed into robot and cleaner, and the resulting set of simple words then includes robot and cleaner.

In some embodiments, when the word to be represented is decomposed, the word to be represented itself is obtained, that is, the word to be represented itself is a simple word. For example: the scaldabiberon still obtains itself after the morphological decomposition, and the simple word set only comprises the word to be represented itself.

Step S3022, the word representing apparatus determines whether the simple word in the first decomposition result is the word to be represented.

In this embodiment of the present application, when the word representing apparatus determines that the simple word in the first decomposition result is the word to be represented, that is, the word to be represented cannot be decomposed, step S3023 is executed; when the word representing device judges that the simple word in the first decomposition result is not the word to be represented, namely, the word to be represented can be decomposed, step S3026 is executed.

Step S3023, the word representing apparatus converts the word to be represented into a second language, so as to obtain a word to be represented in the second language.

In the embodiment of the present application, the first language is different from the second language. The converting the word to be represented into the second language may be translating the word to be represented. For example, the first language may be english, and the second language is chinese, that is, the word to be represented in english is translated into the word to be represented in chinese, for example, the word to be represented is scaldabber, and the word to be represented is translated into chinese to obtain the milk heater, and then the milk heater is the word to be represented in the second language.

In this embodiment of the Application, in the implementation of step S3023, an Application Programming Interface (API) of the translation website may be called to perform translation, so as to obtain the word to be represented in the second language.

Step S3024, the word representing apparatus converts the word to be represented in the second language into the first language to obtain the target word.

In the embodiment of the present application, the target word and the word to be expressed are in a first language. The to-be-expressed word in the second language may be translated into a target word in the first language. Following the example above, the milker is translated to English to obtain a bottom warmer, which is now the target word. That is to say, the English word to be represented is translated to obtain Chinese, and then the Chinese is reversely translated to obtain the English target word. In the embodiment of the present application, the target word may be considered as a synonym or a synonymous phrase of the word to be represented.

And step S3025, the word representing device decomposes the target word to obtain a second decomposition result, and obtains a simple word set according to the simple words in the second decomposition result, and the process of the step is ended.

In the embodiment of the present application, decomposing the target word also uses a simple word as a decomposition granularity to decompose the target word. Illustratively, the target word, a bottle warmer, is decomposed into a bottle and a warmer, where the resulting set of simple words includes: bottle and warmer.

In step S3026, the word representing apparatus obtains the simple word in the first decomposition result.

Illustratively, after the robotchleaner decomposition into robot and cleaner, the first decomposition results are robot and cleaner.

Step S3027, the word representing apparatus obtains a simple word set according to the simple words in the first decomposition result.

Illustratively, the set of simple words is determined to include robot and cleaner.

In the embodiment of the application, when the vector representation corresponding to the root of the word to be represented does not exist in the trained vector model, the word to be represented is decomposed by taking the simple word as the decomposition granularity, so that a simple word set is determined; and when the word to be represented cannot be decomposed, obtaining the synonym or the synonym phrase of the word to be represented, and decomposing the synonym or the synonym phrase to determine the simple word set.

Fig. 5 is a schematic flow chart of obtaining a vector representation of each simple word in the simple word set in the embodiment of the present application, and as shown in fig. 5, the step S303: obtaining the vector representation of each simple word in the simple word set according to the vector model, and the method can be realized by the following steps:

step S3031, the word representing apparatus obtains the ith simple word in the simple word set.

In the embodiment of the present application, i is 1, 2, …, N, and the set of simple words includes N simple words.

In this embodiment of the present application, the simple word set may be determined based on the first decomposition result, or may be determined based on the second decomposition result.

Step S3032, the word representing device determines whether a vector representation corresponding to the i-th simple word exists in the vector model.

In this embodiment of the present application, when a vector representation corresponding to the ith simple word exists in the vector model, step S3033 is executed; when the vector representation of the i-th simple word does not exist in the vector model, step S3034 is performed.

Step S3033, the word representing apparatus obtains the vector representation of the ith simple word.

In the embodiment of the application, vector representations corresponding to all the simple words in the simple word set can be obtained by sequentially obtaining the vector representations of all the simple words in the simple word set, and when the vector representations corresponding to all the simple words in the simple word set are obtained, the process of the step is ended.

Step S3034, the word representing device extracts the root of the ith simple word.

In the embodiment of the present application, the method for extracting the root of the ith word is the same as the above method for extracting the root of the word to be represented, and details are not described here.

Step S3035, the word representing device determines whether a vector representation corresponding to the root of the i-th simple word exists in the vector model.

In this embodiment of the present application, when a vector representation corresponding to the root of the ith simple word exists in the vector model, step S3036 is executed. When the vector representation corresponding to the root of the ith simple word does not exist in the vector model, step S3037 is executed.

Step S3036, the word representing device determines the vector representation corresponding to the root word as the vector representation of the ith simple word.

Step S3037, the word representing device determines the vector representation of the simple word as a preset vector.

In the embodiment of the present application, a situation that the root of a simple word is not in the vector model may occur, and the vector representation of the simple word may be determined as a 0 vector.

In the embodiment of the application, the vector representation corresponding to the simple word is obtained through the simple word or the root of the simple word, and then the vector representation of each simple word in the simple word set is obtained. Therefore, the problem that the vector representation of the word to be represented cannot be obtained due to single obtaining mode in the prior art is solved. Because the simple words only have one word sense, the vector representation of the words to be represented is obtained through the vector representation corresponding to the simple words, so that the semantic information of the vector representation of the words to be represented is more accurate.

An embodiment of the present application further provides a word representation method, where an example is an uncommon word, fig. 6 is a schematic flow chart of the word representation method provided in the embodiment of the present application, and as shown in fig. 6, the method includes:

in step S601, the expression device of the word initializes the vector V of the rare word to 0 vector.

In step S602, the expression device of the word extracts the root word of the rare word w.

In the embodiment of the present application, the root word of w is denoted as ws.

In step S603, the word representing device determines whether a root vector of ws exists in the WSE model.

In this embodiment of the present application, step S618 is executed if a root vector of ws exists in the WSE model, and step S604 is executed if a root vector of ws does not exist in the WSE model.

And step S604, performing morphological decomposition on the rare word W by the word representing equipment to obtain a decomposed simple word set W.

In step S605, the word representing apparatus determines whether the number of words (i.e., simple words) in the W set is equal to 1.

In the embodiment of the present application, if the number of words in the W set is equal to 1, which indicates that the rare word W itself is a simple word, and morphological decomposition cannot be performed, step S606 is performed at this time, and if the number of words in the W set is not equal to 1, step S609 is performed.

In step S606, the expression device of the word translates the rare word w from the language L1 to the language L2, resulting in S1.

In this embodiment of the present application, before step S606, the method further includes: the language category of the rare word w is determined to be L1, and the intermediate language category is determined to be L2, wherein L2 ≠ L1, e.g., L1 is English and L2 is Chinese.

In the embodiment of the application, the word representing device may perform translation by calling an API of a translation website.

In step S607, the presentation apparatus of the word reversely translates the translation result S1 from the language L2 to the language L1, resulting in a translation result S2.

In step S608, the word representing device performs morphological decomposition on the translation result S2.

And step S609, the expression of the word is the simple word set W obtained by the equipment.

In step S610, the word representing apparatus sequentially selects the simple words W0 in the simple word set W to obtain a vector representation M [ W0] of W0 in the WSE model.

In step S611, the expression apparatus of the word determines whether there is a vector expression M of w0 [ w0] in the WSE model.

In the embodiment of the present application, step S612 is performed if M [ w0] exists in the WSE model, and step S613 is performed if M [ w0] does not exist in the WSE model.

In step S612, the word representing apparatus accumulates M [ w0] into a vector V ═ V + M [ w0 ].

In this embodiment, V may be a vector representation obtained by accumulating V to 0 vector after initialization. That is, in the embodiment of the present application, the vector V is dynamically changed, and when a vector representation of a simple word is obtained, the vector representation of the simple word is superimposed on the vector V. In the present embodiment, step S617 is executed after step S612 is completed.

In step S613, the representation apparatus of the word extracts the root ws0 of w0 to obtain a root vector M [ ws0 ].

In step S614, the word representing device determines whether or not a vector representation M [ ws0] of ws0 is stored in the WSE model.

In the embodiment of the present application, if M [ ws0] does not exist in the WSE model, step S615 is performed. If M [ ws0] exists in the WSE model, step S616 is performed.

In step S615, the word representing device records M ws0 as a 0 vector representation.

In step S616, the word representing device adds M [ ws0] to V + M [ ws0 ].

In step S617, the word representing apparatus determines whether the simple word in W is selected completely.

In the embodiment of the present application, if the selection is not completed, step S610 is executed, and if the selection is completed, step S619 is executed.

In step S618, the word representing apparatus determines V ═ M [ ws ].

In step S619, the word representing device returns the vector V as the distributed representation of the rare word w, and the process ends.

Illustratively, the rare word "robotchlearer" word vector can be represented by morphological decomposition as the sum of two simple word vectors:

V(“robotcleaner”)＝V(“robot”)+V(“cleaner”)。

the rare word "refragating" can be represented by root extraction as a vector of its root refrager in the WSE model:

V(“refrigerating)＝V(“refriger”)；

the rare word 'scaldabber' is still obtained after root extraction and morphological decomposition, vector description of the rare word needs to be obtained through bidirectional translation, if the intermediate language is Chinese, the 'scaldabber' is translated into Chinese firstly to obtain a translation result, namely a 'milk heater', and then the 'milk heater' is reversely translated into English to obtain a 'bottle warer'. Performing morphological decomposition on the "bottom warmer" to obtain two simple words of the "bottom" and the "warmer", and finally, overlapping word vectors of the two simple words of the "bottom" and the "warmer" to obtain a distributed representation of the "scaldabiberon":

V(“scaldabiberon”)＝V(“bottle”)+V(“warmer”)

the expression method of the words comprehensively utilizes morphology decomposition, root extraction and bidirectional translation, can realize synonym vector expression of rare words, and the used pre-training model is a WSE model. In the embodiment of the application, the minimum semantic element for expressing the rare words is a complete word or a root, and compared with a method based on a sub-word vector model represented by FastText, each constituent unit has more definite semantic information and can express the meaning of the original rare words more completely. Compared with a method based on a knowledge base, the expression method of the words provided by the embodiment of the application is not limited by the inclusion range of the knowledge base, has better coverage rate of the rare words, and can provide vector reasoning expression for words in a larger range. In addition, the word representation method provided by the embodiment of the application can better express the semantic information of the rare word, namely the domain-specific word, is also suitable for different domains, and has better universality.

Based on the foregoing embodiments, the present application provides a word presentation apparatus, where the apparatus includes modules and units included in the modules, and the modules may be implemented by a processor in a computer device; of course, the implementation can also be realized through a specific logic circuit; in the implementation process, the processor may be a Central Processing Unit (CPU), a Microprocessor Unit (MPU), a Digital Signal Processor (DSP), a Field Programmable Gate Array (FPGA), or the like.

Fig. 7 is a schematic structural diagram of a word representation apparatus provided in an embodiment of the present application, and as shown in fig. 7, the word representation apparatus 700 includes a first obtaining module 701, a decomposition module 702, a second obtaining module 703, and a first determining module 704. Wherein:

the first obtaining module 701 is configured to obtain a root of a word to be represented.

And the decomposition module 702 is configured to decompose the word to be represented by using a simple word as a decomposition granularity when the vector representation corresponding to the root of the word to be represented does not exist in the trained vector model, so as to obtain a simple word set.

A second obtaining module 703, configured to obtain, according to the vector model, a vector representation of each simple word in the simple word set.

A first determining module 704, configured to determine, according to the vector representation of each simple word, a word vector representation of the word to be represented.

In some embodiments, the decomposition module 702 comprises: the device comprises a first decomposition unit, a first conversion unit, a second conversion unit and a second decomposition unit. Wherein:

and the first decomposition unit is used for decomposing the words to be represented by taking the simple words as decomposition granularity to obtain a first decomposition result.

And the first conversion unit is used for converting the word to be represented into a second language to obtain a word to be represented in a second language when the simple word in the first decomposition result is determined to be the word to be represented.

The second conversion unit is used for converting the words to be represented in the second language into the first language to obtain target words; and the target word and the word to be represented are in a first language.

And the second decomposition unit is used for decomposing the target word to obtain a second decomposition result and obtaining a simple word set according to the simple words in the second decomposition result.

In some embodiments, the apparatus 700 further comprises: the device comprises a third obtaining module and a second determining module.

Wherein:

and the third obtaining module is used for obtaining the simple words in the first decomposition result when the simple words in the first decomposition result are determined not to be the words to be represented.

And the second determining module is used for obtaining a simple word set according to the simple words in the first decomposition result.

In some embodiments, the second obtaining module 703 further includes: the device comprises a first acquisition unit, a first judgment unit and a second acquisition unit. Wherein:

a first obtaining unit, configured to obtain an i-th simple word in the simple word set, where i is 1, 2, …, N, and N is the total number of the simple words in the simple word set.

And the first judgment unit is used for judging whether the vector representation corresponding to the ith simple word exists in the vector model.

A second obtaining unit, configured to obtain a vector representation of the ith simple word when a vector representation corresponding to the ith simple word exists in the vector model.

In some embodiments, the second obtaining module 703 further includes: the device comprises a first extraction unit, a second judgment unit and a first determination unit. Wherein:

a first extracting unit, configured to extract a root word of the ith simple word when the vector representation of the ith simple word does not exist in the vector model.

And the second judging unit is used for judging whether a vector representation corresponding to the root of the ith simple word exists in the vector model.

A first determining unit, configured to determine, when a vector representation corresponding to a root of the ith simple word exists in the vector model, the vector representation corresponding to the root of the word as a vector representation of the ith simple word.

In some embodiments, the second obtaining module 703 further includes:

a second determining unit, configured to determine, when a vector representation corresponding to a root of the ith simple word does not exist in the vector model, the vector representation of the simple word as a preset vector.

In some embodiments, the apparatus 700 further comprises: the device comprises a fourth obtaining module and a third determining module.

Wherein:

and the fourth obtaining module is used for obtaining the vector representation corresponding to the root of the word to be represented when the vector representation corresponding to the root of the word to be represented exists in the vector model.

And the third determining module is used for determining the word vector representation of the word to be represented according to the vector representation corresponding to the root word.

The above description of the apparatus embodiments, similar to the above description of the method embodiments, has similar beneficial effects as the method embodiments. For technical details not disclosed in the embodiments of the apparatus of the present application, reference is made to the description of the embodiments of the method of the present application for understanding.

It should be noted that, in the embodiment of the present application, if the expression method of the above-mentioned words is implemented in the form of a software functional module and sold or used as a standalone product, it may also be stored in a computer readable storage medium. Based on such understanding, the technical solutions of the embodiments of the present application may be essentially implemented or portions thereof contributing to the prior art may be embodied in the form of a software product stored in a storage medium, and including several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read Only Memory (ROM), a magnetic disk, or an optical disk. Thus, embodiments of the present application are not limited to any specific combination of hardware and software.

Accordingly, an embodiment of the present application provides a computer-readable storage medium, on which a computer program is stored, wherein the computer program is implemented to implement the steps in the representation method of the words provided in the above embodiments when executed by a processor.

An embodiment of the present application provides a word representation apparatus, fig. 8 is a schematic structural diagram of a composition of the word representation apparatus provided in the embodiment of the present application, and as shown in fig. 8, the word representation apparatus 800 includes: a processor 801, at least one communication bus 802, a user interface 803, at least one external communication interface 804 and memory 805. Wherein the communication bus 802 is configured to enable connective communication between these components. The user interface 803 may include a display screen, and the external communication interface 804 may include a standard wired interface and a wireless interface, among others. Wherein, the processor 801 is configured to execute the program for representing words stored in the memory to realize the steps in the method for representing words provided by the above embodiments

The above description of the words representing the device and storage medium embodiments is analogous to the description of the method embodiments above, with similar advantageous effects as the method embodiments. For technical details of the present application that are not disclosed in the context of apparatus and storage medium embodiments, reference is made to the description of the method embodiments of the present application for understanding.

Here, it should be noted that: the above description of the storage medium and device embodiments is similar to the description of the method embodiments above, with similar advantageous effects as the method embodiments. For technical details not disclosed in the embodiments of the storage medium and apparatus of the present application, reference is made to the description of the embodiments of the method of the present application for understanding.

It should be appreciated that reference throughout this specification to "one embodiment" or "an embodiment" means that a particular feature, structure or characteristic described in connection with the embodiment is included in at least one embodiment of the present application. Thus, the appearances of the phrases "in one embodiment" or "in an embodiment" in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. It should be understood that, in the various embodiments of the present application, the sequence numbers of the above-mentioned processes do not mean the execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present application. The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described device embodiments are merely illustrative, for example, the division of the unit is only a logical functional division, and there may be other division ways in actual implementation, such as: multiple units or components may be combined, or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the coupling, direct coupling or communication connection between the components shown or discussed may be through some interfaces, and the indirect coupling or communication connection between the devices or units may be electrical, mechanical or other forms.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units; can be located in one place or distributed on a plurality of network units; some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, all functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may be separately regarded as one unit, or two or more units may be integrated into one unit; the integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.

Those of ordinary skill in the art will understand that: all or part of the steps for realizing the method embodiments can be completed by hardware related to program instructions, the program can be stored in a computer readable storage medium, and the program executes the steps comprising the method embodiments when executed; and the aforementioned storage medium includes: various media that can store program codes, such as a removable Memory device, a Read Only Memory (ROM), a magnetic disk, or an optical disk.

Alternatively, the integrated units described above in the present application may be stored in a computer-readable storage medium if they are implemented in the form of software functional modules and sold or used as independent products. Based on such understanding, the technical solutions of the embodiments of the present application may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing an AC to perform all or part of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a removable storage device, a ROM, a magnetic or optical disk, or other various media that can store program code.

The above description is only for the embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

21页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：一种基于文本向量化的故障停电抢修事中处置参考方法

Word representation method, device, equipment and computer readable storage medium

相关技术

网友询问留言