Chinese word vector modeling method

文档序号：1378985 发布日期：2020-08-14 浏览：6次中文

阅读说明：本技术 一种中文词向量建模方法 (Chinese word vector modeling method ) 是由王君君于 2020-04-21 设计创作，主要内容包括：本发明属于向量建模领域,尤其是一种中文词向量建模方法,针对现有中文词向量建模方法都只是简单引入部首笔画等信息同时无法对未登录词进行合理的表示,不能够自动更新语句,跟不上时代潮流,智能化程度低,导致使用不便的问题,现提出如下方案,其包括以下步骤：S1：获取中文词语形声特征的语料基础,并对其进行分类,同时对类型进行A类编号；S2：对各类型中的语料基础依次进行B类编号；S3：将编号映射到向量空间中,本发明能够对未登录词进行合理的表示,且能够自动更新语句,跟上时代潮流,智能化程度高,使用方便。(The invention belongs to the field of vector modeling, in particular to a Chinese word vector modeling method, which aims at solving the problems that the existing Chinese word vector modeling methods simply introduce information such as radical strokes and the like, cannot reasonably express unknown words, cannot automatically update sentences, cannot keep up with the trend of the times, has low intelligent degree and causes inconvenient use, and provides the following scheme, which comprises the following steps: s1: acquiring a corpus basis of the morphological-acoustic characteristics of Chinese words, classifying the corpus basis, and numbering the types in A class; s2: b-class numbering is carried out on the corpus bases in each type in sequence; s3: the invention can reasonably express the unknown words and automatically update the sentences by mapping the numbers into the vector space, thereby keeping up with the trend of the times, having high intelligent degree and convenient use.)

1. A Chinese word vector modeling method is characterized by comprising the following steps:

s1: acquiring a corpus basis of the morphological-acoustic characteristics of Chinese words, classifying the corpus basis, and numbering the types in A class;

s2: b-class numbering is carried out on the corpus bases in each type in sequence;

s3: mapping the numbers into vector space;

s4: constructing a basic model aiming at the corpus foundation;

s5: inputting Chinese words, detecting the length of the sentences, and detecting the corpus basis in the Chinese words;

s6: converting the detected corpus basis into a serial number, and judging a conversion result;

s7: mapping the numbers to real number vectors in a vector space;

s8: performing word segmentation processing on the corpus basis of the Chinese words and detecting results;

s9: inputting the word segmentation processing result into a basic model;

s10: and completing the modeling of the Chinese word vector.

2. The method of claim 1, wherein in S1, the corpus basis includes simplified chinese characters, traditional chinese characters, pinyin words, pinyin characters, radicals, strokes, and wubi input methods.

3. The method according to claim 1, wherein in S1, a corpus basis of morphological and acoustic features of chinese words is obtained and classified, and types are numbered in class a and stored in a memory a.

4. The method according to claim 1, wherein in S2, class B numbering is performed on corpus bases in each type in sequence, and corresponding class B numbers in each type are associated and stored in a memory B.

5. The method according to claim 1, wherein in S3, numbers are mapped to vector space, and the result is stored in memory C.

6. The method according to claim 1, wherein in S4, a base model is constructed for corpus base, and the base model is associated with the memory a, the memory B and the memory C respectively.

7. The method according to claim 1, wherein in S6, the detected corpus basis is converted into a number, the conversion result is determined, and the data stored in the memory a, the memory B, and the memory C are retrieved and determined in one-to-one correspondence with the conversion result.

8. The method according to claim 1, wherein in S8, a segmentation processing database is established and automatically updated, and segmentation processing is performed on the corpus basis of the chinese words according to the segmentation processing database to obtain the detection result.

9. The method according to claim 1, wherein in S9, the segmentation processing result is input into a base model, and the base model is automatically optimized according to the segmentation processing database, the memory a, the memory B, the memory C, and the sentence length detection.

Technical Field

The invention relates to the technical field of vector modeling, in particular to a Chinese word vector modeling method.

Background

Word vectors have become an important component of any deep learning based natural language processing system. Natural language processing systems encode words and sentences in dense vectors of fixed length, thereby greatly improving the processing of text data by neural networks.

The existing Chinese word vector modeling method only simply introduces information such as radicals, strokes and the like, simultaneously cannot reasonably express unknown words, cannot automatically update sentences, cannot keep up with the trend of the times, has low intelligent degree, and causes inconvenience in use.

Therefore, we propose a chinese word vector modeling method to solve the above problem.

Disclosure of Invention

The Chinese word vector modeling method provided by the invention solves the problems that the existing Chinese word vector modeling method is only simple to introduce information such as radicals and strokes, cannot reasonably express unknown words, cannot automatically update sentences, cannot keep up with the trend of the times, is low in intelligence degree and causes inconvenience in use.

In order to achieve the purpose, the invention adopts the following technical scheme:

a Chinese word vector modeling method comprises the following steps:

s1: acquiring a corpus basis of the morphological-acoustic characteristics of Chinese words, classifying the corpus basis, and numbering the types in A class;

s2: b-class numbering is carried out on the corpus bases in each type in sequence;

s3: mapping the numbers into vector space;

s4: constructing a basic model aiming at the corpus foundation;

s5: inputting Chinese words, detecting the length of the sentences, and detecting the corpus basis in the Chinese words;

s6: converting the detected corpus basis into a serial number, and judging a conversion result;

s7: mapping the numbers to real number vectors in a vector space;

s8: performing word segmentation processing on the corpus basis of the Chinese words and detecting results;

s9: inputting the word segmentation processing result into a basic model;

s10: and completing the modeling of the Chinese word vector.

Preferably, in S1, the corpus basis includes simplified chinese character, traditional chinese character, pinyin word, pinyin chinese character, radical, stroke, and wubi input method.

Preferably, in S1, the corpus basis of the morphological-acoustic features of the chinese words is obtained and classified, and the types are numbered in class a and stored in the memory a.

Preferably, in S2, class B numbers are sequentially performed on the corpus bases in each type, and the corresponding class B numbers in each type are associated and stored in the memory B.

Preferably, in S3, the numbers are mapped into the vector space, and the result is stored in the memory C.

Preferably, in S4, a base model is constructed for the corpus base, and the base model is respectively associated with the memory a, the memory B, and the memory C.

Preferably, in S6, the detected corpus basis is converted into a number, the conversion result is determined, and the storage data in the memory a, the memory B, and the memory C are retrieved and determined in one-to-one correspondence with the conversion result.

Preferably, in S8, a word segmentation processing database is established and automatically updated, and word segmentation processing is performed on the corpus basis of the chinese words according to the word segmentation processing database to detect a result.

Preferably, in S9, the segmentation processing result is input into the basic model, and the basic model is automatically optimized according to the segmentation processing database, the memory a, the memory B, the memory C, and the sentence length detection.

Compared with the prior art, the invention has the beneficial effects that:

1. because the corpus bases are classified, the corpus bases are numbered simultaneously and are correspondingly connected, the accuracy can be greatly improved;

2. because the word segmentation processing database is established and automatically updated, the sentences can be automatically updated, and the trend of the times is kept up;

3. since the judgment is performed in one-to-one correspondence with the conversion result, the accuracy can be further improved.

The invention can reasonably express the unknown words, can automatically update the sentences, keeps up with the trend of the times, has high intelligent degree and is convenient to use.

Drawings

Fig. 1 is a schematic block diagram of a chinese word vector modeling method according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments.

6页详细技术资料下载

Chinese word vector modeling method

相关技术

网友询问留言