Chinese character coding method based on character embedding

文档序号：782690 发布日期：2021-04-09 浏览：24次中文

阅读说明：本技术 基于字符嵌入的汉字编码方法 (Chinese character coding method based on character embedding ) 是由柯逍刘童安于 2021-01-04 设计创作，主要内容包括：本发明涉及一种基于字符嵌入的汉字编码方法,包括以下步骤：步骤S1：构建汉字字符集,将每个字符分解为若干个子结构,构建子结构集合,定义每个子结构对字符的贡献度,并根据子结构集合,构建子结构对每个字符贡献度矩阵；步骤S2：根据得到的子结构集合和子结构对每个字符贡献度矩阵,构建子结构嵌入矩阵并训练,提取得到字符嵌入矩阵；步骤S3：输入字符,通过字符嵌入矩阵获取字符嵌入。本发明能有效降低汉字编码的维度,使得具有相似构成的汉字编码具有正相关性,有效提高字符识别效率。(The invention relates to a Chinese character coding method based on character embedding, which comprises the following steps: step S1: constructing a Chinese character set, decomposing each character into a plurality of substructures, constructing a substructure set, defining the contribution degree of each substructure to the character, and constructing a substructure contribution degree matrix to each character according to the substructure set; step S2: constructing a substructure embedding matrix and training according to the obtained substructure set and the contribution matrix of the substructure to each character, and extracting to obtain a character embedding matrix; step S3: inputting characters, and acquiring character embedding through a character embedding matrix. The invention can effectively reduce the dimension of Chinese character coding, so that the Chinese character coding with similar structure has positive correlation, and effectively improves the character recognition efficiency.)

1. A Chinese character coding method based on character embedding is characterized by comprising the following steps:

step S1: constructing a Chinese character set, decomposing each character into a plurality of substructures, constructing a substructure set, defining the contribution degree of each substructure to the character, and constructing a substructure contribution degree matrix to each character according to the substructure set;

step S2: constructing a substructure embedding matrix and training according to the obtained substructure set and the contribution matrix of the substructure to each character, and extracting to obtain a character embedding matrix;

step S3: inputting characters, and acquiring character embedding through a character embedding matrix.

2. The method for encoding chinese characters based on character embedding of claim 1, wherein said step S1 specifically comprises:

step S11: determining the character set to be coded, the ia th Chinese character being char_iaIn total, n_charsIf a Chinese character needs to be embedded, the character set is chars ═ char_ia|ia＝1，2，...，n_chars}；

Step S12: all Chinese characters in the chars are split to obtain the partial structure of each partial ═ part_ib|ib＝1，2，...，n_parts}, where part_ibIs the ib-th substructure, n_partsNumber of elements that are parts;

step S13: calculating a substructure frequency table nfreq_parts＝{nfreq_ib|ib＝1，2，...，n_partsWherein nfreq_ibDenotes part_ibIs nfreq_ibA substructure of individual characters;

step S14: because the split result is character split when k is 1, chars is a subset of parts, and a mapping relation g is established, so that part is formed_ib＝part_g(ia)；

Step S15: calculating the contribution degree of each substructure in parts to each character in chars to obtain n_partsLine n_charsThe contribution matrix charparts of the column.

3. The method for encoding chinese characters based on character embedding of claim 2, wherein said step S12 specifically comprises:

(1) presetting that each Chinese character can be split into k substructures;

(2) k is an integer not less than 1, and when k is 1, the split result is a character per se;

(3) the maximum value of k being the number of strokes of a character or k_max，k_maxA maximum manually set split number;

splitting all Chinese characters in char according to (1) - (3) to obtain all substructures of parts ═ part_ib|ib＝1，2，...，n_parts}, where part_ibIs the ib-th substructure, n_partsIs the number of elements of parts.

4. The method for encoding chinese characters based on character embedding of claim 2, wherein said step S15 specifically comprises:

(1) when a Chinese character is split into k parts, the contribution degree of the split sub-structure to the character is

(2) When one substructure appears in a plurality of splitting results of one character at the same time, calculating the contribution degree by taking the splitting method with the minimum k;

(3) if a sub-structure cannot form a character, the contribution degree of the sub-structure to the character is 0;

calculating the contribution degree of each substructure in parts to each character in chars according to (1) - (3),to obtain n_partsLine n_charsThe contribution matrix charparts of the column.

5. The method for encoding chinese characters based on character embedding of claim 1, wherein said step S2 specifically comprises:

step S21: construction of a pair of substructure Embedded matrices embs1, embs2, embs1 and embs2 are all n_partsA matrix with m rows and m columns, wherein m is a vector dimension obtained by embedding which is manually set;

step S22: if each substructure in parts is encoded uniquely, then part_ibIs encoded as ponehot_ibThen the one-hot coding of all substructures is ponehots ═ ponehot { (ponehot)_ib|ib＝1，2，...，n_parts}；

Step S23: for the ib-th substructure, ponehot_ibWith probability f (nfreq)_ib) As the central substructure, the probability calculation method is as follows:

wherein min is a minimum function, alpha is a parameter set manually, then a window with the size of t is set, t is a positive integer parameter set manually, the distribution of the ib-th row of charparts is used as the probability distribution of characters, t characters are extracted, the character numbers are mapped to the substructure numbers by mapping g and are placed in the window to be used as related substructures, r substructures are extracted randomly to be used as unrelated substructures, and r is the positive integer parameter set manually;

step S24: the computation of embedding the one-hot code into the vector by the sub-structure embedding matrix is as follows:

emb＝ponehot×embs_parts

wherein the embs_partsEmbedding a matrix for a substructure, using ponehot as a unique hot code of the substructure, using emb as an embedded vector, and embedding the unique hot code of the central substructure into an embedded vector emb1 through embs 1;

step S25: mixing t phasesSingle-hot coding of the structure of the joint is embedded through embs2 to obtain t embedded vectors emb2ps ═ emb2p_ic1, 2.., t }, where emb2p_icThe ith of the t embedded vectors;

step S26: the one-hot encoding of r unrelated substructures is embedded by embs2 to obtain t embedded vectors emb2ns ═ { emb2n_id1, 2., r }, where emb2n_idIs the id-th of the r embedded vectors;

step S27: loss is calculated and the network is optimized using the following formula:

wherein ∑_icA summation symbol, Σ, representing the traversal ic 1, 2_idA summation symbol representing the traversal id 1, 2., r,is emb2p_icThe transpose of (a) is performed,is emb2n_idThe expression of the log sigmoid function is as follows:

wherein x is an independent variable, e is a natural constant, and log is a logarithmic function with e as a base;

step S28: based on steps S23-S27, go through ib ═ 1, 2_partsEmbedding the embs1 into the matrix as a trained substructure for a plurality of times until the network converges;

step S29: extracting a character embedding matrix embschar from the embs1 through a mapping relation g, wherein the line ia of the embschar corresponds to the line g (ia) of the embs1, and extracting a character-independent-hot-coding table conehots from ponehots through the mapping relation g_ia|ia＝1，2，...，n_charsTherein conhot_ia＝ponehot_g(ia)。

6. The method for encoding chinese characters based on character embedding of claim 1, wherein said step S3 specifically comprises:

step S31: selecting a Chinese character to be coded;

step S32: using the condhoss to code the Chinese characters to be coded into one-hot codes;

step S33: the one-hot encoding is embedded as a low-dimensional vector using embschar.

Technical Field

The invention relates to the field of pattern recognition and computer vision, in particular to a Chinese character coding method based on character embedding.

Background

Language is one of the main ways that humans transmit information, and words are written language, which is also one of the most widespread ways that humans transmit information visually.

With the rapid development of technologies such as artificial intelligence, internet and the like, the automatic recognition of texts in images by using a computer is of great significance. For the task of character recognition, characters are usually coded by a one-hot coding mode, the coding mode ignores the correlation among similar characters and is sparse, and for the task of recognizing English characters and numbers, the applicability is still good due to the fact that the number of categories is small. However, for the task of recognizing Chinese characters, because of the various categories of Chinese characters, there are thousands of common characters, which results in slower network convergence by using unique hot coding, and completely ignores the structural shape similarity between Chinese characters, resulting in low accuracy and low efficiency of character recognition.

Disclosure of Invention

In view of the above, the present invention provides a method for encoding chinese characters based on character embedding, which can effectively reduce the dimensionality of chinese character encoding, so that chinese character encoding with similar structures has positive correlation, and effectively improve character recognition efficiency.

In order to achieve the purpose, the invention adopts the following technical scheme:

a Chinese character coding method based on character embedding comprises the following steps:

step S3: inputting characters, and acquiring character embedding through a character embedding matrix.

Further, the step S1 is specifically:

step S11, determining the character set to be coded, the ia th Chinese character is char_iaIn total, n_charsIf a Chinese character needs to be embedded, the character set is chars ═ char_ia|ia＝1，2，...，n_chars}；

Step S12, all Chinese characters in char are split to obtain the partial of all substructures ═ { part }_ib|ib＝1，2，...，n_parts}, where part_ibIs the ib-th substructure, n_partsNumber of elements that are parts;

step S13, calculating nfreq of substructure frequency table_parts＝{nfreq_ib|ib＝1，2，...，n_partsWherein nfreq_ibDenotes part_ibIs nfreq_ibA substructure of individual characters;

step S14: because the split result is character split when k is 1, chars is a subset of parts, and a mapping relation g is established, so that part is formed_ib＝part_g(ia)；

Step S15: calculating the contribution degree of each substructure in parts to each character in chars to obtain n_partsLine n_charsThe contribution matrix charparts of the column.

Further, the step S12 is specifically:

(1) presetting that each Chinese character can be split into k substructures;

(2) k is an integer not less than 1, and when k is 1, the split result is a character per se;

(3) the maximum value of k being the number of strokes of a character or k_max，k_maxA maximum manually set split number;

Further, the step S15 is specifically:

(1) when a Chinese character is split into k parts, the contribution degree of the split sub-structure to the character is

(2) When one substructure appears in a plurality of splitting results of one character at the same time, calculating the contribution degree by taking the splitting method with the minimum k;

(3) if a sub-structure cannot form a character, the contribution degree of the sub-structure to the character is 0;

calculating the contribution degree of each substructure in parts to each character in chars according to (1) - (3) to obtain n_partsLine n_charsThe contribution matrix charparts of the column.

Further, the step S2 is specifically:

Step S23: for the ib-th substructure, ponehot_ibWith probability f (nfreq)_ib) As the central substructure, the probability calculation method is as follows:

step S24: the computation of embedding the one-hot code into the vector by the sub-structure embedding matrix is as follows:

emb＝ponehot×embs_parts

step S25: the one-hot coding of t related substructures is embedded by the embs2 to obtain t embedded vectors emb2ps ═ { emb2p_ic1, 2, …, t }, where emb2p_icThe ith of the t embedded vectors;

step S26: the one-hot encoding of r unrelated substructures is embedded by embs2 to obtain t embedded vectors emb2ns ═ { emb2n_id1, 2., r }, where emb2n_idIs the id-th of the r embedded vectors;

step S27: loss is calculated and the network is optimized using the following formula:

wherein ∑_icA summation symbol, Σ, representing the traversal ic 1, 2_idA summation symbol representing the traversal id 1, 2, …, r,is emb2p_icThe transpose of (a) is performed,is emb2n_idTranspose of (3), the expression of logsigmoid function is as follows:

wherein x is an independent variable, e is a natural constant, and log is a logarithmic function with e as a base;

step S28: based on steps S23-S27, go through ib ═ 1, 2_partsEmbedding the embs1 into the matrix as a trained substructure for a plurality of times until the network converges;

step S29: extraction from embs1 by mapping gThe characters are embedded into a matrix embschar, wherein the ia row of the embschar corresponds to the g (ia) row of the embs1, and character one-hot encoding tables, namely { conhot ═ are extracted from ponehots through a mapping relation g_ia|ia＝1，2，...，n_charsTherein conhot_ia＝ponehot_g(ia)。

Further, the step S3 is specifically:

step S31: selecting a Chinese character to be coded;

step S32: using the condhoss to code the Chinese characters to be coded into one-hot codes;

step S33: the one-hot encoding is embedded as a low-dimensional vector using embschar.

Compared with the prior art, the invention has the following beneficial effects:

the invention can effectively reduce the dimension of Chinese character coding, enables the Chinese character coding with similar structure to have positive correlation, and effectively improves the character recognition efficiency

Drawings

FIG. 1 is a flow chart of the method of the present invention.

Detailed Description

The invention is further explained below with reference to the drawings and the embodiments.

Referring to fig. 1, the present invention provides a method for encoding chinese characters based on character embedding, comprising the following steps:

step S3: inputting characters, and acquiring character embedding through a character embedding matrix.

In this embodiment, the step S1 specifically includes:

Step S12, (1) presetting that each Chinese character can be split into k substructures;

(2) k is an integer not less than 1, and when k is 1, the split result is a character per se;

(3) the maximum value of k being the number of strokes of a character or k_max，k_maxA maximum manually set split number;

Step S13, calculating nfreq of substructure frequency table_parts＝{nfreq_ib|ib＝1，2，...，n_partsWherein nfreq_ibDenotes part_ibIs nfreq_ibA substructure of individual characters;

step S14: because the split result is character split when k is 1, chars is a subset of parts, and a mapping relation g is established, so that part is formed_ib＝part_g(ia)；

Step S15: (1) when a Chinese character is split into k parts, the contribution degree of the split sub-structure to the character is

(2) When one substructure appears in a plurality of splitting results of one character at the same time, calculating the contribution degree by taking the splitting method with the minimum k;

(3) if a sub-structure cannot form a character, the contribution degree of the sub-structure to the character is 0;

calculating the contribution degree of each substructure in parts to each character in chars according to (1) - (3) to obtain n_partsLine n_charsThe contribution matrix charparts of the column.

In this embodiment, the step S2 specifically includes:

Step S23: for the ib-th substructure, ponehot_ibWith probability f (nfreq)_ib) As the central substructure, the probability calculation method is as follows:

step S24: the computation of embedding the one-hot code into the vector by the sub-structure embedding matrix is as follows:

emb＝ponehot×embs_parts

step S25: the one-hot coding of t related substructures is embedded by the embs2 to obtain t embedded vectors emb2ps ═ { emb2p_ic1, 2, …, t }, where emb2p_icThe ith of the t embedded vectors;

step S26: embedding the unique heat codes of r independent substructures through the embs2 to obtain tEach embedded vector emb2ns ═ { emb2n_id1, 2., r }, where emb2n_idIs the id-th of the r embedded vectors;

step S27: loss is calculated and the network is optimized using the following formula:

wherein ∑_icA summation symbol, Σ, representing the traversal ic 1, 2_idA summation symbol representing the traversal id 1, 2., r,is emb2p_icThe transpose of (a) is performed,is emb2n_idTranspose of (3), the expression of logsigmoid function is as follows:

wherein x is an independent variable, e is a natural constant, and log is a logarithmic function with e as a base;

step S28: based on steps S23-S27, go through ib ═ 1, 2_partsEmbedding the embs1 into the matrix as a trained substructure for a plurality of times until the network converges;

In this embodiment, the step S3 specifically includes:

step S31: selecting a Chinese character to be coded;

step S32: using the condhoss to code the Chinese characters to be coded into one-hot codes;

step S33: the one-hot encoding is embedded as a low-dimensional vector using embschar.

The above description is only a preferred embodiment of the present invention, and all equivalent changes and modifications made in accordance with the claims of the present invention should be covered by the present invention.

10页详细技术资料下载

Chinese character coding method based on character embedding

相关技术

网友询问留言