Chinese character coding method based on character embedding

文档序号:782690 发布日期:2021-04-09 浏览:24次 中文

阅读说明:本技术 基于字符嵌入的汉字编码方法 (Chinese character coding method based on character embedding ) 是由 柯逍 刘童安 于 2021-01-04 设计创作,主要内容包括:本发明涉及一种基于字符嵌入的汉字编码方法,包括以下步骤:步骤S1:构建汉字字符集,将每个字符分解为若干个子结构,构建子结构集合,定义每个子结构对字符的贡献度,并根据子结构集合,构建子结构对每个字符贡献度矩阵;步骤S2:根据得到的子结构集合和子结构对每个字符贡献度矩阵,构建子结构嵌入矩阵并训练,提取得到字符嵌入矩阵;步骤S3:输入字符,通过字符嵌入矩阵获取字符嵌入。本发明能有效降低汉字编码的维度,使得具有相似构成的汉字编码具有正相关性,有效提高字符识别效率。(The invention relates to a Chinese character coding method based on character embedding, which comprises the following steps: step S1: constructing a Chinese character set, decomposing each character into a plurality of substructures, constructing a substructure set, defining the contribution degree of each substructure to the character, and constructing a substructure contribution degree matrix to each character according to the substructure set; step S2: constructing a substructure embedding matrix and training according to the obtained substructure set and the contribution matrix of the substructure to each character, and extracting to obtain a character embedding matrix; step S3: inputting characters, and acquiring character embedding through a character embedding matrix. The invention can effectively reduce the dimension of Chinese character coding, so that the Chinese character coding with similar structure has positive correlation, and effectively improves the character recognition efficiency.)

1. A Chinese character coding method based on character embedding is characterized by comprising the following steps:

step S1: constructing a Chinese character set, decomposing each character into a plurality of substructures, constructing a substructure set, defining the contribution degree of each substructure to the character, and constructing a substructure contribution degree matrix to each character according to the substructure set;

step S2: constructing a substructure embedding matrix and training according to the obtained substructure set and the contribution matrix of the substructure to each character, and extracting to obtain a character embedding matrix;

step S3: inputting characters, and acquiring character embedding through a character embedding matrix.

2. The method for encoding chinese characters based on character embedding of claim 1, wherein said step S1 specifically comprises:

step S11: determining the character set to be coded, the ia th Chinese character being chariaIn total, ncharsIf a Chinese character needs to be embedded, the character set is chars ═ charia|ia=1,2,...,nchars};

Step S12: all Chinese characters in the chars are split to obtain the partial structure of each partial ═ partib|ib=1,2,...,nparts}, where partibIs the ib-th substructure, npartsNumber of elements that are parts;

step S13: calculating a substructure frequency table nfreqparts={nfreqib|ib=1,2,...,npartsWherein nfreqibDenotes partibIs nfreqibA substructure of individual characters;

step S14: because the split result is character split when k is 1, chars is a subset of parts, and a mapping relation g is established, so that part is formedib=partg(ia)

Step S15: calculating the contribution degree of each substructure in parts to each character in chars to obtain npartsLine ncharsThe contribution matrix charparts of the column.

3. The method for encoding chinese characters based on character embedding of claim 2, wherein said step S12 specifically comprises:

(1) presetting that each Chinese character can be split into k substructures;

(2) k is an integer not less than 1, and when k is 1, the split result is a character per se;

(3) the maximum value of k being the number of strokes of a character or kmax,kmaxA maximum manually set split number;

splitting all Chinese characters in char according to (1) - (3) to obtain all substructures of parts ═ partib|ib=1,2,...,nparts}, where partibIs the ib-th substructure, npartsIs the number of elements of parts.

4. The method for encoding chinese characters based on character embedding of claim 2, wherein said step S15 specifically comprises:

(1) when a Chinese character is split into k parts, the contribution degree of the split sub-structure to the character is

(2) When one substructure appears in a plurality of splitting results of one character at the same time, calculating the contribution degree by taking the splitting method with the minimum k;

(3) if a sub-structure cannot form a character, the contribution degree of the sub-structure to the character is 0;

calculating the contribution degree of each substructure in parts to each character in chars according to (1) - (3),to obtain npartsLine ncharsThe contribution matrix charparts of the column.

5. The method for encoding chinese characters based on character embedding of claim 1, wherein said step S2 specifically comprises:

step S21: construction of a pair of substructure Embedded matrices embs1, embs2, embs1 and embs2 are all npartsA matrix with m rows and m columns, wherein m is a vector dimension obtained by embedding which is manually set;

step S22: if each substructure in parts is encoded uniquely, then partibIs encoded as ponehotibThen the one-hot coding of all substructures is ponehots ═ ponehot { (ponehot)ib|ib=1,2,...,nparts};

Step S23: for the ib-th substructure, ponehotibWith probability f (nfreq)ib) As the central substructure, the probability calculation method is as follows:

wherein min is a minimum function, alpha is a parameter set manually, then a window with the size of t is set, t is a positive integer parameter set manually, the distribution of the ib-th row of charparts is used as the probability distribution of characters, t characters are extracted, the character numbers are mapped to the substructure numbers by mapping g and are placed in the window to be used as related substructures, r substructures are extracted randomly to be used as unrelated substructures, and r is the positive integer parameter set manually;

step S24: the computation of embedding the one-hot code into the vector by the sub-structure embedding matrix is as follows:

emb=ponehot×embsparts

wherein the embspartsEmbedding a matrix for a substructure, using ponehot as a unique hot code of the substructure, using emb as an embedded vector, and embedding the unique hot code of the central substructure into an embedded vector emb1 through embs 1;

step S25: mixing t phasesSingle-hot coding of the structure of the joint is embedded through embs2 to obtain t embedded vectors emb2ps ═ emb2pic1, 2.., t }, where emb2picThe ith of the t embedded vectors;

step S26: the one-hot encoding of r unrelated substructures is embedded by embs2 to obtain t embedded vectors emb2ns ═ { emb2nid1, 2., r }, where emb2nidIs the id-th of the r embedded vectors;

step S27: loss is calculated and the network is optimized using the following formula:

wherein ∑icA summation symbol, Σ, representing the traversal ic 1, 2idA summation symbol representing the traversal id 1, 2., r,is emb2picThe transpose of (a) is performed,is emb2nidThe expression of the log sigmoid function is as follows:

wherein x is an independent variable, e is a natural constant, and log is a logarithmic function with e as a base;

step S28: based on steps S23-S27, go through ib ═ 1, 2partsEmbedding the embs1 into the matrix as a trained substructure for a plurality of times until the network converges;

step S29: extracting a character embedding matrix embschar from the embs1 through a mapping relation g, wherein the line ia of the embschar corresponds to the line g (ia) of the embs1, and extracting a character-independent-hot-coding table conehots from ponehots through the mapping relation gia|ia=1,2,...,ncharsTherein conhotia=ponehotg(ia)

6. The method for encoding chinese characters based on character embedding of claim 1, wherein said step S3 specifically comprises:

step S31: selecting a Chinese character to be coded;

step S32: using the condhoss to code the Chinese characters to be coded into one-hot codes;

step S33: the one-hot encoding is embedded as a low-dimensional vector using embschar.

Technical Field

The invention relates to the field of pattern recognition and computer vision, in particular to a Chinese character coding method based on character embedding.

Background

Language is one of the main ways that humans transmit information, and words are written language, which is also one of the most widespread ways that humans transmit information visually.

With the rapid development of technologies such as artificial intelligence, internet and the like, the automatic recognition of texts in images by using a computer is of great significance. For the task of character recognition, characters are usually coded by a one-hot coding mode, the coding mode ignores the correlation among similar characters and is sparse, and for the task of recognizing English characters and numbers, the applicability is still good due to the fact that the number of categories is small. However, for the task of recognizing Chinese characters, because of the various categories of Chinese characters, there are thousands of common characters, which results in slower network convergence by using unique hot coding, and completely ignores the structural shape similarity between Chinese characters, resulting in low accuracy and low efficiency of character recognition.

Disclosure of Invention

In view of the above, the present invention provides a method for encoding chinese characters based on character embedding, which can effectively reduce the dimensionality of chinese character encoding, so that chinese character encoding with similar structures has positive correlation, and effectively improve character recognition efficiency.

In order to achieve the purpose, the invention adopts the following technical scheme:

a Chinese character coding method based on character embedding comprises the following steps:

step S1: constructing a Chinese character set, decomposing each character into a plurality of substructures, constructing a substructure set, defining the contribution degree of each substructure to the character, and constructing a substructure contribution degree matrix to each character according to the substructure set;

step S2: constructing a substructure embedding matrix and training according to the obtained substructure set and the contribution matrix of the substructure to each character, and extracting to obtain a character embedding matrix;

step S3: inputting characters, and acquiring character embedding through a character embedding matrix.

Further, the step S1 is specifically:

step S11, determining the character set to be coded, the ia th Chinese character is chariaIn total, ncharsIf a Chinese character needs to be embedded, the character set is chars ═ charia|ia=1,2,...,nchars};

Step S12, all Chinese characters in char are split to obtain the partial of all substructures ═ { part }ib|ib=1,2,...,nparts}, where partibIs the ib-th substructure, npartsNumber of elements that are parts;

step S13, calculating nfreq of substructure frequency tableparts={nfreqib|ib=1,2,...,npartsWherein nfreqibDenotes partibIs nfreqibA substructure of individual characters;

step S14: because the split result is character split when k is 1, chars is a subset of parts, and a mapping relation g is established, so that part is formedib=partg(ia)

Step S15: calculating the contribution degree of each substructure in parts to each character in chars to obtain npartsLine ncharsThe contribution matrix charparts of the column.

Further, the step S12 is specifically:

(1) presetting that each Chinese character can be split into k substructures;

(2) k is an integer not less than 1, and when k is 1, the split result is a character per se;

(3) the maximum value of k being the number of strokes of a character or kmax,kmaxA maximum manually set split number;

splitting all Chinese characters in char according to (1) - (3) to obtain all substructures of parts ═ partib|ib=1,2,...,nparts}, where partibIs the ib-th substructure, npartsIs the number of elements of parts.

Further, the step S15 is specifically:

(1) when a Chinese character is split into k parts, the contribution degree of the split sub-structure to the character is

(2) When one substructure appears in a plurality of splitting results of one character at the same time, calculating the contribution degree by taking the splitting method with the minimum k;

(3) if a sub-structure cannot form a character, the contribution degree of the sub-structure to the character is 0;

calculating the contribution degree of each substructure in parts to each character in chars according to (1) - (3) to obtain npartsLine ncharsThe contribution matrix charparts of the column.

Further, the step S2 is specifically:

step S21: construction of a pair of substructure Embedded matrices embs1, embs2, embs1 and embs2 are all npartsA matrix with m rows and m columns, wherein m is a vector dimension obtained by embedding which is manually set;

step S22: if each substructure in parts is encoded uniquely, then partibIs encoded as ponehotibThen the one-hot coding of all substructures is ponehots ═ ponehot { (ponehot)ib|ib=1,2,…,nparts};

Step S23: for the ib-th substructure, ponehotibWith probability f (nfreq)ib) As the central substructure, the probability calculation method is as follows:

wherein min is a minimum function, alpha is a parameter set manually, then a window with the size of t is set, t is a positive integer parameter set manually, the distribution of the ib-th row of charparts is used as the probability distribution of characters, t characters are extracted, the character numbers are mapped to the substructure numbers by mapping g and are placed in the window to be used as related substructures, r substructures are extracted randomly to be used as unrelated substructures, and r is the positive integer parameter set manually;

step S24: the computation of embedding the one-hot code into the vector by the sub-structure embedding matrix is as follows:

emb=ponehot×embsparts

wherein the embspartsEmbedding a matrix for a substructure, using ponehot as a unique hot code of the substructure, using emb as an embedded vector, and embedding the unique hot code of the central substructure into an embedded vector emb1 through embs 1;

step S25: the one-hot coding of t related substructures is embedded by the embs2 to obtain t embedded vectors emb2ps ═ { emb2pic1, 2, …, t }, where emb2picThe ith of the t embedded vectors;

step S26: the one-hot encoding of r unrelated substructures is embedded by embs2 to obtain t embedded vectors emb2ns ═ { emb2nid1, 2., r }, where emb2nidIs the id-th of the r embedded vectors;

step S27: loss is calculated and the network is optimized using the following formula:

wherein ∑icA summation symbol, Σ, representing the traversal ic 1, 2idA summation symbol representing the traversal id 1, 2, …, r,is emb2picThe transpose of (a) is performed,is emb2nidTranspose of (3), the expression of logsigmoid function is as follows:

wherein x is an independent variable, e is a natural constant, and log is a logarithmic function with e as a base;

step S28: based on steps S23-S27, go through ib ═ 1, 2partsEmbedding the embs1 into the matrix as a trained substructure for a plurality of times until the network converges;

step S29: extraction from embs1 by mapping gThe characters are embedded into a matrix embschar, wherein the ia row of the embschar corresponds to the g (ia) row of the embs1, and character one-hot encoding tables, namely { conhot ═ are extracted from ponehots through a mapping relation gia|ia=1,2,...,ncharsTherein conhotia=ponehotg(ia)

Further, the step S3 is specifically:

step S31: selecting a Chinese character to be coded;

step S32: using the condhoss to code the Chinese characters to be coded into one-hot codes;

step S33: the one-hot encoding is embedded as a low-dimensional vector using embschar.

Compared with the prior art, the invention has the following beneficial effects:

the invention can effectively reduce the dimension of Chinese character coding, enables the Chinese character coding with similar structure to have positive correlation, and effectively improves the character recognition efficiency

Drawings

FIG. 1 is a flow chart of the method of the present invention.

Detailed Description

The invention is further explained below with reference to the drawings and the embodiments.

Referring to fig. 1, the present invention provides a method for encoding chinese characters based on character embedding, comprising the following steps:

step S1: constructing a Chinese character set, decomposing each character into a plurality of substructures, constructing a substructure set, defining the contribution degree of each substructure to the character, and constructing a substructure contribution degree matrix to each character according to the substructure set;

step S2: constructing a substructure embedding matrix and training according to the obtained substructure set and the contribution matrix of the substructure to each character, and extracting to obtain a character embedding matrix;

step S3: inputting characters, and acquiring character embedding through a character embedding matrix.

In this embodiment, the step S1 specifically includes:

step S11, determining the character set to be coded, the ia th Chinese character is chariaIn total, ncharsIf a Chinese character needs to be embedded, the character set is chars ═ charia|ia=1,2,...,nchars};

Step S12, (1) presetting that each Chinese character can be split into k substructures;

(2) k is an integer not less than 1, and when k is 1, the split result is a character per se;

(3) the maximum value of k being the number of strokes of a character or kmax,kmaxA maximum manually set split number;

splitting all Chinese characters in char according to (1) - (3) to obtain all substructures of parts ═ partib|ib=1,2,...,nparts}, where partibIs the ib-th substructure, npartsIs the number of elements of parts.

Step S13, calculating nfreq of substructure frequency tableparts={nfreqib|ib=1,2,...,npartsWherein nfreqibDenotes partibIs nfreqibA substructure of individual characters;

step S14: because the split result is character split when k is 1, chars is a subset of parts, and a mapping relation g is established, so that part is formedib=partg(ia)

Step S15: (1) when a Chinese character is split into k parts, the contribution degree of the split sub-structure to the character is

(2) When one substructure appears in a plurality of splitting results of one character at the same time, calculating the contribution degree by taking the splitting method with the minimum k;

(3) if a sub-structure cannot form a character, the contribution degree of the sub-structure to the character is 0;

calculating the contribution degree of each substructure in parts to each character in chars according to (1) - (3) to obtain npartsLine ncharsThe contribution matrix charparts of the column.

In this embodiment, the step S2 specifically includes:

step S21: construction of a pair of substructure Embedded matrices embs1, embs2, embs1 and embs2 are all npartsA matrix with m rows and m columns, wherein m is a vector dimension obtained by embedding which is manually set;

step S22: if each substructure in parts is encoded uniquely, then partibIs encoded as ponehotibThen the one-hot coding of all substructures is ponehots ═ ponehot { (ponehot)ib|ib=1,2,...,nparts};

Step S23: for the ib-th substructure, ponehotibWith probability f (nfreq)ib) As the central substructure, the probability calculation method is as follows:

wherein min is a minimum function, alpha is a parameter set manually, then a window with the size of t is set, t is a positive integer parameter set manually, the distribution of the ib-th row of charparts is used as the probability distribution of characters, t characters are extracted, the character numbers are mapped to the substructure numbers by mapping g and are placed in the window to be used as related substructures, r substructures are extracted randomly to be used as unrelated substructures, and r is the positive integer parameter set manually;

step S24: the computation of embedding the one-hot code into the vector by the sub-structure embedding matrix is as follows:

emb=ponehot×embsparts

wherein the embspartsEmbedding a matrix for a substructure, using ponehot as a unique hot code of the substructure, using emb as an embedded vector, and embedding the unique hot code of the central substructure into an embedded vector emb1 through embs 1;

step S25: the one-hot coding of t related substructures is embedded by the embs2 to obtain t embedded vectors emb2ps ═ { emb2pic1, 2, …, t }, where emb2picThe ith of the t embedded vectors;

step S26: embedding the unique heat codes of r independent substructures through the embs2 to obtain tEach embedded vector emb2ns ═ { emb2nid1, 2., r }, where emb2nidIs the id-th of the r embedded vectors;

step S27: loss is calculated and the network is optimized using the following formula:

wherein ∑icA summation symbol, Σ, representing the traversal ic 1, 2idA summation symbol representing the traversal id 1, 2., r,is emb2picThe transpose of (a) is performed,is emb2nidTranspose of (3), the expression of logsigmoid function is as follows:

wherein x is an independent variable, e is a natural constant, and log is a logarithmic function with e as a base;

step S28: based on steps S23-S27, go through ib ═ 1, 2partsEmbedding the embs1 into the matrix as a trained substructure for a plurality of times until the network converges;

step S29: extracting a character embedding matrix embschar from the embs1 through a mapping relation g, wherein the line ia of the embschar corresponds to the line g (ia) of the embs1, and extracting a character-independent-hot-coding table conehots from ponehots through the mapping relation gia|ia=1,2,...,ncharsTherein conhotia=ponehotg(ia)

In this embodiment, the step S3 specifically includes:

step S31: selecting a Chinese character to be coded;

step S32: using the condhoss to code the Chinese characters to be coded into one-hot codes;

step S33: the one-hot encoding is embedded as a low-dimensional vector using embschar.

The above description is only a preferred embodiment of the present invention, and all equivalent changes and modifications made in accordance with the claims of the present invention should be covered by the present invention.

10页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:文本纠错方法、装置、设备及可读存储介质

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!