Tibetan ordering method based on hash function

文档序号：568419 发布日期：2021-05-18 浏览：11次中文

阅读说明：本技术 一种基于哈希函数的藏文排序方法 (Tibetan ordering method based on hash function ) 是由安见才让孙琦龙于 2021-01-28 设计创作，主要内容包括：本发明公开了一种基于哈希函数的藏文排序方法。该方法包括将待排序藏文字集分解成多个藏文字,识别各个藏文字的构件；构建基于哈希函数的藏文排序模型；基于识别的各个藏文字的构件和构建的藏文排序模型,计算各个藏文字的哈希值；基于计算的哈希值对各个藏文字进行排序,得到排序后的藏文字序列。本发明为了解决现有技术中计算藏文字权重方法过于复杂的问题,采用了哈希函数对藏文字的哈希值进行了计算,使每个藏文字都有一个独特的哈希值,从而完成了藏文排序过程,具有简单且运算速度快的优点。(The invention discloses a Tibetan ordering method based on a hash function. The method comprises the steps of decomposing a Tibetan character set to be sorted into a plurality of Tibetan characters, and identifying components of the Tibetan characters; constructing a Tibetan ordering model based on a hash function; calculating the hash value of each Tibetan character based on the identified component of each Tibetan character and the constructed Tibetan character sequencing model; and sequencing all Tibetan characters based on the calculated hash value to obtain a sequenced Tibetan character sequence. In order to solve the problem that the method for calculating the weight of the Tibetan characters in the prior art is too complex, the method adopts the hash function to calculate the hash value of the Tibetan characters, so that each Tibetan character has a unique hash value, the Tibetan character sorting process is completed, and the method has the advantages of simplicity and high calculation speed.)

1. A Tibetan language sorting method based on a hash function is characterized by comprising the following steps:

s1, decomposing the Tibetan character set to be sorted into a plurality of Tibetan characters, and identifying components of the Tibetan characters;

s2, constructing a Tibetan ordering model based on a hash function;

s3, calculating the hash value of each Tibetan character based on the member of each Tibetan character identified in the step S1 and the Tibetan language ranking model constructed in the step S2;

and S4, sorting the Tibetan characters based on the hash value calculated in the step S3 to obtain a sorted Tibetan character sequence.

2. The Tibetan ordering method based on hash function as claimed in claim 1, wherein the step S1 specifically includes:

decomposing the Tibetan character set to be sorted into a plurality of Tibetan characters, wherein the firsti Tibetan characters T_iIs identified as t₆，t₅，t₄，t₃，t₂，t₁，t₀}，t₆，t₅，t₄，t₃，t₂，t₁，t₀Respectively representing a basic character, a prefix, an upper addition character, a lower addition character, a vowel, a rear addition character and a further rear addition character in the Tibetan character components, and the default component is replaced by 'epsilon'.

3. The Tibetan sorting method based on the hash function as claimed in claim 2, wherein the step S2 includes the following sub-steps:

s21, distributing different bit weights to different members of Tibetan characters;

s22, assigning different feature values to the consonants and vowels of the Tibetan characters;

s23, constructing a Tibetan language sorting model based on the hash function based on the bit weight values distributed in the step S21 and the characteristic values distributed in the step S22.

4. The Tibetan ordering method based on hash function as claimed in claim 3, wherein the step S21 specifically includes:

allocating different bit weights to different members of Tibetan characters, wherein the bit weight of the base character is 35⁶The weight of the prefix is 35⁵The weight of the added word is 35⁴The bit weight of the lower addition word is 35³The bit weight of vowel is 35²The bit weight of the postaddition word is 35¹And the postword has a bit weight of 35⁰。

5. The Tibetan ordering method based on hash function as claimed in claim 4, wherein the step S22 specifically includes:

integer values 0 to 34 are assigned to 30 consonants and 4 vowels in the missing and Tibetan letters, respectively, as corresponding feature values.

6. The Tibetan ordering method based on hash function as claimed in claim 5, wherein the step S23 specifically includes:

based on the bit weight values distributed in the step S1 and the characteristic values distributed in the step S22, a Tibetan language ranking model for calculating the hash values of all Tibetan characters is constructed by adopting a hash function:

wherein h (T) is the hash value of Tibetan character T, T_jIs the jth member of Tibetan character T, f (T)_j) To calculate the component t_jCharacteristic function of (2), 35^jThe bit weight value corresponding to the jth component.

7. The Tibetan ordering method based on hash function as claimed in claim 6, wherein the step S3 specifically includes:

calculating the hash value of each Tibetan character based on the member of each Tibetan character identified in the step S1 and the Tibetan ranking model constructed in the step S23:

wherein h (T)_i) Is the ith Tibetan character T_iHash value of t_ijIs the jth component of the ith Tibetan character, f (t)_ij) To calculate the component t_ijThe characteristic function of (2).

8. The Tibetan sorting method based on hash function as claimed in claim 7, wherein the step S1 further includes: and respectively carrying out independent pretreatment on the non-standard Tibetan characters to restore the non-standard Tibetan characters into the standard Tibetan characters.

Technical Field

The invention relates to a Tibetan ordering method, in particular to a Tibetan ordering method based on a hash function.

Background

Any character has its own ordering rule, which is a consensus and norm when people use the character. Various words, dictionaries and lists of words, and look-up of words or words therein may be arranged using rules well known and followed by the person. Nowadays, people can make a computer automatically and quickly perform complex word dictionary (table) arrangement and word searching work according to the sorting rules and algorithms of respective languages.

At present, Tibetan information processing technology lags behind Chinese information processing technology, and Chinese proofreading technology cannot be directly applied to Tibetan information processing. In the prior art, people compare all components in sequence according to the component comparison sequence so as to compare the syllable sequencing priority, the method has a large number of problems of grammar judgment and grammar judgment nesting, and the algorithm time complexity is high. Also, numerical assignment is carried out on each type of Tibetan symbols, character positions are determined step by step through an algorithm, characters are recognized, and finally the Tibetan words are sorted according to corresponding numerical value combinations of extracted characters, so that the sorting of the Tibetan words is completed.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides a Tibetan ordering method based on a hash function.

In order to achieve the purpose of the invention, the invention adopts the technical scheme that:

a Tibetan language sorting method based on a hash function comprises the following steps:

s1, decomposing the Tibetan character set to be sorted into a plurality of Tibetan characters, and identifying components of the Tibetan characters;

s2, constructing a Tibetan ordering model based on a hash function;

s3, calculating the hash value of each Tibetan character based on the member of each Tibetan character identified in the step S1 and the Tibetan language ranking model constructed in the step S2;

and S4, sorting the Tibetan characters based on the hash value calculated in the step S3 to obtain a sorted Tibetan character sequence.

The invention has the following beneficial effects: the algorithm is simple and convenient, and the operation is fast. Each component can be deduced from the hash function value of the Tibetan characters, and the hash function value can be added or subtracted to complete various operations on the Tibetan character components.

Preferably, step S1 specifically includes:

decomposing the Tibetan character set to be sorted into a plurality of Tibetan characters, wherein the ith Tibetan character T_iIs identified as t₆,t₅,t₄,t₃,t₂,t₁,t₀}，t₆,t₅,t₄,t₃,t₂,t₁,t₀Respectively representing a basic character, a prefix, an upper addition character, a lower addition character, a vowel, a rear addition character and a further rear addition character in the Tibetan character components, and the default component is replaced by 'epsilon'.

The preferred scheme has the following beneficial effects: the Tibetan character set to be sorted is decomposed into a plurality of Tibetan characters consisting of all components, so that calculation in subsequent steps is facilitated.

Preferably, step S2 includes the following substeps:

s21, distributing different bit weights to different members of Tibetan characters;

s22, assigning different feature values to the consonants and vowels of the Tibetan characters;

s23, constructing a Tibetan language sorting model based on the hash function based on the bit weight values distributed in the step S21 and the characteristic values distributed in the step S22.

The preferred scheme has the following beneficial effects: based on the grammar rules and the sorting rules of Tibetan characters, the Tibetan character sorting model is skillfully constructed by combining a Hash function, and the construction method is simple and convenient.

Preferably, step S21 specifically includes:

The preferred scheme has the following beneficial effects: different bit weight values are distributed to each component, so that each Tibetan character is ensured to have a unique hash value, and the repetition is avoided.

Preferably, step S22 specifically includes:

integer values 0 to 34 are assigned to 30 consonants and 4 vowels in the missing and Tibetan letters, respectively, as corresponding feature values.

The preferred scheme has the following beneficial effects: different hash values are allocated to different letters, so that calculation in subsequent steps is facilitated, the condition that parts of Tibetan character components are omitted is considered, and characteristic value allocation of the omitted letters is added.

Preferably, step S23 specifically includes:

Preferably, step S3 specifically includes:

calculating the hash value of each Tibetan character based on the member of each Tibetan character identified in the step S1 and the Tibetan ranking model constructed in the step S23:

wherein h (T)_i) Is the ith Tibetan character T_iHash value of t_ijIs the jth component of the ith Tibetan character, f (t)_ij) To calculate the component t_ijThe characteristic function of (2).

The preferred scheme has the following beneficial effects: the constructed Tibetan language sorting model is adopted to simply and conveniently calculate the corresponding unique hash value of each Tibetan language, so that the sorting in the subsequent steps is facilitated.

Preferably, step S1 further includes: and respectively carrying out independent pretreatment on the non-standard Tibetan characters to restore the non-standard Tibetan characters into the standard Tibetan characters.

The preferred scheme has the following beneficial effects: the invention is suitable for the standard modern Tibetan characters and also suitable for the non-standard Tibetan characters, thereby enhancing the practicability of the invention.

Drawings

FIG. 1 is a flow chart of a Tibetan ordering method based on a hash function according to the present invention;

FIG. 2 is a schematic diagram of the allocation of weights of Tibetan constructed words in the embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The Tibetan is formed by combining four pronunciations and thirty consonants through grammar rules of thirty song and word standard, the Tibetan letters are combined by 7 letters according to the thirty song and the word standard according to the sequence of the first plus word, the upper plus word, the base word, the lower plus word, the vowel, the last plus word and the second plus word, and the ordering rule of the Tibetan in the book of the Tibetan word is as follows: the same letter acts differently at different positions, as a numeric value, the same number is weighted differently at different positions. The first additional character, the upper additional character, the lower additional character, the vowel, the last additional character and the second additional character in a Tibetan character can be omitted, but the basic character cannot be omitted.

Based on the rules, the invention establishes a hash function according to the sequence values of the Tibetan letters and the weights of different positions in 7 positions in the Tibetan characters, and sequences the hash values by calculating the hash values for sequencing the Tibetan characters so as to realize the sequencing of the Tibetan characters.

Referring to fig. 1, the present invention provides a Tibetan language sorting method based on a hash function, which includes the following steps:

s1, decomposing the Tibetan character set to be sorted into a plurality of Tibetan characters T₀,T₁,T₂,…,T_nIdentifying each component of the Tibetan language;

in the embodiment of the present invention, step S1 specifically includes:

decomposing the Tibetan character set to be sorted into a plurality of Tibetan characters T₀,T₁,T₂,…,T_nWherein the ith Tibetan character T_iIs identified as t₆,t₅,t₄,t₃,t₂,t₁,t₀}，t₆,t₅,t₄,t₃,t₂,t₁,t₀Respectively representing a basic character, a prefix, an upper addition character, a lower addition character, a vowel, a rear addition character and a further rear addition character in the Tibetan character components, and the default component is replaced by 'epsilon'.

S2, constructing a Tibetan ordering model based on a hash function;

in the embodiment of the present invention, step S2 includes the following sub-steps:

s21, distributing different bit weights to different members of Tibetan characters;

in the embodiment of the present invention, step S21 specifically includes:

referring to FIG. 2, different bit weights are assigned to different components of the Tibetan language, wherein the bit weight of the base word is 35⁶The weight of the prefix is 35⁵The weight of the added word is 35⁴The bit weight of the lower addition word is 35³The bit weight of vowel is 35²The bit weight of the postaddition word is 35¹The postword has a bit weight of 350, as shown in Table 1.

TABLE 1 weight of Tibetan component bit

S22, assigning different feature values to the consonants and vowels of the Tibetan characters;

in the embodiment of the present invention, step S22 specifically includes:

according to the conventional arrangement rule of Tibetan letters and vowels, integer values 0 to 34 are respectively assigned to 30 consonants and 4 vowels in the missing letters and Tibetan letters as corresponding characteristic values, as shown in table 2.

TABLE 2 Tibetan alphabet frequency and assigned eigenvalues

S23, constructing a Tibetan language sorting model based on the hash function based on the bit weight values distributed in the step S21 and the characteristic values distributed in the step S22.

In the embodiment of the present invention, step S23 specifically includes:

S3, calculating the hash value of each Tibetan character based on the member of each Tibetan character identified in the step S1 and the Tibetan language ranking model constructed in the step S2;

in the embodiment of the present invention, step S3 specifically includes:

calculating the hash value of each Tibetan character based on the member of each Tibetan character identified in the step S1 and the Tibetan ranking model constructed in the step S23:

wherein h (T)_i) Is the ith TibetanCharacter T_iHash value of t_ijIs the jth component of the ith Tibetan character, f (t)_ij) To calculate the component t_ijThe characteristic function of (2).

And S4, sorting the Tibetan characters based on the hash value calculated in the step S3 to obtain a sorted Tibetan character sequence.

The accuracy of the Tibetan ordering method based on the hash function provided by the invention is verified, and the experimental corpora containing 18864 normative modern Tibetan characters and 12 non-normative Tibetan characters are collected, and the experimental results are shown in table 3.

TABLE 3 results of the experiment

	Accuracy (%)	Error Rate (%)
			Corpus	99.9365	0.0636
Modern Tibetan character	100	0
			Non-standard Tibetan characters	0	100

The experimental results show that the method provided by the invention can efficiently and conveniently complete the sequencing task on the standard modern Tibetan characters, but the nonstandard Tibetan characters need to be preprocessed independently and then correctly sequenced by the method.

Therefore, in this embodiment of the present invention, step S1 further includes: the non-standard Tibetan characters are respectively preprocessed to restore the non-standard Tibetan characters into the standard modern Tibetan characters.

In summary, the invention provides a method for sorting Tibetan characters based on a hash function by researching the grammar and the character forming rule of the Tibetan characters. The method can be applied to the research and work of fields and industries such as corpus, character recognition, voice recognition, publishing printing and the like.

It will be appreciated by those of ordinary skill in the art that the embodiments described herein are intended to assist the reader in understanding the principles of the invention and are to be construed as being without limitation to such specifically recited embodiments and examples. Those skilled in the art can make various other specific changes and combinations based on the teachings of the present invention without departing from the spirit of the invention, and these changes and combinations are within the scope of the invention.

10页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：一种实现自动生成文档的方法

Tibetan ordering method based on hash function

相关技术

网友询问留言