Text processing method and device, electronic equipment and computer readable storage medium

文档序号:1922143 发布日期:2021-12-03 浏览:14次 中文

阅读说明:本技术 文本处理方法、装置、电子设备及计算机可读存储介质 (Text processing method and device, electronic equipment and computer readable storage medium ) 是由 包祖贻 李辰 王睿 于 2020-05-27 设计创作,主要内容包括:本公开实施例公开了一种文本处理方法、装置、电子设备及计算机可读存储介质,所述方法包括:获取文本对象,并对所述文本对象进行向量化处理,得到与所述文本对象对应的向量化矩阵;对于所述向量化矩阵进行第一特征提取,得到第一特征矩阵,并基于所述第一特征矩阵进行拼写检查,得到拼写检查概率矩阵;对于所述第一特征矩阵进行第二特征提取,得到第二特征矩阵,并基于所述拼写检查概率矩阵和第二特征矩阵执行处理得到所述文本对象的处理结果。该技术方案综合使用了拼写检查特征数据和拼写纠错特征数据,在保证文本处理质量、提高文本处理鲁棒性的前提下,还能够有效提高文本处理的效率,节约人工成本。(The embodiment of the disclosure discloses a text processing method, a text processing device, an electronic device and a computer readable storage medium, wherein the method comprises the following steps: acquiring a text object, and performing vectorization processing on the text object to obtain a vectorization matrix corresponding to the text object; performing first feature extraction on the vectorization matrix to obtain a first feature matrix, and performing spell check on the basis of the first feature matrix to obtain a spell check probability matrix; and performing second feature extraction on the first feature matrix to obtain a second feature matrix, and executing processing based on the spell check probability matrix and the second feature matrix to obtain a processing result of the text object. The technical scheme comprehensively uses the spell checking characteristic data and the spell correction characteristic data, and can effectively improve the text processing efficiency and save the labor cost on the premise of ensuring the text processing quality and improving the text processing robustness.)

1. A text processing method, comprising:

acquiring a text object, and performing vectorization processing on the text object to obtain a vectorization matrix corresponding to the text object;

performing first feature extraction on the vectorization matrix to obtain a first feature matrix, and performing spell check on the basis of the first feature matrix to obtain a spell check probability matrix;

and performing second feature extraction on the first feature matrix to obtain a second feature matrix, and executing processing based on the spell check probability matrix and the second feature matrix to obtain a processing result of the text object.

2. The method of claim 1, wherein the obtaining a text object and vectorizing the text object to obtain a vectorized matrix corresponding to the text object are implemented as:

acquiring a text object;

performing word segmentation on the text object to obtain a word set;

and performing vectorization matrix mapping on the word set to obtain a vectorization matrix corresponding to the text object, wherein the dimension of the vectorization matrix is n x d, n is the number of words in the word set, and d is a preset vectorization dimension.

3. The method of claim 2, the first feature extraction on the vectorization matrix resulting in a first feature matrix implemented as:

determining a context feature extraction model;

extracting context features of the vectorization matrix by using the context feature extraction model to obtain a first feature matrix, wherein the dimension of the first feature matrix is n × h1, n is the number of words in the word set, and h1 is a preset output dimension of the context feature extraction model.

4. The method of claim 2 or 3, the spell-checking based on the first feature matrix resulting in a spell-check probability matrix, implemented as:

determining a spell check model;

and calculating by taking the first characteristic matrix as the input of the spell checking model to obtain a spell checking probability matrix, wherein the dimensionality of the spell checking probability matrix is n x r, n is the number of the words in the word set, and r is the number of the spell checking result categories.

5. The method according to claim 2 or 3, said second feature extraction on said first feature matrix resulting in a second feature matrix, implemented as:

determining a spelling processing feature extraction model;

and extracting spelling processing characteristics of the first characteristic matrix by using the spelling processing characteristic extraction model to obtain a second characteristic matrix, wherein the dimension of the second characteristic matrix is n × h2, n is the number of words in the word set, and h2 is the preset output dimension of the spelling processing characteristic extraction model.

6. The method of claim 2 or 3, the performing processing based on the spell check probability matrix and a second feature matrix resulting in a processing result of the text object being implemented as:

determining a preset word list and a processing candidate probability calculation model, wherein the preset word list at least comprises the word set;

calculating by taking the second feature matrix as an input of the processing candidate probability calculation model to obtain a processing candidate probability matrix, wherein the processing candidate probability matrix is used for representing a corresponding correction candidate probability of the words in the word set in the preset word list, the dimensionality of the processing candidate probability matrix is n x V, n is the number of the words in the word set, and V is the number of the words in the preset word list;

and obtaining a processing result of the text object based on the spell check probability matrix and the processing candidate probability matrix.

7. The method of claim 6, wherein the spell check probability matrix includes both correct and incorrect spell check results;

said deriving a processing result for said text object based on said spell check probability matrix and a processing candidate probability matrix is implemented as:

establishing a position probability matrix based on the position relation of the characters between the character set and a preset word list;

multiplying the position probability matrix with a probability vector corresponding to a correct spelling check result in the spelling check probability matrix to obtain a first text object processing probability matrix;

multiplying the processing candidate probability matrix with a probability vector corresponding to a misspell check result in the spell check probability matrix to obtain a second text object processing probability matrix;

adding the first text object processing probability matrix and the second text object processing probability matrix to obtain a text object processing probability matrix, wherein row vectors of the text object processing probability matrix respectively correspond to the processing probabilities of the words in the preset word list in sequence;

and taking the word corresponding to the vector value with the maximum probability in the text object processing probability matrix row vector as the processing result of the word at the corresponding position in the text object.

8. A text processing apparatus comprising:

the vectorization processing module is configured to acquire a text object and perform vectorization processing on the text object to obtain a vectorization matrix corresponding to the text object;

the spelling check module is configured to perform first feature extraction on the vectorization matrix to obtain a first feature matrix, and perform spelling check on the basis of the first feature matrix to obtain a spelling check probability matrix;

and the processing module is configured to perform second feature extraction on the first feature matrix to obtain a second feature matrix, and execute processing based on the spell check probability matrix and the second feature matrix to obtain a processing result of the text object.

9. The apparatus of claim 8, the vectorization processing module configured to:

acquiring a text object;

performing word segmentation on the text object to obtain a word set;

and performing vectorization matrix mapping on the word set to obtain a vectorization matrix corresponding to the text object, wherein the dimension of the vectorization matrix is n x d, n is the number of words in the word set, and d is a preset vectorization dimension.

10. The apparatus of claim 9, the portion of the spell check module that performs a first feature extraction on the vectorized matrix resulting in a first feature matrix configured to:

determining a context feature extraction model;

extracting context features of the vectorization matrix by using the context feature extraction model to obtain a first feature matrix, wherein the dimension of the first feature matrix is n × h1, n is the number of words in the word set, and h1 is a preset output dimension of the context feature extraction model.

11. The apparatus of claim 9 or 10, the portion of the spell-check module that performs spell-checking based on the first feature matrix resulting in a spell-check probability matrix configured to:

determining a spell check model;

and calculating by taking the first characteristic matrix as the input of the spell checking model to obtain a spell checking probability matrix, wherein the dimensionality of the spell checking probability matrix is n x r, n is the number of the words in the word set, and r is the number of the spell checking result categories.

12. The apparatus according to claim 9 or 12, wherein the processing module performs a second feature extraction on the first feature matrix to obtain a second feature matrix, and is configured to:

determining a spelling processing feature extraction model;

and extracting spelling processing characteristics of the first characteristic matrix by using the spelling processing characteristic extraction model to obtain a second characteristic matrix, wherein the dimension of the second characteristic matrix is n × h2, n is the number of words in the word set, and h2 is the preset output dimension of the spelling processing characteristic extraction model.

13. The apparatus of claim 9 or 10, the portion of the processing module performing processing based on the spell check probability matrix and the second feature matrix resulting in a processing result of the text object configured to:

determining a preset word list and a processing candidate probability calculation model, wherein the preset word list at least comprises the word set;

calculating by taking the second feature matrix as an input of the processing candidate probability calculation model to obtain a processing candidate probability matrix, wherein the processing candidate probability matrix is used for representing a corresponding correction candidate probability of the words in the word set in the preset word list, the dimensionality of the processing candidate probability matrix is n x V, n is the number of the words in the word set, and V is the number of the words in the preset word list;

and obtaining a processing result of the text object based on the spell check probability matrix and the processing candidate probability matrix.

14. The apparatus of claim 13, the spell check probability matrix including both correct and incorrect spell check results;

the portion that derives a processing result for the text object based on the spell check probability matrix and the processing candidate probability matrix is configured to:

establishing a position probability matrix based on the position relation of the characters between the character set and a preset word list;

multiplying the position probability matrix with a probability vector corresponding to a correct spelling check result in the spelling check probability matrix to obtain a first text object processing probability matrix;

multiplying the processing candidate probability matrix with a probability vector corresponding to a misspell check result in the spell check probability matrix to obtain a second text object processing probability matrix;

adding the first text object processing probability matrix and the second text object processing probability matrix to obtain a text object processing probability matrix, wherein row vectors of the text object processing probability matrix respectively correspond to the processing probabilities of the words in the preset word list in sequence;

and taking the word corresponding to the vector value with the maximum probability in the text object processing probability matrix row vector as the processing result of the word at the corresponding position in the text object.

15. An electronic device comprising a memory and a processor; wherein the content of the first and second substances,

the memory is configured to store one or more computer instructions, wherein the one or more computer instructions are executed by the processor to implement the method steps of any of claims 1-7.

16. A computer readable storage medium having stored thereon computer instructions which, when executed by a processor, carry out the method steps of any of claims 1-7.

Technical Field

The embodiment of the disclosure relates to the technical field of image processing, in particular to a text processing method and device, electronic equipment and a computer-readable storage medium.

Background

When people use an input tool to input texts, input errors such as spelling errors, shape-similar-word errors, pronunciation-similar-word errors and the like often occur, the input errors can often make the texts easy to be misunderstood, and further influence the effect and efficiency of character communication of people, and in a plurality of rigorous document scenes such as policies, judicial laws, contract texts and the like, the tolerance to the input errors is almost zero. Therefore, the text needs to be corrected in many cases, and two correction modes of manual correction and machine correction exist in the prior art, wherein the manual correction mode has high relative correction quality, but consumes time and labor, and has low correction efficiency; the machine proofreading mode has high proofreading speed, but because the Chinese text expression mode is very complex, a plurality of spelling errors are difficult to be accurately identified, in addition, the spell checking characteristic data and the spell correction characteristic data used in the existing machine proofreading method are relatively independent data, and neither data is enough to improve the robustness of text proofreading, so the proofreading quality is low. In summary, a text processing method capable of simultaneously ensuring the proofreading quality and the proofreading efficiency and saving labor cost is needed.

Disclosure of Invention

The embodiment of the disclosure provides a text processing method and device, electronic equipment and a computer readable storage medium.

In a first aspect, a text processing method is provided in an embodiment of the present disclosure.

Specifically, the text processing method includes:

acquiring a text object, and performing vectorization processing on the text object to obtain a vectorization matrix corresponding to the text object;

performing first feature extraction on the vectorization matrix to obtain a first feature matrix, and performing spell check on the basis of the first feature matrix to obtain a spell check probability matrix;

and performing second feature extraction on the first feature matrix to obtain a second feature matrix, and executing processing based on the spell check probability matrix and the second feature matrix to obtain a processing result of the text object.

With reference to the first aspect, in a first implementation manner of the first aspect, the obtaining a text object and performing vectorization processing on the text object to obtain a vectorization matrix corresponding to the text object is implemented as:

acquiring a text object;

performing word segmentation on the text object to obtain a word set;

and performing vectorization matrix mapping on the word set to obtain a vectorization matrix corresponding to the text object, wherein the dimension of the vectorization matrix is n x d, n is the number of words in the word set, and d is a preset vectorization dimension.

With reference to the first aspect and the first implementation manner of the first aspect, in a second implementation manner of the first aspect, the performing the first feature extraction on the vectorization matrix to obtain a first feature matrix is implemented as:

determining a context feature extraction model;

extracting context features of the vectorization matrix by using the context feature extraction model to obtain a first feature matrix, wherein the dimension of the first feature matrix is n × h1, n is the number of words in the word set, and h1 is a preset output dimension of the context feature extraction model.

With reference to the first aspect, the first implementation manner of the first aspect, and the second implementation manner of the first aspect, in a third implementation manner of the first aspect, the performing spell check based on the first feature matrix to obtain a spell check probability matrix is implemented as:

determining a spell check model;

and calculating by taking the first characteristic matrix as the input of the spell checking model to obtain a spell checking probability matrix, wherein the dimensionality of the spell checking probability matrix is n x r, n is the number of the words in the word set, and r is the number of the spell checking result categories.

With reference to the first aspect, the first implementation manner of the first aspect, the second implementation manner of the first aspect, and the third implementation manner of the first aspect, in a fourth implementation manner of the first aspect, the performing second feature extraction on the first feature matrix to obtain a second feature matrix is implemented as:

determining a spelling processing feature extraction model;

and extracting spelling processing characteristics of the first characteristic matrix by using the spelling processing characteristic extraction model to obtain a second characteristic matrix, wherein the dimension of the second characteristic matrix is n × h2, n is the number of words in the word set, and h2 is the preset output dimension of the spelling processing characteristic extraction model.

With reference to the first aspect, the first implementation manner of the first aspect, the second implementation manner of the first aspect, the third implementation manner of the first aspect, and the fourth implementation manner of the first aspect, in a fifth implementation manner of the first aspect, the performing processing based on the spell check probability matrix and the second feature matrix to obtain a processing result of the text object is implemented as:

determining a preset word list and a processing candidate probability calculation model, wherein the preset word list at least comprises the word set;

calculating by taking the second feature matrix as an input of the processing candidate probability calculation model to obtain a processing candidate probability matrix, wherein the processing candidate probability matrix is used for representing a corresponding correction candidate probability of the words in the word set in the preset word list, the dimensionality of the processing candidate probability matrix is n x V, n is the number of the words in the word set, and V is the number of the words in the preset word list;

and obtaining a processing result of the text object based on the spell check probability matrix and the processing candidate probability matrix.

With reference to the first aspect, the first implementation manner of the first aspect, the second implementation manner of the first aspect, the third implementation manner of the first aspect, the fourth implementation manner of the first aspect, and the fifth implementation manner of the first aspect, in a sixth implementation manner of the first aspect, the spell check probability matrix includes two spell check results, namely correct and incorrect;

said deriving a processing result for said text object based on said spell check probability matrix and a processing candidate probability matrix is implemented as:

establishing a position probability matrix based on the position relation of the characters between the character set and a preset word list;

multiplying the position probability matrix with a probability vector corresponding to a correct spelling check result in the spelling check probability matrix to obtain a first text object processing probability matrix;

multiplying the processing candidate probability matrix with a probability vector corresponding to a misspell check result in the spell check probability matrix to obtain a second text object processing probability matrix;

adding the first text object processing probability matrix and the second text object processing probability matrix to obtain a text object processing probability matrix, wherein row vectors of the text object processing probability matrix respectively correspond to the processing probabilities of the words in the preset word list in sequence;

and taking the word corresponding to the vector value with the maximum probability in the text object processing probability matrix row vector as the processing result of the word at the corresponding position in the text object.

In a second aspect, a text processing apparatus is provided in an embodiment of the present disclosure.

Specifically, the text processing apparatus includes:

the vectorization processing module is configured to acquire a text object and perform vectorization processing on the text object to obtain a vectorization matrix corresponding to the text object;

the spelling check module is configured to perform first feature extraction on the vectorization matrix to obtain a first feature matrix, and perform spelling check on the basis of the first feature matrix to obtain a spelling check probability matrix;

and the processing module is configured to perform second feature extraction on the first feature matrix to obtain a second feature matrix, and execute processing based on the spell check probability matrix and the second feature matrix to obtain a processing result of the text object.

With reference to the second aspect, in a first implementation manner of the second aspect, the vectorization processing module is configured to:

acquiring a text object;

performing word segmentation on the text object to obtain a word set;

and performing vectorization matrix mapping on the word set to obtain a vectorization matrix corresponding to the text object, wherein the dimension of the vectorization matrix is n x d, n is the number of words in the word set, and d is a preset vectorization dimension.

With reference to the second aspect and the first implementation manner of the second aspect, in a second implementation manner of the second aspect, the spell checking module performs first feature extraction on the vectorization matrix to obtain a first feature matrix, and is configured to:

determining a context feature extraction model;

extracting context features of the vectorization matrix by using the context feature extraction model to obtain a first feature matrix, wherein the dimension of the first feature matrix is n × h1, n is the number of words in the word set, and h1 is a preset output dimension of the context feature extraction model.

With reference to the second aspect, the first implementation manner of the second aspect, and the second implementation manner of the second aspect, in a third implementation manner of the second aspect, the spell checking module performs spell checking based on the first feature matrix to obtain a part of a spell checking probability matrix, and is configured to:

determining a spell check model;

and calculating by taking the first characteristic matrix as the input of the spell checking model to obtain a spell checking probability matrix, wherein the dimensionality of the spell checking probability matrix is n x r, n is the number of the words in the word set, and r is the number of the spell checking result categories.

With reference to the first implementation manner of the second aspect, the second implementation manner of the second aspect, and the third implementation manner of the second aspect, in a fourth implementation manner of the second aspect of the present disclosure, a part of the processing module, which performs second feature extraction on the first feature matrix to obtain a second feature matrix, is configured to:

determining a spelling processing feature extraction model;

and extracting spelling processing characteristics of the first characteristic matrix by using the spelling processing characteristic extraction model to obtain a second characteristic matrix, wherein the dimension of the second characteristic matrix is n × h2, n is the number of words in the word set, and h2 is the preset output dimension of the spelling processing characteristic extraction model.

With reference to the second aspect, the second implementation manner of the second aspect, the third implementation manner of the second aspect, and the fourth implementation manner of the second aspect, in a fifth implementation manner of the second aspect, the processing module is configured to perform processing based on the spell check probability matrix and the second feature matrix to obtain a processing result of the text object:

determining a preset word list and a processing candidate probability calculation model, wherein the preset word list at least comprises the word set;

calculating by taking the second feature matrix as an input of the processing candidate probability calculation model to obtain a processing candidate probability matrix, wherein the processing candidate probability matrix is used for representing a corresponding correction candidate probability of the words in the word set in the preset word list, the dimensionality of the processing candidate probability matrix is n x V, n is the number of the words in the word set, and V is the number of the words in the preset word list;

and obtaining a processing result of the text object based on the spell check probability matrix and the processing candidate probability matrix.

With reference to the second aspect, the second implementation manner of the second aspect, the third implementation manner of the second aspect, the fourth implementation manner of the second aspect, and the fifth implementation manner of the second aspect, in a sixth implementation manner of the second aspect, the spell check probability matrix includes two spell check results, namely correct and incorrect;

the portion that derives a processing result for the text object based on the spell check probability matrix and the processing candidate probability matrix is configured to:

establishing a position probability matrix based on the position relation of the characters between the character set and a preset word list;

multiplying the position probability matrix with a probability vector corresponding to a correct spelling check result in the spelling check probability matrix to obtain a first text object processing probability matrix;

multiplying the processing candidate probability matrix with a probability vector corresponding to a misspell check result in the spell check probability matrix to obtain a second text object processing probability matrix;

adding the first text object processing probability matrix and the second text object processing probability matrix to obtain a text object processing probability matrix, wherein row vectors of the text object processing probability matrix respectively correspond to the processing probabilities of the words in the preset word list in sequence;

and taking the word corresponding to the vector value with the maximum probability in the text object processing probability matrix row vector as the processing result of the word at the corresponding position in the text object.

In a third aspect, the disclosed embodiments provide an electronic device, including a memory for storing one or more computer instructions that support a text processing apparatus to execute the above-mentioned text processing method, and a processor configured to execute the computer instructions stored in the memory. The text processing apparatus may further include a communication interface for the text processing apparatus to communicate with other devices or a communication network.

In a fourth aspect, the disclosed embodiments provide a computer-readable storage medium for storing computer instructions for a text processing apparatus, which includes computer instructions for executing the text processing method described above as a text processing apparatus.

The technical scheme provided by the embodiment of the disclosure can have the following beneficial effects:

according to the technical scheme, the text object to be processed is subjected to feature extraction twice, firstly, spell check is carried out by using a result obtained by the first feature extraction, and then text processing is carried out by combining a result of the spell check and a result obtained by the second feature extraction to obtain a text processing result. The technical scheme comprehensively uses the spell checking characteristic data and the spell correction characteristic data, and can effectively improve the text processing efficiency and save the labor cost on the premise of ensuring the text processing quality and improving the text processing robustness.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of embodiments of the disclosure.

Drawings

Other features, objects, and advantages of embodiments of the disclosure will become more apparent from the following detailed description of non-limiting embodiments when taken in conjunction with the accompanying drawings. In the drawings:

FIG. 1 shows a flow diagram of a text processing method according to an embodiment of the present disclosure;

FIG. 2 illustrates an overall flow diagram of a text processing method according to an embodiment of the present disclosure;

FIG. 3 shows a block diagram of a text processing apparatus according to an embodiment of the present disclosure;

FIG. 4 is a schematic block diagram of a computer system suitable for use in implementing a text processing method according to an embodiment of the present disclosure.

Detailed Description

Hereinafter, exemplary embodiments of the disclosed embodiments will be described in detail with reference to the accompanying drawings so that they can be easily implemented by those skilled in the art. Also, for the sake of clarity, parts not relevant to the description of the exemplary embodiments are omitted in the drawings.

In the disclosed embodiments, it is to be understood that terms such as "including" or "having," etc., are intended to indicate the presence of the disclosed features, numbers, steps, behaviors, components, parts, or combinations thereof, and are not intended to preclude the possibility that one or more other features, numbers, steps, behaviors, components, parts, or combinations thereof may be present or added.

It should be further noted that the embodiments and features of the embodiments in the present disclosure may be combined with each other without conflict. The embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

According to the technical scheme provided by the embodiment of the disclosure, the text object to be processed is subjected to feature extraction twice, firstly, the result obtained by the first feature extraction is utilized to carry out spell check, and then the result obtained by the spell check and the result obtained by the second feature extraction are combined to carry out text processing to obtain a text processing result. The technical scheme comprehensively uses the spell checking characteristic data and the spell correction characteristic data, and can effectively improve the text processing efficiency and save the labor cost on the premise of ensuring the text processing quality and improving the text processing robustness.

Fig. 1 shows a flowchart of a text processing method according to an embodiment of the present disclosure, which is applied to a text processing server, and as shown in fig. 1, the text processing method includes the following steps S101 to S103:

in step S101, a text object is obtained, and vectorization processing is performed on the text object to obtain a vectorization matrix corresponding to the text object;

in step S102, performing first feature extraction on the vectorization matrix to obtain a first feature matrix, and performing spell check based on the first feature matrix to obtain a spell check probability matrix;

in step S103, a second feature extraction is performed on the first feature matrix to obtain a second feature matrix, and processing is performed based on the spell check probability matrix and the second feature matrix to obtain a processing result of the text object.

As mentioned above, when people use an input tool to input a text, input errors such as spelling errors, shape errors, pronunciation errors, etc. often occur, and the input errors often make the text easily misunderstood, thereby affecting the effect and efficiency of word communication of people, and in many strict text scenarios, such as policy, judicial law, contract text, etc., the tolerance for the input errors is almost zero. Therefore, the text needs to be corrected in many cases, and two correction modes of manual correction and machine correction exist in the prior art, wherein the manual correction mode has high relative correction quality, but consumes time and labor, and has low correction efficiency; the machine proofreading mode has high proofreading speed, but because the Chinese text expression mode is very complex, a plurality of spelling errors are difficult to be accurately identified, in addition, the spell checking characteristic data and the spell correction characteristic data used in the existing machine proofreading method are relatively independent data, and neither data is enough to improve the robustness of text proofreading, so the proofreading quality is low. In summary, a text processing method capable of simultaneously ensuring the proofreading quality and the proofreading efficiency and saving labor cost is needed.

In view of the above problem, in this embodiment, a text processing method is proposed, which obtains a text processing result by performing two feature extractions on a text object to be processed, first performing a spell check using a result obtained by the first feature extraction, and then performing text processing by combining a result of the spell check and a result obtained by the second feature extraction. The technical scheme comprehensively uses the spell checking characteristic data and the spell correction characteristic data, and can effectively improve the text processing efficiency and save the labor cost on the premise of ensuring the text processing quality and improving the text processing robustness.

In an embodiment of the present disclosure, the text object refers to a text, especially a chinese text, which may have spelling errors and needs to be checked and corrected. The text object may be a currently input text or a text obtained from other text sources.

In an embodiment of the present disclosure, the vectorization process refers to a process of converting a non-vector into a vector, for example, a text may be converted into a vectorization matrix by using a vectorization matrix mapping method.

In an embodiment of the present disclosure, the first feature refers to a feature for performing spell checking subsequently, and the second feature refers to a feature for performing text processing subsequently, where both the first feature and the second feature may be taken as context features, and of course, other text features may also be used, and the present disclosure is not limited to the specific expression of the first feature and the second feature.

In an embodiment of the present disclosure, the step S101, namely, obtaining a text object, and performing vectorization processing on the text object to obtain a vectorization matrix corresponding to the text object, may be implemented as:

acquiring a text object;

performing word segmentation on the text object to obtain a word set;

and performing vectorization matrix mapping on the word set to obtain a vectorization matrix corresponding to the text object, wherein the dimension of the vectorization matrix can be represented as n x d, n is the number of words in the word set, and d is a preset vectorization dimension.

In order to more conveniently perform effective spell checking and accurate word correction and the like on the text object, in this embodiment, vectorization processing needs to be performed on the text object to convert it into a vector form.

Firstly, the acquired text object is subjected to word segmentation to obtain a word set. In the prior art, spelling and proofreading of western european languages such as english are relatively mature, text checking and proofreading are mainly performed on the basis of word granularity, the western european languages such as chinese and english have great difference, and a chinese text has no word segmentation, so that a text with misspelling easily causes word segmentation errors of a word segmentation system, so that error accumulation becomes more serious, and spelling checking on the chinese text needs to be performed on the basis of word granularity, that is, a text object needs to be firstly segmented by taking a word as a unit, so that a word set consisting of segmented words and corresponding to the text object is obtained.

Then, vectorization matrix mapping is performed on the word set, that is, each part in the word set is mapped into a vector through the vectorization matrix to represent the vector, so as to obtain a vectorization matrix corresponding to the text object, that is, a vectorization result of the text object, where a dimension of the vectorization matrix may be n × d, n is the number of words in the word set, and d is a preset vectorization dimension, and the vectorization matrix may be determined according to a requirement of an actual application and a scale of the word set, which is not specifically limited by the present disclosure.

For example, if a certain text object is "i am 11 surprise, eat KFC", it may be firstly word-segmented to obtain a segmentation result: "i | surprise | day |11| point | and | eat | KFC", and then a word set consisting of 8 word units is obtained: { I, surprise, day, 11, dot,, eat, KFC }, and then mapping each portion of the set of words into a vector representation via a vectorization matrix, an 8 x d dimensional vectorization matrix corresponding to the text object may be obtained.

In an embodiment of the present disclosure, the step of performing the first feature extraction on the vectorization matrix in the step S102 to obtain a first feature matrix may be implemented as:

determining a context feature extraction model;

extracting context features of the vectorization matrix by using the context feature extraction model to obtain a first feature matrix, wherein the dimension of the first feature matrix is n × h1, n is the number of words in the word set, and h1 is a preset output dimension of the context feature extraction model.

In this embodiment, the first feature is taken as a context feature, and therefore, when the first feature matrix is obtained, a context feature extraction model is first determined; and then, extracting context features of the vectorization matrix by using the context feature extraction model to finally obtain a first feature matrix, wherein the dimension of the first feature matrix can be represented as n × h1, n is still the number of words in the word set, h1 is a preset output dimension of the context feature extraction model, and h1 is determined according to the needs of practical application and the scale of the word set, which is not specifically limited by the present disclosure.

In an embodiment of the present disclosure, the context feature extraction model may be taken as a bidirectional long-short memory network (BiLSTM).

In an embodiment of the present disclosure, the step of performing spell check based on the first feature matrix to obtain a spell check probability matrix in the step S102 may be implemented as:

determining a spell check model;

and calculating by taking the first characteristic matrix as the input of the spell checking model to obtain a spell checking probability matrix, wherein the dimensionality of the spell checking probability matrix is n x r, n is the number of the words in the word set, and r is the number of the spell checking result categories.

In this embodiment, a spell check is performed using the first feature matrix, and first, a spell check model is determined; and then inputting the first feature matrix as an input of the spell check model into the spell check model for calculation, so as to obtain a spell check probability matrix, where a dimension of the spell check probability matrix is n × r, n is still the number of words in the word set, and r is the number of categories of spell check results, for example, if the spell check results are classified into two categories, i.e., spell check correct and spell check error, r is 2, a matrix element of a row in the spell check probability matrix represents a probability that a word in the word set is spelled correctly and incorrectly, a matrix element of a first column in the spell check probability matrix represents a probability that each spelled word in the word set is correctly, and a matrix element of a second column in the spell check probability matrix represents a probability that each spelled word in the word set is incorrectly.

In one embodiment of the present disclosure, the spell checking model may be a computational model consisting of a forward neural network and a softmax function.

In an embodiment of the present disclosure, the spell check model may further be based on an output of the spell check model: and training the context feature extraction model and the spell check model by using a back propagation algorithm so as to achieve better optimization and obtain a target spell check probability matrix.

In an embodiment of the present disclosure, the step of performing the second feature extraction on the first feature matrix in step S103 to obtain a second feature matrix may be implemented as:

determining a spelling processing feature extraction model;

and extracting spelling processing characteristics of the first characteristic matrix by using the spelling processing characteristic extraction model to obtain a second characteristic matrix, wherein the dimension of the second characteristic matrix is n × h2, n is the number of words in the word set, and h2 is the preset output dimension of the spelling processing characteristic extraction model.

In this embodiment, the second feature is obtained based on the first feature, and similar to the first feature, the second feature may also be a context feature, so when obtaining the second feature matrix, it is necessary to first determine a spelling processing feature extraction model, i.e. a second feature extraction model; and then, the spelling processing feature extraction model is used for extracting spelling processing features based on the first feature matrix to finally obtain a second feature matrix, wherein the dimension of the second feature matrix can be represented as n × h2, n is still the number of words in the word set, h2 is a preset output dimension of the spelling processing feature extraction model, and similar to d and h1, h2 can also be determined according to the needs of practical application and the scale of the word set, which is not specifically limited by the disclosure.

In an embodiment of the present disclosure, the spelling processing feature extraction model may also be a bidirectional long and short memory network (BiLSTM), and the combination of the first feature extraction model and the spelling processing feature extraction model is equivalent to adding a spelling processing feature extraction model to the output of the first feature extraction model to realize secondary feature extraction.

In an embodiment of the present disclosure, the step of performing processing based on the spell check probability matrix and the second feature matrix in the step S103 to obtain a processing result of the text object may be implemented as:

determining a preset word list and a processing candidate probability calculation model, wherein the preset word list at least comprises the word set;

calculating by taking the second feature matrix as an input of the processing candidate probability calculation model to obtain a processing candidate probability matrix, wherein the processing candidate probability matrix is used for representing a corresponding correction candidate probability of the words in the word set in the preset word list, the dimensionality of the processing candidate probability matrix is n x V, n is the number of the words in the word set, and V is the number of the words in the preset word list;

and obtaining a processing result of the text object based on the spell check probability matrix and the processing candidate probability matrix.

In order to improve the quality of text processing and enhance the robustness of text processing, in this embodiment, processing operations are performed by using spell check feature data and spell correction feature data in combination, specifically, a preset vocabulary and a processing candidate probability calculation model are first determined, wherein the preset vocabulary is a preset vocabulary including at least the word set, such as a word set including both a word likely to appear as a wrong word and a word likely to appear as a correct word, a relatively complete word or a word obtained from history correction data, and the like; then, the second feature matrix is used as the input of the processing candidate probability calculation model, and is input into the processing candidate probability calculation model for calculation, so that a processing candidate probability matrix can be obtained, wherein the processing candidate probability matrix is used for representing the corresponding correction candidate probability of the words in the word set in the preset word list, so that the dimensionality of the processing candidate probability matrix is n x V, n is the number of the words in the word set, and V is the number of the words in the preset word list; and finally, obtaining a processing result of the text object based on the spell check probability matrix and the processing candidate probability matrix.

In an embodiment of the present disclosure, the processing candidate probability calculation model may be a calculation model composed of a forward neural network and a softmax function.

In an embodiment of the present disclosure, the processing candidate probability calculation model may be further based on an output of the processing candidate probability calculation model: and processing the candidate probability matrix, and training a context feature extraction model, a spell check model, a spelling processing feature extraction model and a processing candidate probability calculation model by using a back propagation algorithm so as to achieve better optimization and obtain a target processing candidate probability matrix.

In one embodiment of the present disclosure, the spell check probability matrix includes both correct and incorrect spell check results;

in this embodiment, the step of obtaining the processing result of the text object based on the spell check probability matrix and the processing candidate probability matrix may be implemented as:

establishing a position probability matrix based on the position relation of the characters between the character set and a preset word list;

multiplying the position probability matrix with a probability vector corresponding to a correct spelling check result in the spelling check probability matrix to obtain a first text object processing probability matrix;

multiplying the processing candidate probability matrix with a probability vector corresponding to a misspell check result in the spell check probability matrix to obtain a second text object processing probability matrix;

adding the first text object processing probability matrix and the second text object processing probability matrix to obtain a text object processing probability matrix, wherein row vectors of the text object processing probability matrix respectively correspond to the processing probabilities of the words in the preset word list in sequence;

and taking the word corresponding to the vector value with the maximum probability in the text object processing probability matrix row vector as the processing result of the word at the corresponding position in the text object.

In this embodiment, a position probability matrix O is first established based on a position relationship between the word set and a preset word list, where the position probability matrix is used to represent the existence probability of the words in the word set at corresponding positions in the preset word list, for example, except the word corresponding to the row in the word set in a matrix element of a row of the position probability matrix, the probability value at the corresponding position in the preset word list is 1, and all other positions are 0, so that the dimension of the position probability matrix is n × V, n is still the number of words in the word set, and V is the number of words in the preset word list; when the spell check probability matrix comprises correct and wrong spell check results, in order to obtain a correction result of the text object, firstly multiplying the position probability matrix by a probability vector corresponding to the correct spell check result in the spell check probability matrix to obtain a first text object processing probability matrix; then multiplying the processing candidate probability matrix with a probability vector corresponding to a misspell check result in the spell check probability matrix to obtain a second text object processing probability matrix; adding the first text object processing probability matrix and the second text object processing probability matrix to obtain a text object processing probability matrix P, wherein row vectors of the text object processing probability matrix P respectively correspond to the processing probabilities of the words in the preset word list in sequence; and finally, taking the word corresponding to the vector value with the maximum probability in the text object processing probability matrix row vector as the processing result of the word at the corresponding position in the text object.

Fig. 2 shows an overall flowchart of a text processing method according to an embodiment of the present disclosure, and as shown in fig. 2, a text object to be processed is first obtained, and then vectorized, so as to obtain a vectorization matrix corresponding to the text object; extracting context characteristics of the vectorization matrix by using a preset context characteristic extraction model to obtain a first characteristic matrix; performing spell check by taking the first characteristic matrix as the input of a preset spell check model to obtain a spell check probability matrix, namely a spell check result; spelling processing feature extraction is carried out on the first feature matrix to obtain a second feature matrix, and the second feature matrix is used as the input of a preset processing candidate probability calculation model to be calculated to obtain a processing candidate probability matrix; and finally, obtaining a processing result of the text object based on the spell check probability matrix and the processing candidate probability matrix.

The technical solution of the present disclosure is explained and illustrated below by taking a specific example as an example.

Assuming that the input text object is "apple", vocabulary V includes only three words of apple, apple and apple: [ apple, flat, fruit ], size 3, first performing word segmentation on the text object to obtain a segmentation result: "ping | fruit", and then a word set consisting of 2 word units: the vectorization matrix mapping is carried out on the word set, and since n is 2 and the vectorization dimension d is 1, the vectorization matrix [ [ flat ], [ fruit ] ] of n d is 2 x 1 corresponding to the text object 'flat fruit' can be obtained; using a bidirectional long and short memory network (BilSTM) as a context feature extraction model to extract context features of the vectorization matrix to obtain a first feature matrix, performing spell check by using the first feature matrix as an input of a spell check model consisting of a forward neural network and a softmax function, and obtaining a spell check probability matrix [ [0.1,0.9], [0.95,0.05] ] of n ═ r ═ 2 if the spell check result is divided into two categories of spell check correctness and spell check mistake, namely r ═ 2, wherein a first row probability value [0.1,0.9] in the spell check probability matrix represents that the probability of spelling of the 'flat' word is correct is 0.1, the probability of spelling error is 0.9, a second row probability value [0.95,0.05] represents that the probability of spelling of the 'fruit' word is correct is 0.95, the probability of spelling error is 0.05, and the first column probability value [0.1 ] in the spell check probability matrix, 0.95 corresponding to the probability p _ c that the spelling of the 'Ping' word and the 'fruit' word is correct, and the second column of probability values [0.9,0.05] corresponding to the probability p _ w that the spelling of the 'Ping' word and the 'fruit' word is wrong; spelling processing feature extraction is carried out on the first feature matrix to obtain a second feature matrix, the second feature matrix is used as an input of a processing candidate probability calculation model composed of a forward neural network and a softmax function to calculate to obtain a processing candidate probability matrix C with n V2 3, wherein the processing candidate probability matrix C is [ [0.9,0.1,0.0], [0.1,0.1,0.8], a first row probability value [0.9,0.1,0.0] in the processing candidate probability matrix C indicates that the processing candidate probabilities of the 'flat' words in the word list V are respectively [ apple ═ 0.9, flat ═ 0.1,0.0], a second row probability value [0.1,0.1,0.8] indicates that the processing candidate probabilities of the 'fruit' words in the word list V are respectively [ apple ═ 0.1, flat ═ 0.1,0.1,0.8 ]; establishing a position probability matrix O based on the position relation of the characters between the character set [ Ping, Ping ] and a preset word list [ apple, Ping ] and the preset word list [ apple: [ [0,1,0], [0,0,1] ], the first row probability value [0,1,0] in the position probability matrix O indicates that the position existence probabilities of the "flat" words in the vocabulary V are [0,1,0 respectively, the second row probability value [0,0,1] indicates that the position existence probabilities of the "effect" words in the vocabulary V are [0,0,1 respectively, that is, the "flat" words exist at the position of the second word in the vocabulary V, and the "effect" words exist at the position of the third word in the vocabulary V, the text object processing probability matrix P for indicating the processing results of the text object can be calculated by the following formula:

P=O*p_c+C*p_w

=[[0,1,0],[0,0,1]]*[0.1,0.95]+[[0.9,0.1,0.0],[0.1,0.1,0.8]]*[0.9,0.05]

=[[0,1*0.1,0],[0,0,1*0.95]]+[[0.9*0.9,0.1*0.9,0.0],[0.1*0.05,0.1*0.05,0.8*0.05]]

=[[0,0.1,0],[0,0,0.95]]+[[0.81,0.09,0.0],[0.005,0.005,0.04]]

=[[0.81,0.19,0],[0.005,0.005,0.99]]。

the word "apple" corresponding to the vector value with the maximum probability of 0.81 in the first row vector and the word "fruit" corresponding to the vector value with the maximum probability of 0.99 in the second row vector of the text object processing probability matrix P are the processing results of the corresponding position words "flat" and "fruit" in the text object, and the processing result of the text object "flat fruit" is "apple".

The above technical solution can be used in many scenarios, such as:

1. the method comprises the steps of checking and correcting a published, to-be-published or just completed document by using the scheme to avoid text input and use errors, thereby improving the writing quality of the document, greatly saving the cost of manual checking and correcting, specifically, firstly carrying out vectorization processing on the document to obtain a vectorization matrix corresponding to the document, then carrying out first feature extraction on the vectorization matrix to obtain a first feature matrix, carrying out spell check on the basis of the first feature matrix to obtain a spell check probability matrix, carrying out second feature extraction on the first feature matrix to obtain a second feature matrix, and finally combining the spell check probability matrix and the second feature matrix to obtain a check result of the document, the file can also be corrected and corrected according to the checking result.

2. In the education field, the automatic inspection of homework and the automatic correction of composition can be realized, in the scene, the word list can be determined according to correct homework content, and under the condition that the word list content is determined, the technical scheme described above can be used for effectively and automatically inspecting and correcting student homework and composition, so that the workload of teachers can be reduced to a great extent, and meanwhile, the inspection and correction accuracy of homework composition can be ensured.

3. The method and the system can provide text processing services, for example, the text processing interface can be provided through media such as a cloud server, so that many users can immediately obtain processing results such as text inspection, proofreading and the like only by requesting to call the text processing interface and inputting a text object to be processed, and the users do not need to spend time and manpower to develop related technologies, so that the time and the labor cost of the users are saved, and particularly for the users with deficient research and development capability, the text processing services can provide great convenience for the users.

The technical solution of the present disclosure can also be applied to other application scenarios, and the above are only illustrative examples and illustrations, and should not be construed as limitations of the technical solution of the present disclosure.

The following are embodiments of the disclosed apparatus that may be used to perform embodiments of the disclosed methods.

Fig. 3 shows a block diagram of a text processing apparatus according to an embodiment of the present disclosure, which may be implemented as part or all of an electronic device, and may be implemented as a text processing server, through software, hardware, or a combination of both. As shown in fig. 3, the text processing apparatus includes:

the vectorization processing module 301 is configured to obtain a text object, and perform vectorization processing on the text object to obtain a vectorization matrix corresponding to the text object;

a spell check module 302 configured to perform a first feature extraction on the vectorization matrix to obtain a first feature matrix, and perform a spell check based on the first feature matrix to obtain a spell check probability matrix;

and the processing module 303 is configured to perform second feature extraction on the first feature matrix to obtain a second feature matrix, and perform processing based on the spell check probability matrix and the second feature matrix to obtain a processing result of the text object.

As mentioned above, when people use an input tool to input a text, input errors such as spelling errors, shape errors, pronunciation errors, etc. often occur, and the input errors often make the text easily misunderstood, thereby affecting the effect and efficiency of word communication of people, and in many strict text scenarios, such as policy, judicial law, contract text, etc., the tolerance for the input errors is almost zero. Therefore, the text needs to be corrected in many cases, and two correction modes of manual correction and machine correction exist in the prior art, wherein the manual correction mode has high relative correction quality, but consumes time and labor, and has low correction efficiency; the machine proofreading mode has high proofreading speed, but because the Chinese text expression mode is very complex, a plurality of spelling errors are difficult to be accurately identified, in addition, the spell checking characteristic data and the spell correction characteristic data used in the existing machine proofreading method are relatively independent data, and neither data is enough to improve the robustness of text proofreading, so the proofreading quality is low. In summary, a text processing method capable of simultaneously ensuring the proofreading quality and the proofreading efficiency and saving labor cost is needed.

In view of the above problem, in this embodiment, a text processing apparatus is proposed that obtains a text processing result by performing feature extraction twice for a text object to be processed, first performing spell check using a result obtained by the first feature extraction, and then performing text processing by combining a result of the spell check and a result obtained by the second feature extraction. The technical scheme comprehensively uses the spell checking characteristic data and the spell correction characteristic data, and can effectively improve the text processing efficiency and save the labor cost on the premise of ensuring the text processing quality and improving the text processing robustness.

In an embodiment of the present disclosure, the text object refers to a text, especially a chinese text, which may have spelling errors and needs to be checked and corrected. The text object may be a currently input text or a text obtained from other text sources.

In an embodiment of the present disclosure, the vectorization process refers to a process of converting a non-vector into a vector, for example, a text may be converted into a vectorization matrix by using a vectorization matrix mapping method.

In an embodiment of the present disclosure, the first feature refers to a feature for performing spell checking subsequently, and the second feature refers to a feature for performing text processing subsequently, where both the first feature and the second feature may be taken as context features, and of course, other text features may also be used, and the present disclosure is not limited to the specific expression of the first feature and the second feature.

In an embodiment of the present disclosure, the vectorization processing module 301 may be configured to:

acquiring a text object;

performing word segmentation on the text object to obtain a word set;

and performing vectorization matrix mapping on the word set to obtain a vectorization matrix corresponding to the text object, wherein the dimension of the vectorization matrix can be represented as n x d, n is the number of words in the word set, and d is a preset vectorization dimension.

In order to more conveniently perform effective spell checking and accurate word correction and the like on the text object, in this embodiment, vectorization processing needs to be performed on the text object to convert it into a vector form.

Firstly, the acquired text object is subjected to word segmentation to obtain a word set. In the prior art, spelling and proofreading of western european languages such as english are relatively mature, text checking and proofreading are mainly performed on the basis of word granularity, the western european languages such as chinese and english have great difference, and a chinese text has no word segmentation, so that a text with misspelling easily causes word segmentation errors of a word segmentation system, so that error accumulation becomes more serious, and spelling checking on the chinese text needs to be performed on the basis of word granularity, that is, a text object needs to be firstly segmented by taking a word as a unit, so that a word set consisting of segmented words and corresponding to the text object is obtained.

Then, vectorization matrix mapping is performed on the word set, that is, each part in the word set is mapped into a vector through the vectorization matrix to represent the vector, so as to obtain a vectorization matrix corresponding to the text object, that is, a vectorization result of the text object, where a dimension of the vectorization matrix may be n × d, n is the number of words in the word set, and d is a preset vectorization dimension, and the vectorization matrix may be determined according to a requirement of an actual application and a scale of the word set, which is not specifically limited by the present disclosure.

For example, if a certain text object is "i am 11 surprise, eat KFC", it may be firstly word-segmented to obtain a segmentation result: "i | surprise | day |11| point | and | eat | KFC", and then a word set consisting of 8 word units is obtained: { I, surprise, day, 11, dot,, eat, KFC }, and then mapping each portion of the set of words into a vector representation via a vectorization matrix, an 8 x d dimensional vectorization matrix corresponding to the text object may be obtained.

In an embodiment of the present disclosure, the part of the spell check module 302 that performs the first feature extraction on the vectorization matrix to obtain a first feature matrix may be configured to:

determining a context feature extraction model;

extracting context features of the vectorization matrix by using the context feature extraction model to obtain a first feature matrix, wherein the dimension of the first feature matrix is n × h1, n is the number of words in the word set, and h1 is a preset output dimension of the context feature extraction model.

In this embodiment, the first feature is taken as a context feature, and therefore, when the first feature matrix is obtained, a context feature extraction model is first determined; and then, extracting context features of the vectorization matrix by using the context feature extraction model to finally obtain a first feature matrix, wherein the dimension of the first feature matrix can be represented as n × h1, n is still the number of words in the word set, h1 is a preset output dimension of the context feature extraction model, and h1 is determined according to the needs of practical application and the scale of the word set, which is not specifically limited by the present disclosure.

In an embodiment of the present disclosure, the context feature extraction model may be taken as a bidirectional long-short memory network (BiLSTM).

In an embodiment of the present disclosure, the part of the spell check module 302 that performs spell check based on the first feature matrix to obtain a spell check probability matrix may be configured to:

determining a spell check model;

and calculating by taking the first characteristic matrix as the input of the spell checking model to obtain a spell checking probability matrix, wherein the dimensionality of the spell checking probability matrix is n x r, n is the number of the words in the word set, and r is the number of the spell checking result categories.

In this embodiment, a spell check is performed using the first feature matrix, and first, a spell check model is determined; and then inputting the first feature matrix as an input of the spell check model into the spell check model for calculation, so as to obtain a spell check probability matrix, where a dimension of the spell check probability matrix is n × r, n is still the number of words in the word set, and r is the number of categories of spell check results, for example, if the spell check results are classified into two categories, i.e., spell check correct and spell check error, r is 2, a matrix element of a row in the spell check probability matrix represents a probability that a word in the word set is spelled correctly and incorrectly, a matrix element of a first column in the spell check probability matrix represents a probability that each spelled word in the word set is correctly, and a matrix element of a second column in the spell check probability matrix represents a probability that each spelled word in the word set is incorrectly.

In one embodiment of the present disclosure, the spell checking model may be a computational model consisting of a forward neural network and a softmax function.

In an embodiment of the present disclosure, the spell checking model may further be based on an output of the spell checking model: and training the context feature extraction model and the spell check model by using a back propagation algorithm so as to achieve better optimization and obtain a target spell check probability matrix.

In an embodiment of the present disclosure, the processing module 303 may perform a second feature extraction on the first feature matrix to obtain a second feature matrix, and may be configured to:

determining a spelling processing feature extraction model;

and extracting spelling processing characteristics of the first characteristic matrix by using the spelling processing characteristic extraction model to obtain a second characteristic matrix, wherein the dimension of the second characteristic matrix is n × h2, n is the number of words in the word set, and h2 is the preset output dimension of the spelling processing characteristic extraction model.

In this embodiment, the second feature is obtained based on the first feature, and similar to the first feature, the second feature may also be a context feature, so when obtaining the second feature matrix, it is necessary to first determine a spelling processing feature extraction model, i.e. a second feature extraction model; and then, the spelling processing feature extraction model is used for extracting spelling processing features based on the first feature matrix to finally obtain a second feature matrix, wherein the dimension of the second feature matrix can be represented as n × h2, n is still the number of words in the word set, h2 is a preset output dimension of the spelling processing feature extraction model, and similar to d and h1, h2 can also be determined according to the needs of practical application and the scale of the word set, which is not specifically limited by the disclosure.

In an embodiment of the present disclosure, the spelling processing feature extraction model may also be a bidirectional long and short memory network (BiLSTM), and the combination of the first feature extraction model and the spelling processing feature extraction model is equivalent to adding a spelling processing feature extraction model to the output of the first feature extraction model to realize secondary feature extraction.

In an embodiment of the present disclosure, the portion of the processing module 303 that performs processing based on the spell check probability matrix and the second feature matrix to obtain the processing result of the text object may be configured to:

determining a preset word list and a processing candidate probability calculation model, wherein the preset word list at least comprises the word set;

calculating by taking the second feature matrix as an input of the processing candidate probability calculation model to obtain a processing candidate probability matrix, wherein the processing candidate probability matrix is used for representing a corresponding correction candidate probability of the words in the word set in the preset word list, the dimensionality of the processing candidate probability matrix is n x V, n is the number of the words in the word set, and V is the number of the words in the preset word list;

and obtaining a processing result of the text object based on the spell check probability matrix and the processing candidate probability matrix.

In order to improve the quality of text processing and enhance the robustness of text processing, in this embodiment, processing operations are performed by using spell check feature data and spell correction feature data in combination, specifically, a preset vocabulary and a processing candidate probability calculation model are first determined, wherein the preset vocabulary is a preset vocabulary including at least the word set, such as a word set including both a word likely to appear as a wrong word and a word likely to appear as a correct word, a relatively complete word or a word obtained from history correction data, and the like; then, the second feature matrix is used as the input of the processing candidate probability calculation model, and is input into the processing candidate probability calculation model for calculation, so that a processing candidate probability matrix can be obtained, wherein the processing candidate probability matrix is used for representing the corresponding correction candidate probability of the words in the word set in the preset word list, so that the dimensionality of the processing candidate probability matrix is n x V, n is the number of the words in the word set, and V is the number of the words in the preset word list; and finally, obtaining a processing result of the text object based on the spell check probability matrix and the processing candidate probability matrix.

In an embodiment of the present disclosure, the processing candidate probability calculation model may be a calculation model composed of a forward neural network and a softmax function.

In an embodiment of the present disclosure, the process candidate probability calculation model may further be based on an output of the process candidate probability calculation model: and processing the candidate probability matrix, and training a context feature extraction model, a spell check model, a spelling processing feature extraction model and a processing candidate probability calculation model by using a back propagation algorithm so as to achieve better optimization and obtain a target processing candidate probability matrix.

In one embodiment of the present disclosure, the spell check probability matrix includes both correct and incorrect spell check results;

in this embodiment, the part for deriving the processing result of the text object based on the spell check probability matrix and the processing candidate probability matrix may be configured to:

establishing a position probability matrix based on the position relation of the characters between the character set and a preset word list;

multiplying the position probability matrix with a probability vector corresponding to a correct spelling check result in the spelling check probability matrix to obtain a first text object processing probability matrix;

multiplying the processing candidate probability matrix with a probability vector corresponding to a misspell check result in the spell check probability matrix to obtain a second text object processing probability matrix;

adding the first text object processing probability matrix and the second text object processing probability matrix to obtain a text object processing probability matrix, wherein row vectors of the text object processing probability matrix respectively correspond to the processing probabilities of the words in the preset word list in sequence;

and taking the word corresponding to the vector value with the maximum probability in the text object processing probability matrix row vector as the processing result of the word at the corresponding position in the text object.

In this embodiment, a position probability matrix O is first established based on a position relationship between the word set and a preset word list, where the position probability matrix is used to represent the existence probability of the words in the word set at corresponding positions in the preset word list, for example, except the word corresponding to the row in the word set in a matrix element of a row of the position probability matrix, the probability value at the corresponding position in the preset word list is 1, and all other positions are 0, so that the dimension of the position probability matrix is n × V, n is still the number of words in the word set, and V is the number of words in the preset word list; when the spell check probability matrix comprises correct and wrong spell check results, in order to obtain a correction result of the text object, firstly multiplying the position probability matrix by a probability vector corresponding to the correct spell check result in the spell check probability matrix to obtain a first text object processing probability matrix; then multiplying the processing candidate probability matrix with a probability vector corresponding to a misspell check result in the spell check probability matrix to obtain a second text object processing probability matrix; adding the first text object processing probability matrix and the second text object processing probability matrix to obtain a text object processing probability matrix P, wherein row vectors of the text object processing probability matrix P respectively correspond to the processing probabilities of the words in the preset word list in sequence; and finally, taking the word corresponding to the vector value with the maximum probability in the text object processing probability matrix row vector as the processing result of the word at the corresponding position in the text object.

The embodiment of the present disclosure also discloses an electronic device, which includes a memory and a processor; wherein the content of the first and second substances,

the memory is configured to store one or more computer instructions, wherein the one or more computer instructions are executed by the processor to perform any of the method steps described above.

FIG. 4 is a schematic block diagram of a computer system suitable for use in implementing a text processing method according to an embodiment of the present disclosure.

As shown in fig. 4, the computer system 400 includes a processing unit 401 that can execute various processes in the above-described embodiments according to a program stored in a Read Only Memory (ROM)402 or a program loaded from a storage section 408 into a Random Access Memory (RAM) 403. In the RAM403, various programs and data necessary for the operation of the system 400 are also stored. The processing unit 401, the ROM402, and the RAM403 are connected to each other via a bus 404. An input/output (I/O) interface 405 is also connected to bus 404.

The following components are connected to the I/O interface 405: an input section 406 including a keyboard, a mouse, and the like; an output section 407 including a display device such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage section 408 including a hard disk and the like; and a communication section 409 including a network interface card such as a LAN card, a modem, or the like. The communication section 409 performs communication processing via a network such as the internet. A driver 410 is also connected to the I/O interface 405 as needed. A removable medium 411 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 410 as necessary, so that a computer program read out therefrom is mounted into the storage section 408 as necessary. The processing unit 401 may be implemented as a CPU, a GPU, a TPU, an FPGA, an NPU, or other processing units.

In particular, the above described methods may be implemented as computer software programs, according to embodiments of the present disclosure. For example, embodiments of the present disclosure include a computer program product comprising a computer program tangibly embodied on a medium readable thereby, the computer program comprising program code for performing the text processing method. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 409, and/or installed from the removable medium 411.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowcharts or block diagrams may represent a module, a program segment, or a portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units or modules described in the embodiments of the present disclosure may be implemented by software or hardware. The units or modules described may also be provided in a processor, and the names of the units or modules do not in some cases constitute a limitation of the units or modules themselves.

As another aspect, the disclosed embodiment also provides a computer-readable storage medium, which may be the computer-readable storage medium included in the apparatus in the foregoing embodiment; or it may be a separate computer readable storage medium not incorporated into the device. The computer readable storage medium stores one or more programs for use by one or more processors in performing the methods described in the embodiments of the present disclosure.

The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention in the embodiments of the present disclosure is not limited to the specific combination of the above-mentioned features, but also encompasses other embodiments in which any combination of the above-mentioned features or their equivalents is made without departing from the inventive concept. For example, the above features and (but not limited to) the features with similar functions disclosed in the embodiments of the present disclosure are mutually replaced to form the technical solution.

22页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种文本校正的方法和装置

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!