Disordered code identification method and device for document content and electronic equipment

文档序号：1567735 发布日期：2020-01-24 浏览：20次中文

阅读说明：本技术 文档内容的乱码识别方法、装置及电子设备 (Disordered code identification method and device for document content and electronic equipment ) 是由冷志峰张作兵于 2018-07-17 设计创作，主要内容包括：本发明实施例提供的一种文档内容的乱码识别方法、装置及设备,通过解析待处理文档,得到待处理文档的文字解析结果,从待处理文档的文字解析结果中提取各个字体信息。根据各个字体信息,将与字体信息对应的字体库加载到待处理文档中的文本中并进行编码,获得编码后的文本数据。当编码后的文本数据中有生僻字时,基于编码后的文本数据中生僻字的字数和待处理文档中文字的字数,计算得到生僻字在待处理文档中的占有率。判断生僻字在目标文本中的占有率是否大于第一预设阈值,若大于,则识别出编码后的文本数据中包含乱码文本数据。实现了自动识别编码后的文本数据是否包含乱码文本数据,提高了文档内容的乱码识别的便捷性。(According to the method, the device and the equipment for identifying the messy codes of the document content, provided by the embodiment of the invention, the text analysis result of the document to be processed is obtained by analyzing the document to be processed, and each font information is extracted from the text analysis result of the document to be processed. And loading a font library corresponding to the font information into the text in the document to be processed according to the font information, and coding to obtain coded text data. And when the rarely-used words exist in the coded text data, calculating to obtain the occupancy rate of the rarely-used words in the document to be processed based on the word number of the rarely-used words in the coded text data and the word number of the words in the document to be processed. And judging whether the occupancy rate of the uncommon words in the target text is greater than a first preset threshold value, and if so, identifying that the encoded text data contains messy code text data. Whether the text data after the automatic identification coding contains the messy code text data or not is achieved, and convenience of messy code identification of the document content is improved.)

1. A method for recognizing a messy code of document contents is characterized by comprising the following steps:

analyzing a document to be processed to obtain a character analysis result of the document to be processed;

extracting each font information in the character analysis result of the document to be processed;

according to the font information, loading a font library corresponding to the font information into a text in a document to be processed, and coding the font library to obtain coded text data;

when rare words exist in the coded text data, calculating the occupancy rate of the rare words in the document to be processed based on the word number of the rare words in the coded text data and the word number of the words in the document to be processed;

judging whether the occupancy rate of the uncommon word in the document to be processed is greater than a first preset threshold value;

and if so, identifying that the coded text data contains messy code text data.

2. The method according to claim 1, wherein before the loading and encoding a font library corresponding to the font information into a text in a document to be processed according to the respective font information to obtain encoded text data, the method further comprises:

classifying all texts in the document to be processed according to the font information, and respectively taking the texts corresponding to the font information as target texts;

determining the word number of the target characters in each target text;

the loading and coding a font library corresponding to the font information into a text in a document to be processed according to the font information to obtain coded text data comprises the following steps:

loading a corresponding font library for each target text according to the font information, and coding to obtain coded text data corresponding to the target text;

when rare words exist in the encoded text data, the occupancy rate of rare words in the document to be processed is calculated based on the word number of rare words in the encoded text data and the word number of words in the document to be processed, and the occupancy rate comprises the following steps:

judging whether the coded text data contains uncommon words or not;

if the rare words exist, calculating the occupancy rate of the rare words in the target text according to the word number of the rare words and the word number of the target words in each target text;

the step of judging whether the occupancy rate of the uncommon word in the target text is greater than a first preset threshold value comprises the following steps:

judging whether the occupancy rate of the uncommon word in the target text is greater than a first preset threshold value;

if so, identifying that the coded text data contains messy code text data, including:

and if the occupancy rate of the uncommon word in the target text is greater than a first preset threshold value, recognizing that the encoded text data contains messy code text data.

3. The method of claim 2, wherein the step of determining whether the encoded text data contains uncommon words comprises:

acquiring the word frequency of each target character from a pre-stored word frequency table;

and if the word frequency of the target word is lower than a preset word frequency threshold value of the uncommon word, determining that the target word is the uncommon word.

4. The method of claim 2, wherein after determining that the occupancy rate of the uncommon word in the target text is not greater than a first preset threshold, the method further comprises:

judging whether the occupancy rate of the uncommon word in the target text is smaller than a second preset threshold value, wherein the second preset threshold value is smaller than the first preset threshold value;

if the occupancy rate of the uncommon word in the target text is smaller than a second preset threshold value, recognizing that no messy code text data exists in the coded text data;

if the current time is not less than the second preset threshold, outputting a prompt whether the document needs to be repaired to the user;

and after receiving a document repairing instruction input by a user, determining that the coded text data contains messy code text data.

5. An apparatus for recognizing a scrambled code of a document content, the apparatus comprising:

the analysis module is used for analyzing the document to be processed to obtain a character analysis result of the document to be processed;

the extraction module is used for extracting each font information in the character analysis result of the document to be processed;

the coding module is used for loading a font library corresponding to the font information into a text in a document to be processed according to the font information and coding the font library to obtain coded text data;

the identification module is used for calculating the occupancy rate of the uncommon word in the document to be processed based on the word number of the uncommon word in the encoded text data and the word number of the word in the document to be processed when the uncommon word exists in the encoded text data; judging whether the occupancy rate of the uncommon word in the document to be processed is greater than a first preset threshold value; and if so, identifying that the coded text data contains messy code text data.

6. The apparatus of claim 5, further comprising: a classification module;

the classification module is used for classifying all texts in the document to be processed according to the font information, and respectively taking the texts corresponding to the font information as target texts; determining the word number of the target characters in each target text;

the encoding module is specifically configured to:

loading a corresponding font library for each target text according to the font information, and coding to obtain coded text data corresponding to the target text;

the identification module is specifically configured to:

judging whether the coded text data contains uncommon words or not;

judging whether the occupancy rate of the uncommon word in the target text is greater than a first preset threshold value or not, and if the occupancy rate of the uncommon word in the target text is greater than the first preset threshold value, identifying that the encoded text data contains messy code text data.

7. The apparatus according to claim 6, wherein the identification module is specifically configured to:

acquiring the word frequency of each target character from a pre-stored word frequency table;

and if the word frequency of the target word is lower than a preset word frequency threshold value of the uncommon word, determining that the target word is the uncommon word.

8. The apparatus of claim 6, wherein the identification module is further configured to:

if the occupancy rate of the uncommon word in the target text is smaller than a second preset threshold value, recognizing that no messy code text data exists in the coded text data;

if the current time is not less than the second preset threshold, outputting a prompt whether the document needs to be repaired to the user;

and after receiving a document repairing instruction input by a user, determining that the coded text data contains messy code text data.

9. An electronic device, comprising a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory communicate with each other via the bus; a memory for storing a computer program; a processor for executing a program stored in the memory to perform the method steps of any of claims 1 to 4.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method steps of any one of claims 1 to 4.

Technical Field

The invention relates to the technical field of word processing, in particular to a method and a device for identifying messy codes of document contents and electronic equipment.

Background

Scrambling means that the document content is not displayed as correct characters on the electronic device, but as other nonsense characters. When the text content of the document is displayed on the electronic device, the font library matched with the font information can be searched in the electronic device based on the font information of the text content in the document and the information of the font library matched with the font information, and the font library is loaded to the text content, so that the correct text content is displayed. If a font library matching the font information is missing from the electronic device, the text content is parsed with a non-matching font library, resulting in scrambling. Because the font library stored in the electronic device is limited, a phenomenon that the document is scrambled due to the fact that the font library matched with the font information of the document content cannot be loaded frequently occurs.

When the messy codes of the document contents caused by mismatching of the font information of the font library and the document contents are solved, a user needs to visually observe and identify whether the messy codes exist, so that whether follow-up messy code repair of the document contents is carried out or not is determined according to an identification result, and the mode of manually identifying the messy codes by the user has poorer convenience.

Disclosure of Invention

The embodiment of the invention aims to provide a method, a device and equipment for identifying messy codes of document contents, so as to automatically identify the messy codes aiming at the messy code phenomenon caused by mismatching of font information of a font library and the document contents in the document contents and improve the convenience of messy code identification of the document contents. The specific technical scheme is as follows:

in a first aspect, an embodiment of the present invention provides a method for recognizing a scrambled code of a document content, where the method includes:

analyzing the document to be processed to obtain a character analysis result of the document to be processed;

extracting each font information in the character analysis result of the document to be processed;

according to each font information, loading a font library corresponding to the font information into a text in the document to be processed and coding the font library to obtain coded text data;

when the rarely-used words exist in the coded text data, calculating to obtain the occupancy rate of the rarely-used words in the document to be processed based on the word number of the rarely-used words in the coded text data and the word number of the words in the document to be processed;

judging whether the occupancy rate of the uncommon words in the document to be processed is greater than a first preset threshold value;

and if so, identifying that the text data after encoding contains messy code text data.

In a second aspect, an embodiment of the present invention provides a device for recognizing a scrambled code of document content, where the device includes:

the analysis module is used for analyzing the document to be processed to obtain a character analysis result of the document to be processed;

the extraction module is used for extracting each font information in the character analysis result of the document to be processed;

the encoding module is used for loading a corresponding font library to the text in the document to be processed according to each font information and encoding the font library to obtain encoded text data;

the identification module is used for calculating the occupancy rate of the uncommon words in the document to be processed based on the word number of the uncommon words in the encoded text data and the word number of the words in the document to be processed when the rarely used words exist in the encoded text data; judging whether the occupancy rate of the uncommon words in the document to be processed is greater than a first preset threshold value; and if so, identifying that the text data after encoding contains messy code text data.

In a third aspect, an embodiment of the present invention provides an electronic device, including a processor, a communication interface, a memory, and a communication bus, where the processor and the communication interface complete communication between the memory and the processor through the bus; a memory for storing a computer program; and the processor is used for executing the program stored in the memory and realizing the steps of the method for identifying the messy codes of the document contents provided by the first aspect.

In a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium, in which a computer program is stored, and the computer program, when executed by a processor, implements the steps of the method for recognizing a scrambled code of document content provided in the first aspect.

According to the method, the device and the equipment for identifying the messy codes of the document content, provided by the embodiment of the invention, the text analysis result of the document to be processed is obtained by analyzing the document to be processed, and each font information is extracted from the text analysis result of the document to be processed. And loading a corresponding font library to the text in the document to be processed according to each font information, and coding to obtain coded text data. And when the rarely-used words exist in the coded text data, calculating to obtain the occupancy rate of the rarely-used words in the document to be processed based on the word number of the rarely-used words in the coded text data and the word number of the words in the document to be processed. And judging whether the occupancy rate of the uncommon words in the target text is greater than a first preset threshold value, and if so, identifying that the encoded text data contains messy code text data. The word number of the characters in the document to be processed is used as a standard without messy codes, and when the occupancy rate of the rare characters in the text data in the document to be processed is greater than a first preset threshold value, the occupancy rate of the rare characters in the document to be processed is indicated to reach the level of forming messy codes. On the basis, the word number of characters in the document to be processed is the same as that of the coded text data when no messy code phenomenon exists, so that the occupancy rate of rare characters in the document to be processed reaches the level of forming messy codes, and the coded text data is indicated to contain the messy code text data, so that whether the coded text data contains the messy code text data or not is automatically identified, a user does not need to manually open the document, read and identify the document content, and the convenience of messy code identification of the document content is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is a flowchart illustrating a method for recognizing a scrambled code of document content according to an embodiment of the present invention;

FIG. 2 is a flowchart illustrating a method for recognizing a scrambled code of document content according to another embodiment of the present invention;

fig. 3 is a schematic view illustrating a specific process of determining whether a uncommon word exists in the encoded text data in step S206 according to another embodiment of the present invention shown in fig. 2;

FIG. 4 is a schematic structural diagram of a device for recognizing a scrambled code of document contents according to an embodiment of the present invention;

FIG. 5 is a schematic structural diagram of a device for recognizing a scrambled code of document contents according to another embodiment of the present invention;

fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all embodiments. The embodiments of the present application, and all other embodiments that can be obtained by a person of ordinary skill in the art without any inventive work, belong to the scope of protection of the present application.

In order to realize automatic messy code identification aiming at the messy code phenomenon caused by mismatching of a font library and content of document content, the embodiment of the invention provides a messy code identification method, a messy code identification device and messy code identification equipment of document content.

First, a method for recognizing a scrambled code of a document content according to an embodiment of the present invention will be described.

As shown in fig. 1, a flow of a method for recognizing a scrambled code of document content according to an embodiment of the present invention may include the following steps:

s101, analyzing the document to be processed to obtain a character analysis result of the document to be processed.

In particular implementations, the documents to be processed may be a variety of text-containing documents. For example: the document to be processed can be a word processing document in doc, wps and other formats, a presentation document in ppt, dps and other formats, a table document in xls, et and other formats, and a portable document in pdf format. In specific implementation, the document to be processed may be analyzed by using document analysis software, so as to obtain a text analysis result of the document to be processed.

S102, extracting each font information in the character analysis result of the document to be processed.

In a specific implementation, the document to be processed may be parsed by using document parsing software, and font information of document characters may be extracted from a character parsing result. Specifically, the font information includes the font name, font style information, and position in the document of each character in the document.

And S103, loading the font library corresponding to the font information into the text in the document to be processed according to the font information, and coding to obtain coded text data.

In specific implementation, a corresponding font library stored in the terminal device can be determined according to the obtained font name, and the corresponding font library is loaded to the text with the same font information; and then, according to the font information of each character in the text, finding the corresponding code of each character in the text by using the font index in the corresponding font library, and finally obtaining the coded text data.

For example: extracting font information of the Song body and the black body from the document, searching font library files of the Song body and the black body from a font library folder of the terminal equipment according to names of the Song body and the black body, loading the Song body text in the document into a Song body font library, and loading the black body text into a black body font library; and then respectively using font indexes in the Song body font library and the black body font library according to font information of each character in the Song body text and the black body text to find corresponding codes of each character in the Song body text and the black body text, and finally obtaining coded Song body text data and black body text data.

In detail, there are various ways to encode the text in the document to be processed, including but not limited to the following ways:

if the text in the document to be processed uses a Unicode code (Unicode), obtaining coded text data according to the implementation process of the step S103;

if the text in the document to be processed uses a CID code (character code developed by Adobe, mainly applied to pdf format documents), the CID code of the document needs to be converted into a Unicode code, and then the coded text data is obtained according to the implementation process of S103.

And S104, when the rarely-used words exist in the coded text data, calculating the occupancy rate of the rarely-used words in the document to be processed based on the word number of the rarely-used words in the coded text data and the word number of the words in the document to be processed.

The calculation method of the occupancy rate of the uncommon word in the document to be processed can be as follows: the word number of the uncommon word in the coded text data is divided by the word number of the word in the document to be processed.

Of course, the word count of the uncommon word in the encoded text data and the word count of the word in the document to be processed can be obtained by the following steps:

and counting the word number of the uncommon word in the encoded text data.

And counting the word number of characters in the document to be processed.

S105, judging whether the occupancy rate of the uncommon word in the document to be processed is larger than a first preset threshold value.

And S106, if the number of the text data is larger than the preset number, recognizing that the text data after being coded contains messy code text data.

Because when the condition of messy codes does not exist, rare words with certain word number can also appear in the text data after the document to be processed is coded, and simultaneously, compared with the condition of no messy codes, the word number of rare words is usually more when the condition of messy codes appears. Therefore, in order to avoid misjudging the uncommon word without the messy code as the messy code, a first preset threshold value can be set according to historical experience, and whether the occupancy rate of the uncommon word in the document to be processed is greater than the first preset threshold value or not can be judged.

In consideration of the fact that the characters in the encoded text data may have inaccurate word counts due to the messy codes (for example, two characters in the text data are encoded into one character because the font library cannot be matched when no messy codes exist), the word counts of the characters in the encoded text data cannot be used for calculating the occupancy rate. However, when there is no messy code, the number of words in the text data after normal encoding should be the same as the number of words in the document to be processed, so the number of words in the document to be processed can be used to replace the number of words in the text data after encoding for calculating the occupancy rate of rare words in the text data after encoding, and thus the occupancy rate of rare words in the document to be processed represents the occupancy rate of rare words in the text data after encoding.

Therefore, when the occupancy rate of the uncommon word in the document to be processed is greater than the first preset threshold value, the occupancy rate of the uncommon word in the encoded text data is also greater than the first preset threshold value, which indicates that the word count of the uncommon word in the encoded text data reaches the level of the occurrence of the messy code condition, so that the encoded text data is identified to contain the messy code text data.

According to the messy code identification method for the document content, provided by the embodiment of the invention, the word number of the characters in the document to be processed is taken as the standard of no messy code, and when the occupancy rate of the rare words in the text data in the document to be processed is greater than a first preset threshold value, the occupancy rate of the rare words in the document to be processed reaches the level of forming messy codes. On the basis, the word number of characters in the document to be processed is the same as that of the coded text data when no messy code phenomenon exists, so that the occupancy rate of rare characters in the document to be processed reaches the level of forming messy codes, and the coded text data is indicated to contain the messy code text data, so that whether the coded text data contains the messy code text data or not is automatically identified, a user does not need to manually open the document, read and identify the document content, and the convenience of messy code identification of the document content is improved.

In the specific implementation, there are two cases of the reason for the occurrence of the scrambled text: 1) when the electronic equipment does not have the font library corresponding to the text, other font libraries are loaded to code the text characters, and messy codes appear due to the loading of the wrong font library; 2) when the electronic device has a font library corresponding to a text, but one font library has multiple versions, and the font information does not have the version information of the font library used by the document, other versions of the font library may be loaded to encode the text characters at this time, so that messy codes occur due to the loading of the font library of the wrong version. For example: the text to be processed is a Song style text, and the reason for the occurrence of the messy code text has two conditions: 1) when the Song body font library corresponding to the Song body text is not arranged in the electronic equipment, the black body font library is loaded to encode text characters, and messy codes appear in the Song body text due to the loading of the wrong font library; 2) when the version 1.0 of the song body font library corresponding to the song body text exists in the electronic equipment, the version 1.2 of the song body font library is loaded to encode text characters, and messy codes appear in the song body text due to the loading of the wrong version of the font library.

Therefore, in order to specifically identify the messy codes when the messy codes caused by different reasons appear in the same text, as shown in fig. 2, optionally, the method for identifying the messy codes of the document content according to another embodiment of the present invention includes the following steps:

s201, analyzing the document to be processed to obtain a character analysis result of the document to be processed.

S202, extracting each font information in the character analysis result of the document to be processed.

S201 to S202 are the same as S101 to S102 in the embodiment shown in fig. 1 of the present invention, and are not repeated herein, for details, see the description of the embodiment shown in fig. 1 of the present invention.

And S203, classifying all texts in the document to be processed according to each font information, and taking the texts corresponding to each font information as target texts respectively.

In specific implementation, all texts in the document to be processed are classified according to the font name in each font information and are used as target texts.

S204, determining the word number of the target characters in each target text.

In one embodiment, the word count of the target word is determined to provide for a later calculation of rare word occupancy. For example: and dividing the text in the document into a Song body text and a black body text, taking the Song body text and the black body text as target texts, and respectively determining the word number of the target characters in the Song body text and the black body text.

And S205, loading a corresponding font library for each target text according to each font information, and coding to obtain coded text data corresponding to the target text.

In specific implementation, a corresponding font library stored in the terminal device can be determined according to the obtained font name, and the corresponding font library is loaded to the target text; and then, according to the font information of each character in the target text, finding the corresponding code of each character in the target text by using the font index in the corresponding font library, and finally obtaining the coded text data.

And S206, judging whether the encoded text data contains uncommon words. If the word is uncommon, step S207 is executed, and if the word is uncommon, it is determined that the document has no scrambled text.

And S207, calculating the occupancy rate of the uncommon words in the target text according to the word number of the uncommon words and the word number of the target words in each target text.

In a specific implementation, the occupancy rate of uncommon words in the target text can be calculated as follows: word number of uncommon word in target text ÷ word number of target word.

S208, judging whether the occupancy rate of the uncommon word in the target text is greater than a first preset threshold value. If so, recognizing that the encoded text data contains messy code text data; if not, the encoded text data can be identified not to contain the messy code text data.

For example: after the coded Song body text data and the black body text data are obtained, whether uncommon words exist in the two text data is judged respectively, taking the Song body text data as an example. If the rare words exist in the Song body text data, counting the words of the rare words, and calculating the occupancy rate of the rare words in the Song body text: the number of rarely used words in the Song text/the number of words in the Song text.

Judging whether the occupancy rate of the uncommon words in the Song body text is greater than a first preset threshold, if so, identifying that the Song body text data contains disordered code text data, and if not, identifying that the Song body text data does not contain disordered code text data. The identification of bold text data is also referred to above.

If the uncommon word is not contained in the Song body text data, judging that the text is not a messy code text; and if the uncommon word is not in the black body text data, judging that the document has no messy code text.

In practical application, although the number of words of uncommon words in the document to be processed is less than that of rare words in the document to be processed when no messy codes exist in most cases, the uncertainty of the content of the document to be processed causes no messy codes, and if the messy codes are judged to exist only by comparing the occupancy rate of the rare words with the first preset threshold value, the text data in which the messy codes exist although the occupancy rate of the rare words is less than the first preset threshold value may be determined as text data not containing the messy codes, so that the messy code identification accuracy of the content of the document is reduced.

In order to solve the problem of the accuracy rate of recognizing the scrambled code of the document content being reduced, optionally after S208 in the embodiment of fig. 2 of the present invention, the process of the method for recognizing the scrambled code of the document content according to the embodiment of the present invention may further include:

if the occupancy rate of the uncommon word in the target text is not greater than the first preset threshold value, whether the occupancy rate of the uncommon word in the target text is smaller than a second preset threshold value or not is judged, and the second preset threshold value is smaller than the first preset threshold value. And if the occupancy rate of the uncommon word in the target text is less than a second preset threshold value, recognizing that no messy code text data exists in the coded text data.

On the basis of the first preset threshold, the judgment range of the rarely-used word occupancy is expanded through a second preset threshold smaller than the first preset threshold, so that the situation that although the rarely-used word occupancy is smaller than the first preset threshold, text data with messy codes is determined as text data without messy codes is avoided, and the messy code identification accuracy of the document content is improved.

And if the occupancy rate of the uncommon words in the target text is not less than a second preset threshold, outputting a prompt whether the document needs to be repaired to the user.

In the specific implementation, if the occupancy rate of the uncommon word in the target text is not greater than the first preset threshold value and not less than the second preset threshold value, whether the messy code text data exists or not cannot be judged, at this time, a repair document prompt box can be popped up, and a user selects whether to repair the document or not. And after the terminal equipment receives the 'yes' input by the user, the coded text data is indicated to contain the messy code text data.

After a document instruction which needs to be repaired and is input by a user is received, the text data after being coded is determined to contain messy code text data.

Specifically, the document repairing instruction input by the user may be yes or no, if yes, the document repairing instruction input by the user is determined to be received, and the encoded text data contains messy code text data; if not, determining that the encoded text data does not contain the scrambled text data.

For example: and after judging that the occupancy rate of the uncommon word in the Song body text is not more than a first preset threshold value, judging whether the occupancy rate of the uncommon word in the Song body text is less than a second preset threshold value, wherein the second preset threshold value is less than the first preset threshold value. And if the occupancy rate of the uncommon words in the Song body text is smaller than a second preset threshold value, recognizing that no scrambled code text data exists in the encoded Song body text data.

And if the current time is not less than the second preset threshold, outputting a prompt of whether the document needs to be repaired to the user. After receiving a document instruction which needs to be repaired and is input by a user, judging that the coded Song body text data contains messy code text data; and obtaining Song style font information corresponding to the messy code text from the document to be processed, and preparing for uploading the information to the cloud background. And after receiving a document command which is input by a user and does not need to be repaired, judging that the coded Song body text data does not contain messy code text data.

As shown in fig. 3, in S206 of the embodiment shown in fig. 2 of the present invention, the process of determining whether there is a uncommon word in the encoded text data may specifically include:

s301, the word frequency of each target word is obtained from a pre-stored word frequency table.

In a specific implementation, the word frequency in the word frequency table is the frequency of occurrence of words in a general document, for example, if there are 20000 words in a document, and the "word with" appears 690 times in total, the word frequency is 3.45% (690/20000 ═ 3.45%).

A pre-stored word frequency table is obtained through statistics of a large number of document samples, the pre-stored word frequency table mainly comprises character names, character codes and character frequencies, the pre-stored word frequency table is sorted according to the character frequencies from high to low, and part of contents are shown in the following table 1.

TABLE 1

S302, judging whether the word frequency of the target character is lower than a preset word frequency threshold of the uncommon character, and if so, executing the step S303; if not, step S304 is performed.

In the specific implementation, a word frequency threshold of the uncommon word is generally preset, and if the word frequency of the target word is lower than the word frequency threshold of the uncommon word, the word is considered to be the uncommon word.

S303, determining that the target character is a uncommon character.

S304, determining that the target character is not a uncommon character, and continuously judging the next target character; if all the target characters are determined not to be uncommon characters, the rarely-used characters are absent in the encoded text data.

For example: and acquiring coded Song body text data, acquiring the word frequency of each character in the Song body text data from a pre-stored word frequency table, judging whether the word frequency of the character is lower than a word frequency threshold value of a preset uncommon word, and if the word frequency is lower than the word frequency threshold value of the preset uncommon word, determining that the character is the uncommon word.

If the word frequency of the characters in the Song body text data is not lower than the preset word frequency threshold of the uncommon word, determining that the characters are not the uncommon word, and continuously judging the word frequency of the next character; and if all the characters in the Song body text data are determined not to be uncommon characters, determining that the uncommon characters do not exist in the Song body text data.

Corresponding to the method embodiment, the embodiment of the invention also provides a messy code identification device of the document content.

As shown in fig. 4, the structure of the apparatus for recognizing a scrambled code of document content according to an embodiment of the present invention may include:

the analysis module 401 is configured to analyze the document to be processed to obtain a text analysis result of the document to be processed;

an extracting module 402, configured to extract each font information in a text parsing result of a document to be processed;

the encoding module 403 is configured to load a font library corresponding to the font information into a text in the document to be processed according to each font information, and encode the font library to obtain encoded text data;

the identification module 404 is configured to, when a rare word exists in the encoded text data, calculate an occupancy rate of the rare word in the document to be processed based on the word count of the rare word in the encoded text data and the word count of the word in the document to be processed; judging whether the occupancy rate of the uncommon word in the document to be processed is greater than a first preset threshold value; and if so, identifying that the coded text data contains messy code text data.

As shown in fig. 5, the structure of a scrambling code recognition apparatus for document contents according to another embodiment of the present invention may include:

the analysis module 501 is configured to analyze the document to be processed to obtain a text analysis result of the document to be processed;

an extracting module 502, configured to extract each font information in a text parsing result of a document to be processed;

501-502 are the same modules as 401-402 in the embodiment of fig. 4 of the present invention.

The classification module 503 is configured to classify all texts in the document to be processed according to each font information, and take texts corresponding to each font information as target texts, respectively; determining the word number of the target characters in each target text;

the encoding module 504 loads a corresponding font library for each target text according to each font information and encodes the font library to obtain encoded text data corresponding to the target text;

the identification module 505 is used for judging whether the encoded text data contains uncommon words; if the rare words exist, calculating the occupancy rate of the rare words in the target text according to the word number of the rare words and the word number of the target words in each target text; and judging whether the occupancy rate of the uncommon word in the target text is greater than a first preset threshold value, and if so, identifying that the encoded text data contains messy code text data.

Optionally, when the recognition module 505 in the embodiment shown in fig. 5 is used to determine an uncommon word, the recognition module is specifically used to:

acquiring the word frequency of each target character from a pre-stored word frequency table; and if the word frequency of the target word is lower than a preset word frequency threshold of the uncommon word, determining that the target word is the uncommon word.

Optionally, the identification module 505 in the embodiment shown in fig. 5 of the present invention may be further configured to:

if the occupancy rate of the uncommon word in the target text is smaller than a second preset threshold value, recognizing that no messy code text data exists in the coded text data;

if the current time is not less than the second preset threshold, outputting a prompt whether the document needs to be repaired to the user;

after a document instruction which needs to be repaired and is input by a user is received, the text data after being coded is determined to contain messy code text data.

An embodiment of the present invention further provides an electronic device, as shown in fig. 6, the electronic device may include:

the system comprises a processor 601, a communication interface 602, a memory 603 and a communication bus 604, wherein the processor 601, the communication interface 602 and the memory 603 are communicated with each other through the communication bus 604.

A memory 603 for storing a computer program.

The processor 601 is configured to implement the steps of the method for recognizing the scrambled code of the document content according to any of the embodiments described above when executing the program stored in the memory 603.

According to the electronic equipment provided by the embodiment of the invention, the word number of the characters in the document to be processed is taken as the standard of no messy codes, and when the occupancy rate of the uncommon words in the text data in the document to be processed is greater than a first preset threshold value, the occupancy rate of the uncommon words in the document to be processed reaches the level of forming messy codes. On the basis, the word number of characters in the document to be processed is the same as that of the coded text data when no messy code phenomenon exists, so that the occupancy rate of rare characters in the document to be processed reaches the level of forming messy codes, and the coded text data is indicated to contain the messy code text data, so that whether the coded text data contains the messy code text data or not is automatically identified, a user does not need to manually open the document, read and identify the document content, and the convenience of messy code identification of the document content is improved.

The machine-readable storage medium may include a RAM (Random Access Memory) and may also include a NVM (Non-Volatile Memory), such as at least one disk Memory. Additionally, the machine-readable storage medium may be at least one memory device located remotely from the aforementioned processor.

The Processor may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but also a DSP (Digital Signal Processing), an ASIC (Application Specific Integrated Circuit), an FPGA (Field Programmable Gate Array) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component.

The computer-readable storage medium provided by an embodiment of the present invention is included in a server, and a computer program is stored in the computer-readable storage medium, and when the computer program is executed by a processor, the steps of the method for recognizing a scrambled code of a document content in any of the above embodiments are implemented.

The computer readable storage medium provided by the embodiment of the invention is contained in the proxy server, when the computer program stored in the medium is executed, the word number of the characters in the document to be processed can be used as a standard without messy codes, and when the occupancy rate of the rare words in the text data in the document to be processed is greater than a first preset threshold value, the occupancy rate of the rare words in the document to be processed is indicated to reach the level of forming messy codes. On the basis, the word number of characters in the document to be processed is the same as that of the coded text data when no messy code phenomenon exists, so that the occupancy rate of rare characters in the document to be processed reaches the level of forming messy codes, and the coded text data is indicated to contain the messy code text data, so that whether the coded text data contains the messy code text data or not is automatically identified, a user does not need to manually open the document, read and identify the document content, and the convenience of messy code identification of the document content is improved.

In a further embodiment of the present invention, there is also provided a computer program product containing instructions which, when run on a computer, cause the computer to perform the steps of the method for scrambling code identification of document content in any of the above embodiments.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website, computer, server, or data center to another website, computer, server, or data center via wire (e.g., coaxial cable, fiber, DSL (Digital Subscriber Line), or wireless (e.g., infrared, radio, microwave, etc.), the computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device including one or more integrated servers, data centers, etc., the available medium may be magnetic medium (e.g., floppy disk, hard disk, tape), optical medium (e.g., DVD (Digital Versatile Disc, digital versatile disc)), or a semiconductor medium (e.g.: SSD (Solid state disk)), etc.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the device and electronic apparatus embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and reference may be made to some descriptions of the method embodiments for relevant points.

The above description is only for the preferred embodiment of the present application, and is not intended to limit the scope of the present application. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application are included in the protection scope of the present application.

18页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：一种视频文案配音稿的生成方法及装置

Disordered code identification method and device for document content and electronic equipment

相关技术

网友询问留言