Text processing method and device, computer equipment and readable storage medium

文档序号：1889899 发布日期：2021-11-26 浏览：7次中文

阅读说明：本技术 一种文本处理方法、装置、计算机设备以及可读存储介质 (Text processing method and device, computer equipment and readable storage medium ) 是由童丽霞雷植程于 2021-03-31 设计创作，主要内容包括：本申请实施例公开了一种文本处理方法、装置、计算机设备以及可读存储介质,该方法包括：获取待处理文本,确定所述待处理文本中各第一语种的字符占比；若存在字符占比大于或等于第一设定阈值的第一语种,则将所述字符占比大于或等于第一设定阈值的第一语种确定为所述待处理文本所对应的目标语种；若不存在字符占比大于或等于第一设定阈值的第一语种,则确定所述待处理文本中各第二语种的高频词占比；若所述各第二语种的高频词占比中的最大高频词占比大于或等于第二设定阈值,则将所述最大高频词占比对应的第二语种确定为所述待处理文本所对应的目标语种。通过本申请实施例,可以准确确定出待处理文本的主要语种。(The embodiment of the application discloses a text processing method, a text processing device, computer equipment and a readable storage medium, wherein the method comprises the following steps: acquiring a text to be processed, and determining the character proportion of each first language in the text to be processed; if a first language with a character occupation ratio larger than or equal to a first set threshold exists, determining the first language with the character occupation ratio larger than or equal to the first set threshold as a target language corresponding to the text to be processed; if the first language with the character occupation ratio larger than or equal to a first set threshold value does not exist, determining the high-frequency word occupation ratio of each second language in the text to be processed; and if the maximum high-frequency word occupation ratio in the high-frequency word occupation ratios of the second languages is greater than or equal to a second set threshold, determining the second language corresponding to the maximum high-frequency word occupation ratio as the target language corresponding to the text to be processed. By the method and the device, the main language of the text to be processed can be accurately determined.)

1. A method of text processing, comprising:

acquiring a text to be processed, and determining the character proportion of each first language in the text to be processed;

if a first language with a character occupation ratio larger than or equal to a first set threshold exists, determining the first language with the character occupation ratio larger than or equal to the first set threshold as a target language corresponding to the text to be processed;

if the first language with the character occupation ratio larger than or equal to a first set threshold value does not exist, determining the high-frequency word occupation ratio of each second language in the text to be processed;

and if the maximum high-frequency word occupation ratio in the high-frequency word occupation ratios of the second languages is greater than or equal to a second set threshold, determining the second language corresponding to the maximum high-frequency word occupation ratio as the target language corresponding to the text to be processed.

2. The method as claimed in claim 1, wherein said determining the character proportion of each first language in the text to be processed comprises:

acquiring a character code corresponding to each character in the text to be processed;

and determining the character ratio of each first language in the text to be processed according to the character coding interval of each first language, the character code corresponding to each character in the text to be processed and the total number of characters in the text to be processed.

3. The method according to claim 1, wherein said determining the high-frequency word proportion of each second language in the text to be processed comprises:

determining a target high-frequency word list of each second language and acquiring word segmentation information of the text to be processed;

and determining the high-frequency word ratio of each second language in the text to be processed according to the target high-frequency word list of each second language and the word segmentation information of the text to be processed.

4. The method of claim 3, wherein said determining a target high frequency vocabulary in each second language comprises:

aiming at any second language, acquiring a candidate high-frequency word list of the second language, wherein the candidate high-frequency word list comprises one or more candidate high-frequency words;

and removing one or more of single character high-frequency candidate words and common high-frequency candidate words in the candidate high-frequency word list to obtain a target high-frequency word list of the second language, wherein the common high-frequency candidate words indicate that the frequency in the candidate high-frequency word list of each second language is greater than or equal to a set frequency.

5. The method of claim 4, wherein the method further comprises:

determining the service type indicated by the text to be processed;

acquiring high-frequency words related to the service types;

and adding the high-frequency words related to the service types into the target high-frequency word list.

6. The method of claim 3, wherein the obtaining word segmentation information of the text to be processed comprises:

performing word segmentation processing on the text to be processed according to word interval symbols, and taking one or more obtained words as word segmentation information of the text to be processed; or

And performing word segmentation processing on the text to be processed by using a word segmentation model, and taking one or more obtained words as word segmentation information of the text to be processed.

7. The method of any one of claims 1-6, further comprising:

and if the maximum high-frequency word occupation ratio in the high-frequency word occupation ratios of the second languages is smaller than the second set threshold, determining the default language as the target language corresponding to the text to be processed.

8. A text processing apparatus, comprising:

the acquisition module is used for acquiring a text to be processed and determining the character proportion of each first language in the text to be processed;

the determining module is used for determining a first language with a character occupation ratio larger than or equal to a first set threshold as a target language corresponding to the text to be processed if the first language with the character occupation ratio larger than or equal to the first set threshold exists;

the determining module is further configured to determine a high-frequency word proportion of each second language in the text to be processed if a first language with a character proportion larger than or equal to a first set threshold does not exist;

the determining module is further configured to determine, if a maximum high-frequency word proportion of the high-frequency word proportions of the second languages is greater than or equal to a second set threshold, the second language corresponding to the maximum high-frequency word proportion as the target language corresponding to the text to be processed.

9. A computer device, comprising: a processor, a memory, and a network interface;

the processor is connected with the memory and the network interface, wherein the network interface is used for providing network communication function, the memory is used for storing program codes, and the processor is used for calling the program codes to execute the text processing method of any one of claims 1-7.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program comprising program instructions that, when executed by a processor, perform the text processing method of any one of claims 1-7.

Technical Field

The present application relates to the field of computer technologies, and in particular, to a text processing method and apparatus, a computer device, and a readable storage medium.

Background

With the advancement of globalization and the rapid development of internet information services, more and more products have services oriented to different countries, and meanwhile, a product service party is also faced with processing a large amount of text data in different languages, such as consultation mails of overseas users, and in the text data, most of the text data are text data mixed with multiple languages.

At present, the language identification of a text usually distinguishes the text according to a special character of a certain language, that is, when the special character appearing in the text belongs to a certain language, the language is taken as the language to which the text belongs; or collecting a high-frequency word set of a certain language, and determining the language as the language to which the text belongs when the words appearing in the text belong to the high-frequency words of the certain language. The language discrimination condition of the scheme is reliable for identifying the language components, but the reliability is not high when the main languages of the multi-language text are identified, and the accuracy of the identification result is influenced.

Disclosure of Invention

The embodiment of the application provides a text processing method, a text processing device, computer equipment and a readable storage medium, which can accurately determine the main language of a text to be processed.

An embodiment of the present application provides a text processing method, including:

acquiring a text to be processed, and determining the character proportion of each first language in the text to be processed;

An embodiment of the present application provides a text processing apparatus in one aspect, including:

the acquisition module is used for acquiring a text to be processed and determining the character proportion of each first language in the text to be processed;

An aspect of an embodiment of the present application provides a computer device, including: a processor and a memory;

the memory stores a computer program that, when executed by the processor, causes the processor to execute the text processing method in the embodiments of the present application.

An aspect of the embodiments of the present application provides a computer-readable storage medium, in which a computer program is stored, where the computer program includes program instructions, and when the program instructions are executed by a processor, the text processing method in the embodiments of the present application is executed.

Accordingly, embodiments of the present application provide a computer program product or a computer program comprising computer instructions stored in a computer-readable storage medium. The processor of the computer device reads the computer instructions from the computer readable storage medium, and the processor executes the computer instructions to enable the computer device to execute the text processing method provided by one aspect of the embodiment of the application.

In the embodiment of the present application, the first language corresponding to the ratio of the characters larger than the first set threshold is determined as the main language corresponding to the text to be processed (i.e. the target language is the language with a larger ratio in the text to be processed) by comparing the ratio of the characters of each first language with the first set threshold, so that the accuracy of language identification can be enhanced for the text to be processed in which different languages are mixed. Therefore, the situation of multi-language mixing in a real application scene is solved, and the credibility and the accuracy of the whole language identification model for judging the main languages of the texts to be processed are improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a block diagram of a text processing system according to an embodiment of the present disclosure;

FIG. 2 is a schematic flow chart of mail processing according to an embodiment of the present disclosure;

fig. 3 is a schematic flowchart of text language identification according to an embodiment of the present application;

fig. 4 is a schematic flowchart of a text processing method according to an embodiment of the present application;

fig. 5 is a schematic flowchart of a text processing method according to an embodiment of the present application;

FIG. 6 is a schematic structural diagram of a text processing apparatus according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The core idea of the embodiment of the application is that aiming at the characteristics of different languages, the occupation ratio of characters of a text to be processed is determined through character coding or the occupation ratio of words of the text to be processed is determined through word segmentation information obtained by word segmentation of the text to be processed, and the main language to which the text to be processed belongs is determined according to the comparison result of the occupation ratio and a corresponding threshold value, so that the main language identification efficiency of the text to be processed is improved by the characteristic differentiation detection method for the different languages, and meanwhile, the language is determined only when the occupation ratio of the corresponding language reaches the threshold value by introducing confidence (namely, the threshold value), and the accuracy of the identification result is improved. The main language to which the text to be processed belongs is usually only one, and may be a language with a maximum character ratio or a maximum high-frequency word ratio calculated according to characters or high-frequency words in the text to be processed, which is also called a target language. The text processing method provided by the embodiment of the application is based on a Natural Language Processing (NLP) technology belonging to the field of artificial intelligence.

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

Among them, Natural Language Processing (NLP) is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between humans and computers using natural language. Natural language processing is a science integrating linguistics, computer science and mathematics. Therefore, the research in this field will involve natural language, i.e. the language that people use everyday, so it is closely related to the research of linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic question and answer, knowledge mapping, and the like. For example, in the embodiment of the present application, the character ratio of the text to be processed is obtained by using character encoding, or the high-frequency word ratio is obtained by segmenting words of the text to be processed.

Referring to fig. 1, fig. 1 is a schematic diagram illustrating an architecture of a text processing system according to an embodiment of the present disclosure. As shown in fig. 1, the architecture of the text processing system may include a text processing device 100 and a terminal device 101, the terminal device 101 may include a plurality of terminal devices 101, each terminal device 101 may be in network connection with the text processing device 100, the network connection may include a wired connection or a wireless connection, so that each terminal device may perform data interaction with the text processing device 100 through the network connection, and the text processing device 100 may receive service data from each terminal device 101, and the detailed description of the functions of the text processing device 100 and the terminal device 101 is described below.

The text processing device 100 may be one or more of a single server, a server cluster, and a cloud server, and is not limited herein, and the text processing device 100 mainly functions to perform language identification on a text to be processed, where the text to be processed may be pre-stored text data acquired from a text database, or may be real-time text data directly uploaded to the text processing device 100 by the terminal device 101. In the process of language identification, firstly, the text processing device 100 processes each character in a text to be processed to generate a character code, then counts the number of characters of each first language and the total number of characters of the text according to the character code, and detects the character occupation ratio of each first language of the text to be processed in combination with a character code section corresponding to a character set stored in advance in a text database, then determines the main language to which the text to be processed belongs according to the character occupation ratio and a first set threshold, if the main language to which the text to be processed belongs cannot be determined according to the character occupation ratio, the text processing device 100 performs word segmentation processing on the text to be processed to obtain word segmentation information, detects the high-frequency word occupation ratio of each second language in the text to be processed in combination with a target high-frequency word list stored in advance in the text database, and determines the main language to which the text to be processed belongs according to the high-frequency word occupation ratio and the first set threshold, if the main language to which the text to be processed belongs cannot be determined according to the high-frequency word proportion, the default language is determined as the main language of the text to be processed. After recognizing the language corresponding to the text to be processed, the text processing device 100 may store the recognition result in a text database or directly send the recognition result to the terminal device, so that the terminal device 101 performs subsequent processing, for example, if the text to be processed is a mail text, the mail text is distributed to the terminal device 101 of the corresponding service department according to the recognition result of the mail and replied by a specialist. Furthermore, the text processing device 100 can also store the full amount of service data as a text database, which can be text data or identification data, such as unrecognized text to be processed or recognized text to be processed.

The terminal device 101 may be one or more of a smartphone, a camera, a desktop computer, a tablet computer, an MP4 player, and a laptop computer, and the terminal device 101 may install and run a target application for managing text data in which the terminal device 101 may monitor the number of unrecognized pending texts, the number of pending texts recognized by the text processing device 100, the number of recognized pending texts dispatched, view a corresponding data report, and the like. In addition, the terminal device 101 may also upload the text to be processed to the text processing device 100, or download the text to be processed, which is successfully identified, from the text database or the text processing device 100 for subsequent processing.

Referring to fig. 2 for a specific application scenario of the architecture diagram of the text processing system, fig. 2 is a schematic flow chart of mail processing provided in an embodiment of the present application, and the language identification capability for a mail is mainly applied to a customer service platform, specifically, a service scenario for replying a mail of an overseas user. The whole process comprises the following steps:

step 1, mail receiving. The server can specifically obtain the mails sent by the user from the database specially storing the mail data, for example, the database where the mails in the official mailbox of a certain product are located. Generally, a mail text relates to a consultation and a use aid for a product, and is described in a word mainly in a certain language, so that the mail text may include one or more languages, such as a mail mainly in korean with a small amount of english abbreviations.

And 2, recognizing languages. The language to which the received mail belongs needs to be sent into a language identification model for intelligent detection, the mail is processed by using a corresponding detection method in the language identification model, and finally a language is given as a target language corresponding to the mail so as to achieve the purpose of language identification. As in the above-described example english and korean, it is determined through language identification that korean is the target language to which the mail belongs.

And step 3, mail assignment. And assigning the target mail with the identification result to a corresponding language service skill group. The method specifically comprises the steps of automatically distributing the mails according to the recognition results of the mails without collecting the mails to a public service group, and then manually distributing the mails by the public service group, so that the efficiency of butting the corresponding mails by each language service skill group can be improved, and the processing flow is simplified.

And 4, manually processing. When a mail is assigned to a corresponding language service skill set, manual processing is still required, specifically, the language service skill set arrangement specialist performs processing, for example, the above-mentioned korean mail is assigned to the korean service skill set, and the korean service skill set arrangement specialist replies to the mail.

At this point, the whole flow of the mail reply is completed. In the whole mail dispatching process, the language of the mail is automatically identified through a machine, and the mail is automatically dispatched to different skill groups without manual intervention, so that the process complexity is reduced, and the labor is saved.

The step of language identification in the mail processing flow provided in fig. 2 is shown in fig. 3 in a specific technical implementation, and fig. 3 is a schematic flow diagram of text language identification provided in an embodiment of the present application. The reply of overseas mail is often established on multiple languages, and efficient language detection is a precondition for realizing the reply of the multi-language mail, so that the scheme designs a detection service capable of identifying multiple languages and can be easily expanded to other languages, Chinese (ch), English (en), Arabic (ar), Russian (ru), Turkey (tr), Spanish (es) and Italian (it) are realized at present, a specific language identification flow refers to a figure 3, the core content of the scheme is character set detection and stop word detection, and different modes are selected for detection according to the characteristics of different languages. Firstly, preprocessing a text, filtering punctuation marks, hyperlinks, numbers and the like in the text, detecting whether the text contains special characters through a character set, if not, detecting stop word occupation ratio, if the detection fails again, returning to a default language English, and specifically explaining the content in a corresponding flow chart as follows:

step 1, inputting a text. The input text includes languages including languages that can be processed by the current language recognition, such as chinese (ch), english (en), arabic (ar), russian (ru), turkish (tr), spanish (es), and italian (it), without limitation.

And 2, quickly preprocessing. The method mainly removes symbols except special characters in the text, such as punctuation marks, hyperlinks, numbers and the like, and the text obtained after removal comprises the required text characters, so that the time of subsequent processing can be saved, and the processing speed is increased.

And 3, detecting the character set. The language detection method is used for detecting languages with special characters, including Chinese (ch), Arabic (ar) and Russian (ru). When the character set is successfully detected, executing the recall in the step 5, and returning the languages meeting the standard to the step 5; and when the character set detection fails, executing step 4 to stop the word proportion detection.

And 4, detecting the stop word proportion. When the detection of the character set fails, the steps are performed, mainly for detecting Turkish (tr), Spanish (es) and Italian (it). When the stop word proportion is successfully detected, the recall in the step 5 is executed, and the recall processing is carried out on the result obtained by the stop word detection; and when the stop word proportion detection fails, executing recall processing in the same way, wherein the recalled language is English (en) in the default language, and then taking the recalled language as the main language to which the input text belongs to finish text recognition.

And 5, recalling. And under the condition that the character set detection or the stop word proportion detection is successful, recalling the language obtained by the character set detection or the language obtained by the stop word proportion detection, taking the recalled language as a target language to which the input text belongs, and finishing the text recognition. And in the case of failed detection of the stop word proportion, recalling English in the default language as the target language to which the input text belongs. Generally, the recall success only returns one language, and the sum of the ratios of a plurality of languages is 1 as understood from the viewpoint of characters or words, and the main language of the input text can be determined by calculating the ratio of each language and setting a specific threshold.

The scheme for recognizing the language of the text specifically detects the main language of the text in a targeted manner by utilizing character set detection and stop word proportion detection according to the characteristic difference of different languages, improves the accuracy and the reliability of model discrimination, and efficiently identifies the main language of the input text by the overall scheme for recognizing the language of the text, so that processing personnel who accurately distribute the corresponding language to the text to be processed can be favorably, particularly, the pertinence response of customer service personnel of the corresponding language is further accelerated in the aspect of intelligently recognizing the language of the mail, and the mail processing efficiency is improved. Experiments prove that the language identification accuracy rate for the mails in the scheme is more than 90%, the reply flow of the multi-language mails is simplified after the mails are on line, the complexity of the system is reduced, and the manpower is saved.

It is understood that the text language identification method provided by the embodiment of the present application may be executed by computer devices, including but not limited to the terminal device 101 and the text processing device 100. The computer device may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a network service, cloud communication, a middleware service, a domain name service, a security service, a CDN, a big data and artificial intelligence platform, and the like.

Further, for convenience of understanding, please refer to fig. 4, and fig. 4 is a schematic flowchart of a text processing method according to an embodiment of the present application. The execution main body in this embodiment may be one computer device or a cluster formed by a plurality of computer devices, and the computer device may be a terminal device or a server. The text processing method at least includes the following steps S401 to S404:

s401, obtaining a text to be processed, and determining the character proportion of each first language in the text to be processed.

In a possible embodiment, the text to be processed includes one or more characters, which may be one or more of roman numerals, latin letters, punctuation marks, geometric symbols, characters of multiple languages, etc., and according to these characters, a piece of content expressing specific semantics may be formed, for example, a hyperlink formed by latin letters, punctuation marks, etc., or a piece of content formed by multiple words formed by english characters, or a piece of content formed by characters of other languages, of course, random characters may also be formed without any semantics, and the content of the text to be processed and the language involved are not limited herein. The server with the text processing function can acquire the text to be processed from the text database, the text database can belong to a server side, after the text to be processed is acquired, the text to be processed is required to be preprocessed, punctuation marks, hyperlinks, numbers and the like in the text are filtered, and the character proportion of each first language is determined by using characters included in the preprocessed text to be processed.

In a possible embodiment, an implementation manner of determining a ratio of characters in each first language in a text to be processed may include the following steps: and acquiring a character code corresponding to each character in the text to be processed. The character code may be a Unicode code (i.e., global uniform word code) generated for each character of the text to be processed, and since the Unicode code designates each character of various characters in the world as a unique code, a cross-language and cross-platform application may be implemented, so that the encoding range to which the character in the text to be processed belongs and the language to which the character belongs may be determined by using the Unicode code, so that the language of the text may be well identified by using the Unicode code, and the character code may be other codes that may uniquely represent characters of various languages, which is not limited herein. The characters included in the text to be processed may be characters in multiple languages, and the character encoding and character encoding sections of each language are different, so that the character ratio of each first language in the text to be processed may be determined according to the character encoding section of each first language, the character encoding of each character in the text to be processed, and the total number of characters in the text to be processed. Correspondingly, the character encoding section of each first language herein corresponds to a Unicode encoding range with a special character, the character encoding section of each language corresponds to a character set formed by characters corresponding to different character encodings, the character occupation ratio refers to a special character occupation ratio of the first language, the special character is a character which only appears in the language (or language), but does not appear in other languages (or languages), and the language with the special character is also collectively referred to as the first language. For example, latin characters are used not only in english, but also in italian and french, and therefore latin characters are not special characters, and english, italian and french are not the first language. Even though each latin character has a Unicode encoding, words composed of latin characters in different languages may be repeated without unique representation, but instead, such as chinese, arabic, russian, korean, and japanese, each of these languages has unique characters with unique Unicode encoding ranges, as shown in table 1, that are Unicode encoding ranges of different languages.

TABLE 1 Unicode encoding ranges for different languages

As can be seen from the above table, each encoding range corresponds to a character set, for example, 4E00-9FCC is a partial character set in common Chinese characters, 3400-4DB5 is a character set of Chinese extension A, including 6582 characters, and for example, 3040 and 309F in Japanese are character sets of Hiragana, and 30A0-30FF are character sets of Japanese katakana, and since the languages in Table 1 all have unique characters (i.e. special characters) corresponding to the Unicode encoding range, they can be used as the first language. If the first language is expanded, the specific expansion criterion may be that all languages with special characters may be used as characters in the first language, such as tai language, mongolian language, etc., and languages without repeated appearance in other languages may be regarded as special characters, and the languages with special characters, such as tai language, mongolian language, etc., and the languages in table 1 may be used together as the first language.

In summary, since different languages have the same character and the same character is encoded in the same way, finding out the encoding range of different languages can determine the special character, and then determining whether the character falls into the character encoding range of the first language with the special character as shown in table 1 according to the character encoding of the text to be processed, if the character falls into the character encoding range, the character can be identified as the special character belonging to the corresponding language, and the other characters can also be determined as the special character according to the same principle, and then the ratio of the number of words of the special character to the total number of words of the text in the text to be processed is taken as the special character ratio. For different first languages, the special character ratios need to be respectively counted, so that the character ratio values may be the same or different and correspond to the first languages in table 1, the character ratios of the first languages in the finally obtained text to be processed include chinese, arabic, russian, korean and japanese, and the respective character ratios can be respectively marked as { x }₁,x₂,x₃,x₄,x₅And (5) the character proportion of each first language is usually expressed as a percentage, for example, in the form of 10%, and if the text to be processed does not include a certain first language, for example, Chinese, the corresponding character proportion is marked as 0.

For example, the text to be processed is "help me deblock my PUBG ID as soon as possible", the language involved in the characters includes chinese and english, and the special characters are only chinese, so that the number of the special characters is 8, the number of the english characters is 6, the total number of the characters of the text is 14, the character proportion of the chinese is calculated to be 8/14 (about 57.14%), the character proportions of other first languages (including arabic, russian, korean, and japanese) are all 0, and english is not included in the character proportion calculation of the first language because it is not a special character.

S402, if a first language with a character occupation ratio larger than or equal to a first set threshold exists, determining the first language with the character occupation ratio larger than or equal to the first set threshold as a target language corresponding to the text to be processed.

In a possible embodiment, ideally, the text can be determined to be the specific language by judging that the character Unicode encoding range of the input text contains the encoding range of the specific language, however, different languages are often mixed in the text of the practical application scenario, and the main language of the text cannot be determined by adopting the above manner, such as the above-mentioned example of the chinese-english hybrid text: help me unblocked my PUBG ID as the speech as passive, thank you. The language of the text is determined to be Chinese according to the Unicode, but actually, English is the main language, and the corresponding language can be processed only by correctly identifying English. Therefore, in order to enhance the accuracy of the determination, a special character ratio (i.e., a character ratio) and a confidence level (i.e., a first set threshold) are introduced in the scheme, a ratio of the number of words of the special character contained in the text to be processed to the total number of words of the text is used as the special character ratio, and only when the ratio reaches a specific threshold (i.e., the first set threshold), the language can be determined as the corresponding language.

By comparing the character proportion of each first language with the first set threshold, the first language corresponding to the character proportion greater than or equal to the first set threshold can be determined, and the first language is used as the target language corresponding to the text to be processed, namely the main language of the text to be processed. As an optional implementation manner, a maximum character occupation ratio may be determined from the character occupation ratios of the first languages, the maximum character occupation ratio is compared with a first set threshold, if the maximum character occupation ratios are both greater than or equal to the first set threshold, then the main language of the text to be processed must exist in the first language, and the first language corresponding to the maximum character occupation ratio may be used as the main language corresponding to the text to be processed without performing subsequent processing steps. In general, the value of the first set threshold is set to be greater than 50%, so that for the same language, more than half of characters in the text to be processed are special characters of the language, and the characters are used as the main language of the text to be processed and are more credible.

S403, if the first language with the character occupation ratio larger than or equal to the first set threshold does not exist, determining the high-frequency word occupation ratio of each second language in the text to be processed.

In a possible embodiment, if the maximum character ratio of each first language in the text to be processed is less than the first set threshold, it indicates that there is no first language with a character ratio greater than or equal to the first set threshold in each first language, and the main language to which the text to be processed belongs cannot be identified, which may be because the text to be processed includes many latin letters, and therefore, the identification method needs to be increased from the utilized character to the utilized word, and the best corresponding language is determined by establishing high-frequency word distribution under multiple languages for the input text to be processed, and this identification method needs to determine the high-frequency word ratio of each second language first, where the high-frequency word refers to a word or a single character with a higher frequency of use in the second language. The determination of the high frequency words included in the text to be processed is to combine the previously collected sorted dead word lists (stop words), in this scheme, the dead word lists of turkish, spanish and italian are collected and sorted, wherein turkish, spanish and italian may be referred to as a second language. The concept of stop words is commonly used in a retrieval system, and is called stop words because the words and phrases are immediately stopped and discarded in the text processing process to reduce the index amount and increase the retrieval efficiency, thereby improving the retrieval effect. After the input text to be processed is participled, the proportion of stop words of each second language in the text to be processed is calculated, specifically, the ratio of the number of words of the high-frequency words of the second language to the total number of words of the text can be used as the corresponding high-frequency word proportion, and each second language corresponds to one high-frequency word proportion and usually appears in percentage.

S404, if the maximum high-frequency word occupation ratio in the high-frequency word occupation ratios of the second languages is larger than or equal to a second set threshold, determining the second language corresponding to the maximum high-frequency word occupation ratio as the target language corresponding to the text to be processed.

In a possible embodiment, similar to the character occupation ratio, the step also introduces a high-frequency word occupation ratio and a confidence (referred to as a second set threshold here), and determines the second language corresponding to the high-frequency word occupation ratio reaching the second set threshold as the main language (i.e., the target language) to which the text to be processed belongs. Specifically, the maximum high-frequency word proportion may be determined from the high-frequency word proportions of the second languages, then the maximum high-frequency word proportion is compared with a second set threshold, and if the maximum high-frequency word proportion is greater than or equal to the second set threshold, it is determined that the target language to which the text to be processed belongs is the second language corresponding to the maximum high-frequency word proportion, that is, if the high-frequency word proportion of a certain language is both the highest value and also reaches the set threshold, the text to be processed is determined as the language. On the contrary, if the maximum high-frequency word occupation ratios are all smaller than the second set threshold, this indicates that there is no high-frequency word occupation ratio greater than or equal to the second set threshold in the high-frequency word occupation ratios of the second languages, and the main language corresponding to the text to be processed cannot be determined from the second languages.

Optionally, the high-frequency word occupation ratios of the second languages may be compared with a second set threshold respectively to determine the maximum high-frequency word larger than or equal to the second set threshold, if two or more high-frequency word occupation ratios larger than or equal to the second set threshold occur, the maximum high-frequency word occupation ratio is selected from the two or more high-frequency word occupation ratios larger than or equal to the second set threshold, and the second language corresponding to the maximum high-frequency word occupation ratio is determined as the target language corresponding to the text to be processed.

It should be noted that the second set threshold and the first set threshold are set differently in terms of numeric value, but similarly, the second set threshold is also common to a plurality of languages, that is, the high-frequency word ratios of the second languages are compared with each other using the second set threshold as a criterion.

In summary, the embodiments of the present application have at least the following advantages:

the method comprises the steps of detecting a text to be processed comprising multiple languages through a character set formed by Unicode coding or a high-frequency word list formed by high-frequency words, and determining the main language of the text to be processed. Specifically, in order to ensure the accuracy of the main language identification of the multi-language text to be processed, a special character proportion and a first set threshold are introduced as reference standards for the main language identification from the aspect of characters, and the special character proportion of the text to be processed is detected by using a character set so as to determine the main language of the text to be processed; when the character set detection cannot judge the main language corresponding to the text to be processed, the high-frequency word ratio of the text to be processed is detected by using the high-frequency vocabulary by introducing the high-frequency word ratio of each second language and a second set threshold value so as to determine the main language to which the text to be processed belongs. In the processing processes of the two modes, the character set detection is preferentially adopted, the high-frequency word list detection is used as an assistant, and the occupation ratio and the threshold value are introduced in the two modes, so that the obtained recognition result is more convincing under the condition that the text to be processed is in multiple languages, and the accuracy and the reliability of the main language recognition of the text to be processed are improved.

Referring to fig. 5, fig. 5 is a schematic flowchart of a text processing method according to an embodiment of the present disclosure. The execution main body in this embodiment may be one computer device or a cluster formed by a plurality of computer devices, and the computer device may be a terminal device or a server. Wherein, the text processing method at least comprises the following steps S501-S505:

s501, obtaining a text to be processed, and determining the character proportion of each first language in the text to be processed.

S502, if a first language with a character occupation ratio larger than or equal to a first set threshold exists, determining the first language with the character occupation ratio larger than or equal to the first set threshold as a target language corresponding to the text to be processed.

The specific implementation manner of the above steps may refer to S401 to S402 in the embodiment corresponding to fig. 4, which is not described herein again.

S503, if the first language with the character ratio larger than or equal to the first set threshold does not exist, determining the target high-frequency word list of each second language, and acquiring word segmentation information of the text to be processed.

In a possible embodiment, if the character ratio of each first language is smaller than the first set threshold, the main language cannot be effectively determined by using the text to be processed detection method using the character set, at this time, the language detection may be performed using the words of the text to be processed, and in this process, the key point is to pre-establish a target high-frequency vocabulary of each second language, where the target high-frequency vocabulary is the stop vocabulary (stopwords) mentioned in the above embodiment. The determination process of the target high-frequency vocabulary of each second language may include the following: and aiming at any second language, acquiring a candidate high-frequency word list consisting of one or more high-frequency words in the second language. The candidate high-frequency word list is determined according to the occurrence number of the words in the second language and a set threshold, for example, the words with the occurrence number larger than the threshold are used as the high-frequency words. It should be noted that the candidate high-frequency word lists of the second languages are statistical independently, and high-frequency words which are repeated among the second languages also exist in the candidate high-frequency word lists, for example, spanish and italian languages share many high-frequency words, and candidate high-frequency words a exist in the candidate high-frequency word lists of spanish and italian languages. Therefore, intersections may exist among the candidate high-frequency word lists of the second language, and excessive intersections may affect the distribution of the high-frequency words, and the distribution of stop words calculated in a section of text may have great misleading, which may result in low language identification accuracy. The common high-frequency word is a high-frequency word candidate indicating that the number of times appearing in the high-frequency word candidate list of each second language is equal to or greater than a set number of times, and is a stop word common to a plurality of second languages. For example, assuming 4 second languages, the corresponding high frequency word candidates are { T }₁,T₂,T₃,T₄And 2, the set frequency of occurrence of each candidate high-frequency word is 2, wherein the frequency of occurrence of the candidate high-frequency word A in the four candidate high-frequency word lists is 3 times and is greater than the set frequency, so that the candidate high-frequency words can be removed. The following expression (1) can be adopted for the screening criterion for the common high-frequency words:

S≥ceil(N/2) (1)

wherein S represents the occurrence frequency of the candidate high-frequency word in the candidate stop word list, ceil represents rounding-up, and N represents the number of languages. According to the screening standard, the candidate high-frequency words appearing in half or more of the candidate high-frequency word lists are deleted.

Besides, single-character candidate high-frequency words in the candidate high-frequency word list can be removed, wherein the single-character candidate high-frequency words can be single Latin letters A-Z (or a-Z). Removing two types of high-frequency words, namely single character high-frequency candidate words and common high-frequency candidate words in the candidate high-frequency word list, specifically removing single character high-frequency candidate words in the candidate high-frequency word list by traversing the candidate high-frequency word list corresponding to each second language, then all candidate high-frequency words of the language (part of the second language or all the second language) to be detected by using the candidate high-frequency vocabulary are put into a set, counting and sorting the words in the set, recording all the words with the screened times more than or equal to the set times in a shared high-frequency word list filtered _ stored words, then, traversing each candidate high-frequency vocabulary, removing the candidate high-frequency words appearing in the filtered _ stored words, and finally taking the candidate high-frequency vocabulary after removal as a target high-frequency vocabulary of a second language, such as a stop vocabulary shown in table 2 (namely, the target high-frequency vocabularies of the second languages):

TABLE 2 deactivation word list

The Turkish, Spanish and Italian of the inactive word list are collectively called as a second language, and the corresponding inactive words (i.e. high-frequency words) are displayed in part of the content, wherein the second language can be expanded according to whether the corresponding high-frequency word list can be established or not.

In addition, some frequently-occurring business high-frequency words can be added into the target high-frequency word list according to business characteristics, and business high-frequency words of different languages can be added into the target high-frequency word list. Specifically, the service type indicated by the text to be processed, such as a game type, a makeup type, an entertainment type, and the like, may be determined or otherwise determined according to semantic information of the text to be processed, and then, high-frequency words related to the service type, for example, overseas game services, may be acquired from the text to be processed, such as a seal number, a unseal, a login, a katton, and the like, and add the second language corresponding to the high-frequency words to the target high-frequency word list, for example, add the language corresponding to the "seal number" to the target high-frequency word list respectively corresponding to turkish, spanish, and italian, so that when the text to be processed of the same service type is encountered, the language included in the text to be processed may be quickly determined, and the efficiency of identifying the main language of the text to be processed is improved. Of course, it is also possible to collect all terms related to the service type in advance as service high-frequency words, add the service high-frequency words into the corresponding target high-frequency word list, perform partition management on the service high-frequency words of the target high-frequency word list according to the service type, and then determine the high-frequency words of the corresponding language from the service high-frequency words of the corresponding service type partition according to the service type indicated by the text to be processed. Furthermore, high-frequency words with uniqueness of part of languages can be added to improve the discrimination of each second language, and the second language corresponding to the high-frequency words of the text to be processed can be quickly determined.

The scheme preprocesses the candidate high-frequency word list, and the preprocessing comprises deleting all stop words represented by single Latin letters, deleting part of stop words shared by a plurality of languages and adding part of high-frequency words with more language uniqueness.

The method includes the steps that when a target candidate high-frequency word list of each second language is determined, word segmentation information of a text to be processed is required to be obtained, and the specific obtaining mode can be that word segmentation processing is conducted on the text to be processed according to word spacers or word segmentation processing is conducted on the text to be processed by utilizing a word segmentation model, one or more words are obtained, and the words are used as word segmentation information. The word spacer can be a space, natural spaces between words are used as standards for word segmentation, the word formed by dividing Latin letters by using the space is not only accurate, but also fast and convenient, however, for languages without spaces as a word segmentation basis, such as Chinese, Japanese and the like, word segmentation using the word spacer has certain errors, and further accuracy is poor, so that a universal deep word segmentation method needs to be considered, for example, a universal word segmentation model is used for processing a text to be processed, words can be rapidly split according to the space, closely connected characters can be split according to semantics, and therefore accuracy of multi-language word segmentation of the text is improved, accuracy of recognizing a main language to which the text to be processed belongs based on a target high-frequency word list of each second language is improved, wherein the universal word segmentation model can be a deep neural network after being trained by using various languages, or other network models, without limitation.

S504, determining the high-frequency word ratio of each second language in the text to be processed according to the target high-frequency word list of each second language and the word segmentation information of the text to be processed.

In one possible embodiment, since the word segmentation information of the text to be processed includes one or more text words, the total number of high-frequency words and text words included in the text to be processed can be determined according to the one or more text words. For the determination of the high-frequency words, the target high-frequency disabled word list of each second language is required to be combined, for each second language, the text words in the participle information are matched with the high-frequency words in the target high-frequency word list, the text words identical to the text words in the target high-frequency word list are determined as the high-frequency words, then the high-frequency words and the high-frequency word quantity of each second language are determined, and at this time, the ratio of the high-frequency word quantity to the total number of the text words can be used as the high-frequency word ratio of the second language.

For example, if english abbreviation and spanish are referred to in "Ay dame a desbloquer mi ID de PUBG lo ants lattice", word segmentation can be performed according to a space to obtain a total word number of a text of 10, and high-frequency words include 5 of Ay dame, desbloquer, mi, lo, and de, since english is common in other languages, there is no corresponding high-frequency word table, that is, there is no corresponding high-frequency word, so that the high-frequency word proportion of each second language is 50% in spanish, and the high-frequency word proportion of turkish and italian is 0.

And S505, if the maximum high-frequency word occupation ratio in the high-frequency word occupation ratios of the second languages is greater than or equal to a second set threshold, determining the second language corresponding to the maximum high-frequency word occupation ratio as the target language corresponding to the text to be processed.

The specific implementation manner of this step may refer to S404 in the embodiment corresponding to fig. 4, which is not described herein again.

In summary, the embodiments of the present application have at least the following advantages:

in the process of determining the target language to which the text to be processed belongs by using the high-frequency word occupation ratio and the second set threshold, the target high-frequency word list of each second language and the word segmentation information of the text to be processed are all key parts of text language identification, and high-frequency word distribution and text word segmentation which are accurate enough need to be obtained, so that the high-frequency word occupation ratio and the text word segmentation are accurate, and the final identification result is further ensured to be more accurate. The method comprises the following steps of selecting a high-frequency word from a plurality of languages, wherein the high-frequency word is possibly shared in a cross mode among different second languages and influences the distribution effect of the high-frequency word, and the specific measure is to delete the high-frequency word shared by the plurality of languages. In addition, the accuracy of high-frequency word distribution can be improved in an auxiliary mode by removing single-character high-frequency words, increasing language-unique high-frequency words, introducing high-frequency words aiming at service types and the like, and the discrimination of a target high-frequency word list of each second language is increased, so that the high-frequency word ratio can be quickly and accurately determined when the target high-frequency words are matched with word segmentation information of the text to be processed, and the main language corresponding to the text to be processed is determined in each second language. In addition, aiming at the processing of the word segmentation information of the text to be processed, a universal word segmentation model is adopted to process the multilingual text to be processed, so that the error of word segmentation can be greatly reduced, and the main language to which the text to be processed belongs can be efficiently and accurately identified.

Fig. 6 is a schematic structural diagram of a text processing apparatus according to an embodiment of the present application, where the text processing apparatus may be a computer program (including program code) running in the text processing device 100 shown in fig. 1, for example, the text processing apparatus is an application software; the apparatus may be used to perform the corresponding steps in the methods provided by the embodiments of the present application. The text processing device 60 includes: an obtaining module 601 and a determining module 602, wherein:

an obtaining module 601, configured to obtain a text to be processed, and determine a ratio of characters of each first language in the text to be processed;

a determining module 602, configured to determine, if a first language with a character ratio greater than or equal to a first set threshold exists, the first language with the character ratio greater than or equal to the first set threshold as a target language corresponding to a text to be processed;

the determining module 602 is further configured to determine a high-frequency word proportion of each second language in the text to be processed if there is no first language of which the character proportion is greater than or equal to a first set threshold;

the determining module 602 is further configured to determine, if a maximum high-frequency word proportion in the high-frequency word proportion of each second language is greater than or equal to a second set threshold, the second language corresponding to the maximum high-frequency word proportion as the target language corresponding to the text to be processed.

In an embodiment, the determining module 602 is specifically configured to: acquiring a character code corresponding to each character in a text to be processed; and determining the character ratio of each first language in the text to be processed according to the character coding interval of each first language, the character code corresponding to each character in the text to be processed and the total number of characters in the text to be processed.

In an embodiment, the determining module 602 is specifically configured to: determining a target high-frequency word list of each second language and acquiring word segmentation information of the text to be processed; and determining the high-frequency word ratio of each second language in the text to be processed according to the target high-frequency word list of each second language and the word segmentation information of the text to be processed.

In an embodiment, the determining module 602 is specifically configured to: aiming at any second language, acquiring a candidate high-frequency word list of the second language, wherein the candidate high-frequency word list comprises one or more candidate high-frequency words; and removing one or more of single character high-frequency candidate words and common high-frequency candidate words in the candidate high-frequency word list to obtain a target high-frequency word list of the second language, wherein the common high-frequency candidate words indicate that the frequency in the candidate high-frequency word list of each second language is greater than or equal to the set frequency.

In an embodiment, the text processing apparatus 60 further comprises an adding module 603, wherein:

a determining module 602, configured to determine a service type indicated by the text to be processed;

an obtaining module 601, configured to obtain a high-frequency word related to a service type;

and an adding module 603, configured to add the high-frequency word related to the service type to the target high-frequency word list.

In an embodiment, the obtaining module 601 is specifically configured to: performing word segmentation processing on the text to be processed according to the word interval symbol, and taking one or more obtained words as word segmentation information of the text to be processed; or performing word segmentation processing on the text to be processed by using the word segmentation model, and taking one or more obtained words as word segmentation information of the text to be processed.

In an embodiment, the determining module 602 is further configured to: and if the maximum high-frequency word occupation ratio in the high-frequency word occupation ratios of the second languages is smaller than a second set threshold, determining the default language as the target language corresponding to the text to be processed.

It can be understood that the functions of the functional modules of the text processing apparatus described in the embodiment of the present application can be specifically implemented according to the relevant method in the text processing method embodiment, and the specific implementation process of the function modules may refer to the relevant description of the method embodiment, which is not described herein again. In addition, the description of the beneficial effects of the same text processing method is not repeated herein.

Referring to fig. 7, which is a schematic structural diagram of a computer device according to an embodiment of the present disclosure, the computer device 70 may include a processor 701, a memory 702, a network interface 703, and at least one communication bus 704. The processor 701 is used for scheduling a computer program, and may include a central processing unit, a controller, and a microprocessor; the memory 702 is used to store computer programs and may include high speed random access memory, non-volatile memory, such as magnetic disk storage devices, flash memory devices; a network interface 703 provides data communication functions and a communication bus 704 is responsible for connecting the various communication elements. The computer device 70 corresponds to the text processing device 100 described above.

Among other things, the processor 701 may be configured to invoke a computer program in memory to perform the following operations:

acquiring a text to be processed, and determining the character proportion of each first language in the text to be processed;

if a first language with the character ratio larger than or equal to a first set threshold exists, determining the first language with the character ratio larger than or equal to the first set threshold as a target language corresponding to the text to be processed;

In an embodiment, the processor 701 is specifically configured to: acquiring a character code corresponding to each character in a text to be processed; and determining the character ratio of each first language in the text to be processed according to the character coding interval of each first language, the character code corresponding to each character in the text to be processed and the total number of characters in the text to be processed.

In an embodiment, the processor 701 is specifically configured to: determining a target high-frequency word list of each second language and acquiring word segmentation information of the text to be processed; and determining the high-frequency word ratio of each second language in the text to be processed according to the target high-frequency word list of each second language and the word segmentation information of the text to be processed.

In an embodiment, the processor 701 is specifically configured to: aiming at any second language, acquiring a candidate high-frequency word list of the second language, wherein the candidate high-frequency word list comprises one or more candidate high-frequency words; and removing one or more of single character high-frequency candidate words and common high-frequency candidate words in the candidate high-frequency word list to obtain a target high-frequency word list of the second language, wherein the common high-frequency candidate words indicate that the frequency in the candidate high-frequency word list of each second language is greater than or equal to the set frequency.

In one embodiment, the processor 701 is further configured to: determining the service type indicated by the text to be processed; acquiring high-frequency words related to the service types; and adding the high-frequency words related to the service types into the target high-frequency word list.

In an embodiment, the processor 701 is specifically configured to: performing word segmentation processing on the text to be processed according to the word interval symbol, and taking one or more obtained words as word segmentation information of the text to be processed; or performing word segmentation processing on the text to be processed by using the word segmentation model, and taking one or more obtained words as word segmentation information of the text to be processed.

In one embodiment, the processor 701 is further configured to: and if the maximum high-frequency word occupation ratio in the high-frequency word occupation ratios of the second languages is smaller than a second set threshold, determining the default language as the target language corresponding to the text to be processed.

In a specific implementation, the processor 701, the memory 702, and the network interface 703 described in this embodiment may execute an implementation manner of a computer device described in a text processing method provided in this embodiment, and may also execute an implementation manner and beneficial effects described in a text processing apparatus provided in this embodiment, which are not described herein again.

Embodiments of the present application also provide a computer-readable storage medium, which stores instructions that, when executed on a computer, cause the computer to execute a text processing method according to embodiments of the present application. For specific implementation, reference may be made to the foregoing description, which is not repeated herein.

Embodiments of the present application also provide a computer program product or a computer program, which includes computer instructions stored in a computer-readable storage medium. A processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the text processing method according to the embodiment of the present application. For specific implementation, reference may be made to the foregoing description, which is not repeated herein.

It should be noted that, for simplicity of description, the above-mentioned embodiments of the method are described as a series of acts or combinations, but those skilled in the art should understand that the present application is not limited by the order of acts described, as some steps may be performed in other orders or simultaneously according to the present application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required in this application.

Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by associated hardware instructed by a program, which may be stored in a computer-readable storage medium, and the storage medium may include: flash disks, Read-Only memories (ROMs), Random Access Memories (RAMs), magnetic or optical disks, and the like.

While the invention has been described with reference to specific embodiments, the scope of the invention is not limited thereto, and those skilled in the art can easily conceive various equivalent modifications or substitutions within the technical scope of the invention. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

20页详细技术资料下载

Text processing method and device, computer equipment and readable storage medium

相关技术

网友询问留言