Language identification method and device

文档序号：1875855 发布日期：2021-11-23 浏览：31次中文

阅读说明：本技术 一种语种识别方法及装置 (Language identification method and device ) 是由竺晨曦于 2020-05-19 设计创作，主要内容包括：本申请提供了一种语种识别方法及装置。在本申请中,是根据待识别文本中的各个字符的编码获取各个字符的语言标签特征,然后将各个字符的语言标签特征输入语种识别模型中,以使语种识别模型基于各个字符的语言标签特征识别待识别文本的语种。如此,对于语种识别模型而言,其接收到输入数据中,包括有更多的与待识别文本中的各个字符的语种相关信息,以使语种识别模型可以基于更多的数据来识别待识别文本的语种,由于语种识别模型在识别待识别文本时使用了待识别文本中的各个字符的语种相关信息,因此,可以提高识别出的待识别文本的语种的准确率,进而可以降低出现误召回的情况的可能性,从而避免降低用户体验。(The application provides a language identification method and device. In the application, the language tag characteristics of each character are obtained according to the code of each character in the text to be recognized, and then the language tag characteristics of each character are input into the language recognition model, so that the language recognition model recognizes the language of the text to be recognized based on the language tag characteristics of each character. Therefore, for the language identification model, the language identification model receives the input data and comprises more language related information of each character in the text to be identified, so that the language identification model can identify the language of the text to be identified based on more data, and the language identification model uses the language related information of each character in the text to be identified when identifying the text to be identified, so that the accuracy of the identified language of the text to be identified can be improved, the possibility of the situation of mistaken recall can be reduced, and the reduction of user experience is avoided.)

1. A language identification method is applied to electronic equipment, and the method comprises the following steps:

acquiring a text to be identified;

acquiring codes of all characters in the text to be recognized;

acquiring language label characteristics of each character in the text to be recognized according to the codes;

and determining the language of the text to be recognized according to the language tag characteristics of each character and a language recognition model based on a deep network.

2. The method according to claim 1, wherein the obtaining language tag features of each character in the text to be recognized according to the codes comprises:

for each character in the text to be recognized, determining that the character belongs to one language or simultaneously belongs to a plurality of languages according to the code of the character;

under the condition that the character belongs to one language, acquiring language tag characteristics of the character according to the language;

and under the condition that the characters belong to at least two languages simultaneously, determining adjacent characters of the characters in the text to be recognized, and generating language tag characteristics of the characters according to the characters and the adjacent characters.

3. The method according to claim 2, wherein said determining that said character belongs to one language or to multiple languages simultaneously according to said character encoding comprises:

acquiring a code segment where the code of the character is located;

searching languages corresponding to the coding segments in the corresponding relation between the coding segments and the languages;

determining that the character belongs to one language under the condition that the language corresponding to the code segment is one;

and determining that the characters belong to a plurality of languages simultaneously under the condition that the languages corresponding to the coding segments are at least two.

4. The method according to claim 1, wherein the obtaining language tag features of each character in the text to be recognized according to the codes comprises:

when the characters are words, respectively combining continuous preset number of letters into a letter set in the words;

and generating language label characteristics of the characters according to the words and the letter sets.

5. The method of claim 1, wherein the obtaining the text to be recognized comprises:

acquiring an original text;

and removing non-language characters in the original text to obtain the text to be recognized.

6. The method of claim 1, further comprising:

acquiring a sample data set, wherein the sample language label characteristics of each sample character in at least one sample text in the sample data set;

acquiring a marked language of each sample text in the sample data set;

and training a deep network model based on the sample data set and the marked language until parameters in the deep network model are converged, thereby obtaining the language identification model based on the deep network.

7. The method of claim 6, wherein said obtaining a sample data set comprises:

obtaining a plurality of first sample texts, wherein each sample text comprises a language character;

generating a second sample text from at least two first sample texts comprising characters of different languages;

and generating the sample data set according to the plurality of first sample texts and the generated second sample texts.

8. A language identification device, applied to an electronic device, the device comprising:

the first acquisition module is used for acquiring a text to be recognized;

the second acquisition module is used for acquiring the codes of all characters in the text to be recognized;

the third acquisition module is used for acquiring the language tag characteristics of each character in the text to be recognized according to the codes;

and the determining module is used for determining the language of the text to be recognized according to the language tag characteristics of each character and the language recognition model based on the deep network.

9. The apparatus of claim 8, wherein the third obtaining module comprises:

the first determining unit is used for determining that each character in the text to be recognized belongs to one language or simultaneously belongs to a plurality of languages according to the code of the character;

the first acquisition unit is used for acquiring the language tag characteristics of the character according to the language under the condition that the character belongs to one language;

the first generating unit is used for generating the language label characteristics of the characters according to the characters and the adjacent characters.

10. The apparatus of claim 9, wherein the first determining unit comprises:

the acquisition subunit is used for acquiring a coding section where the code of the character is located;

a searching subunit, configured to search, in a correspondence between a coding segment and a language, a language corresponding to the coding segment;

a first determining subunit, configured to determine that the character belongs to one language if the language corresponding to the code segment is one;

a second determining subunit, configured to determine that the character belongs to multiple languages simultaneously, in a case where the languages corresponding to the code segment are at least two.

11. The apparatus of claim 8, wherein the third obtaining module comprises:

the composition unit is used for respectively composing a letter set from continuous preset number of letters in the words under the condition that the characters are the words;

and the second generating unit is used for generating the language label characteristics of the characters according to the words and the letter sets.

12. The apparatus of claim 8, wherein the first obtaining module comprises:

a second acquiring unit, configured to acquire an original text;

and the removing unit is used for removing the non-language characters in the original text to obtain the text to be recognized.

13. The apparatus of claim 8, further comprising:

the fourth acquisition module is used for acquiring a sample data set, wherein the sample language label characteristics of each sample character in at least one sample text in the sample data set;

the fifth acquisition module is used for acquiring the marked language of each sample text in the sample data set;

and the training module is used for training the deep network model based on the sample data set and the marked languages until parameters in the deep network model are converged, so that the language identification model based on the deep network is obtained.

14. The apparatus of claim 13, wherein the fourth obtaining module comprises:

a third obtaining unit, configured to obtain a plurality of first sample texts, where each sample text includes a language character;

a third generating unit configured to generate a second sample text from at least two first sample texts including characters of different languages;

and the fourth generating unit is used for generating the sample data set according to the plurality of first sample texts and the generated second sample texts.

15. An electronic device, characterized in that the electronic device comprises:

a processor; and

memory having stored thereon executable code which, when executed, causes the processor to perform a language identification method as claimed in one or more of claims 1-7.

16. One or more machine-readable media having executable code stored thereon that, when executed, causes a processor to perform a language identification method as recited in one or more of claims 1-7.

Technical Field

The present application relates to the field of internet, and in particular, to a language identification method and apparatus.

Background

The development of the internet makes information rich and various, and users can inquire information required to be acquired by themselves through a search technology. For a server on the network, it may perform a search job according to a search text input by the user to provide the user with search results.

However, some search texts may involve illegal situations, for example, the search texts include illegal keywords such as "gun", "knife", and "jammer", etc., which need to be limited in order to purify the network environment, for example, when the search texts input by the user include these keywords, the user may be denied the search results.

However, some texts are used in different languages and have different meanings in different languages, for example, for the word "jammer", the meaning in english is "jammer" and the meaning in dutch is "unfortunately".

Assuming that the server needs to limit the 'jammer' and unfortunately does not need to limit the 'jammer', if the user is a search text input in the context of dutch language and the server semantically recognizes it in the context of english language, the server determines the meaning of 'jammer' in the search text as the 'jammer' and then refuses to provide the user with the search result.

Thus, a false recall occurs that the user should be provided with search results, but actually refuses to provide the user with search results, which may degrade the user experience.

Disclosure of Invention

In order to improve user experience, the application shows a language identification method and a language identification device.

In a first aspect, the present application shows a language identification method applied to an electronic device, where the method includes:

acquiring a text to be identified;

acquiring codes of all characters in the text to be recognized;

acquiring language label characteristics of each character in the text to be recognized according to the codes;

and determining the language of the text to be recognized according to the language tag characteristics of each character and a language recognition model based on a deep network.

In an optional implementation manner, the obtaining, according to the code, language tag characteristics of each character in the text to be recognized includes:

for each character in the text to be recognized, determining that the character belongs to one language or simultaneously belongs to a plurality of languages according to the code of the character;

under the condition that the character belongs to one language, acquiring language tag characteristics of the character according to the language;

under the condition that the characters belong to at least two languages simultaneously, determining adjacent characters of the characters in the text to be recognized, and generating language tag characteristics of the characters according to the characters and the adjacent characters;

in an optional implementation manner, the determining that the character belongs to one language or simultaneously belongs to multiple languages according to the encoding of the character includes:

acquiring a code segment where the code of the character is located;

searching languages corresponding to the coding segments in the corresponding relation between the coding segments and the languages;

determining that the character belongs to one language under the condition that the language corresponding to the code segment is one;

and determining that the characters belong to a plurality of languages simultaneously under the condition that the languages corresponding to the coding segments are at least two.

In an optional implementation manner, the obtaining, according to the code, language tag characteristics of each character in the text to be recognized includes:

when the characters are words, respectively combining continuous preset number of letters into a letter set in the words;

and generating language label characteristics of the characters according to the words and the letter sets.

In an optional implementation manner, the obtaining of the text to be recognized includes:

acquiring an original text;

and removing non-language characters in the original text to obtain the text to be recognized.

In an optional implementation, the method further includes:

acquiring a sample data set, wherein the sample language label characteristics of each sample character in at least one sample text in the sample data set;

acquiring a marked language of each sample text in the sample data set;

In an optional implementation manner, the obtaining the sample data set includes:

obtaining a plurality of first sample texts, wherein each sample text comprises a language character;

generating a second sample text from at least two first sample texts comprising characters of different languages;

and generating the sample data set according to the plurality of first sample texts and the generated second sample texts.

In a second aspect, the present application shows a language identification apparatus applied to an electronic device, the apparatus includes:

the first acquisition module is used for acquiring a text to be recognized;

the second acquisition module is used for acquiring the codes of all characters in the text to be recognized;

the third acquisition module is used for acquiring the language tag characteristics of each character in the text to be recognized according to the codes;

In an optional implementation manner, the third obtaining module includes:

the first acquisition unit is used for acquiring the language tag characteristics of the character according to the language under the condition that the character belongs to one language;

the first generating unit is used for generating language label characteristics of the characters according to the characters and the adjacent characters;

in an optional implementation manner, the first determining unit includes:

the acquisition subunit is used for acquiring a coding section where the code of the character is located;

a searching subunit, configured to search, in a correspondence between a coding segment and a language, a language corresponding to the coding segment;

a first determining subunit, configured to determine that the character belongs to one language if the language corresponding to the code segment is one;

In an optional implementation manner, the third obtaining module includes:

the composition unit is used for respectively composing a letter set from continuous preset number of letters in the words under the condition that the characters are the words;

and the second generating unit is used for generating the language label characteristics of the characters according to the words and the letter sets.

In an optional implementation manner, the first obtaining module includes:

a second acquiring unit, configured to acquire an original text;

and the removing unit is used for removing the non-language characters in the original text to obtain the text to be recognized.

In an optional implementation, the apparatus further comprises:

the fourth acquisition module is used for acquiring a sample data set, wherein the sample language label characteristics of each sample character in at least one sample text in the sample data set;

the fifth acquisition module is used for acquiring the marked language of each sample text in the sample data set;

In an optional implementation manner, the fourth obtaining module includes:

a third obtaining unit, configured to obtain a plurality of first sample texts, where each sample text includes a language character;

a third generating unit configured to generate a second sample text from at least two first sample texts including characters of different languages;

and the fourth generating unit is used for generating the sample data set according to the plurality of first sample texts and the generated second sample texts.

In a third aspect, the present application shows an electronic device comprising:

a processor; and

a memory having executable code stored thereon, which when executed, causes the processor to perform the language identification method of the first aspect.

In a fourth aspect, the present application shows one or more machine-readable media having stored thereon executable code that, when executed, causes a processor to perform a language identification method as described in the first aspect.

Compared with the prior art, the embodiment of the application has the following advantages:

in one mode, each character in the text to be recognized is directly input into the language recognition model, and the language recognition model processes the text to be recognized and outputs the language of the text to be recognized.

However, in the present application, the language tag feature of each character is obtained according to the code of each character in the text to be recognized, and then the language tag feature of each character is input into the language recognition model, so that the language recognition model recognizes the language of the text to be recognized based on the language tag feature of each character.

The language tag feature of each character may at least directly or indirectly reflect which languages each character may belong to, and the language tag feature of any character may include: the character, the language to which the character belongs, and adjacent characters of the character in the text to be recognized, and the like, and, in the case where the character is a word composed of a plurality of letters, the language tag feature may further include: a preset number of consecutive letters in a word constitute a letter set, etc.

Therefore, for the language identification model, the language identification model receives the input data and comprises more language related information of each character in the text to be identified, so that the language identification model can identify the language of the text to be identified based on more data, and the language identification model uses the language related information of each character in the text to be identified when identifying the text to be identified, so that the accuracy of the identified language of the text to be identified can be improved, the possibility of the situation of mistaken recall can be reduced, and the reduction of user experience is avoided.

In addition, the language identification model in the application is obtained by model training based on a deep network, so that the generalization degree of the language identification model can be improved, for example, after the language identification model is on line, even if the text to be identified does not appear in the sample text used in the process of training the language identification model, the identification accuracy of the language of the text to be identified by the language identification model is higher than that of the model based on a shallow network.

Drawings

Fig. 1 is a flowchart illustrating a language identification method according to an exemplary embodiment of the present application.

Fig. 2 is a flowchart illustrating a method for obtaining language tag features according to an exemplary embodiment of the present application.

Fig. 3 is a flowchart illustrating a method for obtaining language tag features according to an exemplary embodiment of the present application.

Fig. 4 is a flowchart illustrating a method for training a language identification model according to an exemplary embodiment of the present application.

Fig. 5 is a block diagram illustrating a structure of a language identification device according to an exemplary embodiment of the present application.

Fig. 6 is a schematic structural diagram of an apparatus according to an embodiment of the present application.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present application more comprehensible, the present application is described in further detail with reference to the accompanying drawings and the detailed description.

Referring to fig. 1, a flow chart of a language identification method according to the present application is shown, where the method is applied to an electronic device, and the method may include:

in step S101, a text to be recognized is acquired;

in this application, the text to be recognized may include text input by the user, for example, text input by the user in a search box, and the electronic device may acquire the text input by the user and use the text as the text to be recognized.

Alternatively, the text to be recognized may include text recognized from audio, video, or images.

In one example, if a person speaks in a video, semantic recognition can be performed on the speech spoken by the person in the video based on voice-to-text (speech-to-text) technology to obtain a text expressed by the speech, and the text is used as a text to be recognized.

In another example, in the process of a voice call between two or more people, semantic recognition can be performed on the call voice spoken by the people during the call based on voice-to-text (voice-to-text) technology, so as to obtain a text expressed by the call voice, and the text is used as a text to be recognized.

In yet another example, text is included in the image, and the text included in the image may be recognized based on an OCR (Optical Character Recognition) technique and be the text to be recognized.

Of course, in the present application, the text to be recognized may also be obtained in other manners, and the present application does not limit the specific obtaining manner.

In step S102, acquiring a code of each character in the text to be recognized;

in the present application, the text to be recognized may include one character, or may include at least two characters. In the present application, one character is required to represent a specific meaning, for example, for languages such as latin pre-learned english and french, a single letter does not generally represent a specific meaning, and a word consisting of at least two letters generally represents a specific meaning, and thus, in these languages, one character may not include a single letter but may include a word.

For chinese, japanese, etc., individual words may often represent specific meanings, and thus, in these languages, an individual word may be a character.

In the present application, for any language, for any character in the language, a code of the character may be set in advance, and then the character and the code of the character form a corresponding table entry and are stored in a corresponding relationship between the character and the code. The same is true for every other character in the language. The same is true for every other language.

Therefore, in the present application, for any character in the text to be recognized, the code corresponding to the character may be searched in the correspondence between the character and the code, and for each other character in the text to be recognized, the above operation is also performed, thereby implementing obtaining the code of each character in the text to be recognized.

In step S103, obtaining language tag characteristics of each character in the text to be recognized according to the code of each character;

for any character, the language tag characteristics of that character may include: the character, the language to which the character belongs, and adjacent characters to the character in the text to be recognized, and the like. In the case where the character is a word composed of a plurality of letters, the method may further include: the preset number of consecutive letters in a word constitutes a letter set, etc., and reference may be made to the embodiments shown in fig. 2 and fig. 3, which are not described in detail herein. The language tag characteristics of the character may at least directly or indirectly reflect which languages the character may belong to. The same is true for every other character.

In step S104, the language of the text to be recognized is determined according to the language tag features of each character and the language recognition model based on the deep web.

The language tag characteristics of each character can be input into the language identification model based on the deep network, so as to obtain the language of the text to be identified output by the language identification model based on the deep network.

In an embodiment of the present application, the language identification model determines a most likely language of the text to be identified, and outputs the most likely language as the language of the text to be identified, for example, the language of the text to be identified is chinese, english, or japanese.

In another embodiment of the present application, the language identification model outputs probabilities that the text to be identified belongs to each language, for example, the probability that the text to be identified belongs to chinese is 82%, the probability that the text to be identified belongs to english is 15%, the probability that the text to be identified belongs to japanese is 1%, and the like.

After the language of the text to be recognized is determined, the text to be recognized may be processed in combination with the language of the text to be recognized, for example, it is determined whether the search operation may be performed on the text to be recognized according to the language of the text to be recognized, and in the case where the search operation may be performed on the text to be recognized, the search operation is performed on the text to be recognized, and a search result is returned to the user.

In one mode, each character in the text to be recognized is directly input into the language recognition model, and the language recognition model processes the text to be recognized and outputs the language of the text to be recognized.

In another embodiment of the present application, referring to fig. 2, step S102 includes:

in step S201, for each character in the text to be recognized, determining that the character belongs to one language or multiple languages simultaneously according to the code of the character;

for any character, the character may be applied in only one of a plurality of languages, for example, four characters "chi mei 3957; \ 397753" (chi), i.e., "chi", "mei", "", and "demons and monsters" are applied only in chinese, and the characters "chi", "mei", "", and "demons and monsters" belong to one language of chinese, respectively. The characters "what", "just" and "day" are applied to at least two languages, such as chinese and japanese, and the characters "what", "just" and "day" belong to the two languages, respectively.

All available codes for coding characters can be divided into a plurality of code segments in advance, for any one code segment, the code segment can correspond to one language, and also can correspond to at least two languages simultaneously, and the code segments and the languages corresponding to the code segments can be combined into corresponding table entries in advance and stored in the corresponding relation between the code segments and the languages. The same is true for each of the other encoded segments.

Thus, in this step, the code segment where the code of the character is located can be obtained; searching the language corresponding to the coding section in the corresponding relation between the coding section and the language; determining that the character belongs to one language under the condition that the languages corresponding to the coding segments are one; in the case where the languages corresponding to the code segment are at least two, it is determined that the character belongs to a plurality of languages at the same time.

For any character, when the character is encoded, if the character is applied in only one language, the encoding section corresponding to the one language can be searched in the corresponding relationship between the encoding sections and the languages, then one encoding is selected from the encoding sections as the encoding of the character, if the character is applied in at least two languages, the encoding sections corresponding to the at least two languages can be searched in the corresponding relationship between the encoding sections and the languages, and then one encoding is selected from the encoding sections as the encoding of the character. For each of the other characters, it is also encoded in the manner described above.

In the case that the character belongs to a language, in step S202, the language tag feature of the character is obtained according to the language;

in the present application, the language may be used as the language tag feature of the character.

In the case that the character belongs to at least two languages at the same time, in step S203, the adjacent characters of the character are determined in the text to be recognized;

in the application, under the condition that the character belongs to at least two languages simultaneously, which language the character belongs to is often unable to be directly determined based on the character, so that accurate input data cannot be provided for the language identification model, and the accuracy of the language of the text to be identified, which is identified by the language identification model, is influenced.

Therefore, in order to avoid this, in another embodiment of the present application, it is necessary to determine which language the character belongs to, and in order to be able to determine which language the character belongs to, adjacent characters of the character may be determined in the text to be recognized, for example, adjacent characters of the character may be determined in the text to be recognized based on the N-gram.

In one example, a character that follows the character and is adjacent to the character may be determined to be an adjacent character. Alternatively, at least two characters located after the character and closest to the character may be determined as adjacent characters. Alternatively, one character located before the character and adjacent to the character may be determined as an adjacent character. Alternatively, at least two characters located before the character and closest to the character may be determined as adjacent characters or the like. Alternatively, one character located before and adjacent to the character and one character located after and adjacent to the character may be determined as the adjacent characters. Or determining at least two characters which are positioned behind the character and are closest to the character and at least two characters which are positioned behind the character and are closest to the character as adjacent characters.

For example, in one example, the user needs to search for "soft toothbrush," and the user enters the search text "soft らかいブラシ" in a context based on Japanese in the search box. The electronic device obtains the search text "gentle らかいブラシ" input by the user and serves as the text to be recognized.

The characters "ら", "か", "い", "ブ", "ラ" and "シ" in the text to be recognized belong to a language of Japanese.

The character "soft" in the text to be recognized belongs to both japanese and chinese languages, and the character "" in the text to be recognized belongs to both japanese and chinese languages.

For the character "soft", in the text "soft らかいブラシ" to be recognized, the character "soft" is the first character, so that the adjacent characters of the character "soft" may be determined among the characters following the character "soft", for example, the adjacent characters of the character "soft" may include "ら", "らか", or "らかい", or the like. The same is true for the character "," which is not described in detail herein.

In step S204, a language tag feature of the character is generated from the character and the adjacent characters.

In the present application, the character may be combined with the adjacent character to obtain the language tag feature of the character.

In this application, a user usually inputs a text to be recognized in a context of a language, and when a certain character in a text belongs to multiple languages at the same time, although the certain character actually belongs to multiple languages at the same time, the user subjectively inputs the certain character in the context of one language, at this time, although the electronic device cannot determine which language the certain character belongs to only according to the certain character, if an adjacent character of the certain character in the text belongs to a certain language, the certain character also belongs to the language to which the adjacent character of the certain character belongs in a normal condition.

For example, in the above example, the electronic device cannot accurately determine which language the character "soft" belongs to based on whether the character "soft" is.

However, the adjacent character "ら", "らか" or "らかい" of the character "gentle" in the text "gentle らかいブラシ" to be recognized belongs to japanese, so that the language recognition model can also determine that the language to which the character "gentle" belongs is japanese with a high probability by using the adjacent character "ら", "らか" or "らかい", and the like, thereby improving the possibility of determining that the language of the text to be recognized is japanese, and thus improving the accuracy of determining the language of the text to be recognized.

The method and the device for determining the languages of the text to be recognized can improve the possibility and accuracy of determining the languages to which the characters belonging to the languages belong in the text to be recognized at the same time, so that accurate input data can be provided for the language recognition model, and the accuracy of the languages of the text to be recognized, recognized by the language recognition model, can be improved.

In another embodiment of the present application, referring to fig. 3, step S02 includes:

in step S301, in the case that the character is a word, in the word, a preset number of consecutive letters are respectively combined into a letter set;

in the latin languages such as english, french, and german, a word includes a plurality of letters.

Sometimes a user may inadvertently or intentionally misspell a word when entering the word in a search box.

In one example, the Chinese definition of the word "volunteer" is "volunteer" and it is assumed that the word "volunteer" is used only in English and no longer in other languages.

For example, the user originally intended to enter the English word "volnterer" in the search box, but the user misspelled it as "volnterr", assuming that the word "volnterr" does not exist in English.

Thus, the electronic device cannot determine the language to which the word "volnterr" belongs according to the word "volnterr", for example, the language of the word "volnterr" is not determined as english, or the language of "volnterr" is erroneously determined as another language other than english, and the like, which may not provide accurate input data for the language identification model, and thus may affect the accuracy of the language of the text to be identified by the language identification model.

Therefore, to avoid this, in another embodiment of the present application, in the word "voluterr", different consecutive preset numbers of letters may be grouped into a letter set, for example, consecutive 4 letters "volu" may be grouped into a letter set, consecutive 4 letters "olun" may be grouped into a letter set, consecutive 4 letters "lunt" may be grouped into a letter set, consecutive 4 letters "unt" may be grouped into a letter set, consecutive 4 letters "nte" may be grouped into a letter set, consecutive 4 letters "nter" may be grouped into a letter set, and consecutive 4 letters "ter" may be grouped into a letter set.

The preset number may be determined according to actual conditions, and the preset number is greater than 1 and may be smaller than the number of letters included in the word, and the like.

In step S302, a language tag feature of the character is generated according to the word and the alphabet set.

In the present application, the word may be combined with the set of letters to obtain the language tag characteristics of the character.

However, when the word is misspelled, it is sometimes impossible to specify which language the word belongs to. Or, sometimes it may be determined that the word belongs to a wrong language, for example, the user wants to input a word in english but misspells a certain letter, the misspelled word does not exist in english, however, the misspelled word exists in french, and therefore, the electronic device may determine that the word belongs to french, but this is different from the language to which the user actually wants to input the word, resulting in a case where the determined language of the word is wrong.

However, in the present embodiment, in the word, different consecutive preset numbers of letters are respectively made up into letter sets; any one letter set includes at least two consecutive letters in the word, and at least two consecutive letters in at least one letter set belong to at least one language, although sometimes the language in which the word belongs cannot be accurately determined from the word.

For example, in the above example, the electronic device cannot accurately determine which language the word "volnterrr" belongs to based on the fact that the word "volnterrr" is.

However, since the word "volnterer" is applied in english, the letter sets "volu", "olun", "lunt", and "unt" are also applied in english, and therefore, the language identification model may also determine that the language to which "volnterr" belongs is english with the help of the letter sets "volu", "olun", "lunt", and "unt", and the like, thereby improving the possibility of determining that the language of the text to be identified is english, and thus improving the accuracy of determining the language of the text to be identified.

The method and the device for recognizing the language can improve the possibility and accuracy of determining the language to which the character as the word belongs, so that accurate input data can be provided for the language recognition model, and the accuracy of the language of the text to be recognized, recognized by the language recognition model, can be improved.

In the embodiment of the present application, the word is a misspelled word. In this way, after the language in which the text to be recognized is located is determined based on the language recognition model, the word, the context in the text to be recognized and an error correction algorithm may be combined to correct the word with the misspelling in the context of the determined language, for example, correct "voluterr" in the context of english, for example, correct it to "voluteter", then replace "voluterr" in the text to be recognized with "voluteter", then perform a search operation according to the replaced text to be recognized, and return a search result to the user. The application does not limit the specific error correction algorithm.

In the present application, in order to enable the language of the text to be recognized to be determined according to the language tag features of each character and the language recognition model based on the deep web in step S104, the language recognition model may be trained in advance, where, referring to fig. 4, the training method includes:

in step S401, a sample data set is obtained, where the sample data set includes sample language tag characteristics of each sample character in at least one sample text;

the sample text may be screened out from a wide range of texts on the network by a technician, and for any sample text, the characters in the sample text may all be characters of one language.

The manner of obtaining the sample language tag feature of each sample character in the sample text can refer to the embodiments shown in fig. 2 and fig. 3, and is not described in detail here.

In step S402, a language to be labeled of each sample text in the sample data set is obtained;

in this application, the annotation language may be manually annotated to each sample text by a technician, and of course, the annotation language of the sample text may also be obtained in other manners, for example, the sample text is obtained from an electronic magazine in english, and a text on the electronic magazine in english is usually composed of english words, so that the annotation language of the sample text obtained from the electronic magazine in english may be automatically determined to be in english, and the like. The method and the device do not limit the marking language of the obtained sample text.

In step S403, the deep network model is trained based on the sample data set and the labeled language until the parameters in the deep network model converge, so as to obtain a language identification model based on the deep network.

The deep network model includes a Convolutional Neural network, a Recurrent Neural network, and the like.

In one embodiment, the language identification model may be trained based on sample language tag features of each sample character in a sample text of a single language character, such that after the trained language identification model is on line, the accuracy rate of identifying the language of the text including the character of the single language is high, and the accuracy rate of identifying the language of the text including the characters of at least two languages is low.

In the present application, sometimes the text input by the user in the search box may include characters of at least two languages, however, the trained language identification model does not actually have the ability to identify the language of the text including the characters of at least two languages, and therefore, in this case, the language of the text to be identified based on the language identification model is likely to be inaccurate, and a problem of incorrect language identification of the text to be identified may occur.

Therefore, in order to avoid the above situation, in another embodiment of the present application, it is required to improve the robustness of the language identification model, for example, when the sample data set is obtained, a plurality of first sample texts may be obtained, each sample text including a character of a language; that is, any sample text includes characters that are all characters of the same language, but not characters of other languages.

A second sample text may then be generated from the at least two first sample texts comprising characters of different languages, e.g. by combining the at least two first sample texts comprising characters of different languages, resulting in the second sample text.

A sample data set may then be generated from the plurality of first sample texts and the generated second sample text.

For example, sample language tag features of each sample character in each first sample text and sample language tag features of each sample character in each second sample text are obtained, and then the sample language tag features of each sample character in each first sample text and the sample language tag features of each sample character in each second sample text are combined to obtain a sample data set.

In this embodiment, the language identification model is trained based on the sample language tag features of each sample character in the first sample text of the single language character and the sample language tag features of each sample character in the second sample text of the multilingual character. Thus, the trained language identification model can have the ability to identify the language of the text including the characters of at least two languages. Therefore, the accuracy of recognizing the language of the text including the characters of at least two languages by the trained language recognition model can be improved.

In the present application, the original text input by the user in the search box may be used as the text to be recognized, however, the original text input by the user in the search box sometimes includes non-linguistic characters and the like in addition to linguistic characters, for example, the non-linguistic characters include punctuation marks, arabic numerals and the like. Here, the language characters may represent languages, and the non-language characters often do not represent languages.

When the language identification model is trained, if the used sample text includes non-language symbols in addition to language symbols, the language identification model can learn the language classification of the non-language characters in the sample text as the labeled language of the sample text when the model is trained based on the sample text and the labeled language of the sample text.

Thus, after the language identification model is applied online, if a user inputs a search text in a search box, and the search text includes a language character and a non-language character, in the case where the proportion of the non-language character in the entire search text is greater than a preset threshold, the language of the learned language character may be determined as the language of the search text, but the determined language may not be the language of the language character in the search text, and in this case, a problem may arise in that the language identification of the search text is incorrect.

Therefore, in order to avoid this situation, in another embodiment of the present application, when the language identification model is trained, the sample text may include linguistic characters and may not include non-linguistic characters, so that each sample character in the sample text is a linguistic character and has no non-linguistic character.

In this way, the non-language characters are not learned during the process of training the language identification model, but the language characters are learned, for example, the language classification of the language characters in the sample text is learned as the labeled language of the sample text.

Thus, after the language identification model is applied online, the electronic device may obtain an original text, where the original text includes a text input by the user in the search box, for example, when the user inputs a search text in the search box, the electronic device obtains the search text input by the user as the original text.

Sometimes, the original text includes non-language characters, however, the language identification model is trained by using a sample text only including language characters, so that the language identification model has a high accuracy in language identification of the search text only including language characters, and has a low accuracy in language identification of the search text doped with non-language characters.

Therefore, in order to avoid adverse effects on the recognition accuracy of the language recognition model for recognizing the language of the original text caused by the non-language characters included in the original text, in the application, the non-language characters in the original text can be removed to obtain the text to be recognized. The text to be recognized includes linguistic characters and does not include non-linguistic characters, and then step S102 is performed.

Referring to fig. 5, a block diagram of an embodiment of a language identification apparatus according to the present application is shown, and specifically, the apparatus may include the following modules:

the first obtaining module 11 is configured to obtain a text to be recognized;

a second obtaining module 12, configured to obtain codes of each character in the text to be recognized;

a third obtaining module 13, configured to obtain, according to the code, a language tag feature of each character in the text to be recognized;

and the determining module 14 is configured to determine the language of the text to be recognized according to the language tag features of each character and the language recognition model based on the deep network.