Voice recognition method and device, electronic equipment and computer readable storage medium

文档序号：1906554 发布日期：2021-11-30 浏览：24次中文

阅读说明：本技术 语音识别方法及装置、电子设备、计算机可读存储介质 (Voice recognition method and device, electronic equipment and computer readable storage medium ) 是由李泽轩于 2021-10-19 设计创作，主要内容包括：本发明公开了一种语音识别方法及装置、电子设备、计算机可读存储介质。其中,该方法包括：生成目标唤醒词对应的唤醒词声学词典和唤醒词语言字典,其中,目标唤醒词为自定义唤醒词；基于唤醒词声学词典和唤醒词语言字典生成解码图；利用解码图对目标语音进行逐帧解码,得到语音识别结果。本发明解决了相关技术中进行语音识别的方式可靠性较低的技术问题。(The invention discloses a voice recognition method and device, electronic equipment and a computer readable storage medium. Wherein, the method comprises the following steps: generating a wakeup word acoustic dictionary and a wakeup word language dictionary corresponding to the target wakeup word, wherein the target wakeup word is a user-defined wakeup word; generating a decoding map based on the awakening word acoustic dictionary and the awakening word language dictionary; and decoding the target voice frame by using the decoding image to obtain a voice recognition result. The invention solves the technical problem of lower reliability of a voice recognition mode in the related technology.)

1. A speech recognition method, comprising:

generating a wake-up word acoustic dictionary and a wake-up word language dictionary corresponding to a target wake-up word, wherein the target wake-up word comprises a user-defined wake-up word;

generating a decoding graph based on the awakening word acoustic dictionary and the awakening word language dictionary;

and decoding the target voice frame by using the decoding graph to obtain a voice recognition result.

2. The method of claim 1, wherein before generating the wake word acoustic dictionary and the wake word language dictionary corresponding to the target wake word, the method further comprises: generating a first mapping table, wherein the first mapping table comprises a mapping relation between a Chinese character and at least one pinyin of the Chinese character;

wherein generating the first mapping table comprises:

performing word segmentation processing on a preset text by using a first word segmentation tool to obtain a word segmentation result;

performing pinyin annotation on the word segmentation result by using a pinyin generation tool to obtain a second mapping table, wherein the second mapping table contains a mapping relation between a word and at least one pinyin of the word;

analyzing the second mapping table to obtain a third mapping table, wherein the third mapping table comprises a mapping relation between each character in the words and at least one pinyin of each character;

and combining the third mapping tables according to a preset combination mode to obtain the first mapping table.

3. The method of claim 2, wherein generating the acoustic dictionary of wake words corresponding to the target wake words comprises:

acquiring the target awakening word;

performing word segmentation processing on the target awakening word by using a second word segmentation tool to obtain a plurality of sub-words;

processing the plurality of sub-words according to the first mapping table to obtain a fourth mapping table, wherein the fourth mapping table comprises the mapping relation between each sub-word in the plurality of sub-words and at least one pinyin of each sub-word;

and fusing the fourth mapping table with the first mapping table to obtain the awakening word acoustic dictionary.

4. The method of claim 2, wherein generating a wake word language dictionary corresponding to the target wake word comprises:

carrying out duplication elimination processing on the Chinese characters in the second mapping table to obtain a Chinese character dictionary;

performing word segmentation processing on the target awakening word to obtain a plurality of sub-words, and performing de-duplication processing on the plurality of sub-words to obtain residual sub-words;

and combining the residual sub-words with the Chinese character dictionary to obtain the awakening word language dictionary.

5. The method of claim 1, wherein generating a decoding graph based on the wake word acoustic dictionary and the wake word language dictionary comprises:

fusing the awakening word acoustic dictionary with a preset dictionary to obtain a fused acoustic dictionary;

fusing the awakening word language dictionary with a preset language dictionary to obtain a fused language dictionary;

and inputting the fused acoustic dictionary and the fused language dictionary into a decoding graph generating tool, and processing the fused acoustic dictionary and the fused language dictionary by using the decoding graph generating tool to obtain the decoding graph.

6. The method according to any one of claims 1 to 5, wherein decoding the target speech frame by frame using the decoding map to obtain a speech recognition result comprises:

acquiring an audio stream corresponding to the target voice;

extracting the characteristics of the audio stream to obtain target acoustic characteristics;

determining a phoneme information sequence corresponding to the target acoustic features based on an acoustic model, wherein the acoustic model is a model for phoneme recognition based on the acoustic features;

and processing the phoneme information sequence by using the decoding graph to obtain the voice recognition result.

7. The method of claim 6, wherein after decoding the target speech frame by frame using the decoding map to obtain a speech recognition result, the method further comprises:

and when the target awakening word is determined to exist in the voice recognition result, awakening the equipment corresponding to the target awakening word.

8. A speech recognition apparatus, comprising:

the device comprises a first generation module, a second generation module and a third generation module, wherein the first generation module is used for generating a awakening word acoustic dictionary and an awakening word language dictionary corresponding to a target awakening word, and the target awakening word comprises a user-defined awakening word;

a second generation module, configured to generate a decoding graph based on the wakeup word acoustic dictionary and the wakeup word language dictionary;

and the decoding module is used for decoding the target voice frame by using the decoding image to obtain a voice recognition result.

9. An electronic device, comprising:

a processor; and

a memory for storing executable instructions of the processor;

wherein the processor is configured to perform the speech recognition method of any of claims 1 to 7 via execution of the executable instructions.

10. A computer-readable storage medium, comprising a stored computer program, wherein the computer program, when executed, controls an apparatus in which the computer-readable storage medium is located to perform the speech recognition method according to any one of claims 1 to 7.

Technical Field

The invention relates to the technical field of computers, in particular to a voice recognition method and device, electronic equipment and a computer readable storage medium.

Background

Speech recognition is an important technology for improving the degree of intellectualization of various devices, and speech communication with a machine can be performed by speech recognition. That is, a technique that can cause a machine to convert a speech signal into a corresponding text or command through a recognition and understanding process by a speech recognition technique. The method mainly comprises three aspects of a feature extraction technology, a pattern matching criterion and a model training technology.

Therefore, in a speech recognition system in the related art, an acquisition module is generally used to acquire a wakeup word issued by an operator, an acoustic feature extraction module is used to extract feature information of a new word, a custom wakeup word module outputs a custom wakeup word list, a pronunciation dictionary generator generates a pronunciation dictionary according to the custom wakeup word list and a preset dictionary, a language model generator generates a language model according to the pronunciation dictionary, a decoding diagram generator generates a static decoding diagram according to the language model and the pronunciation dictionary, and a decoder decodes according to the static decoding diagram and the general acoustic model to determine whether speech data contains the new word.

However, the above solution has the following drawbacks: 1) construction of a preset dictionary directly uses phonemes, so that false awakening is increased; 2) when a certain character in the awakening word is a polyphone, different pronunciations of the character are not added into the dictionary (for example, pronunciations T UH2 NG 2X IY2 EH2, T UH4 NG 4X IY2 EH2, T UH5 NG 5X IY2 EH2, three pronunciations are included; in order to improve the recognition rate, similar pronunciations are also added into a pronunciation dictionary, for example, the awakening word is a small Bing classmate, and the small Bing classmate, the small Bing classmate and the small Bing classmate can be added into the pronunciation dictionary); 3) not adding all awakening word sub-words into the dictionary (e.g., the awakening word is a small Bing classmate, and sub-words of the small Bing, classmate, etc. are added into the dictionary to reduce false awakening); 4) hot word technology is not introduced (hot words can improve the awakening rate, for example, the recognition result is a small coin classmate, and the hot words can forcibly correct the coin character to be green).

In view of the above problems, no effective solution has been proposed.

Disclosure of Invention

The embodiment of the invention provides a voice recognition method and device, electronic equipment and a computer readable storage medium, which at least solve the technical problem of low reliability of a voice recognition mode in the related technology.

According to an aspect of an embodiment of the present invention, there is provided a speech recognition method including: generating a wake-up word acoustic dictionary and a wake-up word language dictionary corresponding to a target wake-up word, wherein the target wake-up word comprises a user-defined wake-up word; generating a decoding graph based on the awakening word acoustic dictionary and the awakening word language dictionary; and decoding the target voice frame by using the decoding graph to obtain a voice recognition result.

Optionally, before generating the wakeup word acoustic dictionary and the wakeup word language dictionary corresponding to the target wakeup word, the speech recognition method further includes: generating a first mapping table of Chinese characters, wherein the first mapping table comprises a mapping relation between the Chinese characters and at least one pinyin of the Chinese characters; wherein generating the first mapping table comprises: performing word segmentation processing on a preset text by using a first word segmentation tool to obtain a word segmentation result; performing pinyin annotation on the word segmentation result by using a pinyin generation tool to obtain a second word mapping table, wherein the second word mapping table contains a mapping relation between a word and at least one pinyin of the word; analyzing the second mapping table to obtain a third mapping table, wherein the third mapping table comprises a mapping relation between each character in the words and at least one pinyin of each character; and combining the third mapping tables according to a preset combination mode to obtain the first mapping table.

Optionally, generating an acoustic dictionary of wake words corresponding to the target wake words includes: acquiring the target awakening word; performing word segmentation processing on the target awakening word by using a second word segmentation tool to obtain a plurality of sub-words; processing the plurality of sub-words according to the first mapping table to obtain a fourth mapping table, wherein the fourth mapping table comprises the mapping relation between each sub-word in the plurality of sub-words and at least one pinyin of each sub-word; and fusing the fourth mapping table with the first mapping table to obtain the awakening word acoustic dictionary.

Optionally, generating a wake word language dictionary corresponding to the target wake word includes: carrying out duplication elimination processing on the Chinese characters in the second mapping table to obtain a Chinese character dictionary; performing word segmentation processing on the target awakening word to obtain a plurality of sub-words, and performing de-duplication processing on the plurality of sub-words to obtain residual sub-words; and combining the residual sub-words with the Chinese character dictionary to obtain the awakening word language dictionary.

Optionally, generating a decoding graph based on the wake word acoustic dictionary and the wake word language dictionary includes: fusing the awakening word acoustic dictionary with a preset dictionary to obtain a fused acoustic dictionary; fusing the awakening word language dictionary with a preset language dictionary to obtain a fused language dictionary; and inputting the fused acoustic dictionary and the fused language dictionary into a decoding graph generating tool, and processing the fused acoustic dictionary and the fused language dictionary by using the decoding graph generating tool to obtain the decoding graph.

Optionally, performing frame-by-frame decoding on the target speech by using the decoding map to obtain a speech recognition result, including: acquiring an audio stream corresponding to the target voice; extracting the characteristics of the audio stream to obtain target acoustic characteristics; determining a phoneme information sequence corresponding to the target acoustic features based on an acoustic model, wherein the acoustic model is a model for phoneme recognition based on the acoustic features; and processing the phoneme information sequence by using the decoding graph to obtain the voice recognition result.

Optionally, after performing frame-by-frame decoding on the target speech by using the decoding map to obtain a speech recognition result, the speech recognition method further includes: and when the target awakening word is determined to exist in the voice recognition result, awakening the equipment corresponding to the target awakening word.

According to another aspect of the embodiments of the present invention, there is also provided a speech recognition apparatus including: the device comprises a first generation module, a second generation module and a third generation module, wherein the first generation module is used for generating a awakening word acoustic dictionary and an awakening word language dictionary corresponding to a target awakening word, and the target awakening word comprises a user-defined awakening word; a second generation module, configured to generate a decoding graph based on the wakeup word acoustic dictionary and the wakeup word language dictionary; and the decoding module is used for decoding the target voice frame by using the decoding image to obtain a voice recognition result.

Optionally, the speech recognition apparatus further comprises: the third generation module is used for generating a first mapping table before generating a wakeup word acoustic dictionary and a wakeup word language dictionary corresponding to a target wakeup word, wherein the first mapping table comprises a mapping relation between a Chinese character and at least one pinyin of the Chinese character; wherein the third generating module comprises: the first word segmentation unit is used for carrying out word segmentation processing on the preset text by using a first word segmentation tool to obtain a word segmentation result; the pinyin marking unit is used for performing pinyin marking on the word segmentation result by using a pinyin generating tool to obtain a second mapping table, and the second mapping table contains the mapping relation between a word and at least one pinyin of the word; the analysis unit is used for analyzing the second mapping table to obtain a third mapping table, and the third mapping table contains the mapping relation between each character in the words and at least one pinyin of each character; and the first combination unit is used for combining the third mapping tables according to a preset combination mode to obtain the first mapping table.

Optionally, the first generating module includes: the first obtaining unit is used for obtaining the target awakening word; the second word segmentation unit is used for performing word segmentation processing on the target awakening word by using a second word segmentation tool to obtain a plurality of sub-words; the first processing unit is used for processing the plurality of sub-words according to the first mapping table to obtain a fourth mapping table, and the fourth mapping table comprises the mapping relation between each sub-word in the plurality of sub-words and at least one pinyin of each sub-word; and the first fusion unit is used for fusing the fourth mapping table with the first mapping table to obtain the awakening word acoustic dictionary.

Optionally, the first generating module includes: the duplication removing unit is used for carrying out duplication removing processing on the Chinese characters in the second mapping table to obtain a Chinese character dictionary; the third word segmentation unit is used for carrying out word segmentation processing on the target awakening word to obtain a plurality of sub-words, and carrying out duplication removal processing on the plurality of sub-words to obtain the remaining sub-words; and the first combination unit is used for combining the residual sub-words with the Chinese character dictionary to obtain the awakening word language dictionary.

Optionally, the second generating module includes: the second fusion unit is used for fusing the awakening word acoustic dictionary with a preset dictionary to obtain a fused acoustic dictionary; the third fusion unit is used for fusing the awakening word language dictionary with a preset language dictionary to obtain a fused language dictionary; and the generating unit is used for inputting the fused acoustic dictionary and the fused language dictionary into a decoding graph generating tool, and processing the fused acoustic dictionary and the fused language dictionary by using the decoding graph generating tool to obtain the decoding graph.

Optionally, the decoding module includes: the second acquisition unit is used for acquiring an audio stream corresponding to the target voice; the extraction unit is used for carrying out feature extraction on the audio stream to obtain target acoustic features; a determining unit, configured to determine a phoneme information sequence corresponding to the target acoustic feature based on an acoustic model, where the acoustic model is a model for performing phoneme recognition based on acoustic features; and the first processing unit is used for processing the phoneme information sequence by using the decoding graph to obtain the voice recognition result.

Optionally, the speech recognition apparatus further comprises: and the awakening module is used for awakening equipment corresponding to the target awakening word when the target awakening word is determined to exist in the voice recognition result after the target voice is decoded frame by using the decoding graph to obtain the voice recognition result.

According to another aspect of the embodiments of the present invention, there is also provided an electronic device, including: a processor; and a memory for storing executable instructions of the processor; wherein the processor is configured to perform the speech recognition method of any of the above via execution of the executable instructions.

According to another aspect of the embodiment of the present invention, there is also provided a computer-readable storage medium, which includes a stored computer program, wherein when the computer program runs, the apparatus where the computer-readable storage medium is located is controlled to execute the speech recognition method in any one of the above.

In the embodiment of the invention, an awakening word acoustic dictionary and an awakening word language dictionary corresponding to a target awakening word are generated, wherein the target awakening word comprises a user-defined awakening word; generating a decoding map based on the awakening word acoustic dictionary and the awakening word language dictionary; and decoding the target voice frame by using the decoding image to obtain a voice recognition result. By the voice recognition method provided by the embodiment of the invention, the aim of establishing the decoding graph corresponding to the user-defined awakening word to decode the target voice frame by using the new decoding graph so as to obtain the voice recognition result is achieved, the technical effect of improving the recognition accuracy of the user-defined awakening word is achieved, and the technical problem of lower reliability of the voice recognition mode in the related technology is solved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:

FIG. 1 is a flow diagram of a speech recognition method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of an alternative speech recognition method according to an embodiment of the present invention;

fig. 3 is a schematic diagram of a speech recognition apparatus according to an embodiment of the present invention.

Detailed Description

In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Example 1

In accordance with an embodiment of the present invention, there is provided a method embodiment of a speech recognition method, it being noted that the steps illustrated in the flowchart of the figure may be performed in a computer system such as a set of computer-executable instructions and that, although a logical order is illustrated in the flowchart, in some cases the steps illustrated or described may be performed in an order different than here.

Fig. 1 is a flowchart of a speech recognition method according to an embodiment of the present invention, as shown in fig. 1, the speech recognition method includes the steps of:

and step S102, generating a wakeup word acoustic dictionary and a wakeup word language dictionary corresponding to the target wakeup word, wherein the target wakeup word comprises a user-defined wakeup word.

Optionally, the target wake-up word here may include a word that is customized by the user according to actual needs, for example, a petite classmate, a wisdom, a petite, and the like. When the user setting is successful, the target device can be awakened based on the self-defined awakening word.

Optionally, in an embodiment of the present invention, the wake-up word herein may wake up a target device, for example, an electronic device, such as an air conditioner, a refrigerator, a television, a washing machine, a sound box, and the like.

Since the target wake-up word is set by the user in the embodiment, if the target device can recognize the wake-up word, an acoustic dictionary and a wake-up word language dictionary corresponding to the target wake-up word need to be constructed.

And step S104, generating a decoding map based on the awakening word acoustic dictionary and the awakening word language dictionary.

And step S106, decoding the target voice frame by using the decoding image to obtain a voice recognition result.

As can be seen from the above, in the embodiment of the present invention, after the user sets the user-defined wake-up word (target wake-up word), the wake-up word acoustic dictionary and the wake-up word language dictionary corresponding to the target wake-up word are generated, then a decoding map is generated based on the wake-up word acoustic dictionary and the wake-up word language dictionary, and the decoding map is used to perform frame-by-frame decoding on the target speech to obtain a speech recognition result, so that the purpose of establishing the decoding map corresponding to the user-defined wake-up word to perform frame-by-frame decoding on the target speech by using the new decoding map to obtain the speech recognition result is achieved, and the technical effect of improving the recognition accuracy of the user-defined wake-up word is achieved.

Therefore, the voice recognition method provided by the embodiment of the invention solves the technical problem of low reliability of a voice recognition mode in the related technology.

As an optional embodiment, before generating the wakeup word acoustic dictionary and the wakeup word language dictionary corresponding to the target wakeup word, the speech recognition method may further include: generating a first mapping table, wherein the first mapping table comprises a mapping relation between a Chinese character and at least one pinyin of the Chinese character; wherein generating a first mapping table comprises: performing word segmentation processing on a preset text by using a first word segmentation tool to obtain a word segmentation result; performing pinyin annotation on the word segmentation result by using a pinyin generation tool to obtain a second mapping table, wherein the second mapping table contains a mapping relation between a word and at least one pinyin of the word; analyzing the second mapping table to obtain a third mapping table, wherein the third mapping table contains the mapping relation between each character in the words and at least one pinyin of each character; and combining the third mapping tables according to a preset combination mode to obtain the first mapping table.

That is, in the embodiment of the present invention, a preset dictionary (a chinese character-pinyin mapping table or an acoustic dictionary, i.e., a first mapping table) may be established first, and specifically, the method may be divided into two steps: 1) constructing a word-pinyin mapping table (i.e., a second mapping table in context); 2) a single-word-pinyin mapping table (i.e., a third mapping table in context) is constructed.

The first word segmentation tool can be used for segmenting words of the sorted text materials, and the pinyin generation tool is used for automatically carrying out pinyin labeling on the segmented words, so that a word-pinyin mapping table is formed. It should be noted that, in the embodiment of the present invention, the type of the first segmentation tool is not specifically limited, and may be any platform or software having a segmentation function.

Thirdly, the word-pinyin mapping table can be analyzed to obtain a single word-pinyin mapping pair, and the single word-pinyin mapping pair can be de-duplicated, for example, you-ni 3, you-ni 3, good-hao 3, and good-hao 3; after the weight is removed, the method comprises the following steps: you-ni 3, good-ha 3; and the key pairs of the homonymous and different characters are combined (namely, only one character is reserved as a representative of homonymous characters), for example, Ni-ni 2, Ni-ni 2 and mud-ni 2 are combined into Ni-ni 2, namely, one character is randomly selected from each tone as a representative, so that the Chinese character-pinyin mapping table is obtained.

In one aspect, in this embodiment, the target wake-up words are disassembled and combined to obtain a subword set of the wake-up words, and the subwords are added to a pronunciation dictionary; for example, the sub-words of the awakening word "Xiaobi classmate" are "Xiaobi", "Bi Tong", "Xiaobi Tong", etc., and after the sub-words are added, when the user carelessly speaks the sub-words, the device will not be awakened, thereby reducing the false awakening rate of the model.

On the other hand, in the embodiment of the invention, the table lookup is performed on each word in the awakening words to obtain all the pronunciations of each word, then the pronunciations of different words are combined to obtain all the pronunciations combinations of the awakening words, and the pronunciations dictionary is added, so that the awakening rate can be effectively improved.

By the method, the size of the pronunciation dictionary is greatly reduced, so that the false awakening rate is greatly reduced.

As an optional embodiment, in the embodiment of the present invention, the Chinese characters in the word-pinyin mapping table may be subjected to de-duplication processing (there are many characters in the word-pinyin mapping table, and all characters are de-duplicated, for example, who is who you are good, and after de-duplication, four characters are obtained), so that the language dictionary may be obtained.

As an alternative embodiment, in step S102, generating an acoustic dictionary of a wakeup word corresponding to the target wakeup word may include: acquiring a target awakening word; performing word segmentation processing on the target awakening word by using a second word segmentation tool to obtain a plurality of sub-words; processing the plurality of sub-words according to the first mapping table to obtain a fourth mapping table, wherein the fourth mapping table comprises the mapping relation between each sub-word in the plurality of sub-words and at least one pinyin of each sub-word; and fusing the fourth mapping table with the first mapping table to obtain the awakening word acoustic dictionary.

In this embodiment, after the target wake-up word is obtained, a second word segmentation tool may be used to perform word segmentation on the target wake-up word, so as to obtain all sub-words of the target wake-up word, for example, the sub-words of the xiaobi classmate are: xiaobi, Bitong; then, the plurality of sub-words can be processed according to the Chinese character-pinyin mapping table to obtain a sub-word-pinyin mapping table (i.e., a fourth mapping table in the context), and the sub-word-pinyin mapping table and the Chinese character-pinyin mapping table are combined to obtain the awakening word acoustic dictionary.

For example, a! SIL

[SPK]SPN

[FIL]NSN

<UNK>SPN

Your N IY3

Good HH AW3

Jin J IY 1N 1

Day T IY1 AE 1N 1

Qi Q IY4

Xiaobi X IY3 AW 3B IY4

Bigan B IY 4T UH2 NG2

…

Xiaobi classmate X IY3 AW 3B IY 4T UH2 NG 2X IY2 EH2

It should be noted that, in the embodiment of the present invention, the pronunciation of the custom wakeup word is added to the custom wakeup word acoustic dictionary.

As an alternative embodiment, in step S102, generating a wake word language dictionary corresponding to the target wake word may include: carrying out duplication elimination processing on the Chinese characters in the second mapping table to obtain a Chinese character dictionary; performing word segmentation processing on the target awakening word to obtain a plurality of sub-words, and performing de-duplication processing on the plurality of sub-words to obtain residual sub-words; and combining the residual sub-words with the Chinese character dictionary to obtain the awakening word language dictionary.

In this embodiment, the Chinese characters in the word-pinyin mapping table may be subjected to deduplication processing to obtain a Chinese character dictionary, then the target wake-up word is subjected to word segmentation processing to obtain a plurality of sub-words, the plurality of sub-words are subjected to deduplication processing to obtain remaining sub-words, and the remaining sub-words are combined with the Chinese character dictionary to obtain the wake-up word language dictionary.

Taking the target wake-up word as "Xiaobi classmate" as an example, the wake-up word language dictionary may be:

<UNK>

you are

Good taste

Sky

Qi (Qi)

Xiaobi (a Chinese character)

Bitong for treating hepatitis

College student

…

Xiaobi classmate

As an alternative embodiment, in step S104, generating a decoding map based on the acoustic dictionary of the wake words and the dictionary of the wake words may include: fusing the awakening word acoustic dictionary with a preset dictionary to obtain a fused acoustic dictionary; fusing the awakening word language dictionary with a preset language dictionary to obtain a fused language dictionary; and inputting the fused acoustic dictionary and the fused language dictionary into a decoding image generation tool, and processing the fused acoustic dictionary and the fused language dictionary by using the decoding image generation tool to obtain a decoding image.

In this embodiment, the acoustic dictionary of the awakening word and the preset dictionary may be fused to obtain a fused acoustic dictionary; simultaneously fusing the awakening word language dictionary with a preset language dictionary to obtain a fused language dictionary; then, the decoding model building module builds a decoding map (i.e., model file in speech recognition, hclg.fst) according to the fused acoustic dictionary and the fused language dictionary, and covers the original decoding map.

In addition, in the embodiment of the present invention, the preset dictionary may be a mapping table obtained by performing word segmentation on the sorted text material by using a word segmentation tool before generating the awakening word acoustic dictionary and the awakening word language dictionary corresponding to the target awakening word, performing pinyin labeling on the segmented words by using a pinyin generation tool automatically, obtaining a word-pinyin mapping table, performing parsing to obtain a single-word-pinyin mapping pair, and performing deduplication. The preset dictionary records pinyin.

The preset language dictionary may also be a Chinese character dictionary obtained by segmenting the arranged text material by using a segmentation tool before generating the awakening word acoustic dictionary and the awakening word language dictionary corresponding to the target awakening word. The preset dictionary records Chinese words.

As an alternative embodiment, in step S106, performing frame-by-frame decoding on the target speech by using the decoding map to obtain a speech recognition result, which may include: acquiring an audio stream corresponding to target voice; extracting the characteristics of the audio stream to obtain target acoustic characteristics; determining a phoneme information sequence corresponding to the target acoustic features based on an acoustic model, wherein the acoustic model is a model for performing phoneme recognition based on the acoustic features; and processing the phoneme information sequence by using the decoding graph to obtain a voice recognition result.

For example, when a user speaks into the device, the audio stream is sent to the decoding module for decoding, the decoding module loads the previously constructed decoding map to decode the audio stream, the decoding process uses a hotword technique, and finally, whether the audio stream contains a wakeup word is determined.

The decoding step may be: performing Mel feature extraction on each frame of the audio stream to obtain acoustic features (namely target acoustic features), and then sending the acoustic features into an acoustic model to obtain triphones; over time, a series of triphone strings are generated and assembled into phones, words, sentences through a language model (i.e., a decoding graph). Finally, a fuzzy matching method is used to judge whether the sentence (i.e. the target voice) contains the awakening word.

Because the hot word technology is introduced in the embodiment of the invention, when a certain sub-word of the awakening word appears on the search path of the decoding graph, the acoustic score and the language score of the path are improved, so that the decoding result is more inclined to the awakening word, and the awakening rate is improved.

As an alternative embodiment, after performing frame-by-frame decoding on the target speech by using the decoding map to obtain a speech recognition result, the speech recognition method further includes: and when the target awakening word exists in the voice recognition result, awakening the equipment corresponding to the target awakening word.

That is, if the voice recognition result includes the target wake-up word, the device corresponding to the target wake-up word may be woken up.

Fig. 2 is a schematic diagram of an alternative voice recognition method according to an embodiment of the present invention, and as shown in fig. 2, after a user sets a wakeup word, an acoustic dictionary builder builds a customized wakeup word pronunciation dictionary (i.e., a wakeup word acoustic dictionary) and fuses the wakeup word pronunciation dictionary with a preset dictionary; the language dictionary builder can build a self-defined awakening word dictionary (namely, an awakening word language dictionary) and is fused with the preset language dictionary; the decoding model building module can build a decoding map according to the pronunciation dictionary and the language dictionary, cover the original model, and decode the audio stream in real time by using the decoding map to obtain a voice recognition result. For example, when a user speaks a wake word, the decoding module decodes each frame of the audio, modifies the acoustics and language score of the path when a certain character or word of the wake word appears on the decoding path, and finally obtains an optimal decoding path, and according to the decoding path, an identification result can be obtained, and if the identification result is the wake word, the device is woken up.

By the embodiment, the awakening word acoustic dictionary and the awakening word language dictionary corresponding to the user-defined awakening word can be generated after the awakening word defined by the user is obtained, and the decoding graph is generated based on the awakening word acoustic dictionary and the awakening word language dictionary; and decoding the target voice frame by using the decoding image to obtain a voice recognition result. Because the pronunciations of different characters are combined to obtain all pronunciation combinations of the awakening words and added into the pronunciation dictionary, the awakening rate is effectively improved. In addition, the awakening words are disassembled and combined to obtain the sub-word set of the awakening words, the sub-words are added into the pronunciation dictionary, when the user carelessly speaks the sub-words, the equipment cannot be awakened, and the false awakening rate is reduced. Moreover, due to the introduction of the hot word technology, when a certain sub-word of the awakening word appears on the search path of the decoding graph, the acoustic score and the language score of the path are improved, so that the decoding result is more inclined to the awakening word, and the awakening rate is improved.

Example 2

According to another aspect of the embodiments of the present invention, there is also provided a speech recognition apparatus, where a plurality of implementation units or modules included in the speech recognition apparatus correspond to the implementation steps in embodiment 1, and fig. 3 is a schematic diagram of the speech recognition apparatus according to the embodiments of the present invention, and as shown in fig. 3, the speech recognition apparatus may include: a first generation module 31, a second generation module 33 and a decoding module 35.

The first generating module 31 is configured to generate a wake word acoustic dictionary and a wake word language dictionary corresponding to a target wake word, where the target wake word includes a custom wake word.

And a second generating module 33, configured to generate a decoding map based on the acoustic dictionary of the wake word and the dictionary of the wake word.

And the decoding module 35 is configured to perform frame-by-frame decoding on the target speech by using the decoding map to obtain a speech recognition result.

It should be noted here that the parsing module 31, the first responding module 33, the first obtaining module 35, and the sending module 37 correspond to steps S102 to S108 in embodiment 1, and the modules are the same as the corresponding steps in the implementation example and application scenario, but are not limited to the disclosure in embodiment 1. It should be noted that the modules described above as part of an apparatus may be implemented in a computer system such as a set of computer-executable instructions.

As can be seen from the above, in the embodiment of the present invention, the first generation module 31 may be utilized to generate the awakening word acoustic dictionary and the awakening word language dictionary corresponding to the target awakening word, where the target awakening word includes the custom awakening word; then, generating a decoding graph based on the awakening word acoustic dictionary and the awakening word language dictionary by using a second generating module 33; and then, the decoding module 35 decodes the target speech frame by using the decoding map to obtain a speech recognition result. By the voice recognition device provided by the embodiment of the invention, the aim of establishing the decoding graph corresponding to the user-defined awakening word to decode the target voice frame by using the new decoding graph so as to obtain the voice recognition result is achieved, the technical effect of improving the recognition accuracy of the user-defined awakening word is achieved, and the technical problem of low reliability of a voice recognition mode in the related technology is solved.

Optionally, the speech recognition apparatus further comprises: the third generation module is used for generating a first mapping table before generating a wakeup word acoustic dictionary and a wakeup word language dictionary corresponding to the target wakeup word, wherein the first mapping table comprises a mapping relation between a Chinese character and at least one pinyin of the Chinese character; wherein, the third generation module comprises: the first word segmentation unit is used for carrying out word segmentation processing on the preset text by using a first word segmentation tool to obtain a word segmentation result; the pinyin marking unit is used for performing pinyin marking on the word segmentation result by using the pinyin generating tool to obtain a second mapping table, and the second mapping table contains the mapping relation between the word and at least one pinyin of the word; the analysis unit is used for analyzing the second mapping table to obtain a third mapping table, and the third mapping table contains the mapping relation between each character in the words and at least one pinyin of each character; and the first combination unit is used for combining the third mapping tables according to a preset combination mode to obtain the first mapping table.

Optionally, the first generating module includes: the first acquisition unit is used for acquiring a target awakening word; the second word segmentation unit is used for performing word segmentation processing on the target awakening word by using a second word segmentation tool to obtain a plurality of sub-words; the first processing unit is used for processing the plurality of sub-words according to the first mapping table to obtain a fourth mapping table, and the fourth mapping table comprises the mapping relation between each sub-word in the plurality of sub-words and at least one pinyin of each sub-word; and the first fusion unit is used for fusing the fourth mapping table with the first mapping table to obtain the awakening word acoustic dictionary.

Optionally, the first generating module includes: the duplication removing unit is used for carrying out duplication removing processing on the Chinese characters in the second mapping table to obtain a Chinese character dictionary; the third word segmentation unit is used for performing word segmentation processing on the target awakening word to obtain a plurality of sub-words, and performing duplication removal processing on the plurality of sub-words to obtain residual sub-words; and the first combination unit is used for combining the residual sub-words with the Chinese character dictionary to obtain the awakening word language dictionary.

Optionally, the second generating module includes: the second fusion unit is used for fusing the awakening word acoustic dictionary with the preset dictionary to obtain a fused acoustic dictionary; the third fusion unit is used for fusing the awakening word language dictionary with the preset language dictionary to obtain a fused language dictionary; and the generating unit is used for inputting the fused acoustic dictionary and the fused language dictionary into a decoding graph generating tool, and processing the fused acoustic dictionary and the fused language dictionary by using the decoding graph generating tool to obtain a decoding graph.

Optionally, a decoding module comprising: the second acquisition unit is used for acquiring an audio stream corresponding to the target voice; the extraction unit is used for extracting the characteristics of the audio stream to obtain target acoustic characteristics; a determining unit, configured to determine a phoneme information sequence corresponding to a target acoustic feature based on an acoustic model, where the acoustic model is a model for performing phoneme recognition based on the acoustic feature; and the first processing unit is used for processing the phoneme information sequence by using the decoding graph to obtain a voice recognition result.

Optionally, the speech recognition apparatus further comprises: and the awakening module is used for awakening equipment corresponding to the target awakening word when the target awakening word is determined to exist in the voice recognition result after the voice recognition result is obtained by decoding the target voice frame by using the decoding graph.

Example 3

According to another aspect of the embodiments of the present invention, there is also provided an electronic device, including: a processor; and a memory for storing executable instructions for the processor; wherein the processor is configured to perform the speech recognition method of any of the above via execution of the executable instructions.

Example 4

According to another aspect of the embodiments of the present invention, there is also provided a computer-readable storage medium, which includes a stored computer program, wherein when the computer program runs, the apparatus on which the computer-readable storage medium is located is controlled to execute the speech recognition method of any one of the above.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

In the above embodiments of the present invention, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the embodiments provided in the present application, it should be understood that the disclosed technology can be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units may be a logical division, and in actual implementation, there may be another division, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

14页详细技术资料下载

Voice recognition method and device, electronic equipment and computer readable storage medium

相关技术

网友询问留言