Voice recognition method and device, storage medium and electronic equipment

文档序号：1955165 发布日期：2021-12-10 浏览：15次中文

阅读说明：本技术 语音识别方法及装置、存储介质及电子设备 (Voice recognition method and device, storage medium and electronic equipment ) 是由雪巍蔡玉玉吴俊仪彭毅范璐杨帆丁国宏何晓冬于 2021-01-18 设计创作，主要内容包括：本公开提供了一种语音识别方法、语音识别装置、计算机可读存储介质和电子设备。该方法包括：获取样本语音信号,对样本语音信号进行解码,获得解码结果,从解码结果中提取第一特征,第一特征包含样本语音信号的前置字信息；从样本语音信号中抽取目标语音片段,获取目标语音片段的对数幅度谱；根据对数幅度谱确定第二特征,第二特征为样本语音信号的二维时频域特征；将第一特征和第二特征结合得到第三特征；使用第三特征对未经训练的分类器进行训练,获得经训练的分类器；获取待识别语音信号的待识别第三特征,以确定待识别第三特征中是否含有前置字。本公开提供一种对含有前置字的短字词进行识别的方法。(The present disclosure provides a voice recognition method, a voice recognition apparatus, a computer-readable storage medium, and an electronic device. The method comprises the following steps: acquiring a sample voice signal, decoding the sample voice signal to obtain a decoding result, and extracting a first characteristic from the decoding result, wherein the first characteristic comprises prefix information of the sample voice signal; extracting a target voice fragment from the sample voice signal to obtain a logarithmic magnitude spectrum of the target voice fragment; determining a second characteristic according to the logarithmic magnitude spectrum, wherein the second characteristic is a two-dimensional time-frequency domain characteristic of the sample voice signal; combining the first feature and the second feature to obtain a third feature; training the untrained classifier by using the third features to obtain a trained classifier; and acquiring a third feature to be recognized of the voice signal to be recognized so as to determine whether the third feature to be recognized contains a prefix. The present disclosure provides a method for recognizing short words containing prefix characters.)

1. A speech recognition method, comprising:

acquiring a sample voice signal, decoding the sample voice signal to obtain a decoding result, and extracting a first feature from the decoding result, wherein the first feature comprises prefix information of the sample voice signal;

extracting a target voice fragment from the sample voice signal to obtain a logarithmic magnitude spectrum of the target voice fragment; determining a second characteristic according to the logarithmic magnitude spectrum, wherein the second characteristic is a two-dimensional time-frequency domain characteristic of the sample voice signal;

combining the first feature and the second feature to obtain a third feature;

training an untrained classifier using the third features to obtain a trained classifier;

and acquiring a third feature to be recognized of the voice signal to be recognized, and classifying the third feature to be recognized by using the trained classifier to determine whether the third feature to be recognized contains the prefix.

2. The method of claim 1, wherein decoding the sample speech signal to obtain a decoding result comprises:

and decoding the sample voice signal by using the trained acoustic model and the trained language model, and extracting the candidate result of the top three in the decoding candidate results as the decoding result.

3. The method of claim 1 or 2, wherein extracting the first feature from the decoding result further comprises:

and acquiring the acoustic model score and the language model score of the decoding result, and normalizing the acoustic model score and the language model score to obtain a normalized acoustic model score and a normalized language model score which serve as the first characteristic.

4. The method of claim 3, wherein normalizing the acoustic model score and the language model score to obtain a normalized acoustic model score and a normalized language model score comprises:

dividing the acoustic model score with an acoustic model score of an optimal decoding result to obtain the normalized acoustic model score, wherein the optimal decoding result is a candidate result ranked first in decoding candidate results;

and dividing the language model score with the language model score of the optimal decoding result to obtain the normalized language model score.

5. The method of claim 1, further comprising:

when the decoding result contains a prefix, the prefix information is 1;

when the decoding result does not include the prefix, the prefix information is 0.

6. The method of claim 5, further comprising:

the prefix is at least one of not and not.

7. The method of claim 1, wherein before extracting the first feature from the decoding result, the method further comprises:

and determining a sensitive word set containing prepositions, and extracting the first feature from the decoding result when any element in the sensitive word set is contained in the decoding result.

8. The method of claim 7, wherein extracting a speech segment from the sample speech signal comprises:

determining a time starting point and a time ending point corresponding to the element according to the time information of the decoding result;

and extracting a voice segment between the time starting point and the time ending point from the sample voice signal as the target voice segment.

9. The method according to claim 1 or 8, wherein obtaining a log-magnitude spectrum of the target speech segment comprises:

dividing the target voice fragment into sub-fragments with preset number, and performing short-time Fourier transform of preset number on each sub-fragment to obtain a spectrogram;

and acquiring the log-amplitude spectrum according to the spectrogram.

10. The method of claim 9, wherein determining a second feature from the log-magnitude spectrum comprises:

normalizing the logarithmic magnitude spectrum to an interval of 0 to 1 to obtain a normalized logarithmic magnitude spectrum;

extracting the second feature from the normalized log-magnitude spectrum.

11. The method of claim 10, wherein extracting the second feature from the normalized log-magnitude spectrum comprises:

dividing the normalized log-magnitude spectrum into a plurality of sub-bands;

performing smoothing operation on the sub-band energy of the sub-band in the time direction to obtain a time smoothing sub-band energy value;

calculating a sub-band energy time jump ratio according to the time smoothing sub-band energy value;

averaging the subband energy-time hopping ratios of the plurality of subbands corresponding to each moment to obtain a full-band time hopping ratio corresponding to the moment;

taking a maximum value of a time hopping ratio, an average value of the time hopping ratio and a standard deviation of the time hopping ratio, which are obtained from a plurality of full-band time hopping ratios corresponding to a plurality of moments, as the second characteristic;

performing smoothing operation on the sub-band energy of the sub-band in the frequency direction to obtain a frequency smoothing sub-band energy value;

calculating a frequency hopping ratio of the sub-band energy according to the frequency smoothing sub-band energy value;

averaging the energy frequency hopping ratios of the sub-bands corresponding to each moment to obtain a full-band frequency hopping ratio corresponding to the moment;

and obtaining a frequency hopping ratio minimum value, a frequency hopping ratio average value and a frequency hopping ratio standard deviation from a plurality of full-band frequency hopping ratios corresponding to a plurality of moments as the second characteristic.

12. The method of claim 11, wherein smoothing the subband energy of the subbands in a temporal direction, and wherein obtaining a temporally smoothed subband energy value comprises:

averaging the energy of the sub-band at the current moment and the energy of the sub-band at the adjacent moment to obtain the time smoothing energy value of the sub-band;

performing a smoothing operation on the subband energy of the subband in the frequency direction, and obtaining a frequency-smoothed subband energy value includes:

and averaging the energy of the sub-band of the current frequency with the energy of the sub-band of the adjacent frequency to obtain the frequency smoothing sub-band energy value.

13. The method of claim 11, wherein computing a subband energy-to-time hopping ratio based on the time-smoothed subband energy values comprises:

obtaining a quotient of the time smoothing sub-band energy value corresponding to a preset moment and the time smoothing sub-band energy value corresponding to the current moment, and taking the quotient as the time jump ratio of the sub-band energy;

calculating a sub-band energy-to-frequency hopping ratio based on the frequency smoothed sub-band energy values comprises:

and obtaining the quotient of the frequency smoothing sub-band energy value corresponding to the preset frequency and the frequency smoothing sub-band energy value corresponding to the current frequency as the frequency hopping ratio of the sub-band energy.

14. The method of claim 13, wherein the preset time is separated from the current time by 5 times, and the preset frequency is separated from the current frequency by 5 frequencies.

15. The method of claim 1, wherein training an untrained classifier using the third features, obtaining a trained classifier comprises:

training an untrained naive Bayes classifier using the third feature to obtain a trained naive Bayes classifier.

16. A speech recognition apparatus, comprising:

the first feature acquisition module is used for acquiring a sample voice signal, decoding the sample voice signal to obtain a decoding result, and extracting a first feature from the decoding result, wherein the first feature comprises prefix information of the sample voice signal;

the second characteristic acquisition module is used for extracting a target voice segment from the sample voice signal and acquiring a logarithmic magnitude spectrum of the target voice segment; determining a second characteristic according to the logarithmic magnitude spectrum, wherein the second characteristic is a two-dimensional time-frequency domain characteristic of the sample voice signal;

a third feature obtaining module, configured to combine the first feature and the second feature to obtain a third feature;

a classifier training module, configured to train an untrained classifier using the third feature to obtain a trained classifier;

and the classification and recognition module is used for acquiring a third feature to be recognized of the voice signal to be recognized and classifying the third feature to be recognized by using the trained classifier so as to determine whether the prefix is contained in the third feature to be recognized.

17. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the speech recognition method according to any one of claims 1 to 15.

18. An electronic device, comprising:

a processor;

memory for storing one or more programs which, when executed by the processor, cause the processor to implement the speech recognition method of any of claims 1 to 15.

Technical Field

The present disclosure relates to the field of speech recognition technologies, and in particular, to a speech recognition method, a speech recognition apparatus, a computer-readable storage medium, and an electronic device.

Background

The voice recognition technology is a technology for converting human voice into text, is widely applied to various artificial intelligence products, such as intelligent dialogue robots, intelligent sound boxes, intelligent translation equipment and the like, and realizes information exchange mainly in a man-machine dialogue mode.

In the daily man-machine conversation process, the user usually indicates his intention by answering short words such as "yes/no", "right/no", "buy/not buy", and the like, and therefore, the accuracy of recognizing these short words is also important.

In the model training corpus used by the existing large-scale continuous speech recognition framework, the proportion of the short words is very small, and a method for recognizing the short words containing the prepositive words is not specially used.

Disclosure of Invention

The present disclosure provides a voice recognition method, a voice recognition apparatus, a computer-readable storage medium, and an electronic device, and further provides a method of recognizing a short word including a prefix.

According to a first aspect of the present disclosure, there is provided a speech recognition method comprising: acquiring a sample voice signal, decoding the sample voice signal to obtain a decoding result, and extracting a first feature from the decoding result, wherein the first feature comprises prefix information of the sample voice signal; extracting a target voice fragment from the sample voice signal to obtain a logarithmic magnitude spectrum of the target voice fragment; determining a second characteristic according to the logarithmic magnitude spectrum, wherein the second characteristic is a two-dimensional time-frequency domain characteristic of the sample voice signal; combining the first feature and the second feature to obtain a third feature; training an untrained classifier using the third features to obtain a trained classifier; and acquiring a third feature to be recognized of the voice signal to be recognized, and classifying the third feature to be recognized by using the trained classifier to determine whether the third feature to be recognized contains the prefix.

According to a second aspect of the present disclosure, there is provided a speech recognition apparatus comprising: the first feature acquisition module is used for acquiring a sample voice signal, decoding the sample voice signal to obtain a decoding result, and extracting a first feature from the decoding result, wherein the first feature comprises prefix information of the sample voice signal; the second characteristic acquisition module is used for extracting a target voice segment from the sample voice signal and acquiring a logarithmic magnitude spectrum of the target voice segment; determining a second characteristic according to the logarithmic magnitude spectrum, wherein the second characteristic is a two-dimensional time-frequency domain characteristic of the sample voice signal; a third feature obtaining module, configured to combine the first feature and the second feature to obtain a third feature; a classifier training module, configured to train an untrained classifier using the third feature to obtain a trained classifier; and the classification and recognition module is used for acquiring a third feature to be recognized of the voice signal to be recognized and classifying the third feature to be recognized by using the trained classifier so as to determine whether the prefix is contained in the third feature to be recognized.

According to a third aspect of the present disclosure, there is provided a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the speech recognition method described above.

According to a fourth aspect of the present disclosure, there is provided an electronic device comprising a processor; a memory for storing one or more programs which, when executed by the processor, cause the processor to implement the speech recognition method described above.

In some embodiments of the present disclosure, the prefix information and the two-dimensional time-frequency domain feature are extracted as features of a sample speech signal, and a classifier is trained to obtain a trained classifier that can recognize the prefix, and the trained classifier is used to recognize a speech signal to be recognized to determine whether the speech signal to be recognized contains the prefix. On one hand, the speech recognition method does not need to train a large number of universal linguistic data, and complexity of an algorithm is reduced. On the other hand, by providing a new signal two-dimensional time-frequency domain feature, statistical modeling can be performed on two directions of a time axis and a frequency axis of a voice signal, and for a specific scene of short word recognition, the influences of a phrase adhesion phenomenon, environmental noise, a far field and the like in an expression mode of activation can be reduced, so that the accuracy of short word recognition can be improved. In another aspect, a speech recognition method with higher accuracy is provided for short words containing prefix.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure. It is to be understood that the drawings in the following description are merely exemplary of the disclosure, and that other drawings may be derived from those drawings by one of ordinary skill in the art without the exercise of inventive faculty. In the drawings:

FIG. 1 schematically illustrates a flow chart of a speech recognition method according to an exemplary embodiment of the present disclosure;

FIG. 2 schematically illustrates a flow chart of steps of a speech recognition method according to an exemplary embodiment of the present disclosure;

FIG. 3 schematically illustrates a block diagram of a first feature in a speech recognition method according to an exemplary embodiment of the present disclosure;

FIG. 4 schematically shows a flow chart of steps in a speech recognition method for obtaining a second feature according to an exemplary embodiment of the present disclosure;

FIG. 5 schematically illustrates a block diagram of a speech recognition apparatus according to an exemplary embodiment of the present disclosure;

FIG. 6 schematically illustrates a module diagram of an electronic device according to an exemplary embodiment of the present disclosure;

fig. 7 schematically shows a program product schematic according to an exemplary embodiment of the present disclosure.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the disclosure. One skilled in the relevant art will recognize, however, that the subject matter of the present disclosure can be practiced without one or more of the specific details, or with other methods, components, devices, steps, and the like. In other instances, well-known technical solutions have not been shown or described in detail to avoid obscuring aspects of the present disclosure.

Furthermore, the drawings are merely schematic illustrations of the present disclosure and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus their repetitive description will be omitted. Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices.

The flow charts shown in the drawings are merely illustrative and do not necessarily include all of the steps. For example, some steps may be decomposed, and some steps may be combined or partially combined, so that the actual execution sequence may be changed according to the actual situation. In addition, all of the following terms "first" and "second" are used for distinguishing purposes only and should not be construed as limiting the present disclosure.

The languages used by humans for communication typically include both speech and text representations. With the development of information technology, a great deal of information exchange between people and machines is also needed, and computers have started to simulate the process of human information exchange.

Specifically, the process of human communication includes: 1. and (3) natural language generation: converting thought generated by the brain into language; 2. and (3) voice synthesis: converting the language into speech; 3. and (3) voice recognition: identifying speech content of an expression language; 4. understanding natural language: understanding the linguistic meaning expressed by speech. The first two points are the process performed by the listener and the second two points are the process performed by the listener. The speech recognition is "recognizing the speech content of the expression language" in the above process, and for the device, it is: the speech spoken by a human is recognized and converted into text.

Speech recognition is a pattern recognition system, which mainly comprises the following steps: 1. inputting a language; 2. pre-treating; 3. feature extraction, namely, respectively performing 4 training clustering and 5-7 recognition operations as two branches; wherein 5 is performing similarity comparison using reference patterns of the template library; 6, distortion detection is carried out on the result in the step 5 in the identification process, and then 7 is carried out; and 7, outputting the recognition result.

The preprocessing part comprises the processes of sampling, filtering and the like of the voice signals, and the feature extraction is used for extracting several groups of parameters capable of describing the signal features of the voice signals, such as energy, formants, cepstrum coefficients and the like, from the voice signals so as to facilitate training and identification. The process of establishing the voice recognition system is that a large amount of voice is used for training to obtain a template base, then the templates of the template base are read, and similarity comparison is carried out between the templates and the voice to be recognized to obtain a recognition result.

In speech recognition technology, acoustic models and language models are the basis of automatic speech recognition technology. The acoustic model is used to recognize "voices" from the sound signal, and the language model is used to convert the voices into "words". Large-scale non-human-specific continuous speech recognition relies on large-scale acoustic and text training corpora. In order to achieve good performance, the acoustic model in the training process needs to be adapted to different types of accents, noises, tone changes, channel transmission compression, distortion and the like in the practical application scene; the language model needs to be adapted to proper nouns, dialects, etc. of different domains.

However, the conventional isolated word or command word recognition technology is related to short word recognition, but the recognition technology cannot be used in the application scenario of large-scale non-human-specific continuous speech recognition because the recognition technology usually adopts a limited decoding search space. Within the framework of large-scale continuous speech recognition, short word recognition for the categories of "yes/no", "right/no", "in/out", "buy/not buy" lacks specific optimization; moreover, the proportion of the short words is small, the labor cost and the time cost of retraining the model by adopting large-scale linguistic data are high, and the improvement effect of recognition is not obvious.

Based on the above problems, exemplary embodiments of the present disclosure provide a speech recognition method and apparatus, aiming to improve the problem of accuracy of short word recognition frequently occurring in various service fields and application scenarios. The speech recognition method and the speech recognition device are widely applied to various devices, such as: in various devices such as mobile phones and computers, the exemplary embodiment does not limit the devices that use the embodiments of the present disclosure.

Fig. 1 schematically shows a flow chart of a speech recognition method of an exemplary embodiment of the present disclosure. Referring to fig. 1, the voice recognition method may include the steps of:

step S110, obtaining a sample voice signal, decoding the sample voice signal to obtain a decoding result, and extracting a first feature from the decoding result, where the first feature includes prefix information of the sample voice signal.

In an exemplary embodiment of the present disclosure, the sample speech signal is a signal sample for training of a classifier. The signal sample may be a continuous long speech signal or a short speech signal containing a prefix, or a continuous long speech signal or a short speech signal containing no prefix. If the prefix is included in the decoding result, the prefix information is 1, and if the prefix is not included in the decoding result, the prefix information is 0.

In practical applications, the prefix word may be "no", or "none", etc., or may be other words, and the speech recognition method provided by the exemplary embodiment can recognize short words containing any kind of prefix word. Therefore, the exemplary embodiments of the present disclosure are not particularly limited to the specific prefix.

In the exemplary embodiment of the present disclosure, in the decoding of the acquired sample speech signal, the existing trained acoustic model and the trained language model may be used for decoding. Typically after decoding, a plurality of decoding candidates are obtained.

In order to improve the accuracy of the recognition result, in the present exemplary embodiment, the top three candidate results of the decoding candidate results may be extracted as the decoding result, so as to reduce the complexity while considering the accuracy.

In practical applications, the candidate results of the top two or the top four may also be extracted according to actual needs, which is not limited in the present exemplary embodiment.

For the candidate result, taking the sample speech signal "jingdong" as an example, the rank of the decoded candidate result according to the matching degree may be: "jindong", "ding-dong", "d-dong", etc., these three words are used as decoding results to extract the first feature.

In an exemplary embodiment of the present disclosure, extracting the first feature from the decoding result may specifically include: and acquiring the prefix information, acquiring the acoustic model score and the language model score of the decoding result, and normalizing the acoustic model score and the language model score to obtain a normalized acoustic model score and a normalized language model score as a first characteristic. The score refers to the probability of a word appearing in the decoding result, and the higher the probability of the occurrence, the higher the score, and the higher the possibility of identifying the word.

By obtaining a first feature comprising a normalized acoustic model score, a normalized language model score, and prefix information, a sample speech signal with or without a prefix can be preliminarily characterized by the first feature.

In this exemplary embodiment, the normalizing the acoustic model score and the language model score may specifically include: and dividing the acoustic model score by the acoustic model score of the optimal decoding result to obtain a normalized acoustic model score, and dividing the language model score by the language model score of the optimal decoding result to obtain a normalized language model score. And the optimal decoding result is the candidate result with the first rank in the decoding candidate results. The acoustic model score and the language model score can be unified into the 0-1 interval by normalization.

It should be noted that, before extracting the first feature from the decoding result, the speech recognition method provided by the exemplary embodiment of the present disclosure needs to determine the sensitive word set Ω ═ { AB, B, AC, C, … } containing the prefix a, where B and C represent common phrases combined with a. For example, for "no", B, C are typically "on", "is", "pair", etc. When any element in the sensitive word set omega is contained in the decoding result, the first feature is extracted from the decoding result.

Step S120, extracting a target voice fragment from the sample voice signal, and acquiring a logarithmic magnitude spectrum of the target voice fragment; and determining a second characteristic according to the logarithmic magnitude spectrum, wherein the second characteristic is a two-dimensional time-frequency domain characteristic of the sample voice signal.

In practical application, if a decoding result of step S110 includes a certain element in the sensitive word set Ω, a time start point and a time end point corresponding to the element may be determined according to the time information of the decoding result. And according to the corresponding time starting point and time ending point, extracting a speech segment between the time starting point and the time ending point from the sample speech signal as a target speech segment, namely, a target speech segment containing an element in the sensitive word set omega.

In the exemplary embodiment of the present disclosure, the target speech segment may be further divided into sub-segments with a preset number t, and short-time fourier transform with a preset number f is performed on each sub-segment to obtain a spectrogram. Thus, the one-dimensional signal can be converted into a two-dimensional map, for example, when t is 120 and f is 128, the obtained spectrogram is a 120 × 128 two-dimensional map. For a target voice segment of a short word, the time interval is very short, and after the target voice segment is divided into t sub-segments, the signal in each sub-segment tends to be stable, so that the calculation efficiency can be improved under the condition of meeting the resolution requirement by adopting short-time Fourier transform.

It should be noted that the frame length, the window type, and the frame shift of the short-time fourier transform need to be consistent with the frame length, the window type, and the frame shift adopted during decoding, so that consistency of the extracted information represented by the first feature and the second feature can be ensured.

Then, a log-amplitude spectrum of the spectrogram can be drawn, and for convenience of analysis and calculation, the log-amplitude spectrum needs to be normalized to an interval of 0 to 1 to obtain a normalized log-amplitude spectrum Y (t, f); a second feature is then extracted from the normalized log-magnitude spectrum Y (t, f).

This exemplary embodimentIn an embodiment, extracting the second feature from the normalized log-amplitude spectrum Y (t, f) in the time direction may specifically include: dividing the normalized log-magnitude spectrum Y (t, f) into a plurality of sub-bands; smoothing the sub-band energy of the sub-band in the time direction to obtain a time-smoothed sub-band energy value P_T(t, f) is shown in equation (1):

wherein the time-smoothed subband energy value P_T(t, f) is obtained by performing a smoothing operation on the subband energy at the current time and the subband energy at the adjacent time, for example, by performing an averaging operation on the subband energy at the current time and the subband energy at the adjacent time, a time-smoothed subband energy value can be obtained; the adjacent time instants selected in equation (1) are the subband energies of the first 2 time instants and the last 2 time instants of the current time instant. The subband energies are the product of the complex conjugate of the normalized log-magnitude spectrum Y (t, f).

By performing the smoothing operation on the sub-band energy, on one hand, the influence of the environmental noise can be suppressed, and on the other hand, a more stable frequency band energy variation characteristic can be obtained.

In the present exemplary embodiment, the subband energy values P are smoothed in obtaining the time_TAfter (t, f), the subband energy values P may be smoothed according to time_T(t, f), calculating to obtain the time jump ratio r of the sub-band energy_T(t, f) is shown in equation (2):

r_T(t，f)＝P_T(t+5，f)/P_T(t，f) (2)

wherein, the sub-band energy time jump ratio r_T(t, f) is the time-smoothed subband energy value P corresponding to the preset time_T(t +5, f) time-smoothed subband energy value P corresponding to the current time instant_TQuotient of (t, f).

In practical applications, the preset time may be determined according to actual conditions, for example, the preset time is separated from the current time by 5 times, and the like, which is not limited by the exemplary embodiment of the present disclosure.

Since there is often a short time interval between the prefix word and the following word, when t is in the interval region, P_T(t, f) is small, resulting in a large value of the time hopping ratio. Therefore, by calculating the maximum value of the time-hopping ratio, whether or not a sub-band contains a short time interval can be found efficiently.

An important feature of the short time interval is that t is substantially equal at each sub-band. Therefore, the subband energy-to-time conversion ratio r of the plurality of subbands corresponding to each time t can be set_T(t, f) averaging to obtain the full-band time jump ratio corresponding to the time t

When there is a short time interval in the short words, one obvious feature is that within the time range of extraction, there is a large value as a reference feature for whether or not there is a prefix. Based on this, we obtain { r_T(1)，r_T(2)，...，r_T(N), N is the total frame number, and can count { maximum value, average value, standard deviation } in the above set, i.e. maximum value of time jump ratio, average value of time jump ratio and standard deviation of time jump ratio. Furthermore, a plurality of full-band time hopping ratios r corresponding to a plurality of times can be set_TAnd (t) obtaining the maximum value of the time jump ratio, the average value of the time jump ratio and the standard deviation of the time jump ratio as a second characteristic.

Similarly, in the frequency direction, extracting the second feature from the normalized log-amplitude spectrum Y (t, f) may specifically include: smoothing the sub-band energy of the sub-band in the frequency direction to obtain a frequency-smoothed sub-band energy value P_F(t, f); as shown in equation (3):

wherein the frequency-smoothed subband energy value P_F(t, f) is the energy of the sub-band of the current frequency and the energy of the sub-band of the adjacent frequencyThe quantity obtained by performing smoothing operation, for example, by averaging the sub-band energy of the current frequency with the sub-band energy of the adjacent frequency, a frequency-smoothed sub-band energy value can be obtained; the adjacent frequencies selected in equation (3) are the sub-band energies of the first 2 frequencies and the last 2 frequencies of the current frequency. The subband energies are the product of the complex conjugate of the normalized log-magnitude spectrum Y (t, f).

In the present exemplary embodiment, the subband energy value P is smoothed in obtaining the frequency_FAfter (t, f), the subband energy values P may be smoothed according to frequency_F(t, f), calculating to obtain the frequency jump ratio r of the sub-band energy_F(t, f) is shown in equation (4):

r_F(t，f)＝P_F(t，f+5)/P_F(t，f) (4)

wherein, the frequency jump ratio r of the sub-band energy_T(t, f) is the frequency smoothing sub-band energy value P corresponding to the predetermined frequency_T(t +5, f) frequency-smoothed subband energy value P corresponding to the current frequency_TQuotient of (t, f).

In practical applications, the preset frequency may be determined according to practical situations, for example, the preset frequency is separated from the current frequency by 5 frequencies, and the like, which is not limited in the exemplary embodiments of the present disclosure.

Because a short time interval often exists between a prefix word and a subsequent word, when t is in an interval area, the capacity difference between each frequency band is small because of being in a mute area, so that the energy-frequency hopping ratio r of the sub-band is smaller_FThe values of (t, f) are small. In contrast, in speech segments, the sub-band energy frequency hopping ratio r_F(t, f) will yield larger values.

The sub-band energy frequency hopping ratio r of the sub-bands corresponding to each time t_F(t, f) averaging to obtain the full-band frequency hopping ratio corresponding to the time t

Another obvious feature when there is a short time interval in a short word is the presence of a small r_F(t, f). Based on this, we obtain { r_F(1)，r_F(2)，...，r_F(N), N is the total frame number, and { minimum value, average value, standard deviation } in the above set, that is, the frequency hopping ratio minimum value, the frequency hopping ratio average value, and the frequency hopping ratio standard deviation, can be counted. Furthermore, a plurality of full-band frequency hopping ratios r corresponding to a plurality of times can be obtained_F(t) the frequency jump ratio minimum value, the frequency jump ratio average value, and the frequency jump ratio standard deviation obtained in (t) are used as the second feature.

So far, the second feature as the two-dimensional time-frequency domain feature of the sample speech signal is completely obtained, which is: the maximum value of the time hopping ratio, the average value of the time hopping ratio, the standard deviation of the time hopping ratio, the minimum value of the frequency hopping ratio, the average value of the frequency hopping ratio and the standard deviation of the frequency hopping ratio.

And S130, combining the first characteristic and the second characteristic to obtain a third characteristic.

In the present exemplary embodiment, the first feature includes a normalized acoustic model score, a normalized language model score, and prefix information; the second characteristic comprises a maximum value of a time hopping ratio, an average value of the time hopping ratio, a standard deviation of the time hopping ratio, a minimum value of a frequency hopping ratio, an average value of the frequency hopping ratio and a standard deviation of the frequency hopping ratio; and combining and splicing the 9 features into a third feature. In practical applications, the third feature may be represented in the form of a vector.

And step S140, training the untrained classifier by using the third feature to obtain a trained classifier.

In the present exemplary embodiment, the untrained naive bayes classifier is trained using the third feature, resulting in a trained naive bayes classifier. Wherein the naive bayes classifier can be a naive bayes classifier based on gaussian distributions. Due to the fact that the complexity of the naive Bayes classifier is low, under the condition that the requirement of third feature training is met, the training efficiency can be improved, the purpose that an acoustic model and a language model of voice recognition do not need to be retrained is achieved, and the accuracy of short word recognition can be improved. The detailed description of the training method is omitted here.

And S150, acquiring a third feature to be recognized of the voice signal to be recognized, and classifying the third feature to be recognized by using the trained classifier to determine whether the third feature to be recognized contains a prefix.

Recognition (Recognition) usually obtains a speech parameter, i.e. a third feature to be recognized, by performing the same analysis as during training on a speech signal to be recognized, and inputs the third feature to be recognized into a trained naive bayesian classifier to obtain a judgment result, i.e. whether the third feature to be recognized contains a prefix word.

In summary, according to the speech recognition method of the exemplary embodiment of the present disclosure, the prefix information and the two-dimensional time-frequency domain feature are extracted as the third feature of the sample speech signal, and the naive bayesian classifier is trained to obtain a trained naive bayesian classifier capable of recognizing the prefix, and the trained naive bayesian classifier is used to recognize the speech signal to be recognized to determine whether the speech signal to be recognized contains the prefix. On one hand, due to the fact that the complexity of the naive Bayes classifier is low, under the condition that the requirement of third feature training is met, the training efficiency can be improved, and the purpose that the acoustic model and the language model of the voice recognition do not need to be retrained is achieved. On the other hand, by providing a new signal two-dimensional time-frequency domain feature, statistical modeling can be performed on two directions of a time axis and a frequency axis of a voice signal, and for a specific scene of short word recognition, the influences of a phrase adhesion phenomenon, environmental noise, a far field and the like in an expression mode of activation can be reduced, so that the accuracy of short word recognition can be improved. In another aspect, the exemplary embodiments of the present disclosure provide a speech recognition method with a higher accuracy for short words containing prefix.

The flow of the speech recognition method of the exemplary embodiment of the present disclosure will be described below with reference to fig. 2:

in step S201, a sample voice signal is acquired; in step S202, a sample speech signal is decoded to obtain a decoding result; in step S203, entering a judgment condition, and judging whether a decoding result includes any element in a sensitive word set, where the sensitive word set includes a prefix; if not, the process is ended. If yes, namely sensitive word set elements exist, executing step S204, and extracting a first feature from the interface result; in addition, step S205 is executed to determine a time start point and a time end point, which are referred to as a time start point and a time end point for short, corresponding to the sensitive word set element; in step S206, a target voice segment is obtained according to the time start point and the time end point; step S207 is executed again, short-time Fourier transform is carried out on the target voice fragment, and a logarithmic magnitude spectrum is obtained; then, step S208 is executed to normalize the log-amplitude spectrum, and a normalized log-amplitude spectrum is obtained; then, step S209 is executed to obtain a two-dimensional time-frequency domain feature from the normalized logarithmic magnitude spectrum as a second feature; in step S210, combining the first feature and the second feature to obtain a third feature; in step S211, training the untrained naive bayes classifier using the third feature to obtain a trained naive bayes classifier; in step S212, the trained naive bayes classifier is used to classify the third feature to be recognized of the speech signal to be recognized, so as to determine whether the third feature to be recognized contains a prefix, which is referred to as recognizing whether the speech signal to be recognized contains a prefix.

As shown in fig. 3, the first feature 300 includes a normalized acoustic model score 310, a normalized language model score 320, and prefix information 330. Fig. 4 shows a process of acquiring a two-dimensional time-frequency domain feature from the normalized logarithmic magnitude spectrum as a second feature, that is, the second feature is acquired as follows:

in step S401, a normalized log-magnitude spectrum is obtained; in step S402, a time-smoothed subband energy value is obtained from the normalized log-amplitude spectrum; next, in step S403, a subband energy-to-time hopping ratio is calculated based on the time-smoothed subband energy value; in step S404, a frequency-smoothed subband energy value is obtained from the normalized log-amplitude spectrum; next, in step S405, calculating a frequency hopping ratio of the sub-band energy according to the frequency-smoothed sub-band energy value; finally, in step S406, a maximum value of the time hopping ratio, a mean value of the time hopping ratio, and a standard deviation of the time hopping ratio are obtained from the subband energy time hopping ratio, and a minimum value of the frequency hopping ratio, a mean value of the frequency hopping ratio, and a standard deviation of the frequency hopping ratio are obtained from the subband energy frequency hopping ratio, and are respectively used as the second feature.

It should be noted that although the various steps of the methods of the present disclosure are depicted in the drawings in a particular order, this does not require or imply that these steps must be performed in this particular order, or that all of the depicted steps must be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions, etc.

Further, a speech recognition device is also provided in the present exemplary embodiment.

Fig. 5 schematically shows a block diagram of a speech recognition apparatus of an exemplary embodiment of the present disclosure. Referring to fig. 5, a voice recognition apparatus 500 according to an exemplary embodiment of the present disclosure may include: a first feature obtaining module 510, a second feature obtaining module 520, a third feature obtaining module 530, a classifier training module 540, and a classification identifying module 550.

Specifically, the first feature obtaining module 510 may be configured to obtain a sample voice signal, decode the sample voice signal to obtain a decoding result, and extract a first feature from the decoding result, where the first feature includes prefix information of the sample voice signal; the second feature obtaining module 520 may be configured to extract a target speech segment from the sample speech signal, and obtain a logarithmic magnitude spectrum of the target speech segment; determining a second characteristic according to the logarithmic magnitude spectrum, wherein the second characteristic is a two-dimensional time-frequency domain characteristic of the sample voice signal; a third feature obtaining module 530, configured to combine the first feature and the second feature to obtain a third feature; a classifier training module 540, configured to train an untrained classifier using the third feature, to obtain a trained classifier; the classification recognition module 550 may be configured to acquire a third feature to be recognized of the speech signal to be recognized, and classify the third feature to be recognized by using a trained classifier to determine whether the third feature to be recognized contains a prefix.

Since each functional module of the speech recognition device in the embodiment of the present disclosure is the same as that in the embodiment of the method described above, it is not described herein again.

Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which may be a personal computer, a server, a terminal device, or a network device, etc.) to execute the method according to the embodiments of the present disclosure.

Furthermore, the above-described figures are merely schematic illustrations of processes included in methods according to exemplary embodiments of the present disclosure, and are not intended to be limiting. It will be readily understood that the processes shown in the above figures are not intended to indicate or limit the chronological order of the processes. In addition, it is also readily understood that these processes may be performed synchronously or asynchronously, e.g., in multiple modules.

It should be noted that although in the above detailed description several modules or units of the device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit, according to embodiments of the present disclosure. Conversely, the features and functions of one module or unit described above may be further divided into embodiments by a plurality of modules or units.

In an exemplary embodiment of the present disclosure, an electronic device capable of implementing the above method is also provided.

As will be appreciated by one skilled in the art, aspects of the present invention may be embodied as a system, method or program product. Thus, various aspects of the invention may be embodied in the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, microcode, etc.) or an embodiment combining hardware and software aspects that may all generally be referred to herein as a "circuit," module "or" system.

An electronic device 600 according to this embodiment of the invention is described below with reference to fig. 6. The electronic device 600 shown in fig. 6 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present invention.

As shown in fig. 6, the electronic device 600 is embodied in the form of a general purpose computing device. The components of the electronic device 600 may include, but are not limited to: the at least one processing unit 610, the at least one memory unit 620, a bus 630 connecting different system components (including the memory unit 620 and the processing unit 610), and a display unit 640.

Wherein the storage unit 620 stores program code that can be executed by the processing unit 610, such that the processing unit 610 performs the steps according to various exemplary embodiments of the present invention described in the above section "exemplary method" of the present specification. For example, the processing unit 610 may execute step S110 shown in fig. 1, obtaining a sample speech signal, decoding the sample speech signal, obtaining a decoding result, and extracting a first feature from the decoding result, where the first feature includes prefix information of the sample speech signal; step S120, extracting a target voice fragment from the sample voice signal, and acquiring a logarithmic magnitude spectrum of the target voice fragment; determining a second characteristic according to the logarithmic magnitude spectrum, wherein the second characteristic is a two-dimensional time-frequency domain characteristic of the sample voice signal; step S130, combining the first characteristic and the second characteristic to obtain a third characteristic; step S140, training the untrained classifier by using the third feature to obtain a trained classifier; and S150, acquiring a third feature to be recognized of the voice signal to be recognized, and classifying the third feature to be recognized by using the trained classifier to determine whether the third feature to be recognized contains a prefix.

The storage unit 620 may include readable media in the form of volatile memory units, such as a random access memory unit (RAM)6201 and/or a cache memory unit 6202, and may further include a read-only memory unit (ROM) 6203.

The memory unit 620 may also include a program/utility 6204 having a set (at least one) of program modules 6205, such program modules 6205 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.

Bus 630 may be one or more of several types of bus structures, including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.

The electronic device 600 may also communicate with one or more external devices 670 (e.g., keyboard, pointing device, bluetooth device, etc.), with one or more devices that enable a user to interact with the electronic device 600, and/or with any devices (e.g., router, modem, etc.) that enable the electronic device 600 to communicate with one or more other computing devices. Such communication may occur via an input/output (I/O) interface 650. Also, the electronic device 600 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network such as the Internet) via the network adapter 660. As shown, the network adapter 660 communicates with the other modules of the electronic device 600 over the bus 630. It should be appreciated that although not shown in the figures, other hardware and/or software modules may be used in conjunction with the electronic device 600, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.

In an exemplary embodiment of the present disclosure, there is also provided a computer-readable storage medium having stored thereon a program product capable of implementing the above-described method of the present specification. In some possible embodiments, aspects of the invention may also be implemented in the form of a program product comprising program code means for causing a terminal device to carry out the steps according to various exemplary embodiments of the invention described in the above section "exemplary methods" of the present description, when said program product is run on the terminal device.

Referring to fig. 7, a program product 700 for implementing the above method according to an embodiment of the present invention is described, which may employ a portable compact disc read only memory (CD-ROM) and include program code, and may be run on a terminal device, such as a personal computer. However, the program product of the present invention is not limited in this regard and, in the present document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

A computer readable signal medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).

Furthermore, the above-described figures are merely schematic illustrations of processes involved in methods according to exemplary embodiments of the invention, and are not intended to be limiting. It will be readily understood that the processes shown in the above figures are not intended to indicate or limit the chronological order of the processes. In addition, it is also readily understood that these processes may be performed synchronously or asynchronously, e.g., in multiple modules.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is to be limited only by the terms of the appended claims.

20页详细技术资料下载

Voice recognition method and device, storage medium and electronic equipment

相关技术

网友询问留言