Voice recognition method and device, electronic equipment and storage medium

文档序号：812438 发布日期：2021-03-26 浏览：8次中文

阅读说明：本技术 语音识别方法、装置、电子设备和存储介质 (Voice recognition method and device, electronic equipment and storage medium ) 是由高建清万根顺于 2020-12-11 设计创作，主要内容包括：本发明实施例提供一种语音识别方法、装置、电子设备和存储介质,其中方法包括：确定待识别的语音数据；基于所述语音数据对应的场景关联文本,对所述语音数据进行语音识别,得到所述语音数据的语音识别结果；所述场景关联文本是基于多个关联用户的应用记录数据确定得到的。本发明实施例提供的语音识别方法、装置、电子设备和存储介质,通过获取同一语音识别场景下的不同用户在不同应用间的应用记录数据,利用关联用户间关注点的相似性,提取得到场景关联文本,为待识别语音数据提供了与当前场景关联程度高的辅助文本,提高了基于该场景关联文本得到的语音识别结果的准确性。(The embodiment of the invention provides a voice recognition method, a voice recognition device, electronic equipment and a storage medium, wherein the method comprises the following steps: determining voice data to be recognized; performing voice recognition on the voice data based on the scene associated text corresponding to the voice data to obtain a voice recognition result of the voice data; the scene associated text is determined based on application recording data of a plurality of associated users. According to the voice recognition method, the voice recognition device, the electronic equipment and the storage medium, the application recording data of different users in different applications in the same voice recognition scene are obtained, the scene associated text is extracted and obtained by utilizing the similarity of the attention points among the associated users, the auxiliary text with high association degree with the current scene is provided for the voice data to be recognized, and the accuracy of the voice recognition result obtained based on the scene associated text is improved.)

1. A speech recognition method, comprising:

determining voice data to be recognized;

performing voice recognition on the voice data based on the scene associated text corresponding to the voice data to obtain a voice recognition result of the voice data;

the scene associated text is determined based on application recording data of a plurality of associated users.

2. The speech recognition method according to claim 1, wherein performing speech recognition on the speech data based on the scene-associated text corresponding to the speech data to obtain a speech recognition result of the speech data comprises:

decoding the acoustical hidden layer characteristics of the voice data based on the scene associated text corresponding to the voice data to obtain the probability of each candidate word segmentation of each time period of the voice data;

and determining the voice recognition result based on the probability of each candidate word segmentation of each time interval of the voice data.

3. The speech recognition method of claim 2, wherein the scene-associated text comprises hotwords;

the decoding, based on the scene-associated text corresponding to the speech data, the acoustic hidden layer feature of the speech data to obtain a probability of each candidate word segmentation of each time period of the speech data includes:

and correcting the probability of each candidate participle of each time period of the voice data based on the hot words or based on the hot words and the excitation coefficients thereof, and determining the voice recognition result based on the corrected probability of each candidate participle of each time period.

4. The speech recognition method of claim 3, wherein the hotword is determined based on the steps of:

determining a first duration range of historical speech data of the speech data;

selecting query keywords entered within the first duration range from application usage data of the plurality of associated users;

and selecting at least query keywords input by a preset number of users, and/or selecting the query keywords input by each user and associated with the current scene as the hotwords.

5. The speech recognition method according to claim 3 or 4, wherein the excitation coefficients of the hotwords appearing in the query keywords of at least two users, the hotwords having repeated words or similar words in the query keywords of any user, and other hotwords are sequentially decreased, and the excitation coefficient of any hotword is increased as the frequency of the occurrence of any hotword in the query keywords of different users is increased.

6. The speech recognition method according to claim 2, wherein the scene-associated text includes a history extended text corresponding to each history speech segment of the speech data;

and decoding the acoustical hidden layer characteristics of the voice data based on the universal corpus and the historical extended texts corresponding to the historical voice segments to obtain the probability of each candidate word segmentation of the voice data in each time period.

7. The speech recognition method of claim 6, wherein the decoding an acoustic hidden layer feature of the speech data based on the universal corpus and the history extended text corresponding to each history speech segment to obtain a probability of each candidate word segmentation of each period of the speech data comprises:

decoding the acoustical hidden layer characteristics of the speech data in any time period based on the universal corpus and the historical extended texts corresponding to the historical speech segments respectively to obtain the candidate probability of any candidate participle in any time period corresponding to the universal corpus and the historical speech segments;

determining the probability of any candidate participle based on the candidate probability of any candidate participle corresponding to the general corpus and each historical speech segment and the weight corresponding to the general corpus and each historical speech segment;

wherein the more recent historical speech segment is closer to the speech data, the more corresponding weight is.

8. The speech recognition method according to claim 7, wherein the decoding an acoustic hidden layer feature of the speech data at any time interval based on a universal corpus and a history extended text corresponding to each history speech segment to obtain a candidate probability of any candidate participle at any time interval corresponding to the universal corpus and each history speech segment respectively comprises:

and determining the candidate probability of any candidate word segmentation corresponding to any historical voice segment based on the various types of historical extension texts corresponding to any historical voice segment and the corresponding importance coefficients thereof.

9. The speech recognition method of claim 8, wherein the respective types of historical extension text comprise at least one of a browsed content extension text, a hotword query extension text, and a preset extension text;

the browsing content extension text corresponding to any historical voice segment is obtained based on the following steps:

determining a second duration range of any of the historical speech segments;

screening browsing content in the second duration range from application record data of the plurality of associated users;

and selecting at least one of browsing content associated with the hotword, browsing content associated with at least two users and browsing content associated with the current scene as a browsing content extended text corresponding to any historical voice segment.

10. A speech recognition apparatus, comprising:

a voice data determination unit for determining voice data to be recognized;

the voice recognition unit is used for carrying out voice recognition on the voice data based on the scene associated text corresponding to the voice data to obtain a voice recognition result of the voice data;

the scene associated text is determined based on application recording data of a plurality of associated users.

11. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the speech recognition method according to any of claims 1 to 9 are implemented when the program is executed by the processor.

12. A non-transitory computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the speech recognition method according to any one of claims 1 to 9.

Technical Field

The present invention relates to the field of speech signal processing technologies, and in particular, to a speech recognition method and apparatus, an electronic device, and a storage medium.

Background

With the continuous development of artificial intelligence technology, speech recognition technology is widely applied to scenes such as conferences, interviews, lectures and lectures.

The existing speech recognition technology usually obtains the corpus that may be related to the current usage scenario in advance before performing speech recognition to assist speech recognition. However, if a theme is changed when voice collection and voice recognition are actually performed, or the corpus acquired in advance is incorrect, the accuracy of voice recognition may be reduced.

Disclosure of Invention

The embodiment of the invention provides a voice recognition method, a voice recognition device, electronic equipment and a storage medium, which are used for solving the defect of poor accuracy of voice recognition in the prior art.

The embodiment of the invention provides a voice recognition method, which comprises the following steps:

determining voice data to be recognized;

performing voice recognition on the voice data based on the scene associated text corresponding to the voice data to obtain a voice recognition result of the voice data;

the scene associated text is determined based on application recording data of a plurality of associated users.

According to the voice recognition method of an embodiment of the present invention, the performing voice recognition on the voice data based on the scene-associated text corresponding to the voice data to obtain a voice recognition result of the voice data includes:

and determining the voice recognition result based on the probability of each candidate word segmentation of each time interval of the voice data.

According to the voice recognition method of one embodiment of the invention, the scene associated text comprises hotwords;

According to the voice recognition method of one embodiment of the present invention, the hotword is determined based on the following steps:

determining a first duration range of historical speech data of the speech data;

selecting query keywords entered within the first duration range from application usage data of the plurality of associated users;

and selecting at least query keywords input by a preset number of users, and/or selecting the query keywords input by each user and associated with the current scene as the hotwords.

According to the voice recognition method of one embodiment of the invention, the excitation coefficients of the hot words appearing in the query keywords of at least two users, the hot words with repeated words or similar words existing in the query keywords of any user, and other hot words are sequentially decreased, and the higher the frequency of the appearance of any hot word in the query keywords of different users is, the larger the excitation coefficient is.

According to the voice recognition method of one embodiment of the present invention, the scene related text includes history extended texts corresponding to each history voice segment of the voice data;

According to the speech recognition method of an embodiment of the present invention, the decoding, based on the general corpus and the history extended text corresponding to each history speech segment, the acoustic hidden layer feature of the speech data to obtain the probability of each candidate word segmentation of each period of the speech data includes:

wherein the more recent historical speech segment is closer to the speech data, the more corresponding weight is.

According to the speech recognition method of an embodiment of the present invention, the decoding an acoustic hidden layer feature of the speech data at any time interval based on the general corpus and the history extended text corresponding to each history speech segment to obtain a candidate probability of any candidate word segmentation at any time interval corresponding to the general corpus and each history speech segment respectively includes:

According to the voice recognition method of one embodiment of the invention, the history expanded texts of each type comprise at least one of a browsing content expanded text, a hotword query expanded text and a preset expanded text;

the browsing content extension text corresponding to any historical voice segment is obtained based on the following steps:

determining a second duration range of any of the historical speech segments;

screening browsing content in the second duration range from application record data of the plurality of associated users;

An embodiment of the present invention further provides a speech recognition apparatus, including:

a voice data determination unit for determining voice data to be recognized;

the scene associated text is determined based on application recording data of a plurality of associated users.

An embodiment of the present invention further provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the steps of any of the above-mentioned speech recognition methods when executing the program.

Embodiments of the present invention also provide a non-transitory computer readable storage medium, on which a computer program is stored, and the computer program, when executed by a processor, implements the steps of the speech recognition method as described in any of the above.

According to the voice recognition method, the voice recognition device, the electronic equipment and the storage medium, the application recording data of different users in different applications in the same voice recognition scene are obtained, the scene associated text is extracted and obtained by utilizing the similarity of the attention points among the associated users, the auxiliary text with high association degree with the current scene is provided for the voice data to be recognized, and the accuracy of the voice recognition result obtained based on the scene associated text is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

FIG. 1 is a flow chart of a speech recognition method according to an embodiment of the present invention;

FIG. 2 is a flowchart illustrating a speech recognition method according to another embodiment of the present invention;

fig. 3 is a flowchart illustrating a hotword determining method according to an embodiment of the present invention;

FIG. 4 is a diagram illustrating query terms provided by an embodiment of the present invention;

FIG. 5 is a flowchart illustrating a decoding method according to an embodiment of the present invention;

fig. 6 is a flowchart illustrating a method for determining an expanded text of browsed content according to an embodiment of the present invention;

FIG. 7 is a diagram illustrating browsing content provided by an embodiment of the present invention;

FIG. 8 is a flowchart illustrating a speech recognition method according to another embodiment of the present invention;

fig. 9 is a schematic structural diagram of a speech recognition apparatus according to an embodiment of the present invention;

fig. 10 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Accordingly, the embodiment of the invention provides a voice recognition method. Fig. 1 is a schematic flow chart of a speech recognition method according to an embodiment of the present invention, as shown in fig. 1, the method includes:

step 110, determining voice data to be recognized;

step 120, performing voice recognition on the voice data based on the scene associated text corresponding to the voice data to obtain a voice recognition result of the voice data;

the scene associated text is determined based on application recording data of a plurality of associated users.

Here, the plurality of associated users are a plurality of intelligent terminal users associated under the same speech recognition scene. For example, in a meeting scenario, the plurality of associated users may be a plurality of participants of the meeting, in a speech scenario, the plurality of associated users may be a plurality of listeners to the speech, and so on. Because a plurality of associated users are in the same voice recognition scene, the plurality of associated users use application recording data generated by different applications on the mobile terminal in the voice recognition scene, for example, in a meeting or a speech process, each user uses applications such as a search engine, an entertainment shopping category, or a life service category to query or browse the obtained data, the association degree with the current voice recognition scene is usually large, and even if a theme change occurs currently, the theme change is reflected in the application recording data of the plurality of associated users, so that the obtained scene associated text can be adjusted accordingly, and the high association between the scene associated text and the current scene is ensured.

Therefore, text with a larger degree of association with the current scene can be mined as the scene-associated text based on the application record data of a plurality of associated users. The method and the device can mutually confirm the correlation degree of the application recording data provided by each user and the current voice recognition scene by utilizing the correlation between the application recording data provided by different users according to the similarity of the attention points of different users in the same voice recognition scene, thereby obtaining the text more related to the current voice recognition scene, improving the recognition accuracy, eliminating irrelevant text content and relieving the false triggering of voice recognition. In addition, the scene associated text is acquired from the application record data of a plurality of associated users, so that the user bias caused by the mode of acquiring the associated text from the application record data of a single user can be overcome, and the association degree of the scene associated text and the current scene can be improved.

Here, a sharing mechanism may be first established among a plurality of associated users to acquire application record data of the respective users. For example, a sharing proposal may be initiated by any user to send and accept sharing messages through the channels of intercommunication within the existing local area network. When other users confirm to participate in sharing, the time synchronization can be checked and confirmed synchronously. For example, the time of the intelligent terminal device used by the user who initiates the sharing suggestion can be used as a reference, and the intelligent terminal devices of other users record the time deviation of the intelligent terminal devices with the initiator, so that the time synchronization is realized.

And then, based on the scene associated text corresponding to the voice data, performing voice recognition in an auxiliary manner to obtain a voice recognition result of the voice data to be recognized. For example, semantic information of the speech data to be recognized can be determined in an auxiliary manner based on the scene associated text, and a language expression mode more suitable for the current context can be provided, so that ambiguity caused by homophones or nears and the like is eliminated, a recognition result more meeting the current scene language expression specification is obtained, and the accuracy of speech recognition is improved.

According to the method provided by the embodiment of the invention, the scene associated text is extracted and obtained by acquiring the application recorded data of different users in different applications under the same voice recognition scene and utilizing the similarity of the attention points among the associated users, so that the auxiliary text with high association degree with the current scene is provided for the voice data to be recognized, and the accuracy of the voice recognition result obtained based on the scene associated text is improved.

Based on the foregoing embodiment, fig. 2 is a schematic flowchart of a speech recognition method according to another embodiment of the present invention, as shown in fig. 2, step 120 includes:

step 121, decoding the acoustical hidden layer characteristics of the voice data based on the scene associated text corresponding to the voice data to obtain the probability of each candidate word segmentation of the voice data in each time period;

step 122, determining a voice recognition result based on the probability of each candidate word segmentation of each time interval of the voice data.

Here, the scene-associated text corresponding to the speech data to be recognized may provide a language expression mode more suitable for the context of the current scene, which is helpful for selecting a correct word from words with the same or similar pronunciations among many people, and obtaining a recognition result more suitable for the language expression specification of the current scene.

Therefore, in the process of decoding the acoustic hidden layer features of the voice data, for any time interval of the voice data, for example, a pronunciation process corresponding to a word or a word, the phoneme information contained in the acoustic hidden layer features of the time interval and the scene associated text can be combined to determine the probability of the candidate participles possibly expressed in the time interval. The acoustic hidden layer characteristics of the speech data can be used for determining the acoustic state and phoneme corresponding to the speech data. Then, based on the probability of each candidate participle of each time interval of the voice data, determining the corresponding participle of each time interval, and combining to form a voice recognition result of the whole voice data.

Based on any embodiment, the scene associated text comprises hotwords;

step 121 comprises:

and correcting the probability of each candidate participle of each time interval of the voice data based on the hot words or based on the hot words and the excitation coefficients thereof, and determining a voice recognition result based on the corrected probability of each candidate participle of each time interval.

Here, the scene associated text may include keywords, i.e., hotwords, that frequently appear in application record data of a plurality of associated users. Since the hotword frequently occurs at a plurality of associated users in the current speech recognition scenario, it can be inferred that the hotword is also more likely to occur in the speech data. Thus, the probability of each candidate participle for each time period of the speech data may be corrected based on the hotwords. For example, for any time interval, the probability of the candidate participle as the hotword in the time interval may be increased by a preset value, so as to improve the probability that the candidate participle as the hotword is selected as the corresponding participle in the time interval.

In addition, a plurality of hotwords may be included in the scene-related text, and the importance of different hotwords may not be consistent. For example, a hotword that appears more frequently is more important, or a hotword that appears in the application log data of multiple users indicates that the hotword is focused by multiple users, and the importance of the hotword is also higher. Therefore, when the hot word excitation is carried out, the hot words with different importance can be distinguished, and different excitation coefficients are set for the hot words with different importance, so that the effect of hot word excitation is improved, and the accuracy of voice recognition is further improved. Among them, the excitation coefficient of the hot word with higher importance is higher, and the increased numerical value is also higher when the probability of the candidate participle is corrected. Then, the probability of each candidate segmented word for each time period of the speech data is corrected based on the hot word and its excitation coefficient. For example, for any time period, when the probability of a candidate word segmentation therein as a hot word is corrected, a preset numerical value may be multiplied by the excitation coefficient of the hot word and then added to the probability of the candidate word segmentation.

According to the method provided by the embodiment of the invention, the scene associated text comprises the obtained hot words, so that the possibility that the candidate participles as the hot words are selected as the corresponding participles in any time interval is improved in a hot word excitation mode, and the accuracy of voice recognition is improved.

Based on any of the above embodiments, fig. 3 is a schematic flow chart of a hotword determination method provided by an embodiment of the present invention, as shown in fig. 3, the method includes:

at step 310, a first duration range of historical speech data of the speech data is determined.

Here, the first duration range may be a period of time during which the historical voice data continues from the start of the collection to the end of the collection, and the historical voice data may be one or more sentences before the current voice data. The historical voice data and the corresponding first duration range can be intercepted according to boundary information of a historical voice recognition result obtained before the current voice data.

At step 320, the query keywords entered within the first duration range are filtered from the application usage data of the plurality of associated users.

Here, the query keyword in the application use data of a plurality of associated users and the time for performing the keyword search may be first acquired, for example, by using an input method function of the smart terminal, an input record generated by a pinyin input method, a voice input method, a handwriting input method, or the like may be used as the query keyword, and the time for generating the query keyword may be used as the time for the keyword search. Then, according to the time of the keyword search, the query keyword input in the first duration range is obtained. FIG. 4 is a diagram of query keywords according to an embodiment of the present invention, and as shown in FIG. 4, it is assumed that the first duration range is T0-T1, where T0 may represent the beginning of the historical speech data, T1 represents the end of the historical speech data, and may also represent the beginning of the current speech data. The query keywords filtered within the first duration range are shown in fig. 4, where U1K1 represents the 1 st query keyword input by user 1 in the application, and UNKM represents the M-th query keyword input by user N in the application.

Step 330, selecting at least a preset number of query keywords input by the users, and/or selecting the query keywords input by each user and associated with the current scene as hotwords.

Here, if a plurality of different users, for example, more than 2 users, all input the same query keyword, the keyword may be considered as a hotword; for the query keyword only appearing at a certain user, the correlation between the query keyword and the current existing voice recognition result or the pre-acquired text related to the current scene and the like can be calculated based on the TF-IDF strategy, and a threshold value is set to select the query keyword with a larger correlation degree with the current scene as the hotword. If any query keyword has the same or similar other query keywords in the query keywords of any user, a relatively low threshold value can be set, and for other query keywords, a relatively high threshold value can be set. And then, based on the hot words corresponding to the historical voice data, removing the duplicate of the hot words obtained in the step, and combining the hot words corresponding to the historical voice data with the hot words obtained in the step after the duplicate is removed to obtain the hot words corresponding to the current voice data. In addition, the hotword corresponding to the current voice data may be immediately valid, or may be re-valid at the end time of the first duration range.

Before this, the query keywords of the same user can be deduplicated because any user repeatedly inputs the same query keywords in different applications, or inputs the same or similar query keywords after multiple inputs in the same application or automatic correction of an engine. First, for any user i, the query keywords it enters are Ki1 through KiM. If the M query keywords have the same words, deleting the rest repeated query keywords; if the M query keywords have query keywords which are considered to be similar in pronunciation based on a pinyin recovery scheme or query keywords which are considered to be similar in font based on an existing font similarity detection scheme, calculating the correlation between the query keywords which are similar in pronunciation or font and the current existing voice recognition result or the text which is collected in advance and is related to the current scene based on a TF-IDF strategy, and setting a threshold value to select the query keywords which are more related to the current scene from the query keywords which are similar in pronunciation or font. If the correlation corresponding to the query keywords with similar pronunciation or similar font does not reach the threshold value, only the last query keyword appears is reserved.

According to the method provided by the embodiment of the invention, the query keywords input by a plurality of users and/or the query keywords input by each user and associated with the current scene are/is selected as the hot words, so that the association degree of the hot words and the current voice recognition scene is improved, and the accuracy of voice recognition is further improved.

Based on any of the above embodiments, the incentive coefficients of the hotwords appearing in the query keywords of at least two users, the hotwords with repeated words or similar words existing in the query keywords of any user, and other hotwords are sequentially decreased, and the incentive coefficient is larger as the frequency of any hotword appearing in the query keywords of different users is higher.

Here, considering that a hotword appears in the query keywords of multiple users, which indicates that the hotword is focused by multiple users, the probability that the hotword appears in the speech data in the speech recognition scenario is also high, and therefore the importance of the hotword is higher than that of a hotword appearing in the query keyword of only one user. Meanwhile, if the frequency of any hot word appearing in the query keywords of different users is higher, it indicates that the hot word is concerned by more users, and the importance of the hot word is higher. In addition, if any hot word has repeated words or similar words in the query keywords of any user, it indicates that the hot word is more important for the user, so the importance of the hot word is greater than that of other hot words.

Therefore, when the excitation coefficients of the hotwords are set, the excitation coefficients of the hotwords appearing in the query keywords of at least two users, the hotwords with repeated words or similar words existing in the query keywords of any user and other hotwords are sequentially decreased, and the excitation coefficients of the hotwords are larger as the frequency of the hotwords appearing in the query keywords of different users is higher.

Wherein, for the hot words appearing in the query keywords of at least two users, the incentive coefficient can be set to 1+ (the number of times appearing among different users/the number of users); for a hotword with repeated words or similar words in the query keywords of any user, the excitation coefficient can be set to 1+ 1/user number, and the excitation coefficients of other hotwords can be set to 1.

According to the method provided by the embodiment of the invention, when the excitation coefficients are set for the hot words, the excitation coefficients of the hot words appearing in the query keywords of at least two users, the hot words with repeated words or similar words existing in the query keywords of any user and other hot words are sequentially decreased, and the higher the frequency of the appearance of any hot word in the query keywords of different users is, the higher the excitation coefficient is, the hot words with different importance are distinguished, so that the accuracy of voice recognition is further improved.

Based on any one of the above embodiments, the scene associated text includes history extended texts corresponding to each history voice segment of the voice data;

step 121 comprises:

Here, the scene associated text may further include history extended text corresponding to each history voice segment of the current voice data. The history expanded text corresponding to any history voice segment may be a text obtained or expanded from application record data of a plurality of associated users in a time period from the beginning to the end of the collection of the history voice segment. For example, the content that the user browsed during the time period, or other text related to the content that the user browsed or queried during the time period. Because each historical extension text is generated in the acquisition process of each historical voice fragment of the current voice data, the association degree of each historical extension text with the current voice recognition scene is higher, a language expression mode which is more fit with the context of the current scene can be provided for the current voice data, correct words can be selected from the words which are the same as or similar to numerous voices, and a recognition result which is more fit with the language expression specification of the current scene can be obtained.

Therefore, when decoding the acoustic hidden layer feature of the current speech data, the general corpus and the history extended text corresponding to each history speech segment can be used as a corpus to be referred to when the language model calculates the probability of each candidate word segmentation corresponding to each period of the speech data. For example, when the language model is a statistical language model, in order to calculate the n-gram probability of each candidate participle corresponding to any time interval, the probability of each candidate participle appearing at this point may be calculated according to the corpus composed of the general corpus and the history extended text corresponding to each history speech segment. The more any candidate word segmentation conforms to the language expression mode given by the corpus, the higher the probability of any candidate word segmentation is.

According to the method provided by the embodiment of the invention, the scene associated text comprises the historical extended text corresponding to each historical voice fragment, so that the probability of candidate word segmentation conforming to the language expression mode given by each historical extended text is improved in a mode of updating the language model corpus, the recognition result more conforming to the current scene language expression specification is obtained, and the accuracy of voice recognition is improved.

Based on any of the above embodiments, fig. 5 is a flowchart of a decoding method provided by an embodiment of the present invention, and as shown in fig. 5, decoding an acoustic hidden layer feature of speech data based on a universal corpus and history extended texts corresponding to history speech segments to obtain a probability of each candidate participle of the speech data in each period includes:

step 1211, decoding the acoustical hidden layer feature of any time interval of the speech data based on the general corpus and the history extended text corresponding to each history speech segment, respectively, to obtain a candidate probability of any candidate participle corresponding to the general corpus and each history speech segment in the time interval.

The universal corpus and the history extended texts corresponding to the history voice segments are respectively used as the corpus of the language model, the acoustical hidden layer characteristics of any time interval of the voice data are decoded, and the candidate probability of any candidate participle corresponding to the universal corpus and the history voice segments in the time interval is obtained through calculation.

Taking a Trigram language model (Trigram language model) as an example, the universal corpus and the history extended text corresponding to each history speech segment can be respectively used as the corpus of the language model, and the candidate probability P of any candidate participle in any period corresponding to the universal corpus and each history speech segment can be calculated_t(w_x|w_x-2w_x-1)、P_Pi(w_x|w_x-2w_x-1)、P_P(i-1)(w_x|w_x-2w_x-1)、…、P_P1(w_x|w_x-2w_x-1). Wherein, the historical speech segments have i segments and w segments_x-2And w_x-1For the corresponding participles of the first two periods of any period, P_t(w_x|w_x-2w_x-1) Is the candidate probability of any candidate participle in any period of the corresponding general corpus, P_Pi(w_x|w_x-2w_x-1)、P_P(i-1)(w_x|w_x-2w_x-1)、…、P_P1(w_x|w_x-2w_x-1) And the candidate probability of any candidate participle in any time interval corresponding to each historical voice segment.

Step 1212, determining the probability of the candidate participle based on the candidate probability of the candidate participle corresponding to the general corpus and each historical speech segment and the weight corresponding to the general corpus and each historical speech segment;

wherein, the more recent historical speech segment is closer to the speech data, the more corresponding weight is.

Here, the candidate probabilities of any candidate participle corresponding to the general corpus and each historical speech segment may be weighted and summed to obtain the probability of the candidate participle. The more the any historical voice segment is close to the current voice data, the greater the degree of association between the historical expanded text generated in the acquisition process of the historical voice segment and the current voice data is, and the more accurate the probability of any candidate word segmentation calculated according to the degree of association is, so the greater the weight of the candidate word segmentation is. When setting the weight for each historical voice segment, a basic weight and a forgetting coefficient may be preset, and an n-th power of the forgetting coefficient may be multiplied on the basis of the basic weight, where n is 1. Since the sum of all weights is 1 in the weighted summation, the difference between 1 and the sum of the weights of the respective historical speech segments can be used as the weight of the general corpus. For example, the probability of the candidate participle may be determined using the following formula:

P_new(w_x|w_x-2w_x-1)

＝(1-α)βP_Pi(w_x|w_x-2w_x-1)

+(1-α)β²P_P(i-1)(w_x|w_x-2w_x-1)+…

+(1-α)βⁱP_P1(wx|w_x-2w_x-1)+[1-(1-α)β

-(1-α)β²-…-(1-α)βⁱ]P_t(w_x|w_x-2w_x-1)

wherein, P_new(w_x|w_x-2w_x-1) For the probability of the candidate word segmentation, 1-alpha is the basic weight, and beta is the leftA forgetting coefficient.

According to the method provided by the embodiment of the invention, the general corpus and the historical extension texts corresponding to the historical speech segments are respectively used as the corpus of the language model, the candidate probability of any candidate participle corresponding to the general corpus and the historical speech segments is obtained through calculation, the probability of the candidate participle is determined based on the candidate probability of the candidate participle corresponding to the general corpus and the historical speech segments and the corresponding weight of the general corpus and the historical speech segments, the historical extension texts corresponding to the historical speech segments are distinguished in importance, the historical extension texts corresponding to the historical speech segments closer to the current speech data are highlighted, and the accuracy of speech recognition is improved.

Based on any of the above embodiments, step 1211 includes:

and determining the candidate probability of the candidate word segmentation corresponding to any historical voice segment based on the historical expansion texts of various types corresponding to the historical voice segment and the corresponding importance coefficients thereof.

Here, in order to enrich the history expanded text, different types of history expanded text may be acquired from different approaches. For example, content related to the current scene that the user browsed during the collection of the historical speech segments, or other text related to the content that the user browsed or queried during this time period may be obtained. The association degree of the different types of history expanded texts with the current scene is different, so that the different types of history expanded texts have correspondingly different roles when performing voice recognition.

In order to show that the history extended texts of various types play different roles in the decoding process, corresponding importance coefficients can be set for the history extended texts of various types. Wherein, the more relevant the history extension text of any type is to the current scene, the higher the importance coefficient is. And respectively taking the historical extended texts of all types corresponding to any historical speech segment as a language database of a language model, calculating the candidate probability of the candidate participle corresponding to the historical extended texts of all types, and then carrying out weighted summation based on the importance coefficients corresponding to the historical extended texts of all types to obtain the candidate probability of the candidate participle corresponding to the historical speech segment.

Based on any embodiment, each type of history expanded text comprises at least one of a browsing content expanded text, a hotword query expanded text and a preset expanded text.

The browsing content expanded text is a text which is acquired from browsing data of a plurality of associated users and is associated with a current scene, the hotword query expanded text is a text which is acquired by performing keyword query in a pre-acquired corpus based on existing hotwords, the preset expanded text is a text which is acquired by performing text similarity calculation in the pre-acquired corpus based on the browsing content expanded text and/or the hotword query expanded text, and the acquired text is more associated with the browsing content expanded text and/or the hotword query expanded text. Here, it is considered that the browsing content extended text is obtained from browsing contents of a plurality of associated users, and is more associated with the current scene, so that the importance coefficient thereof is higher than that of the hotword query extended text and the preset extended text. For example, the importance coefficients of the hotword query expanded text and the preset expanded text may be set to 1, and the importance coefficient of the browsing content expanded text may be set higher.

Fig. 6 is a flowchart illustrating a method for determining an expanded text of browsing content according to an embodiment of the present invention, as shown in fig. 6, the method includes:

at step 610, a second duration range for the historical speech segment is determined.

Here, the second duration range may be a time period during which the historical speech segment lasts from the beginning of the collection to the end of the collection. Each historical voice segment and the corresponding first duration range thereof can be intercepted according to the segmentation information of the historical voice recognition result obtained before the current voice data.

And step 620, screening browsing contents in the second duration range from the application record data of a plurality of associated users.

Here, the browsing content in the application usage data of a plurality of associated users and the time at which the browsing content is generated may be acquired first. For example, the text content of the web page corresponding to the website browsed by each user in different applications can be acquired, or the interface browsed by each user is automatically captured, and the text content is acquired based on the existing optical character recognition method, and the time for generating the browsed content is recorded. Then, the browsing content within the second duration range is acquired according to the time of generating the browsing content. FIG. 7 is a schematic diagram of browsing contents provided by an embodiment of the present invention, and as shown in FIG. 7, it is assumed that the second duration range is P0-P1, where P0 may indicate the beginning of the historical speech segment, T1 indicates the end of the historical speech segment, and may also indicate the beginning of the next historical speech segment. The browsed contents screened in the second duration range are shown in fig. 7, where U1H1 represents the 1 st browsed content of user 1 in the application, and UNHL represents the L th browsed content of user N in the application.

Step 630, selecting at least one of the browsing content associated with the hotword, the browsing content associated with at least two users, and the browsing content associated with the current scene as the browsing content extension text corresponding to the historical speech segment.

Here, the browsing content associated with the hotword may be selected as the browsing content extension text. For example, for any user, based on the hotwords obtained in the acquisition process of the historical speech segment, the existing TF-IDF strategy is used to calculate the correlation between the hotwords and each browsing content of the user, and the browsing content with higher correlation is screened out as the browsing content extension text. For the rest browsing contents, relevance measurement can be performed on the browsing contents from different users, and if the relevance is strong, the browsing contents can be used as browsing content extension texts. For the browsing content extended texts obtained in the above two manners, for convenience of description, the browsing content extended texts obtained in the above two manners may be referred to as important browsing content extended texts among users, and the importance coefficient may be set to 1+ the number of selected texts/the total number of texts, so as to emphasize browsing contents that are of major interest to different users. The selected text number is the expanded text number of the browsed contents screened out according to the two modes in all the browsed contents, and the total text number is the number of all the browsed contents.

And for the rest browsing contents, calculating the relevance measurement between the browsing contents and the current existing voice recognition result, and if the relevance is strong, selecting the browsing contents as the browsing content extended text to ensure the strong relevance between the browsing contents and the current scene. For the browsing content extension text obtained in this way, for convenience of description, it may be referred to as an important browsing content extension text within a user, and its importance coefficient may be set to 1.

On the basis, when determining the candidate probability of any candidate participle corresponding to any time period of any historical voice fragment based on each type of historical extended text corresponding to any historical voice fragment and corresponding importance coefficient thereof, calculating the candidate probabilities of the candidate participles corresponding to the important browsing content extended text among users, the important browsing content extended text inside users, the hot word query extended text and the preset extended text respectively based on the corpus of the language model, then performing weighted summation based on the importance coefficients corresponding to the important browsing content extended text among users, the important browsing content extended text inside users, the hot word query extended text and the preset extended text, and obtaining the candidate probability of the candidate participle corresponding to the historical voice segment. For example, the candidate probability of the candidate segmented word corresponding to the historical speech segment may be determined using the following formula:

P_Pi(w_x|w_x-2w_x-1)

＝U_EiP_Ei(w_x|w_x-2w_x-1)+U_IiP_Ii(w_x|w_x-2w_x-1)

+U_SiP_Si(w_x|w_x-2w_x-1)+U_BiP_B(w_x|w_x-2w_x-1)

wherein, P_Pi(w_x|w_x-2w_x-1) The candidate probability of the candidate word segmentation corresponding to the historical voice segment i; p_Ei(w_x|w_x- ₂w_x-1) Candidate probabilities, U, of the candidate participles for expanding text for important browsed content between corresponding users_EiIs its importance coefficient; p_Ii(w_x|w_x-2w_x-1) Expanding the candidate probabilities, U, of the candidate participles of the text for corresponding important viewed content within the user_IiIs its importance coefficient; p_Si(w_x|w_x-2w_x-1) Candidate probabilities, U, of the candidate participles of the expanded text for the corresponding hotword query_SiIs its importance coefficient; p_B(w_x|w_x-2w_x-1) For the candidate probability, U, of the candidate participle corresponding to the preset expanded text_BiIs its importance coefficient.

Based on any of the above embodiments, fig. 8 is a schematic flow chart of a speech recognition method according to another embodiment of the present invention, as shown in fig. 8, the method includes:

at step 810, a sharing mechanism is established among a plurality of associated users. And information sharing among a plurality of associated users is carried out so as to acquire application record data of each user.

At step 820, application log data generated by different applications used by respective users is obtained. For example, query keywords input by each user for querying through different applications such as a search engine class, an entertainment shopping class, or a living service class, and browsing contents for browsing query results are obtained.

Step 830, determining hot words corresponding to the voice data to be recognized and the effective time thereof based on the application recording data of a plurality of associated users. The hotword may be generated by using the hotword determination method provided in any of the above embodiments, and details are not described here. In addition, the effective time of each hotword is the end time of the first duration range.

Step 840, based on the application recording data of a plurality of associated users, determining the history expanded text corresponding to each history voice segment of the voice data to be recognized and the effective time thereof. The historical expansion text comprises a browsing content expansion text, a hot word query expansion text and a preset expansion text. The browsing content extended text can be generated by adopting the browsing content extended text determination method provided by any one of the above embodiments, and is not described herein again. The history expanded text corresponding to any history voice segment is valid within the duration range of the next voice segment of the history voice segment.

And 850, performing voice recognition on the voice data based on the hot words and the historical extension texts corresponding to the voice data to be recognized, so as to obtain a voice recognition result of the voice data.

The following describes a speech recognition apparatus provided in an embodiment of the present invention, and the speech recognition apparatus described below and the speech recognition method described above may be referred to correspondingly.

Based on any of the above embodiments, fig. 9 is a schematic structural diagram of a speech recognition apparatus according to an embodiment of the present invention, and as shown in fig. 9, the apparatus includes a speech data determining unit 910 and a speech recognition unit 920.

The voice data determining unit 910 is configured to determine voice data to be recognized;

the voice recognition unit 920 is configured to perform voice recognition on the voice data based on the scene-related text corresponding to the voice data to obtain a voice recognition result of the voice data;

the scene associated text is determined based on application recording data of a plurality of associated users.

According to the device provided by the embodiment of the invention, the scene associated text is extracted and obtained by acquiring the application recorded data of different users in different applications under the same voice recognition scene and utilizing the similarity of the attention points among the associated users, so that the auxiliary text with high association degree with the current scene is provided for the voice data to be recognized, and the accuracy of the voice recognition result obtained based on the scene associated text is improved.

Based on any of the above embodiments, the speech recognition unit 920 includes:

the decoding unit is used for decoding the acoustical hidden layer characteristics of the voice data based on the scene associated text corresponding to the voice data to obtain the probability of each candidate word segmentation of each time period of the voice data;

and the voice recognition result determining unit is used for determining a voice recognition result based on the probability of each candidate participle of each time interval of the voice data.

Based on any embodiment, the scene associated text comprises hotwords;

the decoding unit includes:

and the hot word excitation unit is used for correcting the probability of each candidate participle of each time interval of the voice data based on the hot words or based on the hot words and the excitation coefficients thereof, and determining a voice recognition result based on the corrected probability of each candidate participle of each time interval.

According to the device provided by the embodiment of the invention, the scene associated text comprises the obtained hot words, so that the possibility that the candidate participles as the hot words are selected as the corresponding participles in any time interval is improved in a hot word excitation mode, and the accuracy of voice recognition is improved.

Based on any of the above embodiments, the apparatus further comprises a hotword determining unit, configured to:

determining a first duration range of historical speech data of the speech data;

screening application usage data of a plurality of associated users for input query keywords within a first duration range;

and selecting at least query keywords input by a preset number of users, and/or selecting the query keywords input by each user and associated with the current scene as hot words.

According to the device provided by the embodiment of the invention, the query keywords input by a plurality of users and/or the query keywords input by each user and associated with the current scene are/is selected as the hot words, so that the association degree of the hot words and the current voice recognition scene is improved, and the accuracy of voice recognition is further improved.

When the device provided by the embodiment of the invention sets the excitation coefficients for the hot words, the excitation coefficients of the hot words appearing in the query keywords of at least two users, the hot words with repeated words or similar words existing in the query keywords of any user and other hot words are sequentially decreased, and the higher the frequency of the appearance of any hot word in the query keywords of different users is, the larger the excitation coefficient is, the hot words with different importance are distinguished, which is favorable for further improving the accuracy of voice recognition.

Based on any one of the above embodiments, the scene associated text includes history extended texts corresponding to each history voice segment of the voice data;

the decoding unit includes:

and the probability calculation unit is used for decoding the acoustical hidden layer characteristics of the voice data based on the universal corpus and the historical extension texts corresponding to the historical voice segments to obtain the probability of each candidate participle of the voice data in each time period.

According to the device provided by the embodiment of the invention, the scene associated text comprises the historical extended text corresponding to each historical voice fragment, so that the probability of candidate word segmentation conforming to the language expression mode given by each historical extended text is improved in a mode of updating the language model corpus, the recognition result more conforming to the current scene language expression specification is obtained, and the accuracy of voice recognition is improved.

Based on any one of the above embodiments, the probability calculation unit includes:

the candidate probability calculation unit is used for decoding the acoustical hidden layer characteristics of any time interval of the voice data based on the general corpus and the historical extension texts corresponding to the historical voice segments respectively to obtain the candidate probability of any candidate participle corresponding to the general corpus and the historical voice segments in the time interval;

a probability determining unit, configured to determine a probability of the candidate segmented word based on candidate probabilities of the candidate segmented word corresponding to the universal corpus and each of the historical speech segments and weights corresponding to the universal corpus and each of the historical speech segments;

wherein, the more recent historical speech segment is closer to the speech data, the more corresponding weight is.

The device provided by the embodiment of the invention respectively takes the general corpus and the historical extension texts corresponding to the historical speech segments as the corpus of the language model, calculates the candidate probability of any candidate participle corresponding to the general corpus and the historical speech segments, determines the probability of the candidate participle based on the candidate probability of the candidate participle corresponding to the general corpus and the historical speech segments and the corresponding weight of the general corpus and the historical speech segments, distinguishes the importance of the historical extension texts corresponding to the historical speech segments, highlights the historical extension texts corresponding to the historical speech segments closer to the current speech data, and is beneficial to improving the accuracy of speech recognition.

Based on any of the above embodiments, the candidate probability calculation unit is configured to:

Based on any embodiment, each type of history expanded text comprises at least one of a browsing content expanded text, a hotword query expanded text and a preset expanded text.

The apparatus further includes a browsing content extended text determination unit configured to:

determining a second duration range of the historical speech segment;

screening browsing contents in a second duration range from application record data of a plurality of associated users;

and selecting at least one of browsing content associated with the hotword, browsing content associated with at least two users and browsing content associated with the current scene as browsing content extension text corresponding to the historical voice segment.

Fig. 10 illustrates a physical structure diagram of an electronic device, and as shown in fig. 10, the electronic device may include: a processor (processor)1010, a communication Interface (Communications Interface)1020, a memory (memory)1030, and a communication bus 1040, wherein the processor 1010, the communication Interface 1020, and the memory 1030 communicate with each other via the communication bus 1040. Processor 1010 may invoke logic instructions in memory 1030 to perform a speech recognition method comprising: determining voice data to be recognized; performing voice recognition on the voice data based on the scene associated text corresponding to the voice data to obtain a voice recognition result of the voice data; the scene associated text is determined based on application recording data of a plurality of associated users.

Furthermore, the logic instructions in the memory 1030 can be implemented in software functional units and stored in a computer readable storage medium when the logic instructions are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

In another aspect, an embodiment of the present invention further provides a computer program product, where the computer program product includes a computer program stored on a non-transitory computer-readable storage medium, the computer program includes program instructions, and when the program instructions are executed by a computer, the computer can execute the speech recognition method provided by the above-mentioned method embodiments, where the method includes: determining voice data to be recognized; performing voice recognition on the voice data based on the scene associated text corresponding to the voice data to obtain a voice recognition result of the voice data; the scene associated text is determined based on application recording data of a plurality of associated users.

In yet another aspect, an embodiment of the present invention further provides a non-transitory computer-readable storage medium, on which a computer program is stored, where the computer program is implemented by a processor to perform the speech recognition method provided in the foregoing embodiments, and the method includes: determining voice data to be recognized; performing voice recognition on the voice data based on the scene associated text corresponding to the voice data to obtain a voice recognition result of the voice data; the scene associated text is determined based on application recording data of a plurality of associated users.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

21页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：一种组合型语音识别处理方法

Voice recognition method and device, electronic equipment and storage medium

相关技术

网友询问留言