Speech recognition system and method

文档序号：1510439 发布日期：2020-02-07 浏览：10次中文

阅读说明：本技术 语音识别系统和方法 (Speech recognition system and method ) 是由李秀林于 2018-06-15 设计创作，主要内容包括：提供了用于语音识别的系统和方法。所述方法可以包括获得由用户发出的语音信息的至少两个候选识别结果和与至少两个候选识别结果分别对应的至少两个初步分数。所述方法可以进一步包括对于至少两个候选识别结果中的每一个,从候选识别结果中提取一个或多个关键词汇,并且确定与一个或多个提取的关键词汇有关的至少一个参数。所述方法可以进一步包括针对至少两个候选识别结果中的每一个,基于所述至少一个参数生成更新系数,并基于更新系数更新初步分数以生成更新分数。所述方法可以进一步包括从至少两个候选识别结果中基于至少两个更新分数确定目标识别结果。(Systems and methods for speech recognition are provided. The method may include obtaining at least two candidate recognition results of speech information uttered by a user and at least two preliminary scores respectively corresponding to the at least two candidate recognition results. The method may further include, for each of the at least two candidate recognition results, extracting one or more key terms from the candidate recognition result, and determining at least one parameter related to the one or more extracted key terms. The method may further include generating, for each of the at least two candidate recognition results, an update coefficient based on the at least one parameter, and updating the preliminary score based on the update coefficient to generate an update score. The method may further include determining a target recognition result from the at least two candidate recognition results based on the at least two update scores.)

1. A method implemented on a computing device having at least one storage device storing a set of instructions for voice recognition, a data exchange port communicatively connected to a network, and at least one processor in communication with the at least one storage device and the data exchange port, the method comprising:

acquiring at least two candidate recognition results of voice information sent by a user and at least two preliminary scores respectively corresponding to the at least two candidate recognition results;

for each of the at least two candidate recognition results,

extracting one or more key words from the candidate recognition result;

determining at least one parameter associated with the one or more extracted key terms;

generating an update coefficient based on the at least one parameter; and

updating the preliminary score based on the update coefficient to generate an updated score; and

determining a target recognition result from the at least two candidate results based on the at least two update scores.

2. The method of claim 1, wherein determining at least one parameter associated with the one or more extracted key terms comprises:

acquiring at least two sample key words from a database by a data exchange port;

for each of the one or more extracted key terms,

determining a degree of match between each of the one or more extracted key terms and each of the at least two sample key terms;

determining one or more target sample key words from the at least two sample key words, wherein a degree of match between each of the one or more target sample key words and the extracted key word is above a degree of match threshold; and

determining the at least one parameter associated with the one or more extracted key terms based on the one or more target sample key terms.

3. The method of claim 2, wherein the at least one parameter comprises a retrieval parameter, and determining the at least one parameter associated with the one or more extracted key terms based on the one or more target sample key terms comprises:

determining the search parameters based on the degree of match between the one or more target sample key terms and the one or more extracted key terms.

4. The method of claim 2, wherein the at least one parameter comprises a heat parameter, and determining the at least one parameter associated with the one or more extracted key terms based on the one or more target sample key terms comprises:

acquiring the heat of the one or more target sample key words; and

determining a heat parameter based on a heat of the one or more target sample key words.

5. The method of claim 2, wherein the at least one parameter comprises a preference parameter, and determining the at least one parameter associated with the one or more extracted key terms based on the one or more target sample key terms comprises:

acquiring the preference degree of a user on the one or more target sample key words; and

determining the preference parameter based on the preference of the one or more target sample key words.

6. The method of claim 2, wherein the at least one parameter comprises a distance parameter, the determining the at least one parameter associated with the one or more extracted key terms based on the one or more target sample key terms comprising:

obtaining location information associated with the one or more target sample key words;

identifying one or more location type indicators in the candidate recognition results;

for each of the one or more extracted key terms that immediately follows the one or more identified location type indicators, determining a location type of the extracted key term based on the corresponding location type indicator;

determining distance information associated with the one or more extracted key terms based on the location information associated with the one or more target sample key terms and a location type of each of the one or more extracted key terms; and

determining the distance parameter based on the distance information.

7. The method of claim 6, further comprising:

obtaining a first number of travel times corresponding to the distance information associated with each of at least one travel mode in a statistical time period;

determining a second number of travel times corresponding to the distance information associated with all of the at least one travel modes in the statistical time period;

for each of the at least one travel pattern, determining a usage probability of the travel pattern based on the first number of travel times and the second number of travel times;

acquiring a travel mode associated with the voice information; and

determining the distance parameter based on the probability of using the travel pattern associated with the voice information.

8. The method of claim 1, wherein the at least one parameter comprises at least one of a heat parameter, a preference parameter, a retrieval parameter, or a distance parameter, and generating update coefficients based on the at least one parameter comprises:

generating the update coefficient based on the heat parameter, the preference parameter, and the retrieval parameter; or

Generating the update coefficient based on the distance parameter and the retrieval parameter.

9. The method of claim 1, wherein the target recognition result comprises a departure place or a destination, and the method further comprises:

and generating a service request based on the target recognition result.

10. The method of claim 9, further comprising:

the service request is sent to a user terminal associated with a service provider.

11. A speech recognition system comprising:

at least one storage device comprising a set of instructions;

a data exchange port communicatively connected to a network; and

at least one processor in communication with the at least one storage device and the data exchange port, the at least one processor configured to execute the set of instructions and directed to cause the system to:

for each of the at least two candidate recognition results,

extracting one or more key words from the candidate recognition result;

determining at least one parameter associated with the one or more extracted key terms;

generating an update coefficient based on the at least one parameter; and

updating the preliminary score based on the update coefficient to generate an updated score; and

determining a target recognition result from the at least two candidate recognition results based on the at least two update scores.

12. The system of claim 11, wherein to determine the at least one parameter associated with the one or more extracted key terms, the at least one processor is further directed to cause the system to:

obtaining at least two sample key words from a database by the data exchange port;

for each of the extracted one or more key terms,

determining a degree of match between each of the one or more extracted key terms and each of the at least two sample key terms; and

determining one or more target sample key words from the at least two sample key words, a degree of match between each of the one or more target sample key words and the extracted key word being above a degree of match threshold; and

determining the at least one parameter associated with the one or more extracted key terms based on the one or more target sample key terms.

13. The system of claim 12, wherein the at least one parameter comprises a retrieval parameter, and the at least one parameter associated with the one or more extracted key terms is determined based on the one or more target sample key terms, the at least one processor further directed to cause the system to:

determining the search parameters based on the degree of match between the one or more target sample key terms and the one or more extracted key terms.

14. The system of claim 12, wherein the at least one parameter comprises a heat parameter, and the at least one parameter associated with the one or more extracted key terms is determined based on the one or more target sample key terms, the at least one processor further directed to cause the system to:

acquiring the heat of the one or more target sample key words; and

determining a heat parameter based on the heat of the one or more target sample key words.

15. The system of claim 12, wherein the at least one parameter comprises a preference parameter, and the at least one parameter associated with the one or more extracted key terms is determined based on the one or more target sample key terms, the at least one processor further directed to cause the system to:

acquiring the preference degree of the user for the one or more target sample key words; and

determining the preference parameter based on the preference of the one or more target sample key words.

16. The system of claim 12, wherein the at least one parameter comprises a distance parameter, and the at least one parameter associated with the one or more extracted key terms is determined based on the one or more target sample key terms, the at least one processor further directed to cause the system to:

obtaining location information associated with the one or more target sample key words;

identifying one or more location type indicators in the candidate recognition results;

determining distance information associated with the one or more extracted key terms based on location information associated with the one or more target sample key terms and a location type of each of the one or more extracted key terms; and

determining the distance parameter based on the distance information.

17. The system of claim 16, wherein the at least one processor is further directed to cause the system to:

obtaining a first number of travel times corresponding to the distance information associated with each of at least one travel mode in a statistical time period;

determining a second number of travel times corresponding to the distance information associated with all of the at least one travel mode in the statistical time period;

for each of the at least one travel pattern, determining a usage probability of the travel pattern based on the first number of travel times and the second number of travel times;

acquiring a travel mode associated with the voice information; and

determining the distance parameter based on a probability of using the travel pattern associated with the voice information.

18. The system of claim 11, wherein the at least one parameter comprises at least one of a heat parameter, a preference parameter, a retrieval parameter, or a distance parameter, and generating an update coefficient based on the at least one parameter, the at least one processor further directed to cause the system to:

generating the update coefficient based on the heat parameter, the preference parameter, and the retrieval parameter; or

Generating the update coefficient based on the distance parameter and the retrieval parameter.

19. The system of claim 11, wherein the target recognition result comprises a departure or destination, and the at least one processor is further directed to cause the system to:

and generating a service request based on the target recognition result.

20. The system of claim 19, wherein the at least one processor is further directed to cause the system to:

the service request is sent to a user terminal associated with a service provider.

21. A non-transitory computer-readable medium comprising a set of instructions for speech recognition that, when executed by at least one processor, direct the at least one processor to implement a method comprising:

for each of the at least two candidate recognition results,

extracting one or more key words from the candidate recognition result;

determining at least one parameter associated with the one or more extracted key terms;

generating an update coefficient based on the at least one parameter; and

updating the preliminary score based on the update coefficient to generate an updated score; and

determining a target recognition result from the at least two candidate recognition results based on the at least two update scores.

22. A method implemented on a computing device having at least one storage device storing a set of instructions for speech recognition and at least one processor in communication with the at least one storage device, the method comprising:

acquiring at least two candidate recognition results and at least two preliminary scores of voice information provided by a current user, wherein each of the at least two preliminary scores corresponds to one candidate recognition result in the candidate recognition results;

extracting one or more key words of a preset type from each of the at least two candidate recognition results based on a predetermined key word extraction rule;

revising the preliminary score corresponding to each of the at least two candidate recognition results based on the extracted one or more key words, and determining a target recognition result of the speech information based on a revised result.

23. The method of claim 22, wherein revising the preliminary score corresponding to each of the at least two candidate recognition results based on the extracted one or more key terms comprises:

determining an update coefficient of each of the at least two candidate recognition results having the extracted one or more key words based on a similarity between the extracted one or more key words and at least two sample key words in a preset sample database; and

updating the preliminary score corresponding to each of the at least two candidate recognition results based on the update coefficient to generate an updated score corresponding to each of the at least two candidate recognition results.

24. The method of claim 23, wherein the preset sample database further comprises at least one of heat information of the at least two sample key words or historical information of current user usage of the at least two sample key words.

25. The method of claim 24, wherein

The preset sample database further comprises the heat information of the at least two sample key words, an

Determining an update coefficient of each of the at least two candidate recognition results having the one or more extracted key words based on a similarity between the one or more extracted key words and at least two sample key words in a preset sample database comprises:

determining a similarity between the one or more extracted key terms and the at least two sample key terms;

selecting one or more sample key terms from the at least two sample key terms, wherein a similarity between the one or more extracted key terms and the one or more selected sample key terms is greater than a similarity threshold;

converting the heat information of the selected one or more sample key words into one or more heat parameters according to a first conversion relation between the heat information and the heat parameters; and

determining an update coefficient for each of the at least two candidate recognition results with the one or more extracted key terms based on the one or more heat parameters.

26. The method of claim 25, wherein

The heat information of the at least two sample key words includes at least two heats of the at least two sample key words corresponding to the at least two periodic statistical time periods, an

Converting the heat information of the selected one or more sample key words into one or more heat parameters according to a first conversion relationship between the heat information and the heat parameters comprises:

determining a statistical time period to which the current time belongs;

selecting one or more heats corresponding to the statistical time period from at least two heats of the one or more selected sample key words corresponding to the at least two periodic statistical time periods; and

and converting the one or more heat degrees into one or more heat degree parameters of each of the at least two candidate recognition results according to a second conversion relation between the heat degrees and the heat degree parameters.

27. The method of claim 24, wherein

The preset sample database further comprises heat information of the at least two sample key words and the historical information of the current user using the at least two sample key words; and

determining an update coefficient for each of the at least two candidate recognition results having the one or more extracted key words based on a similarity between the one or more extracted key words and the at least two sample key words in a preset sample database comprises:

determining a similarity between one or more extracted key terms and the at least two sample key terms;

converting the similarity into a retrieval parameter according to a third conversion relation between the similarity and the retrieval parameter;

converting the similarity into a preference parameter according to a fourth conversion relation between the similarity and the preference parameter;

determining a heat parameter based on the similarity, the heat information of the at least two sample key words and a first conversion relationship between the heat information and the heat parameter; and

determining the update coefficient for said each of said at least two candidate recognition results having said one or more extracted key terms by adding or multiplying the search parameter to a higher value between the preference parameter and the heat parameter,

for the same degree of similarity, the preference parameter converted according to the fourth conversion relationship between the degree of similarity and the preference parameter is larger than the heat parameter determined based on the first conversion relationship between the heat information and the heat parameter.

28. An apparatus for speech recognition, comprising:

at least one storage device comprising a set of instructions; and

at least one processor in communication with the at least one storage device, wherein the at least one processor is configured to execute the set of instructions, the at least one processor comprising:

an information acquisition module configured to acquire at least two candidate recognition results of voice information provided by a current user and at least two preliminary scores, wherein each of the at least two preliminary scores corresponds to one of the candidate recognition results;

an information extraction module configured to extract one or more key words of a preset type from each of the at least two candidate recognition results based on a predetermined key word extraction rule; and

a result determination module configured to revise the preliminary score corresponding to the each of the at least two candidate recognition results based on the extracted one or more key words and determine a target recognition result of the speech information based on the revised result.

29. The apparatus of claim 28, wherein the result determination module comprises:

an update coefficient determination sub-module configured to determine an update coefficient of said each of at least two candidate recognition results having said one or more extracted key words based on a similarity between said one or more extracted key words and at least two sample key words in a preset sample database; and

an information modification sub-module configured to update the preliminary score corresponding to the each of the at least two candidate recognition results based on the update coefficient to generate an updated score corresponding to the each of the at least two candidate recognition results.

30. The apparatus of claim 29, wherein said preset sample database further comprises at least one of popularity information of said at least two sample key words or historical information of said current user's use of said at least two sample key words.

31. The apparatus of claim 30, wherein

The preset sample database further comprises the heat information of the at least two sample key words, an

The update coefficient determination sub-module is further configured to:

determining a similarity between the one or more extracted key terms and the at least two sample key terms;

converting the heat information of the selected one or more sample key words into one or more heat parameters based on a first conversion relation between the heat information and the heat parameters; and

determining an update coefficient for said each of said at least two candidate recognition results having said one or more extracted key terms based on said one or more heat parameters.

32. The apparatus of claim 31, wherein

The heat information of the at least two sample key words includes at least two heats of the at least two sample key words corresponding to at least two periodic statistical time periods, an

The update coefficient determination sub-module is further configured to:

determining a statistical time period to which the current time belongs;

selecting one or more heats corresponding to at least two periodic statistical time periods from the at least two heats of the one or more selected sample key words corresponding to the statistical time periods; and

converting the one or more degrees of heat into the one or more degree of heat parameters of the each of the at least two candidate recognition results according to a second conversion relationship between the degrees of heat and the degree of heat parameters.

33. The apparatus of claim 30, wherein

The preset sample database further comprises the popularity information of the at least two sample key words and the historical information of the current user using the at least two sample key words; and

the update coefficient determination sub-module includes:

a similarity determination unit configured to determine a similarity between the one or more extracted key terms and the at least two sample key terms;

a retrieval parameter determination unit configured to convert the similarity into a retrieval parameter according to a third conversion relationship between the similarity and the retrieval parameter;

a preference parameter determining unit configured to determine a similarity between the one or more extracted key words and at least two sample key words, and convert the similarity into a preference parameter according to a fourth conversion relationship between the similarity and the preference parameter;

a heat parameter determination unit configured to determine a heat parameter based on the similarity, the heat information of the at least two sample key words, and a first conversion relationship between the heat information and the heat parameter; and

an update coefficient determination unit configured to determine an update coefficient of each of at least two candidate recognition results having one or more extracted key words by adding or multiplying the retrieval parameter to or by a higher value between both the preference parameter and the heat parameter,

wherein the preference parameter converted according to the fourth conversion relationship between similarity and preference parameter is larger than the heat parameter determined based on the first conversion relationship between heat information and heat parameter for the same similarity.

34. A non-transitory computer-readable medium comprising a set of instructions for speech recognition, which when executed by at least one processor, direct the at least one processor to implement a method comprising:

obtaining at least two candidate recognition results and at least two preliminary scores of voice information provided by a current user, wherein each of the at least two preliminary scores corresponds to one of the candidate recognition results;

extracting one or more key words of a preset type from each of the at least two candidate recognition results based on a predetermined key word extraction rule;

revising the preliminary score corresponding to the each of the at least two candidate recognition results based on the extracted one or more keyword collections, and determining a target recognition result of the voice information based on the revised result.

35. A method implemented on a computing device having at least one storage device for a set of instructions for speech recognition in a transportation service and at least one processor in communication with the at least one storage device, the method comprising:

receiving and analyzing speech information to generate at least two candidate recognition results and at least two preliminary scores for the speech information, wherein each of the at least two preliminary scores corresponds to one of the at least two candidate recognition results;

extracting information of at least one location from said each of said at least two candidate recognition results;

searching a database for one or more points of interest matching each of the at least one location and determining a first parameter for said each of the at least two candidate recognition results based on a match between the searched one or more points of interest and said each of the at least one location;

determining a location type for said each of said at least one location for said each of said at least two candidate recognition results, and determining a second parameter for said each of said at least two candidate recognition results based on said location type;

determining an updated score corresponding to said each of said at least two candidate recognition results based on said preliminary score, said first parameter and said second parameter corresponding to said each of said at least two recognition results;

determining a highest update score of at least two update scores corresponding to the at least two candidate recognition results, and outputting a recognition result corresponding to the highest score.

36. The method of claim 35, wherein said searching a database for one or more points of interest matching each of said at least one location, and determining a first parameter for said each of said at least two candidate recognition results based on the searched matching results for said one or more points of interest and said each of said at least one location comprises:

when a point of interest matching the at least one location is found in the database,

determining the first parameter of the recognition result as 1;

when no point of interest matching the at least one location is found in the database,

determining a degree of match between each of the one or more points of interest in the database and the at least one location;

when the degree of match between the each of the one or more points of interest and the at least one location is less than or equal to a first degree of match threshold,

determining the first parameter of the recognition result to be 0; and

when the degree of match between the each of the one or more points of interest and the at least one location is greater than the first degree of match threshold,

determining the first parameter of the recognition result based on the degree of matching, wherein the

The first parameter of the recognition result is proportional to the matching degree.

37. The method of claim 35, wherein said determining a location type for said each of said at least one location corresponding to said each of said at least two candidate recognition results comprises:

determining whether the recognition result includes origin information before the information of the at least one location;

in response to determining that the recognition result does not include origin information prior to the information for any of the at least one location,

determining position information related to voice information as the departure place; and

in response to determining that the recognition result includes origin information prior to the information of the at least one origin,

searching a first interest point matched with the at least one position in the database, and determining a first position corresponding to the first interest point as the starting place; or

Searching the database for at least two second points of interest, a degree of matching between each of the at least two second points of interest and the at least one location being greater than a second degree of matching threshold, determining a second location corresponding to each of the at least two second points of interest, and determining a first average location as the origin based on the second locations corresponding to the at least two second points of interest.

38. The method of claim 37, wherein said determining a location type for said each of said at least one location corresponding to said each of said at least two candidate recognition results further comprises:

determining whether the recognition result includes destination information prior to the information of the at least one location;

in response to determining that the recognition result does not include destination information prior to the information of any of the at least one location,

generating a notification for notifying a user of providing destination information; and

in response to determining that the recognition result includes destination information prior to the information of the at least one destination,

searching the database for a third point of interest matching the information of the at least one location and determining a third location corresponding to the third point of interest as the destination; or

Searching the database for at least two fourth points of interest, wherein a degree of match between each of the at least two fourth points of interest and the at least one location is greater than a third degree of match threshold, determining a fourth location corresponding to said each of the at least two fourth points of interest, and determining a second average location as a destination based on the second locations corresponding to the at least two fourth points of interest.

39. The method of claim 38, wherein said determining the second parameter for said each of the at least two candidate recognition results based on the location type comprises:

determining, for the each of the at least two candidate recognition results, distance information from the origin to the destination;

determining at least one travel mode corresponding to the distance information;

determining a number of trips corresponding to the distance information of each trip mode in the at least one trip mode in a statistical time period;

determining a usage probability of each of the at least one travel mode based on the travel times corresponding to each of the at least one travel mode and a total travel time in the statistical time period; and

determining the usage probability as a second parameter.

40. The method of any of claims 35-39, further comprising:

the name associated with each point of interest is correlated with the location corresponding to the point of interest and the correlation is stored in a database.

41. A speech recognition system for use in a transportation service, comprising:

at least one storage device comprising a set of instructions; and

a preliminary score determination module configured to receive and analyze speech information to generate at least two candidate recognition results and at least two preliminary scores for the speech information, wherein each of the at least two preliminary scores corresponds to one of the at least two candidate recognition results;

an extraction module configured to extract information of at least one location from said each of said at least two candidate recognition results;

a first parameter assignment module configured to search a database for one or more points of interest (points of interest) matching each of the at least one location, and to determine a first parameter of each of the at least two candidate recognition results based on a result of the matching of the searched one or more points of interest with each of the at least one location;

a second parameter assignment module configured to determine a location type of the each of the at least one location in the each of the at least two candidate recognition results, and determine a second parameter of the each of the at least two candidate recognition results based on the location type;

a modification module configured to determine an update score corresponding to said each of said at least two candidate recognition results based on said first parameter and said second parameter; and

an output module configured to determine a highest update score of at least two update scores corresponding to at least two candidate recognition results, and output a recognition result corresponding to the highest update score.

42. The system of claim 41, wherein the first parameter assignment module is configured to:

when the point of interest matching at least one location is found in the database,

determining a first parameter of the recognition result as 1;

when no point of interest matching the at least one location is found in the database,

determining a degree of match between each of the one or more points of interest in the database and the at least one location;

when the degree of match between the each of the one or more points of interest and the at least one location is less than or equal to a first degree of match threshold,

determining the first parameter of the recognition result as 0; and

when the degree of match between the each of the one or more points of interest and the at least one location is greater than the first degree of match threshold,

determining a first parameter of the recognition result based on the degree of matching, wherein the first parameter of the recognition result is proportional to the degree of matching.

43. The system of claim 41, wherein the second parameter assignment module includes an origin determination submodule configured to:

determining whether the recognition result includes origin information before the information of the at least one location;

in response to determining that the recognition result does not include origin information prior to the information at any of the at least one location,

determining position information related to the voice information as the departure place; and

in response to determining that the recognition result includes origin information prior to the information for the at least one location,

searching a first interest point matched with the at least one position in the database, and determining a first position corresponding to the first interest point as the starting place; or

Searching the database for at least two second points of interest, a degree of match between each of the at least two second points of interest and the at least one location being greater than a second degree of match threshold, determining a second location corresponding to the each of the at least two second points of interest, and determining a first average location as the starting place based on the second locations corresponding to the at least two second points of interest.

44. The system of claim 43, wherein the second parameter assignment module comprises a destination determination submodule configured to:

determining whether the identification result includes destination information prior to the information of the at least one location;

in response to determining that the recognition result does not include destination information prior to the information of any of the at least one location,

generating a notification for notifying a user of providing destination information; and

in response to determining that the recognition result includes destination information prior to the information of the at least one location,

Searching the database for at least two fourth points of interest, wherein a degree of match between each of the at least two fourth points of interest and the at least one location is greater than a third degree of match threshold, determining a fourth location corresponding to said each of the at least two fourth points of interest, and determining a second average location as the destination based on the second locations corresponding to the at least two fourth points of interest.

45. The system of claim 44, wherein the second parameter assignment module further comprises:

a distance determination sub-module configured to determine, for said each of said at least two candidate recognition results, distance information from said origin to said destination;

a probability determination submodule configured to

Determining at least one travel mode corresponding to the distance information;

determining a number of trips corresponding to the distance information of the each trip mode of the at least one trip mode in a statistical time period;

determining a usage probability of each of the at least one travel pattern based on the travel times and total travel times corresponding to the each of the at least one travel pattern in the statistical time period; and

determining the probability of use as the second parameter.

46. The system of any one of claims 41-45, further comprising a correlation module configured to:

correlating the name associated with each point of interest with the location corresponding to the point of interest and storing the correlation in the database.

47. A computing device comprising at least one storage device storing a set of instructions and at least one processor in communication with the at least one storage device, the at least one processor being directed to implement the method of any one of claims 35-40 when executing the instructions.

48. A non-transitory computer-readable medium comprising a set of instructions for speech recognition, which when executed, the at least one processor directs performance of the method of any one of claims 35-40.

Technical Field

The present application relates generally to speech information processing and, more particularly, to a method and system for speech recognition.

Background

With the development of computer technology, human-computer interaction is more and more popular. The basic requirement for human-computer interaction is that the computer should understand the information provided by the user. With the development of acoustic models and speech recognition technologies, such as Automatic Speech Recognition (ASR) technologies, speech information is often the first choice for users to interact with computers due to its convenience. However, current speech recognition methods are usually single-pass recognition, and speech information is only converted into one possible recognition result. In other words, the speech information provided by different people in different scenes may be considered the same result, matching only the true intent of one or a few people. For example, a voice message "i want to order a high table dinner" may be considered an instruction to purchase a particular table, while the user actually wants to order a formal dinner at the restaurant. The misrecognized results are typically not corrected and displayed directly to the user. The user may need to repeat his words several times before the computer can accurately understand his/her meaning. This experience using current speech recognition methods is neither easy nor pleasant. Accordingly, it is desirable to provide systems and methods for more accurately and efficiently recognizing speech information.

Disclosure of Invention

According to one aspect of the present application, a method for speech recognition is provided. The method may be implemented on a computing device having at least one storage device storing a set of instructions for speech recognition, the computing device further having a data exchange port communicatively connected to a network, and at least one processor in communication with the at least one storage device and the data exchange port. The method may include acquiring at least two candidate recognition results of voice information uttered by a user and at least two preliminary scores respectively corresponding to the at least two candidate recognition results. The method may further include, for each of the at least two candidate recognition results, extracting one or more key terms from the candidate recognition result and determining at least one parameter associated with the one or more extracted key terms. The method may further comprise: for each of the at least two candidate recognition results, an update coefficient is generated based on the at least one parameter, and the preliminary score is updated based on the update coefficient to generate an update score. The method may further include determining a target recognition result based on at least two update scores from the at least two candidate recognition results.

In some embodiments, determining at least one parameter associated with one or more extracted key terms may include obtaining at least two sample key terms from a database via a data exchange port. For each of the one or more extracted key terms, determining at least one parameter associated with the one or more extracted key terms may further include determining a degree of match between each of the one or more extracted key terms and each of the at least two sample key terms, determining the one or more target sample key terms from the at least two sample key terms, the degree of match between each of the one or more target sample key terms and the extracted key terms may be above a degree of match threshold. Determining the at least one parameter associated with the one or more extracted key terms may further include determining the at least one parameter associated with the one or more extracted key terms based on the one or more target sample key terms.

In some embodiments, the at least one parameter may include a retrieval parameter, and determining the at least one parameter associated with the one or more extracted key terms based on the one or more target sample key terms may include determining a retrieval parameter based on the degree of match between the one or more target sample key terms and the one or more extracted key terms.

In some embodiments, the at least one parameter may include a heat parameter, and determining at least one parameter associated with the one or more extracted key terms based on the one or more target sample key terms may include obtaining a heat of the one or more target sample key terms and determining the heat parameter from the heat of the one or more target sample key terms.

In some embodiments, the at least one parameter may include a preference parameter, and determining the at least one parameter associated with the one or more extracted key words based on the one or more target sample key words may include obtaining a preference of the user for the one or more target sample key words and determining the preference parameter based on the preference of the one or more target sample key words.

In some embodiments, the at least one parameter may include a distance parameter, and determining the at least one parameter associated with the one or more extracted key terms based on the one or more target sample key terms may include obtaining location information associated with the one or more target sample key terms and identifying one or more location type indicators in the candidate recognition results. For each of the one or more extracted key terms that immediately follows the one or more identified location type indicators, determining the at least one parameter associated with the one or more extracted key terms based on the one or more target sample key terms may further include determining a location type of the extracted key term based on the respective location type indicator, and distance information associated with the one or more extracted key terms based on the location information associated with the one or more target sample key terms and the associated location information and the location type of each of the one or more extracted key terms. Determining the at least one parameter associated with the one or more extracted key terms based on the one or more target sample key terms may further include determining a distance parameter based on distance information.

In some embodiments, the method further comprises obtaining a first number of travel times corresponding to the distance information associated with each of the at least one travel mode over the statistical time period, and determining a second number of travel times corresponding to the distance information associated with all of the at least one travel mode over the statistical time period. The method may further include, for each of the at least one travel pattern, determining a usage probability of the travel pattern based on the first number of travel times and the second number of travel times, and acquiring the travel pattern associated with the voice information. The method may further include determining a distance parameter based on a probability of using a travel pattern associated with the voice information.

In some embodiments, the at least one parameter may include at least one of a heat parameter, a preference parameter, a retrieval parameter, or a distance parameter. Generating an update coefficient based on the at least one parameter may include generating the update coefficient based on the heat parameter, the preference parameter, and the retrieval parameter, or generating the update coefficient based on the distance parameter and the retrieval parameter.

In some embodiments, the target recognition result may include a departure location or destination, and the method may further include generating a service request based on the target recognition result.

In some embodiments, the method may further comprise sending the service request to a user terminal associated with a service provider.

According to another aspect of the present application, a system for speech recognition is provided. The system may include at least one storage device comprising a set of instructions, the system may further include a data exchange port communicatively connected to a network and at least one processor in communication with the at least one storage device and the data exchange port. The at least one processor may be configured to execute the set of instructions and be directed to cause the system to obtain at least two candidate recognition results of speech information uttered by a user and at least two preliminary scores corresponding to the at least two candidate recognition results, respectively. The at least one processor may be further directed to cause the system to, for each of the at least two candidate recognition results, extract one or more key words from the candidate recognition result, determine at least one parameter associated with the one or more extracted key words, generate an update coefficient based on the at least one parameter, and update the preliminary score based on the update coefficient to generate an update score. The at least one processor may be further directed to cause the system to determine a target recognition result from the at least two candidate recognition results based on the at least two update scores.

According to another aspect of the present application, a non-transitory computer-readable medium is provided. The non-transitory computer-readable medium may include a set of instructions for speech recognition. The set of instructions, when executed by at least one processor, may direct the at least one processor to implement a method. The method may include acquiring at least two candidate recognition results of voice information uttered by a user and at least two preliminary scores respectively corresponding to the at least two candidate recognition results. The method may further include, for each of the at least two candidate recognition results, extracting one or more key words from the candidate recognition result, determining the at least one parameter associated with the one or more extracted key words, generating an update coefficient based on the at least one parameter, and updating the preliminary score based on the update coefficient to generate an update score. The method may further include determining a target recognition result from the at least two candidate recognition results based on the at least two update scores.

According to another aspect of the present application, a method for speech recognition is provided. The method may be implemented on a computing device having at least one memory device storing a set of instructions for speech recognition, the computing device having a processor in communication with the at least one memory device. The method may include obtaining at least two candidate recognition results and at least two preliminary scores of speech information provided by a current user, wherein each of the at least two preliminary scores corresponds to one of the candidate recognition results. The method may further include extracting one or more key words of a preset type from each of the at least two candidate recognition results based on a predetermined key word extraction rule. The method may further include revising the preliminary score corresponding to each of the at least two candidate recognition results based on the extracted one or more key words, and determining a target recognition result of the voice information based on a result of the revising.

In some embodiments, modifying the preliminary score corresponding to each of the at least two candidate recognition results based on the one or more extracted key words may include determining an update coefficient for each of the at least two candidate recognition results having the one or more extracted key words based on a similarity between the one or more extracted key words and at least two sample key words in a preset sample database. Revising the preliminary score corresponding to each of the at least two candidate recognition results based on the one or more extracted keyword pools, may further include updating the preliminary score corresponding to each of the at least two candidate recognition results based on the update coefficient to generate an updated score corresponding to each of the at least two candidate recognition results.

In some embodiments, the preset sample database may further include at least one of popularity information of the at least two sample key words or historical information of current user usage of the at least two sample key words.

In some embodiments, the preset sample database may further include heat information of at least two sample key words. Determining an update coefficient for each of the at least two candidate recognition results having the one or more extracted key words based on a similarity between the one or more extracted key words and at least two sample key words in a preset sample database, may include determining a similarity of the one or more extracted key terms and the at least two sample key terms, selecting one or more sample key terms from the at least two sample key terms, converting the heat information of the selected one or more sample key words into one or more heat parameters according to a first conversion relation between the heat information and the heat parameters, and determining an update coefficient for each of the at least two candidate recognition results with the one or more extracted key terms based on the one or more heat parameters. A similarity between the one or more extracted key terms and the one or more selected sample key terms may be greater than a similarity threshold.

In some embodiments, the popularity information for the at least two sample key words may include at least two popularity of the at least two sample key words corresponding to the at least two periodic statistical time periods. Converting the heat information of the selected one or more sample key vocabularies into one or more heat parameters according to a first conversion relation between the heat information and the heat parameters, wherein the conversion relation comprises determining a statistical time period to which the current time belongs, selecting one or more heats corresponding to the statistical time period from the heats of the one or more selected sample key vocabularies corresponding to the at least two periodic statistical time periods, and converting the one or more heats into one or more heat parameters of each of the at least two candidate recognition results according to a second conversion relation between the heats and the heat parameters.

In some embodiments, the preset sample database may further include popularity information of at least two sample key words and history information of the current user using the at least two sample key words. Determining an update coefficient for each of the at least two candidate recognition results having one or more extracted key words based on a similarity between the one or more extracted key words and at least two sample key words in the preset sample database may include determining a similarity of the one or more extracted key words and the at least two sample key words, converting the similarity into a search parameter according to a third conversion relationship between the similarity and the search parameter, converting the similarity into a preference parameter according to a fourth conversion relationship between the similarity and the preference parameter, and determining a heat parameter based on the similarity, the heat information of the at least two sample key words, and a first conversion relationship between the heat information and the heat parameter. Determining an update coefficient for each of the at least two candidate recognition results having one or more extracted key words based on the similarity between the one or more extracted key words and the at least two sample key words in the preset sample database, which may further include obtaining an update coefficient for each of the at least two candidate recognition results having the one or more extracted key words by adding or multiplying the search parameter to or by a higher value between the preference parameter and the heat parameter. For the same similarity, the preference parameter converted according to the fourth conversion relationship between the similarity and the preference parameter may be larger than the heat parameter determined based on the first conversion relationship between the heat information and the heat parameter.

According to another aspect of the present application, there is provided an apparatus for speech recognition. The apparatus may include at least one storage device comprising a set of instructions, and the apparatus may include at least one processor in communication with the at least one storage device. The at least one processor may be configured to execute the set of instructions. The at least one processor may include an information acquisition module configured to acquire at least two candidate recognition results of current user-provided speech information and at least two preliminary scores, wherein each of the at least two preliminary scores corresponds to one of the candidate recognition results. The at least one processor may further include an information extraction module configured to extract one or more key words of a preset type from each of the at least two candidate recognition results based on a predetermined key word extraction rule. The at least one processor may further include a result determination module configured to revise a preliminary score corresponding to each of the at least two candidate recognition results based on the one or more extracted key words and determine a target recognition result for the speech information based on the revised result.

According to another aspect of the present application, a non-transitory computer-readable medium is provided. The non-transitory computer-readable medium may include a set of instructions for speech recognition. The instructions, when executed by the at least one processor, may direct the at least one processor to implement a method. The method may include obtaining the at least two candidate recognition results and the at least two preliminary scores of the speech information provided by the current user, wherein each of the at least two preliminary scores corresponds to one of the candidate recognition results. The method may also include extracting one or more key words of a preset type from each of the at least two candidate recognition results based on a predetermined key word extraction rule. The method may further include revising the preliminary score corresponding to each of the at least two candidate recognition results based on the extracted one or more keyword collections, and determining a target recognition result of the voice information based on a result of the revising.

According to another aspect of the present application, a method for speech recognition of a transportation service is provided. The method may be implemented on a computing device having at least one storage device storing a set of instructions for speech recognition in a transportation service, and having at least one processor in communication with the at least one storage device. The method may include receiving and analyzing speech information to generate at least two candidate recognition results and at least two preliminary scores for the speech information. Each of the at least two preliminary scores may correspond to one of the at least two candidate recognition results from which information of the at least one location is extracted. The method may further include searching the database for one or more points of interest (POIs) matching each of the at least one location, and determining the first parameter of each of the at least two candidate recognition results based on the searched matching result of each of the one or more POIs and the at least one location. The method may further include determining a location type for each of the at least one location in each of the at least two candidate recognition results, and determining a second parameter for each of the at least two candidate recognition results based on the location type. The method may further include determining an updated score corresponding to each of the at least two candidate recognition results based on the preliminary score, the first parameter, and the second parameter corresponding to each of the at least two recognition results. The method may further include determining a highest update score of the at least two update scores corresponding to the at least two candidate recognition results, and outputting a recognition result corresponding to the highest update score.

In some embodiments, searching the database for one or more POIs matching each of the at least one location, and based on the results of the matching between the searched one or more POIs and each of the at least one location, determining the first parameter for each of the at least two candidate recognition results may comprise: determining the first parameter of the recognition result to be 1 when the POI matched with the at least one position is found in the database; when no POI matching the at least one location is found in the database, determining a degree of match between each of the one or more POIs in the database and the at least one location; determining the first parameter of the recognition result to be 0 when a degree of match between each of the one or more POIs and the at least one location is less than or equal to a first degree of match threshold; and when the matching degree between each of the one or more POIs and the at least one position is larger than the first matching degree threshold value, determining a first parameter of the recognition result based on the matching degree, wherein the first parameter of the recognition result can be in direct proportion to the matching degree.

In some embodiments, determining the location type of each of the at least one location corresponding to each of the at least two candidate recognition results may include determining whether the recognition result may include departure location information prior to information of the at least one location. Determining a location type for each of the at least one location corresponding to each of the at least two candidate recognition results may include: in response to determining that the information of any at least one location in the recognition result does not include the departure location information, determining location information associated with the speech information as the departure location. Determining a location type for each of the at least one location corresponding to each of the at least two candidate recognition results may include: in response to determining that the identification result may include departure location information prior to information of at least one departure location, searching the database for a first POI matching the at least one location and determining a first location corresponding to the first POI as the departure location; or searching the database for at least two second POIs, determining a second position corresponding to each of the at least two second POIs, and determining the first average position as the departure position based on the second positions corresponding to the at least two second POIs. A degree of match between each of the at least two second POIs and the at least one location may be greater than a second degree of match threshold.

In some embodiments, determining the location type of each of the at least one location corresponding to each of the at least two candidate recognition results may further include determining whether the recognition result may include destination information before information of the at least one location. Determining a location type for each of the at least one location corresponding to each of the at least two candidate recognition results may further comprise generating a notification for notifying a user of the provision of the destination information in response to determining that the recognition result does not include the destination information prior to information of any of the at least first locations. Determining a location type of each of the at least one location corresponding to each of the at least two candidate recognition results, may further comprise searching a database for a third POI matching information of the at least one location and determining a third location corresponding to the third POI as the destination in response to determining that the recognition result may include destination information prior to the information of the at least one location; or searching the database for at least two fourth POIs, determining a fourth location corresponding to each of the at least two fourth POIs, and determining a second average location as the destination based on the second locations corresponding to the at least two fourth POIs. A degree of match between each of the at least two fourth POIs and the at least one location may be greater than a third degree of match threshold.

In some embodiments, determining the second parameter of each of the at least two candidate recognition results based on the location type may include determining distance information from a departure location to a destination for each of the at least two candidate recognition results, determining at least one travel pattern corresponding to the distance information, determining a number of travels corresponding to the distance information of each of the at least one travel patterns in a statistical time period, determining a usage probability of each of the at least one travel patterns based on the number of travels corresponding to each of the at least one travel patterns and a total number of travels in the statistical time period, and determining the usage probability as the second parameter.

In some embodiments, the method may further include correlating a name associated with each POI with a location corresponding to the POI, and storing the correlation in a database.

In some embodiments, a computing device is provided. The computing device may include at least one storage device storing a set of instructions and at least one processor in communication with the at least one storage device. The instructions, when executed, may direct the at least one processor to implement the above-described methods.

In some embodiments, a non-transitory computer-readable medium is provided. The non-transitory computer-readable medium may include a set of instructions for speech recognition. The instructions, when executed, may direct the at least one processor to implement the above-described methods.

According to another aspect of the present application, a speech recognition system for a transportation service is provided. The system may include at least one storage device comprising a set of instructions, the system including at least one processor in communication with the at least one storage device. The at least one processor may be configured to execute the set of instructions. The at least one processor may include a preliminary score determination module configured to receive and analyze speech information to generate at least two candidate recognition results and at least two preliminary scores for the speech information, where each of the at least two preliminary scores may correspond to one of the at least two candidate recognition results. The at least one processor may further include an extraction module configured to extract information of at least one location from each of the at least two candidate recognition results. The at least one processor may further include a first parameter assignment module configured to search a database for one or more points of interest (POIs) matching each of the at least one location, and determine a first parameter for each of the at least two candidate recognition results based on a result of the matching of the searched one or more POIs to each of the at least one location. The at least one processor may further include a second parameter assignment module configured to determine a location type for each of the at least one location in each of the at least two candidate recognition results, and determine a second parameter for each of the at least two candidate recognition results based on the location type. The at least one processor may further include a modification module configured to determine an update score corresponding to each of the at least two candidate recognition results based on the first parameter and the second parameter. The at least one processor may further include an output module configured to determine a highest update score of at least two update scores corresponding to the at least two candidate recognition results, and output a recognition result corresponding to the highest update score.

Additional features will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following drawings or may be learned by production or operation of the examples. The features of the present application may be realized and obtained by means of the instruments and methods and by means of the methods and combinations set forth in the detailed examples discussed below.

Drawings

The present application will be further described in conjunction with the exemplary embodiments. The exemplary embodiments may be described in detail with reference to the accompanying drawings. These embodiments are not intended to be limiting, and in these embodiments, like reference numerals are used to refer to like structures, wherein:

FIG. 1 is a schematic diagram of an exemplary speech recognition system according to some embodiments of the present application;

FIG. 2 is a schematic diagram of exemplary hardware and/or software components of an exemplary computing device, according to some embodiments of the present application;

FIG. 3 is a schematic diagram of an exemplary terminal device shown in accordance with some embodiments of the present application;

FIG. 4 is a block diagram of an exemplary speech recognition device shown in accordance with some embodiments of the present application;

FIG. 5 is a schematic diagram of an exemplary process for speech recognition, shown in accordance with some embodiments of the present application;

FIG. 6 is a flow diagram illustrating an exemplary process for determining a target recognition result for speech information according to some embodiments of the present application;

FIG. 7 is a flow diagram illustrating an exemplary process of determining update coefficients according to some embodiments of the present application.

FIG. 8 is a schematic diagram of an exemplary process for speech recognition shown in accordance with some embodiments of the present application;

FIG. 9 is a schematic diagram of an exemplary process for speech recognition shown in accordance with some embodiments of the present application; and

FIG. 10 is a schematic diagram illustrating an exemplary interface for generating a service request based on voice information according to some embodiments of the present application.

Detailed Description

The following description is presented to enable one of ordinary skill in the art to make and use the application and is provided in the context of a particular application and its requirements. Various modifications to the disclosed embodiments will be readily apparent to those skilled in the art, and the general principles defined herein may be applied to other embodiments and applications without departing from the spirit and scope of the present application. Thus, the present application is not limited to the embodiments shown, but is to be accorded the widest scope consistent with the claims.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to limit the scope of the present application. As used herein, the singular forms "a", "an" and "the" may include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this application, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

These and other features and characteristics of the present application, as well as the methods of operation and functions of the related elements of structure and the combination of parts and economies of manufacture, will become more apparent upon consideration of the following description. Reference is made to the accompanying drawings, which form a part hereof, however, it is to be understood that the drawings are for illustration and description only and are not intended to limit the scope of the present application. It should be understood that the drawings are not to scale.

The flow charts used in this application illustrate operations implemented by systems according to some embodiments of the present application. It is expressly understood that the operations of the flow diagrams may be performed out of order. Rather, these operations may be performed in the reverse order or simultaneously. Also, one or more other operations may be added to the flowchart. One or more operations may be deleted from the flowchart.

Further, while the systems and methods disclosed herein are primarily directed to speech recognition in a transportation service, it should be understood that this is merely one exemplary embodiment. The system or method of the present application may be applied to users of any other kind of scenario where speech information needs to be recognized. For example, the system or method of the present application may be applied to an e-commerce service, an online shopping service, a voice control system, and the like, or any combination thereof. Application scenarios of the system or method of the present application may include web pages, plug-ins for browsers, client terminals, customization systems, internal analysis systems, artificial intelligence robots, and the like, or any combination thereof.

The departure location of the transportation service in the present application may be obtained by a positioning technology embedded in the wireless device (e.g., user terminal, etc.). Positioning techniques used in the present application may include Global Positioning System (GPS), global navigation satellite system (GLONASS), COMPASS navigation system (COMPASS), galileo positioning system, quasi-zenith satellite system (QZSS), COMPASS navigation satellite system, wireless fidelity (WiFi) positioning techniques, and the like, or any combination thereof. One or more of the above-described positioning techniques may be used interchangeably herein. For example, a GPS-based method and a WiFi-based method may be used together as a location technology to locate a wireless device.

As used in this application, "voice information" may refer to a stream of audio data. The terms "voice information" and "voice data" may be used interchangeably. In some embodiments, the voice information may be acquired by a microphone of the user terminal (e.g., a cell phone, an in-vehicle device). In some embodiments, the voice information may be converted to text and displayed on a screen of the user terminal before being further processed by the user terminal (e.g., when the user is "typing" by voice). In some embodiments, the voice information may be converted to voice commands for controlling the user terminal, such as playing music, dialing numbers, and the like. In some embodiments, the voice information may be converted to a service request (e.g., a taxi service, a navigation service, etc.). Operations related to the service request may be performed after the voice information is recognized. For example, after a destination, departure location, and/or start time is identified, a taxi service transmission may be sent to a service provider (e.g., driver).

One aspect of the present application relates to a system and/or method for speech recognition. For example, voice information may be obtained from a user terminal. The speech information may be processed to generate at least two candidate recognition results and corresponding preliminary scores. Each of the at least two candidate recognition results may be further evaluated. For example, one or more key terms may be extracted from each of the at least two candidate recognition results. One or more of the extracted key terms may be compared to the at least two sample key terms to determine a target sample key term from the at least two sample key terms. At least one parameter may be determined based on one or more extracted keyword pools, the at least one parameter including a search parameter associated with a degree of match between the extracted key words and the target sample key words, a hotness parameter associated with use of the target sample key words by at least two users, a preference parameter associated with use of the target sample key words by a user providing voice information, a distance parameter associated with a road distance from a departure location to a destination determined based on the target sample key words, or the like, or any combination thereof. An update coefficient may be determined based on the at least one parameter and used to update the preliminary score corresponding to each of the at least two candidate recognition results. The target recognition result may be selected from the at least two candidate results based on the update score.

FIG. 1 is a schematic diagram of an exemplary speech recognition system, shown in accordance with some embodiments of the present application. For example, the speech recognition system 100 may be a service platform for speech recognition services. The speech recognition system 100 may include a server 110, a network 120, a user terminal 130, and a storage 140 (also referred to as a database). The server 110 may include a processing engine 112.

The server 110 may be used to process voice information. For example, the server 110 may obtain voice information of the user from the user terminal 130 via the network 120. Server 110 may access a database in storage 140 and recognize voice information based on the database in storage 140. The recognition result of the voice information may be transmitted back to the user terminal 130 via the network 120. In some embodiments, the server 110 may be a single server or a group of servers. The set of servers can be centralized or distributed (e.g., the servers 110 can be a distributed system). In some embodiments, the server 110 may be local or remote. For example, server 110 may access information and/or data stored in user terminal 130 and/or storage 140 via network 120. As another example, server 110 may be directly connected to user terminal 130, and/or storage device 140 to access information and/or data. In some embodiments, the server 110 may be implemented on a cloud platform. By way of example only, the cloud platform may include a private cloud, a public cloud, a hybrid cloud, a community cloud, a distributed cloud, an intermediate cloud, a multi-cloud, and the like, or any combination thereof. In some embodiments, server 110 may be implemented on a computing device having one or more of the components shown in FIG. 1, as shown in FIG. 2.

In some embodiments, the server 110 may include a processing engine 112. Processing engine 112 may process the voice information to perform one or more functions of server 110 described herein. In some embodiments, the processing engine 112 may obtain the user's voice information from the user terminal 130 and recognize the voice information to generate at least two candidate recognition results and at least two preliminary scores. The processing engine 112 may further determine an update coefficient for each candidate recognition result and update the preliminary score based on the update coefficient. For example, processing engine 112 may retrieve target data from one or more databases stored in storage 140 and determine update coefficients based on the target data.

The processing engine 112 may further determine a target recognition result from the candidate recognition results based on the update score. For voice information related to the service request, the processing engine 112 may generate the service request based on the target recognition result and perform operations related to the service request, such as generating the service request, searching for a service provider related to the service request, sending the service request to the service provider, and so on. In some embodiments, processing engine 112 may include one or more processing engines (e.g., a single core processing engine or a multi-core processor). By way of example only, the processing engine 112 may include a Central Processing Unit (CPU), an Application Specific Integrated Circuit (ASIC), an application specific instruction set processor (ASIP), a Graphics Processing Unit (GPU), etc., a Physical Processing Unit (PPU), a Digital Signal Processor (DSP), a Field Programmable Gate Array (FPGA), a Programmable Logic Device (PLD), a controller, a microcontroller unit, a Reduced Instruction Set Computer (RISC), etc., a microprocessor, etc., or any combination thereof.

Network 120 may facilitate the exchange of information and/or data. In some embodiments, one or more components in the speech recognition system 100 (e.g., the server 110, the user terminal 130, and/or the storage 140) may send transmission information and/or data to other components in the speech recognition system 100 via the network 120. For example, the server 110 may acquire/acquire voice information from the user terminal 130 via the network 120. In some embodiments, the network 120 may be any type or combination of wired or wireless network. By way of example only, network 120 may include a cable network, a fiber optic network, a telecommunications network, an intranet, the Internet, a Local Area Network (LAN), a Wide Area Network (WAN), a Wireless Local Area Network (WLAN), a Metropolitan Area Network (MAN), a Wide Area Network (WAN), a Public Switched Telephone Network (PSTN), a Bluetooth network, a ZigBee network, a Near Field Communication (NFC) network, a global system for mobile communications (GSM) network, a Code Division Multiple Access (CDMA) network, a Time Division Multiple Access (TDMA) network, a General Packet Radio Service (GPRS) network, an enhanced data rates for GSM evolution (EDGE) network, a Wideband Code Division Multiple Access (WCDMA) network, a High Speed Downlink Packet Access (HSDPA) network, a Long Term Evolution (LTE) network, a User Datagram Protocol (UDP), a transmission control protocol/Internet protocol (TCP/IP) network, a Short Message Service (SMS) network, A Wireless Application Protocol (WAP) network, an ultra-wideband (UWB) network, infrared, and the like, or any combination thereof. In some embodiments, the server 110 may include one or more network access points. For example, the server 110 may include wired or wireless network access points, such as base stations and/or Internet exchange points 120-1, 120-2, etc., through which one or more components of the speech recognition system 100 may exchange data and/or information by connecting to the network 120.

The user terminal 130 may be associated with a user. In some embodiments, user terminal 130 may obtain voice information from a user. User terminal 130 may send voice information to server 110 (e.g., processing engine 112). In some embodiments, the user terminal 130 may perform one or more of the functions of the aforementioned processing engine 112, such as generation of candidate recognition results, determination of target recognition results, or the like. In some embodiments, user terminal 130 may perform operations related to voice information, such as playing music, dialing numbers, determining a navigation route from a departure location to a destination, generating a service request, and so forth. In some embodiments, the user terminal 130 may include a mobile device 130-1, a tablet computer 130-2, a laptop computer 130-3, a desktop computer 130-4, and the like, or any combination thereof. In some embodiments, the mobile device 130-1 may include a smart home device, a wearable device, a smart mobile device, a virtual reality device, an augmented reality device, and the like, or any combination thereof. In some embodiments, the smart home devices may include smart lighting devices, control devices for smart electrical devices, smart monitoring devices, smart televisions, smart cameras, interphones, and the like, or any combination thereof. In some embodiments, the wearable device may include a smart bracelet, smart footwear, smart glasses, smart helmet, smart watch, smart garment, smart backpack, smart accessory, or the like, or any combination thereof. In some embodiments, the smart mobile device may include a smartphone, a Personal Digital Assistant (PDA), a gaming device, a navigation device, a point of sale (POS) device, and the like, or any combination thereof. In some embodiments, the virtual reality device and/or the augmented reality device may include a virtual reality helmet, virtual reality glass, virtual reality eyewear, an augmented reality helmet, augmented reality glass, augmented reality eyewear, and the like, or any combination thereof. For example, the virtual reality device and/or augmented reality device may include Google Glass, Oculus Rift, Hololens, Gear VR, and the like. In some embodiments, the built-in devices in motor vehicle 130-4 may include a vehicle-mounted computer, a vehicle-mounted television, and the like. In some embodiments, user terminal 130 may be a wireless device having a location technology for locating a user and/or user terminal 130.

Storage device 140 may store data and/or instructions. In some embodiments, storage 140 may store data acquired/obtained from user terminal 130. In some embodiments, storage 140 may store data and/or instructions that server 110 may perform or be used to perform the exemplary methods described herein. For example, the storage 140 may store a recognition model for recognizing speech information. For another example, the storage device 140 may store one or more databases, such as a sample key vocabulary database (also referred to as a POI database when used in a transportation service), a heat information database, a preference database, a travel pattern database, and the like, or combinations thereof. In some embodiments, storage 140 may include mass storage, removable storage, volatile read-write memory, read-only memory (ROM), and the like, or any combination thereof. Exemplary mass storage devices may include magnetic disks, optical disks, solid state drives, and the like. Exemplary removable memory can include flash memory drives, floppy disks, optical disks, memory cards, compact disks, magnetic tape, exemplary volatile read-write memory can include Random Access Memory (RAM), exemplary RAM can include Dynamic RAM (DRAM), double data rate synchronous dynamic RAM (DDR SDRAM), Static RAM (SRAM), thyristor RAM (T-RAM), and zero capacitor RAM (Z-RAM) (ROM), Programmable ROM (PROM), Erasable Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), compact disk ROM (CD-ROM), and digital versatile disk ROM, among others. In some embodiments, storage 140 may be implemented on a cloud platform. By way of example only, the cloud platform may include a private cloud, a public cloud, a hybrid cloud, a community cloud, a distributed cloud, an intermediate cloud, a multi-cloud, and the like, or any combination thereof.

In some embodiments, a storage 140 may be connected to the network 120 to communicate with one or more components in the speech recognition system 100 (e.g., the server 110, the user terminal 130, etc.). One or more components in the speech recognition system 100 may access data or instructions stored in the storage 140 via the network 120. In some embodiments, the storage 140 may be directly connected to or in communication with one or more components in the speech recognition system 100 (e.g., the server 110, the user terminal 130, etc.). In some embodiments, storage 140 may be part of server 110.

In some embodiments, one or more components in the speech recognition system 100 (e.g., the server 110, the user terminal 130, etc.) may have permission to access the storage 140. In some embodiments, one or more components in the speech recognition system 100 may read and/or modify information related to the user when one or more conditions are satisfied. For example, the server 110 may obtain information including sample key words, heat information, preference information associated with the user of the user terminal 130, statistical data related to at least one travel pattern (also referred to as travel pattern information), and the like or a combination thereof from the storage 140.

One of ordinary skill in the art will appreciate that when elements of the speech recognition system 100 are executed, the elements may be executed by electrical and/or electromagnetic signals. For example, when user terminal 130 processes a task such as inputting voice data, recognizing or selecting an object, user terminal 130 may operate logic circuits in its processor to perform such a task. When user terminal 130 transmits voice information to server 110, the processor of server 110 may generate an electrical signal encoding the voice information. The processor of the server 110 may then send the electrical signals to an output port. If user terminal 130 communicates with server 110 via a wired network, the output port may be physically connected to a cable that further transmits the electrical signals to the input port of server 110. If the user terminal 130 communicates with the server 110 via a wireless network, the output port of the service requester terminal 130 may be one or more antennas that convert electrical signals to electromagnetic signals. Within an electronic device, such as user terminal 130 and/or server 110, when its processor processes instructions, sends instructions, and/or performs operations, the instructions and/or operations may be performed via electrical signals. For example, when the processor retrieves or stores data from the storage medium, the processor may transmit electrical signals to a read/write device of the storage medium, which may read or write structured data in the storage medium. The structured data may be transmitted to the processor in the form of electrical signals via a bus of the electronic device. Here, an electrical signal may refer to one electrical signal, a series of electrical signals, and/or at least two discrete electrical signals.

FIG. 2 is a schematic diagram of exemplary hardware and/or software components of a computing device shown in accordance with some embodiments of the present application. In some embodiments, server 110 and/or user terminal 130 may be implemented on computing device 200 shown in FIG. 2. For example, the processing engine 112 may be implemented on the computing device 200 and used to perform the functions of the processing engine 112 disclosed herein.

The computing device 200 may be used to implement any of the components of the speech recognition system 100 described herein. For example, the processing engine 112 may be implemented on the computing device 200 by its hardware, software programs, firmware, or a combination thereof. For convenience, although only one such computer is shown, the computer functions described herein in connection with speech recognition services may be implemented in a distributed manner on at least two similar platforms to distribute the processing load.

For example, computing device 200 may include a communication port 250 to connect to a network to which it is connected to facilitate data communication. Computing device 200 may further include a processor (e.g., processor 220) in the form of one or more processors (e.g., logic circuits) for executing program instructions. For example, the processor 220 may include interface circuitry and processing circuitry therein. The interface circuit may be configured to receive electronic signals from bus 210, where the electronic signals encode structured data and/or instructions for processing by the processing circuit. The processing circuitry may perform logical computations and then determine conclusions, results and/or instructions encoded into electronic signals. The interface circuit may then issue electronic signals from the processing circuit via bus 210.

Exemplary computing devices may further include different forms of program memory and data storage, including, for example, a disk 270, a Read Only Memory (ROM)230, or a Random Access Memory (RAM)240 for processing and/or transmitting various data files by the computing device. The exemplary computing device may further include a processor 220 that stores program instructions executed by the processor in the ROM 230, the RAM 240, and/or another type of non-transitory storage medium. The methods and/or processes of the present application may be implemented as program instructions. Computing device 200 may further include a component 260 that supports input/output between the computer and other components to support input/output between the computer and other components. Computing device 200 may also receive programs and data via network communications.

For illustration only, only one processor is shown in FIG. 2. At least two processors 220 are also contemplated; thus, operations and/or method steps performed by one processor 220 may also be performed by at least two processors, either jointly or separately. For example, if in the present application processor 220 of computing device 200 performs both steps a and B, it should be understood that steps a and B may also be performed jointly or separately by two different processors 220 in computing device 200 (e.g., a first processor performs step a and a second processor performs step B, or the first and second processors perform steps a and B together).

Fig. 3 is a schematic diagram of exemplary hardware and/or software components of a terminal device shown in accordance with some embodiments of the present application. In some embodiments, user terminal 130 may be implemented on terminal device 300 shown in fig. 1. The terminal device 300 may be a mobile device, such as a mobile phone of a passenger or driver, a built-in device on a vehicle driven by the driver. As shown in fig. 3, terminal device 300 may include a communication platform 310, a display 320, a Graphics Processing Unit (GPU)330, a Central Processing Unit (CPU)340, input/output 350, memory 360, and storage 390. In some embodiments, any other suitable component, including but not limited to a system bus or a controller (not shown), may also be included in the terminal device 300.

In some embodiments, the mobile operating system 370 (e.g., iOS) may be moved from the memory 390^TM、Android^TM、Windows Phone^TMEtc.) and one or more applications (applications) 380 are loaded into memory 360 for execution by CPU 340. In some embodiments, the terminal device 300 may include a microphone 315 or the like for acquiring voice information. The microphone 315 may continuously acquire voice information while the terminal device 300 is operating or while the voice-related application 380 is running. For example, the voice-related applications 380 may include an online transportation service App (e.g., drip car), an e-commerce application (e.g., Taobao, fun, Amazon), a voice-controlled application (e.g., Siri)^TM) Etc., the microphone 315 may continuously acquire voice information when the user opens the voice-related application 380. In some embodiments, terminal device 300 may include a record button such that when the user presses and holds the record button, microphone 315 may begin acquiring voice information. The microphone 315 may continuously capture voice information until the user releases the button or a preset length of recording time is reached. For another example, the voice-related application 380 may provide a recording icon on a Graphical User Interface (GUI) of the terminal device 300 via the display 320 so that the microphone 315 starts to acquire voice information when the user touches the recording icon. In some embodiments, CPU340 may retrieve data from memory 390 and recognize voice information to base retrieval of data from memory 390And determining a target recognition result. Alternatively or additionally, the terminal device 300 may send the voice information to the server 110 or the processing engine 112 to be recognized. In some embodiments, the target recognition result may be displayed on the GUI of the terminal device 300 via the display 320. In some embodiments, in addition to the target recognition results, the candidate recognition results may also be displayed on the display 320 in descending order of the update score. In some embodiments, the user may confirm and/or modify the target recognition result or the service request related to the target recognition result. User interaction may be implemented via I/O350 and provided to server 110 and/or other components of speech recognition system 100 via network 120. The terminal device 300 can transmit/receive data related to voice information via the communication platform 310. For example, the terminal device 300 may transmit voice information to the server 110 via the communication platform 310.

FIG. 4 is a block diagram of an exemplary speech recognition device shown in accordance with some embodiments of the present application. The speech recognition device 400 may be in communication with a storage medium (e.g., the storage 140 of the speech recognition system 100 and/or the memory 390 of the terminal device 300) and may execute instructions stored in the storage medium. In some embodiments, the processing engine 400 may include an information acquisition module 410, an information extraction module 420, and a result determination module 430.

The information acquisition module 410 may be configured to acquire data/information related to speech recognition. For example, the information acquisition module 410 may acquire voice information from a user terminal (e.g., the user terminal 130 or a microphone thereof). The user terminal can obtain the voice information sent by the current user of the user terminal. The information obtaining module 410 may further obtain information associated with the user terminal, such as location information of the user terminal when the user terminal obtains the voice information, a user identifier (e.g., a user account name) associated with the user, and the like, or a combination thereof. For another example, the information obtaining module 410 may obtain at least two candidate recognition results and at least two preliminary scores of the voice information.

The information extraction module 420 may be configured to extract one or more key terms from each candidate recognition result. The information extraction module 420 may extract one or more key words based on predetermined rules. For example, when voice information is used for a transportation service, the information extraction module 420 may extract content following a destination indicator (also referred to as destination information) as a key vocabulary of a destination and extract content following a departure position indicator (also referred to as departure position information) as a key vocabulary of a departure place.

The result determination module 430 may be configured to determine a target recognition result of the voice information. For example, the result determination module 430 may correct the preliminary score corresponding to each of the at least two candidate recognition results based on the extracted one or more key words, and determine a target recognition result of the speech information based on the corrected result. In some embodiments, the result determination module 430 may include an update coefficient determination sub-module and an information modification sub-module. The update coefficient determination sub-module may be configured to determine the update coefficient for each of the at least two candidate recognition results having the one or more extracted key words based on a similarity (also referred to as a "degree of match") between the one or more extracted key words and each of the at least two sample key words in the preset sample database. For example, the update coefficient determination sub-module may select one or more sample key words from the at least two sample key words as target sample key words (also referred to as "selected sample key words"), wherein the similarity of the one or more extracted key words to the one or more target sample key words is greater than a similarity threshold. The update coefficient determination sub-module may determine based on at least one parameter associated with one or more target sample key words, the at least one parameter including a search parameter, a heat parameter, a preference parameter, a distance parameter, or the like, or a combination thereof. In some embodiments, the update coefficient determination sub-module may include a similarity determination unit, a retrieval parameter determination unit, a preference parameter determination unit, a heat parameter determination unit, and an update coefficient determination unit. The similarity determination unit may be configured to determine a similarity between the one or more extracted key terms and the at least two sample key terms. The retrieval parameter determination unit may be configured to convert the similarity into the retrieval parameter according to a third conversion relationship between the similarity and the retrieval parameter. The preference parameter determining unit may be configured to determine a similarity between the one or more extracted key words and the at least two sample key words, and convert the similarity into the preference parameter according to a fourth conversion relationship between the similarity and the preference parameter. The heat parameter determination unit may be configured to determine the heat parameter based on the similarity, the heat information of the at least two sample key words, and a first conversion relationship between the heat information and the heat parameter. The update coefficient determination unit may be configured to determine the update coefficient of each of the at least two candidate recognition results having the one or more extracted key words by, for example, adding or multiplying the retrieval parameter to a higher value between both the preference parameter and the heat parameter. The information modification sub-module may be configured to update the preliminary score corresponding to each of the at least two candidate recognition results based on the update coefficient to generate an updated score corresponding to each of the at least two candidate recognition results. For example, the information modification submodule may update the preliminary score by multiplying the update coefficient by the preliminary score. In some embodiments, each update coefficient corresponding to a candidate recognition result may be normalized, i.e., converted to a number between 0 and 1. The information modification submodule may update the preliminary score by multiplying the normalized update coefficient by the preliminary score.

In some embodiments, the speech recognition device 400 may further include a preliminary score determination module, an extraction module, a first parameter assignment module, a second parameter assignment module, a modification module, and an output module (not shown in the figures). Some or all of these modules may be integrated as sub-modules into the result determination module 430.

The preliminary score determination module may be configured to receive and analyze the speech information to generate at least two candidate recognition results of the speech information and at least two preliminary scores, wherein each of the at least two preliminary scores corresponds to one of the at least two candidate recognition results. For example, the preliminary score determination module may identify speech data based on a recognition model (e.g., recognition model 500) to generate at least two candidate recognition results and corresponding preliminary scores.

The first parameter assignment module may be configured to search a database for one or more points of interest (POIs) matching each of the at least one location, and determine a first parameter (e.g., a retrieval parameter) for each of the at least one location based on a result of the matching between the searched one or more POIs and each of the at least one location. For example, when a POI matching at least one location is found in the database, the first parameter assignment module may determine the first parameter of the recognition result to be 1; when no POI matching the at least one location is found in the database, the first parameter assignment module may determine a degree of matching between each of the one or more POIs in the database and the at least one location. The first parameter assignment module may determine the first parameter of the recognition result to be 0 when a degree of matching between each of the one or more POIs and the at least one location is less than or equal to a first degree of matching threshold; and when a degree of match between each of the one or more POIs and the at least one location is greater than the first degree of match threshold, the first parameter assignment module may determine the first parameter of the recognition result based on the degree of match, where the first parameter of the recognition result may be positively correlated (e.g., proportional) to the degree of match.

The second parameter assignment module may be configured to determine a location type for each of the at least one location for each of the at least two candidate recognition results, and determine a second parameter (e.g., a distance parameter) for each of the at least two candidate recognition results based on the location type. In some embodiments, the second parameter allocation module may further include a departure location determination sub-module, a destination determination sub-module, a distance determination sub-module, and a probability determination sub-module. The departure location determination submodule may be configured to determine a departure location based on the at least one location. The destination determination sub-module may be configured to determine the destination based on the at least one location. The distance determination sub-module may be configured to determine distance information (e.g., road distance from the departure location to the destination) for each of the at least two candidate recognition results. The probability determination submodule may be configured to determine a usage probability of each of the at least one travel manner based on the number of trips corresponding to each of the at least one travel manner and the total number of trips in the statistical time period. The usage probability may be determined or converted to a second parameter.

The modification module may be configured to determine an update score corresponding to each of the at least two candidate recognition results based on the first parameter, the second parameter, and the preliminary score.

The output module may be configured to determine a highest update score of the at least two update scores corresponding to the at least two candidate recognition results, and output the recognition result corresponding to the highest update score.

In some embodiments, the speech recognition device 400 may further include an association module. The association module may be configured to correlate a description (e.g., a name or address) relating to each POI with a location corresponding to the POI, and store the correlation in a database. For example, the key vocabulary extracted from the candidate recognition results may be a description used by the user in relation to the POI and may be the same or different from the sample key vocabulary. The association module may store a correlation between the description about each POI used by the user and the location corresponding to the POI to update the database.

It should be noted that the above description is provided for illustrative purposes only, and is not intended to limit the scope of the present application. Many variations and modifications will be apparent to those of ordinary skill in the art in light of the teachings herein. However, such changes and modifications do not depart from the scope of the present application. The modules, sub-modules or units described above may be connected or communicate with each other by wired or wireless connections. In some embodiments, two or more modules/sub-modules/units may be combined into a single module/sub-module/unit, respectively, and any one module/sub-module/unit may be divided into two or more modules/sub-modules/units, respectively.

FIG. 5 is a schematic diagram of an exemplary process for speech recognition shown in accordance with some embodiments of the present application. In some embodiments, speech information 505 may be input to recognition model 500. The recognition model 500 may be implemented by or included in the user terminal 130 and/or the processing engine 112. Based on the input of the speech information, the recognition model 500 may generate as output at least two candidate recognition results and corresponding preliminary scores 565 based on the input of the speech information. Each preliminary score may correspond to one of the candidate recognition results. In some embodiments, the candidate recognition result may be textual information associated with a word, phrase, sentence, or letter.

In some embodiments, the recognition model 500 may be stored in a memory (e.g., the memory 140 of the speech recognition system 100 or the memory 390 of the terminal device 300). As shown in fig. 5, the recognition model 500 may include a preprocessor 510, a feature extractor 520, an acoustic model 530, a decoder 540, a pronunciation model 550, and a language model 560.

The preprocessor 510 may preprocess the speech information 505. For example, the speech information 505 to be recognized may be pre-processed by the pre-processor 510 to be divided into at least two audio frames. In some embodiments, the pre-processing of the speech information 505 may further include noise filtering, enhancement, channel equalization, domain conversion, e.g., time-frequency domain conversion via Fourier Transform (FT), frequency-time domain conversion via Inverse Fourier Transform (IFT), etc., or any combination thereof.

The feature extractor 520 may extract appropriate acoustic feature information from the frequency domain in the transformed audio frame.

The acoustic model 530 may determine pronunciation data corresponding to the audio signal based on the acoustic feature information. For example, the acoustic model 530 may be trained based on at least two sample utterances and corresponding sample acoustic feature information from an utterance database (e.g., utterance data stored in the storage device 140). The acoustic model 530 may use the acoustic feature information as input to map the acoustic feature information to the pronunciations corresponding to the audio frames. The acoustic model 530 may determine a first probability of mapping the audio frame to each pronunciation. In some embodiments, the pronunciation model 550 may determine at least two words or characters corresponding to the pronunciation and a second probability associated with the words or characters. In some embodiments, language model 560 may include correlations between different language units (e.g., words, characters, or phrases) and probabilities corresponding to the correlations. Language model 560 may estimate a third probability for various texts constructed based on the language units.

The decoder 540 may build a recognition network based on the acoustic model 530, the language model 560, and the pronunciation model 550. Each path in the recognition network (similar to a branch node in a neural network) may correspond to text and/or text-related pronunciations. The decoder 540 may then determine a preliminary score for each path of the recognition network based on the pronunciation output by the acoustic model to obtain a preliminary recognition result and a corresponding preliminary score.

In some embodiments, processing engine 112 or terminal device 300 may determine at least two candidate recognition results and corresponding preliminary scores 565 based on the preliminary recognition results and corresponding preliminary scores. For example, the processing engine 112 or the user terminal 130 may select at least two preliminary recognition results having relatively high preliminary scores as candidate recognition results from all the preliminary recognition results. For example only, a preliminary recognition result having a preliminary score above a predetermined score threshold may be determined as a candidate recognition result. For another example, the preliminary recognition results corresponding to the top N scores may be determined as candidate recognition results, and N may be a natural number greater than 1, such as 5, 10, 20, or the like. In some embodiments, all of the preliminary recognition results may be determined as candidate recognition results.

In some embodiments, a target recognition result corresponding to the speech information may be determined from the candidate recognition results. For example, the processing engine 112 or the user terminal 130 may determine the candidate recognition result corresponding to the highest initial score as the target recognition result. For another example, the processing engine 112 or the user terminal 130 may further update the preliminary scores corresponding to the candidate recognition results based on the update coefficients to generate update scores, and determine the target recognition results based on the update scores. For example, a detailed description of determining a target recognition result based on a candidate recognition result may be found elsewhere in the present application, such as in fig. 6 and its description.

It should be noted that the above description is provided for illustrative purposes only, and is not intended to limit the scope of the present application. Many variations and modifications may be made to the teachings of the present application by those of ordinary skill in the art. However, such changes and modifications do not depart from the scope of the present application. For example, the preprocessor 510 and/or feature extractor 520 may be omitted from the recognition model 500. As another example, the recognition model 500 may be located outside of the speech recognition system 100. More particularly, the recognition model 500 external to the speech recognition system 100 may recognize speech information to produce candidate recognition results and corresponding preliminary scores, and the speech recognition system 100 (e.g., server 110, processing engine 112, user terminal 130) may directly retrieve and process the candidate recognition results and corresponding preliminary scores.

FIG. 6 is a flow diagram illustrating an exemplary process for determining a target recognition result for speech information according to some embodiments of the present application. The process 600 may be performed by the speech recognition system 100. For example, the process 600 may be implemented as a set of instructions (e.g., an application) stored in a memory (e.g., the storage 140 of the speech recognition system 100 in fig. 1, the memory 390 of the terminal device 300 in fig. 3). The modules of the speech recognition device 400 in fig. 4 may execute a set of instructions and, when executing the instructions, the modules may be configured to perform the process 600. In some embodiments, at least a portion of the speech recognition device 400 may be implemented on the processing engine 112 and/or the terminal device 300. The operation of the illustrated process 600 presented below is intended to be illustrative. In some embodiments, process 600 may be accomplished with one or more additional operations not described, and/or without one or more of the operations discussed. Additionally, the order in which the operations of process 600 are illustrated in FIG. 6 and described below is not intended to be limiting.

In 610, the voice recognition apparatus 400 (e.g., the information acquisition module 410) may acquire voice information from a user terminal (e.g., the user terminal 130 or a microphone thereof, the terminal apparatus 300 or a microphone thereof). The user terminal may obtain voice information spoken by a user of the user terminal. In some embodiments, the speech recognition device 400 may further obtain information associated with the user terminal, such as location information of the user terminal when the user terminal obtained the speech information, a user identification (e.g., a user account name) associated with the user, and the like, or a combination thereof.

In 620, the speech recognition device 400 (e.g., the result determination module) may determine at least two candidate recognition results of the speech information and at least two preliminary scores corresponding to the at least two candidate recognition results. In some embodiments, the speech information may be recognized by a recognition model (e.g., recognition model 500) to generate at least two candidate recognition results based on a speech recognition method. The speech recognition method may include, but is not limited to, a feature parameter matching algorithm, a Hidden Markov Model (HMM) algorithm, an Artificial Neural Network (ANN) algorithm, and the like.

In some embodiments, the determination of candidate recognition results and corresponding preliminary scores may be performed by the processing engine 112 and/or the user terminal 130. The candidate recognition results and their corresponding scores, e.g., (candidate recognition result 1, preliminary score 1), (candidate recognition result 2, preliminary score 2), etc., may be determined in pairs. The preliminary score may be any number, such as 10, 30, 500, etc., or fall within a range of 0-1, such as 0.3, 0.5, 0.8, etc.

In 630, the speech recognition device 400 (e.g., the information extraction module 420) may extract one or more key words from each of the at least two candidate recognition results based on predetermined key word extraction rules.

When the speech recognition device 400 is used for a car-calling service or a navigation service, the extracted one or more key words may include points of interest (POIs), street names, etc. When the speech recognition device 400 is used for an e-commerce service, the extracted one or more key terms may include a merchant name, a commodity name, a price. One or more key words may be extracted from software or App installed in the voice recognition apparatus 400 that inputs voice information. For example, if voice information is input to a car calling application App or a navigation service App, key words such as POIs, street names, and the like may be extracted. If voice information is entered into an e-commerce application, key words such as merchant names, item names, prices, etc. may be extracted.

In some embodiments, the candidate recognition result may be in text form, and the term "candidate recognition result" and the term "recognition text" may be used interchangeably. In some embodiments, the candidate recognition result may be a sentence including a subject, a predicate, an object, an adverb, and the like. Sometimes, the subject and adverbs may be omitted. For example, the candidate's recognition result may be: "I want to go to a digital valley", "go to a digital valley", or "I want to go to a digital valley from the west kingdom at 3 pm today", etc.

The predetermined key vocabulary extraction rule may be a predetermined rule for extracting a key vocabulary from the candidate recognition result. There may be at least two extraction rules and the following exemplary description is associated with an extraction rule based on a structure template. In some embodiments, the structure template may be determined based on historical candidate recognition results or set manually by a user. In some embodiments, the content in the candidate recognition results that matches the structure template may be determined as the key vocabulary.

Taking a traffic scene as an example, the structure template related to the destination may be { destination indicator (also referred to as destination information) + POI (or location) }. The destination designator may include text such as "i want to go," "destination is," "go," and so forth. The content following the destination indicator may be extracted as a key vocabulary of the destination. For another example, the structure template related to the departure position may be { departure position indicator (also referred to as departure position information) + POI (or position) }. The departure location indicator may include text such as "i am," "i am located," "from," and the like. The content following the starting position indicator may be extracted as a key word for the starting position. When voice information is used for traffic services, the keyword summary thus extracted may also be referred to as a "suspected POI". The destination indicator and the departure location indicator may also be referred to as location types of indicators. By way of example only, for the candidate recognition result "i want to go from west ethnic gate to a digital valley", since the origin indicator "from" immediately before "west ethnic gate", the "west ethnic gate" may be extracted as a suspected POI of the origin position. Similarly, since the destination indicator "go" precedes the "digital valley," the "digital valley" can be extracted as a suspected POI of the destination.

In some embodiments, if a key word satisfying a preset type is not extracted from the candidate recognition results, the preliminary score of the candidate recognition results may be reduced, or the candidate recognition results may be deleted. In some embodiments, if a key vocabulary satisfying a preset type is not extracted from all candidate recognition results, a prompt message may be sent to the user terminal 130 to inform the user that the provided voice information may not be recognized or may not be sufficient to generate a service request or a voice command. The prompting message may further include a suggestion and/or instruction to provide voice information again for the user. For example, the alert message may be "sorry, i not recognized. Please say again. "

In 640, the speech recognition device 400 (e.g., the information acquisition module 410) may acquire one or more databases associated with speech recognition. One or more databases associated with speech recognition may be stored in a storage medium (e.g., storage 140 of speech recognition system 100 in fig. 1, storage 390 of terminal device 300 in fig. 3). In some embodiments, the one or more databases associated with speech recognition may include a sample key vocabulary database, a heat database, a preferences database, a travel pattern database, and the like, or any combination thereof. The sample key vocabulary database may include at least two sample key vocabularies used in different scenarios, such as POI, street name, business name, commodity name, food name, common voice command, App name, and the like. The popularity database may include popularity information (popularity) corresponding to each of at least two sample key words used by at least two users. For example, the popularity may include a number of uses (e.g., a total number of uses or a frequency of uses) and/or a probability of using each of at least two sample key words as an input for an application related to the speech information to be recognized. In some embodiments, each of the at least two sample key terms may correspond to at least two periodic statistical time periods and/or at least two degrees of heat associated with at least two geographic regions. The periodic statistical time period may include one week, one month, or one season (spring, summer, fall, and winter). The periodic statistical time periods may also include peak periods such as time periods on the way to work and time periods on the way home off work (e.g., 8:00-9:30 am, 5:00-6:30 pm) and off-peak periods. The periodic statistical time period may also include weekdays, weekends, holidays, and the like. The geographic region may include a block, a street, a city, a town, a county, a province, a country, a continent, and so on.

The preference database may include preference information (e.g., preferences) corresponding to each of at least two sample key words being used by a user of the terminal device. The user of the terminal device in 610 may be identified by the user identification obtained from the terminal device 130. For example, the preferences may include historical information associated with the user, such as whether the user has used the sample key vocabulary before, the number of times the user has used in the past, and/or the probability of using the sample key vocabulary. In some embodiments, each of the at least two sample key terms may correspond to at least two degrees of preference with respect to at least two periodic statistical time periods and/or at least two geographic regions. In some embodiments, the preference information may be included in the heat information in the heat database. For example, the popularity database may be searched to generate user preference information regarding the sample key vocabulary.

The travel pattern database may include travel pattern information related to various distance information. The travel pattern information may include a number of uses or a probability of using each of at least two travel patterns corresponding to various distance information (e.g., different road distances). For example, travel modes may include walking, cycling, driving, taxiing, bus riding, train riding, airplane riding, and the like. For example only, the travel pattern database may include probability distribution data related to different distance information corresponding to each of the at least two travel patterns. In some embodiments, the probability distribution data may be depicted as at least two probability curves corresponding to at least two travel patterns. Each probability curve may exhibit a probability trend for using different driving styles for different road distances. For example, in a probability curve corresponding to a taxi, the probability may be relatively low when the road distance is less than 1 km, and gradually increase to a relatively high value when the road distance increases from 1 km to 20 km. The probability may drop sharply as road distance increases from 20 kilometers to 200 kilometers.

In some embodiments, one or more databases may be integrated into one database. For example, the preference database may be integrated into the heat database. For another example, the popularity information database and the preference information database may be integrated into the sample database.

In 650, the speech recognition device 400 (e.g., the result determination module 430) may determine an update coefficient corresponding to each of the at least two candidate recognition results based on the one or more extracted key words and one or more databases associated with speech recognition. The update coefficients may be determined based on one or more extracted key words and at least one parameter determined from historical data. The at least one parameter may include a retrieval parameter, a heat parameter, a preference parameter, a distance parameter, the like, or combinations thereof.

In some embodiments, the speech recognition device 400 may determine a degree of match (also referred to as "similarity") between the extracted key vocabulary and each of the at least two sample key vocabularies, and determine one or more target sample key vocabularies from the at least two samples. A degree of match between each of the one or more target sample key words and the extracted key words may be above a first degree of match threshold. The speech recognition device 400 may determine at least one parameter based on one or more target sample keyword collections. For example, the search parameters may be determined based on a degree of match between one or more target sample key terms and one or more extracted key terms. The heat parameter may be determined based on heat information associated with one or more target sample key words. The preference parameters may be determined based on preference information associated with one or more target sample key words. The distance parameter may be determined based on travel mode information associated with one or more target sample key words. In some embodiments, the term "search parameter" may also be referred to as a first parameter, and the term "distance parameter" may also be referred to as a second parameter. Details regarding determining the at least one parameter may be found elsewhere in the present disclosure, such as in fig. 7 and the description thereof.

In some embodiments, the speech recognition device 400 may determine the update coefficients based on an average or a weighted average, a sum or a weighted sum, a product, or a combination thereof of the at least one parameter. Other methods of determining the update coefficient based on at least one parameter may also be used and are within the scope of the present application. For example only, the speech recognition device 400 may determine a higher value between the heat parameter and the preference parameter, and determine the update coefficient by adding the higher value to the retrieval parameter. For another example, the speech recognition device 400 may determine the update coefficient by multiplying the retrieval parameter by the distance parameter. In some embodiments, each update coefficient corresponding to a candidate recognition result may be normalized, i.e., converted to a number between 0 and 1. The normalization of each update coefficient may include dividing each of the update coefficients by a highest update coefficient of the update coefficients. For example, the three update coefficients 20, 40, 50 may be normalized to 0.4(20/50), 0.8(40/50), and 1.0(50/50), respectively.

In 660, the speech recognition device 400 (e.g., the result determination module 430) may update the preliminary score corresponding to each of the at least two candidate recognition results based on the update coefficient to generate an updated score recognition result corresponding to each of the at least two candidates. For example, the speech recognition device 400 may update the preliminary score by multiplying the update coefficient by the preliminary score. The update score of the candidate recognition result may be represented as y { x, v (k) }, w [ dist (a, b), D ] }, where x may be a preliminary score, v may be a function for determining a search parameter, k may represent a degree of matching of a target sample key word or an average degree of matching of at least two target sample key words, a may represent a departure location, b may represent a destination, dist may represent a function for determining a road distance between two locations, D may represent probability distribution data related to different road distances, w may be a function for determining a distance parameter, and y may be a function for determining an update score based on the search parameter and the distance parameter. In some embodiments, the speech recognition device 400 may directly update the preliminary score using the at least one parameter. For example, the preliminary score corresponding to each candidate recognition result may be updated using the retrieval parameters to generate an updated preliminary score. The preliminary score may be further updated using the distance parameter to generate an updated score. Other methods of updating the preliminary score may also be used and are within the scope of the present application.

In 670, the speech recognition device 400 (e.g., the result determination module 430) may determine a target recognition result for the speech information based on the update score. In some embodiments, the speech recognition device 400 may sort the candidate recognition results in descending order according to the corresponding update scores. For example, the candidate recognition result corresponding to the highest score may be determined as the target recognition result. The target recognition result may be sent to the user terminal and/or the processing engine 112. In some embodiments, information related to the target recognition result may also be sent to the user terminal and/or the processing engine 112. For example, the information related to the target recognition result may include a target sample key vocabulary. The target sample key vocabulary may be used for subsequent operations, such as generating a service request. For example, target sample key words such as departure location and destination can be sent to the processing engine 112 to generate a transport service request.

In some embodiments, the candidate recognition results may correspond to at least two distance parameters corresponding to at least two travel modes. Therefore, the candidate recognition result may include all of the at least two update scores corresponding to the at least two travel manners. The speech recognition apparatus 400 may compare all update scores corresponding to at least two candidate recognition results and determine the candidate recognition result corresponding to the highest score as the target recognition result. In some embodiments, the travel pattern corresponding to the target recognition result may be transmitted to the user terminal as a recommended travel pattern. For example, when the travel style is a bicycle, the voice recognition device 400 may generate a highest update score of 0.5 for voice information associated with a transportation service, and for the same voice information associated with the transportation service, the highest update score is 0.8 when the travel style is a car. The voice recognition apparatus 400 may determine a candidate recognition result having an update score of 0.8 as a target recognition result and recommend the car as a travel pattern of the user. If the user selects a bicycle as a travel pattern, a candidate recognition result having an update score of 0.5 may be determined as a target recognition result.

In some embodiments, the target recognition result and the at least two candidate recognition results having relatively high update scores may be transmitted to the user terminal. For example, a relatively high update score may refer to a score above a score threshold, or the first three/five/ten scores, or the like. The user may confirm and/or correct the target recognition result through the user terminal 130. In some embodiments, a service request generated based on the target recognition result (e.g., by the server 110 or the processing engine 112) may also be sent to the user terminal 130. The user may confirm and/or amend the service request through the user terminal. In some embodiments, the confirmed service request may be communicated to a service provider, such as a driver.

Fig. 7 is a flow diagram of an exemplary process for determining update coefficients according to some embodiments of the present application. The process 700 may be performed by the speech recognition system 100. For example, the process 700 may be implemented as a set of instructions (e.g., an application) stored in a memory (e.g., the storage 140 of the speech recognition system 100 in fig. 1, the memory 390 of the terminal device 300 in fig. 3). Processing engine 112, terminal device 300, and/or the modules in fig. 4 may execute the set of instructions, and when executing the instructions, processing engine 112, terminal device 300, and/or the modules may be configured to perform process 700. The operations of the illustrated process 700 presented below are intended to be illustrative. In some embodiments, process 700 may be accomplished with one or more additional operations not described, and/or without one or more of the operations discussed. Additionally, the order in which the operations of process 700 are illustrated in FIG. 7 and described below is not intended to be limiting.

In 710, the speech recognition device 400 (e.g., the information acquisition module 410) may acquire one or more extracted key terms corresponding to each of the at least two candidate recognition results. In some embodiments, one or more extracted key terms may be extracted based on predetermined key term extraction rules. For example, the key vocabulary immediately following the structure template may be extracted. For details on extracting the key vocabulary, please refer to other locations in the application, such as in the description of operation 630 in fig. 6. In some embodiments, the extracted key vocabulary may include characters, words, phrases, sentences, or the like.

At 720, the speech recognition device 400 (e.g., the information retrieval module 410) may retrieve at least two sample key terms from one or more databases. In some embodiments, the one or more databases may include a sample key vocabulary database (also referred to as a POI database when used in a transportation service), a heat database, a preference database, a travel pattern database, or the like, or any combination thereof. For example, details regarding one or more databases may be found in the description of operation 640.

In 730, the speech recognition device 400 (e.g., the result determination module 430) may determine a degree of match between each of the one or more extracted key terms and each of the at least two sample key terms. In some embodiments, the degree of match may be determined based on an edit distance algorithm. As used herein, the term "edit distance" between the first text and the second text may refer to the minimum number of editing operations required to convert the first text to the second text. A suitable editing operation may include replacing a character with another character, inserting a character or deleting a character, etc. The edit distance may be inversely proportional to a similarity between the first text and the second text. That is, the smaller the edit distance, the greater the similarity between the first text and the second text. The degree of match may be determined based on an edit distance between each of the extracted one or more key terms and each of the at least two sample key terms.

In some embodiments, the degree of match may be determined according to the length of the match. For example, the degree of match may be equal to the ratio of the length of the match to the overall length of the sample key vocabulary. As used herein, the term "match length" refers to the number of words or characters in the extracted key vocabulary that are also present in the sample key vocabulary. The term "total length of the key vocabulary" refers to the total number of words or characters in the sample key vocabulary. For example only, the key vocabulary extracted from the candidate recognition results may be related to location and may be referred to as suspected POIs. For a suspected POI "digital valley" (e.g., extracted key vocabulary), if a sample POI "digital valley" (e.g., sample key vocabulary) is found in a sample key vocabulary database (also referred to as a "POI database"), the speech recognition device 400 may determine that the suspected POI "digital valley" and the sample POI "digital valley" completely match, and that the degree of match between the extracted key vocabulary and the sample key vocabulary may be 1. For the suspected POI "Zhongguancun Street", if the sample POI "Zhongguancun Street" is not found in the POI database, but the sample POI "No. 1, Zhongguancun Street" and other similar sample POIs exist in the POI database, the matching degree between the suspected POI "Zhongguancun Street" and the sample POI "No. 1, Zhongguancun Street" may be determined as 2/4 ═ 0.5 according to the ratio of the matching length to the total length of the sample POI, where 2 is the number of words of complete matching between the suspected POI "Zhongguancun Street" and the sample, and 4 is the total number of words of the sample POI "No. 1, Zhongguancun Street". It should be noted that other methods for determining the degree of match between each of the one or more extracted key words and each of the at least two sample key words may also be used, and are within the scope of the present application.

In 740, the speech recognition device 400 (e.g., the result determination module 430) can determine one or more target sample key terms from at least two sample key terms, wherein a degree of match between each of the one or more target sample key terms and each of the one or more extracted key terms is above a first degree of match threshold. The first threshold of the degree of match may be a predetermined value. For example, when determining the degree of match based on the length of match, the first threshold degree of match may be a predetermined value between 0-1, such as 0.4, 0.5, and so on.

In 750, the speech recognition device 400 (e.g., the result determination module 430) may determine search parameters based on one or more degrees of match of one or more target sample key words. The matching degree may be converted into the search parameter based on a third conversion relationship between the matching degree and the search parameter. For example, the search parameter may be positively correlated or proportional to the degree of match. In some embodiments, the degree of match may be multiplied by an adjustment ratio to generate a retrieval parameter. The adjustment ratio may be less than or greater than 1. For another example, the search parameter may be the same as the degree of matching. In some embodiments, for degrees of match less than the first threshold of degree of match, the corresponding retrieval parameter may be 0. Alternatively or additionally, the respective candidate recognition results may be deleted.

In some embodiments, at least two degrees of match corresponding to at least two target sample key words may be determined for each candidate recognition result. For example, the search parameter corresponding to each candidate recognition result may be determined based on the highest matching degree of the at least two matching degrees. Alternatively or additionally, the retrieval parameter may be determined based on an average of the degrees of match being greater than a first threshold degree of match.

In 760, the speech recognition device 400 (e.g., the result determination module 430) may determine a heat parameter based on heat information for one or more target sample key words. For example, the heat parameter may be determined as h (q) q^*Where h may represent a function for converting heat information or heat value into a heat parameter. q may represent heat information or heat values of one or more target sample key words, and q^*A heat parameter may be represented. More specifically, the heat information and the heat parameter may be basedThe first conversion relation between the target sample key words converts the heat information of one or more target sample key words into heat parameters. Alternatively, a heat value (also referred to as "heat") may be determined based on the heat information. The heat value may be converted into the heat parameter based on a second conversion relationship between the heat and the heat parameter.

In some embodiments, each target sample key vocabulary may correspond to at least two periodic statistical time periods and/or heat information for at least two geographic regions. The speech recognition device 400 may determine a current time point and/or a statistical time period of a current geographic area, and determine a heat parameter according to the statistical time period and/or a heat value corresponding to the current geographic area. The heat value may be determined based on a number of uses (e.g., a total number of uses or a frequency of uses) and/or a probability of using each of the at least two sample key words.

In some embodiments, at least two heats corresponding to at least two target sample key words may be determined for each candidate recognition result. A heat parameter corresponding to each candidate recognition result may be determined based on a highest heat value from the at least two heats. Alternatively or additionally, the heat parameter may be determined based on an average of at least two heats.

In 770, the speech recognition device 400 (e.g., the result determination module 430) may determine a preference parameter based on preference information for one or more target sample key words associated with the user. The preference information may be converted into the preference parameter based on a third conversion relationship between the preference information and the preference parameter.

In some embodiments, each target sample key vocabulary may correspond to at least two periodic statistical time periods and/or preference information for at least two geographic regions. The speech recognition device 400 may determine a current time point and/or a statistical time period for a current geographic area and determine the preference parameter based on the statistical time period and/or preference information corresponding to the current geographic area. If the preference information includes whether the user providing the voice information has used the sample key vocabulary, the preference parameter may be determined based on a degree of matching corresponding to the target sample key vocabulary. The preference parameter may be determined based on a preference value corresponding to the target sample key word if the preference information includes a preference value associated with a number of uses (e.g., a total number of uses or a frequency of uses) and/or a probability that the user used each of the at least two sample key words. In some embodiments, for the same degree of matching, the preference parameter converted according to the fourth conversion relationship between the degree of matching and the preference parameter is larger than the degree of heat parameter determined based on the first conversion relationship between the degree of heat information and the degree of heat parameter.

In some embodiments, at least two degrees of preference corresponding to at least two target sample key words may be determined for each candidate recognition result. For example, the preference parameter may be determined based on a highest matching degree of the at least two matching degrees or based on an average matching degree of the at least two matching degrees. As another example, the preference parameter may be determined based on a highest preference value from the at least two preference parameters or based on an average preference value of the at least two preference parameters.

In 780, the speech recognition device 400 (e.g., the result determination module 430) may determine a distance parameter based on the one or more target sample keyword collections. For example only, a user may input voice information in the voice recognition device 400 to request a transport service. The one or more extracted key terms may include at least one location, such as a street name, a store name, an address, a POI, and the like. The distance parameter may be determined based on distance information between the departure location and the destination. For example, the distance information may be a road distance between the departure location and the destination.

The speech recognition device 400 may determine the location type of one or more of the extracted key terms. The location type may include a departure location type and a destination type. For example, if the candidate recognition result includes a location type indicator (i.e., a departure location indicator or departure location information) associated with the departure location before the extracted key vocabulary, the speech recognition apparatus 400 may determine the extracted location type key vocabulary as the departure location type. Similarly, if the candidate recognition result includes a location type indicator (i.e., indicator or destination information) associated with the destination before the extracted key vocabulary, the speech recognition apparatus 400 may determine the location type of the extracted key vocabulary as the destination type.

For example only, if the extracted key vocabulary of the departure location type completely matches from a first POI of the one or more target sample POIs, the first location corresponding to the first POI may be determined to be the departure location. If no target sample POI completely matches the extracted key vocabulary of the starting location type, at least two target sample POIs corresponding to a degree of matching above a second degree of matching threshold may be selected from the one or more target sample POIs and determined to be a second POI. The second degree of matching may be higher than or equal to the first degree of matching. The speech recognition device 400 can determine a second location corresponding to a second POI. The speech recognition device 400 may further determine an average location as the departure location based on the second location. Similarly, if the extracted key vocabulary of the one or more destination types and a third POI of the one or more target sample POIs completely match, a third location corresponding to the third POI may be determined as the destination. If no target sample POI completely matches the extracted target sample key vocabulary of the destination type, at least two target sample POIs corresponding to a degree of matching above a third degree of matching threshold may be selected from the one or more target sample POIs and determined to be a third POI. The third degree of matching may be higher than or equal to the first degree of matching. The third degree of matching may be the same as or different from the second degree of matching. The speech recognition device 400 can determine a fourth location corresponding to the second POI. The speech recognition device 400 may further determine a second average location as the destination based on the fourth location.

In some embodiments, when the key vocabulary of the departure location type is not extracted, the speech recognition apparatus 400 may acquire the location information of the user terminal of the speech information and determine the location information of the user terminal as the departure location. When the key vocabulary of the destination type is not extracted, the voice recognition apparatus 400 may transmit a prompt message to the user terminal to notify the user that voice information may not be recognized or that the information is insufficient to, for example, generate a service request or a voice command, and that the user may need to provide the voice information again. The voice recognition apparatus 400 may acquire the re-provided voice information and determine a departure location and a destination based on the re-provided voice information.

In some embodiments, the voice recognition device 400 may determine at least one travel pattern that the user may use based on information received from the user terminal. For example, if the voice recognition device 400 determines that voice information associated with a taxi appointment service may be used, the at least one travel mode may be taking a taxi. For another example, if the voice recognition device 400 determines that voice information associated with a navigation service may be used, the at least one travel mode may include walking, riding a bicycle, riding a bus, riding a subway, riding a taxi, or the like, or a combination thereof. The distance parameter may be determined based on probability distribution data related to different distance information corresponding to each of the at least one travel pattern. The distance parameter corresponding to the travel pattern may be positively correlated with a probability of a road distance between the departure location and the destination using the travel pattern. For example, when the road distance is 1.5 kilometers, the walking probability corresponding to 1.5 kilometers may be 0.3, the bicycle riding probability corresponding to 1.5 kilometers may be 0.5, and the taxi riding probability may be 0.2. Distance parameters corresponding to walking, bicycling and taxiing may be determined based on 0.3, 0.5 and 0.2, respectively. For 1.5 kilometers, the distance parameter corresponding to biking may be higher than the distance parameter corresponding to walking and riding a taxi.

In 790, the speech recognition device 400 (e.g., the result determination module 430) may determine an update coefficient based on at least one of the retrieval parameter, the popularity parameter, the preference parameter, or the distance parameter. For example, the speech recognition device 400 may determine the update coefficients based on an average or a weighted average, a sum or a weighted sum, a product, or a combination thereof of the at least one parameter. Other methods of determining the update coefficient based on at least one parameter may also be used and are within the scope of the present application. For more detailed information on determining the update coefficients, please refer to other locations in the application, for example, in the description of operation 650 in fig. 6.

FIG. 8 is a schematic diagram of an exemplary process for speech recognition according to some embodiments of the present application. By way of example only, voice data 810 may be used for a transportation service. The key vocabulary extracted from the candidate recognition results may be related to a location (e.g., a departure location or a destination) and is referred to as a suspected POI. In some embodiments, voice recognition device 400 may obtain voice data 810 from a user terminal (e.g., user terminal 130, terminal device 300). The speech recognition device 400 may recognize the speech data 810 based on the recognition model 500 to generate at least two candidate recognition results and corresponding preliminary scores. Such a method of recognizing speech data 810 can be found, for example, in fig. 5 and 6.

In 820, the speech recognition device 400 may extract one or more suspected POIs from each of the at least two candidate recognition results based on the predetermined key vocabulary extraction rules. The speech recognition device 400 may evaluate the accuracy of each of the at least two candidate recognition results by comparing suspected POIs (i.e., extracted key words) to sample POIs (i.e., sample key words) in one or more databases, such as the POI database 860, the POI popularity database 870, the POI preference database 880, or the like.

At 830, the speech recognition device 400 may determine search parameters for each of the at least two candidate recognition results based on a degree of match between the suspected POI and the sample POI in the POI database 860. The POI database 860 may include at least two sample POIs. Each sample POI may include at least one description corresponding to a location (e.g., geographic coordinates). The description may include a name, address, etc., or a combination thereof. The speech recognition device 400 may select one or more target sample POIs from the sample POIs, where one or more degrees of match between the one or more target sample POIs and the suspected POI are above a first degree of match threshold. The retrieval parameters may be determined based on a degree of match between the target sample POI and the suspected POI.

In 840, the speech recognition device 400 may determine a heat parameter and a preference parameter based on the suspected POIs. For example, the heat parameter may be determined according to heat information corresponding to the target sample POI in the POI heat database 870. The POI heat database 870 may include heat information corresponding to each of the at least two sample POIs. The preference parameters may be obtained according to the heat information corresponding to the target sample POI in the POI preference database 880. The POI preference database may include preference information corresponding to each sample POI associated with a current user providing voice information.

In 850, the speech recognition device 400 may determine a target recognition result 890 based on the retrieval parameter, the popularity parameter, and the preference parameter. Since both the heat parameter and the preference parameter can indicate usage information of a suspected POI, one of the heat parameter and the preference parameter having a higher value can be selected. The speech recognition apparatus 400 may determine the update coefficient corresponding to the candidate recognition result based on the retrieval parameter and the parameter having a higher value between the heat parameter and the preference parameter. The score of the candidate recognition result may be updated based on the update coefficient to generate an updated score of the candidate recognition result. Target recognition result 890 may be selected based on the update score. For example, the candidate recognition result corresponding to the highest update score may be determined as the target recognition result 890.

FIG. 9 is a schematic diagram illustrating an exemplary process for speech recognition according to some embodiments of the present application. For example only, voice information 910 may be used for a transportation service. The key vocabulary extracted from the candidate recognition results may relate to a location (e.g., a departure location or destination) and is referred to as a suspected POI. In some embodiments, the speech recognition device 400 may obtain the speech information 910 from a user terminal (e.g., user terminal 130, terminal device 300). The speech recognition device 400 may recognize the speech information 910 based on the recognition model 500 to generate at least two candidate recognition results and corresponding preliminary scores. Such a method of recognizing speech information 910 may be found, for example, in fig. 5 and 6.

At 920, the speech recognition device 400 may extract one or more suspected POIs. In some embodiments, n suspected POIs may be obtained. The speech recognition device 400 may determine the update factor by comparing suspected POIs to sample POIs in one or more databases, such as the POI database 970, the travel pattern database 980, or the like.

In 930, the speech recognition device 400 may determine search parameters for each of the at least two candidate recognition results based on a degree of match between the one or more suspected POIs and the sample POIs in the POI database 970. The POI database 970 may include at least two sample POIs. A sample POI corresponding to a degree of match above the first degree of match threshold may be determined as a target sample POI. In some embodiments, the preliminary score for each suspected POI may be updated based on the retrieval parameters. For example, the updated preliminary score corresponding to a candidate recognition result may be expressed as f (x, s), where f is a function for determining the updated preliminary score based on the search parameter, x may represent the preliminary score corresponding to the candidate recognition result, and s may represent the search parameter corresponding to the candidate recognition result. For example, the updated preliminary score may be obtained by multiplying the preliminary score by a search parameter, which may be expressed as f (x, s) ═ xs.

At 940, the speech recognition device 400 may determine a road distance. The road distance may be determined based on GPS information of the departure location and the destination. The departure location and destination may be determined based on the sample POI and location type of the suspected POI (e.g., whether it is a departure location or a destination) that match the one or more suspected POIs, such as described in operation 780 of fig. 7. If no suspected POI of the departure location type is extracted from the candidate recognition result, the positioning information of the user terminal may be determined by the GPS. If no suspected POIs of the destination type are extracted, a prompt message may be sent to the user terminal to inform the user that voice information may not be recognized or that the information is insufficient, such as to generate a service request or voice command. In some embodiments, if a suspected POI completely matches a target sample POI in the POI database (i.e., the degree of match is 1), the location information of the target sample POI is directly extracted to determine the departure location or destination. In some embodiments, the M target sample POIs may be arranged in a descending order based on the degree of matching corresponding to the target sample POIs. The speech recognition device 400 may determine an average location as a departure location or destination based on the GPS information corresponding to the M target sample POIs. In some embodiments, the voice recognition apparatus 400 may acquire at least one trip pattern employed by a user providing voice information via a user terminal. In some embodiments, a road distance corresponding to each of the at least one travel modes may be determined.

In 950, the speech recognition device 400 may determine a distance parameter. The distance parameter may be determined from probability distribution data relating to different distance information (e.g., different road distances) corresponding to each of the at least two travel patterns in the travel database. A probability corresponding to the road distance determined in operation 940 may be determined for each of the at least one travel mode and determined as or converted into a distance parameter. In some embodiments, the update score of each candidate recognition result may be represented as g { F, p [ dist (a, b), D ] }, where F may represent an updated preliminary score determined based on the retrieval parameter, a may represent a place of origin, b may represent a destination, dist may represent a function for determining a road distance between two locations, D may represent probability distribution data related to different road distances, p may be a function for determining a target probability corresponding to a road distance, and g may be a function for determining an update score based on the updated preliminary score and the distance parameter. In some embodiments, the speech recognition device 400 may determine the update coefficient based on the retrieval parameter and the distance parameter. The initial score may be updated using an update coefficient to generate an update score.

In 960, the speech recognition device 400 may determine a target recognition result 990. In some embodiments, each candidate recognition result may correspond to at least one update score associated with at least one travel mode. For example, the speech recognition apparatus 400 may compare all update scores and determine the candidate recognition result corresponding to the highest update score as the target recognition result. In some embodiments, the target recognition result and information related to the target recognition result may be sent to the user terminal 130 or the processing engine 112. The information related to the target recognition result may include one or more target sample key words and/or travel patterns corresponding to the target recognition result. One or more target samples (e.g., departure location, destination) may be used for subsequent operations, such as generating a service request. The travel manner corresponding to the target recognition result may be determined as a recommended travel manner for the user. For example only, the service request may be a request for a network car reservation service. The service request may be sent to a user terminal (e.g., driver) associated with the service provider.

FIG. 10 is a schematic diagram of an exemplary interface for generating a service request based on voice information according to some embodiments of the present application. Interface 1010-1040 in fig. 10 is an exemplary interface associated with a network appointment service. For example, the network appointment service may be provided by a taxi service APP such as "drip taxi".

When a user requests a service through a user terminal (e.g., user terminal 130, terminal device 300), the user terminal may use a positioning technology such as GPS to obtain the current location of the user terminal and display a map around the current location of the user terminal on a display of the user terminal as shown in interface 1010. Names of at least two streets, such as "co-blessing street", "Yongkang street", etc., may be displayed on the interface 1010. The drip cart application may provide the user with two options, namely to make a service request now, or to make a reservation for a future service request. For example, the user may click on an icon with the word "now" to make a service request.

After clicking "now" in the interface 1010, the interface 1020 may be displayed on the screen of the user terminal. A microphone icon is displayed to indicate that the user can speak to provide the desired information. An icon with the text "press and talk" is displayed in interface 1020. The user can press and hold the button icon to speak, and the microphone of the user terminal can acquire the speaking information. Additionally or alternatively, the user may hold a button on the terminal device for speaking, such as a home button, a volume button, or any combination thereof. For example, the user may say, "i want to go to Beijing university". When the user releases the icon or reaches a preset recording time length, the microphone may stop acquiring voice information. After acquiring the voice information, the user terminal may perform a voice recognition operation. Alternatively, the user terminal may transmit voice information to a server (e.g., the server 110 in fig. 1), and the server may perform a voice recognition operation. At least two candidate recognition results may be generated based on the speech information, and a target recognition result may be selected from the at least two candidate recognition results. Such a method can be found, for example, in fig. 6-8. If the speech recognition operation is performed by the server, at least two candidate recognition results and/or the target recognition result may be transmitted to the user terminal.

In the interface 1030, the target recognition result of "i want to go to the university of beijing" is displayed under the "recognition text". A list of four candidate recognition results (e.g., "i want to go to beijing zoo", etc.) is displayed in interface 1030 under the text "candidate text". The user may confirm the recognized text or select a candidate text from the list, for example, by clicking on the recognized text or the selected candidate text. If the recognized text and the candidate text are not accurate, the user may edit the recognized text or the candidate text. Alternatively, the user may record the word again to update the recognized text.

After the recognized text is confirmed, the user terminal may generate a service request in interface 1040. The departure location and destination of the service request may be displayed on the screen. For example, the departure location may be determined based on positioning information of the user terminal. The destination may be a sample key vocabulary of a destination type corresponding to the target recognition result. In interface 1040, the departure location is displayed as the "current location" under the text "from" and the destination is displayed as "university of beijing" under the "go" text. The user may also modify the departure location and/or destination, if desired. The user may click on the "confirm" icon to confirm the service request. Otherwise, the user may click on the "cancel" icon to cancel the service request. If the user confirms the service request, a service request may be initiated and sent to the service provider (e.g., driver).

Having thus described the basic concept, it will be apparent to those skilled in the art that the foregoing disclosure is by way of example only, and is not intended to limit the present application. Various modifications, improvements and adaptations to the present application may occur to those skilled in the art, although not explicitly described herein. Such alterations, modifications, and improvements are intended to be suggested herein and are intended to be within the spirit and scope of the exemplary embodiments of the present application.

Also, this application uses specific terminology to describe embodiments of the application. For example, the terms "one embodiment," "an embodiment," and/or "some embodiments" mean that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the present application. Therefore, it is emphasized and should be appreciated that two or more references to "an embodiment" or "one embodiment" or "an alternative embodiment" in various portions of this specification are not necessarily all referring to the same embodiment. Furthermore, some features, structures, or characteristics may be combined as suitable in one or more embodiments of the application.

Moreover, those of ordinary skill in the art will understand that aspects of the present application may be illustrated and described in terms of several patentable species or situations, including any new and useful combination of processes, machines, articles, or materials, or any new and useful modification thereof. Accordingly, various aspects of the present application may be embodied entirely in hardware, entirely in software (including firmware, resident software, micro-code, etc.) or in a combination of software and hardware implementations, which may generally be referred to herein as a "unit," module, "or" system. Furthermore, aspects of the present application may take the form of a computer program product embodied in one or more computer-readable media having computer-readable program code embodied in the medium.

A computer readable signal medium may include a propagated data signal with computer readable program code embodied therewith, for example, in baseband or as part of a carrier wave. Such a propagated signal may take any of a variety of forms, including electromagnetic, optical, or the like, or any suitable combination thereof. A computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable signal medium may be transmitted using any appropriate medium, including wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations for aspects of the present application may be written in a combination of one or more programming languages, including an object oriented programming language such as Java, Scala, Smalltalk, Eiffel, JADE, Emerald, C + +, C #, VB; conventional procedural programming languages, such as the "C" programming language, Visual Basic, Fortran 2003, Perl, COBOL 2002, PHP, ABAP; dynamic programming languages (such as Python, Ruby, and Groovy) or other programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer, partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider), or in a cloud computing environment, or as a service using, for example, software as a service (SaaS).

Furthermore, the recited order of processing elements or sequences, or the use of numbers, letters, or other designations therefore, is not intended to limit the claimed processes and methods to any order except as may be specified in the claims. While various presently contemplated embodiments have been discussed in the foregoing disclosure by way of example, it is to be understood that such detail is solely for that purpose and that the appended claims are not limited to the disclosed embodiments, but, on the contrary, are intended to cover all modifications and equivalent arrangements that are within the spirit and scope of the embodiments herein disclosed. For example, although the system components described above may be implemented by hardware devices, they may also be implemented by software-only solutions, such as installing the described system on an existing server or mobile device.

Similarly, it should be noted that in the preceding description of embodiments of the application, various features are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure aiding in the understanding of one or more of the embodiments. This method of disclosure, however, is not intended to imply that more features are required than are expressly recited in each claim. Rather, claimed subject matter may be characterized as having less than all of the features of a single foregoing application embodiment.

49页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：语音识别方法、装置、拍摄系统和计算机可读存储介质

Speech recognition system and method

相关技术

网友询问留言