Voice processing method, device, equipment and storage medium

文档序号：344507 发布日期：2021-12-03 浏览：12次中文

阅读说明：本技术 一种语音处理方法、装置、设备及存储介质 (Voice processing method, device, equipment and storage medium ) 是由张建军陈真于 2021-08-31 设计创作，主要内容包括：本公开提供了一种语音处理方法、装置、设备及存储介质,涉及人工智能领域,尤其涉及车联网和智能座舱技术。具体实现方案为：确定发送方输入的语音信息的语境信息；获取所述语音信息的至少两个语义解析结果；根据所述语境信息,从所述至少两个语义解析结果中选择目标解析结果。根据本公开的技术,能够精准定位用户意图。(The disclosure provides a voice processing method, a voice processing device, voice processing equipment and a storage medium, and relates to the field of artificial intelligence, in particular to technologies of Internet of vehicles and intelligent cabins. The specific implementation scheme is as follows: determining the context information of the voice information input by the sender; acquiring at least two semantic analysis results of the voice information; and selecting a target analysis result from the at least two semantic analysis results according to the context information. According to the technology of the present disclosure, the user intention can be accurately located.)

1. A method of speech processing comprising:

determining the context information of the voice information input by the sender;

acquiring at least two semantic analysis results of the voice information;

and selecting a target analysis result from the at least two semantic analysis results according to the context information.

2. The method of claim 1, wherein the determining context information of the voice message input by the sender comprises:

and determining the context information of the voice information input by the sender according to the state of the vehicle-side interface.

3. The method of claim 2, wherein determining contextual information of the voice message input by the sender based on the vehicle-side interface state comprises:

and determining the contextual information of the voice information input by the sender according to the state of the vehicle-end interface and the position information and/or the historical interaction record of the sender.

4. The method of claim 1, wherein the contextual information comprises a target interface identification; selecting a target parsing result from the at least two semantic parsing results according to the context information, including:

determining a target scene associated with a target interface according to the target interface identifier;

and selecting a target analysis result from the at least two semantic analysis results according to a matching result between the at least two semantic analysis results and the target scene.

5. The method of claim 4, wherein the selecting a target parsing result from the at least two semantic parsing results according to a matching result between the at least two semantic parsing results and the target scene comprises:

if the semantic analysis result matched with the target scene exists, determining the priority of the matched semantic analysis result according to the priority of the target scene;

and determining a target analysis result from the matched semantic analysis results according to the priority of the matched semantic analysis results.

6. The method of claim 4, wherein the selecting a target parsing result from the at least two semantic parsing results according to a matching result between the at least two semantic parsing results and the target scene comprises:

if the semantic analysis result matched with the target scene does not exist, scoring the at least two semantic analysis results;

and selecting a target analysis result from the at least two semantic analysis results according to the scoring result.

7. The method of claim 1, further comprising, after selecting a target parsing result from the at least two semantic parsing results according to the context information:

and executing the target analysis result and outputting the execution result to the sender.

8. A speech processing apparatus comprising:

the context information determining module is used for determining the context information of the voice information input by the sender;

the analysis result acquisition module is used for acquiring at least two semantic analysis results of the voice information;

and the target result selection module is used for selecting a target analysis result from the at least two semantic analysis results according to the context information.

9. The apparatus of claim 8, wherein the contextual information determination module comprises:

and the contextual information determining unit is used for determining the contextual information of the voice information input by the sender according to the state of the vehicle-side interface.

10. The apparatus according to claim 9, wherein the context information determining unit is specifically configured to:

11. The apparatus of claim 8, wherein the contextual information comprises a target interface identification; the target result selection module comprises:

the target scene determining unit is used for determining a target scene associated with the target interface according to the target interface identifier;

and the target result selection unit is used for selecting a target analysis result from the at least two semantic analysis results according to the matching result between the at least two semantic analysis results and the target scene.

12. The apparatus according to claim 11, wherein the target result selection unit is specifically configured to:

if the semantic analysis result matched with the target scene exists, determining the priority of the matched semantic analysis result according to the priority of the target scene;

and determining a target analysis result from the matched semantic analysis results according to the priority of the matched semantic analysis results.

13. The apparatus according to claim 11, wherein the target result selection unit is specifically configured to:

if the semantic analysis result matched with the target scene does not exist, scoring the at least two semantic analysis results;

and selecting a target analysis result from the at least two semantic analysis results according to the scoring result.

14. The apparatus of claim 8, further comprising:

and the execution module is used for executing the target analysis result and outputting the execution result to the sender.

15. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the speech processing method of any of claims 1-7.

16. A non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the speech processing method according to any one of claims 1 to 7.

17. A computer program product comprising a computer program which, when executed by a processor, implements a speech processing method according to any one of claims 1-7.

Technical Field

The present disclosure relates to the field of artificial intelligence technologies, and in particular, to technologies of car networking and intelligent cabins, and in particular, to a voice processing method, apparatus, device, and storage medium.

Background

With the wide popularization of artificial intelligence technology, human-computer interaction is applied to various fields. Currently, in the field of car networking, the main way for car owners to interact with car terminals is voice interaction. Therefore, accurate recognition of user intent is crucial to voice interaction.

Disclosure of Invention

The disclosure provides a voice processing method, apparatus, device and storage medium.

According to an aspect of the present disclosure, there is provided a speech processing method, including:

determining the context information of the voice information input by the sender;

acquiring at least two semantic analysis results of the voice information;

and selecting a target analysis result from the at least two semantic analysis results according to the context information.

According to another aspect of the present disclosure, there is provided a voice processing apparatus including:

the context information determining module is used for determining the context information of the voice information input by the sender;

the analysis result acquisition module is used for acquiring at least two semantic analysis results of the voice information;

and the target result selection module is used for selecting a target analysis result from the at least two semantic analysis results according to the context information.

According to another aspect of the present disclosure, there is provided an electronic device including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a speech processing method according to any one of the embodiments of the present disclosure.

According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform a speech processing method according to any one of the embodiments of the present disclosure.

According to another aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the speech processing method of any of the embodiments of the present disclosure.

According to the technology of the present disclosure, the user intention can be accurately located.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a flow chart of a method of speech processing provided in accordance with an embodiment of the present disclosure;

FIG. 2 is a flow chart of another method of speech processing provided in accordance with an embodiment of the present disclosure;

FIG. 3 is a flow chart of yet another speech processing method provided in accordance with an embodiment of the present disclosure;

FIG. 4 is a schematic structural diagram of a speech processing apparatus according to an embodiment of the present disclosure;

FIG. 5 is a block diagram of an electronic device for implementing a speech processing method of an embodiment of the present disclosure;

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Fig. 1 is a flowchart of a speech processing method according to an embodiment of the present disclosure. The embodiment of the disclosure is suitable for the situation of how to process the voice, and is particularly suitable for the situation of how to process the voice information of the user under the condition that the voice information input to the vehicle end by the user has multiple meanings (namely the voice information does not express the user intention explicitly) so as to accurately position the user intention. The embodiment may be performed by a speech processing apparatus configured in an electronic device, which may be implemented in software and/or hardware. Optionally, the electronic device may be a vehicle end, a server end, or the like. As shown in fig. 1, the speech processing method provided in this embodiment may include:

s101, determining the context information of the voice information input by the sender.

In this embodiment, the sender is any party having a requirement for interaction with the vehicle end, such as a driver, a co-driver, or other users in the vehicle. The voice information is a service request sent to the vehicle terminal by a sender through voice; further, in this embodiment, the voice message input from the sender to the vehicle terminal can express multiple meanings, such as "red day", and the voice message has two meanings, such as "play song" red day "and" introduce encyclopedic red day ".

The context information is used for representing the situation under which the sender inputs the voice information to the vehicle terminal; optionally, the context information includes all factors influencing the semantic parsing result. Further, the context information may include related information of the vehicle-side state, such as an interface identifier currently displayed on the vehicle side, where the interface identifier is used to uniquely represent a specific interface, and may be an interface ID; the context information can also comprise the language, tone and the like adopted by the sender to express the voice information; in addition, the context information may also include the relative distance between the sender and the vehicle end when the sender expresses the voice information, and the like.

Optionally, if the execution main body of this embodiment is a vehicle end, the vehicle end may collect the voice information input by the sender through a voice collection module (such as a microphone), and determine the context information of the voice information. For example, the vehicle end can determine the context information of the voice information according to the state of the vehicle end; or, when the vehicle end collects the voice information of the sender, the scene image including the sender is synchronously collected (for example, when the sender inputs the voice information to the vehicle end, the vehicle end can synchronously collect the image of the whole vehicle interior when the sender is located in the vehicle), and then the collected scene image can be identified and analyzed to determine the context information of the voice information and the like.

Further, if the execution main body of the embodiment is a server, the vehicle terminal may acquire the voice information input by the sender through a voice acquisition module (such as a microphone), and synchronously acquire a scene image including the sender; and then, the vehicle end can send the collected voice information and the scene image to the server end, and the server end determines the context information of the voice information input by the sender according to the scene image.

In consideration of network quality, storage space, computing power, and the like of the vehicle side and the server side, the execution subject of the present embodiment is preferably the server side. Further, the execution subject of this embodiment may also be the vehicle end under the condition that the vehicle end hardware facility, the network quality, and the like meet the requirements. For example, in a case that the execution subject is a service end, considering that the service end directly determines the context information, the vehicle end is required to transmit data (such as a scene image) and further occupies network resources of the vehicle end, so that the service end can obtain the context information of the voice information transmitted by the vehicle end.

S102, at least two semantic parsing results of the voice information are obtained.

In this embodiment, the semantic parsing result is obtained by performing voice recognition on the voice information to obtain text content corresponding to the voice information, and performing semantic parsing on the text content.

Optionally, if the execution main body of this embodiment is the vehicle end, the vehicle end may distribute the voice information input by the sender to at least two semantic analysis modules, and each semantic analysis module analyzes the voice information, so that the vehicle end may obtain a semantic analysis result of each semantic analysis module on the voice information. Or the vehicle end can send the semantic request to the server end, the server end distributes the voice information to different semantic analysis modules, and then the vehicle end can obtain at least two semantic analysis results of the voice information fed back by the server end. The semantic parsing module may be configured in the vehicle end, or may be configured in other devices independently from the vehicle end.

Further, if the execution main body of this embodiment is a server, the server may send a semantic request including the voice information to at least two semantic parsing modules, and each semantic parsing module parses the voice information; and then the server can obtain the semantic analysis result of each semantic analysis module on the voice information. The semantic parsing module may be configured in the server, or may be configured in other devices independently from the server.

S103, selecting a target analysis result from the at least two semantic analysis results according to the context information.

In this embodiment, the target parsing result is a semantic parsing result capable of expressing the real intention of the sender.

In an implementation manner, the present embodiment may perform statistical analysis on at least two semantic parsing results based on the context information, so as to determine a target parsing result.

In yet another embodiment, the target resolution may be determined by a neural network model. For example, the context information and at least two semantic analysis results are input into a matching model trained in advance, and the matching model outputs a target analysis result.

It should be noted that, in the case that the same voice information has multiple expression meanings, the prior art determines the user intention directly according to the analysis result of the semantic analysis module on the voice information, and the problem of semantic calling error exists; in the embodiment, by combining the context information of the voice information, the target analysis result, namely the real intention of the user, can be accurately positioned from different semantic analysis results, so that the problem of semantic calling by mistake can be effectively solved, and the satisfaction degree of the user is improved.

According to the technical scheme provided by the embodiment of the disclosure, under the condition that the same voice information has multiple expression meanings, the context information of the voice information is combined, the target analysis result, namely the true intention of a user, can be accurately positioned from different semantic analysis results, the problem of semantic calling error in the existing voice interaction process is effectively solved, and the guarantee is provided for the smooth operation of human-computer interaction; in addition, this scheme has also promoted speech processing's intelligent degree.

On the basis of the above embodiment, as an optional manner of the embodiment of the present application, after selecting the target parsing result from the at least two semantic parsing results according to the context information, the target parsing result may be further executed, and the execution result is output to the sender.

Specifically, if the execution subject of this embodiment is the vehicle end, the vehicle end may directly execute the target analysis result after determining the target analysis result, and may output the execution result to the sender. The mode of the output execution result can be voice broadcast, interface display, combination of the voice broadcast and the interface display and the like.

For example, the voice information is "XX supermarket", the semantic analysis result a is "introduction encyclopedia XX supermarket", and the semantic analysis result B is "search nearby XX supermarket in map"; and if the target analysis result is a semantic analysis result B, executing the semantic analysis result B by the vehicle end, and displaying the search result of the nearby XX supermarket in a map interface of the vehicle end so as to be conveniently viewed by the sender.

For another example, the voice message is "turn on air conditioner", and in the smart home environment with the car and the home interconnected, "turn on air conditioner" may correspond to a semantic analysis result C "turn on air conditioner at home" and a semantic analysis result D "turn on air conditioner at car"; if the target analysis result is the semantic analysis result D, the vehicle terminal executes the semantic analysis result D and can broadcast a fixed telephone operation, such as 'air conditioner is opened' and the like, to the sender.

Further, if the execution subject of this embodiment is the server, after determining the target analysis result, the server may feed back the target analysis result to the vehicle end, and the vehicle end executes the target analysis result and outputs the execution result to the sender.

It should be noted that, after the user intention is accurately located, the result after the user intention is executed may be output to the user, thereby further improving the scheme.

Fig. 2 is a flowchart of another speech processing method according to an embodiment of the present disclosure, and this embodiment further explains in detail how to determine the context information of the input speech information of the sender based on the above embodiment. As shown in fig. 2, the speech processing method provided in this embodiment may include:

s201, determining the context information of the voice information input by the sender according to the state of the vehicle-side interface.

In this embodiment, the state of the vehicle-side interface may include display and operation states of the vehicle-side interface when the sender inputs the voice information to the vehicle side.

According to one implementation mode, when the voice information is input to the vehicle end from the sending direction, the interface which is displayed on the display screen of the vehicle end is used as a target interface, and the target interface identifier is used as the context information of the voice information.

Or, when the sender inputs the voice information to the vehicle end, the interface being displayed on the display screen of the vehicle end and the interface running in the background at the moment can be used as the target interface, and the target interface identifier can be used as the context information of the voice information.

In yet another implementation, contextual information of the voice message input by the sender can be determined based on the vehicle-side interface state and the sender location information and/or historical interaction records. The historical interaction record can comprise interaction records between the vehicle terminal and the sender within a period of time; in consideration of factors such as memory resource occupation, the historical interaction record in this embodiment preferably includes an interaction record based on the previous voice information between the vehicle end and the sender, and specifically may include the previous voice information input to the vehicle end by the sender, and content output to the sender by the vehicle end based on an analysis result of the previous voice information. The last voice message is a voice message input to the vehicle side by the sender before the voice message in S201 is input to the vehicle side.

For example, it may be determined whether an interface being displayed on a display screen of the vehicle end is the same as an interface displayed by the vehicle end after the vehicle end responds to the last voice message of the sender when the sender inputs the voice message to the vehicle end; if the two interfaces are the same, taking the interface which is being displayed on the display screen at the vehicle end as a target interface; otherwise, the interface which is being displayed on the display screen of the vehicle end and the interface which is displayed by the vehicle end after the vehicle end responds the last voice information of the sender are taken as the target interface; the target interface is identified as context information for the voice information.

For another example, when the sending direction inputs the voice information to the vehicle end, the interface being displayed on the display screen of the vehicle end can be used as a target interface; meanwhile, determining the relative distance between the sending end and the vehicle end according to the position information of the sending end; the target interface identification and the relative distance can thus be used together as context information.

Or, it can be determined whether the interface being displayed on the display screen of the vehicle end is the same as the interface displayed by the vehicle end after the vehicle end responds to the last voice message of the sender when the sender inputs the voice message to the vehicle end; if the two interfaces are the same, taking the interface which is being displayed on the display screen at the vehicle end as a target interface; otherwise, the interface which is being displayed on the display screen of the vehicle end and the interface which is displayed by the vehicle end after the vehicle end responds the last voice information of the sender are taken as the target interface; meanwhile, determining the relative distance between the sending end and the vehicle end according to the position information of the sending end; the target interface identification and the relative distance can thus be used together as context information.

S202, at least two semantic parsing results of the voice information are obtained.

S203, selecting a target analysis result from the at least two semantic analysis results according to the context information.

According to the technical scheme provided by the embodiment of the disclosure, the context information of the voice information input by the sender can be determined by combining the interface state of the vehicle end, an optional mode is provided for determining the context information, and meanwhile, data support is provided for accurately positioning a target analysis result from different semantic analysis results by subsequently combining the context information of the voice information. In addition, when determining the context information, the embodiment also combines with other factors such as sender position information and/or historical interaction records, so that the context information is more accurate and comprehensive, and the accuracy of the finally determined target analysis result is further improved.

Fig. 3 is a flowchart of another speech processing method according to an embodiment of the present disclosure, and this embodiment explains how to select a target parsing result from at least two semantic parsing results according to context information in detail on the basis of the above embodiment and in the case that the context information includes a target interface identifier. As shown in fig. 3, the speech processing method provided in this embodiment may include:

and S301, determining the context information of the voice information input by the sender, wherein the context information comprises the target interface identification.

S302, at least two semantic parsing results of the voice information are obtained.

S303, determining a target scene associated with the target interface according to the target interface identifier.

Optionally, in this embodiment, according to the target interface identifier, a search may be performed from a pre-established context mapping relationship table to determine a target scene associated with the target interface.

In an implementation manner, the context mapping table may be constructed according to services (or may be called verticals) supported by the vehicle end, functions (or may be called intents) included in each service, an interface which can be presented to a user by the vehicle end, and the like. Alternatively, a scenario in the context mapping table may be composed of a service and a function under the service. Wherein services include, but are not limited to, maps, weather, music, telephony, games, and the like; for example, for a map, the service of the map can have multiple functions of navigation, peripheral search, search along the way and the like.

For example, one interface in the context mapping table may be associated with one or more scenarios. For example, as shown in table 1, card _ type indicates a service, and intent indicates a function. For the car-mounted map interface, a scene composed of a map (map) and a peripheral search (near) can be associated, and a scene composed of music (music) and a play (play) can be associated. Further, a scene may also be associated with one or more interfaces. For example, for a map along-the-way search scene, a car-mounted map interface can be associated, and a car-mounted map navigation interface can also be associated.

Table 1 context mapping table

card_type	intent	Description of the invention	Interface (I)
				map	nearby	Map perimeter search	Vehicle map interface
music	Play	Music playing	Vehicle map interface
				passing_point	search	Map search along the way	Vehicle map interface
passing_point	search	Map search along the way	Vehicle map navigation interface
				map	route	Navigation	Vehicle map navigation interface
music	play	Music playing	Music playing interface
				music	collect	Music album	Music playing interface

It should be noted that table 1 simply lists some corresponding relationships between the scenes and the interfaces for description, and the context mapping relationship table may further include corresponding relationships between other interfaces and the scenes, which is not limited in this embodiment.

In order to be able to more fully cover a wide variety of scenarios in order to more precisely locate any intention of the user. Furthermore, when the context mapping relation table is constructed, historical interaction records of the user and the vehicle end can be combined.

Illustratively, the corresponding relationship between the scene and the interface in the context mapping relationship table may be dynamically adjusted according to the new adding, deleting, and updating conditions of the service function at the vehicle end, the new adding interaction record between the user and the vehicle end, and the like. For example, a corresponding relationship between a scene and an interface is added.

The execution subject of the present embodiment is a server side as an example for explanation. The server can use the target interface representation as an index, and search from a pre-established context mapping relation table to obtain a target scene associated with the target interface. The number of target scenes may be one or more, and is preferably multiple in this embodiment.

S304, selecting a target analysis result from the at least two semantic analysis results according to the matching result between the at least two semantic analysis results and the target scene.

Optionally, if the obtained semantic analysis result includes the relevant fields such as card _ type, intent, data, and the like, the semantic analysis result may be directly matched with the target scene; if the obtained semantic parsing result is not composed in a field form, the semantic parsing result may be parsed into a form including relevant fields such as card _ type, intent, and data. Wherein the data is used for representing specific data obtained by the voice request. For example, if the semantic analysis result is "search for the XX supermarket nearby in the map", the semantic analysis result may be analyzed in a field format including card _ type, intent, and data in the XX supermarket.

Optionally, after determining a target scene associated with the target interface, matching each semantic analysis result with the target scene; and if only one semantic analysis result matched with the target scene exists, taking the matched semantic analysis result as a target analysis result.

Exemplarily, if at least two semantic analysis results matched with the target scene exist, determining the priority of the matched semantic analysis results according to the priority of the target scene; and determining a target analysis result from the matched semantic analysis results according to the priority of the matched semantic analysis results.

In this embodiment, one interface may be associated with a plurality of scenes, and each scene has a certain priority, which represents the recall order of each scene. Optionally, the priority between the scenes may be determined based on the sequence between various functions supported by the service, the historical interaction records between the user and the vehicle end, and the like.

If only one target interface identifier is available, the acquired target scenes are multiple scenes under one target interface. Optionally, a semantic parsing result matches at most one target scene. Further, if there are at least two semantic analysis results matching the target scene, for example, the semantic analysis result a matches the target scene a, and the semantic analysis result B matches the target scene B, then the priority of the target scene under the target interface may be used as the priority of the matching semantic analysis result, for example, the priority "first" of the target scene a is used as the priority of the semantic analysis result a, and the priority "second" of the target scene B is used as the priority of the semantic analysis result B; and then according to the priority of the matched semantic analysis results, selecting one from the matched semantic analysis results as a target analysis result. For example, the semantic analysis result with the highest priority in the matched semantic analysis results is used as the target analysis result. Further, if the priorities of the matched semantic analysis results are the same, each matched semantic analysis result can be scored based on a set scoring rule, and the semantic analysis result with the highest score is used as the target analysis result.

For another example, if the target interface identifier includes two or more than two target interfaces, the target scene obtained at this time is a scene under two or more than two target interfaces. If at least two semantic analysis results matched with the target scene exist, and the target scene matched with the semantic analysis results corresponds to different target interfaces, for example, the semantic analysis result C is matched with the target scene C, and the semantic analysis result D is matched with the target scene D, wherein the target scene C corresponds to the target interface 1, and the target scene D corresponds to the target interface 2, the priority of the target scene C and the priority of the target scene D can be determined firstly, then the priority of the target scene C can be used as the priority of the semantic analysis result C, and the priority of the target scene D can be used as the priority of the semantic analysis result D; and then according to the priority of the matched semantic analysis results, selecting one from the matched semantic analysis results as a target analysis result. Wherein, if the priority of the target scene c under the target interface 1 is determined based on the context mapping relation table to be the same as the priority of the target scene d under the target interface 2, the priorities of the target scene c and the target scene d can be determined according to the priorities between the target interface 1 and the target interface 2. Optionally, in this embodiment, the priority between the interfaces may be determined based on the priority between a service providing the interfaces (for example, the service providing the car map interface is a map) and a service providing the interfaces.

Further, if the obtained target scenes are scenes under two or two target interfaces and the same scene exists under different target interfaces, if a target scene matched with a semantic analysis result corresponds to multiple target interfaces, one target interface can be selected from the multiple target interfaces as a target interface corresponding to the target scene matched with the semantic analysis result based on the priority between the interfaces.

Illustratively, if there is no semantic analysis result matched with the target scene, scoring at least two semantic analysis results; and selecting a target analysis result from the at least two semantic analysis results according to the scoring result.

Specifically, if any semantic analysis result is not matched with the target scene, each semantic analysis result may be scored based on a preset scoring rule, and the semantic analysis result with the highest score may be used as the target analysis result. The scoring rule can score based on the source credibility of the semantic analysis result, and can also score based on the semantic recall condition of the voice information in the historical interaction record.

It should be noted that, in this embodiment, an association relationship between an interface and a scene is introduced, and a semantic analysis result matched with the scene is selected from a plurality of semantic analysis results to be recalled, that is, a semantic analysis result matched with context information is recalled preferentially, so that the recalled semantic analysis result is more suitable for the intention of a user; meanwhile, under the condition that no semantic analysis result matched with the context information exists, different semantic analysis results are scored and recalled based on the bottom-of-pocket strategy, and flexibility and completeness of the scheme are improved.

According to the technical scheme provided by the embodiment of the disclosure, the context information is represented by the interface identifier, the scene can be quickly positioned based on the interface identifier, and the target analysis result, namely the real intention of the user, is accurately positioned from the plurality of semantic analysis results based on the positioned scene, so that the problem of semantic mis-calling in the existing voice interaction process is effectively solved, and meanwhile, the response efficiency is improved.

Fig. 4 is a schematic structural diagram of a speech processing apparatus according to an embodiment of the present disclosure. The embodiment of the disclosure is suitable for the situation of how to process voice, and is particularly suitable for the situation of how to process the voice information of the user in the scene of interaction between the user and the vehicle end so as to accurately position the intention of the user. The device can be implemented by software and/or hardware, and the device can implement the voice processing method described in any embodiment of the disclosure. As shown in fig. 4, the speech processing apparatus includes:

a context information determining module 401, configured to determine context information of the voice information input by the sender;

an analysis result obtaining module 402, configured to obtain at least two semantic analysis results of the voice information;

and a target result selecting module 403, configured to select a target parsing result from the at least two semantic parsing results according to the context information.

Illustratively, the context information determination module 401 includes:

and the contextual information determining unit is used for determining the contextual information of the voice information input by the sender according to the state of the vehicle-side interface.

Illustratively, the context information determining unit is specifically configured to:

Illustratively, the contextual information includes a target interface identification; the target result selection module 403 includes:

the target scene determining unit is used for determining a target scene associated with the target interface according to the target interface identifier;

Illustratively, the target result selection unit is specifically configured to:

if the semantic analysis result matched with the target scene exists, determining the priority of the matched semantic analysis result according to the priority of the target scene;

and determining a target analysis result from the matched semantic analysis results according to the priority of the matched semantic analysis results.

Illustratively, the target result selection unit is specifically configured to:

if the semantic analysis result matched with the target scene does not exist, scoring at least two semantic analysis results;

and selecting a target analysis result from the at least two semantic analysis results according to the scoring result.

Exemplarily, the apparatus further includes:

and the execution module is used for executing the target analysis result and outputting the execution result to the sender.

In the technical scheme of the disclosure, the acquisition, storage, application and the like of the voice information, the semantic analysis result and the like of the related sender all accord with the regulations of related laws and regulations, and do not violate the good custom of the public order.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

FIG. 5 illustrates a schematic block diagram of an example electronic device 500 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 5, the electronic device 500 includes a computing unit 501, which can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM)502 or a computer program loaded from a storage unit 508 into a Random Access Memory (RAM) 503. In the RAM 503, various programs and data required for the operation of the electronic apparatus 500 can also be stored. The calculation unit 501, the ROM 502, and the RAM 503 are connected to each other by a bus 504. An input/output (I/O) interface 505 is also connected to bus 504.

A number of components in the electronic device 500 are connected to the I/O interface 505, including: an input unit 506 such as a keyboard, a mouse, or the like; an output unit 507 such as various types of displays, speakers, and the like; a storage unit 508, such as a magnetic disk, optical disk, or the like; and a communication unit 509 such as a network card, modem, wireless communication transceiver, etc. The communication unit 509 allows the electronic device 500 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.

The computing unit 501 may be a variety of general-purpose and/or special-purpose processing components having processing and computing capabilities. Some examples of the computing unit 501 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 501 performs the respective methods and processes described above, such as a voice processing method. For example, in some embodiments, the speech processing method may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 508. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 500 via the ROM 502 and/or the communication unit 509. When the computer program is loaded into the RAM 503 and executed by the computing unit 501, one or more steps of the speech processing method described above may be performed. Alternatively, in other embodiments, the computing unit 501 may be configured to perform the speech processing method by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

17页详细技术资料下载

Voice processing method, device, equipment and storage medium

相关技术

网友询问留言