Information processing apparatus, information processing method, and program

文档序号：328160 发布日期：2021-11-30 浏览：6次中文

阅读说明：本技术 信息处理装置、信息处理方法和程序 (Information processing apparatus, information processing method, and program ) 是由仓沢裕美于 2020-03-26 设计创作，主要内容包括：本发明允许用户关于语音代理执行良好的用户话语操作。信息处理装置接受用户话语数据和用户共享信息。信息处理装置通过考虑用户共享信息来分析用户话语数据,并获取包括话语意图的分析结果。信息处理装置输出分析结果。例如,用户共享信息是文本信息和用于识别由文本信息指示的信息类型的标签信息的组合。例如,用户共享信息还是指示预定数量的状态类型中的状态的信息。在关于对语音代理进行的话语操作中,信息处理装置允许用户像在人与人对话的情况下适当省略地交谈,从而可以执行良好的话语操作。(The present invention allows a user to perform good user utterance operations with respect to a voice agent. The information processing apparatus accepts user utterance data and user sharing information. The information processing apparatus analyzes user utterance data by considering user sharing information, and acquires an analysis result including an utterance intention. The information processing apparatus outputs the analysis result. For example, the user shared information is a combination of text information and tag information for identifying the type of information indicated by the text information. For example, the user sharing information is also information indicating a state in a predetermined number of state types. In regard to the speech operation performed on the voice agent, the information processing apparatus allows the user to talk as if it were left out appropriately in the case of person-to-person conversation, so that a good speech operation can be performed.)

1. An information processing apparatus comprising:

a speech input unit configured to accept user speech data and shared information of a user;

an utterance analysis unit configured to analyze the user utterance data in consideration of the shared information of the user and acquire an analysis result including an utterance intention; and

an analysis result output unit configured to output the analysis result.

2. The information processing apparatus according to claim 1, wherein the shared information of a user is a combination of text information and tag information for identifying a type of information indicated by the text information.

3. The information processing apparatus according to claim 2, wherein a synonym is added to the text information.

4. The information processing apparatus according to claim 2, wherein the shared information of the user includes visually or audibly identifiable information presented to the user.

5. The information processing apparatus according to claim 1, wherein the shared information of a user is information indicating a state in a predetermined number of state types.

6. The information processing apparatus according to claim 5, wherein the shared information of a user is a combination of tag information indicating a status type and status information indicating a status of each status type.

7. The information processing apparatus according to claim 5, wherein the status type includes at least one of a screen status, a volume status, and a performance status.

8. The information processing apparatus according to claim 7, wherein when the status type is the screen status, the status information indicates a display status of a music playlist or a weather forecast.

9. The information processing apparatus according to claim 5,

wherein the shared information of the user is information having a predetermined format acquired from information processed by the application, and

wherein the utterance analysis unit analyzes the user utterance data using information having the predetermined format based on machine learning.

10. The information processing apparatus according to claim 5, wherein the utterance analysis unit further analyzes the user utterance data in consideration of a predetermined number of previous user utterance data.

11. An information processing method comprising:

a process of accepting user utterance data and shared information of a user;

a process of obtaining an analysis result including an utterance intention by analyzing the user utterance data in consideration of the shared information of the user; and

and outputting the analysis result.

12. A program for causing a computer to function as:

speech input means for accepting user speech data and shared information of a user;

utterance analysis means for obtaining an analysis result including an utterance intention by analyzing the user utterance data in consideration of the shared information of the user; and

and the analysis result output device is used for outputting the analysis result.

Technical Field

The present technology relates to an information processing apparatus, an information processing method, and a program, and more particularly, to an information processing apparatus and the like suitably applied to a proxy system and the like.

Background

In recent years, with the advent of devices such as home agents, interactive systems have been introduced in homes. Therefore, in the future, it is expected that voice agents will be used as interfaces for various devices.

In the case of person-to-person conversation, it is often determined based on the shared assumption which information is mutually identifiable, and the conversation is conducted while talking with a partner. For example, what a person sees in front of can be represented by the indicator "that (that)" or features can be understood from the partial phrase "something red", and in some cases, when a person is in the same location, sometimes by partially referencing some of the omissions.

Similarly, when a person talks to a machine, the person may also estimate "information recognized by the machine" and talk about "information displayed by or responded to by the device" or "information controlled by the machine itself".

For example, patent document 1 discloses a technique for suggesting a display method for distinguishing corresponding input information from other display information regarding information displayed on a screen by voice recognition. In order to deal with a case where not all of the information displayed by the apparatus can be input according to the voice input, a mismatch about the user's desire is prevented by performing display, so that it can be understood what information can be input according to the voice input.

[ list of references ]

[ patent document ]

[ patent document 1]

JP 2014-202857 A

Disclosure of Invention

[ problem ] to

The technique described in patent document 1 is a technique for actively presenting to a user an expression that can be accepted as a voice input by an application side. However, according to this technique, the user may not be able to perform various operations with free speech expression, and may only be able to perform limited operations.

In order for an application to present certain information to a user and flexibly understand the user's verbal input in response to that information, it is necessary for a module that understands the utterance to actively understand the content presented by the application to the user and share the information with the user.

However, in general, a general agent system has a configuration in which a control unit that controls an application itself and a unit that interprets a meaning of a speech are different modules. In some cases, there may be a format in which the user has an application in his or her hand on the client side, and a unit that interprets the meaning of the utterance is on the server side, receives the utterance as input, and simply returns the interpretation result to the client.

In this case, when the control result that has been controlled or the information that has been presented to the user is not actively transmitted to the unit that interprets the meaning of the utterance, the unit that interprets the meaning of the utterance merely interprets the utterance. If the control result or the like, which has been controlled by the application, is transmitted to the unit that interprets the meaning of the utterance, but the information does not have an understandable format, the unit that interprets the meaning of the utterance cannot accept the information.

An object of the present technology is to allow a user utterance operation with respect to a voice agent to be performed satisfactorily.

[ solution of problem ]

According to one aspect of the present technology, an information processing apparatus includes: a speech input unit configured to accept user speech data and shared information of a user; an utterance analysis unit configured to analyze user utterance data in consideration of shared information of a user and acquire an analysis result including an utterance intention; and an analysis result output unit configured to output an analysis result.

In aspects of the present technology, a speech input unit accepts user speech data and shared information of a user. The utterance analysis unit analyzes user utterance data in consideration of shared information of the user, and acquires an analysis result including an utterance intention. Here, the shared information of the user is information that can be understood as information shared between the user and the system, for example, information that is controlled in presentation of information by the application control unit in addition to information presented to the user as an image or a human voice by the application itself. The analysis result output unit outputs an analysis result.

For example, the shared information of the user may be a combination of text information and tag information for identifying the type of information indicated by the text information. In this case, for example, synonyms may be added to the text information. Thus, variations in the user's utterance can be processed. In this case, for example, the shared information of the user may include information presented to the user to be visually or audibly recognizable. Accordingly, information presented to the user so that the user can visually or audibly recognize the information can be processed as information shared with the user.

For example, the shared information of the user may be information indicating a state in a predetermined number of state types. In this case, for example, the shared information of the user may be a combination of tag information indicating a status type and status information indicating a status of each status type. Therefore, the utterance analysis unit can appropriately recognize the state of each state type.

In this case, for example, the shared information of the user may be information having a predetermined format acquired from the information processed by the application, and the utterance analysis unit may analyze the user utterance data using the information having the predetermined format based on machine learning. In this case, for example, the utterance analysis unit may further analyze the user utterance data in consideration of a predetermined number of previous user utterance data.

In this way, in the present technology, the user utterance data is analyzed in consideration of the shared information of the user, and an analysis result including an utterance intention is acquired. Therefore, in a voice operation by the user using the voice agent, the user can talk as if it were left out appropriately in the case of person-to-person conversation, so that the speech operation can be performed satisfactorily.

Drawings

Fig. 1 is a block diagram showing an example of the configuration of a voice agent system.

Fig. 2 is a block diagram showing an example of the configuration of an application apparatus and an interaction apparatus.

Fig. 3 is a flowchart showing an example of a processing procedure of the interactive apparatus.

Fig. 4 is a diagram showing an example of the operation of the application apparatus and the interaction apparatus.

Fig. 5 is a diagram showing an example of the operation of the application apparatus and the interaction apparatus.

Fig. 6 is a diagram showing another example of the operation of the application apparatus and the interaction apparatus.

Fig. 7 is a diagram showing examples of various status types.

Fig. 8 is a block diagram showing an example of a hardware configuration of a computer.

Detailed Description

[ description of examples ]

Hereinafter, modes for carrying out the present invention (hereinafter, referred to as embodiments) will be described. The description will be made in the following order.

1. Examples of the embodiments

2. Modified examples

<1. example >

[ configuration example of Voice agent System ]

Fig. 1 shows an example of the configuration of the voice agent system 10. The voice agent system 10 is configured such that the system main body 100 and the cloud server 200 are connected via a network 300 such as the internet. The system main body 100 includes an application device 110, and the cloud server 200 includes an interaction device 210.

Fig. 2 shows an example of the configuration of the application apparatus 110 and the interaction apparatus 210. The application device 110 includes an input unit 111, an application control unit 112, and an output unit 113. The input unit 111 detects an utterance of a user, and transmits voice data corresponding to the utterance to the application control unit 112. The input unit 111 is configured by, for example, a microphone.

The application control unit 112 transmits user utterance data and shared information of the user to the interaction device 210, receives an analysis result including an utterance intention from the interaction device 210, performs application control corresponding to the analysis result, and transmits proposal data to the output unit 113 as necessary. The output unit 113 displays an image and/or outputs a human voice based on the presented information. The output unit 113 is configured by a display or a speaker. Here, as the output unit 113, there are various examples. The application device 110 of the system main body 100 may itself include the output unit 113, and the output unit 113 may also be configured by a television receiver, a projector, or the like outside the system main body 100.

Here, the user utterance data transmitted from the application control unit 112 to the interaction device 210 is voice data corresponding to a user utterance or text data acquired through a voice recognition process on the voice data.

When the application control unit 112 does not have a voice processing function, the application control unit 112 may convert voice data into text data using a voice recognition server. When the application control unit 112 does not have a voice processing function, the application control unit 112 may transmit voice data to the interaction device 210. In this case, the interactive device 210 converts voice data into text data and uses the text data.

Here, the shared information of the user includes information that can be understood as information shared between the user and the system, for example, information (information that is visually or aurally presented and can be recognized by the user) that is controlled in the presentation of the information by the application control unit 112 in addition to information that is presented as an image or a human voice by the application apparatus 110 itself.

For example, one may assume the utterance to the user "how do tomorrow tokyo weather? The "presented response will be" clear weather ". In this case, the user estimates that the reply is "sunny weather tokyo tomorrow". In this case, "tomorrow" or "tokyo" is not presented information but information for acquiring "weather is clear" presented and information controlled by the application control unit 112.

The interaction device 210 includes a speech input unit 211, a speech analysis unit 212, and an analysis result output unit 213. The utterance input unit 211 accepts as input a pair of user utterance data and shared information of a user transmitted from the application control unit 112.

The utterance analysis unit 212 analyzes the user utterance data in consideration of the shared information of the user to acquire an analysis result including an utterance intention. The analysis result output unit 213 returns the analysis result acquired by the utterance analysis unit 212 to the application control unit 112. In this case, although a general format may be used, it is assumed here that a flag indicating an utterance intention and one or more parameters are returned. Assume that a parameter is a pair of a parameter item name and a vocabulary item.

Although not described above, it is conceivable that the application control unit 112 of the application device 110 inside the system main body 100 is on the side of the cloud server 200. Although not described above, it is also conceivable that the interaction means 210 is on one side of the system main body 100 as in the application means 110.

Fig. 3 is a flowchart showing an example of a processing procedure of the interaction means 210. In step ST1, the utterance input unit 211 of the interaction apparatus 210 accepts as input a pair of user utterance data and shared information of the user transmitted from the application control unit 112.

Subsequently, in step ST2, the utterance analysis unit 212 of the interaction device 210 analyzes the user utterance data in consideration of the shared information of the user to acquire an analysis result including an utterance intention. In step ST3, the analysis result output unit 213 of the interaction device 210 returns the analysis result to the application control unit 112.

[ example 1]

Next, an example of an operation between the application apparatus 110 and the interaction apparatus 210 will be described. This example is an example in which the shared information of the user is a combination of text information and tag information for identifying the type of information indicated by the text information.

A case will be considered in which a playlist of a musical piece to be reproduced is displayed on a display configured as the output unit 113 of the application device 110, as shown in fig. 4 (a). The playlist includes music pieces of "egg", "apple" and "banana".

For example, when the user says "delete egg", as shown in example 1-1 of fig. 4(b), the tag information "music title" and the text information "egg", "apple" and "banana" are transmitted together with the words from the application control unit 112 to the interaction device 210.

In this case, the speech analysis unit 212 of the interaction device 210 analyzes "eggs" into music pieces, acquires analysis results indicating an operation of deleting the music pieces "eggs" from the play list of the music pieces, and returns the analysis results to the application control unit 112. The analysis result returned to the application control unit 112 is composed of "playlist _ delete item" as a label indicating the utterance intention and an' item as a parameter: an "egg" is formed. In the parameters, "item" is the parameter item name, and "egg" is the vocabulary item.

Thus, the application control unit 112 that returns the analysis result performs control so that the music "egg" in the play list of music is deleted.

For example, when the user says "play egg", as shown in example 1-2 of fig. 4(b), the tag information "music title" and the text information "egg", "apple" and "banana" are transmitted together with the words from the application control unit 112 to the interaction device 210.

In this case, the speech analysis unit 212 of the interaction device 210 analyzes "eggs" into music pieces, acquires analysis results indicating an operation of reproducing the music pieces "eggs", and returns the analysis results to the application control unit 112. The analysis result returned to the application control unit 112 is composed of "play _ music" as a label indicating the utterance intention and an' item as a parameter: an "egg" is formed.

Thus, the application control unit 112 that returns the analysis result performs control so that the music pieces "egg" in the play list of music pieces are reproduced.

For example, when the user says "play natto", as shown in examples 1 to 3 of fig. 4(b), the tag information "music title" and the text information "egg", "apple", and "banana" are transmitted together with the words from the application control unit 112 to the interaction device 210.

In this case, the speech analysis unit 212 of the interaction apparatus 210 analyzes that "natto" is a general noun because the music piece "natto" is not included in the play list of the music piece and is not shared with the user, and returns "unknown ()" indicating an unclear meaning to the application control unit 112 as an analysis result. Based on this, the application control unit 112 performs control so that, for example, a reply "cannot be completed".

Next, a case will be considered in which a shopping cart list is displayed on a display configured as the output unit 113 of the application device 110, as shown in fig. 5(a), the shopping cart list including foods such as "eggs", "apples", and "bananas".

For example, when the user says "delete egg", as shown in example 2-1 of fig. 5(b), the tag information "food item" and the text information "egg", "apple" and "banana" are transmitted together with the words from the application control unit 112 to the interaction device 210.

In this case, the utterance analysis unit 212 of the interaction device 210 analyzes "eggs" as food, acquires an analysis result indicating an operation of deleting food "eggs" from the shopping cart list, and returns the analysis result to the application control unit 112. The analysis result returned to the application control unit 112 is composed of "shopping cart _ deletion item" as a tag indicating the utterance intention and an' item as a parameter: an "egg" is formed.

Accordingly, the application control unit 112 returning the analysis result performs control so that the food "eggs" in the shopping cart list are deleted.

For example, when the user says "play egg", as shown in example 2-2 of fig. 5(b), the tag information "food item" and the text information "egg", "apple" and "banana" are transmitted together with the words from the application control unit 112 to the interaction device 210.

In this case, the speech analysis unit 212 of the interaction device 210 analyzes "egg" as a subject of food, and returns "unknown ()" indicating an unclear meaning to the application control unit 112 as an analysis result. Based on this, the application control unit 112 performs control so that, for example, a reply "cannot be completed". That is, in this case, even if there is a music piece "egg", the music piece is not reproduced.

In the above description, the shared information of the user has a format in which text information is attached to tag information. For example, there is { "music title": ("egg", "apple", "banana") }. However, the shared information of the user may have a format in which tag information is attached for each piece of text information. For example, { "egg": "music title", "apple": "music title", "banana": "music title" }.

Although not described above, synonyms may be appended to the textual information. Here, synonyms refer to expressions that the user can say, not expressions indicated by text information. For example, when the playlist shown in fig. 4(a) is displayed on the display, the user may also say "play No. 1" or the like instead of "play eggs". In this case, "number 1", "egg", etc. may be the synonym "egg". By adding synonyms to the text information in this manner, variations in the utterance expression of the user can be satisfactorily processed.

[ example 2]

Next, another example of the operation between the application device 110 and the interaction device 210 will be described. This example is an example in which the shared information of the user is formed of information indicating a state in a predetermined number of state types. Here, the state type is one type of state. In this case, the shared information of the user is, for example, a combination of tag information indicating a status type and status information (status flag) indicating a status of each status type.

Here, three states (e.g., screen state, volume state, and performance state) are handled as state types. In the screen state, the state information indicates which display is implemented on a display configured as the output unit 113 of the application device 110. The state information indicates whether the volume state reaches a volume adjustment state. The state information indicates whether or not the performance state reaches the music reproduction state.

In fig. 6(a), a playlist of a musical piece to be reproduced is displayed on the display. In this case, the state information indicates a playlist display state. The playlist shown includes "love", "excited", and "inharmonious" music pieces.

Here, the screen state, the volume state, and the performance state may be changed by the user's operation, and are shared information of the user. Fig. 6(b) shows an example of changes in the screen state, volume state, performance state, and utterance timing of the user. Utterances that change state are most likely generated by the user. Here, description of the utterance will be omitted. For example, it is assumed that the change of the reproduction state of the music piece is automatically performed by the application control unit 112 at a preset time. However, the change is typically performed based on the utterance of the user.

For the screen state, the period of the arrow indicates the state of each of the playlist display or the weekly weather display. As for the performance state, the period of the arrow indicates the reproduction state of the music piece.

For the volume state, the start timing of the arrow is the timing at which the volume state reaches the volume adjustment state, and the period of the arrow indicates a given period at which the user does not forget to adjust the volume because of the volume display on the display, for example. The given period is arbitrarily set, and it is possible for the user to say the period regarding the volume adjustment with a shortened expression.

At the utterance timing T1, the screen state is a playlist display state of a music piece of the music application, the volume state is a volume adjustment state, and the performance state is a reproduction state of the music piece. In this case, it is assumed that the user mentions probabilities of all states, and even when the immediately preceding utterance is an utterance for a volume adjustment request, the utterance may be a reproduction stop request for a music piece, or may be a request for reproducing another music piece that is displayed on the screen and is not in a reproduction state.

At the utterance timing T2, the screen state is a playlist display state of a music piece of the music application, the volume state is a volume non-adjustment state, and the performance state is a reproduction state of the music piece. At the utterance timing T3, the screen state is a playlist display state of a music piece of the music application, the volume state is a volume non-adjustment state, and the performance state is a reproduction stop state of the music piece.

At the utterance timing T4, the screen state is the weekly weather display state, the volume state is the volume non-adjustment state, and the performance state is the reproduction state of the music piece. At the utterance timing T5, the screen state is a weekly weather display state, the volume state is a volume adjustment state, and the performance state is a reproduction state of the music piece.

In this case, at the utterance timing T1, information indicating each of the screen state, the volume state, and the performance state is transmitted from the application control unit 112 to the interaction apparatus 210 together with the utterance of the user. At this time, the information indicating the screen state is composed of a pair of "display state" serving as tag information indicating the type of state and "music playlist" serving as state information indicating the display state of the playlist.

The information indicating the volume state is composed of a pair of "volume state" serving as tag information indicating the type of state and "currently changed" serving as state information indicating the volume adjustment state. The information indicating the performance status is composed of a pair of "play status" serving as tag information indicating the status type and "play music" serving as status information indicating the reproduction status.

At the utterance timing T2, information indicating each of the screen state, the volume state, and the performance state is also transmitted from the application control unit 112 to the interaction device 210 together with the utterance of the user. At this time, the information indicating the screen state is composed of a pair of "display state" serving as tag information indicating the type of state and "music playlist" serving as state information indicating the display state of the playlist.

The information indicating the volume state is composed of a pair of "volume state" serving as flag information indicating the type of the state and "currently changed" serving as state information indicating the volume non-adjustment state. The information indicating the performance status is composed of a pair of "play status" serving as tag information indicating the status type and "play music" serving as status information indicating the reproduction status.

At the utterance timing T3, information indicating each of the screen state, the volume state, and the performance state is also transmitted from the application control unit 112 to the interaction device 210 together with the utterance of the user. At this time, the information indicating the screen state is composed of a pair of "display state" serving as tag information indicating the type of state and "music playlist" serving as state information indicating the display state of the playlist.

The information indicating the volume state is composed of a pair of "volume state" serving as flag information indicating the type of state and "currently unchanged" serving as state information indicating the volume non-adjustment state. The information indicating the performance status is composed of a pair of "play status" serving as flag information indicating the status type and "stop music" serving as status information indicating the non-reproduction status.

At the utterance timing T4, information indicating each of the screen state, the volume state, and the performance state is also transmitted from the application control unit 112 to the interaction device 210 together with the utterance of the user. At this time, the information indicating the screen status is composed of a pair of "display status" serving as tag information indicating the status type and "weekly weather" serving as status information indicating the weekly weather display status.

The information indicating the volume state is composed of a pair of "volume state" serving as flag information indicating the type of state and "currently unchanged" serving as state information indicating the volume non-adjustment state. The information indicating the performance status is composed of a pair of "play status" serving as tag information indicating the status type and "play music" serving as status information indicating the reproduction status.

At the utterance timing T5, information indicating each of the screen state, the volume state, and the performance state is transmitted from the application control unit 112 to the interaction device 210 together with the utterance of the user. At this time, the information indicating the screen status is composed of a pair of "display status" serving as tag information indicating the status type and "weekly weather" serving as status information indicating the weekly weather display status.

The utterance analysis unit 212 of the interaction device 210 analyzes user utterance data in consideration of shared information of the user (information indicating each of a screen state, a volume state, and a performance state), acquires an analysis result including an utterance intention, and returns the analysis result to the application control unit 112.

For example, when the utterance of the user is "set to 2", the utterance analysis unit 212 interprets the user utterance data as meaning for requesting an operation to change the volume to "2", acquires an analysis result indicating an instruction to perform an operation to change the volume to "2", and returns the analysis result to the application control unit 112 at utterance timings T1 and T5 due to the volume adjustment state. Therefore, the application control unit 112 that returns the analysis result performs control so that the volume is changed to "2".

In this case, at the utterance timings T2 and T3, due to the volume non-adjustment state and the display state of the playlist, the utterance analysis unit 212 analyzes the user utterance data into the meaning for reproducing No. 2 music piece in the playlist, acquires an analysis result indicating an instruction to perform an operation of reproducing No. 2 music piece of the playlist, and returns the analysis result to the application control unit 112. Thus, the application control unit 112 that returns the analysis result performs control so that the No. 2 tune of the playlist is reproduced.

In this case, at the utterance timing T4, the utterance analysis unit 212 analyzes the user utterance data into an unclear meaning due to a volume non-adjustment state and a display state of the weekly weather, and returns the analysis result of the unclear meaning to the application control unit 112. Based on this, the application control unit 112 performs control so that, for example, a reply "cannot be completed".

For example, when the utterance of the user is "tokyo", the utterance analysis unit 212 analyzes that the music theme is preferred, the music name is "tokyo", and issues a request to display the music piece "tokyo", acquires an analysis result indicating an instruction to perform an operation of displaying the music piece "tokyo", and returns the analysis result to the application control unit 112 due to the play list display state at the utterance timing T1, T2, and T3. Thus, the application control unit 112 that returns the analysis result performs control so that the music piece "tokyo" is displayed.

In this case, at the utterance timings T4 and T5, due to the weekly weather display state, the utterance analysis unit 212 analyzes a preferred screen state rather than a performance state even in the reproduction state of the music, and issues a request to check "tokyo" weather, acquires an analysis result indicating an instruction to perform an operation of checking "tokyo" weather, and returns the analysis result to the application control unit 112. Therefore, the application control unit 112 that returns the analysis result performs control so that "tokyo" weather is checked.

It is also conceivable that the utterance analysis unit 212 may return an analysis result indicating an instruction to perform an operation of displaying the music piece "tokyo" and an analysis result indicating an instruction to perform an operation of checking "tokyo" weather without preferring the screen state to the application control unit 112, and one side of the application device 110 may select one of the analysis results.

For example, when the utterance of the user is "stop", the utterance analysis unit 212 analyzes that a request to stop reproduction of a music piece is issued, acquires an analysis result indicating an instruction to stop reproduction of the music piece, acquires an analysis result indicating an instruction to perform an operation to stop reproduction of the music piece, and returns the analysis result to the application control unit 112 at utterance timings T1 and T2 due to a play list display state and a reproduction state of the music piece. Thus, the application control unit 112 that returns the analysis result performs control so that the reproduction of the music is stopped.

In this case, at the utterance timing T3, the utterance analysis unit 212 analyzes the user utterance data into an unclear meaning due to the playlist display state and the non-reproduction state of the musical composition, and returns the analysis result of the unclear meaning to the application control unit 112. Based on this, the application control unit 112 performs control so that, for example, a reply "cannot be completed".

In this case, at the utterance timings T4 and T5, since the weekly weather display state and the reproduction state of the music piece, it is preferable that the utterance analysis unit 212 analyzes the reproduction state of the music piece, and issues a request to stop the reproduction of the music piece, acquires an analysis result indicating an instruction to perform an operation to stop the reproduction of the music piece, and returns the analysis result to the application control unit 112. Thus, the application control unit 112 that returns the analysis result performs control so that the reproduction of the music is stopped.

The above describes an example in which the utterance analysis unit 212 of the interaction apparatus 210 analyzes the user utterance data at any utterance timing in consideration of shared information of the user (state information in a predetermined number of state types) transmitted in conjunction with the utterance data of the user.

It is also contemplated that utterance analysis unit 212 performs analysis in consideration of a predetermined number of past user utterance data. For example, in the above-described example, the volume state is the volume adjustment state "currently changed" for a given period (period indicated by the length of the arrow) after the volume adjustment state on the side of the application control unit 112 is reached, but it is preferable to authorize the side of the utterance analysis unit 212 to maintain the volume adjustment state for a certain period after the volume state is brought into the volume adjustment state due to the utterance of the user.

Three states (screen state, volume state, and performance state) have been described above as examples in which the state types are handled. However, the status type is not limited thereto. For example, as shown in fig. 7, other status types such as display content name, display content attribute value, display content attribute name, number of times of display, display number, avatar may be conceived in addition to the screen status, volume status, and performance status.

The above has described an example in which the shared information of the user is a combination of tag information indicating a status type and status information (status flag) indicating a status of each status type. However, it is also conceivable that the shared information of the user is set as information having a predetermined format (e.g., vector expression) acquired from the information processed by the application.

In this case, the original signal information as the different state types is set as, for example, information having a predetermined format (for example, vector expression) acquired based on the result of learning using each state of the system, rather than directly having an information format understood in a portion in which the meaning of the utterance is explained. In this case, it is also conceivable that the utterance analysis unit 212 analyzes the user utterance data using information having a predetermined format based on, for example, machine learning.

As described above, in the voice agent system 10 shown in fig. 1 and 2, the utterance analysis unit 212 of the interaction device 210 analyzes user utterance data in consideration of shared information of the user, and acquires an analysis result including an utterance intention. Therefore, in the voice operation of the user with the voice agent, the user can talk as if it were left out appropriately in the case of person-to-person conversation, so that the speaking operation can be performed satisfactorily.

<2. modified example >

Fig. 8 is a block diagram showing an example of a hardware configuration of a computer, and a program causes the computer to execute the series of processing described above. For example, the application device 110 or the interaction device 210 shown in fig. 2 may be configured as a computer.

In the computer, a Central Processing Unit (CPU)501, a Read Only Memory (ROM)502, and a Random Access Memory (RAM)503 are connected to each other by a bus 504. An input/output interface 505 is further connected to the bus 504. An input unit 506, an output unit 507, a storage unit 508, a communication unit 509, and a drive 510 are connected to the input/output interface 505.

The input unit 506 is a keyboard, a mouse, a microphone, or the like. The output unit 507 is a display, a speaker, or the like. The storage unit 508 is a hard disk, a nonvolatile memory, or the like. The communication unit 509 is a network interface or the like. The drive 510 drives a removable medium 511 such as a magnetic disk, an optical disk, a magneto-optical disk, or a semiconductor memory.

In the computer having the above-described configuration, for example, the CPU 501 executes the above-described series of processing by loading a program stored in the storage unit 508 to the RAM 503 via the input/output interface 505 and the bus 504 and executing the program.

The program executed by the computer (CPU 501) may be recorded on, for example, a removable medium 511 serving as a package medium for provisioning. The program may be supplied via a wired or wireless transmission medium such as a local area network, the internet, or digital satellite broadcasting.

In the computer, by installing the removable medium 511 on the drive 510, a program can be installed in the storage unit 508 via the input/output interface 505. The program may be received by the communication unit 509 via a wired or wireless transmission medium installed in the storage unit 508. In addition, the program may be installed in advance in the ROM 502 or the storage unit 508.

The program executed by the computer may be a program that performs processing in time series in the procedures described in this specification, or may be a program that performs processing at a necessary timing, for example, in parallel or when called.

The preferred embodiments of the present disclosure have been described in detail with reference to the accompanying drawings, but the technical scope of the present disclosure is not limited to this example. It should be apparent to those skilled in the art of the present disclosure that various change examples or correction examples can be made within the scope of the technical spirit described in the claims, and of course, these change examples or correction examples are construed as belonging to the technical scope of the present disclosure.

The present technology can be configured as follows.

(1)

An information processing apparatus comprising:

a speech input unit configured to accept user speech data and shared information of a user;

an utterance analysis unit configured to analyze user utterance data in consideration of shared information of a user and acquire an analysis result including an utterance intention; and

an analysis result output unit configured to output an analysis result.

(2)

The information processing apparatus according to (1), wherein the shared information of the user is a combination of text information and tag information for identifying a type of information indicated by the text information.

(3)

The information processing apparatus according to (2), wherein the synonym is added to the text information.

(4)

The information processing apparatus according to (2) or (3), wherein the shared information of the user includes visually or audibly identifiable information presented to the user.

(5)

The information processing apparatus according to (1), wherein the shared information of the user is information indicating a state in a predetermined number of state types.

(6)

The information processing apparatus according to (5), wherein the shared information of the user is a combination of tag information indicating a status type and status information indicating a status of each status type.

(7)

The information processing apparatus according to (5) or (6), wherein the status type includes at least one of a screen status, a volume status, and a performance status.

(8)

The information processing apparatus according to (7), wherein the status information indicates a display status of a music playlist or a weather forecast when the status type is a screen status.

(9)

The information processing apparatus according to (5),

wherein the shared information of the user is information having a predetermined format acquired from information processed by the application, and

wherein the utterance analysis unit analyzes the user utterance data using information having a predetermined format based on machine learning.

(10)

The information processing apparatus according to any one of (5) to (9), wherein the utterance analysis unit further analyzes the user utterance data in consideration of a predetermined number of previous user utterance data.

(11)

An information processing method comprising:

a process of accepting user utterance data and shared information of a user;

a process of obtaining an analysis result including an utterance intention by analyzing user utterance data in consideration of shared information of a user; and

and outputting the analysis result.

(12)

A program for causing a computer to function as:

speech input means for accepting user speech data and shared information of a user;

utterance analysis means for obtaining an analysis result including an utterance intention by analyzing user utterance data in consideration of shared information of a user; and

and the analysis result output device is used for outputting the analysis result.

[ list of reference signs ]

10 voice proxy system

100 system body

110 application device

111 input unit

112 application control unit

113 output unit

200 cloud server

210 interaction device

211 speech input unit

212 Speech analysis Unit

213 analysis result output unit

300 network

20页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：利用旋转的插值和量化进行空间化音频编解码

Information processing apparatus, information processing method, and program

相关技术

网友询问留言