Information processing apparatus, information processing system, information processing method, and program

文档序号:1277193 发布日期:2020-08-25 浏览:6次 中文

阅读说明:本技术 信息处理装置、信息处理系统、信息处理方法和程序 (Information processing apparatus, information processing system, information processing method, and program ) 是由 河野真一 滝祐平 岩濑広 于 2018-11-16 设计创作,主要内容包括:通过生成并使用其中收集了多个用户语音实例的语音收集列表,本发明实现了一种能够基于多个用户语音实例精确且重复地执行处理的装置和方法。学习处理单元生成语音收集列表,其中,收集对应于多个不同处理请求的多个用户语音实例。此外,在显示单元上显示生成的语音收集列表。在诸如当获得用户同意或者确定对应于用户语音的多个处理的执行已经成功时或者当多个用户语音实例的组合达到或超过预定阈值或者推断用户满意时等情况下,所述学习处理单元生成语音收集列表并且将语音收集列表存储在存储单元中。(By generating and using a voice collection list in which a plurality of user voice instances are collected, the present invention realizes an apparatus and method capable of accurately and repeatedly performing processing based on a plurality of user voice instances. The learning processing unit generates a speech collection list in which a plurality of user speech instances corresponding to a plurality of different processing requests are collected. Further, the generated voice collection list is displayed on the display unit. The learning processing unit generates a voice collection list and stores the voice collection list in the storage unit, such as when user consent is obtained or it is determined that execution of a plurality of processes corresponding to the user voice has succeeded or when a combination of a plurality of user voice instances reaches or exceeds a predetermined threshold or user satisfaction is inferred.)

1. An information processing apparatus comprising:

a learning processing unit configured to perform learning processing of a user utterance, wherein,

the learning processing unit generates a speech collection list in which a plurality of user utterances corresponding to a plurality of different processing requests are collected.

2. The information processing apparatus according to claim 1,

the information processing apparatus also displays the utterance collection list on a display unit.

3. The information processing apparatus according to claim 1,

the user utterances recorded in the utterance collection list are user utterances corresponding to commands as processing requests made by the user to the information processing apparatus.

4. The information processing apparatus according to claim 1,

the learning processing unit asks the user whether to generate the utterance collection list, generates the utterance collection list with the user's consent, and stores the utterance collection list in the storage unit.

5. The information processing apparatus according to claim 1,

in a case where the learning processing unit determines that the plurality of processes corresponding to the plurality of user utterances have been successfully performed, the learning processing unit generates the utterance collection list and stores the utterance collection list in a storage unit.

6. The information processing apparatus according to claim 1,

the learning processing unit generates the utterance collection list and stores the utterance collection list in a storage unit in a case where a combination of a plurality of user utterances is equal to or greater than a predetermined threshold number of times.

7. The information processing apparatus according to claim 1,

the learning processing unit analyzes presence or absence of an indicator indicating a relationship between utterances included in a plurality of user utterances, generates the utterance collection list based on a result of the analysis, and stores the utterance collection list in a storage unit.

8. The information processing apparatus according to claim 1,

the learning processing unit analyzes a state of a user with respect to processing performed by the information processing apparatus in response to a user utterance, generates the utterance collection list based on a result of the analysis, and stores the utterance collection list in a storage unit.

9. The information processing apparatus according to claim 1,

in a case where the learning processing unit receives an input of user status information and the user status information is information indicating that a user is satisfied, the learning processing unit generates the utterance collection list and stores the utterance collection list in a storage unit.

10. The information processing apparatus according to claim 9,

the user status information is information indicating a user satisfaction status, and is acquired based on at least one of the following information:

non-verbal information based on the user utterance and generated by the speech analysis unit;

image analysis information based on the user image and generated by the image analysis unit; or

The sensor information analysis unit generates sensor information analysis information.

11. The information processing apparatus according to claim 1, further comprising

A display information generation unit configured to perform a process of highlighting a speech corresponding node currently being executed by the information processing apparatus, wherein the speech corresponding node is among a plurality of speech corresponding nodes included in the speech collection list displayed on the display unit.

12. The information processing apparatus according to claim 1,

the information processing apparatus also acquires an external utterance collection list that can be acquired by the information processing apparatus, and displays the external utterance collection list on a display unit.

13. The information processing apparatus according to claim 1,

the learning processing unit selects user utterances to be collected according to context information, and generates the utterance collection list.

14. An information processing system includes a user terminal and a data processing server, wherein,

the user terminal comprises

A voice input unit configured to input a user utterance;

the data processing server comprises

A learning processing unit configured to perform learning processing of a user utterance received from a user terminal; and is

The learning processing unit generates a speech collection list in which a plurality of user utterances corresponding to a plurality of different processing requests are collected.

15. The information processing system of claim 14,

the user terminal displays the utterance collection list on a display unit.

16. An information processing method executed in an information processing apparatus, wherein,

the information processing apparatus includes: a learning processing unit configured to perform learning processing of a user utterance; and is

The learning processing unit generates a speech collection list in which a plurality of user utterances corresponding to a plurality of different processing requests are collected.

17. An information processing method performed in an information processing system including a user terminal and a data processing server, wherein,

the user terminal performs a voice input process of inputting a user utterance;

the data processing server performs a learning process of a user utterance received from a user terminal; and is

A utterance collection list in which a plurality of user utterances corresponding to a plurality of different processing requests are collected is generated in a learning process.

18. A program for causing an information processing apparatus to execute information processing, wherein,

the information processing apparatus includes: a learning processing unit configured to perform learning processing of a user utterance; and is

The program causes a learning processing unit to generate a utterance collection list in which a plurality of user utterances corresponding to a plurality of different processing requests are collected.

Technical Field

The present disclosure relates to an information processing apparatus, an information processing system, an information processing method, and a program. More specifically, the present disclosure relates to an information processing apparatus, an information processing system, and an information processing method, and a program that perform processing according to a user utterance.

Background

In recent years, voice interactive systems have been increasingly used which perform voice recognition of a user utterance and perform various processes and responses based on the recognition result.

Those speech recognition systems recognize and understand a user utterance input through a microphone and perform processing according to the recognition and understanding.

For example, in a case where the user says "display a moving image of interest", the voice recognition system performs a process of acquiring moving image content from a moving image content providing server and outputting the moving image content to a display unit or a connected television set. Alternatively, in the case where the user says "turn off the television", the voice recognition system performs an operation such as turning off the television.

The general voice interaction system has, for example, a natural language understanding function, such as Natural Language Understanding (NLU), and understands the intention of a user utterance by applying the Natural Language Understanding (NLU) function.

However, for example, in order for the voice interaction system to continuously perform a plurality of processes, the user needs to perform a plurality of user utterances corresponding to the plurality of processes. One example is as follows, for example.

"displaying a moving image (moving image) of interest. "

Playing classical music. "

"I want to continue playing the game that was stopped yesterday. "

"I want to play a game with my friend, so please contact them. "

For example, in the case where such successive user utterances are made, it is difficult for the user to immediately confirm whether or not the system can understand and execute all of the utterances.

In fact, the user needs to wait for a certain period of time after uttering the utterance to confirm whether or not the processing is performed in response to the user utterance based on the execution result.

In the case where the processing is not executed, it is necessary to execute processing of a repeated utterance with respect to the processing that is not executed, processing of a repeated utterance with respect to the processing, or other processing.

Such a response places a heavy burden on the user. Further, an increase in the time required to complete these processes is problematic.

A related art that discloses a configuration for securely performing a processing request based on a user utterance is, for example, patent document 1 (japanese patent application laid-open No. 2007-052397). This document discloses a configuration in which a list of voice commands that can be input to a car navigation system is displayed in advance on a display unit so that a user can input a voice command while viewing the list.

This configuration enables the user to issue a user utterance (command) that the car navigation system can understand. Therefore, the possibility of executing a user utterance (command) that cannot be understood by the car navigation system can be reduced.

This configuration can match user utterances with commands registered in the system. However, as described above, in order for the configuration to continuously execute a plurality of processing requests, the user needs to search for a plurality of commands corresponding to a plurality of processes desired by the user from the list. This increases the burden on the user. Further, as a result, there arises a problem that the time required to complete these processes increases.

Bibliography

Patent document

Patent document 1: japanese patent application laid-open No. 2007-052397

Disclosure of Invention

Problems to be solved by the invention

The present disclosure has been made in view of, for example, the above-described problems, and an object thereof is to provide an information processing apparatus, an information processing system, an information processing method, and a program capable of performing processing according to a user utterance more securely.

Further, embodiments of the present disclosure provide an information processing apparatus, an information processing system, an information processing method, and a program capable of securely executing a plurality of processes requested by a user while collectively executing a plurality of different processes.

Solution to the problem

A first aspect of the present disclosure is

An information processing apparatus includes

A learning processing unit configured to perform a learning process of a user utterance, wherein,

the learning processing unit generates a speech collection list in which a plurality of user utterances corresponding to a plurality of different processing requests are collected.

Further, a second aspect of the present disclosure is

An information processing system comprises

A user terminal; and

a data processing server, wherein:

the user terminal comprises

A voice input unit configured to input a user utterance;

the data processing server comprises

A learning processing unit configured to perform learning processing of a user utterance received from a user terminal; and is

The learning processing unit generates a speech collection list in which a plurality of user utterances corresponding to a plurality of different processing requests are collected.

Further, a third aspect of the present disclosure is

An information processing method executed in an information processing apparatus, wherein:

the information processing apparatus includes: a learning processing unit configured to perform learning processing of a user utterance; and is

The learning processing unit generates a speech collection list in which a plurality of user utterances corresponding to a plurality of different processing requests are collected.

Further, a fourth aspect of the present disclosure is

An information processing method executed in an information processing system including a user terminal and a data processing server, wherein:

the user terminal performs a voice input process of inputting a user utterance;

the data processing server performs a learning process of a user utterance received from a user terminal; and is

A utterance collection list is generated in a learning process, in which a plurality of user utterances corresponding to a plurality of different process requests are collected.

Further, a fifth aspect of the present disclosure is

A program for causing an information processing apparatus to execute information processing, wherein:

the information processing apparatus includes: a learning processing unit configured to perform learning processing of a user utterance; and is

The program causes the learning processing unit to generate a speech collection list in which a plurality of user utterances corresponding to a plurality of different processing requests are collected.

Note that the program of the present disclosure is, for example, a program that can be provided in a computer-readable format by a storage medium or a communication medium for an information processing apparatus or a computer system that can execute various program codes. By providing such a program in a computer-readable format, processing according to the program is realized in an information processing apparatus or a computer system.

Other objects, features and advantages of the present disclosure will become apparent based on the more detailed description of the embodiments of the present disclosure and the accompanying drawings, which are described later. Note that in this specification, a system is a logical set configuration of a plurality of devices, and is not limited to a system in which devices having respective configurations are in the same housing.

Effects of the invention

According to the configuration of the embodiments of the present disclosure, an apparatus and method capable of accurately and repeatedly performing processing based on a plurality of user utterances is realized by generating and using an utterance collection list in which a plurality of user utterances are collected.

Specifically, for example, the learning processing unit generates a speech collection list in which a plurality of user utterances corresponding to a plurality of different processing requests are collected. Further, the generated utterance collection list is displayed on the display unit. In a case where it is determined that the plurality of processes corresponding to the user utterances have been successfully performed, in a case where a combination of the plurality of user utterances is equal to or greater than a predetermined threshold number of times, in a case where it is estimated that the user is satisfied, or in other cases, the learning processing unit generates an utterance collection list and stores the utterance collection list in the storage unit.

With this configuration, an apparatus and method capable of accurately and repeatedly performing processing based on a plurality of user utterances is realized by generating and using an utterance collection list in which the plurality of user utterances are collected.

Note that the effects described in this specification are merely examples, are not limited thereto, and may have other additional effects.

Drawings

Fig. 1 shows an example of an information processing apparatus that performs response and processing based on a user utterance;

fig. 2 shows a configuration example and a use example of the information processing apparatus;

fig. 3 shows a specific configuration example of an information processing apparatus;

fig. 4 shows an example of display data of the information processing apparatus;

fig. 5 shows an example of display data of the information processing apparatus;

fig. 6 shows an example of display data of the information processing apparatus;

fig. 7 shows an example of display data of the information processing apparatus;

fig. 8 shows an example of display data of the information processing apparatus;

fig. 9 shows an example of display data of the information processing apparatus;

fig. 10 shows an example of display data of the information processing apparatus;

fig. 11 shows an example of display data of the information processing apparatus;

fig. 12 shows an example of display data of the information processing apparatus;

fig. 13 shows an example of display data of the information processing apparatus;

fig. 14 shows an example of display data of the information processing apparatus;

fig. 15 shows an example of display data of the information processing apparatus;

fig. 16 shows an example of display data of the information processing apparatus;

fig. 17 shows an example of display data of the information processing apparatus;

fig. 18 shows an example of display data of the information processing apparatus;

fig. 19 shows an example of display data of the information processing apparatus;

fig. 20 shows an example of display data of the information processing apparatus;

fig. 21 shows an example of display data of the information processing apparatus;

fig. 22 shows an example of display data of the information processing apparatus;

fig. 23 shows an example of display data of the information processing apparatus;

fig. 24 shows an example of display data of the information processing apparatus;

fig. 25 shows an example of display data of the information processing apparatus;

fig. 26 shows an example of display data of the information processing apparatus;

fig. 27 shows an example of display data of the information processing apparatus;

fig. 28 shows an example of display data of the information processing apparatus;

fig. 29 shows an example of display data of the information processing apparatus;

fig. 30 shows an example of display data of the information processing apparatus;

fig. 31 shows an example of display data of the information processing apparatus;

fig. 32 shows an example of display data of the information processing apparatus;

fig. 33 shows an example of display data of the information processing apparatus;

fig. 34 shows an example of display data of the information processing apparatus;

fig. 35 shows an example of display data of the information processing apparatus;

fig. 36 is a flowchart showing a processing sequence executed by the information processing apparatus;

fig. 37 is a flowchart showing a processing sequence executed by the information processing apparatus;

fig. 38 is a flowchart showing a processing sequence executed by the information processing apparatus;

fig. 39 is a flowchart showing a processing sequence executed by the information processing apparatus;

fig. 40 is a flowchart showing a processing sequence executed by the information processing apparatus;

fig. 41 shows a configuration example of an information processing system;

fig. 42 shows a hardware configuration example of the information processing apparatus.

Detailed Description

Hereinafter, details of an information processing apparatus, an information processing system, and an information processing method and a program of the present disclosure will be described with reference to the drawings. Note that description will be made in terms of the following items.

1. Configuration example of information processing apparatus

2. Example of generating display information and utterance Collection List output by an information processing apparatus

3. Processing examples Using utterance Collection List

4. Other examples of displaying and generating a Collection of utterances List

5. Processing sequence executed by information processing apparatus

6. Configuration example of information processing apparatus and information processing system

7. Hardware configuration example of information processing apparatus

8. Brief description of the configurations of the present disclosure

[1. configuration example of information processing apparatus ]

First, a configuration example of an information processing apparatus according to an embodiment of the present disclosure will be described with reference to fig. 1 and the following drawings.

Fig. 1 shows a configuration and processing example of an information processing apparatus 10, which information processing apparatus 10 recognizes a user utterance made by a user 1 and performs processing and response corresponding to the user utterance.

The user 1 utters the following user utterance in step S01.

The user utterance is "display moving image of interest. "

In step S02, the information processing apparatus 10 performs speech recognition of the user utterance, and performs processing based on the recognition result.

In the example of fig. 1, in step S02, the following system utterance is output as a response to the user utterance being "display a moving image of interest".

The system utterance is "good" i will play the moving image of interest. "

Further, the information processing apparatus 10 acquires moving image content from a content distribution server, for example, as a server 20 in the cloud connected to the network, and outputs the moving image content to the display unit 13 of the information processing apparatus 10 or a nearby external apparatus (television) 30 controlled by the information processing apparatus 10.

Further, in step S03, the user 1 utters the following user utterance.

The user utterances as "playing classical music. "

In step S04, the information processing apparatus 10 performs speech recognition of the user utterance, and performs processing based on the recognition result.

In the example of fig. 1, in step S04, the following system utterance is output as a response to the user utterance "play classical music".

The system uttered "good" i will play classical music. "

Further, the information processing apparatus 10 acquires the classical music content from, for example, a music distribution server as a server 20 in the cloud connected to the network, and outputs the classical music content to the speaker 14 of the information processing apparatus 10 or a nearby external apparatus (speaker).

The information processing apparatus 10 in fig. 1 includes a camera 11, a microphone 12, a display unit 13, and a speaker 14, and is configured to perform voice input/output and image input/output.

The information processing apparatus 10 in fig. 1 is referred to as, for example, "smart speaker", "proxy apparatus", or the like.

Note that the voice recognition processing and the semantic analysis processing for the user utterance may be executed in the information processing apparatus 10, or may be executed in a data processing server as one server 20 in the cloud.

As shown in fig. 2, the information processing apparatus 10 of the present disclosure is not limited to the proxy apparatus 10a, and may be in various apparatus forms, for example, a smartphone 10b and a PC 10 c.

The information processing apparatus 10 recognizes the speech of the user 1 and responds based on the user speech, and also controls an external apparatus 30, such as a television set and an air conditioner shown in fig. 2, for example, in response to the user speech.

For example, where the user utterance is such as "change television channel to 1". OR sets the temperature of the air conditioner to 20o. "or the like, the information processing apparatus 10 outputs a control signal (wireless network, infrared light, or the like) to the external apparatus 30 based on the voice recognition result of the user utterance, and performs control in accordance with the user utterance.

Note that the information processing apparatus 10 is connected to the server 20 via a network, and can acquire information necessary for generating a response to a user utterance from the server 20. Further, as described above, the server may be configured to perform speech recognition processing and semantic analysis processing.

Next, a specific configuration example of the information processing apparatus will be described with reference to fig. 3.

Fig. 3 shows a configuration example of the information processing apparatus 10 that recognizes a user utterance and performs processing and response corresponding to the user utterance.

As shown in fig. 3, the information processing apparatus 10 includes an input unit 110, an output unit 120, and a data processing unit 150.

Note that although the data processing unit 150 may be provided in the information processing apparatus 10, a data processing unit of an external server may be used without providing the data processing unit 150 in the information processing apparatus 10. In the case of using the configuration of the server, the information processing apparatus 10 transmits input data input from the input unit 110 to the server via the network, receives a processing result of the data processing unit 150 of the server, and outputs the processing result via the output unit 120.

Next, the components of the information processing apparatus 10 of fig. 3 will be described.

The input unit 110 includes a voice input unit (microphone) 111, an image input unit (camera) 112, and a sensor 113.

The output unit 120 includes a voice output unit (speaker) 121 and an image output unit (display unit) 122.

The information processing apparatus 10 includes at least those components.

Note that the voice input unit (microphone) 111 corresponds to the microphone 12 of the information processing apparatus 10 in fig. 1.

The image input unit (camera) 112 corresponds to the camera 11 of the information processing apparatus 10 in fig. 1.

The voice output unit (speaker) 121 corresponds to the speaker 14 of the information processing apparatus 10 in fig. 1.

The image output unit (display unit) 122 corresponds to the display unit 13 of the information processing apparatus 10 in fig. 1.

Note that the image output unit (display unit) 122 may also be configured by, for example, a projector or the like, or may be configured as a display unit using a television as an external device.

As described above, the data processing unit 150 is provided in the information processing apparatus 10 or a server capable of communicating with the information processing apparatus 10.

The data processing unit 150 includes an input data analysis unit 160, a storage unit 170, and an output information generation unit 180.

The input data analysis unit 160 includes a voice analysis unit 161, an image analysis unit 162, a sensor information analysis unit 163, a user state estimation unit 164, and a learning processing unit 165.

The output information generating unit 180 includes an output voice generating unit 181 and a display information generating unit 182.

The display information generating unit 182 generates display data, for example, a node tree and a speech collection list. The display data will be described in detail later.

The speech voice of the user is input to a voice input unit 111 such as a microphone.

The voice input unit (microphone) 111 inputs the input user utterance voice to the voice analysis unit 161.

The speech analysis unit 161 has, for example, an Automatic Speech Recognition (ASR) function, and converts speech data into text data including a plurality of words.

Further, the speech analysis unit 161 performs utterance semantic analysis processing on the text data.

The speech analysis unit 161 has, for example, a natural language understanding function, for example, Natural Language Understanding (NLU), and estimates, from the text data, the intention of the user utterance and an entity that is a meaningful element (important element) included in the utterance.

A specific example will be described. For example, the following user utterances are input.

The user utters a weather forecast that tells me that osaka is tomorrow afternoon.

The intent of this user utterance is to know weather, which is in fact the following words: osaka, tomorrow, and afternoon.

When the intention and the entity can be accurately estimated and acquired from the user utterance, the information processing apparatus 100 can perform accurate processing in response to the user utterance.

For example, in the above example, a weather forecast for the afternoon of osaka tomorrow may be acquired and output in response.

The user utterance analysis information 191 acquired by the speech analysis unit 161 is stored in the storage unit 170, and is also output to the learning processing unit 165 and the output information generation unit 180.

Further, the voice analysis unit 161 acquires information (non-language information) necessary for the user emotion analysis processing based on the voice of the user, and outputs the acquired information to the user state estimation unit 164.

The image input unit 112 captures an image of the speaking user and its surroundings, and inputs the image to the image analysis unit 162.

The image analysis unit 162 analyzes facial expressions, gestures, gaze information, and the like of the user, and outputs the analysis result to the user state estimation unit 164.

The sensor 113 includes, for example, a sensor for acquiring data required for analyzing the line of sight, body temperature, heart rate, pulse, brain wave, and the like of the user. The acquired information from the sensor is input to the sensor information analysis unit 163.

The sensor information analysis unit 163 acquires data of the user, for example, the line of sight, the body temperature, the heart rate, and the like, based on the sensor acquisition information, and outputs the analysis result to the user state estimation unit 164.

The user state estimation unit 164 receives the input of the following data, estimates the state of the user, and generates user state estimation information 192:

an analysis result of the voice analysis unit 161, that is, information (non-language information) necessary for the user emotion analysis processing based on the voice of the user;

analysis results of the image analysis unit 162, i.e., analysis information, such as facial expressions, gestures, and sight line information of the user; and

the analysis result of the sensor information analysis unit 163, i.e., data of the user, such as line of sight, body temperature, heart rate, pulse, and brain wave.

The generated user state estimation information 192 is stored in the storage unit 170, and is also output to the learning processing unit 165 and the output information generation unit 180.

Note that the user state estimation information 192 generated by the user state estimation unit 164 is specifically, for example, estimation information indicating whether the user is satisfied or not, that is, whether the user is satisfied with the processing performed on the user utterance by the information processing apparatus or not.

For example, in a case where it is estimated that the user is satisfied, it is estimated that the processing performed by the information processing apparatus in response to the user utterance is correct, that is, the processing has been successfully performed.

The learning processing unit 165 performs learning processing on the user utterance and stores learning data in the storage unit 170. For example, in a case where, when a new user utterance is input or the intention of the user utterance is unknown, the intention is analyzed based on subsequent interaction between the apparatus and the user and the analysis result is obtained, the learning processing unit 165 performs processing of generating learning data in which the user utterance is associated with the intention and storing the learning data in the storage unit 170.

By performing such learning processing, accurate understanding of the intention of a large number of user utterances can be gradually achieved.

Further, the learning processing unit 165 also performs processing of generating a "utterance collection list" that collects a plurality of user utterances and storing the utterance collection list in the storage unit 170.

The "utterance collection list" will be described in detail later.

Note that not only the analysis result of the voice analysis unit 161 but also the analysis information and the estimation information generated by the image analysis unit 162, the sensor information analysis unit 163, and the user state estimation unit 164 are input to the learning processing unit 165.

Based on such input information, the learning processing unit 165 grasps, for example, the degree of success of the processing performed by the information processing apparatus 10 in response to the user utterance. In the case where the learning processing unit 165 determines that the processing has been successfully executed, the learning processing unit 165 executes processing of generating learning data and storing the learning data in the storage unit 170 or other processing.

The storage unit 170 stores content of the user utterance, learning data based on the user utterance, display data to be output to the image output unit (display unit) 122, and the like.

Note that the display data includes a node tree, a speech collection list, and the like generated by the display information generation unit 182. These data will be described in detail later.

The output information generating unit 180 includes an output voice generating unit 181 and a display information generating unit 182.

The output speech generating unit 181 generates a response to the user based on the user utterance analysis information 191 that is the analysis result of the speech analyzing unit 161. Specifically, the output speech generation unit 181 generates a response according to the intention of the user utterance as the analysis result of the speech analysis unit 161.

The response voice information generated by the output voice generation unit 181 is output via the voice output unit 121 such as a speaker.

The output speech generating unit 181 also performs control of changing a response to be output based on the user state estimation information 192.

For example, in a case where the user has an expression that is not satisfied and confused, the output speech generating unit 181 performs processing of performing a system utterance (e.g., "do you have any problem.

The display information generation unit 182 generates display data to be displayed on the image output unit (display unit) 122, for example, a node tree and a speech collection list.

These data will be described in detail later.

Note that fig. 3 does not show a process execution function for a user utterance, for example, a configuration for executing a moving image acquisition process for playing a moving image and a configuration for outputting the acquired moving image, which have been described above with reference to fig. 1. However, these functions are also configured in the data processing unit 150.

[2. example of generating display information and a utterance collection list output by an information processing apparatus ]

Next, an example of generating the display information and the utterance collection list output by the information processing apparatus 10 will be described.

Fig. 4 shows an example of display data to be output to the image output unit (display unit) 122 of the information processing apparatus 10.

Note that the image output unit (display unit) 122 corresponds to the display unit 13 of the information processing apparatus 10 in fig. 1 as described above, but may be configured by, for example, a projector or the like, and may also be configured as a display unit using a television as an external apparatus.

In the example of fig. 4, first, the user utters the following user utterance as a call to the information processing apparatus 10.

The user utterance is "hey, Sonitaro. "

Note that "Sonitaro" is a nickname of the information processing apparatus 10.

In response to the call, the information processing apparatus 10 makes the following system response.

System response ═ what do you want to do? This you can do. "

In the information processing apparatus 10, the output voice generating unit 182 generates the above-described system response, and outputs the system response via the voice output unit (speaker) 121.

In addition to the output of the above-described system response, the information processing apparatus 10 displays the display data of fig. 4 generated by the display information generation unit 182 on the image output unit (display unit) 122.

The display data shown in fig. 4 will be described.

The domain corresponding node tree 200 is tree (tree structure) data that classifies processing that the information processing apparatus 10 can perform in response to a user utterance according to type (domain), and also shows an acceptable user utterance example for each domain.

In the example of figure 4, it is shown that,

the number of the game fields is,

the domain of the media is the domain of the media,

a setting field, and

field of stores

Is set as a field 201, and

the domain of the photo or the images,

the video domain, and

music field

Further shown as a subdomain of the media domain.

The acceptable-utterance display node 202 is further set as a child node of each domain.

A specific example of the acceptable utterance display node 202 will be described later with reference to fig. 5 and subsequent drawings.

The display unit also displays the display area identification information 211 in the upper right part. This is information indicating to which part of the entire tree the domain corresponding node tree 200 displayed on the display unit corresponds.

The display unit also displays registered utterance collection list information 212 in the lower right portion. This is list data of the utterance collection list recorded on the storage unit 170 of the information processing apparatus 10.

The utterance collection list is a list in which a series of a plurality of different user utterances are collected. For example, in a case where the information processing apparatus 10 is requested to continuously perform two or more processes, the utterance collection list is used.

The utterance collection list will be described in detail later.

The state in fig. 4 shifts to the state in fig. 5.

As shown in fig. 5, the user utters the following user utterances.

The user utterance is "play BGM". "

The information processing apparatus 10 performs speech recognition and semantic analysis of the user utterance, and grasps that the user intention is "play".

Based on the user utterance analysis information, the display information generation unit 182 updates display data on the display unit, as shown in fig. 5.

The display data of fig. 5 is display data showing the processing category display node 203 as a child node of the video domain and the music domain, and further showing the acceptable utterance display node 202 as a child node of the processing category display node 203.

The processing category display node 203 is a node indicating a category of executable processing corresponding to each domain (video, music, game, etc.).

The acceptable utterance display node 202 is displayed as a child node of the processing category display node 203.

The registered user utterance which causes the information processing apparatus 10 to execute the process related to the process displayed in the process category node, for example, a command is displayed in the acceptable-utterance display node 202. Note that the command is a user utterance (command) that causes the information processing apparatus 10 to perform some of the processes in the user utterance.

As shown in figure 5 of the drawings,

text data of the following user utterances (═ commands) are displayed in the acceptable-utterance display node 202:

"fast forward ten minutes";

"Return to starting point"; and

"play moving pictures that everyone looks yesterday. "

Those user utterances displayed in the acceptable-utterance display node 202 are, for example, learning data (learning data in which a correspondence relationship between the user utterance and the intention is recorded) recorded in advance on the storage unit 170 or learning data learned and generated by the learning processing unit 165 based on the past user utterances, and are data recorded on the storage unit 170.

When the user utters an utterance matching the acceptable-utterance display node 202, the information processing apparatus 10 can accurately grasp the intention of the user utterance based on the learning data and safely perform processing in accordance with the user utterance.

From the user's perspective, when the user reads out the acceptable-utterance display node 202 displayed on the display unit as it is, the user can be sure that the information processing apparatus 10 performs the processing intended by the user, and thus can utter an utterance without anxiety.

Note that the character string displayed in the acceptable utterance display node 202 is a character string recorded as learning data. However, even in a case where the user utters an utterance including a character string that does not match the character string, the voice analysis unit 161 of the information processing apparatus 10 estimates the intention of the user utterance by referring to learning data including a proximity character string. Therefore, when the user utters an utterance close to the display data, the information processing apparatus 10 can perform accurate processing according to the user utterance.

The display data of fig. 5 is displayed on the display unit. Next, description will be made with reference to fig. 6.

As shown in fig. 6, the user utters the following user utterances.

The user utterance is "play songs of 80 s. "

The information processing apparatus 10 performs speech recognition and semantic analysis of the user utterance, and grasps that the user's intention is "to play a song for the 80 s".

Based on the user utterance analysis information, the information processing apparatus 10 performs processing (playing of songs of 80 s).

Note that the song to be played is acquired from, for example, a server (a service providing server that provides music content) connected to a network.

Further, as shown in fig. 6, the display information generating unit 182 updates the display data on the display unit.

In the display data of figure 6 of the drawings,

the following nodes are highlighted as highlight nodes 221:

"play 1999 song," which is an acceptable speech display node 202.

User utterances like "playing songs of 80 s"

Similar to the speech data "play 1999 song" in the node, this is the speech that has been recorded as learning data, and

the voice analysis unit 161 of the information processing apparatus 10 can perform accurate voice recognition and semantic analysis by referring to the learning data in which the utterance data "song of 1999 was played" is recorded, and thus can safely grasp that the user's intention is "song of 80 years was played". That is, "80 s" may be obtained as an age entity, and as a result, songs of 80 s are played.

When the intention of the user utterance is grasped, the display information generating unit 182 of the information processing apparatus 10 highlights the following nodes as the highlighted nodes 221:

node "play a 1999 song," which is an acceptable utterance display node 202 with similar intent.

By viewing the display, the user can be confident that the user utterance has been correctly interpreted.

In addition, as shown in FIG. 6,

it is possible to grasp the degree of understanding of the information processing apparatus 10 and determine other usable utterances, as can be seen from the following utterances:

{ the process is performed. Very good! I want to express various meanings by changing the part "1999" }

Next, description will be made with reference to fig. 7.

As shown in fig. 7, the user utters the following user utterances.

User utterance being "play favorite List"

The information processing apparatus 10 performs speech recognition and semantic analysis of the user utterance, and grasps that the user intention is "play a favorite list".

Based on the user utterance analysis information, the information processing apparatus 10 performs processing (playing a favorite list).

Note that the favorite list and songs to be played are acquired from, for example, a server (a service providing server that provides music content) connected to a network.

Further, as shown in fig. 7, the display information generating unit 182 updates the display data on the display unit.

In the display data of figure 7 of the drawings,

the following nodes are highlighted as highlight nodes 221:

"Play favorites", which is an acceptable utterance display node 202.

Further, the output voice generating unit 181 of the information processing apparatus 10 generates the following system response, and outputs the system response via the voice output unit 121.

The system responds by "i am playing your favorite song. "

Note that during execution of processing in response to a user utterance (during playback of a song), the voice analysis unit 161, the image analysis unit 162, the sensor information analysis unit 163, and the user state estimation unit 164 of the input data analysis unit 160 estimate the state of the user (whether the user is satisfied or not, etc.) based on the user utterance, the image, the sensor information, and the like, and output the estimation information to the learning processing unit 165. The learning processing unit 165 performs processing such as generating, updating, or discarding learning data based on the information.

For example, in a case where it is estimated that the user is satisfied, the learning processing unit 165 determines that the grasping of the intention and the execution of the processing in response to the user utterance have been correctly performed, generates and updates learning data, and stores the learning data in the storage unit 170.

In the case where it is estimated that the user is not satisfied, the learning processing unit 165 determines that the grasping of the intention and the execution of the processing in response to the user utterance are not correctly executed, and does not generate or update the learning data. Alternatively, for example, the learning processing unit 165 discards the generated learning data.

Next, description will be made with reference to fig. 8.

As shown in fig. 8, the user utters the following user utterances.

User utterance (adding Souzan) "

Note that "Souzan" is considered to be the name of a famous artist.

It is assumed that the information processing apparatus 10 performs speech recognition and semantic analysis of a user utterance, but cannot interpret a user intention.

Such utterances that cannot interpret the user's intent are called "out-of-domain utterances" (OOD utterances).

Note that a user utterance whose user intention is interpretable and which can be executed by the information processing apparatus 10 is referred to as "intra-domain (utterance)".

When the information processing apparatus 10 receives an input of such an OOD utterance, the output speech generation unit 181 generates a query response, and outputs the query response via the speech output unit 121. That is, as shown in fig. 8, the output speech generating unit 181 generates and outputs the following system response.

The system responds as "sorry, i don't understand" Souzan ". Can you say again? "

Further, as shown in fig. 8, the display information generating unit 182 displays the following guide information 222 in the lower right of the display unit.

I do not understand "add Souzan". You can repeat the process within ten seconds.

After the display, the information processing apparatus 10 waits for 10 seconds.

Next, description will be made with reference to fig. 9.

As shown in fig. 9, the user utters the following user utterance as a restatement utterance of "Souzan added" which is regarded as an OOD utterance.

The user utterance (restatement) 'play yesterday's Souzan song. "

The information processing apparatus 10 performs speech recognition and semantic analysis of the user utterance, and

the user's intention to grasp "add Souzan" that is regarded as the OOD utterance is "play Souzan song", which is similar to the intention of "play yesterday Souzan song".

The learning processing unit 165 stores the result of the grasping intention as learning data in the storage unit 170.

Further, the output voice generating unit 181 of the information processing apparatus 10 generates and outputs the following system response.

The system response is "good", i learn "add Souzan".

Further, as shown in fig. 9, the display information generating unit 182 updates the display data on the display unit.

A node indicating a user utterance whose intention has been successfully grasped is added as an additional node 231, and guide information 232 indicating that learning has been performed is further displayed.

Note that, as described above, the learning processing unit 165 performs processing such as generation, update, and discard of learning data based on the user state (whether the user is satisfied or not, etc.) estimated from the information input from the voice analyzing unit 161, the image analyzing unit 162, the sensor information analyzing unit 163, and the user state estimating unit 164 of the input data analyzing unit 160.

That is, in a case where it is estimated that the user is satisfied, the learning processing unit 165 determines that the grasping of the intention and the execution of the processing in response to the user utterance have been correctly performed, generates and updates the learning data, and stores the learning data in the storage unit 170. In the case where it is estimated that the user is not satisfied, the learning processing unit 165 determines that the grasping of the intention and the execution of the processing in response to the user utterance are not correctly executed, and does not generate or update the learning data. Alternatively, the learning processing unit 165 discards the generated learning data.

Next, description will be made with reference to fig. 10.

The user next wants to play the game and issues the following user utterance.

User utterances are "show commands (utterances) that i can use in a game"

Note that the command is a user utterance (command) that causes the information processing apparatus 10 to execute some of the processes described above.

The voice analysis unit 161 of the information processing apparatus 10 performs voice recognition and semantic analysis of the user utterance. Based on the analysis result, the display information generation unit 182 updates the display data on the display unit, as shown in fig. 10.

As shown in fig. 10, a tree region is shown in which an acceptable utterance display node 202 (acceptable command node) corresponding to a game field setting is displayed.

The user thinks that he/she wants to play a game with his/her friend, and searches for the best utterance (command) from the acceptable-utterance display node 202 (acceptable-command node).

The user finds the following nodes:

node-sends invitation to my friend, and

utterances displayed in the nodes are uttered.

As shown in fig. 11, the user utters the following user words.

The user utters for "send invitation to my friend. "

The voice analysis unit 161 of the information processing apparatus 10 performs voice recognition and semantic analysis of the user utterance, and based on the result thereof, the information processing apparatus 10 performs processing (sending an invitation email to a friend).

Note that the invitation email to the friend is transmitted directly from the information processing apparatus 10, for example, or is transmitted via a server (a service providing server that provides a game) connected to a network.

Further, as shown in fig. 11, the display information generating unit 182 updates the display data on the display unit.

When the intention of the user utterance is grasped, the display information generation unit 182 of the information processing apparatus 10 highlights the following nodes:

node "send invitation to my friend", which is an acceptable utterance display node 202 with similar intent.

By viewing the display, the user can be confident that the user utterance has been correctly interpreted.

Further, the output voice generating unit 181 of the information processing apparatus 10 generates the following system response, and outputs the system response via the voice output unit 121.

The system responds by "i have sent an invitation to your general game friend. "

Next, description will be made with reference to fig. 12.

The user wants to play the moving image while playing the game, and issues the following user utterance.

The user utterance is "play a moving picture that yesterday everybody sees. "

The voice analysis unit 161 of the information processing apparatus 10 performs voice recognition and semantic analysis of the user utterance. Based on the analysis result, the information processing apparatus 10 executes processing (playing a moving image).

Note that the moving image to be played is acquired from, for example, a server (a service providing server that provides moving image content) connected to a network.

Further, as shown in fig. 12, the display information generating unit 182 updates the display data on the display unit.

As shown in fig. 12, the following nodes are highlighted:

the node "play moving images that yesterday looks at" is an acceptable speech display node in the video domain, i.e., a node corresponding to the user speech.

By viewing the display, the user can be confident that the user utterance has been correctly interpreted.

Further, the output voice generating unit 181 of the information processing apparatus 10 generates the following system response, and outputs the system response via the voice output unit 121.

The system responds by "i are playing comedy motion pictures that we have seen yesterday. "

Next, description will be made with reference to fig. 13.

In fig. 13, the user thinks as follows. I.e. the user thinks it is

{ I could perform these processes before, but I did not know if I could do the same thing any more (four things), nor did I mind to do it }.

These four things are processing corresponding to the following four user utterances:

(1) "Play favorites List" (FIG. 7);

(2) "add Souzan" (fig. 8);

(3) "send invitation to my friend. "(FIG. 11); and is

(4) "play the motion picture that everyone looked yesterday. "(FIG. 12).

At this time, the input data analysis unit 160 of the information processing apparatus 10 analyzes that the user is concerned about something and seems to be dissatisfied. That is, based on the information input from the voice analysis unit 161, the image analysis unit 162, and the sensor information analysis unit 163, the user state estimation unit 164 generates user state estimation information 192 indicating that the user is worried about something and seems dissatisfied, and outputs the user state estimation information to the output information generation unit 180.

The output speech generating unit 181 of the output information generating unit 180 generates and outputs the following system utterance in response to the input of the user state estimation information 192.

The system utterance is the utterance "i can record together from 'play a favorite list' to" play a moving image that we have looked at yesterday ".

Next, description will be made with reference to fig. 14.

As shown in fig. 14, the user uttered the following user utterances in response to the system utterance.

The user utterances as "remembering this operation. "

The voice analysis unit 161 of the information processing apparatus 10 performs voice recognition and semantic analysis of the user utterance. Based on the analysis result, the information processing apparatus 10 performs processing (processing of generating a "utterance collection list"). Further, the display information generating unit 182 updates the display data on the display unit, as shown in fig. 14.

As shown in fig. 14, the display unit displays an utterance collection list 231 in which a plurality of utterances are collected and listed.

The "utterance collection list" is data that lists a plurality of user utterances (commands).

That is, the user utterance recorded in the "utterance collection list" is a user utterance corresponding to a command to process a request made by the user to the information processing apparatus 10.

The "utterance collection list" is generated in the learning processing unit 165.

The response user utterance is "remember this operation. "

The learning processing unit 165 generates a speech collection list in which the following four user utterances are collected as a list and the list is stored as one piece of learning data in the storage unit 170:

these four things are processing corresponding to the following four user utterances:

(1) "Play favorites List" (FIG. 7);

(2) "add Souzan" (fig. 8);

(3) "send invitation to my friend. "(FIG. 11); and is

(4) "a moving image that was seen yesterday by everyone is played" (fig. 12).

For example, in a case where the user utters a user utterance included in the "utterance collection list" stored in the storage unit 170, or in a case where the user specifies the "utterance collection list" stored in the storage unit 170 and issues an utterance to request processing, the information processing apparatus 10 sequentially performs processing in accordance with the user utterance recorded in the "utterance collection list".

When the "utterance collection list" is generated in the learning processing unit 165, as shown in fig. 14, the display information generating unit 182 displays the generated "utterance collection list" 231 on the display unit.

When the user utters an utterance to specify the "utterance collection list" 231 from the next time, the user can cause the information processing apparatus to execute a plurality of processes recorded in the specified list together.

A processing example using the generated utterance collection list will be described with reference to fig. 15.

[3. processing example Using utterance Collection List ]

Next, a processing example using the utterance collection list will be described.

An example of processing using the "utterance collection list" 231 generated by the processing described above with reference to fig. 14 will be described.

First, when the information processing apparatus 10 starts, the display unit of the information processing apparatus 10 displays an initial screen shown in fig. 15.

This is the same as the display data described above with reference to fig. 4.

As shown in fig. 15, first, the user utters the following user words as a call to the information processing apparatus 10:

the user utterance is "hey, Sonitaro. "

In response to the call, the information processing apparatus 10 makes the following system response.

System response ═ what do you want to do? This you can do. "

In addition to the output of the above-described system response, the information processing apparatus 10 displays the display data of fig. 15 generated by the display information generation unit 182 on the image output unit (display unit) 122.

The display data of fig. 15 is data showing the domain correspondence node tree 200 described above with reference to fig. 4.

The user may have the following idea when viewing the display data.

How do i should you do as before. I do not remember

Note that the "utterance collection list" 231 described with reference to fig. 14 is assumed to be generated by the previous day.

Next, description will be made with reference to fig. 16.

As shown in fig. 16, the user utters the following user utterances.

User utterances being "utterance collection list showing collected previous days"

The information processing apparatus 10 performs speech recognition and semantic analysis of the user utterance, and grasps that the user intention is "a request to display an utterance collection list generated by the previous day".

Based on the user utterance analysis information, the display information generation unit 182 of the information processing apparatus 10 displays the "utterance collection list" 231 on the display unit.

Further, the output voice generating unit 181 of the information processing apparatus 10 generates the following system response, and outputs the system response via the voice output unit 121.

The system response is "good", which is the collection list of utterances collected the day before. "

By viewing the utterance collection list 231 displayed on the display unit, the user can confirm again a series of four utterances and processing performed the previous day.

Next, description will be made with reference to fig. 17.

In fig. 17, the user sequentially utters similar utterances to the utterances recorded in the utterance collection list 231 displayed on the display unit. That is, the user utters the following utterances in sequence:

(1) "Play favorites List";

(2) "Add Souzan";

(3) "invite to my friends"; and is

(4) "play the motion picture that everyone looked yesterday. ",

therefore, the information processing apparatus 10 can be made to safely execute exactly the same processing as that executed on the previous day.

Alternatively, instead of uttering those utterances in turn, the user may utter one of the following utterances:

the user utterance is "process utterance collection list (2)"; and is

User utterances as "processing displayed utterance collection list"

The voice analysis unit 161 of the information processing apparatus 10 performs voice recognition and semantic analysis of the user utterance. Based on the analysis result, the information processing apparatus 10 performs processing ("processing of the utterance collection list (2)"). That is, the information processing apparatus 10 sequentially executes processing corresponding to a plurality of user utterances recorded in the utterance collection list.

Note that the display information generation unit 182 of the information processing apparatus 10 changes the display mode of the utterance collection list 231 displayed on the display unit according to the execution state of the processing in the information processing apparatus 10.

Specifically, the display information generation unit 182 performs processing of highlighting a node (acceptable utterance display node) in the list corresponding to the processing currently performed by the information processing apparatus 10.

The highlighting process will be described with reference to fig. 18 and subsequent figures.

First, the information processing apparatus 10 first starts processing (processing of playing a favorite list) based on a user utterance corresponding to the following nodes:

the node is "play favorite list", which is the first node recorded in the utterance collection list 231.

As shown in fig. 18, the display information generating unit 182 highlights nodes recorded in the utterance collection list 231 and currently executed by the information processing apparatus 10, that is, the following nodes:

the node is "play favorites list".

By viewing the highlight display, the user can confirm that the information processing apparatus 10 is correctly performing the process of playing the favorite list.

Next, description will be made with reference to fig. 19.

As shown in fig. 19, the information processing apparatus 10 starts processing (plays Souzan) based on a user utterance corresponding to the following nodes:

the node "add Souzan" is the second node recorded in the utterance collection list 231.

Then, as shown in fig. 19, the display information generating unit 182 highlights the nodes recorded in the utterance collection list 231 and currently executed by the information processing apparatus 10, that is, the following nodes:

node is "add Souzan".

By viewing the highlight display, the user can confirm that the information processing apparatus 10 is correctly performing the process of playing Souzan.

Next, description will be made with reference to fig. 20.

As shown in fig. 20, the information processing apparatus 10 starts processing (sending an invitation email to a friend) based on a user utterance corresponding to the following nodes:

the node "send invitation to my friend", which is the third node recorded in the utterance collection list 231.

Then, as shown in fig. 20, the display information generating unit 182 highlights the nodes recorded in the utterance collection list 231 and currently executed by the information processing apparatus 10, that is, the following nodes:

node-sending invitation to my friend "

By viewing the highlight display, the user can confirm that the information processing apparatus 10 is correctly performing the process of sending the invitation email to the friend.

Next, description will be made with reference to fig. 21.

As shown in fig. 20, the information processing apparatus 10 starts processing (playing a moving image that was seen everyone yesterday) based on a user utterance corresponding to the following nodes:

the node "play moving picture that yesterday has seen", which is the fourth node recorded in the utterance collection list 231.

Then, as shown in fig. 20, the display information generating unit 182 highlights the nodes recorded in the utterance collection list 231 and currently executed by the information processing apparatus 10, that is, the following nodes:

the node is "play moving picture that everyone looks yesterday".

By looking at the highlight, the user can confirm that the information processing apparatus 10 is correctly performing the process of playing the moving image that was seen yesterday.

The "utterance collection list" can be freely created by the user, and processing can be performed by using the created list, so that the information processing apparatus 10 safely performs a plurality of processes at once or in sequence.

Furthermore, a "speech collection list" created by another user may also be used.

Fig. 22 shows an example of displaying the utterance collection list 232 generated by the user ABC as another user.

The user utters the following user utterances.

User utterances being "display mr. ABC's public utterance collection list"

The voice analysis unit 161 of the information processing apparatus 10 performs voice recognition and semantic analysis of the user utterance, and based on the result thereof, the information processing apparatus 10 performs processing (acquires and displays the public utterance collection list of mr. ABC).

As shown in fig. 22, the display information generating unit 182 updates the display data on the display unit.

That is, mr. ABC's public utterance collection list 232 is displayed.

For example, the utterance collection list of a large number of users is stored in a storage unit of a server accessible to the information processing apparatus 10.

For each utterance collection list, whether to disclose the utterance collection list may be set, and only the list set to "disclosure" may be acquired and displayed in response to a request from another user.

As shown in fig. 22, the public-utterance collection list of another user displayed on the display unit is thereafter stored in the storage unit 170 as a list that can be used at any time by the user who calls the list.

Further, as shown in fig. 23, it is also possible to acquire, display, and use, for example, a network public-utterance collection list 233, which network public-utterance collection list 233 is a public-utterance collection list generated by a game-only network managed by a game-only server.

Further, as shown in fig. 24, it is also possible to, for example, acquire, display, and use a blog disclosure utterance collection list 234, which blog disclosure utterance collection list 234 is a disclosure utterance collection list disclosed in a blog.

[4. other examples of displaying and generating a Collection List of utterances ]

Next, other processing examples of displaying and generating the utterance collection list different from the above-described embodiment will be described.

These processing examples will be described with reference to fig. 25 and subsequent drawings.

Fig. 25 shows an initial screen displayed on the display unit of the information processing apparatus 10 when the information processing apparatus 10 starts.

This is the same as the display data described above with reference to fig. 4.

As shown in fig. 25, first, the user utters the following user utterance as a call to the information processing apparatus 10.

The user utterance is "hey, Sonitaro. "

In response to the call, the information processing apparatus 10 makes the following system response.

System response ═ what do you want to do? This you can do. "

In addition to the output of the above-described system response, the information processing apparatus 10 displays the display data of fig. 15 generated by the display information generation unit 182 on the image output unit (display unit) 122.

The display data of fig. 15 is data showing the domain correspondence node tree 200 described above with reference to fig. 4.

The user may have the following idea when viewing the display data.

{ what did i say at the beginning? I tell Sonitaro Play the favorites List! }

Next, description will be made with reference to fig. 26.

As shown in fig. 26, the user utters the following user utterances.

User utterance being "play favorite List"

The information processing apparatus 10 performs speech recognition and semantic analysis of the user utterance, and grasps that the user intention is "request to play a favorite list".

Further, the learning processing unit 165 of the information processing apparatus 10 inputs the voice analysis result, and

a search is performed to determine whether or not the "utterance collection list" in which the following user utterances are registered is stored in the storage unit 170:

the user utterance is "play favorite list".

As a result, it is detected that the "utterance collection list" described above with reference to fig. 14 is stored in the storage unit 170. That is, the "utterance collection list" in which the following user utterances are detected to be recorded is stored in the storage unit 170:

(1) "Play favorites List";

(2) "Add Souzan";

(3) "invite to my friends"; and is

(4) "play the motion picture that everyone looked yesterday. ".

Based on the detection result, the display information generation unit 182 of the information processing apparatus 10 performs a process of displaying the "utterance collection list" stored in the storage unit 170 on the display unit.

First, as shown in fig. 26, the display information generating unit 182 starts moving a node corresponding to the user utterance recorded in the "utterance collection list", that is, the utterance collection list corresponding node 241 in fig. 26.

Then, as shown in fig. 27, a speech collection list 242 including those nodes is displayed.

By viewing this display, the user can confirm that there is a "utterance collection list" 242 that includes previously uttered user utterances, i.e., the following:

user utterance being "play favorite List"

Further, by referring to the displayed "speech collection list" 242, the user can cause the information processing apparatus 10 to safely perform a series of processes identical to a plurality of processes that have been previously performed.

Further, an example in which the learning processing unit 165 of the information processing apparatus 10 autonomously determines whether to execute the process of generating the utterance collection list and executes the process of generating the utterance collection list will be described with reference to fig. 28 and subsequent figures.

First, as shown in fig. 28, the user utters the following user utterance.

User's words like "play happy birthday"

The voice analysis unit 161 of the information processing apparatus 10 performs voice recognition and semantic analysis of the user utterance, and grasps that the user intention is "a request to play birthday happy".

Based on the user utterance analysis information, the information processing apparatus 10 performs processing (playing birthday happy). Further, the display information generating unit 182 updates the display data on the display unit, as shown in fig. 28.

In the display data of figure 28 of the drawings,

the following nodes are highlighted as highlight nodes 221:

"play birthday happy", which is an acceptable utterance display node 202.

Further, the output voice generating unit 181 of the information processing apparatus 10 generates the following system response, and outputs the system response via the voice output unit 121.

The system responds as "i am playing birthday happy. "

Then, as shown in fig. 29, the user utters the following user utterance.

The user utterance is "play a movie using the song. "

The voice analysis unit 161 of the information processing apparatus 10 performs voice recognition and semantic analysis of the user utterance, and grasps that the user intention is "a request to play a movie using birthday happiness".

Based on the user utterance analysis information, the information processing apparatus 10 performs processing (playing a movie using birthday happy). Further, the display information generating unit 182 updates the display data on the display unit, as shown in fig. 29.

In the display data of figure 29, it is,

the following nodes are highlighted as highlight nodes 221:

"play movie using" happy birthday ", which is an acceptable speech display node 202.

Further, the output voice generating unit 181 of the information processing apparatus 10 generates the following system response, and outputs the system response via the voice output unit 121.

The system responds by "i will play the movie happy life. "

Further, in fig. 30, the learning processing unit 165 of the information processing apparatus 10 verifies the history of the user utterance.

User's words like "play happy birthday"

The user utterance is "play a movie using the song. "

The learning processing unit 165 confirms that between the two user utterances, the second user utterance includes the indicator "that" of the first user utterance, and determines that the two user utterances have a strong relationship.

Based on the determination of the relationship, the learning processing unit 165 determines that an utterance collection list including two user utterances should be generated.

As shown in fig. 30, the information processing apparatus 10 outputs the following system words even without an explicit request from the user.

The system utterance is "i can record together the utterance from 'play happy birthday' to 'play movie using the song'. "

Next, description will be made with reference to fig. 31.

As shown in fig. 31, the user uttered the following user utterances in response to the system utterance.

The user utterances as "remembering this operation. "

The voice analysis unit 161 of the information processing apparatus 10 performs voice recognition and semantic analysis of the user utterance. Based on the analysis result, the information processing apparatus 10 performs processing (processing of generating a "utterance collection list"). Further, the display information generating unit 182 updates the display data on the display unit, as shown in fig. 31.

As shown in fig. 31, the display unit displays an utterance collection list 261 in which a plurality of utterances are collected and listed.

The "utterance collection list" 261 of fig. 31 is a list that collects two kinds of user utterances:

the user's words are "play happy birthday"; and is

The user utterance is "play a movie using the song".

The "utterance collection list" is generated in the learning processing unit 165.

This operation is "remembered" in response to a user utterance. "

The learning processing unit 165 generates a speech collection list in which the following two user utterances are collected as a list and the list is stored as one piece of learning data in the storage unit 170:

(1) "play birthday happy"; and is

(2) "play a movie using the song".

The user can safely perform the same series of processes later by using the utterance collection list.

The processing described with reference to FIGS. 28 to 31 is

Processing examples, where the confirmation is among two user utterances, the second user utterance includes the indicator "the" of the first user utterance:

a first user utterance: "play birthday happy. "; and is

A second user utterance: "play a movie using the song", and

the two user utterances are determined to have a strong relationship, and as a result of the determination, an utterance collection list is generated.

Next, a process example of generating the utterance collection list in a case where the order of two user utterances is different, that is, in a case where a request to play a movie is made first and then a request to play a song used in the movie is made, will be described with reference to fig. 32 and subsequent figures.

First, as shown in fig. 32, the user utters the following user utterance.

User's words as "playing happy life"

The voice analysis unit 161 of the information processing apparatus 10 performs voice recognition and semantic analysis of the user utterance, and grasps that the user intention is "a request to play a movie happy life".

Based on the user utterance analysis information, the information processing apparatus 10 performs processing (playing a movie happy life). Further, the display information generating unit 182 updates the display data on the display unit, as shown in fig. 32.

In the display data of figure 32, it is,

the following nodes are highlighted as highlight nodes 221:

"play happy life," which is an acceptable utterance display node 202.

Further, the output voice generating unit 181 of the information processing apparatus 10 generates the following system response, and outputs the system response via the voice output unit 121.

The system responds by "i will play the movie happy life. "

Then, as shown in fig. 33, the user utters the following user utterance.

The user utterance is "play the theme tune in the movie". "

First, the image analysis unit 162 of the information processing apparatus 10 analyzes the line-of-sight information of the user, and confirms that the user is viewing a movie happy life. Further, the voice analysis unit 161 performs voice recognition and semantic analysis of the user utterance, and grasps that the user intention is "request to play a theme tune in movie happy life".

Based on the user utterance analysis information, the information processing apparatus 10 performs processing (playing a theme tune in movie happy life — birthday happy). Further, the display information generating unit 182 updates the display data on the display unit, as shown in fig. 33.

In the display data of figure 33, it is,

the following nodes are highlighted as highlight nodes 221:

"play birthday happy", which is an acceptable utterance display node 202.

Further, the output voice generating unit 181 of the information processing apparatus 10 generates the following system response, and outputs the system response via the voice output unit 121.

The system responds as "i am playing birthday happy. "

Further, in fig. 34, the learning processing unit 165 of the information processing apparatus 10 verifies the history of the user utterance.

User utterances as "happy life"

The user utterance is "play the theme tune in this movie". "

The learning processing unit 165 confirms that, among the two user utterances, the second user utterance includes the indicator "this" of the first user utterance.

Further, the learning processing unit 165 confirms that the user is viewing a movie happy life based on the analysis result of the image analysis unit 162, and determines that the above-described two user utterances have a strong relationship.

Based on the determination of the relationship, the learning processing unit 165 determines that an utterance collection list including two user utterances should be generated.

As shown in fig. 34, the information processing apparatus 10 outputs the following system words even without an explicit request from the user.

The system utterance is "i can record together the utterance from 'happy life' to 'play the theme tune in the movie'.

Next, description will be made with reference to fig. 35.

As shown in fig. 35, the user uttered the following user utterances in response to the system utterance.

The user utterances as "remembering this operation. "

The voice analysis unit 161 of the information processing apparatus 10 performs voice recognition and semantic analysis of the user utterance. Based on the analysis result, the information processing apparatus 10 performs processing (processing of generating a "utterance collection list"). Further, the display information generating unit 182 updates the display data on the display unit, as shown in fig. 35.

As shown in fig. 35, the display unit displays an utterance collection list 262 in which a plurality of utterances are collected and listed.

The "utterance collection list" 262 of fig. 35 is a list that collects two user utterances:

the user utterance is "happy life"; and is

The user's utterance is "play birthday happy".

The "utterance collection list" is generated in the learning processing unit 165.

The response user utterance is "remember this operation. "

The learning processing unit 165 generates a speech collection list in which the following two user utterances are collected as a list and the list is stored as one piece of learning data in the storage unit 170:

(1) "play birthday happy"; and is

(2) "play a movie using this song".

By using this utterance collection list, the user can safely perform the same series of processing later.

As described above, the learning processing unit 165 of the information processing apparatus 10 of the present disclosure generates the utterance collection list according to various conditions.

An example of execution of the processing in which the learning processing unit 165 generates the utterance collection list and stores the utterance collection list in the storage unit 170 is, for example, as follows.

(1) The learning processing unit 165 inquires of the user whether to generate a utterance collection list, generates the utterance collection list with the user's consent, and stores the utterance collection list in the storage unit 170.

(2) In the case where the learning processing unit 165 determines that the plurality of processes corresponding to the plurality of user utterances have been successfully performed, the learning processing unit 165 generates an utterance collection list and stores the utterance collection list in the storage unit 170.

(3) In the case where the combination of the plurality of user utterances is equal to or greater than the predetermined threshold number of times, the learning processing unit 165 generates an utterance collection list, and stores the utterance collection list in the storage unit 170.

For example, when the threshold is set to three times, and the following two kinds of user utterances are combined:

user utterances are "play favorites"; and is

User utterance (display comedy movie) "

In the case of being input three times, the learning processing unit 165 generates a utterance collection list including a combination of the above-described two utterances, and stores the utterance collection list in the storage unit 170.

(4) The learning processing unit 165 analyzes the presence or absence of an indicator indicating a relationship between utterances included in a plurality of user utterances, generates an utterance collection list based on the analysis result, and stores the utterance collection list in the storage unit 170.

This corresponds to the processing example described above with reference to fig. 28 to 31.

(5) The learning processing unit 165 analyzes a user state regarding processing performed by the information processing apparatus 10 in response to a user utterance, generates an utterance collection list based on the analysis result, and stores the utterance collection list in the storage unit 170.

As described above, the voice analysis unit 161, the image analysis unit 162, the sensor information analysis unit 163, and the user state estimation unit 164 of the input data analysis unit 160 estimate the state of the user (whether the user is satisfied or not, etc.) based on the user utterance, the image, the sensor information, and the like, and output the estimation information to the learning processing unit 165. The learning processing unit 165 performs processing such as generation, update, or discard of learning data based on the information.

For example, in a case where it is estimated that the user is satisfied, the learning processing unit 165 determines that the grasping of the intention and the execution of the processing in response to the user utterance have been correctly performed, generates and updates learning data, and stores the learning data in the storage unit 170.

In the case where it is estimated that the user is not satisfied, the learning processing unit 165 determines that the grasping of the intention and the execution of the processing in response to the user utterance are not correctly executed, and does not generate or update the learning data. Alternatively, for example, the learning processing unit 165 discards the generated learning data.

(6) The learning processing unit 165 selects a user utterance to be collected according to the context information, generates an utterance collection list, and stores the utterance collection list in the storage unit 170.

This is an example in which processing such as generation, update, or discarding of learning data is performed based on context information indicating the user state obtained from the analysis result by, for example, the voice analysis unit 161, the image analysis unit 162, and the sensor information analysis unit 163 of the input data analysis unit 160, similarly to the above-described example.

For example, the learning processing unit 165 selects only the processing estimated to be required by the user according to the state of the user, for example, the state in which the user is cooking, the state in which the user is playing a game, and the state in which the user is listening to music, generates a speech collection list, and stores the speech collection list in the storage unit 170.

Note that the context information is not limited to the behavior information of the user, and may be various environment information, for example, time information, weather information, and location information.

For example, in the case where the time period is daytime, the learning processing unit 165 generates a list including only user utterances corresponding to requests for processing that may be performed during the daytime.

For example, in the case where the time period is evening, the learning processing unit 165 generates a list including only user utterances corresponding to requests for processing that may be performed in evening.

[5. processing sequence executed by information processing apparatus ]

Next, a processing sequence executed by the information processing apparatus 10 will be described with reference to flowcharts in fig. 36 and subsequent drawings.

The processing according to the flowcharts in fig. 36 and the subsequent drawings is executed according to, for example, a program stored in a storage unit of the information processing apparatus 10. For example, these processes may be executed as program execution processes by a processor (e.g., CPU) having a program execution function.

First, the overall sequence of processing performed by the information processing apparatus 10 will be described with reference to the flowchart of fig. 36.

The processing in each step in the flow of fig. 36 will be described.

(step S101)

First, in step S101, the information processing apparatus 10 inputs and analyzes voice, image, and sensor information.

This processing is processing performed by the input unit 110 and the input data analysis unit 160 of the information processing apparatus 10 of fig. 3.

In step S101, speech recognition and semantic analysis of the user utterance speech are performed to acquire the intention of the user utterance, and further the user state (whether the user is satisfied or not, etc.) based on the user utterance speech, the image, the sensor information, and the like is acquired.

Details of this processing will be described later with reference to the flow in fig. 37.

(Steps S102 to S103)

Then, in steps S102 to S103, the information processing apparatus 10 analyzes the content of the user utterance (command (processing request)), and determines whether the processing corresponding to the user utterance is executable (in the domain) or non-executable (out of the domain: OOD).

In the case where the process is not executable (out of domain (OOD), the process terminates.

Note that at this time, the user may be notified that the process cannot be performed, or may be provided with a system response requesting restatement.

Meanwhile, in a case where it is determined that the processing corresponding to the user utterance is executable (in the domain), the processing proceeds to step S104.

(step S104)

Then, in step S104, the information processing apparatus 10 records the user utterance determined to be executable on the storage unit 170 (in the domain).

(step S105)

Then, in step S105, the information processing apparatus 10 highlights the node corresponding to the user utterance in the domain-corresponding node tree displayed on the image output unit (display unit) 122.

This is, for example, the process of displaying the highlighted node 221 described above with reference to fig. 7.

This processing is processing executed by the display information generation unit 182 of the information processing apparatus 10 in fig. 3.

(step S106)

Then, in step S106, the information processing apparatus 10 performs processing corresponding to the user utterance, that is, processing corresponding to the node highlighted in step S105.

Specifically, for example, in the example of FIG. 7, the user utterance is

User utterance being "play favorite List"

Thereby playing songs included in the pre-registered user favorite list.

Note that the favorite list and the songs to be played are acquired from, for example, a server (a service providing server that provides music content) connected to a network.

(Steps S107 to S108)

Then, in steps S107 to S108, the information processing apparatus 10 estimates whether the process corresponding to the user utterance (command) has been successfully performed based on the user state (satisfaction, dissatisfaction, etc.) estimated from the analysis result of the input information (voice, image, and sensor information), and determines whether the process of collecting a plurality of utterances is performed based on the estimation result.

This is processing performed by the learning processing unit 165 of the information processing apparatus 10 in fig. 3.

That is, the learning processing unit 165 generates the utterance collection list described with reference to fig. 14 and the like, and stores the utterance collection list in the storage unit 170.

In the case where, for example, the following conditions are satisfied: that is to say that the first and second electrodes,

(1) a plurality of user utterances (commands) are input at intervals within a specified time,

for example, as described with reference to fig. 13, the learning processing unit 165 outputs a system utterance indicating that "utterance collection list" can be generated.

Further, in the case of the user' S consent, as shown in fig. 14, it is determined that the "utterance collection list" is generated (yes in step S108), and the processing proceeds to step S109.

Meanwhile, in a case where the user disagrees, it is determined that the "utterance collection list" is not generated (no in step S108), and the process is terminated.

(step S109)

In a case where it is determined in step S108 that the "utterance collection list" is generated (yes in step S108) and the process proceeds to step S109, the learning processing unit 165 of the information processing apparatus 10 generates the "utterance collection list".

Specifically, this is, for example, the utterance collection list 231 of fig. 14.

The example of fig. 14 shows a speech collection list in which the following four user utterances are collected as a list:

(1) "Play favorites List";

(2) "Add Souzan";

(3) "invite to my friends"; and is

(4) "play the motion picture that everyone looked yesterday. "

The learning processing unit 165 of the information processing apparatus 10 stores the list as one piece of learning data in the storage unit 170.

In the case where the learning processing unit 165 generates the "utterance collection list", as shown in fig. 14, the display information generating unit 182 displays the generated "utterance collection list" on the display unit.

When the user utters an utterance to specify the "utterance collection list" 231 from the next time, the user can cause the information processing apparatus to execute a plurality of processes recorded in the specified list together.

For example, in a case where the user utters a user utterance included in the "utterance collection list" stored in the storage unit 170, or in a case where the user specifies the "utterance collection list" stored in the storage unit 170 and issues an utterance to request processing, the information processing apparatus 10 sequentially performs processing in accordance with the user utterance recorded in the "utterance collection list".

Next, details of the processing in step S101 in the flowchart of fig. 36, that is,

the details of the processing of inputting and analyzing voice, image, and sensor information.

This processing is processing performed by the input unit 110 and the input data analysis unit 160 of the information processing apparatus 10 of fig. 3.

In step S101, speech recognition and semantic analysis of the user utterance speech are performed to acquire the intention of the user utterance, and further the user state (whether the user is satisfied or not, etc.) based on the user utterance speech, the image, the sensor information, and the like is acquired.

The input unit 110 includes a voice input unit (microphone) 111, an image input unit (camera) 112, and a sensor 113, and acquires user speech, user images, and sensor acquisition information (line of sight, body temperature, heart rate, pulse, brain wave, and the like of the user).

The voice analysis unit 161, the image analysis unit 162, the sensor information analysis unit 163, and the user state estimation unit 164 of the input data analysis unit 160 perform analysis of input data.

Processing in each step in the flow of fig. 37 will be described.

(step S201)

First, in step S201, the voice input unit (microphone) 111, the image input unit (camera) 112, and the sensor 113 of the input unit 110 acquire a user utterance voice, a user image, and sensor acquisition information (a line of sight, a body temperature, a heart rate, a pulse, brain waves, and the like of the user).

In steps S202 and S204, the voice information acquired by the voice input unit (microphone) 111 is processed.

The image information acquired by the image input unit (camera) 112 is processed in steps S206 and S207.

In step S208, the sensor information acquired by the sensor 113 is processed.

These processes may be performed in parallel.

(Steps S202 to S203)

Steps S202 to S203 are processing performed by the voice analysis unit 161.

For example, in step S202, the speech analysis unit 161 converts speech data into text data including a plurality of words by an Automatic Speech Recognition (ASR) function.

Further, in step S203, the speech analysis unit 161 performs utterance semantic analysis processing on the text data. For example, the speech analysis unit 161 estimates the intention of the user utterance and an entity that is a meaningful element (important element) included in the utterance from the text data by applying a natural language understanding function such as Natural Language Understanding (NLU).

The processing in step S102 in the flow of fig. 36 is executed by using the result of this semantic analysis.

(Steps S204 to S205)

The processing in steps S204 to S205 is also processing performed by the voice analysis unit 161.

The voice analysis unit 161 acquires information (non-language information) necessary for the emotion analysis processing of the user based on the voice of the user, and outputs the acquired information to the user state estimation unit 164.

The non-language information is, for example, information obtained from user's voice other than text data, such as pitch, tone, intonation, and trembling of the voice, and is information that can be used to analyze a user's state, such as excitement or tension. This information is output to the user state estimation unit 164.

(step S206)

The processing in step S206 is processing performed by the image analysis unit 162.

The image analysis unit 162 analyzes the facial expression, gesture, and the like of the user captured by the image input unit 112, and outputs the analysis result to the user state estimation unit 164.

(step S207)

The processing in step S207 is processing performed by the image analysis unit 162 or the sensor information analysis unit 163.

The image analysis unit 162 or the sensor information analysis unit 163 analyzes the line of sight of the user based on the user image or the sensor information captured by the image input unit 112.

Specifically, for example, the image analysis unit 162 or the sensor information analysis unit 163 acquires line-of-sight information or the like for analyzing the degree of attention to the processing performed on the information processing apparatus 10, for example, whether the user is viewing a moving image for which the information processing apparatus 10 has started playing. This information is output to the user state estimation unit 164.

(step S208)

The processing in step S208 is processing performed by the sensor information analysis unit 163.

The sensor information analysis unit 163 acquires information (the user's line of sight, body temperature, heart rate, pulse, brain wave, and the like) acquired by the sensor 113, and outputs the acquired information to the user state estimation unit 164.

(step S210)

The processing in step S210 is processing performed by the user state estimation unit 164.

The user state estimation unit 164 receives the following data inputs, estimates the state of the user, and generates the user state estimation information 192 of fig. 3:

an analysis result of the voice analysis unit 161, that is, information (non-language information) necessary for the user emotion analysis processing based on the voice of the user;

analysis results of the image analysis unit 162, i.e., analysis information, such as facial expressions, gestures, and sight line information of the user; and

the analysis result of the sensor information analysis unit 163, that is, data of the user such as line of sight, body temperature, heart rate, pulse, and brain wave.

This information is used later in the processing of step S102 and the processing of step S107 in the flow of fig. 36.

Note that the user state estimation information 192 generated by the user state estimation unit 164 is specifically, for example, information that estimates whether the user is satisfied, that is, whether the user is satisfied with the processing that the information processing apparatus performs on the user utterance.

In the case where the user satisfaction is estimated, it is estimated that the processing performed by the information processing apparatus in response to the user utterance is correct, that is, the processing has been successfully performed.

The learning processing unit 165 performs learning processing on the user utterance and stores learning data in the storage unit 170. For example, in a case where a new user utterance is input and the intention of the user utterance is unknown, the intention is analyzed based on subsequent interaction with the apparatus, and the learning processing unit 165 performs a process of generating learning data in which the user utterance is associated with the intention and storing the learning data in the storage unit 170.

By performing such learning processing, accurate grasping of the user utterance intention can be gradually achieved.

Further, in step S107 of fig. 36 described above, the learning processing unit 165 also performs processing of generating a "utterance collection list" that collects a plurality of user utterances and storing the utterance collection list in the storage unit 170.

Next, a sequence showing an example of a process of displaying and using the utterance collection list will be described with reference to a flowchart of fig. 38.

The processing in each step of the flowchart in fig. 38 will be described in turn.

(Steps S301 to S304)

The processing in steps S301 to S304 is similar to the processing in steps S101 to S104 described above with reference to the flow of fig. 36.

That is, first, in step S301, the information processing apparatus 10 inputs and analyzes voice, image, and sensor information.

This processing is the processing described with reference to fig. 37, and is processing of performing speech recognition and semantic analysis of the user utterance speech to acquire the intention of the user utterance, and further acquiring the state of the user (whether the user is satisfied or not, etc.) based on the user utterance speech, images, sensor information, and the like.

Then, in steps S302 to S303, the information processing apparatus 10 analyzes the content of the user utterance (command (processing request)), and determines whether the processing corresponding to the user utterance is executable (in the domain) or non-executable (out of the domain: OOD).

In the case where the process is not executable (out of domain (OOD), the process terminates.

Meanwhile, in a case where it is determined that the processing corresponding to the user utterance is executable (in the domain), the processing proceeds to step S304.

Then, in step S304, the information processing apparatus 10 records the user utterance (in the domain) determined to be executable on the storage unit 170.

(step S305)

Then, in step S305, the information processing apparatus determines whether there is a speech collection list including a speech corresponding to the user speech.

This processing is processing performed by the output information generation unit 180 in fig. 3.

The output information generation unit 180 searches in the storage unit 170 to determine whether there is an utterance collection list including utterances corresponding to the user utterances.

In the case where there is no utterance collection list including utterances corresponding to the user utterances, the process proceeds to step S306.

Meanwhile, in a case where there is a speech collection list including a speech corresponding to the user speech, the process proceeds to step S308.

(Steps S306 to S307)

In the case where it is determined in step S305 that there is no utterance collection list including utterances corresponding to the user utterances, in step S306, nodes corresponding to the user utterances in the domain corresponding node tree displayed on the image output unit (display unit) 122 are highlighted.

This is, for example, the process of displaying the highlighted node 221 described above with reference to fig. 7.

This processing is processing executed by the display information generation unit 182 of the information processing apparatus 10 in fig. 3.

Further, in step S307, processing corresponding to the user utterance, that is, processing corresponding to the node highlighted in step S306 is performed.

(step S308)

Meanwhile, in a case where it is determined in step S305 that there is a speech collection list including a speech corresponding to the user speech, in step S308, the speech collection list is displayed on the image output unit (display unit) 122.

This is, for example, the process of displaying the utterance collection list 231 described above with reference to fig. 14 and the like.

This processing is processing executed by the display information generation unit 182 of the information processing apparatus 10 in fig. 3.

(step S309)

Then, in step S309, processing corresponding to the user utterance, that is, processing corresponding to the user utterance corresponding node listed in the utterance collection list 231 displayed in step S308 is sequentially performed.

Further, a process of highlighting the currently executed user utterance corresponding node in the displayed utterance collection list 231 is performed.

This process corresponds to the process described above with reference to fig. 18 to 21.

This processing is processing executed by the display information generation unit 182 of the information processing apparatus 10 in fig. 3.

Next, a processing sequence using the external utterance collection list described above with reference to fig. 22 to 24 (i.e., an utterance collection list of another person, a network disclosure list, a blog disclosure list, etc.) in the absence of an utterance collection list created by the user will be described with reference to flowcharts in fig. 39 and 40.

The processing in each step of the flowcharts in fig. 39 and 40 will be described in turn.

(Steps S401 to S404)

The processing in steps S401 to S404 is similar to the processing in steps S101 to S104 described above with reference to the flow of fig. 36.

That is, first, in step S401, the information processing apparatus 10 inputs and analyzes voice, image, and sensor information.

This processing is the processing described with reference to fig. 37, and is processing of performing speech recognition and semantic analysis of the user utterance speech to acquire the intention of the user utterance, and further acquiring the state of the user (whether the user is satisfied or not, etc.) based on the user utterance speech, images, sensor information, and the like.

Then, in steps S402 to S403, the information processing apparatus 10 analyzes the content of the user utterance (command (processing request)), and determines whether the processing corresponding to the user utterance is executable (in the domain) or non-executable (out of the domain: OOD).

In the case where the process is not executable (out of domain (OOD), the process terminates.

Meanwhile, in a case where it is determined that the processing corresponding to the user utterance is executable (in the domain), the processing proceeds to step S404.

Then, in step S404, the information processing apparatus 10 records the user utterance determined to be executable on the storage unit 170 (in the domain).

(step S405)

Then, in step S405, the information processing apparatus determines whether the user utterance is a request to acquire and display an external utterance collection list.

In a case where the user utterance is not a request to acquire and display an external utterance collection list, the process proceeds to step S406.

Meanwhile, in a case where the user utterance is a request to acquire and display an external utterance collection list, the process proceeds to step S408.

(Steps S4306 to S407)

In a case where the user utterance is not a request to acquire and display the external utterance collection list in step S405, in step S406, a node corresponding to the user utterance in the domain corresponding node tree displayed on the image output unit (display unit) 122 is highlighted.

This is, for example, the process of displaying the highlighted node 221 described above with reference to fig. 7.

This processing is processing executed by the display information generation unit 182 of the information processing apparatus 10 in fig. 3.

Further, in step S407, processing corresponding to the user utterance, that is, processing corresponding to the node highlighted in step S406 is performed.

(step S408)

Meanwhile, in a case where the user utterance is a request to acquire and display an external utterance collection list in step S405, the utterance collection list acquired from the outside is displayed on the image output unit (display unit) 122 in step S408.

This is, for example, the process of displaying the utterance collection list described above with reference to fig. 22 to 24.

This processing is processing executed by the display information generation unit 182 of the information processing apparatus 10 in fig. 3.

(step S501)

Then, in step S501, it is determined whether a new user utterance indicating a processing request corresponding to a node displayed in the displayed external utterance collection list has been input.

This processing is processing performed by the input data analysis unit 160 of the information processing apparatus 10.

In a case where it is determined that a new user utterance indicating a processing request corresponding to a node displayed in the displayed external utterance collection list has been input, the process proceeds to step S502.

Meanwhile, in a case where it is determined that no new user utterance indicating a processing request corresponding to a node displayed in the displayed external utterance collection list is input, the process proceeds to step S503.

(step S502)

In a case where it is determined that a new user utterance indicating a processing request corresponding to a node displayed in the external utterance collection list displayed in step S501 has been input, the process proceeds to step S502. In step S502, processing corresponding to the user utterance corresponding nodes listed in the utterance collection list is sequentially performed.

Further, processing of highlighting the currently executed user utterance corresponding node in the displayed utterance collection list is performed.

This processing is processing executed by the display information generation unit 182 of the information processing apparatus 10 in fig. 3.

(step S503)

Meanwhile, in a case where it is determined that a new user utterance indicating a processing request corresponding to a node displayed in the external utterance collection list displayed in step S501 is not input, the process proceeds to step S503. In step S503, normal processing according to the user utterance is performed without using the utterance collection list.

[6. configuration examples of information processing apparatus and information processing System ]

A plurality of embodiments have been described, and various processing functions (for example, processing functions of respective components of the information processing apparatus 10 of fig. 3) described in these embodiments may all be configured in a single apparatus, for example, an apparatus owned by a user, for example, an agent apparatus, a smartphone, or a PC. Alternatively, a part of the processing functions may be executed in a server or the like.

Fig. 41 shows a system configuration example.

An information processing system configuration example 1 of fig. 41(1) is an example of configuring almost all functions of the information processing apparatus of fig. 3, for example, an information processing apparatus 410 owned by a user in a single apparatus, the information processing apparatus being a user terminal such as a smartphone, a PC, or the like, or a proxy apparatus having a voice input/output function and an image input/output function.

The information processing apparatus 410 corresponding to the user terminal communicates with the service providing server 420 only when, for example, the information processing apparatus 410 generates a response sentence using an external service.

The service providing server 420 is, for example, a content providing server of a music providing server, a movie, or the like, a game server, a weather information providing server, a traffic information providing server, a medical information providing server, a sightseeing information providing server, or the like, and includes a group of servers capable of providing information necessary for performing processing in response to a user utterance and generating a response.

Meanwhile, an information processing system configuration example 2 of fig. 41(2) is a system example in which a part of functions of the information processing apparatus of fig. 3 is configured in the information processing apparatus 410 owned by the user, the information processing apparatus 410 is a user terminal such as a smartphone, a PC, or a proxy apparatus, and the part of functions is executed in a data processing server 460 capable of communicating with the information processing apparatus.

For example, a configuration may be adopted in which only the input unit 110 and the output unit 120 in the apparatus of fig. 3 are provided in the information processing apparatus 410 serving as a user terminal, and all other functions are performed in a server.

Note that the mode of dividing the functions into the user terminal and the server may be set differently. Furthermore, both may perform a single function.

[7. example of hardware configuration of information processing apparatus ]

Next, a hardware configuration example of the information processing apparatus will be described with reference to fig. 42.

The hardware described with reference to fig. 42 is a hardware configuration example of the information processing apparatus described above with reference to fig. 3, and is also a hardware configuration example of the information processing apparatus forming the data processing server 460 described with reference to fig. 41.

A Central Processing Unit (CPU)501 functions as a control unit or a data processing unit that executes various processes according to programs stored in a Read Only Memory (ROM)502 or a storage unit 508. For example, the CPU 501 executes the processing according to the order described in the above-described embodiments. A Random Access Memory (RAM)503 stores programs, data, and the like executed by the CPU 501. The CPU 501, ROM502, and RAM 503 are connected to each other via a bus 504.

The CPU 501 is connected to an input/output interface 505 through a bus 504. The input/output interface 505 is connected to an input unit 506 including various switches, a keyboard, a mouse, a microphone, sensors, and the like, and also to an output unit 507 including a display, a speaker, and the like. The CPU 501 executes various processes in response to a command input from the input unit 506, and outputs a processing result to, for example, the output unit 507.

The storage unit 508 connected to the input/output interface 505 includes, for example, a hard disk or the like, and stores programs executed by the CPU 501 and various data. The communication unit 509 functions as a transmission/reception unit that performs Wi-Fi communication, bluetooth (registered trademark) (BT) communication, and other data communication via a network such as the internet or a local area network, and communicates with an external device.

A drive 510 connected to the input/output interface 505 drives a removable medium 511, such as a magnetic disk, an optical disk, a magneto-optical disk, or a semiconductor memory (e.g., a memory card), to record or read data.

[8. summary of the configuration of the present disclosure ]

In the foregoing, embodiments of the present disclosure have been described in detail with reference to specific embodiments. It will be apparent, however, to one skilled in the art that modifications and substitutions can be made to the embodiments without departing from the scope of the disclosure. That is, the present invention has been described in an illustrative manner, and should not be construed in a limiting sense. In order to determine the gist of the present disclosure, the claims should be considered.

Note that the technique disclosed in this specification can be configured as follows.

(1) An information processing apparatus includes

A learning processing unit configured to perform a learning process of a user utterance, wherein,

the learning processing unit generates a speech collection list in which a plurality of user utterances corresponding to a plurality of different processing requests are collected.

(2) The information processing apparatus according to (1), wherein,

the information processing apparatus also displays a speech collection list on the display unit.

(3) The information processing apparatus according to (1) or (2), wherein,

the user utterances recorded in the utterance collection list are user utterances corresponding to commands as processing requests made by the user to the information processing apparatus.

(4) The information processing apparatus according to any one of (1) to (3),

the learning processing unit asks the user whether to generate a speech collection list, generates the speech collection list with the user's consent, and stores the speech collection list in the storage unit.

(5) The information processing apparatus according to any one of (1) to (4), wherein,

in a case where the learning processing unit determines that the plurality of processes corresponding to the plurality of user utterances have been successfully performed, the learning processing unit generates an utterance collection list and stores the utterance collection list in the storage unit.

(6) The information processing apparatus according to any one of (1) to (4), wherein,

in a case where a combination of the plurality of user utterances is equal to or greater than a predetermined threshold number of times, the learning processing unit generates an utterance collection list, and stores the utterance collection list in the storage unit.

(7) The information processing apparatus according to any one of (1) to (4), wherein,

the learning processing unit analyzes the presence or absence of an indicator indicating a relationship between utterances included in a plurality of user utterances, generates an utterance collection list based on a result of the analysis, and stores the utterance collection list in a storage unit.

(8) The information processing apparatus according to any one of (1) to (4), wherein,

the learning processing unit analyzes a state of a user with respect to processing performed by the information processing apparatus in response to a user utterance, generates an utterance collection list based on a result of the analysis, and stores the utterance collection list in the storage unit.

(9) The information processing apparatus according to any one of (1) to (4), wherein,

in a case where the learning processing unit receives an input of user status information and the user status information is information indicating that the user is satisfied, the learning processing unit generates a speech collection list and stores the speech collection list in the storage unit.

(10) The information processing apparatus according to (9), wherein,

the user status information is information indicating a user satisfaction status, and is acquired based on at least one of the following information:

non-verbal information based on the user utterance and generated by the speech analysis unit;

image analysis information based on the user image and generated by the image analysis unit; or

The sensor information analysis unit generates sensor information analysis information.

(11) The information processing apparatus according to any one of (1) to (10), further comprising

A display information generation unit configured to perform a process of highlighting a speech corresponding node currently being executed by the information processing apparatus, wherein the speech corresponding node is among a plurality of speech corresponding nodes included in the speech collection list displayed on the display unit.

(12) The information processing apparatus according to any one of (1) to (11), wherein,

the information processing apparatus also acquires an external utterance collection list that can be acquired by the information processing apparatus, and displays the external utterance collection list on the display unit.

(13) The information processing apparatus according to any one of (1) to (12), wherein,

the learning processing unit selects user utterances to be collected according to the context information, and generates an utterance collection list.

(14) An information processing system comprising a user terminal and a data processing server, wherein:

the user terminal comprises

A voice input unit configured to input a user utterance;

the data processing server comprises

A learning processing unit configured to perform learning processing of a user utterance received from a user terminal; and is

The learning processing unit generates a speech collection list in which a plurality of user utterances corresponding to a plurality of different processing requests are collected.

(15) The information processing system according to (14), wherein,

the user terminal displays a speech collection list on a display unit.

(16) An information processing method executed in an information processing apparatus, wherein:

the information processing apparatus includes: a learning processing unit configured to perform learning processing of a user utterance; and is

The learning processing unit generates a speech collection list in which a plurality of user utterances corresponding to a plurality of different processing requests are collected.

(17) An information processing method executed in an information processing system including a user terminal and a data processing server, wherein:

the user terminal performs a voice input process of inputting a user utterance;

the data processing server performs a learning process of a user utterance received from a user terminal; and is

A utterance collection list is generated in a learning process, in which a plurality of user utterances corresponding to a plurality of different process requests are collected.

(18) A program for causing an information processing apparatus to execute information processing, wherein:

the information processing apparatus includes: a learning processing unit configured to perform learning processing of a user utterance; and is

The program causes a learning processing unit to generate a utterance collection list in which a plurality of user utterances corresponding to a plurality of different processing requests are collected.

Further, the series of processes described in the specification may be performed by a hardware, software, or a combination configuration of both. In the case of executing the processing by software, the processing may be executed by installing a program recorded in a memory inside a computer incorporated into dedicated hardware in the processing sequence and executing the program, or by installing a program in a general-purpose computer capable of executing various kinds of processing and executing the program. For example, the program may be recorded in advance on a recording medium. The program may be installed in a computer from a recording medium, or may also be received via a network such as a Local Area Network (LAN) or the internet and installed in a recording medium such as a built-in hard disk.

Note that the various processes described in the specification are not only performed in time series according to the description, but also performed in parallel or individually according to the processing capability of the apparatus that performs the processes or as necessary. Further, in the present specification, the system is a logical set configuration of a plurality of devices, and is not limited to a system in which devices having respective configurations are in the same housing.

INDUSTRIAL APPLICABILITY

As described above, according to the configuration of the embodiments of the present disclosure, an apparatus and method capable of accurately and repeatedly performing processing based on a plurality of user utterances is realized by generating and using an utterance collection list in which a plurality of user utterances are collected.

Specifically, for example, the learning processing unit generates a speech collection list in which a plurality of user utterances corresponding to a plurality of different processing requests are collected. Further, the generated utterance collection list is displayed on the display unit. In a case where it is determined that the plurality of processes corresponding to the user utterances have been successfully performed, in a case where a combination of the plurality of user utterances is equal to or greater than a predetermined threshold number of times, in a case where it is estimated that the user is satisfied, or in other cases, the learning processing unit generates an utterance collection list and stores the utterance collection list in the storage unit.

With this configuration, an apparatus and method capable of accurately and repeatedly performing processing based on a plurality of user utterances is realized by generating and using an utterance collection list in which the plurality of user utterances are collected.

List of reference numerals

10 information processing apparatus

11 Camera

12 microphone

13 display unit

14 loudspeaker

20 server

30 external device

110 input unit

111 voice input unit

112 image input unit

113 sensor

120 output unit

121 voice output unit

122 image output unit

150 data processing unit

160 input data analysis unit

161 speech analysis unit

162 image analysis unit

163 sensor information analysis unit

164 user state estimation unit

165 learning processing unit

170 memory cell

180 output information generating unit

181 output voice generation unit

182 display information generating unit

200 field correspondent node tree

201 domain

202 acceptable utterance display node

211 display area identification information

212 registered utterance Collection List information

221 highlighting nodes

222 guide information

231 collected list of utterances

232 public utterance collection list of another user

233 network public utterance Collection List

234 blog public words gathering list

241 utterance collection list corresponding node

242 Speech Collection List

261 list of collected utterances

420 service providing server

460 data processing server

501 CPU

502 ROM

503 RAM

504 bus

505 input/output interface

506 input unit

507 output unit

508 memory cell

509 communication unit

510 driver

511 removable media

79页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:多功能触控笔

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!

技术分类