Cross-modal input fusion for wearable systems

文档序号：914506 发布日期：2021-02-26 浏览：2次中文

阅读说明：本技术 用于可穿戴系统的跨模态输入融合 (Cross-modal input fusion for wearable systems ) 是由 P·莱西 S·A·米勒 N·A·克莱默 D·C·伦德马克于 2019-05-21 设计创作，主要内容包括：可穿戴系统和方法的示例可以使用多个输入(例如手势、头部姿势、眼睛注视、声音、图腾和/或环境因素(例如,位置))来确定应该执行的命令以及应该在三维(3D)环境中进行操作的对象。可穿戴系统可以检测不同的输入何时会聚在一起,诸如何时用户使用诸如眼睛注视、头部姿势、手势和图腾输入的多个输入来选择虚拟对象。在检测到输入会聚后,可穿戴系统可以执行跨模态滤波方案,该跨模态滤波方案利用会聚的输入来帮助正确解释用户提供哪个命令或用户瞄准哪个对象。(Examples of wearable systems and methods may use multiple inputs, such as gestures, head gestures, eye gaze, sounds, totems, and/or environmental factors (e.g., location), to determine commands that should be performed and objects that should operate in a three-dimensional (3D) environment. The wearable system may detect when different inputs come together, such as when a user selects a virtual object using multiple inputs, such as eye gaze, head gestures, and totem inputs. Upon detecting the input convergence, the wearable system may execute a cross-modal filtering scheme that utilizes the converged input to help correctly interpret which command the user provides or which object the user aims at.)

1. a wearable system, comprising:

a head pose sensor configured to determine a head pose of a user of the wearable system;

an eye gaze sensor configured to determine an eye gaze direction of the user of the wearable system;

a gesture sensor configured to determine a gesture of the user of the wearable system;

a hardware processor in communication with the head pose sensor, the eye gaze sensor, and the gesture sensor, the hardware processor programmed to:

determining a first vergence between the eye gaze direction and the head pose of the user relative to a subject;

executing a first interactive command associated with the object based at least in part on input from the head pose sensor and the eye gaze sensor;

determining a second dispersion of the gesture of the user with the eye gaze direction and the head pose relative to the object; and

executing a second interactive command associated with the object based at least in part on input from the gesture, the head pose sensor, and the eye gaze sensor.

2. The wearable system of claim 1, wherein the head pose sensor comprises an Inertial Measurement Unit (IMU), the eye gaze sensor comprises an eye tracking camera, and the gesture sensor comprises an outward facing camera.

3. The wearable system of claim 1, wherein to determine the first vergence, the hardware processor is programmed to: determining that an angle between the eye gaze direction and a head pose direction associated with the head pose is less than a first threshold.

4. The wearable system of claim 1, wherein to determine the second scatter, the hardware processor is programmed to: determining that a cross-modal triangle associated with the gesture, the eye gaze direction, and the head pose is less than a second threshold.

5. The wearable system of claim 1, wherein the first interaction command comprises aiming at the object.

6. The wearable system of claim 1, wherein the second interaction command comprises selecting the object.

7. The wearable system of claim 1, wherein the hardware processor is further programmed to: determining divergence of the object from at least one of the gesture, the eye gaze direction, or the head pose.

8. The wearable system of claim 1, wherein the first interactive command comprises applying a first filter or the second interactive command comprises applying a second filter.

9. The wearable system of claim 8, wherein the first filter is different from the second filter.

10. The wearable system of claim 8, wherein the first filter or the second filter comprises a low pass filter having an adaptive cutoff frequency.

11. The wearable system of claim 10, wherein the low pass filter comprises a euro filter.

12. The wearable system of claim 1, wherein to determine the first vergence, the hardware processor is programmed to: determining that a dwell time of the eye gaze direction and the head pose toward the object exceeds a first dwell time threshold.

13. The wearable system of claim 1, wherein to determine the second scatter, the hardware processor is programmed to: determining that a dwell time of the eye gaze direction, the head pose, and the gesture relative to the object exceeds a second dwell time threshold.

14. The wearable system of claim 1, wherein the first or second interaction command comprises providing a stable aiming vector associated with the object.

15. The wearable system of claim 14, wherein the hardware processor provides the stable aiming vector to an application.

16. The wearable system of claim 1, wherein the gesture sensor comprises a handheld user input device.

17. The wearable system of claim 16, wherein the hardware processor is programmed to: determining a third divergence between an input from the user input device and at least one of the eye gaze direction, the head pose, or the gesture.

18. The wearable system of claim 1, further comprising a voice sensor, and wherein the hardware processor is programmed to: determining a fourth vergence between input from the speech sensor and at least one of the eye gaze direction, the head pose, or the gesture.

19. A system, comprising:

a first sensor of the wearable system configured to acquire first user input data in a first input mode;

A second sensor of the wearable system configured to acquire second user input data in a second input mode, the second input mode being different from the first input mode;

a third sensor of the wearable system configured to acquire third user input data in a third input mode, the third input mode being different from the first input mode and the second input mode; and

a hardware processor in communication with the first sensor, the second sensor, and the third sensor, the hardware processor programmed to:

receiving a plurality of inputs comprising the first user input data in the first input mode, the second user input data in the second input mode, and the third user input data in the third input mode;

identifying a first interaction vector based on the first user input data;

identifying a second interaction vector based on the second user input data;

identifying a third interaction vector based on the third user input data;

determining a vergence between at least two of the first interaction vector, the second interaction vector, and the third interaction vector;

Identifying a target virtual object from a set of candidate objects in a three-dimensional (3D) region around the wearable system based at least in part on the vergence;

determining a user interface operation on the target virtual object based on at least one of the first user input data, the second user input data, the third user input data, and the vergence; and

generating a cross-modal input command that causes the user interface operation to be performed on the target virtual object.

20. A method, comprising:

under control of a hardware processor of the wearable system:

accessing sensor data from a plurality of more than three sensors having different modalities;

identifying convergence events for a first sensor and a second sensor of the greater than three plurality of sensors having different modalities; and

targeting objects in a three-dimensional (3D) environment surrounding the wearable system using first sensor data from the first sensor and second sensor data from the second sensor.

21. A method, comprising:

under control of a hardware processor of the wearable system:

Accessing sensor data from at least a first sensor and a second sensor having different modalities, wherein the first sensor provides sensor data having a plurality of potential interpretations;

identifying a convergence of sensor data from the second sensor with a given one of the potential interpretations of the sensor data from the first sensor; and

generating an input command for the wearable system based on a given potential interpretation of the potential interpretations.

22. A method, comprising:

under control of a hardware processor of the wearable system:

accessing sensor data from a plurality of sensors having different modalities;

identifying a convergence event of sensor data from a first sensor and a second sensor of the plurality of sensors; and

selectively applying a filter to the sensor data from the first sensor during the convergence event.

23. A method, comprising:

under control of a hardware processor of the wearable system:

identifying a current cross-modal state, the current cross-modal state comprising cross-modal vergence associated with an object;

identifying an intent region, ROI, associated with the cross-modal vergence;

Identifying a corresponding interaction field based at least in part on the ROI;

selecting an input fusion method based at least in part on the cross-modal state;

selecting a setting for a primary aiming vector;

applying an adjustment to the primary aiming vector to provide a stabilized pose vector; and

communicating the stable gesture vector to an application.

24. A method, comprising:

under control of a hardware processor of the wearable system:

identifying a cross-modal fixation point;

defining an extended region of interest ROI based on a cross-modal gaze time or a predicted dwell time in the vicinity of the cross-modal gaze point;

determining that the ROI intersects a rendering element;

determining a rendering enhancement compatible with the ROI, the cross-modality gaze time, or the predicted dwell time; and

activating the rendering enhancement.

25. A method, comprising:

under control of a hardware processor of the wearable system:

receiving sensor data from a plurality of sensors having different modalities;

determining that data from a particular subset of the plurality of sensors having different modalities indicates that a user is initiating execution of a particular motor or sensorimotor control strategy from a plurality of predetermined motor and sensorimotor control strategies;

Selecting a particular sensor data processing scheme corresponding to the particular motor or sensorimotor control strategy from a plurality of different sensor data processing schemes corresponding to respective different ones of the plurality of predetermined motor and sensorimotor control strategies; and

processing data received from the particular subset of the plurality of sensors having different modalities according to the particular sensor data processing scheme.

26. A method, comprising:

under control of a hardware processor of the wearable system:

receiving sensor data from a plurality of sensors having different modalities;

determining that data from a particular subset of the plurality of sensors having different modalities is changing randomly in a particular manner;

in response to determining that data from the particular subset of the plurality of sensors having different modalities is randomly varying in the particular manner, switching between:

processing data received from the particular subset of the plurality of sensors having different modalities according to a first sensor data processing scheme; and

processing data received from the particular subset of the plurality of sensors having different modalities according to a second sensor data processing scheme different from the first sensor data processing scheme.

Technical Field

The present disclosure relates to virtual reality and augmented reality imaging and visualization systems, and more particularly, to dynamically fusing multiple modes of user input to facilitate interaction with virtual objects in a three-dimensional (3D) environment.

Background

Modern computing and display technologies have facilitated the development of systems for so-called "virtual reality", "augmented reality" or "mixed reality" experiences, in which digitally reproduced images or parts thereof are presented to a user in a manner in which they appear to be, or may be perceived as, real. Virtual reality or "VR" scenes typically involve the presentation of digital or virtual image information, while being opaque to other real-world visual inputs; augmented reality or "AR" scenes typically involve the presentation of digital or virtual image information as an enhancement to the visualization of the real world around the user; mixed reality or "MR" is concerned with fusing real and virtual worlds to create new environments in which physical and virtual objects coexist and interact in real time. The human visual perception system has proven to be very complex and it is challenging to generate VR, AR or MR techniques that facilitate comfortable, natural feeling, rich presentation of virtual image elements as well as other virtual or real-world image elements. The systems and methods disclosed herein address various challenges associated with VR, AR, and MR technologies.

Disclosure of Invention

Examples of wearable systems and methods described herein may use multiple inputs (e.g., gestures from a user input device, head gestures, eye gaze, speech, or environmental factors (e.g., location)) to determine commands that should be executed or objects that should be manipulated or selected in a three-dimensional (3D) environment. Multiple inputs may also be used by the wearable device to allow a user to interact with physical objects, virtual objects, text, graphics, icons, user interfaces, and the like.

For example, the wearable display device may be configured to dynamically resolve multiple sensor inputs to perform a task or aim at an object. The wearable device may dynamically use a combination of multiple inputs, such as head gestures, eye gaze, hand, arm, or body gestures, voice commands, user input devices, environmental factors (e.g., the user's location or objects surrounding the user), to determine which object in the user's environment the user intends to select or the actions that the wearable device may perform. The wearable device may dynamically select a set of sensor inputs that collectively indicate the user's intent to select the target object (inputs that provide an independent or supplemental indication of the user's intent to select the target object may be referred to as convergence (convergent) or convergence inputs). The wearable device may combine or fuse inputs from the group (e.g., to enhance the quality of user interaction as described herein). If later the sensor input from the group shows divergence (divergence) from the target object, the wearable device may stop using (or reduce the relative weight assigned to) the divergent sensor input.

The process of dynamically using converging sensor inputs while ignoring (or reducing the relative weights assigned to) diverging sensor inputs is sometimes referred to herein as cross-modal input fusion (or simply cross-modal fusion), and may provide substantial advantages over techniques that accept only inputs from multiple sensors. Cross-modal input fusion can be expected or even predicted on a dynamic real-time basis, which of many possible sensor inputs are appropriate modal inputs conveying the user's intent to aim at or manipulate real or virtual objects in the user's 3D AR/MR/VR environment.

The details of one or more implementations of the subject matter described in this specification are set forth in the accompanying drawings and the description below. Other features, aspects, and advantages will become apparent from the description, the drawings, and the claims. Neither this summary nor the following detailed description is intended to define or limit the scope of the inventive subject matter.

Drawings

FIG. 1 depicts an illustration of a mixed reality scene with certain virtual reality objects and certain physical objects viewed by a person.

Fig. 2A and 2B schematically illustrate examples of wearable systems that may be configured to use the cross-modal input fusion technique described herein.

FIG. 3 schematically illustrates aspects of a method for simulating a three-dimensional image using multiple depth planes.

Fig. 4 schematically shows an example of a waveguide stack for outputting image information to a user.

Fig. 5 shows an example exit beam that may be output by a waveguide.

FIG. 6 is a schematic diagram showing an optical system including a waveguide device, an optical coupler subsystem for optically coupling light to or from the waveguide device, and a control subsystem for generating a multi-focal volumetric display, image or light field.

Fig. 7 is a block diagram of an example of a wearable system.

FIG. 8 is a process flow diagram of an example of a method of rendering virtual content related to an identified object.

Fig. 9 is a block diagram of another example of a wearable system.

Fig. 10 is a process flow diagram of an example of a method for determining user input to a wearable system.

FIG. 11 is a process flow diagram of an example of a method for interacting with a virtual user interface.

Fig. 12A schematically shows examples of a field of view (FOR), a field of view (FOV) of a world camera, a field of view of a user, and a field of view (gaze) of a user.

Fig. 12B schematically illustrates an example of a virtual object in the user field of view and a virtual object in the field of view.

FIG. 13 illustrates an example of interacting with a virtual object using one mode of user input.

FIG. 14 illustrates an example of selecting a virtual object using a combination of user input modes.

FIG. 15 illustrates an example of interacting with a virtual object using a combination of direct user inputs.

FIG. 16 illustrates an example computing environment for aggregating input patterns.

FIG. 17A illustrates an example of using lattice tree analysis to identify a target virtual object.

FIG. 17B illustrates an example of determining a target user interface operation based on multimodal input.

FIG. 17C illustrates an example of aggregating confidence scores associated with input patterns for virtual objects.

Fig. 18A and 18B illustrate examples of calculating a confidence score for an object within the FOV of a user.

19A and 19B illustrate an example of interacting with a physical environment using multimodal input.

FIG. 20 illustrates an example of automatically resizing virtual objects based on multimodal input.

FIG. 21 illustrates an example of identifying a target virtual object based on a location of the object.

22A and 22B illustrate another example of interacting with a user environment based on a combination of direct and indirect inputs.

FIG. 23 illustrates an example process of interacting with a virtual object using multimodal input.

Fig. 24 shows an example of setting a direct input mode associated with user interaction.

FIG. 25 illustrates an example of a user experience with multimodal input.

FIG. 26 shows an example user interface for an application with various bookmarking (bookmarked).

FIG. 27 shows an example user interface when a search command is issued.

28A-28F illustrate an example user experience for composing and editing text based on a combination of voice and gaze input.

FIG. 29 shows an example of selecting a word based on input from a user input device and gaze.

FIG. 30 illustrates an example of selecting a word for editing based on a combination of speech and gaze input.

FIG. 31 shows an example of selecting a word for editing based on a combination of gaze and gesture inputs.

Fig. 32 shows an example of replacing words based on a combination of eye gaze and speech input.

FIG. 33 shows an example of changing words based on a combination of speech and gaze input.

FIG. 34 illustrates an example of editing a selected word using a virtual keyboard.

FIG. 35 illustrates an example user interface displaying possible actions applied to a selected word.

FIG. 36 shows an example of interacting with phrases using multimodal input.

Fig. 37A and 37B illustrate additional examples of interacting with text using multimodal input.

FIG. 38 is a process flow diagram of an example method of interacting with text using multiple modes of user input.

Fig. 39A shows an example of user input received through the controller button.

Fig. 39B shows an example of user input received through the controller touch panel.

Fig. 39C shows an example of user input received through physical movement of a controller or a Head Mounted Device (HMD).

Fig. 39D shows an example of how user input may have different durations.

Fig. 40A shows additional examples of user input received through the controller buttons.

Fig. 40B illustrates additional examples of user input received through the controller touchpad.

FIG. 41A illustrates examples of user input received through various modes of user input for spatially manipulating a virtual environment or virtual object.

FIG. 41B illustrates examples of user inputs for interacting with planar objects received through various modes of user input.

Fig. 41C shows an example of user input received through various modes of user input for interacting with a wearable system.

42A, 42B, and 42C illustrate examples of user input in the form of fine finger gestures and hand motions.

Fig. 43A illustrates examples of the perception domains of a user of a wearable system, including a visual perception domain and an auditory perception domain.

Fig. 43B shows an example of a display rendering plane of a wearable system having multiple depth planes.

Fig. 44A, 44B, and 44C illustrate examples of different interaction regions, whereby the wearable system may receive and respond to user input differently depending on which interaction region the user interacts with.

FIG. 45 illustrates an example of a single modality user interaction.

46A, 46B, 46C, 46D, and 46E illustrate examples of multi-modal user interactions.

Fig. 47A, 47B, and 47C illustrate examples of cross-modality user interaction.

Fig. 48A, 48B, and 49 illustrate examples of cross-modal user interaction.

FIG. 50 is a process flow diagram of an example of a method of detecting modal vergence.

FIG. 51 illustrates examples of user selections in single-modality, dual-modality, and tri-modality interactions.

FIG. 52 illustrates an example of interpreting user input based on a convergence of user input for multiple modes.

FIG. 53 shows an example of how different user inputs may converge across different interaction regions.

Fig. 54, 55, and 56 illustrate examples of how the system selects among a plurality of possible input convergence interpretations based at least in part on rankings (rank) of different inputs.

Fig. 57A and 57B are block diagrams of examples of wearable systems that fuse multiple modes of user input to facilitate user interaction with the wearable systems.

FIG. 58A is a graph of vergence distance and vergence region for various input pairs for user interaction with dynamic cross-modal input fusion disabled.

FIG. 58B is a graph of vergence distance and vergence region for various input pairs for user interaction with dynamic cross-modal input fusion enabled.

Fig. 59A and 59B illustrate examples of user interaction and feedback during gaze and dwell events.

Fig. 60A and 60B illustrate examples of wearable systems that include at least one neuromuscular sensor, such as, for example, an Electromyography (EMG) sensor, and that may be configured to use embodiments of the cross-modal input fusion techniques described herein.

Throughout the drawings, reference numerals may be reused to indicate correspondence between reference elements. The drawings are provided to illustrate example embodiments described herein and are not intended to limit the scope of the present disclosure.

Detailed Description

SUMMARY

Modern computing systems may possess a variety of user interactions. The wearable device may present an interactive VR/AR/MR environment that may include data elements that may be interacted with by a user through various inputs. Modern computing systems are typically designed to generate a given output based on a single direct input. For example, the keyboard will relay text input received from a user's finger touch. The speech recognition application may create an executable data string based on the user's speech as direct input. A computer mouse may direct a cursor in response to a user's direct manipulation (e.g., a user's hand motion or gesture). The various ways in which a user may interact with the system are sometimes referred to herein as user input modes. For example, user input via a mouse or keyboard is a gesture-based interaction pattern (because a finger of a hand presses a key on the keyboard or the hand moves the mouse).

However, in data rich and dynamic interactive environments (e.g., AR/VR/MR environments), traditional input techniques such as keyboards, user input devices, gestures, etc., may require a high degree of specificity to accomplish the required tasks. Otherwise, without accurate input, the computing system may suffer from a high error rate and may result in performing incorrect computer operations. For example, when a user intends to move an object in 3D space using a touchpad, the computing system may not be able to correctly interpret the movement command if the user does not specify a destination or specifies an object using the touchpad. As another example, using a virtual keyboard (e.g., with a user input device or via gesture manipulation) as the only input mode to enter text strings can be slow and physically tiring because of the long time fine motion control required to type the described keys on an aerial or physical surface (e.g., a desk) rendering the virtual keyboard.

To reduce the degree of specificity required in input commands and reduce the error rate associated with imprecise commands, the wearable systems described herein may be programmed to dynamically apply multiple inputs to identify an object to select or act upon, perform an interaction event associated with the object, such as, for example, a task to select, move, resize, or aim at a virtual object. The interaction event may include causing an application (sometimes referred to as an application (app)) associated with the virtual object to execute (e.g., if the target object is a media file, the interaction event may include causing a media player to play the media file (e.g., a song or video)). Selecting the target virtual object may include executing an application associated with the target virtual object. As described below, the wearable device may dynamically select which of two or more types of input (or input from multiple input channels or sensors) to generate a command for performing a task or identify a target object on which to execute a command.

The particular sensor inputs used at any point in time may change dynamically as the user interacts with the 3D environment. When the device determines that an input pattern is providing additional information to help target a virtual object, the input pattern may be dynamically added (or "fused" as further described herein) and dynamically removed if the input pattern no longer provides relevant information. For example, the wearable device may determine that the user's head pose and eye gaze are directed at the target object. The device may use both input modes to assist in selecting the target object. If the device determines that the user is also pointing the totem at the target object, the device may dynamically add the totem input to the head pose and eye gaze inputs, which may further determine that the user intends to select the target object. It can be said that the totem input has "converged" with the head pose input and the eye gaze input. Continuing with the example, if the user looks away from the target object such that the user's eye gaze is no longer directed at the target object, the device may cease using the eye gaze input while continuing to use the totem input and the head pose input. In this case, it can be said that the eye gaze input "diverges" from the totem input and the head pose input.

The wearable device may dynamically determine divergence and convergence events that occur between multiple input modes, and may dynamically select a subset of the input modes from the multiple input modes that are relevant to the user's interaction with the 3D environment. For example, the system may use input patterns that have converged and may ignore input patterns that have diverged. The number of input modes that can be dynamically fused or filtered in response to input convergence is not limited to the three modes described in this example (totem, head pose, eye gaze) and can be dynamically switched between 1, 2, 3, 4, 5, 6, or more sensor inputs (because different input modes converge or diverge).

The wearable device may use the convergence input by: accepting input for analysis, by increasing computational resources available or allocated to the converged input (e.g., input sensor component), by selecting a particular filter to apply to one or more converged inputs, by taking other appropriate actions, and/or by any combination of these actions. The wearable device may not use, discontinue use, or mitigate the weight given to divergent or diverging sensor inputs.

For example, when the variance between input vectors of an input is smaller than a threshold, the input patterns can be said to converge. Once the system recognizes that the inputs are converged, the system can filter the converged inputs and fuse them together to create a new conditional input, which can then be used to do useful work and can be used to perform a task with greater confidence and accuracy (than can be done by using the inputs separately). In various embodiments, the system may apply dynamic filtering (e.g., dynamically fuse inputs together) in response to relative convergence of the inputs. The system may continuously evaluate whether the inputs converge. In some embodiments, the system may scale the intensity of input fusion (e.g., how strongly the system fuses the inputs together) relative to the intensity of input convergence (e.g., how closely input vectors of two or more inputs match).

The process of dynamically using converging sensor inputs while ignoring (or reducing the relative weight assigned by the diverging sensor inputs) the diverging sensor inputs is sometimes referred to herein as cross-modal input fusion (or simply cross-modal fusion), and may provide substantial advantages over techniques that accept only inputs from multiple sensors. Cross-modal input fusion can be expected or even predicted on a dynamic real-time basis, which of many possible sensor inputs are appropriate modal inputs conveying the user's intent to target or operate on a real or virtual object in the user's 3D AR/MR/VR environment.

As will be further explained herein, input modes may include, but are not limited to, hand or finger gestures, arm gestures, body gestures, head gestures, eye gaze, body gestures, voice commands, environmental inputs (e.g., a location of a user or a location of an object in a user's environment), shared gestures from another user, and so forth. The sensors used to detect these input modes may include, for example, an outward facing camera (e.g., to detect hand or body gestures), an inward facing camera (e.g., to detect eye gaze), an inertial measurement unit (IMU, e.g., accelerometer, gravitometer, magnetometer), an electromagnetic tracking sensor system, a Global Positioning System (GPS) sensor, a radar or lidar sensor, and so forth (e.g., see description of the sensor examples with reference to fig. 2A and 2B).

As another example, when the user says "move that to where", the wearable system may use a combination of head gestures, eye gaze, gestures, and other environmental factors (e.g., the location of the user or the location of objects around the user), in conjunction with voice commands, to determine which object should be moved (e.g., which object is "that") and which destination (e.g., "where") to intend to be reached in response to appropriate dynamic selection of these multiple inputs.

As will be further described herein, the techniques for cross-modality input are not just an aggregation of multiple user input modes. In contrast, wearable systems employing such cross-modal techniques may advantageously support increased depth dimensions in 3D (as compared to traditional 2D interactions) provided in wearable systems. The increased dimensionality not only allows for additional types of user interaction (e.g., rotation or movement along additional axes in a cartesian coordinate system), but also requires a high degree of precision of user input to provide correct results.

However, due to the user's limitations on motion control, user input for interacting with virtual objects is not always accurate. Although conventional input techniques may calibrate and adjust for inaccuracies in the user's motion control in 2D space, such inaccuracies are magnified in 3D space due to the increased dimensionality. However, conventional input methods (such as keyboard input) are not suitable for adjusting such inaccuracies in 3D space. One benefit provided by the cross-modal input technique is, among other benefits, adapting the input method to smooth and more accurate interaction with objects in 3D space.

Thus, embodiments of the cross-modal input technique may dynamically monitor which input patterns have converged and use the set of converged input patterns to more accurately determine or predict that the user intends to interact with the target. Embodiments of the cross-modal input technique may dynamically monitor which input patterns have diverged (e.g., indicating that the input patterns are no longer relevant to a potential target) and stop using those diverged input patterns (or reduce the weight given to divergent input patterns as compared to convergent input patterns). The converged set of sensor input patterns is typically temporal and constantly changing. For example, when a user provides user input or uses voice commands on a totem, different sensor input modes dynamically converge and diverge as the user moves his or her hands, body, head, or eyes. Thus, a potential advantage of the cross-modality input technique is that only the correct set of sensor input modalities is used at any particular time or for any particular target object in the 3D environment. In some embodiments, the system may assign a greater weight (than normally assigned) to a given input based on the physiological context. As an example, the system may determine that the user is attempting to grasp and move the virtual object. In response, the system may assign greater weight to gesture inputs and lesser weight to other inputs, such as eye gaze inputs. The system may also time any shift in input weighting in an appropriate manner. As an example, the system may shift the weights to the gesture input when the gestures converge on the virtual object.

Additionally, advantageously, in some embodiments, the techniques described herein may reduce hardware requirements and the cost of the wearable system. For example, rather than determining the task itself with a high resolution eye tracking camera (which may be expensive and complex to use), the wearable device may use a low resolution eye tracking camera to perform the task in conjunction with voice commands or head gestures (e.g., by determining that some or all of these input modes have converged on the target object). In this example, the lower resolution when performing eye tracking may be compensated for using the user's voice commands. Accordingly, cross-modal combinations of multiple user input modes (allowing dynamic selection of which of the multiple user input modes to use) may provide lower cost, simpler, more robust user interaction with the AR/VR/MR device than using a single input mode. Additional benefits and examples of techniques related to cross-modal sensor fusion techniques for interacting with real or virtual objects are further described below with reference to fig. 13-59B.

Cross-modal fusion techniques may provide substantial advantages over simply aggregating multiple sensor inputs for functionality such as, for example, targeting small objects, targeting objects in a field of view containing many objects, targeting moving objects, managing transitions between near-field, mid-field, and far-field targeting methods, manipulating virtual objects, and the like. In some implementations, the cross-modal fusion technique is referred to as providing a TAMDI interaction model for: targeting (e.g., specifying a cursor vector toward an object), activating (e.g., selecting a particular object or region or volume in a 3D environment), manipulating (e.g., directly moving or changing a selection), deactivating (e.g., deselecting) and integrating (e.g., putting back previous selections into the environment, if necessary).

3D display example of wearable system

Wearable systems, also referred to herein as Augmented Reality (AR) systems, may be configured to present 2D or 3D virtual images to a user. The image may be a still image, a frame of video, or video, a combination of the above, or the like. The wearable system may include a wearable device that may present VR, AR, or MR content, alone or in combination, in an environment for user interaction. The wearable device may be a Head Mounted Device (HMD), which may include a head mounted display. In some cases, the wearable device is interchangeably referred to as an AR device (ARD).

FIG. 1 depicts an illustration of a mixed reality scene with certain virtual reality objects and certain physical objects viewed by a person. In fig. 1, an MR scene 100 is depicted, where a user of MR technology sees a real-world park-like setting 110 featuring people, trees, buildings in the background, and a concrete platform 120. In addition to these items, the user of MR technology may also perceive that he "sees" a robot figurine 130 standing on the real world platform 120, as well as a flying cartoon-like avatar character 140 that appears to be an avatar of bumblebee, even though these elements are not present in the real world.

In order for a 3D display to produce a realistic sense of depth, and more specifically, a simulated sense of surface depth, it may be desirable for each point in the display field of view to generate an adjustment response (adaptive response) corresponding to its virtual depth. If the accommodative response to a displayed point does not conform to the virtual depth of the point (as determined by the binocular depth cues for convergence and stereoscopy), the human eye may encounter an accommodation conflict, resulting in imaging instability, harmful eye fatigue, headaches, and, in the absence of accommodative information, almost complete absence of surface depth.

The VR, AR, and MR experiences may be provided by a display system having a display in which images corresponding to a plurality of rendering planes are provided to a viewer. The rendering plane may correspond to one depth plane or a plurality of depth planes. The images may be different for each rendering plane (e.g., providing a slightly different presentation of the scene or object) and may be focused separately by the eyes of the viewer, providing depth cues to the user based on eye accommodation required to focus different image features of the scene located on the different rendering planes, or based on observing defocusing of different image features on the different rendering planes. As discussed elsewhere herein, such depth cues provide reliable depth perception.

Fig. 2A illustrates an example of a wearable system 200. Wearable system 200 includes a display 220 and various mechanical and electronic modules and systems that support the functionality of display 220. The display 220 may be coupled to a frame 230, which frame 230 may be worn by a user, wearer, or viewer 210. The display 220 may be positioned in front of the eyes of the user 210. The display 220 may present AR/VR/MR content to the user. The display 220 may include a Head Mounted Display (HMD) that is worn on the head of the user. In some embodiments, speaker 240 is coupled to frame 230 and positioned adjacent to the ear canal of the user (in some embodiments, another speaker, not shown, is positioned adjacent to another ear canal of the user to provide stereo/shapeable sound control). The display 220 may include an audio sensor 232 (e.g., a microphone) for detecting an audio stream from the environment on which speech recognition is to be performed.

The wearable system 200 may include an outward-facing imaging system 464 (shown in fig. 4) that views the world in the user's surroundings. Wearable system 200 may also include an inward facing imaging system 462 (shown in fig. 4) that may track the user's eye movements. The inward facing imaging system may track the motion of one eye or the motion of both eyes. An inward-facing imaging system 462 may be attached to frame 230 and may be in electrical communication with processing module 260 or 270, which processing module 260 or 270 may process image information acquired by the inward-facing imaging system to determine, for example, a pupil diameter or orientation, an eye movement, or an eye pose of the eye of user 210.

As an example, wearable system 200 may use outward facing imaging system 464 or inward facing imaging system 462 to acquire images of a user's gesture (e.g., a hand gesture). The image may be a still image, a frame of video, or a video, a combination of the above, or the like. The wearable system 200 may include other sensors, such as Electromyography (EMG) sensors, that sense signals indicative of the action of muscle groups (e.g., see the description with reference to fig. 60A and 60B).

The display 220 may be operatively coupled 250 to a local data processing module 260, such as by a wired lead or wireless connection, which local data processing module 260 may be mounted in various configurations, such as fixedly attached to the frame 230, fixedly attached to a helmet or hat worn by the user, embedded in headphones, or otherwise removably attached to the user 210 (e.g., in a backpack, belt-coupled configuration).

The local processing and data module 260 may include a hardware processor as well as digital memory, such as non-volatile memory (e.g., flash memory), both of which may be used to facilitate processing, caching, and storage of data. The data may include the following: a) data captured from environmental sensors (which may be, for example, operatively coupled to the frame 230 or otherwise attached to the user 210), audio sensors 232 (e.g., microphones); or b) data retrieved or processed using remote processing module 270 or remote data store 280, perhaps after such processing or retrieval, communicated to display 220. Local processing and data module 260 may be operatively coupled to remote processing module 270 or remote data store 280 by communication links 262 or 264 (such as via wired or wireless communication links) such that these remote modules may serve as resources for local processing and data module 260. Further, remote processing module 280 and remote data store 280 may be operatively coupled to each other.

In some embodiments, remote processing module 270 may include one or more processors configured to analyze and process data and/or image information. In some embodiments, remote data store 280 may include a digital data storage facility, which may be used over the internet or other network configurations in a "cloud" resource configuration. In some embodiments, all data is stored and all computations are performed in local processing and data modules, allowing fully autonomous use from remote modules.

In addition to or in lieu of the components depicted in fig. 2A or 2B (described below), wearable system 200 may include environmental sensors to detect objects, stimuli, people, animals, locations, or other aspects of the world around the user. The environmental sensors may include an image capture device (e.g., camera, inward facing imaging system, outward facing imaging system, etc.), a microphone, an Inertial Measurement Unit (IMU) (e.g., accelerometer, gyroscope, magnetometer (compass)), a Global Positioning System (GPS) unit, a radio device, an altimeter, a barometer, a chemical sensor, a humidity sensor, a temperature sensor, an external microphone, a light sensor (e.g., photometer), a timing device (e.g., clock or calendar), or any combination or subcombination thereof. In some embodiments, the IMU may be a 9-axis IMU, which may include a three-axis gyroscope, a three-axis accelerometer, and a three-axis magnetometer.

The environmental sensors may also include various physiological sensors. These sensors may measure or estimate physiological parameters of the user, such as heart rate, respiratory rate, galvanic skin response, blood pressure, electroencephalogram status, and the like. The environmental sensor may further include a transmitting device configured to receive signals such as laser light, visible light, light of non-visible wavelengths, or sound (e.g., audible sound, ultrasonic, or other frequencies). In some embodiments, one or more environmental sensors (e.g., cameras or light sensors) may be configured to measure ambient light (e.g., brightness) of the environment (e.g., to capture lighting conditions of the environment). Physical contact sensors (such as strain gauges, curb touch sensors, etc.) may also be included as environmental sensors.

Fig. 2B shows another example of a wearable system 200, which includes an example of a number of sensors. The input from any of these sensors can be used by the system in the cross-modal sensor fusion technique described herein. The head-mounted wearable assembly 200 is shown operably coupled (68) to a local processing and data module (70), such as a belt pack, here using physical multi-wire leads, which also features a control and quick release module (86) to connect the belt pack to the head-mounted display. The head-mounted wearable assembly 200 is also referenced with reference numeral 58 in fig. 2B and below. The local processing and data module (70) is here operatively coupled (100) to the handheld assembly (606) by a wireless connection such as low power bluetooth; the handheld assembly (606) may also be directly operatively coupled (94) to the head-mounted wearable assembly (58), such as through a wireless connection, such as low power bluetooth. Typically, where IMU data is communicated to coordinate gesture detection of various components, high frequency connections are desired, such as in the range of hundreds or thousands of cycles/second or higher; tens of cycles per second may be sufficient for electromagnetic position sensing, such as by sensor (604) and transmitter (602) pairing. Also shown is a global coordinate system (10) representing fixed objects in the real world around the user, such as walls (8).

The cloud resources (46) may also be operatively coupled (42, 40, 88, 90) to local processing and data modules (70), head-mounted wearable assemblies (58), resources possibly coupled to walls (8) or other items fixed relative to the global coordinate system (10), respectively. Resources coupled to a wall (8) or having a known position and/or orientation relative to a global coordinate system (10) may include a wireless transceiver (114), an electromagnetic transmitter (602) and/or receiver (604), a beacon or reflector (112) configured to transmit or reflect a given type of radiation, such as an infrared LED beacon, a cellular network transceiver (110), a RADAR transmitter or detector (108), a LIDAR transmitter or detector (106), a GPS transceiver (118), a poster or marker (122) having a known detectable pattern, and a camera (124).

The system 200 may include a depth camera or depth sensor (154), which may be, for example, a stereo triangulation style depth sensor (such as a passive stereo depth sensor, a texture projection stereo depth sensor, or a structured light stereo depth sensor) or a time or flight style depth sensor (such as a LIDAR depth sensor or a modulated emission depth sensor). The system 200 may include a front facing "world" camera (124, which may be a grayscale camera, e.g., with a sensor capable of 720p range resolution) and a relatively high resolution "picture camera" (156, which may be a full color camera, e.g., with a sensor capable of 2 megapixels or higher resolution).

The head-mounted wearable assembly (58) has similar components in addition to the illumination transmitter (130), the illumination transmitter (130) being configured as an auxiliary camera (124) detector, such as an infrared transmitter (130) for an infrared camera (124); also on the head-worn wearable assembly (58) are one or more strain gauges (116) that may be fixedly coupled to a frame or mechanical platform of the head-worn wearable assembly (58) and configured to determine deflection of such platform between components such as electromagnetic receiver sensors (604) or display elements (220), where it may be valuable to know if the platform is bent, such as at a thinned portion of the platform, such as the portion above the nose of a spectacle-like platform shown in fig. 2B.

The head-mounted wearable assembly (58) also has a processor (128) and one or more IMUs (102). Each of the components is preferably operatively coupled to a processor (128). A handheld component (606) with similar components and a local processing and data module (70) are shown. With many sensing and connection mechanisms, such a system can be utilized to provide a very high level of connection, system component integration, and position/orientation tracking, as shown in fig. 2B. For example, with such a configuration, the various primary mobile components (58, 70, 606) may be located in terms of position relative to a global coordinate system using WiFi, GPS, or cellular signal triangulation; beacon, electromagnetic tracking, RADAR and LIDAR systems may provide even further position and/or orientation information and feedback. The markers and camera may also be used to provide further information about relative and absolute position and orientation. For example, various camera components (124), such as those shown coupled with the head-mounted wearable component (58), may be used to capture data that may be used in a simultaneous localization and mapping protocol (or "SLAM") to determine the location and how to orient the component (58) relative to other components.

An illustrative and non-limiting list of types of sensors and input modes that may be used with wearable system 200 is described with reference to the descriptions of fig. 2A and 2B. However, not all of these sensors or input modes need be used in every embodiment. In addition, additional or alternative sensors may be used. The selection of sensors and input modes for particular embodiments of wearable system 200 may be based on factors such as cost, weight, size, complexity, and the like. Many permutations and combinations of sensors and input modes are contemplated. Wearable systems that include sensors (such as those described with reference to fig. 2A and 2B) may advantageously utilize the cross-modal input fusion techniques described herein to dynamically select a subset of these sensor inputs to assist a user in selecting, aiming, or interacting with real or virtual objects. A subset of the sensor inputs (typically less than the set of all possible sensor inputs) may include sensor inputs that have converged on the target object, and may exclude (or reduce reliance on) sensor inputs that diverge from the subset or that have not converged on the target object.

The human visual system is complex and it is challenging to provide a sense of realism on depth. Without being limited by theory, it is believed that a viewer of the object may perceive the object as three-dimensional due to a combination of vergence and accommodation. Vergence movement of the two eyes relative to each other (e.g., rotational movement of the pupils toward or away from each other to converge the eyes 'lines of sight to focus on an object) is closely related to the focusing (or "accommodation") of the eye's lens. Under normal conditions, changing the focus of the eye's lens or accommodating the eye to change the focus from one object to another object at another distance will automatically cause a matching change in vergence at the same distance in a relationship called "accommodation-vergence". Also, under normal circumstances, a change in vergence will trigger a change in the match of the adjustment. A display system that provides a better match between accommodation and vergence may result in a more realistic and comfortable simulation of a three-dimensional image.

FIG. 3 illustrates aspects of a method of simulating a three-dimensional image using multiple rendering planes. Referring to fig. 3, objects at different distances from the eye 302 and the eye 304 on the z-axis are accommodated by the eye 302 and the eye 304 such that the objects are in focus. Eyes 302 and 304 assume a particular state of accommodation such that objects at different distances along the z-axis are brought into focus. Thus, it can be said that a particular accommodation state is associated with a particular one of the rendering planes 306, which has an associated focal length, such that an object or portion of an object in the rendering plane is focused when the eye is in the accommodation state of the particular rendering plane. In some embodiments, a three-dimensional image may be simulated by providing a different presentation of the image for each of the eyes 302 and 304, and also by providing a different presentation of the image corresponding to each of the rendering planes. Although shown as separate for clarity of illustration, it is understood that the fields of view of the eye 302 and the eye 304 may overlap, for example, as the distance along the z-axis increases. Additionally, while shown as flat for ease of illustration, it is understood that the outline of the rendering plane may be curved in physical space such that all features in the rendering plane are in focus with the eye under a particular state of accommodation. Without being limited by theory, it is believed that the human eye can typically interpret a limited number of rendering planes to provide depth perception. Thus, by providing the eye with different presentations of images corresponding to each of these limited number of rendering planes, a highly reliable simulation of perceived depth may be achieved.

Waveguide stack assembly

Fig. 4 shows an example of a waveguide stack for outputting image information to a user. The wearable system 400 includes a waveguide stack or stacked waveguide assembly 480 that can be used to provide three-dimensional perception to the eye/brain using a plurality of waveguides 432b, 434b, 436b, 438b, 4400 b. In some embodiments, wearable system 400 may correspond to wearable system 200 of fig. 2A or 2B, with fig. 4 schematically illustrating some portions of wearable system 200 in more detail. For example, in some embodiments, the waveguide assembly 480 may be integrated into the display 220 of fig. 2 and 2B.

With continued reference to fig. 4, the waveguide assembly 480 may further include a plurality of features 458, 456, 454, 452 located between the waveguides. In some embodiments, the features 458, 456, 454, 452 may be lenses. In other embodiments, the features 458, 456, 454, 452 may not be lenses. Instead, they may simply be spacers (e.g., cladding and/or structures for forming air gaps).

The waveguides 432b, 434b, 436b, 438b, 440b or the plurality of lenses 458, 456, 454, 452 may be configured to transmit image information to the eye at various levels of wavefront curvature or light divergence. Each waveguide level may be associated with a particular rendering plane and may be configured to output image information corresponding to that rendering plane. The image injection devices 420, 422, 424, 426, 428 may be used to inject image information into waveguides 440b, 438b, 436b, 434b, 432b, each of which may be configured to distribute incident light through each respective waveguide for output toward the eye 410. Light exits the output surfaces of the image injection devices 420, 422, 424, 426, 428 and is injected into the respective input edges of the waveguides 440b, 438b, 436b, 434b, 432 b. In some embodiments, a single beam (e.g., a collimated beam) may be injected into each waveguide to output an entire field of cloned collimated beams that are directed toward the eye 410 at a particular angle (and amount of divergence) corresponding to the rendering plane associated with the particular waveguide.

In some embodiments, the image injection devices 420, 422, 424, 426, 428 are discrete displays, each display producing image information for injection into a respective waveguide 440b, 438b, 436b, 434b, 432b, respectively. In some other embodiments, the image injection devices 420, 422, 424, 426, 428 are outputs of a single multiplexed display that can pipe image information to each of the image injection devices 420, 422, 424, 426, 428, such as via one or more light pipes (e.g., fiber optic cables).

The controller 460 controls the operation of the stacked waveguide assembly 480 and the image injection devices 420, 422, 424, 426, 428. Controller 460 includes programming (e.g., instructions in a non-transitory computer readable medium) that adjusts the timing and provision of image information to waveguides 440b, 438b, 436b, 434b, 432 b. In some embodiments, controller 460 may be a single unitary device or a distributed system connected by a wired or wireless communication channel. In some embodiments, the controller 460 may be part of the processing module 260 or 270 (as shown in fig. 2A, 2B).

The waveguides 440b, 438b, 436b, 434b, 432b may be configured to propagate light within each respective waveguide by Total Internal Reflection (TIR). The waveguides 440b, 438b, 436b, 434b, 432b may each be planar or have other shapes (e.g., curved), with top and bottom major surfaces and edges extending between the top and bottom major surfaces. In the illustrated configuration, the waveguides 440b, 438b, 436b, 434b, 432b may each include light extraction optics 440a, 438a, 436a, 434a, 432a configured to extract light out of the waveguides by redirecting light propagating within each respective waveguide to output image information to the eye 410. The extracted light may also be referred to as outcoupled light, and the light extracting optical element may also be referred to as an outcoupled optical element. The extracted light beam is output by the waveguide at a location where light propagating in the waveguide strikes the light redirecting element. The light extraction optical elements (440a, 438a, 436a, 434a, 432a) may be, for example, reflective or diffractive optical features. Although illustrated as being disposed at the bottom major surface of the waveguides 440b, 438b, 436b, 434b, 432b for ease of description and clarity of drawing, in some embodiments the light extraction optical elements 440a, 438a, 436a, 434a, 432a may be disposed at the top or bottom major surface or may be disposed directly in the volume of the waveguides 440b, 438b, 436b, 434b, 432 b. In some embodiments, the light extraction optical elements 440a, 438a, 436a, 434a, 432a may be formed in a layer of material attached to a transparent substrate to form waveguides 440b, 438b, 436b, 434b, 432 b. In some other embodiments, the waveguides 440b, 438b, 436b, 434b, 432b may be a single piece of material, and the light extraction optical elements 440a, 438a, 436a, 434a, 432a may be formed on a surface of that piece of material or in the interior of that piece of material.

With continued reference to fig. 4, as discussed herein, each waveguide 440b, 438b, 436b, 434b, 432b is configured to output light to form an image corresponding to a particular rendering plane. For example, the waveguide 432b closest to the eye may be configured to deliver collimated light to the eye 410 as injected into such waveguide 432 b. The collimated light may represent an optically infinite focal plane. The next upper row waveguide 434b can be configured to emit collimated light that is transmitted through the first lens 452 (e.g., a negative lens) before it can reach the eye 410. The first lens 452 can be configured to produce a slightly convex wavefront curvature such that the eye/brain interprets light from the next upward waveguide 434b as coming from a first focal plane that is closer inward from optical infinity to the eye 410. Similarly, the third upstream waveguide 436b transmits the output light through the first lens 452 and the second lens 454 before reaching the eye 410. The combined optical power of the first lens 452 and the second lens 454 can be configured to produce another increment of wavefront curvature so that the eye/brain interprets light from the third waveguide 436b as coming from a second focal plane that is further inward from optical infinity to the person than light from the next upward waveguide 434 b.

The other waveguide layers (e.g., waveguides 438b, 440b) and lenses (e.g., lenses 456, 458) are similarly configured, with the highest waveguide 440b in the stack sending its output through all of the lenses between it and the eye for representing the total (aggregate) power closest to the person's focal plane. To compensate for the stack of lenses 458, 456, 454, 452 when viewing/interpreting light from the world 470 on the other side of the stacked waveguide assembly 480, a compensating lens layer 430 may be disposed at the top of the stack to compensate for the total power of the underlying lens stack 458, 456, 454, 452. This configuration provides as many sensing focal planes as there are waveguide/lens pairs available. The focusing aspects of the light extraction optics and lenses of the waveguide may be static (e.g., not dynamic or electrically active). In some alternative embodiments, either or both may be dynamic using electrically active features.

With continued reference to fig. 4, the light extraction optical elements 440a, 438a, 436a, 434a, 432a may be configured to redirect light out of their respective waveguides and output the light with an appropriate amount of divergence or degree of collimation for the particular rendering plane associated with the waveguide. As a result, waveguides with different associated rendering planes may have different configurations of light extraction optical elements that output light with different amounts of divergence depending on the associated rendering plane. In some embodiments, as discussed herein, the light extraction optical elements 440a, 438a, 436a, 434a, 432a may be volumes or surface features that may be configured to output light at a particular angle. For example, the light extraction optical elements 440a, 438a, 436a, 434a, 432a may be volume holograms, surface holograms, and/or diffraction gratings. Light extraction optical elements such as diffraction gratings are described in U.S. patent publication No.2015/0178939, published on 25/6/2015, which is incorporated herein by reference in its entirety.

In some embodiments, the light extraction optical elements 440a, 438a, 436a, 434a, 432a are diffractive features or "diffractive optical elements" (also referred to herein as "DOEs") that form a diffraction pattern. Preferably, the DOE has a relatively low diffraction efficiency, such that only a portion of the beam is deflected by each intersection of the DOE towards the eye 410, while the remainder continues to move through the waveguide via total internal reflection. The light carrying the image information can thus be split into a plurality of related exit beams that exit the waveguide at a plurality of locations, and the result is a fairly uniform pattern of exit emissions towards the eye 304 for that particular collimated beam bouncing within the waveguide.

In some embodiments, one or more DOEs may be switchable between an "on" state in which they are actively diffracting and an "off" state in which they are not significantly diffracting. For example, a switchable DOE may comprise a polymer dispersed liquid crystal layer, wherein the droplets comprise a diffraction pattern in the matrix medium, and the refractive index of the droplets may be switched to substantially match the refractive index of the matrix material (in which case the pattern DOEs not significantly diffract incident light), or the droplets may be switched to a refractive index that DOEs not match the refractive index of the matrix medium (in which case the pattern actively diffracts incident light).

In some embodiments, the number and distribution or depth of field of the rendering planes may be dynamically changed based on the pupil size or orientation of the viewer's eyes. The depth of field may vary inversely with the pupil size of the viewer. Thus, as the pupil size of the viewer's eye decreases, the depth of field increases, such that a plane that is not discernable due to its position beyond the depth of focus of the eye may become discernable, and appear to be more focused as the pupil size decreases, commensurate with the increase in depth of field. Likewise, the number of spaced rendering planes for presenting different images to a viewer may decrease as the pupil size decreases. For example, a viewer may not be able to clearly perceive details of both the first and second rendering planes at one pupil size without adjusting the accommodation of the eyes from one rendering plane to another. However, the two rendering planes may be sufficiently focused at the same time for a user at another pupil size without changing the adjustment.

In some embodiments, the display system may change the number of waveguides receiving image information based on a determination of pupil size or orientation, or upon receiving an electrical signal indicative of a particular pupil size or orientation. For example, if the user's eye is unable to distinguish between two depth planes associated with two waveguides, controller 460 may be configured or programmed to stop providing image information to one of the waveguides. Advantageously, this may reduce the processing burden on the system, thereby increasing the responsiveness of the system. In embodiments in which the DOE for a waveguide is switchable between on and off states, the DOE may be switched to the off state when the waveguide DOEs receive image information.

In some embodiments, it may be desirable to have the outgoing beam satisfy a condition that the diameter is smaller than the diameter of the viewer's eye. However, meeting such conditions can be challenging given the variability of the pupil size of the viewer. In some embodiments, this condition is satisfied over a wide range of pupil sizes by varying the size of the emergent beam in response to a determination of the pupil size of the viewer. For example, as the pupil size decreases, the size of the exiting beam may also decrease. In some embodiments, an iris diaphragm may be used to vary the outgoing beam size.

The wearable system 400 may include an outward facing imaging system 464 (e.g., a digital camera) that images a portion of the world 470. This portion of the world 470 may be referred to as the field of view (FOV) of the world camera, and the imaging system 464 is sometimes referred to as a FOV camera. The entire area available FOR viewing or imaging by a viewer may be referred to as the field of view (FOR). Because the wearer can move his body, head, or eyes to perceive substantially any direction in space, the FOR can include 4 pi steradians around the solid angle of wearable system 400. In other cases, the wearer's motion may be more restricted, and accordingly, the wearer's FOR may subtend a smaller solid angle. Images obtained from the outward facing imaging system 464 may be used to track gestures made by the user (e.g., hand or finger gestures), detect objects in the world 470 in front of the user, and so forth.

Wearable system 400 may also include an inward facing imaging system 462 (e.g., a digital camera) that observes user motion, such as eye motion and facial motion. Inward facing imaging system 462 may be used to capture images of eye 410 to determine the size and/or orientation of the pupil of eye 304. The inward facing imaging system 462 may be used to obtain images for determining a direction in which the user is looking (e.g., eye pose) or for biometric recognition of the user (e.g., via iris recognition). In some embodiments, at least one camera may be utilized for each eye to independently determine the pupil size or eye pose of each eye separately, thereby allowing image information to be presented to each eye to dynamically fit the eye. In some other embodiments, the pupil diameter or orientation of only a single eye 410 (e.g., using only a single camera per pair of eyes) is determined and assumed to be similar for both eyes of the user. The images obtained by inward facing imaging system 462 may be analyzed to determine the user's eye pose or emotion, which may be used by wearable system 400 to decide which audio or visual content should be presented to the user. Wearable system 400 may also use sensors such as IMUs, accelerometers, gyroscopes, etc. to determine head pose (e.g., head position or head orientation).

Wearable system 400 may include a user input device 466 by which a user may input commands to controller 460 to interact with wearable system 400. For example, the user input devices 466 may include a touch pad, touch screen, joystick, multiple degree of freedom (DOF) controller, capacitive sensing device, game controller, keyboard, mouse, directional pad (D-pad), magic wand, haptic device, totem (e.g., for use as a virtual user input device), and so forth. The multi-DOF controller may sense user input in some or all of the possible translational (e.g., left/right, fore/aft, or up/down) or rotational (e.g., yaw, pitch, or roll) aspects of the controller. A multi-DOF controller that supports translational motion may be referred to as a 3DOF controller, while a multi-DOF controller that supports translation and rotation may be referred to as a 6DOF controller. In some cases, the user may press or swipe on the touch-sensitive input device using a finger (e.g., a thumb) to provide input to the wearable system 400 (e.g., to provide user input to a user interface provided by the wearable system 400). User input device 466 may be held by a user's hand during use of wearable system 400. User input device 466 may be in wired or wireless communication with wearable system 400.

Fig. 5 shows an example of an outgoing light beam output by a waveguide. One waveguide is shown, but it should be understood that other waveguides in the waveguide assembly 480 may function similarly, where the waveguide assembly 480 includes a plurality of waveguides. Light 520 is injected into waveguide 432b at input edge 432c of waveguide 432b and propagates within waveguide 432b by TIR. At the point where light 520 impinges on DOE 432a, a portion of the light exits the waveguide as exit beam 510. The exit light beam 510 is shown as being substantially parallel, but depending on the rendering plane associated with the waveguide 432b, the exit light beam 510 may also be redirected at an angle (e.g., to form a diverging exit light beam) to propagate to the eye 410. It should be understood that the substantially parallel exit beams may be indicative of a waveguide having light extraction optics that couple out light to form an image that appears to be disposed on a rendering plane at a large distance (e.g., optical infinity) from the eye 410. Other waveguides or other groups of light extraction optics may output a more divergent exit beam pattern, which would require the eye 410 to adjust to a closer distance to focus it on the retina and would be interpreted by the brain as light from a distance closer to the eye 410 than optical infinity.

FIG. 6 is a schematic diagram showing an optical system including a waveguide device, an optical coupler subsystem to optically couple light to or from the waveguide device, and a control subsystem for generating a multi-focal stereoscopic display, image, or light field. The optical system may include a waveguide device, an optical coupler subsystem to optically couple light to or from the waveguide device, and a control subsystem. The optical system may be used to generate a multi-focal stereo, image or light field. The optical system may include one or more planar bus waveguides 632a (only one shown in fig. 6) and one or more DOEs 632b associated with each of at least some of the bus waveguides 632 a. The planar waveguide 632b may be similar to the waveguides 432b, 434b, 436b, 438b, 440b discussed with reference to fig. 4. The optical system may use a distributed waveguide device to relay light along a first axis (the vertical or Y-axis in the view of fig. 6) and expand the effective exit pupil of the light along the first axis (e.g., the Y-axis). The distributed waveguide apparatus may, for example, include a distributed planar waveguide 622b and at least one DOE 622a (shown by dashed double-dotted lines) associated with the distributed planar waveguide 622 b. The distribution planar waveguide 622b may be similar or identical to, but have a different orientation than, the main planar waveguide 632b in at least some respects. Similarly, at least one DOE 622a can be similar or identical to DOE 632a in at least some respects. For example, the distribution planar waveguide 622b or DOE 622a may be composed of the same material as the main planar waveguide 632b or DOE 632a, respectively. The embodiment of the optical display system 600 shown in fig. 6 may be integrated into the wearable display system 200 shown in fig. 2A or fig. 2B.

The relayed and exit pupil expanded light can be optically coupled from the distribution waveguide arrangement into one or more of the main planar waveguides 632 b. The primary planar waveguide 632b may relay light along a second axis (e.g., the horizontal or X-axis in the view of fig. 6), which is preferably orthogonal to the first axis. Notably, the second axis may be a non-orthogonal axis to the first axis. The main planar waveguide 632b expands the effective exit pupil of light along this second axis (e.g., the X-axis). For example, the distribution planar waveguide 622b can relay and expand light along the vertical or Y-axis and deliver the light to the main planar waveguide 632b, which can relay and expand light along the horizontal or X-axis.

The optical system may include one or more colored light sources (e.g., red, green, and blue lasers) 610 that may be optically coupled into the proximal end of a single mode optical fiber 640. The distal end of the optical fiber 640 may be passed through or received by a hollow tube 642 of piezoelectric material. The distal end protrudes from the tube 642 as an unsecured flexible cantilever 644. The piezoelectric tubes 642 may be associated with four quadrant electrodes (not shown). For example, the electrodes may be plated on the outside, outer surface, or periphery or outer diameter of the tube 642. A core electrode (not shown) may also be located in the core, center, inner periphery, or inner diameter of tube 642.

Drive electronics 650, for example, electrically coupled via leads 660, drive the opposing pairs of electrodes to independently bend the piezoelectric tube 642 in two axes. The protruding distal tip of the optical fiber 644 has a mechanical resonance mode. The frequency of resonance may depend on the diameter, length, and material characteristics of the optical fiber 644. By vibrating the piezoelectric tube 642 near the first mechanical resonance mode of the fiber cantilever 644, the fiber cantilever 644 can be made to vibrate and can sweep large deflections.

The tip of the fiber optic cantilever 644 is scanned bi-axially throughout the area of the two-dimensional (2-D) scan by exciting resonances in both axes. By modulating the intensity of one or more light sources 610 in synchronization with the scanning of the fiber optic cantilever 644, the light exiting the fiber optic cantilever 644 may form an image. A description of such an arrangement is provided in U.S. patent publication No.2014/0003762, which is incorporated herein by reference in its entirety.

Components of the optical coupler subsystem may collimate the light exiting the scanning fiber cantilever 644. The collimated light may be reflected by a mirror 648 into a narrow distribution planar waveguide 622b containing at least one Diffractive Optical Element (DOE)622 a. The collimated light can propagate vertically (relative to the view of figure 6) along the distribution planar waveguide 622b by TIR and in so doing repeatedly intersect with the DOE 622 a. The DOE 622a preferably has low diffraction efficiency. This may result in a portion (e.g., 10%) of the light being diffracted towards the edge of the larger principal planar waveguide 632b at each intersection with the DOE 622a, and a portion of the light continuing down the length of the distribution planar waveguide 622b on its original trajectory by TIR.

At each intersection with DOE 622a, the additional light may be diffracted towards the entrance of the main waveguide 632 b. By dividing the incident light into a plurality of outcoupling groups, the exit pupil of the light can be vertically expanded in the distribution plane waveguide 622b by the DOE 4. The vertically expanded light coupled out of the distribution planar waveguide 622b can enter the edge of the main planar waveguide 632 b.

Light entering the main waveguide 632b may propagate horizontally (relative to the view of fig. 6) along the main waveguide 632b via TIR. Since light propagates horizontally by TIR along at least a portion of the length of the primary waveguide 632b, the light intersects the DOE 632a at multiple points. The DOE 632a may advantageously be designed or constructed to have a phase profile that is the sum of a linear diffraction pattern and a radially symmetric diffraction pattern to produce deflection and focusing of light. The DOE 632a may advantageously have a low diffraction efficiency (e.g., 10%) such that only a portion of the light of the beam is deflected toward the eye of view at each intersection of the DOE 632a, while the rest of the light continues to propagate through the waveguide 632b via TIR.

At each intersection between the propagating light and the DOE 632a, a portion of the light is diffracted towards the adjacent facet of the primary waveguide 632b, allowing the light to escape TIR and exit the facet of the primary waveguide 632 b. In some embodiments, the radially symmetric diffraction pattern of the DOE 632a additionally imparts a level of focus to the diffracted light, both shaping the optical wavefront of the individual beams (e.g., imparting curvature) and steering the beams at an angle that matches the designed level of focus.

Thus, these different paths may couple light out of the main planar waveguide 632b by multiple DOEs 632a at different angles, focus levels, and/or creating different fill patterns at the exit pupil. Different fill patterns at the exit pupil may advantageously be used to create a light field display having multiple depth planes. Each layer in the waveguide assembly or a set of layers (e.g., 3 layers) in the stack may be used to produce a respective color (e.g., red, blue, green). Thus, for example, a first set of three adjacent layers may be employed to produce red, blue, and green light, respectively, at a first focal depth. A second set of three adjacent layers may be employed to produce red, blue, and green light, respectively, at a second focal depth. Multiple groups can be employed to generate a full 3D or 4D color image light field with various focal depths.

Other parts of wearable system

In many embodiments, the wearable system may include other components in addition to or in place of the components of the wearable system described above. The wearable system may include, for example, one or more haptic devices or components. Haptic devices or components may be used to provide haptic sensations to a user. For example, a haptic device or component may provide pressure or texture haptics when touching virtual content (e.g., virtual objects, virtual tools, other virtual constructs). Haptic sensations can replicate the sensation of a physical object represented by a virtual object, or can replicate the sensation of a desired object or character (e.g., a dragon) represented by virtual content. In some embodiments, the haptic device or component may be worn by a user (e.g., a glove wearable by the user). In some embodiments, the haptic device or component may be held by a user.

The wearable system may include, for example, one or more physical objects that may be manipulated by a user to allow input or interaction with the wearable system. These physical objects may be referred to herein as totems. Some totems may take the form of inanimate objects such as, for example, metal or plastic blocks, walls, surfaces of tables. In some embodiments, the totem may not actually have any physical input structures (e.g., keys, triggers, joysticks, trackballs, rocker switches). Instead, the totem may simply provide a physical surface, and the wearable system may present a user interface to appear to the user to be on one or more surfaces of the totem. For example, the wearable system may make it appear that an image of the computer keyboard and touch pad reside on one or more surfaces of the totem. For example, the wearable system may make the virtual computer keyboard and virtual touchpad appear on the surface of a thin rectangular plate of aluminum, totem. The rectangular plate itself does not have any physical keys or touch pads or sensors. However, the wearable system may detect user manipulation or interaction or touching the rectangular pad as a selection or input via a virtual keyboard or virtual trackpad. The user input device 466 (shown in fig. 4) may be a totem embodiment, which may include a trackpad, touchpad, trigger, joystick, trackball, rocker or virtual switch, mouse, keyboard, multiple degree of freedom controller, or another physical input device. The user may use totems, alone or in combination with gestures, to interact with a wearable system or other users.

Examples of haptic devices and totems that may be used in the wearable devices, HMDs, and display systems of the present disclosure are described in U.S. patent publication No.2015/0016777, which is incorporated herein by reference in its entirety.

Examples of wearable systems, environments, and interfaces

Wearable systems may employ various mapping-related techniques in order to achieve a high depth of field in the rendered light field. When drawing a virtual world, it is advantageous to know all the features and points in the real world to accurately depict virtual objects related to the real world. To do so, FOV images captured from a user of the wearable system may be added to the world model by including new pictures conveying information about various points and features of the real world. For example, the wearable system may collect a set of map points (such as 2D points or 3D points) and find a new map point (map point) to present a more accurate version of the world model. The world model of the first user may be communicated to the second user (e.g., over a network such as a cloud network) so that the second user may experience the world surrounding the first user.

Fig. 7 is a block diagram of an example of an MR environment 700. The MR environment 700 may be configured to receive input (e.g., visual input 702 from a user's wearable system, fixed input 704 such as a room camera, sensor input 706 from various sensors, user input from a user input device 466, gestures, totems, eye tracking, etc.) from one or more user wearable systems (e.g., wearable system 200 or display system 220) or fixed room systems (e.g., an indoor camera, etc.). The wearable system may use various sensors (e.g., accelerometers, gyroscopes, temperature sensors, movement sensors, depth sensors, GPS sensors, inward facing imaging systems, outward facing imaging systems, etc.) to determine the location and various other attributes of the user's environment. This information may be further supplemented with information from stationary cameras in the room, which may provide images or various cues from different viewpoints. Image data acquired by a camera (such as a room camera and/or a camera of an outward facing imaging system) may be reduced to a set of map points.

One or more object identifiers 708 may crawl (crawl through) the received data (e.g., a collection of points) and identify or map points, mark images, and append semantic information to the objects by means of a map database 710. Map database 710 may include various points collected over time and their corresponding objects. The various devices and map databases may be interconnected through a network (e.g., LAN, WAN, etc.) to access the cloud.

Based on this information and the set of points in the map database, object identifiers 708 a-708 n may identify objects in the environment. For example, the object identifier may identify a face, a person, a window, a wall, a user input device, a television, other objects in the user's environment, and so forth. One or more object identifiers may be dedicated to objects having particular characteristics. For example, object recognizer 708a may be used to recognize a face, while another object recognizer may be used to recognize a totem, while another object recognizer may be used to recognize a hand, finger, arm, or body gesture.

Object recognition may be performed using various computer vision techniques. For example, the wearable system may analyze images acquired by the outward facing imaging system 464 (as shown in fig. 4) to perform scene reconstruction, event detection, video tracking, object recognition, object pose estimation, learning, indexing, motion estimation, image restoration, or the like. One or more computer vision algorithms may be used to perform these tasks. Non-limiting examples of computer vision algorithms include: scale Invariant Feature Transform (SIFT), Speeded Up Robust Features (SURF), oriented FAST and rotated BRIEF (ORB), binary robust invariant extensible key points (BRISK), FAST retinal key points (FREAK), Viola-Jones algorithm, eigenface method, Lucas-Kanade algorithm, Horn-Schunk algorithm, mean-shift (Mean-shift) algorithms, visual synchronous positioning and mapping (vSLAM) techniques, sequential bayesian estimators (e.g., kalman filters, extended kalman filters, etc.), bundle adjustment (bundle adjustment), adaptive threshold segmentation (and other threshold segmentation techniques), Iterative Closest Point (ICP), semi-global matching (SGM), semi-global block matching (SGBM), feature point histograms, various machine learning algorithms (e.g., support vector machines, k-nearest neighbor algorithms, naive bayes, neural networks (including convolutional or deep neural networks), or other supervised/unsupervised models, etc.), and the like.

Additionally or alternatively, object recognition may be performed by various machine learning algorithms. Once trained, the machine learning algorithm may be stored by the HMD. Some examples of machine learning algorithms may include supervised or unsupervised machine learning algorithms, including regression algorithms (e.g., ordinary least squares regression), instance-based algorithms (e.g., learning vector quantization), decision tree algorithms (e.g., classification and regression trees), bayesian algorithms (e.g., na iotave bayes), clustering algorithms (e.g., k-means clustering), association rule learning algorithms (e.g., a priori), artificial neural network algorithms (e.g., perceptron), deep learning algorithms (e.g., deep boltzmann machine or deep neural network), dimension reduction algorithms (e.g., principal component analysis), ensemble algorithms (e.g., stack generalization), and/or other machine learning algorithms. In some embodiments, individual models may be customized for individual data sets. For example, the wearable device may generate or store a base model. The base model may be used as a starting point for generating additional models that are specific to the type of data (e.g., a particular user in a telepresence session), the data set (e.g., additional image sets obtained from the user in the telepresence session), the condition, or other changes. In some embodiments, the wearable HMD may be configured to utilize a variety of techniques to generate a model for analyzing the aggregated data. Other techniques may include using predefined thresholds or data values.

Based on this information and the set of points in the map database, the object identifiers 708 a-708 n may identify objects and supplement the objects with semantic information to give the objects life. For example, if the object identifier identifies a set of points as a door, the system may append some semantic information (e.g., the door has a hinge and has 90 degrees of movement around the hinge). If the object identifier identifies a set of points as mirrors, the system may append semantic information that: the mirror has a reflective surface that can reflect an image of an object in the room. Over time, map databases grow as the system (which may reside locally or be accessible through a wireless network) accumulates more data from the world. Once the object is identified, this information may be sent to one or more wearable systems. For example, MR environment 700 may include information about a scene occurring in california. The environment 700 may be sent to one or more users of new york. Based on data received from the FOV camera and other inputs, the object identifier and other software components may map points collected from various images, identify objects, etc., so that the scene may be accurately "passed on" to a second user, possibly in a different part of the world. The environment 700 may also use a topology map for localization purposes.

FIG. 8 is a process flow diagram of an example of a method 800 of presenting virtual content related to an identified object. Method 800 describes how to present a virtual scene to a user of a wearable system. The user may be geographically distant from the scene. For example, a user may be in new york, but may want to view a scene that is currently occurring in california, or may want to walk with friends that live in california.

At block 810, the wearable system may receive input from the user and other users regarding the user's environment. This can be achieved by various input devices and knowledge already in the map database. At block 810, the user's FOV camera, sensor, GPS, eye tracking, etc. communicates information to the system. At block 820, the system may determine sparse points based on the information. Sparse points may be used to determine pose data (e.g., head pose, eye pose, body pose, or hand gesture) that may be used to display and understand the orientation and position of various objects in the user's surroundings. At block 830, the object identifiers 708a-708n may crawl through these collected points and identify one or more objects using the map database. This information may then be communicated to the user's personal wearable system at block 840, and the desired virtual scene may be displayed to the user accordingly at block 850. For example, a desired virtual scene (e.g., a user located at CA) may be displayed in an appropriate orientation, position, etc., relative to various objects and other environments of the user in New York.

Fig. 9 is a block diagram of another example of a wearable system. In this example, wearable system 900 includes a map, which may include map data of the world. The map may reside partially locally to the wearable system, and may reside partially in a network storage location accessible through a wired or wireless network (e.g., in a cloud system). Gesture processing 910 (e.g., head or eye gestures) may be performed on the wearable computing architecture (e.g., processing module 260 or controller 460) and utilize data from the map to determine the position and orientation of the wearable computing hardware or the user. Gesture data may be calculated from data collected instantaneously as the user is experiencing the system and operating in the world. The data may include images, data from sensors (e.g., inertial measurement units, which typically include accelerometer and gyroscope components), and surface information related to objects in a real or virtual environment.

The sparse point representation may be the output of a simultaneous localization and mapping (SLAM or V-SLAM, meaning a configuration in which the input is image/visual only) process. The system may be configured to not only find the locations of various components in the world, but also find out what the world is made up of. Gestures may be building blocks that achieve many goals, including populating maps and using data from maps.

In one embodiment, the sparse point location itself may not be entirely sufficient, and further information may be needed to produce a multi-focus AR, VR or MR experience. The gap may be at least partially filled with a dense representation of the general reference depth map information. Such information may be calculated according to a process 940 known as Stereo (Stereo), where depth information is determined using techniques such as triangulation or time-of-flight sensing. Image information and an active (active) mode, such as an infrared mode created using an active (active) projector, may be used as inputs to stereo process 940. A large amount of depth map information may be fused together and some of them may be summarized in a surface representation. For example, a mathematically definable surface is an efficient (e.g., relative to a large point cloud) and digestible input for other processing devices such as a game engine. Thus, the outputs of the stereo process (e.g., depth map) 940 may be combined in the fusion process 930. A gesture may also be an input to the fusion process 930, and the output of the fusion 930 becomes an input to the fill (output) map process 920. The sub-surfaces may be connected to each other (e.g. in a topographical map) to form a larger surface, and the map becomes a large mixture of points and surfaces.

To address various aspects in the mixed reality process 960, various inputs may be used. For example, in the embodiment shown in fig. 9, game parameters may be entered to determine that a user of the system is playing a monster game, where one or more monsters are located at various locations, monsters die or escape under various conditions (e.g., if the user kills a monster), walls or other objects are located at various locations, and so forth. The world map may include information about where these objects are relevant to each other as another valuable input to the mixed reality. Gestures relative to the world also become an input and play a key role for almost any interactive system.

The control or input from the user is another input to wearable system 900. As described herein, user inputs may include visual inputs, gestures, totems, audio inputs, sensory inputs, and the like. To move around or play a game, for example, the user may need to instruct wearable system 900 as to what he or she wants to do. In addition to moving themselves in space, there are many forms of user control that may be utilized. In one embodiment, a totem (e.g., a user input device) or object such as a toy gun may be held by a user and tracked by the system. The system is preferably configured to know that the user is holding the merchandise and understand what interaction the user is with the merchandise (e.g., if the tote or object is a gun, the system may be configured to know the location and orientation, and whether the user is clicking a trigger or other sensing button or element, such as an IMU, that may be equipped with a sensor, which may help determine what is happening even if such activity is not within the field of view of any camera).

Gesture tracking or recognition may also provide input information. Wearable system 900 may be configured to track and interpret gestures of button presses for gesturing left or right, stopping, grabbing, holding, and the like. For example, in one configuration, a user may want to flip through an email or calendar, or "punch" with others or players in a non-gaming environment. Wearable system 900 may be configured to utilize a minimum amount of gestures, which may or may not be dynamic. For example, the gesture may be a simple static gesture such as opening the watch to stop, pointing with the thumb to indicate good (ok), and pointing with the thumb down to indicate not good; or flip the hand side-to-side or up-and-down to make directional commands.

Eye tracking is another input (e.g., tracking where a user is looking to control display technology to render at a particular depth or range). In one embodiment, the vergence of the eyes may be determined using triangulation, and then accommodation may be determined using a vergence/accommodation model developed for that particular character.

Speech recognition is another input that may be used alone or in combination with other inputs (e.g., totem tracking, eye tracking, gesture tracking, etc.). The system 900 may include an audio sensor 232 (e.g., a microphone) that receives an audio stream from the environment. The received audio stream may be processed (e.g., by the processing modules 260, 270 or the central server 1650) to recognize the user's speech (from other speech or background audio) to extract commands, topics (subjects), parameters, etc. from the audio stream. For example, the system 900 may identify from the audio stream that the phrase "move that to there" has been spoken, identify that the phrase was spoken by the wearer of the system 900 (and not another person in the user's environment), and extract from the phrase that there is an executable command ("move") and an object ("that") to be moved to a location ("there"). The object to be operated on by the command may be referred to as the subject of the command, and other information is provided as parameters of the command. In this example, the location to which the object is to be moved is a parameter of the "move" command. The parameters may include, for example, location, time, other objects to interact with (e.g., "move that next to the red chair" or "give the magic wand to Linda"), "how the command is executed (e.g.," play my music using the upstair speakers "on the floor), and so on.

As another example, the system 900 may process the audio stream using speech recognition techniques to enter text strings or modify text content. The system 900 may incorporate speaker recognition techniques to determine who is speaking and speech recognition techniques to determine what is being spoken. The speech recognition techniques may include hidden markov models, gaussian mixture models, pattern matching algorithms, neural networks, matrix representations, vector quantization, speaker diary (speaker scoring), decision trees, and Dynamic Time Warping (DTW) techniques, alone or in combination. The speech recognition techniques may also include anti-speaker techniques such as homogeneous group models (cohort models) and world models. Spectral features may be used to represent speaker characteristics.

With respect to camera systems, the exemplary wearable system 900 shown in fig. 9 may include three pairs of cameras: a relatively wide FOV or passive SLAM camera pair arranged to the side of the user's face, with a different camera pair positioned in front of the user to handle the stereo imaging process 940 and also to capture gestures and totem/object tracking in front of the user's face. The FOV camera and the camera pair used for stereoscopic processing 940 may be part of an outward facing imaging system 464 (shown in fig. 4). Wearable system 900 may include an eye-tracking camera (which may be part of inward-facing imaging system 462 shown in fig. 4) oriented toward the user's eye in order to triangulate eye vectors and other information. Wearable system 900 may also include one or more textured light projectors (e.g., Infrared (IR) projectors) to inject texture into the scene.

Fig. 10 is a process flow diagram of an example of a method 1000 for determining user input to a wearable system. In this example, the user may interact with the totem. A user may have multiple totems. For example, the user may have designated one totem for a social media application, another totem for playing a game, and so on. At block 1010, the wearable system may detect a movement of the totem. The movement of the totem may be recognized by an outward facing system or may be detected by sensors (e.g., tactile gloves, image sensors, hand tracking devices, eye tracking cameras, head pose sensors, etc.).

At block 1020, based at least in part on the detected gesture, eye pose, head pose, or input through the totem, the wearable system detects the position, orientation, and/or movement of the totem (or the user's eyes or head or gesture) relative to a reference frame (reference frame). The reference frame may be a set of map points based on which the wearable system translates the movement of the totem (or user) into an action or command. At block 1030, the user's interactions with the totem are mapped (map). At block 1040, based on the mapping of the user interaction with respect to the frame of reference 1020, the system determines a user input.

For example, a user may move a totem or physical object back and forth to indicate that a virtual page is flipped and moved to the next page or from one User Interface (UI) display screen to another. As another example, a user may move their head or eyes to view different real or virtual objects in the user's FOR. If the user gazes at a particular real or virtual object for longer than a threshold time, the real or virtual object may be selected as the user input. In some implementations, the vergence of the user's eyes may be tracked and an accommodation state of the user's eyes may be determined using an accommodation/vergence model, which provides information about the rendering plane in which the user is focusing. In some implementations, the wearable system can use cone projection techniques to determine which real or virtual objects are along the direction of the user's head pose or eye pose. The generally described cone projection technique may project an invisible cone in the direction of the user's view and identify any objects that intersect the cone. Cone projection may involve projecting a thin pencil light having a substantially small lateral width or a light having a large lateral width (e.g., a cone or frustum) from an AR display (of a wearable system) to a physical or virtual object. A cone projection with a single ray may also be referred to as a ray projection. Detailed examples of cone projection techniques are described in U.S. application No.15/473,444 entitled "Interactions with 3D Virtual Objects Using spots and Multiple-DOF Controllers (Using gestures and multi-DOF Controllers) filed on 29/3.2017, the entire contents of which are incorporated herein by reference.

The user interface may be projected by a display system described herein (e.g., display 220 in fig. 2A or fig. 2B). It may also be displayed using various other technologies, such as one or more projectors. The projector may project an image onto a physical object such as a canvas or a globe. One or more cameras external to or as part of the system may be used (e.g., using an inward facing imaging system 462 or an outward facing imaging system 464) to track interactions with the user interface.

FIG. 11 is a process flow diagram of an example of a method 1100 for interacting with a virtual user interface. Method 1100 may be performed by a wearable system described herein.

At block 1110, the wearable system may identify a particular UI. The type of UI may be predetermined by the user. The wearable system may identify a need to populate a particular UI based on user input (e.g., gestures, visual data, audio data, sensory data, direct commands, etc.). At block 1120, the wearable system may generate data for the virtual UI. For example, data associated with the bounds (define), general structure, shape, etc. of the UI may be generated. Additionally, the wearable system may determine map coordinates of the user's physical location such that the wearable system may display a UI related to the user's physical location. For example, if the UI is body-centered, the wearable system may determine coordinates, head gestures, or eye gestures of the user's body position (physical stance) such that a ring-shaped UI may be displayed around the user, or a planar UI may be displayed on a wall or in front of the user. If the UI is hand-centered, the map coordinates of the user's hand may be determined. These map points may be taken by means of data received by the FOV camera, sensory input or any other type of collected data.

At block 1130, the wearable system may send data from the cloud to the display, or the data may be sent from a local database to the display component. At block 1140, a UI is displayed to the user based on the transmitted data. For example, the light field display may project a virtual UI into one or both eyes of the user. At block 1150, once the virtual UI is created, the wearable system need only wait for commands from the user to generate more virtual content on the virtual UI. For example, the UI may be a body center ring surrounding the user's body. The wearable system may then wait for a command (gesture, head or eye motion, input from a user input device, etc.) and, if it is recognized (block 1160), may display virtual content associated with the command to the user (block 1170). As an example, the wearable system may wait for a gesture of the user before mixing multiple flow tracks.

Other examples of wearable systems, UIs, and user experiences (UX) are described in U.S. patent publication No.2015/0016777, which is incorporated herein by reference in its entirety.

Example objects in the field of View (FOR) and View (FOV)

Fig. 12A schematically shows an example of a field of view (FOR)1200, a field of view (FOV)1270 of a world camera, a field of view 1250 of a user, and a gaze field of view 1290 of a user. As shown with reference to fig. 4, the FOR 1200 includes a portion of an environment surrounding the user that is perceptible to the user via the wearable system. FOR may comprise a solid angle of 4 pi radians around the wearable system, since the wearer may move his body, head or eyes to perceive substantially any direction in space. In other cases, the wearer's movement may be more restricted, and thus the wearer's FOR may subtend a smaller solid angle.

The field of view 1270 of the world camera may include a portion of the user's FOR that is currently viewed by the outward facing imaging system 464. Referring to fig. 4, the field of view 1270 of the world camera may include the world 470 observed by the wearable system 400 at a given time. The size of the world camera's FOV 1270 may depend on the optical characteristics of the outward facing imaging system 464. For example, the outward-facing imaging system 464 may include a wide-angle camera that may image a 190-degree space around the user. In some implementations, the FOV 1270 of the world camera may be greater than or equal to the natural FOV of the user's eye.

The FOV 1250 of a user may include a portion of the FOR 1200 that the user perceives at a given time. The FOV may depend on the size or optical characteristics of the display of the wearable device. For example, an AR/MR display may include optics that provide AR/MR functionality when a user views a particular portion of the display. The FOV 1250 may correspond to the solid angle that is perceivable by a user when viewing an AR/MR display, such as the stacked waveguide assembly 480 (fig. 4) or the planar waveguide 600 (fig. 6). In some embodiments, the FOV 1250 of the user may be smaller than the natural FOV of the user's eye.

The wearable system may also determine a gaze field of view of the user 1290. The gaze field of view 1290 may include a portion of the FOV 1250 (e.g., at which a visual gaze is maintained) at which the user's eyes may gaze. Gaze field 1290 may correspond to the foveal region of the eye on which light falls. The gaze field of view 1290 may be less than the FOV 1250 of the user, e.g., the gaze field of view may be between several degrees to about 5 degrees. Thus, the user may perceive some virtual objects that are not in the gaze field of view 1290, but are in the FOV 1250 in the user's peripheral field of view.

Fig. 12B schematically shows an example of a virtual object in a user field of view (FOV) and a virtual object in a field of view (FOR). In fig. 12B, FOR 1200 may contain a set of objects (e.g., 1210, 1220, 1230, 1242, and 1244) that can be perceived by a user via a wearable system. Objects within the user's FOR 1200 may be virtual and/or physical objects. FOR example, the user's FOR 1200 may include a physical object such as a chair, sofa, wall, and the like. The virtual objects may include operating system objects such as a recycle bin for deleted files, a terminal for entering commands, a file manager for accessing files or directories, icons, menus, application programs for audio or video streaming, notifications from an operating system, text editing applications, messaging applications, and so forth. The virtual objects may also include objects in the application, such as avatars, virtual objects in games, graphics or images, and the like. Some virtual objects may be both operating system objects and objects in an application. In some embodiments, the wearable system may add a virtual element to an existing physical object. For example, the wearable system may add a virtual menu associated with a television in the room, where the virtual menu may provide the user with an option to turn on or change a television channel using the wearable system.

The virtual object may be a three-dimensional (3D), two-dimensional (2D), or one-dimensional (1D) object. For example, the virtual object may be a 3D coffee cup (which may represent a virtual control of a physical coffee machine). The virtual object may also be a 2D graphical representation of the clock (showing the current time to the user). In some implementations, one or more virtual objects can be displayed within (or associated with) another virtual object. The virtual coffee cup may be displayed inside the user interface plane, although the virtual coffee cup appears to be 3D in the 2D planar virtual space.

The object in the FOR of the user may be part of the world map described with reference to fig. 9. Data associated with an object (e.g., location, semantic information, properties, etc.) may be stored in various data structures, such as arrays, lists, trees, hashes, graphs, and so forth. The index of each stored object may be determined, for example, by the location of the object, as appropriate. For example, the data structure may index the object by a single coordinate such as the distance of the object from the reference location (e.g., how far to the left or right of the reference location, how far to the top or bottom of the reference location, or how far to the depth of the reference location). The reference position may be determined based on a position of the user (e.g., a position of the user's head). The reference location may also be determined based on the location of a virtual or physical object (e.g., a target object) in the user environment. Thus, a 3D space in a user environment may be represented in a 2D user interface in which virtual objects are arranged according to the distance of the objects from a reference position.

In fig. 12B, FOV1250 is schematically illustrated by dashed line 1252. A user of the wearable system may perceive a plurality of objects in the FOV1250, such as a portion of object 1242, object 1244, and object 1230. When the user's pose changes (e.g., head pose or eye pose), the FOV1250 will change accordingly, and objects within the FOV1250 may also change. For example, in fig. 12B, map 1210 is initially outside the FOV of the user. If the user looks towards the map 1210, the map 1210 may move into the user's FOV1250 and the object 1230, for example, may move outside of the user's FOV 1250.

The wearable system may keep track of objects in FOR 1200 as well as objects in FOV 1250. FOR example, local processing and data module 260 may communicate with remote processing module 270 and remote data store 280 to retrieve virtual objects in the user's FOR. The local processing and data module 260 may store the virtual objects in, for example, a buffer or temporary memory. The local processing and data module 260 may determine the FOV of the user using the techniques described herein and render a subset of the virtual objects in the FOV of the user. As the user's pose changes, the local processing and data module 260 may update the user's FOV and render another set of virtual objects corresponding to the user's current FOV accordingly.

Overview of various user input modes

The wearable system may be programmed to accept various input modes for performing operations. For example, the wearable system may accept two or more of the following types of input modes: voice commands, head gestures, body gestures (which may be measured by sensors external to the HMD or IMU in a belt pack, for example), eye gaze (also referred to herein as eye gestures), gestures (or gestures made by other parts of the body), signals from user input devices (e.g., totems), environmental sensors, and the like. Computing devices are typically designed to produce a given output based on a single input from a user. For example, a user may enter a text message by tapping a keyboard or use a mouse to direct the motion of a virtual object, which are examples of gesture input modes. As another example, the computing device may receive an audio data stream from a user's voice and convert the audio data into executable commands using voice recognition techniques.

In some cases, user input modes may be non-exclusively categorized as direct user input or indirect user input. The direct user input may be user interaction provided directly by the user, e.g., via voluntary movement of the user's body (e.g., turning the head or eyes, staring at an object or location, speaking a phrase, moving a finger or hand). As an example of direct user input, a user may interact with a virtual object using a gesture such as a head gesture, an eye gesture (also referred to as an eye gaze), a hand gesture, or another body gesture. For example, the user may look at the virtual object (with the head and/or eyes). Another example of a direct user input is the user's voice. For example, the user may say "launch a browser" to cause the HMD to open a browser application. As yet another example of direct user input, a user may actuate a user input device, for example, by touch gestures (e.g., touching a touch-sensitive portion of a totem) or body movements (e.g., rotating a totem acting as a multiple degree of freedom controller).

In addition to or instead of direct user input, the user may also interact with the virtual object based on indirect user input. The indirect user input may be determined based on various contextual factors (e.g., geographic location of the user or virtual object, environment of the user, etc.). For example, the user's geographic location may be in the user's office (rather than in the user's home), and different tasks (e.g., work-related tasks) may be performed based on the geographic location (e.g., derived from a GPS sensor).

Contextual factors may also include the visibility (affordance) of the virtual object. The visibility of a virtual object may include a relationship between the virtual object and the environment of the object that provides an opportunity for an action or use associated with the object. The visibility may be determined based on, for example, the function, orientation, type, location, shape, and/or size of the object. The visibility can also be based on the environment in which the virtual object is located. As an example, the visibility of a horizontal table is that an object may be placed on the table, while the visibility of a vertical wall is that an object may be hung or projected onto the wall. For example, one could say "place that there" and then place a virtual office calendar so that it appears level on the user's desk at the user's office.

A single direct user input mode may create various limitations in that the number or types of user interface operations available may be limited due to the type of user input. For example, a user may not be able to zoom in or out with head gestures, as head gestures may not provide accurate user interaction. As another example, the user may need to move a thumb back and forth (or a greater distance) on the touchpad in order to move the virtual object from the floor to the wall, which may cause fatigue to the user over time.

However, some direct input modes may be more convenient and intuitive for the user. For example, a user may speak into the wearable system to issue a voice command without typing in a sentence using gesture-based keyboard input. As another example, instead of moving a cursor to identify the target virtual object, the user may use a gesture to point to the target virtual object. Other direct input modes may improve the accuracy of user interaction, although they may be less convenient or intuitive. For example, the user may move a cursor to the virtual object to indicate that the virtual object is the target object. However, as described above, if a user wants to select the same virtual object using direct user input (e.g., head gestures or other input as a direct result of user actions), the user may need to control the precise movement of the head, which may lead to muscle fatigue. A 3D environment (e.g., a VR/AR/MR environment) may present more challenges to user interaction because user input also needs to be specified for depth (as opposed to a planar surface). This additional depth dimension creates more opportunities for error than a 2D environment. For example, in a 2D environment, user input may be translated with respect to horizontal and vertical axes in a coordinate system, whereas user input may require translation with respect to 3 axes (horizontal, vertical, and depth) in a 3D environment. Thus, inaccurate execution of user input may result in 3-axis (rather than 2-axis in a 2D environment) errors.

To take advantage of the existing advantages of direct user input, while improving the accuracy of interacting with objects in 3D space and reducing user fatigue, user interface operations may be performed using multiple direct input modes. Multimodal input can further improve existing computing devices (particularly wearable devices) to interact with virtual objects in data rich and dynamic environments such as AR, VR, or MR environments.

In multi-modal user input techniques, one or more of the direct inputs can be used to identify a target virtual object (also referred to as a subject) with which the user is to interact and to determine user interface operations to be performed on the target virtual object. For example, the user interface operations may include command operations such as select, move, zoom, pause, play, and parameters of the command operation (e.g., how the operation is performed, where or when the operation will occur, which object the target object will interact with, etc.). As an example of recognizing a target virtual object and determining an interaction to perform on the target virtual object, a user may look at a virtual note (head or eye gesture input mode), point at a table (gesture input mode), and then say "move that there" (voice input mode). The wearable system may recognize that the target virtual object in the phrase "move that to that" is a virtual note ("that"), and may determine that the user interface operation involves moving (executable command) the virtual note to a table ("that"). In this example, the command operation may be to "move" the virtual object, and the parameters of the command operation may include a destination object that is a table at which the user is pointing. Advantageously, in some embodiments, the wearable system may improve the overall accuracy of the user interface operation or increase the convenience of user interaction by performing the user interface operation based on multiple direct user input modes (e.g., three modes in the above example, head/eye gestures, and speech). For example, the user does not say "move the leftmost browser 2.5 feet to the right" (move the leftmost browser 2.5 fets to the right) ", but can say" move that there "while using a head gesture or gesture that indicates that the object is the leftmost browser (without indicating the moved object in the speech input), and use the head or hand movement to indicate the movement distance.

Example of interaction in a virtual Environment Using various input modes

FIG. 13 illustrates an example of interacting with a virtual object using one mode of user input. In fig. 13, a user 1310 wears an HMD and interacts with virtual content in three scenes 1300a, 1300b, and 1300 c. The user's head position (and corresponding eye gaze direction) is represented by geometric cone 1312 a. In this example, the user may perceive the virtual content via the display 220 of the HMD. Upon interacting with the HMD, the user may enter a text message through the user input device 466. In scene 1300a, the user's head is in its natural resting position 1312a and the user's hand is also in its natural resting position 1316 a. However, while the user may be more comfortable typing text on the user input device 466, the user may not be able to see the interface on the user input device 466 to ensure that characters are typed correctly.

To view text entered on the user input device, the user may move the hand up to location 1316b, as shown in scenario 1300 b. Thus, when the head is in its natural rest position 1312a, the hands will be in the FOV of the user's head. However, position 1316b is not the natural resting position of the hand, and thus may cause fatigue to the user. Alternatively, as shown in scenario 1300c, the user may move her head to position 1312c in order to hold the hand in the natural resting position 1316 a. However, due to the unnatural position of the head, the muscles around the user's neck may become fatigued and the user's FOV is directed to the ground or floor, rather than to the outside world (which may be unsafe if the user walks in a crowded area). In either scenario 1300b or 1300c, when the user performs a user interface operation using a single input mode, the user's natural ergonomics are sacrificed to meet the desired user interface operation.

The wearable systems described herein may at least partially alleviate the ergonomic limitations depicted in scenarios 1300b and 1300 c. For example, the virtual interface may be projected within the user's field of view in scene 1300 a. The virtual interface may allow the user to view the entered input from a natural location.

The wearable system may also display and support interaction with virtual content without device constraints. For example, the wearable system may present multiple types of virtual content to the user, and the user may interact with one type of content using the touchpad while interacting with another type of content using the keyboard. Advantageously, in some embodiments, the wearable system may determine which virtual content is the target virtual object (the object on which the user intended to perform the operation) by calculating a confidence score (a higher confidence score representing a higher confidence (or likelihood) that the system has identified the correct target virtual object). Detailed examples regarding the recognition of the target virtual object are described with reference to fig. 15 to 18B.

FIG. 14 shows an example of selecting a virtual object using a combination of user input modes. In scene 1400a, the wearable system may present to user 1410 a plurality of virtual objects represented by squares 1422, circles 1424, and triangles 1426.

As shown in scene 1400b, user 1410 may interact with a virtual object using head gestures. This is an example of a head gesture input mode. The head pose input mode may involve a cone projection for targeting or selecting a virtual object. For example, the wearable system may project a cone 1430 from the user's head toward the virtual object. The wearable system may detect whether one or more of the virtual objects fall within the volume of the cone to identify which object the user intends to select. In this example, the cone 1430 intersects the circle 1424 and the triangle 1426. Thus, the wearable system may determine that the user intends to select circle 1424 or triangle 1426. However, because the cone 1430 intersects both the circle 1424 and the triangle 1426, the wearable system may not be able to determine whether the target virtual object is the circle 1424 or the triangle 1426 based only on the head pose input.

In scene 1400c, a user 1410 may interact with a virtual object by manually orienting a user input device 466 (e.g., a handheld remote control device) such as a totem. This is an example of a gesture input mode. In this scenario, the wearable system may determine that circle 1424 or square 1422 is the intended target because these two objects are in the direction in which user input device 466 is pointing. In this example, the wearable system may determine the direction of the user input device 466 by detecting the position or orientation of the user input device 466 (e.g., via an IMU in the user input device 466), or by performing a cone projection from the user input device 466. Since circle 1424 and square 1422 are both candidates for a target virtual object, the wearable system cannot determine which one is the object that the user actually wants to select based on the gesture input pattern alone.

In scene 1400d, the wearable system may determine the target virtual object using multi-modal user input. For example, the wearable system may use results obtained from cone projection (head pose input mode) and results obtained from orientation of the user input device (gesture input mode) to identify the target virtual object. In this example, circle 1424 is a candidate identified in both the results from the cone projection and the results obtained from the user input device. Thus, the wearable system may use these two input modes to determine with high confidence that the target virtual object is a circle 1424. As further shown in scenario 1400d, the user may issue a voice command 1442 (exemplified as "Move that," which is an example of a third input mode (i.e., voice)) to interact with the target virtual object. The wearable system may associate the word "that" with the target virtual object, associate the word "move" with the command to be executed, and may move the circle 1424 accordingly. However, using voice command 1442 alone (without an indication from user input device 466 or cone projection 143) may cause confusion for the wearable system, as the wearable system may not know which object is associated with the word "that".

Advantageously, in some embodiments, by accepting multiple input modes to identify and interact with virtual objects, the amount of precision required for each input mode may be reduced. For example, cone projection may not be able to accurately locate objects on a far rendering plane because the cone diameter may increase as the cone is farther away from the user. As other examples, a user may need to hold the input device in a particular orientation to point at a target object and speak at a particular phrase or speed to ensure proper speech input. However, by combining the results of the speech input and cone projection (from head gestures or gestures using the input device), the wearable system may still recognize the target virtual object without requiring input (e.g., cone projection or speech input) to be accurate. For example, even if the cone projection selects multiple objects (e.g., as described with reference to scenes 1400b, 1400 c), the speech input may help narrow the selection (e.g., increase the confidence score of the selection). For example, the pyramidal projection may capture 3 objects, where a first object is to the right of the user, a second object is to the left of the user, and a third object is in the center of the FOV of the user. The user can narrow the selection by saying "select the rightmost object". As another example, there may be two identically shaped objects in the FOV of a user. In order for the user to select the correct object, the user may need to describe the object more by voice command. For example, the user may need to say "select the square object that is red", rather than "select the square object". However, when cone projection is used, the voice command need not be so accurate. For example, the user may look at one of the square objects and say "select the square object" or even "select the object". The wearable system may automatically select a square object that is consistent with the user's eye-gaze direction without selecting a square object that is not in the user's eye-gaze direction.

In some embodiments, the system may have a hierarchy of preferences for input mode combinations. For example, a user tends to look in the direction his or her head is pointing; thus, the eye gaze and the head pose may provide similar information to each other. The combination of head pose and eye gaze may not be preferred because it does not provide much additional information than using eye gaze alone or head pose alone. Thus, the system can use a hierarchy of pattern input preferences to select a pattern input that provides contrasting information rather than general repetitive information. In some embodiments, the hierarchy uses head gestures and speech as the primary mode input, followed by eye gaze and gestures.

Thus, based on the multimodal input, the system can calculate a confidence score for each object in the user environment that such object is the target object, as further described herein. The system may select the particular object in the environment with the highest confidence score as the target object.

FIG. 15 illustrates an example of interacting with a virtual object using a combination of direct user inputs. As shown in fig. 15, a user 1510 wears an HMD 1502 configured to display virtual content. The HMD 1502 may be part of the wearable system 200 described herein and may include a belt-worn power supply and processing package 1503. The HMD 1502 may be configured to accept user input from the totem 1516. A user 1510 of HMD 1502 may have a first FOV 1514. The user may view the virtual object 1512 in the first FOV 1514.

User 1510 may interact with virtual object 1512 based on a combination of direct inputs. For example, user 1510 may select virtual object 1512 by cone projection techniques based on the user's head or eye pose, or by totem 1516, by voice command, or by a combination of these (or other) input modes (e.g., as described with reference to fig. 14).

User 1510 may change his head pose to move the selected virtual object 1512. For example, the user may rotate their head to the left to update the FOV from the first FOV 1514 to the second FOV 1524 (as shown from scene 1500a to scene 1500 b). The movement of the user's head may be combined with other direct inputs to move the virtual object from the first FOV 1514 to the second FOV 1524. For example, changes in head pose may be aggregated with other inputs, such as voice commands ("move that to there"), guidance from totems 1516, or eye gaze direction (e.g., recorded by inward facing imaging system 462 as shown in fig. 4). In this example, HMD 1502 may use the updated FOV 1524 as the approximate region to which virtual object 1512 should be moved. HMD 1502 may further determine a destination to which virtual object 1512 is moved based on the user's gaze direction. As another example, the HMD may capture the voice command "move that to that". The HMD may identify virtual object 1512 as the object with which the user will interact (because the user previously selected virtual object 1512). The HMD may further determine that the user intends to move the object from FOV 1514 to FOV 1524 by detecting changes in the pose of the user's head. In this example, the virtual object 1512 may initially be located in a central portion of the user's first FOV 1514. Based on the voice command and the user's head pose, the HMD may move the virtual object to the center of the user's second FOV 1524.

Examples of identifying target virtual objects or user interface operations through multimodal user input

As described with reference to fig. 14, in some cases, the wearable system may not be able to identify (with sufficient confidence) the target virtual object with which the user intends to interact using a single input mode. Furthermore, even if multiple user input modes are used, one user input mode may indicate one virtual object while another user input mode may indicate a different virtual object.

To address ambiguity (ambiguity) and provide an improved wearable system that supports multi-modal user input, the wearable system may aggregate user input patterns and calculate confidence scores to identify desired virtual objects or user interface operations. As described above, a higher confidence score indicates a higher probability or likelihood that the system has identified the desired target object.

FIG. 16 illustrates an example computing environment for aggregating input patterns. Example environment 1600 includes, for example, three virtual objects associated with application a 1672, application B1674, and application C1676. As shown with reference to fig. 2A, 2B, and 9, the wearable system may include various sensors, and may receive various user inputs from these sensors, and may analyze the user inputs to interact with the mixed reality 960, e.g., using the cross-modal input fusion techniques described herein. In the example environment 1600, the central runtime server 1650 can aggregate the direct input 1610 and the indirect user input 1630 to produce multi-modal interactions for the application. Examples of direct inputs 1610 may include gestures 1612, head gestures 1614, voice inputs 1618, totems 1622, eye gaze directions (e.g., eye gaze tracking 1624), other types of direct inputs 1626, and so forth. Examples of indirect inputs 1630 may include environmental information (e.g., environmental tracking 1632) and geographic location 1634. Central runtime server 1650 may include remote processing module 270. In certain embodiments, the local processing and data module 260 (or processor 128) may perform one or more functions of the central runtime server 1650. The local processing and data module 260 may also communicate with the remote processing module 270 to aggregate input patterns.

The wearable system may track the gesture 1612 using the outward facing imaging system 464. The wearable system may track gestures using various techniques described in fig. 9. For example, the outward-facing imaging system 464 may take an image of the user's hand and map the image to a corresponding gesture. Outward facing imaging system 464 may use a FOV camera or a depth camera (configured for depth detection) to image the user's gesture. Central runtime server 1650 may use object recognizer 708 to recognize a user's head pose. The gesture 1612 may also be tracked by the user input device 466. For example, the user input device 466 may include a touch-sensitive surface that may track a user's hand movement, such as a swipe gesture or a tap gesture.

The HMD may recognize the head pose 1614 using the IMU. The head 1410 may have multiple degrees of freedom, including three types of rotation (e.g., yaw, pitch, and roll) and three types of translation (e.g., pitch, yaw, and heave). The IMU may be configured to measure 3-DOF motion or 6-DOF motion of the head, for example. Measurements obtained from the IMU may be transmitted to the central runtime server 1650 for processing (e.g., to recognize head gestures).

The wearable system may perform eye gaze tracking 1624 using the inward facing imaging system 462. For example, inward facing imaging system 462 may include an eye camera configured to acquire images of the user's eye area. The central runtime server 1650 may analyze the images (e.g., via the object identifier 708) to infer a user's gaze direction or track the user's eye movement.

The wearable system may also receive input from totem 1622. As described herein, totem 1622 can be an embodiment of user input device 466. Additionally or alternatively, the wearable system may receive speech input 1618 from the user. The input from the totem 1622 and the voice input 1618 may be transmitted to the central runtime server 1650. The central runtime server 1650 may parse the user's audio data (e.g., audio data from the microphone 232) in real-time or near real-time using natural language processing. The central runtime server 1650 may identify the content of the speech by applying various speech recognition algorithms, such as hidden markov models, Dynamic Time Warping (DTW) based speech recognition, neural networks, deep learning algorithms (e.g., deep feed forward and recurrent neural networks), end-to-end automatic speech recognition, machine learning algorithms (described with reference to fig. 7 and 9), semantic analysis, other algorithms using acoustic or language modeling, and so forth. The central runtime server 1650 may also apply speech recognition algorithms that can identify the identity of the speaker (e.g., whether the speaker is a user of the wearable device or a person in the context of the user).

The central runtime server 1650 may also receive indirect input when the user interacts with the HMD. The HMD may include various environmental sensors as described with reference to fig. 2A and 2B. Using data acquired by the environmental sensors (either alone or in combination with related data directly input 1610), the central runtime server 1650 can reconstruct or update the user's environment (e.g., map 920). For example, central runtime server 1650 may determine the user's ambient light conditions based on the user's environment. The ambient light conditions may be used to determine which virtual object the user may interact with. For example, when the user is in a bright environment, the central runtime server 1650 may recognize the target virtual object as a virtual object that supports the gesture 1612 as an input mode because the camera may observe the user's gesture 1612. However, if the environment is dark, the central runtime server 1650 may determine that the virtual object may be an object that supports the voice input 1618 instead of the gesture 1612.

The central runtime server 1650 can perform environment tracking 1632 and aggregate direct input patterns to produce multi-modal interactions for multiple applications. As an example, the central runtime server 1650 may disable the voice input 1618 when the user enters a noisy environment from a quiet environment. Other examples regarding selecting an input mode based on context are further described with reference to FIG. 24.

The central runtime server 1650 may also identify the target virtual object based on the user's geographic location information. Geographic location information 1634 may also be obtained from environmental sensors (e.g., GPS sensors). The central runtime server 1650 may identify virtual objects for potential user interaction, where the distance between the virtual object and the user is within a threshold distance. Advantageously, in some embodiments, the cone in the cone projection may have a length that is adjustable by the system (e.g., based on the number or density of objects in the environment). By selecting objects within a certain radius of the user, the number of potential objects that may be target objects may be significantly reduced. Other examples of using indirect input as an input mode are described with reference to fig. 21.

Example of determining target object

The central runtime server 1650 may use a variety of techniques to determine the target object. Fig. 17A shows an example of identifying a target object using lattice tree analysis. The central runtime server 1650 may derive the given values from the input sources and generate a grid of possible values for candidate virtual objects with which the user may interact. In some embodiments, the value may be a confidence score. The confidence scores may include rankings (ranking), ratings (rating), valuations (valuations), quantitative or qualitative values (e.g., values in the range of 1 to 10, percentages or percentiles, or qualitative values "a", "B", "C", etc.), and so forth. Each candidate object may be associated with a confidence score, and in some cases, the system selects the candidate object with the highest confidence score (e.g., higher than the confidence scores of the other objects or higher than a threshold score) as the target object. In other cases, the system excludes objects with confidence scores below a threshold confidence score from consideration of the target object, which may improve computational efficiency.

In many examples herein, reference is made to selection of a target virtual object or selection from a set of virtual objects. This is intended to illustrate an example embodiment, but is not intended to be limiting. The described techniques may be applied to virtual objects or physical objects in a user environment. For example, the voice command "move that there" may refer to moving a virtual object (e.g., a virtual calendar) onto a physical object (e.g., a horizontal surface of a user's table). Alternatively, the voice command "move that there" may refer to moving a virtual object (e.g., a virtual word processing application) to another location within another virtual object (e.g., another location in the user's virtual desktop).

The context of the command may also provide information about whether the system should attempt to identify virtual objects and/or physical objects. For example, in the command "move that there", the system may recognize that "is a virtual object because the AR/VR/MR system cannot move the actual physical object. Thus, the system may eliminate the physical object as a candidate for "that". As described in the examples above, the target location "there" may be a virtual object (e.g., a user's virtual desktop) or a physical object (e.g., a user's desk).

Additionally, the system may assign a confidence score to objects in the user environment, which may be FOR, FOV, or gaze field of view (see, e.g., fig. 12A), depending on the context and goal of the system at that point in time. For example, a user may wish to move the virtual calendar to a location on the user's desk with both objects in the user's FOV. The system may analyze objects within the user's FOV rather than all objects in the user's FOR, as the context of this situation suggests that the command to move the virtual calendar refers to moving to a target destination in the user's FOV, which may improve processing speed or efficiency. In another case, the user may be viewing a menu of movie selections in the virtual movie application, and may stare at a small portion of the movie. The system may only target movie selections in the user's gaze field (e.g., based on the user's eye gaze) rather than the full FOV (or FOR) (and provide a confidence score, FOR example), which may also improve processing efficiency or speed.

Referring to the example shown in fig. 17A, a user may interact with the virtual environment using two input modes (head pose 1614 and eye gaze 1624). Based on the head gesture 1614, the central runtime server 1650 may identify two candidate virtual objects associated with application a 1672 and application B1674. The central runtime server 1650 may evenly distribute 100% confidence scores between application a 1672 and application B1674. Thus, application a 1672 and application B1674 may be assigned a confidence score of 50%, respectively. The central runtime server 1650 may also identify two candidate virtual objects (application a 1672 and application C1676) based on the eye gaze direction 1624. The central runtime server 1650 may also divide 100% confidence between application A1672 and application C1676.

The central runtime server 1650 may execute a lattice compression logic function 1712 to reduce or eliminate unusual confidence values among multiple input modes or those that fall below a certain threshold to determine the most likely application with which the user wants to interact. For example, in fig. 17A, the central runtime server 1650 may eliminate application B1674 and application C1676 because neither the head gesture 1614 nor the eye gaze 1624 analysis recognized both virtual objects. As another example, central runtime server 1650 may aggregate values assigned to each application. Central runtime server 1650 may set the threshold confidence value equal to or greater than 80%. In this example, the total value of application a 1672 is 100% (50% + 50%); application B1674 has a total value of 50%; and the value of application C1676 is 50%. Because the confidence values of each of applications B and C are below the threshold confidence value, the central runtime server 1650 may be programmed to select application a 1672 instead of applications B and C because the overall confidence value (100%) of application a is greater than the threshold confidence value.

Although the example in fig. 17A averages values associated with the input device (e.g., confidence scores) between the candidate virtual objects, in some embodiments, the distribution of values may not be equal between the candidate virtual objects. For example, if head gesture 1614 has a value of 10, application A1672 may receive a value of 7, while application B1674 may receive a value of 3 (because the head gesture points more to A1672). As another example, if the head gesture 1614 has a qualitative rating of "A," application A1672 may be designated as rating "A" while application B1674 and application C1676 do not receive anything from the head gesture 1614.

The wearable system (e.g., central runtime server 1650) may assign a focus indicator (focus indicator) to the target virtual object so that the user may more easily perceive the target virtual object. The focus indicator may be a visual focus indicator. For example, the focus indicator may include a halo (substantially surrounding or near the object), a color, a perceived change in size or depth (e.g., to make the target object appear closer and/or larger when selected), or other visual effect that draws the attention of the user. The focus indicator may also include an audible or tactile effect such as a vibration, ring, beep, or the like. The focus indicator may provide useful feedback to the user about the system "doing the right" by confirming (via the focus indicator) to the user that the object associated with the command has been correctly determined by the system (e.g., "move that to here" and "there" in the command). For example, a first focus indicator may be assigned to the identified target virtual object, and a second focus indicator may be assigned to the destination location (e.g., "there" in the command). In some cases, if the system incorrectly determines the target object, the user may override the determination of the system, for example, by staring at (gazing at) the correct object and providing a voice command (e.g., "no, this is not that").

Example of identifying target user interface operations

In addition to, or in lieu of, identifying the target virtual object, the central runtime server 1650 may also determine a target user interface operation based on the plurality of inputs received. FIG. 17B illustrates an example of determining a targeted user interface operation based on multimodal input. As shown, central runtime server 1650 may receive a plurality of inputs in the form of head gestures 1614 and gestures 1612. The central runtime server 1650 may display a plurality of virtual objects associated with, for example, application A1672 and application B1674 to a user. However, using the head pose input mode alone may not be sufficient to determine the required user interface action, as there is such a 50% confidence: the head gesture applies to the user interface operations associated with application a 1672 (shown as modification option 1772), and also with a 50% confidence that: the head gesture is applied to another user interface operation associated with application B1674 (shown as modification option 1774).

In various embodiments, a particular application or some type of user interface operation may be programmed to be more responsive to a particular input mode. For example, the HTML tag or JavaScript programming of application B1674 may be set to be more responsive to gesture input than the HTML tag or JavaScript programming of application a 1672. For example, application a 1672 is more responsive to head gesture 1672 than to gesture 1612, while the "select" operation is more responsive to gesture 1612 (e.g., a flick gesture) than to head gesture 1614 because in some cases, the user has a greater likelihood of selecting an object using a gesture than using a head gesture.

Referring to FIG. 17B, gesture 1612 may be more responsive to a particular type of user interface operation in application B1674. As shown, the gesture 1612 may have a higher confidence associated with the user interface operations of application B, while the gesture 1612 may not apply the interface operations in application a 1672. Thus, if the target virtual object is application a 1672, the input received from the head gesture 1614 may be a target user interface operation. However, if the target virtual object is application B1674, the input received from gesture 1612 (alone or in combination with the input based on head pose 1614) may be a target user interface operation.

As another example, because the confidence level of the gesture 1612 is higher than the confidence level of the head gesture 1614 when the user interacts with application B, the gesture 1612 may become the primary input mode for application B1674, while the head gesture 1614 may be the secondary input mode. Thus, input received from gesture 1612 may be associated with a higher weight than head pose 1614. For example, if the head gesture indicates that the virtual object associated with application B1674 should remain stationary and the gesture 1612 indicates that the virtual object should move to the left, the central runtime server 1650 may cause the virtual object to move to the left. In some embodiments, the wearable system may allow the user to interact with the virtual object using a primary input mode, and may consider a secondary input mode if the primary input mode is insufficient to determine the user's actions. For example, the user may interact with application B1674 primarily through gesture 1612. However, when the HMD is unable to determine a target user interface operation (e.g., because multiple candidate virtual objects may be present in application B1674, or the gesture 1612 is unclear), the HMD may use the head pose as an input to determine a target virtual object or a target user interface operation to perform on application B1674.

The scores associated with each input mode may be aggregated to determine the desired user interface operation. FIG. 17C illustrates an example of aggregating confidence scores associated with input patterns of virtual objects. As shown in this example, the head pose input 1614 generates a higher confidence score (80% confidence) for application a than for application B (30% confidence), while the gesture input 1612 generates a higher confidence score (60% confidence) for application B than for application a (30% confidence). The central runtime server 1650 may aggregate the confidence score for each object based on the confidence scores derived from each user input pattern. For example, the central runtime server 1650 may generate an overall score of 110 for application a 1672 and an overall score of 90 for application B1674. The total score may be a weighted or unweighted average or other mathematical combination. Because application a 1672 scores higher overall than application B1674, the central runtime server 1650 may select application a as the application to interact with. Additionally or alternatively, because the overall score of application a 1672 is higher, the central runtime server 1650 may determine that the head pose 1614 and the gesture 1612 are intended to perform user interface operations on application a 1672, even though application B is more responsive to the gesture 1612 than application a.

In this example, the central runtime server 1650 aggregates the confidence scores that occur by increasing the confidence scores of the various inputs for a given object. In various other embodiments, the central runtime server 1650 may aggregate the confidence scores using techniques other than simple addition. For example, the input pattern or score may be associated with a weight. Thus, the aggregation of confidence scores will take into account the weights assigned to the input patterns or scores. The weights may be user-adjustable to allow the user to selectively adjust the "responsiveness" of the multi-modal interaction with the HMD. The weights may also be context dependent. For example, weights used in public places may weight head or eye gestures compared to gestures, thereby avoiding social embarrassment that may be caused by having a user gesture frequently while operating an HMD. As another example, on a subway, airplane, or train, the voice command may be given less weight than the head or eye pose because the user may not want to speak loudly into his or her HMD in such an environment. Environmental sensors (e.g., GPS) may assist in determining an appropriate context for the user to operate the HMD.

Although the examples in fig. 17A-17C are shown with reference to two objects, the techniques described herein may also be applied when there are more or fewer objects. In addition, the techniques described with reference to these figures may be applied to applications of a wearable system or virtual objects associated with one or more applications. Furthermore, the techniques described herein may also be applied to direct or indirect input modes other than head gestures, eye gaze, or hand gestures. For example, voice commands may also be used. In addition, although the central runtime server 1650 is used throughout as an example to describe the processing of various input modes, the local processing and data module 260 of the HMD may perform some or all of the operations in addition to or instead of the central runtime server 1650.

Example techniques to calculate confidence scores

The wearable system may use one or a combination of various techniques to calculate the confidence score for the subject. Fig. 18A and 18B illustrate an example of calculating a confidence score for an object within the FOV of a user. For example, the FOV of the user may be calculated based on the head pose or eye gaze of the user during cone projection. The confidence scores in fig. 18A and 18B may be based on a single input mode (e.g., a user's head pose). Multiple confidence scores (for some or all of the various multi-modal inputs) may be computed and then aggregated to determine a user interface operation or target virtual object based on the multi-modal user inputs.

Fig. 18A illustrates an example of computing a confidence score for a virtual object based on a portion of the virtual object that falls within the FOV 1810 of the user. In fig. 18A, the FOV of the user has a portion of two virtual objects (represented by circle 1802 and triangle 1804). The wearable system may assign confidence scores to circles and triangles based on the proportion of the projected area of the object that falls within FOV 1810. As shown, approximately half of the circle 1802 falls within the FOV 1810, so the wearable system may assign a 50% confidence score to the circle 1802. As another example, about 75% of the triangle is located within the FOV 1810. Thus, the wearable system may assign a 75% confidence score to triangle 1804.

The wearable system may use regression analysis (regression analysis) on the FOV and content in the FOR to calculate the proportion of virtual objects within the FOV. As described with reference to fig. 12B, while the wearable system remains tracking the object in the FOR, the wearable system may also communicate the object (or portions of the object) in the FOV to a rendering projector (e.g., display 220) FOR display within the FOV. The wearable system may determine which portions are provided for the rendering projector, and may analyze the proportion of the portions that are communicated to the rendering projector relative to the entire virtual object to determine the percentage of the virtual object within the FOV.

In addition to or instead of calculating a confidence score based on a proportional area (probabilistic area) that falls within the FOV, the wearable system may also analyze the space near the object in the FOV to determine the confidence score of the object. Fig. 18B shows an example of calculating a confidence score based on the uniformity of the surrounding space of the virtual object in the FOV 1820. The FOV 1820 includes two virtual objects, as shown by the triangle 1814 and the circle 1812. The space around each virtual object may be represented by a vector. For example, the space around virtual object 1812 may be represented by vectors 1822a, 1822b, 1822c, and 1822d, and the space around virtual object 1814 may be represented by vectors 1824a, 1824b, 1824c, and 1824 d. The vector may originate from a virtual object (or a boundary of a virtual object) and end at an edge of FOV 1820. The system may analyze the distribution of the lengths of the vectors from the objects to the edges of the FOV to determine which object is closer to the center of the FOV. For example, objects in the very center of a circular FOV will have a relatively uniform vector length distribution, while objects very close to the edges will have a non-uniform vector length distribution (since some vectors pointing to the near edges are shorter, while those pointing to the farthest edges are longer). As shown in fig. 18B, the distribution of vector lengths from the virtual triangle 1814 to the edge of the field of view 1820 varies more than the distribution of vector lengths from the circle 1812 to the edge of the field of view 1820, which means that the virtual circle 1812 is closer to the center of the FOV 1820 than the virtual triangle 1814. The variability of the vector length distribution may be represented by the standard deviation or variance of the length (or other statistical measure). The wearable system may assign a higher confidence score to the virtual circle 1812 accordingly (as compared to the virtual triangle 1814).

In addition to the techniques described with reference to fig. 18A and 18B, the wearable system may assign confidence scores to the virtual objects based on historical analysis of user interactions. As an example, the wearable system may assign a higher confidence score to virtual objects that frequently interact with the user. As another example, one user may prefer to use voice commands to move a virtual object (e.g., "move that to that"), while another user may prefer to use gestures (e.g., by reaching out and "grabbing" the virtual object and moving it to another location). The system may determine such user trends from historical analysis. As yet another example, input patterns may often be associated with particular user interface operations or particular virtual objects, and thus, the wearable system may increase the confidence score assigned to a particular user interface operation or particular virtual object, even though there may be alternative user interface operations or virtual objects based on the same input.

Given the field of view 1810 or 1820 shown in fig. 18A or 18B, the second input mode may facilitate selection of an appropriate virtual object or an appropriate user interface operation in a virtual object. For example, the user may say "enlarge the triangle" to increase the size of the triangle within the field of view 1810. As another example, in fig. 18A, the user may issue a voice command, such as "make that twice as large (make that as big)". Because virtual object 1804 has a higher confidence score based on the head pose, the wearable system may determine that the subject of the voice command (e.g., the target object) is virtual object 1804. Advantageously, in some embodiments, this reduces the specificity of interaction required to produce the desired result. For example, the user may cause the wearable system to achieve the same interaction without saying "triangulating to twice as large (make the triangle as big)".

The triangles and circles in fig. 18A and 18B are for illustration purposes only. The various techniques described herein may also be applied to virtual content that supports more complex user interactions.

Multimodal interaction examples in a physical environment

In addition to or in lieu of interacting with virtual objects, the wearable system may also provide a wide range of interactions within the real-world environment. 19A and 19B illustrate an example of interacting with a physical environment using multimodal input. In fig. 19A, three input modes are shown: gestures 1960, head gestures 1920, and input from user input device 1940. A head pose 1920 may be determined using a pose sensor. The attitude sensor may be an IMU, gyroscope, magnetometer, accelerometer, or other type of sensor described in fig. 2A and 2B. Outward facing imaging system 464 may be used to measure gesture 1960, while user input device 1940 may be an embodiment of user input device 466 shown in fig. 4.

In some embodiments, the wearable system may also measure the user's eye gaze. The eye gaze may include a vector extending from each eye of the user to a location where lines of sight of the two eyes converge. The vector may be used to determine the direction the user is looking, and may be used to select or identify virtual content at the point of convergence or along the vector. Such eye fixations may be determined by eye tracking techniques such as glint (glint) detection, iris or pupil shape mapping, infrared illumination or binocular imaging, wherein the regression of the intersection points is derived from the respective pupil orientations. The eye gaze or head pose may then be considered as the source point of the cone projection or ray projection for virtual object selection.

As described herein, an interaction event that moves the selected virtual content in the user's environment (e.g., "put that there") may require a determination of a command operation (e.g., "put"), a subject (e.g., "that," which may be determined according to the multi-modal selection techniques described above), and a parameter (e.g., "put"). A combination of input modes may be used to determine command operations (or simply commands) and topics (also referred to as target objects or target virtual objects). For example, the command to move the theme 1912 may be based on a change in the head pose 1920 (e.g., a head turn or nodding) or a gesture 1960 (e.g., a swipe gesture), alone or in combination. As another example, the topic 1912 may be determined based on a combination of head pose and eye gaze. Thus, commands based on multimodal user input may sometimes also be referred to as multimodal input commands.

The parameters may also be determined using single input or multi-modal input. The parameters may be associated with an object in the user's physical environment (e.g., a table or wall) or an object in the user's virtual environment (e.g., a movie application, an avatar in a game, or a virtual building). In some embodiments, identifying real-world parameters may allow for faster and more accurate content placement responses. For example, a particular virtual object (or a portion of a virtual object) may be substantially planar with a horizontal orientation (e.g., the normal of the virtual object is perpendicular to the floor of the room). When a user initiates an interaction to move a virtual object, the wearable system may identify a real-world surface (e.g., a desktop) having a similar orientation and move the virtual object to the real-world surface. In some embodiments, this movement may be automatic. For example, a user may want to move a virtual book from a location on the floor on which it is located. The only available horizontal surface in the room may be the user's desk. Thus, the wearable system can automatically move the virtual book to the surface of the desk in response to the voice command to "move that" without the user entering additional commands or parameters, since the surface of the desk is the most likely location to which the user wishes to move the book. As another example, the wearable system may identify a real-world surface of suitable size for a given content, which may provide a better parameter match for the user. For example, if a user is watching a virtual video screen with a given display size and wishes to move it to a particular surface by simple voice command, the system can determine which real world surfaces provide the necessary surface area to best support the display size of the virtual video.

The wearable system may identify a target parameter (e.g., a target surface) using a technique described with reference to identifying a target virtual object. For example, the wearable system may calculate confidence scores associated with a plurality of target parameters based on indirect user input or direct user input. As an example, the wearable system may calculate a confidence score associated with a wall based on direct input (e.g., a user's head pose) and indirect input (e.g., a feature of the wall such as a vertical surface).

Example techniques to identify real world parameters

The wearable system may use various techniques to determine parameters (such as target location) of the multimodal input command. For example, the wearable system may use various depth sensing techniques, such as applying the SLAM protocol to environmental depth information (e.g., as described with reference to fig. 9), or constructing or accessing a mesh model of the environment. In some embodiments, depth sensing determines the distance between known points in 3D space (e.g., the distance between sensors on an HMD) and points of interest ("POIs") on the surface of an object in the real world (e.g., for locating walls of virtual content). This depth information may be stored in the world map 920. Parameters for interaction may be determined based on the set of POIs.

The wearable system may apply these depth sensing techniques to data obtained from the depth sensors to determine the boundaries of the physical environment. The depth sensor may be part of the outward facing imaging system 464. In some embodiments, a depth sensor is coupled to the IMU. Data acquired from the depth sensor may be used to determine the orientation of the multiple POIs relative to each other. For example, the wearable system may calculate a truncated signed distance function ("TSDF") for the POI. TSDF may include a numerical value for each POI. When a point is within a given tolerance of a particular plane, the value may be zero; when a point is spaced apart from a particular plane in a first direction (e.g., above or outside), the value can be positive; when a point is spaced apart (e.g., below or inside) from a particular plane in a second (e.g., opposite) direction, the value may be negative. The calculated TSDF can be used to define a 3D volume grid of bricks or boxes aligned in, above, and below the particular plane to construct or represent a particular surface along the orientation determined by the IMU.

POIs that are outside a given plane tolerance (e.g., the absolute value of the TSDF is greater than the tolerance) may be eliminated, leaving only POIs that are adjacent to each other within the given tolerance to create a virtual representation of the surface in the real-world environment. For example, the real-world environment may include a conference table. There may be various other objects (e.g., a phone, a laptop, a coffee cup, etc.) on top of the conference table. For a surface of a conference table, the wearable system may hold POIs associated with the conference table and remove POIs of other objects. Thus, a plan map (depicting the surface of the conference table) may represent the conference table with only points belonging to the conference table. The map may omit points associated with objects on the conference table top. In some embodiments, the set of POIs retained in the flat-map may be referred to as "usable surfaces" of the environment, as these areas of the flat-map represent the space in which virtual objects may be placed. For example, when a user wants to move a virtual screen onto a table, the wearable system may identify a suitable surface (e.g., a desktop, a wall, etc.) in the user's environment while eliminating objects (e.g., coffee cups or pencils or fresco) or surfaces (e.g., bookshelf surfaces) that are not suitable for placing the screen. In this example, the identified suitable surface may be a useable surface of the environment.

Referring back to the example shown in FIG. 19A, environment 1900 may include physical walls 1950. The HMD or user input device 1940 may house a depth sensor system (e.g., a time-of-flight sensor or a Vertical Cavity Surface Emitting Laser (VCSEL)) and a pose sensor (e.g., an IMU). Data obtained by the depth sensor system may be used to identify various POIs in the user's environment. The wearable system may group substantially planar POIs together to form a boundary polygon 1910. The boundary polygon 1910 may be an example embodiment of a usable surface.

In some embodiments, the outward-facing imaging system 464 may recognize a user gesture 1960, which user gesture 1960 may include a finger pointing to an area within the real world environment 1900. The outward-facing imaging system 464 may identify the pre-measured boundary polygon 1910 by determining a sparse point vector configuration of finger pointing towards the boundary polygon 1910.

As shown in fig. 19A, inside the boundary polygon 1910 there may be a virtual video screen 1930. The user can interact with virtual objects 1912 inside virtual video screen 1930 using multimodal input. FIG. 19B illustrates interaction with a virtual object 1912 within a real-world environment using multimodal input. The environment in fig. 19B includes a vertical surface 1915 (which may be part of a wall) and a surface 1917 on a desktop. In the first state 1970a, the virtual content 1926 is initially displayed within the boundary polygon 1972a on the wall surface 1915. The user may select virtual object 1926, for example, by cone projection or multi-modal input (including two or more of gesture 1960, head pose 1920, eye gaze, or input from user input device 1940).

The user may select surface 1917 as the destination using another input as part of the multimodal input. For example, the user may use head gestures and gestures in combination to indicate that the surface 1917 is a destination. The wearable system may identify the surface 1917 (and the polygon 1972b) by combining POIs that appear to be located on the same plane. The wearable system may also use other surface identification techniques to identify the surface 1917.

The user may also pan the virtual content 1126 to a bounding polygon 1972b on the surface 1917 using multi-modal input, as shown in a second state 1970 b. For example, the user may move virtual content 1926 through a combination of changes in head pose and movement of user input device 1940.

As another example, a user may say "move that there" via the microphone 232 of the wearable system, which may receive the audio stream and parse the command therefrom (as described herein). The user may combine this voice command with the initiation of a head gesture, eye gaze, hand gesture, or totem. Since the virtual object 1926 is the highest confidence object (e.g., see dashed lines in scene 1970a, which indicates that the user's finger 1960, HMD 1920, and totem 1940 are oriented toward the object 1926), the wearable system may detect the virtual object 1926 as the subject of the command. The wearable system may also recognize the command operation as "movement" and determine the parameters of the command as "there". The wearable system may further determine that "there" is a boundary polygon 1972b based on input patterns other than speech (e.g., eye gaze, head pose, gestures, totems).

The commands in the interaction event may involve the adjustment and calculation of a number of parameters. For example, the parameters may include a destination, placement, orientation, appearance (e.g., size or shape), or animation of the virtual object. The wearable system may also automatically calculate the parameters even if the direct input is not explicit in altering the parameters. As an example, the wearable system may automatically change the orientation of virtual object 1926 as virtual object 1926 moves from vertical surface 1915 to horizontal surface 1917. In the first state 1970a, the virtual content 1926 is substantially vertically oriented on the surface 1915. When virtual content 1926 is moved to surface 1917 in second state 1970b, the orientation of virtual content 1926 may remain consistent (e.g., remain vertically oriented), as shown by virtual object 1924. The wearable system may also automatically adjust the orientation of the virtual content 1926 to align with the orientation of the surface 1917 such that the virtual content 1926 appears to be in a horizontal position, as shown by virtual object 1922. In this example, the orientation may be automatically adjusted based on the environmental tracking 1632 as an indirect input. When the wearable system determines that the object is a target destination object, the wearable system may automatically consider characteristics of the object (e.g., surface 1917). The wearable system may adjust a parameter of the virtual object based on a characteristic of the target destination object. In this example, the wearable system automatically rotates the orientation of virtual object 1926 based on the orientation of surface 1917.

Other examples OF AUTOMATIC PLACEMENT or movement OF VIRTUAL OBJECTs are described IN U.S. application No.15/673,135 entitled "AUTOMATIC PLACEMENT OF VIRTUAL OBJECTs IN THREE-DIMENSIONAL SPACE," filed on 9.8.2017, and published as U.S. patent publication No. 2018\0045963, the entire contents OF which are incorporated herein by reference.

In some embodiments, the input may explicitly modify a plurality of parameters. In addition to recognizing the surface 1917 as a destination, the voice command "place that flat" may also change the orientation of the virtual object 1926. In this example, the word "flat" and the word "there" may both be parameter values, where "there" causes the wearable system to update the location of the target virtual object, and the word "flat" is associated with the orientation of the target virtual object at the destination location. To perform the parameter "flat," the wearable system may match the orientation of virtual object 1926, and thus the orientation of surface 1917.

In addition to or in lieu of selecting and moving virtual objects, the multimodal input can interact with the virtual content in other ways. FIG. 20 illustrates an example of automatically resizing a virtual object based on multimodal input. In fig. 20, a user 1510 may wear an HMD 1502 and may interact with virtual objects using gestures and voice commands 2024. Fig. 20 shows four scenes 2000a, 2000b, 2000c, and 2000 d. Each scene includes a display screen and a virtual object (shown by a smiling face).

In scene 2000a, the display screen has size 2010 and the virtual object has size 2030. The user may change the gesture from gesture 2020 to gesture 2022 to indicate that the user wants to resize the virtual object or display screen. The user may use the speech input 2024 to indicate whether a virtual object or a display screen is the subject of the manipulation.

As an example, a user may want to simultaneously zoom in on a display screen and a virtual object. Thus, the user may use the input gesture 2022 as a zoom-in command. The parameter of the degree of enlargement may be represented by the extended finger range. Meanwhile, the user may indicate the subject of the interaction using the voice input 2024. As shown in scene 2000b, the user may say "all" (all) to produce an enlarged display 2012 and an enlarged virtual object 2032. As another example, in scene 2000c, the user may say "content" to generate an enlarged virtual object 2034, while the size of the display screen is the same as in scene 2000 a. As yet another example, in scene 2000d, the user may say "display" to produce an enlarged display screen 2016 while the size of the virtual objects remains the same as in scene 2000 a.

Examples of indirect input as input mode

As described herein, the wearable system can be programmed to allow a user to interact with direct user input and indirect user input as part of the multimodal input. Direct user input may include head gestures, eye gaze, voice input, gestures, input from a user input device, or other input directly from the user. The indirect input may include various environmental factors such as the user's location, the user's characteristics/preferences, characteristics of the object, characteristics of the user's environment, and the like.

As described with reference to fig. 2A and 2B, the wearable system may include a location sensor, such as GPS or radar or lidar. The wearable system may determine the subject of the user interaction based on the proximity of the object to the user. FIG. 21 illustrates an example of identifying a target virtual object based on a location of the object. Fig. 21 schematically shows a bird's eye view 2100 of a FOR of a user. The FOR may include a plurality of virtual objects 2110a through 2110 q. A user may wear an HMD that includes a position sensor. The wearable system may determine candidate target objects based on the proximity of the object to the user. For example, the wearable system may select virtual objects within a threshold radius (e.g., 1m, 2m, 3m, 5m, 10m, or more) from the user as candidate target virtual objects. In fig. 21, virtual objects (e.g., virtual objects 2110o, 2110p, 2110q) fall within a threshold radius (illustrated by dashed circle 2122) from user location 2120. Accordingly, the wearable system may set virtual objects 2110o through 2110q as candidate target virtual objects. The wearable system may further refine the selection based on other inputs (e.g., the user's head pose). The threshold radius may depend on contextual factors, such as the location of the user. For example, the threshold radius may be shorter for a user when in his or her office than for a user when in an outside park. The candidate object may be selected from a portion of the region 2122 within a threshold radius from the user. For example, only those objects that are within circle 2122 and that are in the FOV of the user (e.g., typically in front of the user) may be candidates, while objects that are within circle 2122 but outside the FOV of the user (e.g., behind the user) may not be candidates. As another example, multiple virtual objects may be along a common line of sight. For example, the pyramidal projection may select multiple virtual objects. The wearable system may use the user's location as another input to determine the target virtual object or parameters for user interaction. For example, cone projection may select objects corresponding to different depth planes, but the wearable system may be configured to identify the target virtual object as an object within reach of the user's hand.

Similar to direct input, indirect input may also be assigned a value that may be used to calculate a confidence score for the virtual object. For example, when multiple objects or parameters are within a common selection confidence, the indirect input may be further used as a confidence factor. Referring to fig. 21, virtual objects within circle 2122 may have a higher confidence score than virtual objects in between circle 2122 and circle 2124 because objects closer to user location 2120 are more likely to be objects with which the user is interested in interacting.

In the example shown in fig. 21, for convenience, dashed circles 2122, 2124 are shown representing projections of spheres having respective radii onto the plane shown in fig. 21. This is for illustration only and not for limitation; in other embodiments, other shapes of regions (e.g., polyhedrons) may be selected.

22A and 22B illustrate another example of interacting with a user environment based on a combination of direct and indirect inputs. These two figures show two virtual objects in the world camera's FOV 1270 (which may be larger than the user's FOV 1250): virtual object a 2212 and virtual object B2214. Virtual object a 2212 is also within the FOV 1250 of the user. For example, virtual object a 2212 may be a virtual document that the user is currently viewing, while virtual object B2214 may be a virtual sticky-note on a wall. However, when the user interacts with virtual object a 2212, the user may need to view virtual object B2214 to obtain additional information from virtual object B2214. Thus, the user may rotate the head to the right (to change FOV 1250) to view virtual object B2214. Advantageously, in some embodiments, the wearable system may detect a change in the user's gaze direction (direction toward virtual object B2214) instead of turning the head. Thus, the wearable system may automatically move virtual object B2214 within the FOV of the user without the user having to change his head pose. The virtual object B may overlay the virtual object a (or be included within the object a), or the object B may be placed within the user FOV 1250, but at least partially spaced apart from the object a (so that the object a is also at least partially visible to the user).

As another example, virtual object B2214 may be on another user interface screen. The user may want to switch between the user interface screen with virtual object a 2212 and the user interface screen with virtual object B2214. The wearable system may switch without changing the FOV 1250 of the user. When an eye gaze change or actuation of a user input device is detected, the wearable system may automatically move the user interface screen with virtual object a 2212 outside the user's FOV 1250 while simultaneously moving the user interface screen with virtual object B2214 inside the user's FOV 1250. As another example, the wearable system may automatically overlay the user interface screen with virtual object B2214 over the user interface screen with virtual object a 2212. Once the user provides an indication that he has completed the virtual user interface screen, the wearable system may automatically move the virtual user interface screen out of the FOV 1250.

Advantageously, in some embodiments, the wearable system may identify virtual object B2214 as the target virtual object to be moved within the FOV based on the multi-modal input. For example, the wearable system may make this determination based on the user's eye gaze and the location of the virtual object. The wearable system may set the target virtual object to be in the user's gaze direction and to be the closest object to the user.

Example Process for interacting with virtual objects Using Multi-modal user input

FIG. 23 illustrates an example process of interacting with a virtual object using multimodal input. Process 2300 may be performed by a wearable system as described herein. For example, process 2300 can be performed by local processing and data module 260, remote processing module 270, and central runtime server 1650, alone or in combination.

At block 2310, the wearable system may optionally detect an initiation condition (initiation condition). The activation may be a user-activated input that may provide an indication that the user intends to issue a command to the wearable system. The activation condition may be predefined by the wearable system. The start condition may be a single input or a combined input. For example, the start condition may be a voice input, such as by speaking the phrase "Hey, odd jump" (heel). The launch condition may also be gesture based. For example, the wearable system may detect the presence of the activation condition when the user's hand is detected within the FOV of the world camera (or the FOV of the user). As another example, the activation condition may be a specific hand action, such as a finger clapping. The activation condition may also be detected when a user activates the user input device. For example, a user may click a button on a user input device indicating that the user is to issue a command. In some implementations, the launch condition can be based on a multi-modal input. For example, the wearable system may require both voice commands and gestures to detect the presence of the activation condition.

Block 2310 is optional. In some embodiments, the wearable system may receive and begin parsing the multimodal input without detecting a start condition. For example, while the user is watching a video, the wearable system may take multimodal input of the user to adjust the volume, fast forward, fast reverse, skip to the next episode, etc., without the user first providing a start condition. Advantageously, in some embodiments, the user may not need to wake up the video screen before the user can interact with the video screen using multimodal input (e.g., so that the video screen can present time adjustment or volume adjustment tools).

At block 2320, the wearable system may receive multimodal input for user interaction. The multimodal input may be direct or indirect input. Example input modes may include speech, head gestures, eye gaze, gestures (on a user input device or in the air), input on a user input device (e.g., totem), characteristics of an object (physical or virtual object) in a user environment, or a 3D environment.

At block 2330, the wearable system can parse the multimodal input to identify topics, commands, and parameters of user interaction. For example, the wearable system may assign confidence scores to the candidate target virtual objects, target commands, and target parameters and select topics, commands, and parameters based on the highest confidence scores. In some embodiments, one input mode may be a primary input mode and another input mode may be a secondary input mode. Input from the secondary input mode may supplement input from the primary input mode to determine a target topic, command, or parameter. For example, the wearable system may set the head pose to a primary input mode and the voice command to a secondary input mode. The wearable system may first interpret as much input as possible from the primary input mode, and then interpret additional input from the secondary input mode. The wearable system may automatically provide a disambiguation prompt to the user if the additional input is interpreted as suggesting a different interaction than the input of the primary input. The disambiguation prompt may request that the user select a desired task from: interpretation of the primary input, or alternative options based on interpretation of the secondary input. Although the example is described with reference to a primary input mode and a secondary input mode, in various cases there may be more than two input modes. The same technique can also be applied to the third input mode, the fourth input mode, and the like.

At block 2340, the wearable system may perform user interactions based on the topics, commands, and parameters. For example, the multimodal input may include an eye gaze and a voice command "place that there". The wearable system may determine that the subject of the interaction is the object with which the user is currently interacting, that the command is "drop", and that the parameter is the center of the user's gaze field (determined based on the user's eye gaze direction). Thus, the user may move the virtual object with which the user is currently interacting to the center of the user's gaze field of view.

Example of setting direct input mode associated with user interaction

In some cases, for example when a user interacts with a wearable system using gestures, or speech, there is a risk that others in the vicinity of the user may "hijack" the user interaction by issuing commands using these direct inputs. For example, in a park, user A may be standing near user B. User a may interact with the HMD using voice commands. User B can hijack the experience of user a by saying "take a picture". This voice command issued by user B causes the HMD of user a to take a picture even though user a never intended to take a picture. As another example, user B may gesture within the FOV of the world camera of user a's HMD. For example, while user a is playing a video game, this gesture may cause user a's HMD to go to the home page.

In some implementations, the input can be analyzed to determine whether the input originated from a user. For example, the system may apply speaker recognition techniques to determine whether the command "take a picture" was spoken by user a or hijacker B. The system may apply computer vision techniques to determine whether the gesture was made by the hand of user a or the hand of hijacker B.

Additionally or alternatively, to prevent security breaches and interruption of user interaction with the wearable system, the wearable system may automatically set available direct input modes based on indirect input, or require multiple direct input modes before issuing a command. FIG. 24 illustrates an example of setting a direct input mode associated with user interaction. Three direct inputs are shown in FIG. 24: speech 2412, head gestures 2414, and gestures 2416. As described further below, the sliders 2422, 2424, 2426 represent the amount of weighting for each input in determining the command. If the slider is all the way to the right, the input is given a full weight (e.g., 100%), if the slider is all the way to the left, the input is given a zero weight (e.g., 0%), and if the slider is between these limit settings, the input is given a partial weight (e.g., 20% or 80% or some other intermediate value, such as a value between 0 and 1). In this example, the wearable system may be set to require both voice commands 2422 and gestures 2426 (without using head pose 2414) prior to executing the commands. Thus, if the voice command 2442 and gesture 2426 indicate different user interactions (or virtual objects), the wearable system may not execute the command. By requiring both types of input, the wearable system may reduce the likelihood that others hijack the user interaction.

As another example, one or more input modes may be disabled. For example, when a user interacts with a document processing application, head pose 2414 may be disabled as an input mode, as shown in FIG. 24, where head pose slider 2424 is set to 0.

Each input may be associated with an authentication level. In fig. 24, speech 2412 is associated with an authentication level 2422; head pose 2414 is associated with authentication level 2424; and the gesture 2416 is associated with an authentication level 2426. The authentication level may be used to determine whether a command to be executed requires input, or whether input is disabled, or whether input is partially weighted (between fully enabled or fully disabled). As shown in fig. 24, the authentication levels of the speech 2412 and the gesture 2416 are set to all the way to the right (associated with a maximum authentication level), indicating that both inputs are required to issue a command. As another example, the authentication level of the head gesture is set to the full left (associated with the minimum authentication level). This indicates that head pose 2414 is not required to issue a command, even though head pose 2414 may still be used to determine a target virtual object or target user interface operation. In some cases, by setting the authentication level to a minimum, the wearable system may disable the head pose 2414 as an input mode.

In some embodiments, the authentication level may also be used to calculate a confidence level associated with the virtual object. For example, the wearable system may assign a higher value to input modes with higher authentication levels and a lower value to input modes with lower authentication levels. Thus, when the confidence scores from multiple input patterns are aggregated to calculate a total confidence score for the virtual object, the input pattern with the higher authentication level has a greater weight in the total confidence score than the input pattern with the lower authentication level.

The authentication level may be set by the user (either by input or via a settings panel) or may be set automatically by the wearable system (e.g., based on indirect input). The wearable system may require more input modes when the user is in a public location, and less input modes when the user is in a private location. For example, when the user is on a subway, the wearable system may require both speech 2412 and gestures 2416. However, when the user is at home, the wearable system may only need speech 2412 to issue the command. As another example, the wearable system may disable voice commands when the user is in the park, thereby providing privacy for the user's interactions. However, when the user is at home, voice commands are still available.

Although the examples are described with reference to setting a direct input mode, similar techniques may also be applied to setting an indirect input mode as part of a multimodal input. For example, when a user is using a public vehicle (e.g., a bus), the wearable system may be configured to disable the geographic location as an input mode because the wearable system may not know exactly where the user is specifically sitting or standing on the public vehicle.

Other examples of user experiences

In addition to the examples described herein, this section also describes other user experiences that utilize multimodal input. As a first example, the multimodal input may include speech input. For example, the user may speak a voice command, such as "Hey, jump, call her (Hey Magic Leap)" that is received by the audio sensor 232 on the HMD and interpreted by the HMD system. In this command, the user can start the task (or provide a start condition) by speaking "hey, odd-jump". The "call" may be a pre-programmed word so the wearable system knows that it should make a phone call (rather than initiate a video call). In some embodiments, these pre-programmed words, which may also be referred to as "hotwords" or "carrier phrases," may be recognized by the system as indicating that the user wants to take a particular action (e.g., "call"), and may alert the system to accept further input to complete the desired action (e.g., identify the person ("s") or phone number after the word "call"). The wearable system may use additional input to identify who "she" is. For example, the wearable system may use eye tracking to see which contact on the user's phone or virtual contact list the user is viewing. The wearable system may also use head gestures or eye tracking to determine whether the user is looking directly at the person the user wants to call. In some embodiments, the wearable system may utilize face recognition techniques (e.g., using object recognizer 708) to determine the identity of the person the user is viewing.

As a second example, the user may have a virtual browser placed directly on a wall (e.g., the display 220 of the wearable system may project the virtual browser as if it were overlaid on the wall). The user may extend his or her hand and provide a flick gesture on a link in the browser. Since the browser appears to be located on a wall, the user may tap on the wall or tap in the space such that the projection of the user's finger appears to tap on the wall to provide an indication. The wearable system can use the multi-modal input to identify the link that the user intends to click on. For example, the wearable system may use gesture detection (e.g., via data acquired by outward facing imaging system 464), cone projection based on head pose, and eye gaze. In this example, the accuracy of the gesture detection may be less than 100%. Wearable systems may utilize data acquired from head gestures and eye gaze to improve gesture detection, thereby improving gesture tracking accuracy. For example, the wearable system may identify the radius at which the eye is most likely to focus based on data acquired by the inward-facing imaging system 462. In some embodiments, the wearable system may identify a gaze field of view of the user based on the eye gaze. The wearable system may also use indirect input such as environmental features (e.g., the location of a wall, features of a browser or web page, etc.) to improve gesture tracking. In this example, the wall may be represented by a planar grid (which may be pre-stored in the map 920 of the environment), from which the wearable system may determine the location of the user's hand, and thus the link that the user aims and selects. Advantageously, in various embodiments, by combining multiple input modes, the accuracy required for one input mode for user interaction may be reduced as compared to a single input mode. For example, the FOV camera need not have very high resolution for gesture recognition, as the wearable system may complement the gesture with head pose or eye gaze to determine the intended user interaction.

Although the multimodal input in the above examples includes audio input, audio input is not necessary for the multimodal input interaction described above. For example, a user may use a 2D touch swipe gesture (e.g., on a totem) to move a browser window from one wall to another. The browser may initially be located on the left side wall. The user may select a browser by launching a totem. The user may then look at the right wall and make a swipe right gesture on the totem's touchpad. Swiping over a touchpad is loose and inaccurate because 2D swiping itself does not translate easily/well into 3D motion. However, the wearable system may detect walls (e.g., based on environmental data acquired by an outward facing imaging system) and detect points on the walls that the user is exclusively looking at (e.g., based on eye gaze). With these three inputs (touch swipe, gaze, environmental features), the wearable system can perfectly place the browser with high confidence where the user wants the browser window to sit.

Other examples of head gestures as multimodal input

In various embodiments, the multimodal input may support a totem-less experience (or an experience that does not use totems frequently). For example, the multimodal input can include a combination of head gestures and voice controls that can be used to share or search for virtual objects. Multimodal input can also use a combination of head gestures and gestures to navigate various user interface planes and virtual objects within the user interface planes. A combination of head gestures, speech, and gestures may be used to move objects, conduct social networking activities (e.g., initiate and conduct telepresence sessions, share posts), browse information on web pages, or control media players.

FIG. 25 shows an example of a user experience for multimodal input. In the example scenario 2500a, the user 2510 may target the applications 2512 and 2514 with head gestures and select the applications 2512 and 2514. The wearable system may display a focus indicator 2524a, the focus indicator 2524a indicating that the user is currently interacting with the virtual object using the head pose. Once the user selects the application 2514, the wearable system may display a focus indicator 2524a for the application 2514 (e.g., the target graphic shown in fig. 25, a halo around the application 2514, or make the virtual object 2514 appear closer to the user). The wearable system may also change the appearance of the focus indicator from focus indicator 2524a to focus indicator 2524b (e.g., the arrow graphic shown in scene 2500 b), which indicates that the interaction by user input device 466 also becomes available after the user selects virtual object 2514. Voice and gesture interactions extend this interaction mode of head pose plus gestures. For example, when a user issues a voice command, an application targeted with head gestures may respond to or be manipulated by the voice command. Other examples of interacting with virtual objects through combinations of, for example, head gestures, and speech recognition are described in U.S. application No.15/296,869, entitled "SELECTING VIRTUAL OBJECTS IN A THREE-DIMENSIONAL SPACE (for selecting virtual objects in three-DIMENSIONAL SPACE)", filed on 10/18/2016, published as U.S. patent publication No.2017/0109936, the entire disclosure of which is incorporated herein by reference.

Head gestures may be combined with voice control, gesture recognition, and environmental information (e.g., grid information) to provide hands-free browsing. For example, if the user is using a head gesture to aim at the browser, a voice command to "Search for Lauderdale" will be processed by the browser. The wearable system can also process the voice command without scrutinizing (go through) the browser if the user does not aim a particular browser. As another example, when the user says "Share this with Karen," the wearable system will perform a sharing action on the application that the user is aiming (e.g., using head gestures, eye gaze, or gestures). As another example, voice control may perform browser window functions, such as "Go to Bookmarks (Go to Bookmarks)," while gestures may be used to perform basic navigation of a web page, such as clicking and scrolling.

The virtual object can also be launched and moved using multimodal input without the need for a user input device. The wearable system may use multi-modal input such as gestures, speech, and gaze to place content naturally in the vicinity of the user and the environment. For example, when the user interacts with the HMD, the user may open the un-launched application using voice. The user may issue a voice command by saying "Hey, strange, launch a Browser (launch the Browser)". In this command, the start-up condition includes the presence of the enabling phrase "hey, odd-jump". The command may be interpreted to include "start" or "open" (which may be interchangeable commands). The subject of this command is the application name, e.g., "browser". However, the command does not require a parameter. In some embodiments, the wearable system may automatically apply default parameters, such as placing a browser in the user's environment (or the user's FOV).

Multimodal input can also be used to perform basic browser controls, such as opening bookmarks, opening new tabs, navigating to history, and the like. The ability to refer to web content in a hands-free or fully manual situation may allow the user to obtain more information and improve work efficiency. For example, user Ada is a radiologist looking at a film in her office. Ada can browse web pages by voice and gestures to bring up reference material when viewing a film, reducing the need for her to move a mouse back and forth across the screen to switch between the film and the reference material. As another example, the user Chris is cooking a new recipe through a virtual browser window. The virtual browser window may be placed on his cabinet. Chris may use voice commands to call out bookmarked recipes when starting to shred food.

FIG. 26 shows an example user interface for an application with various bookmarking. The user may select an application on user interface 2600 by speaking the name of the application. For example, the user may say "open food" to start a food application. As another example, the user may say "open this". The wearable system may determine a gaze direction of the user and identify applications on the user interface 2600 that intersect the gaze direction of the user. The wearable system may open the identified application accordingly.

The user may also use speech to issue search commands. The search command may be executed by the application at which the user is currently aiming. If the object does not currently support a search command, the wearable system may perform a search within the wearable system's data store or search for information through a default application (e.g., through a browser). Fig. 27 shows an example user interface 2700 when a search command is issued. The user interface 2700 shows an email application and a media viewing application. The wearable system may determine (based on the user's head pose) that the user is currently interacting with the email application. Thus, the wearable system may automatically translate the user's voice command into a search command in an email application.

Media control can also be implemented using multimodal input. For example, the wearable system may issue commands, such as play, pause, mute, fast forward, and fast rewind, using voice and gesture controls to control a media player in an application (e.g., screen). The user may use voice and gesture controls with the media application and put the tote aside.

The multimodal input can further be used in a social networking context. For example, a user may start a conversation and share experience (e.g., virtual images, documents, etc.) without a user input device. As another example, a user may participate in a telepresence session and set a private context so that the user may comfortably use a voice navigation user interface.

Accordingly, in various embodiments, the system can utilize multimodal inputs, such as: head pose plus voice (e.g., for information sharing and general application searching), head pose plus gesture (e.g., for navigation in an application), or head pose plus voice plus gesture (e.g., for "put that there" functionality, media player controls, social interactions, or browser applications).

Other examples of gesture control as part of multimodal input

Gesture interactions may be of two non-limiting and non-exclusive categories: event gestures and dynamic hand tracking. The event gesture may be a response to an event while the user interacts with the HMD, such as a throwing sign made by a catcher to a pitcher in a baseball game or a thumbs-up sign in a browser window, causing the wearable system to open a sharing session. The wearable system may follow one or more gesture patterns performed by the user and respond to events accordingly. Dynamic hand tracking may involve tracking a user's hand with low latency. For example, the user may move one hand over the FOV of the user, while the virtual character may follow the movement of the user's finger.

The quality of the gesture tracking may depend on the type of user interaction. Quality may involve a number of factors such as robustness, responsiveness and ergonomics. In some embodiments, the event gesture has near perfect robustness. In social experiences, front-edge (margin-edge) interactions, and third-party applications, the threshold for minimum acceptable gesture performance may be lower, as the aesthetics of these experiences may tolerate errors, interference, low latency, etc., but gesture recognition may still perform well in these experiences to maintain responsiveness.

To increase the likelihood of the wearable system responding to user gestures, the system may reduce or minimize the delay of gesture detection (for both event gestures and dynamic hand tracking). For example, the wearable system may reduce or minimize latency by detecting when the user's hand is within the field of view of the depth sensor, automatically switching the depth sensor to an appropriate gesture mode, and then providing feedback to the user as to when he or she performed the gesture.

As described herein, gestures may be used in conjunction with other input modes to launch, select, and move applications. Gestures may also be used to interact with virtual objects within an application, for example by tapping, scrolling in the air or on a surface (e.g., on a table or wall).

In some embodiments, the wearable system may implement a social networking tool that supports gesture interactions. The user may perform semantic event gestures to enrich the communication. For example, the user may wave his hand in front of the FOV camera, so the waving animation is sent to the person with whom the user is chatting. The wearable system may also provide dynamic hand tracking for virtualization of the user's hand. For example, a user may lift his or her hand in front of his or her FOV and obtain visual feedback on the hand animation that is tracking his or her hand to make his or her avatar.

Gestures may also be used as part of a multimodal input to enable media player control. For example, a user may use gestures to play or pause a video stream. The user may perform gesture manipulations remotely from the device playing the video (e.g., a television). Upon detecting the user's gesture, the wearable system may remotely control the device based on the user's gesture. The user may also view the media panel, and the wearable system may update parameters of the media panel using the user's gaze direction in conjunction with the user's gestures. For example, a pinch (ok) gesture may imply a "play" command, and a fist gesture may imply a "pause" command. The user may also close the menu by waving one arm in front of the FOV camera. An example of gesture 2080 is shown in FIG. 20.

Other examples of interacting with virtual objects

As described herein, the wearable system can support various multi-modal interactions with objects (physical or virtual objects) in the user's environment. For example, the wearable system may support direct input for interacting with the found object, such as aiming, selecting, controlling (e.g., moving or characteristic) the found object. The interaction with the found object may also include interaction with the found object geometry or interaction with a surface to which the found object connects.

Direct input for interaction with a planar surface is also supported, such as aiming and selecting a wall or a desktop. The user may also initiate various user interface events, such as a touch event, a tap event, a swipe event, or a scroll event. The user may manipulate 2D user interface elements (e.g., panels) using direct interaction (e.g., panel scrolling, swiping, and selecting elements within the panel (e.g., virtual objects or user interface elements such as buttons).

The direct input may further be used to manipulate objects at different depths. The wearable system may set various threshold distances (threshold distances from the user) to determine the area of the virtual object. Referring to fig. 21, objects within the dashed circle 2122 may be considered objects in the near field, objects within the dashed circle 2124 (but outside the dashed circle 2122) may be considered objects in the mid-field, and objects outside the dashed circle 2124 may be considered objects in the far field. The threshold distance between the near field and the far field may be, for example, 1m, 2m, 3m, 4m, 5m, or more, and may depend on the environment (e.g., larger in an outdoor park than an indoor office cubicle).

The wearable system may support various 2D or 3D manipulations of virtual objects in the near field. Exemplary 2D manipulations may include moving or resizing. Exemplary 3D manipulations may include placing a virtual object in 3D space, for example, by pinching, dragging, moving, or rotating the virtual object. The wearable system may also support interactions with virtual objects in the midfield, such as translating and repositioning objects in the user environment, performing radial motion of objects, or moving objects into the near field or far field.

The wearable system may also support continuous fingertip interaction. For example, the wearable system may allow the user's finger to point like an attractor, or pinpoint an object and perform a push interaction on the object. The wearable system may further support fast gesture interactions, such as hand surface interactions or hand contour interactions.

Other examples of Voice commands in social networking and shared scenarios

The wearable system may support voice commands as input for a social networking (or messaging) application. For example, the wearable system may support voice commands for sharing information with a contact or talking to a contact.

As an example of initiating a call with a contact, a user may use a voice command, such as "Hey, odd jump, call Karen (say Magic Leap, call Karen)". In this command, "Hey, odd-jump (Hey Magic Leap)" is a calling phrase, the command is "call (call)", and a parameter of the command is the name of the contact. The wearable system may automatically initiate the call using a messenger application (as a theme). The command "call" may be associated with a task, such as "start and.

If the user says "Start a call" and then speaks a name, the wearable system may attempt to identify the name. If the wearable system is unable to recognize the name, the wearable system may communicate a message to the user for the user to confirm the name or contact information. If the wearable system recognizes the name, the wearable system may present a dialog prompt so that the user can confirm/decline (or cancel) the call, or provide an alternate contact.

The user may also start a call with several contacts using the buddy list. For example, the user may say "he, and Kojo start a group chat (Hey Magic Leap, start a group chat with Karen, Cole, and Kojo)". The group chat command may be extracted from the phrase "start a group chat" or may be extracted from a buddy list provided by the user. While a user is talking, the user may add another user to the session. For example, the user may say "Hey, oddly, invite Karen (Hey Magic Leap, invite Karen)", where the phrase "invite" may be associated with the invite command.

The wearable system may share the virtual object with the contact using voice commands. For example, a user may say "he. In these examples, the word "share" is a share command. The words "screens" or "that" may refer to a topic that the wearable system may determine based on the multimodal input. Names such as "Karen", "David, and Tony" are parameters of the command. In some embodiments, when the voice command provided by the user includes the word "share" with an application reference and a contact, the wearable system may provide a confirmation dialog to ask the user to confirm whether the user wants to share the application itself or share a topic via the referenced application. When the user issues a voice command that includes the words "share," application reference, and contact, the wearable system may determine whether the application name is recognized by the wearable system, or whether the application is present on the user's system. The wearable system may provide a message to the user if the system fails to recognize the name or the application is not present in the user's system. The message may suggest to the user to try the voice command again.

If the user provides an indicative or back-indicative reference (e.g., "this" or "that") in the voice command, the wearable system can use the multi-modal input (e.g., the user's head pose) to determine whether the user is interacting with an object that can be shared. If the object cannot be shared, the wearable system may prompt the user with an error message or move to a second input mode (e.g., a gesture) to determine the object that should be shared.

The wearable system may also determine whether a contact with which the object is shared can be identified (e.g., as part of the user's contact list). If the wearable system recognizes the contact's name, the wearable system may provide a confirmation dialog to confirm that the user wishes to continue sharing. If the user confirms, the virtual object may be shared. In some embodiments, the wearable system may share multiple virtual objects associated with the application. For example, the wearable system may share an entire photo album or share recently viewed photos in response to a user's voice command. If the user refuses to share, the sharing command is cancelled. If the user indicates a contact error, the wearable system may prompt the user to speak the contact name again, or select a contact from a list of available contacts.

In some implementations, if the user says "share" and says the application reference but does not specify a contact, the wearable system may locally share the application with people in the user's environment that have access to the user's files. The wearable system may also reply and ask the user to enter a name using one or more of the input modes described herein. Similar to the social networking example, a user may issue a voice command to share a virtual object with a contact or a group of contacts.

A challenge in making a call via voice is that the voice user interface incorrectly identifies or fails to identify a contact name. This is particularly problematic for less common or non-english names (e.g., lsi, Ileana, etc.). For example, when the user speaks a voice command containing a contact name (e.g., "Share Screens with lly"), the wearable system may not be able to recognize the name "lly" or its pronunciation. Wearable systems may use a key such as "who? (Who. The user may use speech to try again to specify "Ily," spell out the name "I-L-Y" using speech or a user input device, or quickly select a name from a panel of available names using a user input device. The name "Ily" may be a nickname for Ileana having an entry in the user's contacts. Once the user indicates that the system "Ily" is a nickname, the system may be configured to "remember" the nickname by automatically associating the nickname (or a pronunciation or audio pattern associated with the nickname) with the name of the friend.

Other examples of selecting and moving virtual objects using voice commands

A user can naturally and quickly manage the placement of virtual objects in the user's environment using multi-modal input (e.g., a combination of eye gaze, gestures, and speech). For example, a user named Lindsay sits at a table and is ready to do some work. She opens her laptop and then launches the desktop monitor application on her computer. When the computer loads, she extends her hand over the laptop screen and says: "Black and play, place the monitor there (Hey Magic Leap, put monitor places)". In response to the voice command, the wearable system may automatically launch the monitor screen and place it over the laptop. However, when Lindsay says "Put the screen there" while looking at the wall on the other side of the room, the wearable system may automatically Put the screen on the wall on the opposite side of her. Lindsay may also say "Put delbrueckea in place" when looking at her desk. The delphinium was originally on a kitchen table, but in response to the voice command, the wearable system could automatically move it onto her desktop. At work she can use totems to interact with these objects and adjust their proportions according to her own preferences.

The user may use speech to open an un-launched application at any point in the user's environment. For example, a user may say "Hey, Leap, launch a Browser (launch the Browser)". In this command, "Hey Magic Leap" is a calling word, the word "launch" is a launch command, and the word "Browser" is an application of the subject. The "start" command may be associated with the words "start", "open", "play". For example, when the user says "open the browser," the wearable system may still recognize the launch command. In some embodiments, the application may be an immersive application that may provide the 3D virtual environment to the user as if the user were part of the 3D virtual environment. Thus, when the immersive application is launched, the user may be positioned as if he were in a 3D virtual environment. In some implementations, the immersive applications also include a store application. When the store application is launched, the wearable system may provide the user with a 3D shopping experience, so that the user may feel as if he were shopping in a real store. In contrast to immersive applications, the applications may be landscape applications (landscaped applications). When the landscape application is launched, it may be placed in the location where it will be placed when launched by the totem in the launcher. Thus, the user may interact with the landscape application, but the user may not feel that he is part of the landscape application.

The user may also launch a virtual application using voice commands at a specified location in the user's FOV, or the user may move an already-placed virtual application (e.g., landscape application) to a particular location in the user's FOV. For example, a user may say "he or she Leap, Put a browser there (Hey Magic Leap, Put the browser here)", "he or she Leap, Put a browser there (Hey Magic Leap, Put the browser there)", "he or she Leap, Put this here (Hey Magic Leap, Put the browser here)" or "he or she, she Leap, she, he or she, Put that there (Hey Magic Leap, Put the heat)". These voice commands include a call word, a "put" command, an application name (which is a subject), and a location hint (which is a parameter). The topic may be referenced based on audio data (e.g., based on the name of the application spoken by the user). When the user instead speaks the word "this" or "that," the subject matter may also be identified based on head pose or eye gaze. To facilitate such voice interaction, the wearable system may make, for example, two inferences: (1) which application to launch; (2) where to place the application.

The wearable system may use the "put" command and the application name to infer which application to launch. For example, if the user speaks an application name that the wearable system cannot recognize, the wearable system may provide an error message. If the user speaks the application name that the wearable system recognizes, the wearable system may determine whether the application has been placed in the user's environment. If the application is already displayed in the user environment (e.g., in the user FOV), the wearable system may determine how many application instances (instances) are in the user environment (e.g., how many browser windows are open). If there is only one target application instance, the wearable system may move the application to a user-specified location. If there is more than one instance of the application spoken in the environment, the wearable system may move all instances of the application to a specified location, or move the most recently used instance to a specified location. If the virtual application has not been placed in the user's environment, the system may determine whether the application is a landscape application, an immersive application, or a store application (in which the user may download or purchase other applications). If the application is a landscape application, the wearable system may launch the virtual application at a specified location. If the application is an immersive application, the wearable system can place a shortcut of the application at a specified location because the immersive application does not support functions that are launched at the specified location in the FOV of the user. If the application is a store application, the system can place the mini-store in a specified location, since the store application needs to have the user fully 3D immersed in the virtual world, and therefore does not support a specific location launch in the user's environment. The mini-store may include a brief summary or icon of the virtual objects in the store.

The wearable system may use various inputs to determine where to place the application. The wearable system may parse a grammar (e.g., "here" or "there") in the user command, determine intersections of virtual objects in the user's environment with the head-pose-based ray projections (or cone projections), determine the user's hand position, determine a planar surface mesh or an environment planar mesh (e.g., a mesh associated with a wall or table), and so on. As an example, if the user says "here," the wearable system may determine the user's gesture, e.g., whether there is a flat open hand in the user's FOV. The wearable system may place the object at the location of the user's flat open hand and at a rendering plane near the user's hand reach. If there is no flat open hand in the FOV, the wearable system may determine whether the head pose (e.g., based on the direction of the cone projection of the head pose) intersects a surface-plane (surface-planar) mesh within reach of the user's arm. If a surface plane mesh is present, the wearable system may place the virtual object at the intersection of the head pose direction and the surface plane mesh at a rendering plane located within reach of the user's arm. The user may place the object flat on the surface. Without the surface plane mesh, the wearable system may place the virtual object at a rendering plane having a distance somewhere between within arm reach and an optimal reading distance. If the user says "that," the wearable system may perform similar operations as when the user said "that," except that if there is no surface plane mesh within reach of the user's arm, the wearable system may place the virtual object at the rendering plane within the midfield.

Once the user says "Put the application at … …" (Put the application.) ", the wearable system may immediately provide predictive feedback to the user to show where to place the virtual object based on the available input when the user says" here "or" there ". The feedback may take the form of a focus indicator. For example, the feedback may include a small floating text bubble saying "here" at a hand, grid, or planar surface intersecting the user's head pose direction at a rendering plane within reach of the user's arm. If the user's command is "here" the planar surface may be located in the near field, and if the user's command is "there" the planar surface may be located in the mid-field or the far field. The feedback can be visualized as shadows or contours of visual objects.

The user may also cancel the interaction. In various cases, the interaction may be canceled in two ways: (1) the command fails to complete before the n-second timeout, or (2) a cancel command is input, such as "no", "not necessarily (new mind)", or "cancel (cancel)".

Example of interacting with text Using user input combinations

Free-form text input, particularly input of long character string sequences, in mixed reality environments using traditional interaction means is problematic. For example, systems that rely entirely on Automatic Speech Recognition (ASR) can be difficult to use for text editing (e.g., to correct ASR errors that are characteristic of the speech recognition technology itself, such as erroneous transcription of a user's speech), especially in "hands-free" environments that lack input or interface devices (e.g., keyboards, handheld controllers (e.g., totems), or mice). As another example, a virtual keyboard in a "hands-free" environment may require fine user control and may cause fatigue when used as the primary form of user input.

The wearable system 200 described herein can be programmed to allow a user to interact with virtual text, yet quickly, using multimodal input, such as a combination of two or more of the following: speech, eye gaze, gestures, head gestures, totem input, etc. The phrase "text" as used herein may include letters, characters, words, phrases, sentences, paragraphs, or other types of free-form text. The text may also include graphics or animations, such as emoji, ideograms, emoticons, smileys, logos, etc. Interaction with the virtual text may include writing, selecting (e.g., selecting some or all of the text), or editing the text (e.g., changing, copying, cutting, pasting, deleting, clearing, undoing, redoing, inserting, replacing, etc.), either alone or in combination. The system described herein provides significant improvements in speed and convenience by utilizing a combination of user inputs as compared to a single input system.

The multimodal text interaction techniques described herein may be applied to any dictation scenario or application (e.g., where the system simply transcribes the user's speech without applying any semantic ratings, even though the transcription is part of another task that does rely on semantic ratings). Some example applications may include messaging applications, word processing applications, gaming applications, system configuration applications, and the like. Examples of use cases may include a user composing a text message to be sent to a contact that may or may not be in the user's contact list; user written notes, articles, or other textual content; a user publishes and shares content on a social media platform; and the user completes or otherwise fills out the form using wearable system 200.

The system utilizing the combination of user inputs need not be a wearable system. If desired, such a system may be any suitable computing system, such as a desktop computer, a laptop computer, a tablet computer, a smartphone, or another computing device having multiple user input channels (e.g., a keyboard, a trackpad, a microphone, an eye or gaze tracking system, a gesture recognition system, etc.).

Examples of composing text using multi-modal user input

28A-28F illustrate an example user experience of composing and editing text based on a combination of inputs such as voice commands or eye gaze. As described herein, the wearable system may determine the user's gaze direction based on images acquired by the inward facing imaging system 462 shown in fig. 4. The inward facing imaging system 462 may determine the orientation of one or both pupils of the user and may extrapolate the line of sight of one or both eyes of the user. By determining the line of sight of both eyes of the user, the wearable system 200 can determine a three-dimensional location in space that the user is looking at.

The wearable system may also determine a voice command based on data acquired from an audio sensor 232 (e.g., a microphone) shown in fig. 2A. The system may have an Automatic Speech Recognition (ASR) engine that converts speech input 2800 into text. The speech recognition engine may use natural language understanding in converting the speech input 2800 to text, including separating and extracting message text from longer utterances.

As shown in fig. 28A, the audio sensor 232 may receive a phrase 2800 spoken by the user. As illustrated in fig. 28, the phrase 2800 may include: a command, such as "Send a message to John Smith say (Send message to John Smith)"; and parameters of commands, such as composing and sending messages; and the intended recipient of the message, John Smith. Phrase 2800 may also include the content of the message to be composed. In this example, the content of the message may include "I'm flying in from Boston and with be thermal area seven o' clock; period; let's meet at the corner near the office (i will take off from boston and will go there at about seven o' clock; period; we meet at a corner near the office) ". Such content may be obtained by parsing the audio data using an ASR engine that may implement natural language understanding, separating and extracting message content and punctuation (e.g., "Period") from the user's utterance. In some examples, punctuation may be processed for presentation within the context of a transcribed string (e.g., "two o' clock" may be presented as "2: 00", or "question mark" may be presented as ". The wearable system may also tokenize (tokenize) the text string (e.g., by isolating discrete words in the text string) and display the results in the mixed reality environment (e.g., by displaying discrete words).

However, in some cases, automatic speech recognition may be prone to errors. As shown in FIG. 28B, a system using an ASR engine may produce results that do not completely match the user's speech input for a variety of reasons, including poor or odd pronunciations, environmental noise, homonyms and other similarly pronounced words, hesitations or disfluencies, and words that are not in the ASR's dictionary (e.g., extraneous phrases, technical terms, jargon, slang, etc.). In the example of FIG. 28B, the system correctly interprets the command aspect of phrase 2800 and generates a message with a header 2802 and body 2804. However, in the body 2804 of the message, the system erroneously interpreted the user expression of "corner" as "quarter" (these two words pronounce somewhat similarly). In systems that rely entirely on speech input, it is difficult for a user to quickly replace a misrecognized word (or phrase) with an intended word (or phrase). However, the wearable system 200 described herein may advantageously allow a user to quickly correct errors, as shown in fig. 28C-28F.

An ASR engine in the wearable system may generate a text result (including at least one word) associated with the utterance of the user, and may also generate an ASR score associated with each word (or phrase) in the text result. A high ASR score may indicate that the ASR engine has a high confidence or high likelihood of correctly transcribing the user's utterance into text, while a low ASR score may indicate that the ASR engine has a low confidence or low likelihood of correctly transcribing the user's utterance into text. In some embodiments, the system may display words with low ASR scores (e.g., ASR scores below the ASR threshold) in an emphasized manner (e.g., by highlighted background, italics or boldness, different colored fonts, etc.), which may make it easier for the user to recognize or select misrecognized words. A low ASR score for a word may indicate that the user has a greater likelihood of selecting the word for editing or replacement because the ASR engine is most likely to incorrectly recognize the word.

As shown in fig. 28C and 28D, the wearable system may enable the user to select a misrecognized word (or phrase) using an eye tracking system (e.g., inward facing imaging system 462 of fig. 4). In this example, the selected word may be an example of the target virtual object described above in connection with the earlier figures.

The wearable system 200 may determine a gaze direction based on the inward facing imaging system 462 and may project a cone 2806 or rays in the gaze direction. The wearable system may select one or more words that intersect the user's gaze direction. In some implementations, the word may be selected when the user's gaze stays on the wrong word for at least a threshold time. As described above, the incorrect word may be determined at least in part by being associated with a low ASR score. The threshold time may be any amount of time sufficient to indicate that the user wants to select a particular word, but not so long as to unnecessarily delay the selection. The threshold time may also be used to determine a confidence score that indicates that the user wants to select a particular virtual word. For example, the wearable system may calculate a confidence score based on the time the user gazes at a direction/object, where the confidence score may increase as the duration of gazing at a particular direction/object increases. Confidence scores may also be calculated based on the multimodal inputs described herein. For example, the wearable system may determine, with a higher confidence score (higher than the confidence score derived from eye gaze alone), whether both the user's gesture and eye gaze indicate that a word should be selected.

As another example, the wearable system may calculate a confidence score based in part on the ASR score, which, as discussed in more detail herein, may indicate a relative confidence with which the ASR engine converted a particular word. For example, a low ASR engine score may indicate that the ASR engine has a low confidence in correctly transcribing the speech word. Thus, the user has a greater chance of selecting the word for editing or replacement. If the user's gaze stays on a word with a low ASR score for more than a threshold time, the system may assign a higher confidence score to reflect that the user selected the word for at least two reasons: first, the length of the word is fixated by the eye; second, the word is likely to be misregistered by the ASR engine, both of which tend to indicate that the user will want to edit or replace the word.

If the confidence score exceeds a threshold criteria, a word may be selected. As an example, the threshold time may be one-half second, one-half second, two seconds, two-half seconds, one second to two seconds, one second to three seconds, and so forth. Thus, the user can easily and quickly select the wrong word simply by staring at the wrong word "quater" for a long enough time. A word may be selected based on a combination of eye gaze (or gesture) time and ASR score above the ASR threshold, both criteria providing an indication that the user is about to select that particular word.

As an example, if the results of the ASR engine include a first word with a high ASR score (e.g., a word that the ASR engine correctly recognizes with relative confidence) and a second word with a low ASR score (e.g., a word that the ASR engine incorrectly recognizes with relative confidence), and the two words are displayed adjacent to each other by the wearable system, the wearable system may assume that the user's gaze input containing the first and second words is in fact an attempt by the user to select the second word based on its relatively low ASR score, because the user is more likely to want to edit the misrecognized second word than the correctly recognized first word. In this way, words with low ASR scores produced by the ASR engine (which are more likely to be inaccurate and require editing) are significantly easier for a user to select for editing, thereby facilitating editing by the user.

Although this example describes selecting misrecognized words using eye gaze, words may also be selected using another multimodal input. For example, the cone projection may identify words such as "around", "7: 00", "the", and "quartz" because they also intersect a portion of virtual cone 2806. As will be further described with reference to fig. 29-31, the wearable system may combine the user gaze input with another input (e.g., a gesture, a voice command, or an input from the user input device 466) to select the word "quartz" as the word for further editing.

After selecting the word 2808, the system may enable editing of the selected word. The wearable system may allow the user to edit words using a variety of techniques, e.g., change, cut, copy, paste, delete, clear, undo, redo, insert, replace, etc. As shown in fig. 28D, the wearable system may allow the user to change word 2808 to another word. The wearable system may support a variety of user inputs for editing the word 2808, such as receiving additional voice input through a microphone to replace or delete a selected word, displaying a virtual keyboard to enable a user to type in an alternate word, or receiving user input via a user input device, etc. In some implementations, the input can be associated with a particular type of text edit. For example, a wave gesture may be associated with deleting selected text, while a finger gesture directed to a location in the text may cause the wearable system to insert additional text at the location. The wearable system may also support user input combinations for editing words. As will be further described with reference to fig. 32-35, the system may support combining eye gaze with another input mode to edit a word.

In the example of fig. 28D and 28E, the system may automatically present the user with an array of suggested substitute words, such as substitute words 2810a and 2810b, upon selection of the word 2808. The suggested substitute words may be generated by an ASR engine or other language processing engine in the system, and may be based on raw speech input (which may also be referred to as speech input in some embodiments), natural language understanding, context, learning from user behavior, or other suitable sources. In at least some embodiments, the suggested substitute words can be substitute hypotheses generated by an ASR engine, hypotheses generated by a predictive text engine (which may attempt to "fill in" using the context of adjacent words and the user's historical text style), originally-converted homonyms, synonym libraries, or other suitable techniques. In the example shown, suggested alternatives for "quartz" include "corn" and "court", which are provided by the language engine as words that pronounce similar to "quartz".

Fig. 28E shows how the system may enable the user to select a desired alternative word, such as "corner", by eye gaze. The wearable system may select the surrogate word using a technique similar to that described with reference to fig. 28C. For example, the system may track the user's eyes using the inward-facing imaging system 462 to determine that the user's gaze 2812 has been focused on a particular surrogate word (such as the surrogate word 2810A or "corn") for at least a threshold time. After determining that the user's gaze 2812 is centered on the surrogate word for a threshold time, the system may modify the text (message) by replacing the originally selected word with the selected surrogate word 2814, as shown in fig. 28F. In some implementations, where the wearable system uses cone projection to select words, the wearable system may dynamically adjust the cone size based on the density of the text. For example, the wearable system may present a cone with a larger aperture (and thus a larger surface area away from the user) to select an alternative word for editing, as shown in fig. 28E, because there are few options available. The wearable system may present a cone with a smaller aperture to select word 2808 in fig. 28C because word 2808 is surrounded by other words and a smaller cone may reduce the error rate of accidentally selecting another word.

Throughout operation, the wearable system may provide feedback (e.g., visual, audible, tactile feedback, etc.) to the user. For example, the wearable system may present a focus indicator to facilitate the user in identifying the target virtual object. For example, as shown in fig. 28E, the wearable system may provide a contrasting background 2830 around the word "quartz" to indicate that the word "quartz" is selected and that the user is currently editing the word "quartz". As another example, as shown in fig. 28F, the wearable system may change the font of the word "corner" 2814 (e.g., to bold) to indicate that the wearable system has confirmed replacing the word "quartz" with the replacement word "corner". In other embodiments, the focus indicator may include a crosshair, a circle or oval surrounding the selected text, or other graphical technique to highlight or emphasize the selected text.

Example of selecting words using multimodal user input

The wearable system may be configured to support and utilize a variety of user input modes to select words. Fig. 29 to 31 show examples of selecting a word based on a combination of an eye gaze and another input mode. In other examples, however, interaction with text may also be accomplished using an input other than eye gaze in conjunction with another input mode.

FIG. 29 shows an example of selecting a word based on input from a user input device and gaze. As shown in fig. 29, the system may combine a user's gaze 2900 (which may be determined based on data from the inward facing imaging system 462) with user input received via the user input device 466. In this example, the wearable system may perform cone projection based on the user's gaze direction. The wearable system may confirm the selection of the word "quartz" based on input from the user input device. For example, the wearable system may recognize that the word "quartz" is the word closest to the user's gaze direction, and the wearable system may confirm selection of the word "quartz" based on user actuation of the user input device 466. As another example, the cone projection may capture a plurality of words, such as "around", "7: 00", "the", and "quartz". The user may select a word from the plurality of words for further editing via the user input device 466. By receiving input independently of the user's gaze, the system can confidently identify a particular word as one that the user wants to edit without waiting for a long time. After selecting the word to edit in this manner, the system may present a surrogate word (as discussed in connection with FIG. 28E) or otherwise allow the user to edit the selected word. The same process of combining the user's gaze with the user input received via the totem may be applied to select the desired replacement word (e.g., selecting the word "corner" to replace the word "quartz" among the replacement words). Some embodiments may utilize the confidence score to determine which text the user is selecting. The confidence score may aggregate multiple input patterns to better determine the selected text. For example, the confidence score may be based on the time the user gazed at the text, whether the user actuated the user input device 466 while gazing at the text, whether the user pointed at the selected text, and so on. If the confidence score exceeds a threshold, the wearable system may determine with increased confidence that the system has correctly selected the text intended by the user. For example, to select text by eye gaze only, the system may be configured to select text when the gaze time exceeds 1.5 seconds. However, if the user is only looking at the text for 0.5 seconds, but at the same time the user input device is actuated, the system can determine the selected text more quickly and confidently, which can improve the user experience.

FIG. 30 illustrates an example of selecting a word to edit based on a combination of speech and gaze input. The wearable system may determine the target virtual object based on the user's gaze. As shown in fig. 30, the system may determine that the user's gaze 3000 is directed at a particular word (in this case, "quater"). The wearable system may also determine an operation to perform on the target virtual object based on the user's voice command. For example, the wearable system may receive a user's voice input 3010 via the audio sensor 232, may recognize the voice input 3010 as a command, and may combine the two user inputs into a command to apply a command operation ("edit") to a target virtual object (e.g., a word ("quater") on which the user's gaze is focused). As previously described, the system may present the surrogate word after the user selects the word to edit. The same process of combining the user's gaze with the voice input may be applied to select a desired alternative word among a plurality of alternative words to replace the word "quater". As described herein, terms such as "edit" represent context-specific wake words that are used to invoke a constrained system command library associated with editing for each of one or more different user input modes. That is, such terms, when received by the system as speech input, may enable the system to evaluate subsequently received user input against a limited set of criteria to identify editing-related commands provided by the user with increased accuracy. For example, in the context of speech input, the system may consult a limited vocabulary of command-specific terms to perform speech recognition on subsequently received speech input. In another example, in the context of a gaze or gesture input, the system may consult a limited library of command-specific template images to perform image recognition on subsequently received gaze or gesture inputs. Terms such as "edit" are sometimes referred to as "hotwords" or "carrier phrases", and the system may include a plurality of preprogrammed (and optionally user-settable) hotwords, for example (in an editing context): edit, cut, copy, paste, bold, italic, delete, move, etc.

FIG. 31 shows an example of selecting a word to edit based on a combination of gaze and gesture inputs. As shown in the example of fig. 31, the system may use a combination of eye gaze input 3100 and gesture input 3110 to select a word to edit. In particular, the system may determine the eye gaze input 3100 (e.g., based on data acquired by the inward-facing imaging system 462), and may recognize the gesture input 3110 (e.g., based on an image acquired by the outward-facing imaging system 464). An object recognizer, such as recognizer 708, may be used to detect a portion of a user's body (e.g., a user's hand) and make a gesture associated with recognizing a word to edit.

Gestures may be used alone or in combination with eye gaze to select words. For example, although the cone projection may capture multiple words, the wearable system may identify the word "quartz" as the target virtual object because it was recognized from both the cone projection and the user's gesture (e.g., based on the eye gazing at the cone projection's confidence score exceeding a confidence threshold in addition to the gesture, indicating that the user selected the word "quartz"). As another example, although the cone projection may capture multiple words, the wearable system may still recognize the word "quater" as the target virtual object because it was recognized from the cone projection and is also the word from the ASR engine with the lowest ASR score that is located within (or near) the cone projection. In some implementations, the gesture can be associated with a command operation because the gesture is associated with a command such as "exit" or other hotword described herein. As an example, the system may identify when users point to the same word they are gazing at and interpret these user inputs as requests to edit the same word. If desired, the system may also utilize additional user input, such as a voice command for "edit," while determining that the user is to edit a particular word.

Example of editing words using multimodal user input

Once the user selects a word to edit, the system can edit the selected word using any desired user input mode. The wearable system may allow the user to alter or replace the selected word by displaying a list of possible replacement words and receiving a user gaze input 2812 to select a replacement word that replaces the original word (see example shown in fig. 28E). 32-34 illustrate additional examples of editing a selected word, where the selected word may be edited using multimodal input.

FIG. 32 shows an example of replacing words based on a combination of eye gaze and speech input. In fig. 32, the system receives speech input 3210 from the user (via the audio sensor 232 or other suitable sensor). The speech input 3210 may contain a desired replacement word (which may or may not be a replacement word from the list of suggested replacement words 3200). Upon receiving the speech input 3210, the wearable system may parse the input (e.g., strip out a carrier phrase such as "change this to … (change this to … …)") to identify the word spoken by the user and replace the selected word "quartz" with the word "corner" spoken by the user. Although in this example, the replacement word is a word, in some implementations, the wearable system may be configured to replace the word "quartz" with a phrase or sentence or some other element (e.g., emoji). In examples where multiple words are included in the eye gaze cone projection, the wearable system may automatically select the word within the eye gaze cone that is closest to the replacement word (e.g., "quater" is closer to "corner" than "the" or "7: 00").

FIG. 33 shows an example of altering words based on a combination of speech and gaze input. In this example, the wearable system may receive the voice input 3310 and determine the user's gaze direction 3300. As shown in fig. 33, the speech input 3310 includes the phrase "change it to 'corner' (change it to 'corner')". The wearable system may parse the voice input 3310 and determine that the voice input 3310 includes the command operation "change" (which is an example of a carrier phrase), the subject "it", and parameters of the command (e.g., the result word "corn"). This speech input 3310 may be combined with the eye gaze 3300 to determine the subject of the operation. As shown with reference to fig. 28A and 28B, the wearable system may recognize the word "quartz" as the subject of operation. Thus, the wearable system may modify the topic ("quartz") to the result word "corner".

Fig. 34 shows an example of editing a selected word 3400 using a virtual keyboard 3410. The virtual keyboard 3410 may be controlled by user gaze input, gesture input, input received from a user input device, and the like. For example, the user may type the replacement word by moving eye gaze direction 3420 over virtual keyboard 3410 (which is displayed to the user through the display of wearable system 200). The user may type each letter in the replacement word by pausing their gaze on the respective key for a threshold period of time, or the wearable system may recognize the change in the user's gaze direction 3420 on a particular key as an indication that the user wants to select that key (thereby eliminating the need for the user to maintain a steady focus on each individual key while typing a word). As described with reference to fig. 28D, in some embodiments, the wearable system may change the size of the cone based on the size of the key. For example, in a virtual keyboard 3410 where the size of each key is relatively small, the wearable system may reduce the size of the cone to allow the user to more accurately identify letters in the replacement word (so that the cone projection does not accidentally capture a large number of possible keys). If the size is relatively large, the wearable system may increase the size of the keys accordingly, thereby freeing the user from having to precisely locate the gaze direction (which may reduce fatigue).

In some embodiments, after a word is selected, the wearable system may present a set of possible actions in addition to or in place of displaying a list of suggested substitute words to replace the selected word. The user 210 may select an action and edit the selected word using the techniques described herein. FIG. 35 illustrates an example user interface displaying possible actions applied to a selected word. In fig. 35, after selecting the word to edit 3500, the wearable system may present a list 3510 of options for editing, including (in this example) options for performing the following actions: (1) alter words (using any of the techniques described herein for editing), (2) cut words and optionally store them in the clipboard, or copy words and store them in the clipboard, or (3) paste words or phrases from the clipboard. Additional or alternative options that may be presented include a delete selection option, an undo option, a redo option, a full selection option, an insert here option, and an replace option. Various options may be selected using gaze input, totem input, gesture input, etc., as described herein.

Example of interacting with phrases Using Multi-modal user input

Although the foregoing examples have described the use of multimodal input to select and edit words, this is for illustration only, and the same or similar processing and input may be used to select and edit phrases or sentences or paragraphs (including multiple words or characters) in general.

36(i) through 36(iii) illustrate examples of interacting with phrases using multimodal input. In fig. 36(i), the wearable system may determine a user's gaze 3600 direction and perform cone projection based on the user's gaze direction. In fig. 36(ii), the system may recognize that the gaze 3600 of the user 210 is focused on the first word 3610 (e.g., "I'm (me)"). The system may make such a determination of the first word 3610 using any of the techniques discussed herein, including but not limited to recognizing a stay (e.g., linger) of the user's gaze 3600 on a particular word for a threshold period of time, recognizing that the user's gaze 3600 is located on a particular word while the user provides voice, gesture, or totem input, etc. The wearable system may also display a focus indicator (e.g., a contrasting background as shown) on the selected word "I'm" 3610 to indicate that the word has been determined from eye-gaze cone projection. The user may activate the totem 3620 (which is an example of the user input device 466) while gazing at the first word 3610. The initiation may indicate that the user intends to select a phrase or sentence beginning with the first word 3610.

In fig. 36(iii), after the user input device 466 is activated, the user may see the last expected word (e.g., the word "there") to indicate that the user desires to select a phrase beginning with the word "I'm" and ending with the word "there". The wearable system may also detect that the user has stopped starting the totem 3620 (e.g., releasing the button that the user previously pressed) and may select the entire range 3630 of the phrase "I'm flying in from Boston and will be there" accordingly. The system may use the focus indicator to display the selected phrase (e.g., by extending a sharp background to all words in the phrase).

The system may use various techniques to determine that the user desires to select a phrase for editing rather than another word. As an example, when the user selects the second word shortly after selecting the first word, the system may determine that the user wishes to select a phrase rather than undo their selection of the first word. As another example, when the user selects a second word that occurs after the first word and the user has not edited the first selected word, the system may determine that the user wants to select a phrase. As yet another example, a user may press a button on the totem 3620 while focusing on the first word 3610 and then hold the button until their gaze lands on the last word. When the system recognizes that the button is pressed while the gaze 3610 is focused on the first word, but is released only after the user's gaze 3610 moves to the second word, the system may recognize the multi-modal user input as selecting a phrase. The system can then identify all words in the phrase, including the first word, the last word, and all words in between, and allow the phrase to be edited in its entirety. The system may use the focus indicator to highlight the selected phrase (e.g., highlight, emphasized text (e.g., bold, italic, or a different color), etc.) so that it is highlighted from the unselected text. The system may then display contextually relevant options for editing the selected phrase, such as option 3510, a virtual keyboard (e.g., keyboard 3410), an alternative phrase, and so forth. The system may receive additional user input, such as voice input, totem input, gesture input, etc., to determine how to edit the selected phrase 3630.

Although FIG. 36 shows the user selecting the first word 3610 at the beginning of the phrase, the system may also allow the user to select backwards from the first word 3610. In other words, the user may select a phrase by selecting the last word of the word (e.g., "there"), and then selecting the first word of the desired phrase (e.g., "I'm").

37A-37B illustrate another example of interacting with text using multi-modal input. In fig. 37A, the user 210 speaks a sentence ("I wait to sleep"). The wearable system may capture the user's utterance as speech input 3700. For this speech input, the wearable system can display, for each word, the primary and secondary results from an Automatic Speech Recognition (ASR) engine, as shown in fig. 37B. The primary result for each word may represent the ASR engine's best guess (e.g., the word with the highest ASR score indicating what word the user actually uttered) for the word the user uttered in the speech input 3700, while the secondary result may represent a similarly pronounced alternative word or a word with an ASR score lower than the ASR engine's best guess. In this fig. 37B, the primary result is shown as sequence 3752. In some embodiments, the wearable system may present the replacement results or hypotheses as replacement phrases and/or entire sentences, rather than as replacement words. As an example, the wearable system may provide a primary result of "four score and seven year ago" and a secondary result of "carrying seven year to go" where there is no one-to-one correspondence between discrete words in the primary and secondary results. In such embodiments, the wearable system may support input from the user (in any of the manners described herein) to select an alternative or secondary phrase and/or sentence.

As shown in FIG. 37B, each word from the user speech input 3700 may be displayed as a set 3710, 3720, 3730, 3740 of primary and secondary results. This type of arrangement may enable a user to quickly swap out an incorrect primary result and correct any errors introduced by the ASR engine. The primary results 3752 (e.g., in the example of fig. 37B, each word is bold text surrounded by a bounding box) may be emphasized with a focus indicator to distinguish them from the secondary results.

If the primary word is not intended by the user, the user 210 may stay on the secondary result, such as a secondary word, phrase, or sentence, etc. As an example, the primary result of the ASR engine in set 3740 is "slip", while the correct transcription is actually the first secondary result "sleep". To correct this error, the user may focus the gaze on the correct secondary outcome "sleep" and the system may recognize that the user's gaze remains on the secondary outcome for a threshold period of time. The system may convert the user gaze input into a request to replace the primary result "slip" with the selected secondary result "sleep". Additional user input, such as user voice input (e.g., the user may ask the system "edge", "use this", or "replace" while looking at the desired secondary result) may be received along with selecting the desired secondary result.

Once the user has completed editing or confirmed that the transcription is correct for the phrase "I wait to sleep," the phrase may be added to the text body using any of the user input modes described herein. For example, the user may speak a hotword such as "finish" to cause the edited phrase to be added back into the text body.

Example Process for interacting with text Using a combination of user inputs

FIG. 38 is a process flow diagram of an example method 3800 for interacting with text using multiple user input modes. Process 3800 may be performed by wearable system 200 described herein.

At block 3810, the wearable system may receive a voice input from the user. The speech input may include a user's speech including one or more words. In one example, a user may dictate a message, and the wearable system may receive the dictated message. This may be accomplished by any suitable input device (e.g., audio sensor 232).

At block 3820, the wearable system may convert the speech input to text. The wearable system may utilize an Automatic Speech Recognition (ASR) engine to convert the user's speech input into text (e.g., a word transcription), and may further utilize natural language processing techniques to convert such text into a semantic representation that indicates intent and concepts. The ASR engine may be optimized for free-form text input.

At block 3830, the wearable system may tokenize (tokenize) the text into discrete actionable elements, such as words, phrases, or sentences. The wearable system may also display text to the user using a display system such as display 220. In some embodiments, the wearable system does not need to understand the meaning of the text during the tokenization system. In other embodiments, the wearable system is equipped with the ability to understand the meaning of the text (e.g., one or more natural language processing models or other probabilistic statistical models), or is equipped with the ability to only distinguish between (i) and (ii) below: (i) words, phrases, and sentences that represent a user-composed message or portion thereof, and (ii) words, phrases, and sentences that do not represent a user-composed message or portion thereof, but rather correspond to commands to be executed by the wearable system. For example, the wearable system may need to know the meaning of the text to identify the command operation or parameters of the command spoken by the user. Examples of such text may include context-specific wake words, also referred to herein as hotwords, for invoking one or more restricted libraries of system commands associated with editing for each of one or more different user input modes.

The user can interact with one or more of the actionable elements using multimodal user input. At block 3840, the wearable system may select one or more elements in response to the first indication. As described herein, the first indication may be one or a combination of user inputs. The wearable system may receive input from a user selecting one or more elements of the text string for editing. The user may select a single word or multiple words (e.g., phrases or sentences). The wearable system may receive any desired form of user input to select an element to edit, including, but not limited to, voice input, gaze input (e.g., via the inward facing imaging system 462), gesture input (e.g., captured by the outward facing imaging system 464), totem input (e.g., via actuation of the user input device 466), or any combination thereof. As an example, the wearable system may receive user input in the form of a user's gaze that remains on a particular word for a threshold period of time, or may receive a user's gaze on a particular word while user input acquired via a microphone or totem indicates selection of the particular word to edit.

At block 3850, the wearable system may edit the selected element in response to the second indication. The second indication may be received via a single input mode or a combination of input modes described with reference to the above figures, including but not limited to user gaze input, voice input, gesture input, and totem input. The wearable system may receive user input indicating how the selected element should be edited. The wearable system may edit the selected element according to the user input received in block 3850. For example, the wearable system may replace the selected element based on the voice input. The wearable system may also present a list of suggested substitute elements and select among the selected substitute elements based on the user's eye gaze. The wearable system may also receive input via user interaction with the virtual keyboard or via a user input device 466 (e.g., a physical keyboard or a handheld device).

At block 3860, the wearable system may display the results of editing the selected element. In some implementations, the wearable system can provide a focus indicator on the element being edited.

As shown by arrow 3870, the wearable system may repeat blocks 3840, 3850, and 3860 if the user provides additional user input to edit additional elements of the text.

Additional details regarding MULTIMODAL TASK EXECUTION AND TEXT EDITING FOR WEARABLE SYSTEMs are provided in U.S. patent application No. 15/955,204 entitled "MULTIMODAL TASK EXECUTION AND TEXT EDITING FOR a WEARABLE SYSTEM" filed on 2018, 4, month 17 as disclosed in U.S. patent publication No. 2018/0307303, the entire contents of which are incorporated herein by reference.

Examples of Cross-modality input fusion techniques

As described above, cross-modal input fusion techniques that provide dynamic selection of appropriate input modes may advantageously allow a user to more accurately and confidently target real or virtual objects and may provide a more robust user-friendly AR/MR/VR experience.

The wearable system may advantageously support opportunistic fusion of multi-modal user inputs to facilitate user interaction in a three-dimensional (3D) environment. The system may detect when a user provides two or more inputs that may come together via two or more respective input modes. As an example, a user may be pointing at a virtual object with their finger while also gazing their eyes at the virtual object. The wearable system may detect this convergence of eye gaze and finger gesture inputs and apply an opportunistic fusion of eye gaze and finger gesture inputs and thereby determine the virtual object at which the user is pointing with greater accuracy and/or speed. Thus, the system allows the user to select smaller elements (or faster moving elements) by reducing the uncertainty of the primary input targeting method. The system can also be used to speed up and simplify the selection of elements. The system may allow a user to increase the success of targeting a mobile element. The system may be used to accelerate rich rendering of display elements. The system can be used to prioritize and accelerate local (and cloud) processing of object point cloud data, dense meshing and plane acquisition to improve indentation fidelity of discovered entities and surfaces and grasped objects of interest. Embodiments of the cross-modal techniques described herein allow the system to establish varying degrees of cross-modal focus from the user's perspective, while still preserving the minimal head, eye, and hand movements, thereby significantly enhancing the system's understanding of the user's intent.

As will be described further herein, the identification of cross-modal states may be performed through analysis of the relative convergence of some or all of the available input vectors. This may be accomplished, for example, by examining the angular distance between pairs of aiming vectors (e.g., a vector from the user's eye to the target object and a vector pointing to the target object from the totem held by the user). The relative variance of each pair of inputs may then be checked. If the distance or variance is below a threshold, a bimodal state (e.g., bimodal input convergence) may be associated with a pair of inputs. A trimodal state may be associated with an input of a triplet if the input of the triplet has a targeting vector with an angular distance or variance below a threshold. Convergence of four or more inputs is also possible. In examples where a triplet of head pose aiming vectors (head gaze), eye divergence aiming vectors (eye gaze), and tracked controller or tracked hand (hand pointer) aiming vectors are identified, the triplet may be referred to as a cross-modal triangle. The relative sizes of the triangle sides, the area of the triangle, or their associated variances may present characteristic features that the system may use to predict the aim and activation intent. For example, if the area of the cross-modal triangle is less than a threshold area, a tri-modal state may be associated with the input of the triplet. Examples of vergence calculations are provided herein and in appendix a. As an example of predictive aiming, the system may recognize that the user's eye and head inputs tend to converge before the user's hand inputs converge. For example, when trying to grasp an object, the motions of the eyes and the head can be performed quickly, and can be converged on the object within a short time (e.g., about 200ms) before the motion input of the hand is converged. The system may detect head-eye convergence and predict that hand motion input will converge shortly thereafter.

Convergence of targeting vector pairs (bimodal), triplets (trimodal), or quartets (quadrodal) or more inputs (e.g., 5, 6, 7, or more) may be used to further define sub-types of cross-modal coordination. In at least some embodiments, the desired fusion method is identified based on the detailed cross-modal state of the system. The desired fusion method may also be determined, at least in part, by: cross modality types (e.g., which inputs are converged), motion types (e.g., how converged inputs move), and interaction field types (e.g., in which interaction field (such as midfield region, task space, and workspace interaction region described in connection with fig. 44A, 44B, and 44C), the inputs are focused). In at least some embodiments, the selected fusion method can determine which of the available input modes (and associated input vectors) are selectively converged. The motion type and the field type may determine settings of the selected fusion method, such as relative weighting or filtering of one or more fusion inputs.

Additional benefits and examples of techniques related to opportunistic fusion of cross-modal input and multi-modal user input for interacting with virtual objects are further described with reference to fig. 39A-60B (and below and appendix a).

Interpretation of certain cross-modal terms

The following provides an explanation of certain terms used for cross-modality input fusion techniques. These explanations are intended to illustrate, but not limit, the scope of the cross-modal terminology. The cross-modal terminology should be understood from the perspective of one of ordinary skill in the art in view of the entire description set forth in the specification, claims, and drawings.

The term IP area may include a volume associated with an Interaction Point (IP). For example, the pinch IP region may include a volume (e.g., a sphere) created by the separation of the poses of the index finger tip and the thumb tip. The term intent Region (ROI) may include a volume comprised of overlapping uncertainty regions (e.g., volumes) associated with a set of aiming vectors for an intended target object. The ROI may represent a volume in which a desired target object may be found.

The term modal input may refer to input from any sensor of the wearable system. For example, common modal inputs include inputs from six degree of freedom (6DOF) sensors (e.g., for head pose or totem position or orientation), eye tracking cameras, microphones (e.g., for voice commands), outward facing cameras (for hand gestures or body poses, etc.). Modal input may refer to the simultaneous use of multiple dynamically coupled modal inputs.

The term vergence may include the convergence (convergence) of multiple input vectors associated with multiple modes of user input on a common interaction point (e.g., when both eye gaze and gestures are directed to the same spatial location). The term fixation may include local slowing down and pausing of points of a vergence point or a single input vector. The term dwell may include fixations that are extended in time by at least a given duration. The term ballistic tracking may include ballistic (e.g., projectile-like) motion of a convergent point toward a target or other object. The term smooth tracking may include smooth (e.g., low acceleration or low rate of change of acceleration) motion of the vergence point toward the target or other object.

The term sensor convergence may include convergence of sensor data (e.g., convergence of data from a gyroscope with data from an accelerometer forming an Inertial Measurement Unit (IMU), convergence of data from an IMU with a camera, convergence of multiple cameras for SLAM processing, etc.). The term feature convergence may include spatial convergence of inputs (e.g., convergence of input vectors from inputs of multiple modes) as well as temporal convergence of inputs (e.g., parallel or sequential timing of inputs of multiple modes).

The term bimodal convergence may include a convergence of two input modes (e.g., a convergence of two input modes when the input vectors converge on a common interaction point). The terms tri-modal and quad-modal convergence may include convergence of three and four input modes, respectively. The term cross-modality convergence may include a temporal convergence of multiple inputs (e.g., detecting temporal convergence of multiple modes of user inputs and corresponding sets (intergrams) or fusions of these user inputs to improve the overall input experience).

The term divergence may refer to at least one input mode that previously converged (and thus fused) with at least one other input mode and no longer converged with that other input mode (e.g., trimodal divergence may refer to a transition from a trimodal convergence state to a bimodal convergence state as the initially converged third input vector diverges from the converging first and second input vectors).

The term head-hand-vergence may include convergence of head and hand ray casting vectors or convergence of head pose and hand interaction points. The term head-eye-vergence may include convergence of head pose and eye gaze vectors. The term head-eye-hand-vergence may include convergence of head pose, eye gaze and hand direction input vectors.

The term passive cross-mode intent may include pre-aiming, head-eye fixation, and dwell. The term active cross-modality intent may include head-eye-hand dwell interactions or head-eye-hand manipulation interactions. The term cross-modal triangle may include a region created by the convergence of three modal input vectors (and may refer to a region of uncertainty for a three modal convergence input). This region may also be referred to as a vergence region or a modal vergence region. The term trans-modal quadrilateral may include a region created by the convergence of four modal input vectors.

Examples of user input

Fig. 39A and 39B illustrate examples of user inputs received through input regions on controller buttons or user input devices. In particular, fig. 39A and 39B illustrate a controller 3900, which controller 3900 may be part of the wearable system disclosed herein, and may include a home button 3902, a trigger 3904, a bumper (bump) 3906, and a touchpad 3908. The user input device 466 or totem 1516 described with reference to fig. 4 and 15, respectively, may be used as the controller 3900 in various embodiments of the wearable system 200.

Potential user inputs that may be received through controller 3900 include, but are not limited to, pressing and releasing main button 3902; half and full (and other partial) press triggers 3904; release trigger 3904; pressing and releasing bumper 3906; touch, move when touched, release touch, increase or decrease pressure on touch, touch a particular portion (such as an edge of touch panel 3908), or make a gesture on touch panel 3908 (e.g., by drawing a shape with a thumb).

Fig. 39C shows an example of user input received through physical movement of a controller or a Head Mounted Device (HMD). As shown in fig. 39C, physical movements of controller 3900 and head mounted display 3910(HMD) may form user inputs into the system. The HMD 3910 may include the head mounted assembly 220, 230 shown in fig. 2A or the head mounted wearable assembly 58 shown in fig. 2B. In some embodiments, controller 3900 provides a three degree of freedom (3DOF) input by recognizing rotation of controller 3900 in any direction. In other embodiments, controller 3900 also provides a six degree of freedom (6DOF) input by recognizing translation of the controller in any direction. In other embodiments, the controller 3900 can provide inputs that are less than 6DOF or less than 3 DOF. Similarly, the head mounted display 3910 may recognize and receive 3DOF, 6DOF, less than 6DOF, or less than 3DOF inputs.

Fig. 39D shows an example of how user input may have different durations. As shown in fig. 39D, some user inputs may have a short duration (e.g., a duration of less than a fraction of a second, such as 0.25 seconds), or may have a long duration (e.g., a duration of greater than a fraction of a second, such as more than 0.25 seconds). In at least some embodiments, the duration of the input itself can be recognized by the system and used as the input. The short duration input and the long duration input may be processed differently by wearable system 200. For example, a short-duration input may represent selecting an object, while a long-duration input may represent activating the object (e.g., causing execution of an application (app) associated with the object).

Fig. 40A, 40B, 41A, 41B and 41C illustrate various examples of user inputs that may be received and recognized by the system. The user input may be received via one or more modes of user input (either individually or in combination, as shown). User input may include input through controller buttons (such as main button 3902, trigger 3904, bumper 3906, and touch pad 3908); physical movement of controller 3900 or HMD 3910; an eye gaze direction; a head pose direction; a gesture; inputting voice; and so on.

As shown in fig. 40A, a short press and release of main button 3902 may indicate a main tap action, while a long press of main button 3902 may indicate a main press and hold action. Similarly, a short press and release of trigger 3904 or bumper 3906 may indicate a trigger flick action or a bumper flick action, respectively; while a long press of trigger 3904 or bumper 3906 may indicate a press and hold action of the trigger or bumper, respectively.

As shown in fig. 40B, a touch of the touchpad 3908 that is moved over the touchpad may indicate a touch-drag action. Short touches and releases of touchpad 3908 (where the touch does not substantially move) may indicate a tap action. If such short touches and releases of touchpad 3908 are accomplished with forces that exceed some threshold level (which may be a predetermined threshold, a dynamically determined threshold, a learned threshold, or some combination thereof), the input may indicate a force tap input. Force-touch touchpad 3908 that exceeds a threshold level may indicate a force-press action, while a long touch with such force may indicate a force-press and hold input. A touch near an edge of the touch panel 3908 may indicate an edge press action. In some embodiments, the edge press action may also include an edge touch that is greater than a threshold pressure level. Fig. 40B also shows that a touch moving in an arc on the touch pad 3908 can indicate a touch circle action.

The example of fig. 41A shows that interaction with the touch pad 3908 (e.g., by moving a thumb on the touch pad) or physical movement (6DOF) of the controller 3900 can be used to rotate a virtual object (e.g., by making a circular gesture on the touch pad), move the virtual object in a z-direction toward or away from a user (e.g., by making a gesture on the touch pad in, for example, a y-direction), and expand or contract the size of the virtual object (e.g., by making a gesture on the touch pad in a different direction (e.g., an x-direction)).

FIG. 41A also shows that combinations of inputs can represent actions. In particular, fig. 41 shows that interaction with bumper 3906 and a user turning and tilting their head (e.g., adjusting their head pose) can indicate a maneuver start and/or maneuver end action. As an example, a user may provide an indication to begin manipulation of an object by double-clicking or pressing and holding bumper 3906, may then move the object by providing additional input, and may then provide an indication to end manipulation of the object by double-clicking or releasing bumper 3906. In at least some embodiments, the user can provide additional input to move the object in the form of physical motions (6 or 3DOF) of controller 3900 or by adjusting their head pose (e.g., tilting and/or rotating their head).

FIG. 41B illustrates additional examples of user interaction. In at least some embodiments, the interaction of FIG. 41B involves two-dimensional (2D) content. In other embodiments, the interactions of FIG. 41B may be for three-dimensional content. As shown in fig. 41B, a combination of a head gesture (which may point to 2D content) and a touch moving on the touch pad 3908 may indicate a set selection action or a scroll action. The head gesture in combination with the pressing of the force on the edge of the touchpad 3908 may indicate a scrolling action. Head gestures, in combination with taps and short taps on the touchpad 3908, or with short taps of presses on the touchpad 3908, may indicate an active action. Forceful presses and holds on the touchpad 3908 in combination with head gestures (which may be specific head gestures) may indicate contextual menu actions.

As shown in fig. 41C, the wearable device may open a menu associated with an application using a head gesture and a primary tap action that indicates that the user's head is directed at the virtual application, or open a launcher application (e.g., an application that allows multiple applications to execute) using a head gesture and a primary press and hold action. In some embodiments, the wearable device may use a primary tap action (e.g., a single or double tap of the primary button 3902) to open a launcher application associated with the pre-targeted application.

42A, 42B, and 42C illustrate examples of user input in the form of fine finger gestures and hand motions. The user inputs shown in fig. 42A, 42B, and 42C may sometimes be referred to herein as micro-gestures, and may take the form of fine finger movements, such as pinching the thumb and forefinger together, pointing with a single finger, grasping with a closed or open hand, pointing with a thumb, tapping with a thumb, and so forth. As one example, micro-gestures may be detected by a wearable system using a camera system. In particular, micro-gestures may be detected using one or more cameras (which may include a pair of cameras in a stereoscopic configuration), which may be part of an outward-facing imaging system 464 (shown in fig. 4). The object recognizer 708 may analyze the image from the out-of-plane imaging system 464 to recognize the example micro-gestures shown in fig. 42A-42C. In some implementations, the micro-gesture is activated by the system when the system determines that the user has focused on a target object for a sufficiently long gaze or dwell time (e.g., convergence of multiple input modes is considered robust).

Examples of perceptual fields, display rendering planes, and interaction regions

Fig. 43A shows the visual and auditory perception fields of a wearable system. As shown in fig. 43A, the user may have a main field of view (FOV) and a peripheral FOV in his field of view. Similarly, the user may sense in directions having at least a forward, backward and peripheral direction in the auditory perception field.

Fig. 43B shows a display rendering plane of a wearable system having multiple depth planes. In the example of fig. 43B, the wearable system has at least two rendering planes, one rendering plane displaying virtual content at a depth of about 1.0 meter and another rendering plane displaying virtual content at a depth of about 3.0 meters. The wearable system may display virtual content at a given virtual depth on a depth plane having a closest display depth. Fig. 43B also shows a 50 degree field of view for this example wearable system. In addition, fig. 43B shows a near clipping plane at about 0.3 meters and a far clipping plane at about 4.0 meters. Virtual content that is closer than the near clipping plane may be clipped (e.g., not displayed) or may be shifted away from the user (e.g., at least to the distance of the near clipping plane). Similarly, virtual content that is further from the user than the far clipping plane may be clipped or may be shifted toward the user (e.g., at least a distance to the far clipping plane).

44A, 44B, and 44C illustrate examples of different interaction regions around a user, including a midfield region, an extended workspace, a task space, a manipulation space, an examination space, and a head space. These interaction regions represent spatial regions in which a user can interact with real and virtual objects, and the type of interaction may be different in different regions, and the appropriate set of sensors for cross-modality fusion may be different in different regions. For example, the workspace may include areas in front of the user, within the user's field of view (FOV), and within the user's reach (e.g., out to about 0.7 m). The task space may be a volume within the workspace and may generally correspond to a volume (e.g., from about 0.2m to 0.65m) in which a user may comfortably manipulate an object with their hand. The task space may include a generally downward angle (measured from a forward horizontal vector) from, for example, about 20 degrees below horizontal to about 45 degrees below horizontal. The examination space may be a volume within the task space and may generally correspond to a volume in which a user may hold an object while in close proximity (e.g., from about 0.2m to 0.3m) to the examination object. The types of input in the exam, task, and workspace may generally include head gestures, eye gaze, gestures (e.g., micro-gestures shown in fig. 42A-42C). At greater distances from the user, the expanded workspace may expand to a distance of about 1m, the workspace may typically be the volume in front of the user within the FOV of the user, and the midfield region may spherically expand to a distance (from the user's head) of about 4 meters. The head space may correspond to the volume occupied by the user's head (e.g., out to the near crop plane shown in fig. 43B) as well as the volume occupied by any head-mounted components of the wearable systems disclosed herein.

The near field region includes a region near the user and extends from about 0.7m to about 1.2m from the user. The midfield region extends beyond the near field and out to about 4 m. The far field region extends beyond the mid field region and may include a distance (which may be up to 10m or 30m or even infinity) outward to a distance of a maximum rendering plane or a maximum depth plane provided by the wearable system. In some implementations, the midfield region may range from about 1.2m to about 2.5m and may represent a region of space in which a user may "tilt" and grasp or interact with real or virtual objects. In some such implementations, the far field region extends beyond about 2.5 m.

The exemplary interaction regions shown in FIGS. 44A-44C are illustrative and not limiting. The user interaction regions may be arranged differently from the illustrated regions and may have different sizes, shapes, etc.

Degree of input integration

As will be described further below, the development of cross-modal input fusion techniques represents an increased integration of inputs from statically defined input systems through input feedback system-based dynamic coupling to input coupling that employs multiple dynamic feedback and feedforward system operations (e.g., dynamically anticipating or predicting input sensors to be used). For example, single modality techniques may utilize a single sensor with dedicated controls (e.g., touch gestures), and multi-modality techniques may utilize multiple independent sensors (e.g., head gaze and button selection on an input device) operating concurrently in parallel. Cross (cross) modal techniques may utilize multiple sensors that are statically fused (e.g., permanently cross-coupled). In such technologies, the wearable system typically accepts all sensor inputs and determines the user's likely intent (e.g., to select a particular object).

In contrast, cross-modal input fusion techniques provide dynamic coupling of sensor inputs, e.g., identifying sensor inputs that converge spatially (e.g., converge in a spatial region around a target object) or temporally (e.g., continue for gaze or dwell time).

In some implementations, the sensor input coupled across the modes occurs a relatively small fraction of the time that the user interacts with the 3D environment. For example, in some such implementations, cross-modal coupling input occurs only about 2% of the time. However, during the time that the appropriate set of convergence sensors is identified, cross-modal input fusion techniques can significantly improve the accuracy of target object selection and interaction.

Single modality user interaction

FIG. 45 illustrates an example of a single modality user interaction. In single-modality user interaction, user input is received via a single mode of user input. In the example of fig. 45, the touch input is registered as a touch gesture, and the other mode of input is not used in interpreting the user input. If desired, the touch input may include a single pattern of changes in user input over time (e.g., a moving touch followed by a tap).

Multi-modal user interaction

46A, 46B, 46C, 46D, and 46E illustrate examples of multi-modal user interactions. In a multi-modal user interaction, independent modes of user input will be received and utilized together to improve the user experience. However, in multi-modal interactions, dynamic fusion of different sensor modalities typically does not occur. In the figures similar to fig. 46A and 46B, user actions and corresponding input controls (e.g., touch gestures, head gaze (also referred to as head gestures), eye gaze, and gestures) are shown on the left side (e.g., the user may provide touch input or move his or her head, eyes, or hands).

As shown in the example of fig. 46A and 46B, a user may use their head gaze to aim at a virtual object and then use the controller button to select the virtual object. Thus, the user provides input through two modes (head gaze and buttons), and the system utilizes these two modes to determine the virtual object that the user wishes to select to which the user's head is directed.

In addition, the patterns of user input combined in a multi-modal user interaction may be interchanged to some extent. As an example, a user may use their eye gaze instead of their head pose to aim at a virtual object. As another example, the user may select the virtual target using a blink, gesture, or other input instead of a controller button.

FIG. 46C illustrates an example of multi-modal user interaction in near and midfield interaction regions, which may correspond to the workspace and midfield of FIG. 44B, for example. As shown in fig. 46C, the user 4610 may aim at a virtual object 4600 in the near field region using a totem collider (collader) directly associated with the location of the totem 4602, a totem touchpad cursor (e.g., by manipulating a touchpad to move a mouse arrow over the virtual object 4600), or by making a particular gesture 4604 on or near the virtual object 4600. Similarly, the user 4610 may aim at a virtual object 4612 in the midfield area using a head gesture, an eye gaze gesture, a controller 4602 with 3 or 6DOF inputs, or using a hand or arm gesture 4604. In the example of FIG. 46C, the interaction is multimodal due to the fact that there are multiple single modality options available for the user to provide the same input to the system.

While fig. 46C shows an example of multi-modal user interaction associated with targeting, fig. 46D shows an example of multi-modal user interaction associated with targeting and selection (in the near-field and midfield interaction regions). As shown in fig. 46D, the user 4610 may aim the virtual objects 4600 and 4612 selected in fig. 46C using various techniques including, but not limited to, pressing a bumper or trigger (also referred to herein as a controller) on the totem, pressing or tapping a touch pad on the totem, performing a micro-gesture (such as a finger pinch or tap), or hovering (e.g., hovering) an input for aiming the objects (e.g., keeping their head pose, eye pose, totem projection, or gesture focused on the virtual objects 4600 or 4612 longer than some threshold amount). In the example of FIG. 46D, the interaction is multimodal due to the fact that the user is using multiple modes of input (where time in the form of dwell may itself be the mode of input).

As shown in fig. 46E, the user may select targeted virtual objects, such as objects 4600 and 4612, using an input mode other than for targeting. As shown in the various examples of fig. 46E, the user may select the aimed virtual object using various techniques including, but not limited to, pressing a trigger or bumper or tapping on a totem touchpad (e.g., while aiming using a head gesture, an eye gesture, or a totem), and making a gesture or micro-gesture (such as a pinch tap, a flick gesture, or a pointing gesture) (e.g., while aiming a gesture using a hand). In the example of FIG. 46E, the interaction is multimodal due to the fact that the user is using multiple modes of input.

Cross-modality user interaction

Fig. 47A, 47B, and 47C illustrate examples of cross-modality user interaction. In a cross-modality user interaction, a user input of a first mode is modified by a user input of a second mode. In the example of targeting, user input in the primary mode may be used to target a desired virtual object, and user input in the secondary (secondary) mode may be used to adjust user input in the primary mode. This may be referred to as a relative cursor (cursor) mode. As an example of a relative cursor mode, a user may use a first mode of user input (such as eye gaze) to provide primary control of the cursor, and a second mode of user input (such as input on a controller) to adjust the position of the cursor. This may provide the user with more precise control over the cursor.

As shown in fig. 47C, a head ray casting (e.g., head pose) of the virtual object 4700 that is approximately aimed may be received. Input from the totem may then be received that imparts an increment to the head ray casting in order to fine tune the aim of the virtual object 4700. Once the user is satisfied, the user may provide a touch tap to select the targeted virtual object 4700. FIG. 47C also shows various examples of similar processes using different combinations of user input in the primary mode and user input in the secondary mode, as well as different examples of selection inputs.

Cross-modal user interaction

Fig. 48A, 48B, and 49 illustrate various examples of cross-modal user interaction. In cross-modal user interaction, two or more modes of user input may be dynamically coupled together. As an example, the wearable system may dynamically detect when two (or more) different modes of user input converge, and then may combine the inputs received in those modes to achieve better results than none of the other individual inputs can provide.

As shown in the example of fig. 48A, as inputs come together and separate, the user's gesture inputs, head pose inputs, and eye gaze inputs may be dynamically integrated and separated.

At time 4800, the user is providing gesture input to target a particular virtual object. As an example, the user may be pointing at a virtual object.

At time 4802, the user has focused their eyes on the same virtual object (as the user is aiming with one or more of their hands), and also turned the head to point at the same virtual object. Thus, all three modes of input (gestures, eye gestures, and head gestures) have come together on a common virtual object. The wearable system may detect this convergence (e.g., a trimodal convergence) and cause the user input to be filtered to reduce any uncertainty associated with the user input (e.g., increase the probability that the system correctly identifies the virtual object that the user intends to select). The system may selectively process the converged input with different filters and/or patterns of user input (e.g., based on factors such as which inputs have converged and how strongly the inputs converged). As an example, the system may overlay or otherwise combine the uncertainty regions for each of the convergence inputs and thereby determine a new uncertainty region that is less than the respective uncertainties. In at least some embodiments, the wearable system may integrate convergence inputs, where different weights are given to different user inputs. As an example, during this brief period of time, the eye gaze input may indicate the current location of the target more accurately than the head or hand gesture input and thus be given more weight. In particular, even when using an eye tracking system with a relatively low resolution camera and sampling rate, the user's eyes tend to guide other inputs and thus tend to respond more quickly to small changes in target position. In this way, the eye gaze input may provide a more accurate input vector when appropriately adjusted (e.g., filtered and fused with other inputs in an appropriately weighted manner).

At time 4804, the user has shifted their eye gaze away from the virtual object while their head gestures and gestures are still focused or pointed at the virtual object. In other words, the user's eye gaze has diverged and no longer converged with the user's head pose and gesture input. The wearable system may detect the divergence event and adjust its filtering of different user inputs accordingly. As an example, the wearable system may continue to combine the user's head gestures and gesture inputs (e.g., converge in a bimodal fashion) to identify which virtual object the user wishes to select, while ignoring eye gaze inputs for this purpose.

At time 4806, the user has returned their eye gaze to the virtual object such that the user's eye gaze and gestures converge on the virtual object. The wearable system may detect this bimodal convergence and combine or fuse the two inputs in a weighted manner.

As shown in the example of fig. 48B, cross-modal selection or targeting of objects may include dynamically cross-coupled inputs and may be used for static and ballistic objects (e.g., moving objects) in various interaction regions such as the near field (sometimes referred to as the task space) and mid-field regions. FIG. 48B shows that a user can use cross-modality input to target a static or moving object. As an example, a user may aim at an object by turning the gaze of their head and eyes toward the object (and possibly hovering over the object and/or employing such input to track the object). The user may also provide additional input to select an object, such as a controller or gesture (e.g., converging with the head and/or eye gaze to form a tri-modal input).

As shown in the example of fig. 49, the system may dynamically cross-couple the inputs together. At time 4900, the system may fuse the eye gaze input with the head pose input, allowing the user to view to select an object (while using their head pose to improve the accuracy of their eye gaze input). At time 4910, the system may fuse the head pose, eye pose, and gesture inputs together, allowing the user to view and point to select an object (using both head and eye gaze pose inputs to improve the accuracy of the gesture inputs).

Example Process of Cross-modality interaction

Fig. 50 is a process flow diagram of an example method 5000 of interacting with a wearable system using multiple modes of user input. Process 5000 may be performed by wearable system 200 described herein.

At block 5002, the wearable system may receive user input and classify the modal interaction and determine the occurrence of any convergence of the different modes of user input. The wearable system may classify the modal interaction as bi-modal or tri-modal (e.g., having two or three different user input modes come together) or quad-modal (e.g., four input modes) or a higher number of input modes. Block 5002 may relate to detection and classification of modal interactions at various stages, such as initial formation of a joint or bimodal or trimodal "join (bond)" (e.g., where differences between input vectors or different user input modes fall below some given threshold), fixation or stabilization of bimodal or trimodal "join" (e.g., where differences between input vectors stabilize below a threshold) and divergence (e.g., where differences between input vectors increase beyond a threshold). In some cases, two input modes may stabilize before an action is performed or before the other input mode converges with them. For example, the head and eye inputs may converge and stabilize for a short time (e.g., about 200ms) before performing a hand-grasping action or before the hand inputs converge with the head and eye inputs.

Block 5002 may also relate to classifying the type of motion and interaction region of the modal interaction. In particular, block 5002 may involve determining whether the converging user inputs are in ballistic tracking (e.g., with variable velocity or variable acceleration), in smooth tracking (e.g., with a more constant velocity with lower acceleration), in gaze (e.g., with relatively lower velocity and acceleration over time), or in dwell (e.g., have been in gaze for more than a given amount of time). Additionally, block 5002 may involve determining whether the input of convergence is within a near-field region (e.g., a mission field region), a mid-field region, or a far-field region. The system may process cross-modal input differently depending on the type of motion and the interaction region.

At block 5004, the wearable system may apply filtering to the cross-modal input. Block 5004 may involve applying filtering of different strengths based on how strong the vergence is between the inputs, which inputs are coming together, in which region the inputs are coming together, and so on. As an example, when eye gesture inputs and gesture inputs converge and are in a mid-or far-field region, stronger filtering may need to be applied than when such inputs are in a near-field region (since the uncertainty of these inputs typically increases with increasing distance from the user). In at least some embodiments, the filtering in block 5004 may involve conditioning of the input and may or may not include removing portions of the input. As an example, the system may filter the cross-modal input by noise filtering the primary input pose to increase targeting accuracy. Such filtering may include a low pass filter, such as an european (Euro) filter, which filters out high frequency components (and which may include an adaptive cutoff frequency). Although such filtering may improve targeting accuracy even without fusion, keeping such filtering permanently (as opposed to dynamically converging inputs) may result in introducing significant (and undesirable) latency. By selectively applying noise filtering only when the input converges (which may represent a small fraction of the operating time), the system can retain the accuracy benefits of applying a low-pass filter while avoiding most significant delays. In various implementations, other filters may be used (alone or in combination), including, for example, kalman filters, Finite Impulse Response (FIR) filters, Infinite Impulse Response (IIR) filters, moving averages, single or double exponentials, and the like. Another example of a filter is a dynamic recursive low-pass filter, where the low-pass filter has a dynamically adjustable cut-off frequency such that at low speed input vectors the cut-off frequency is smaller to reduce jitter (while allowing a smaller degree of delay or latency), and at high speed input vectors the cut-off frequency is larger compared to jitter to reduce delay.

At block 5006, the wearable system may integrate any converged user input. In combining user inputs, the wearable system may interpolate (linearly, spherically, or otherwise) between the user inputs to create a combined or fused input. In some embodiments, the wearable system may perform (linear, quadratic, exponential, or otherwise) inching to avoid jittering input. In particular, the wearable system may smooth abrupt changes in the combined input. As an example, at a time when the difference between two inputs becomes less than a threshold for convergence, the activity (active) input of the wearable system may jump from one of the activity inputs to a new fusion input. To avoid jitter, the wearable system may move the active input (and any corresponding cursor or feedback mechanism) from the original active input to the new fusion input in a damped manner (e.g., travel away from the original active input with limited acceleration and then travel to the new fusion input with limited deceleration). The acts in block 5006 may be dynamic in that the method 5000 may continuously or repeatedly check for converging or diverging inputs and integrate converging inputs. For example, the method may dynamically integrate inputs that have converged and dynamically remove inputs that have diverged.

At block 5008, the wearable system may optionally provide feedback to the user that the user input has been fused and that cross-modal interactions are now available. As examples, the wearable system may provide such feedback in the form of text, visual markers such as points, lines, rings, arcs, triangles (e.g., for tri-modal convergence), squares (e.g., for quad-modal convergence), and meshes.

Examples of user selection procedures in Single-, Dual-and triple-Modal interactions

FIG. 51 illustrates examples of user selections in single-modality, dual-modality, and tri-modality interactions.

In single modality interaction 5100, a user provides user input via a first mode of user input to track or select a given object. The mode of user input in interaction 5100 may be any suitable mode, such as head gestures, eye gestures, hand gestures, controller or totem inputs, and so forth. The user input in interaction 5100 is typically input from a user identifying a particular region or volume for selection. As shown in fig. 51, there may be an uncertainty associated with the user input in the interaction 5100. In particular, user input typically has at least some amount of uncertainty due to limitations of input devices such as eye tracking systems, head pose tracking systems, gesture tracking systems, and the like. In at least some embodiments, this uncertainty can decrease over time (e.g., as the system averages over time user input over time that is otherwise constant). This effect is illustrated in figure 51 by a circle, which represents the uncertainty of a given user input and scales off over time (e.g., comparing a relatively larger uncertainty at time 5102 to a relatively smaller uncertainty at time 5104).

In the bimodal interaction 5110, a user provides user input via two different modes of user input to track or select a given object. The mode of user input in interaction 5110 may be any suitable combination of modes, such as head-eye, hand-eye, head-hand, and the like. As shown in fig. 51, the system may use the overlapping uncertainties to reduce the uncertainty of the system to the user interaction. As an example, the system may identify a region 5112 that is located within both the uncertainty region of the first mode of user input and the uncertainty region of the second mode of user input in the bimodal interaction 5110. As shown in fig. 51, the overlap error zone 5112 associated with fusing bimodal inputs is substantially smaller than the error zone associated with any of the constituent modal inputs. Additionally, as the uncertainty of each of the base modality inputs decreases (e.g., the bimodal uncertainty 5112 may generally decrease from the initial time 5102 to a later time 5104), the overlap error region 5112 may also shrink over time.

In triple-modal interaction 5120, a user provides user input via three different modes of user input to track or select a given object. The mode of user input in interaction 5110 may be any suitable combination of modes, such as head-eye-hand, head-eye-totem, and so forth. As discussed in connection with the bimodal interaction 5110, the system can use the overlapping uncertainties of the three different modes to reduce the overall uncertainty of the system to the user interaction. As an example, the system can identify the area 5122 that is within the uncertainty area of the user inputs of the first, second, and third modes. The total uncertainty region 5122 may sometimes be referred to as cross-modal triangles (for tri-modal interactions) and cross-modal quadrilaterals (for quad-modal interactions) and cross-modal polygons (for a greater number of input interactions). As shown in fig. 51, as the uncertainty of each of the base modality inputs decreases (e.g., the triple modality uncertainty 5122 may generally decrease from an initial time 5102 to a later time 5104), the overall uncertainty region 5122 may also shrink over time. Fig. 51 also shows an overall uncertainty region 5123 in the form of a circle instead of a triangle in an interaction 5121. The exact shape of the uncertainty region formed by the user inputs of the multiple convergence patterns may depend on the shape of the uncertainty region of the underlying user input.

In at least some embodiments, the length of the "legs" of the cross-modal triangle, cross-modal quadrilateral, and/or higher level cross-modal shape may be proportional to the degree of convergence of each pair of input gesture vectors associated with the respective legs. The length of the "legs" may also indicate the type of task and/or characteristics of the individual user (e.g., different users may tend to interact in different and recognizable ways, which may be reflected in the length of the "legs"). In various embodiments, the length of the "legs" or edges across the modal shape may be used to classify which type of user interaction is involved. For example, the area of a triangle or quadrilateral (depending on the number of input gestures) is directly proportional to the cross-modal convergence and can be applied to a wide range of scenes. In use, two different triangles may have the same area but different side lengths. In this case, the length of the edge may be used to classify the subtype of the input convergence, and the area may be a proxy for the intended intensity and variance of the area.

Examples of interpreting user input based on convergence of user input

In at least some embodiments, the systems disclosed herein can interpret user input based on a convergence of the user input. In the example of fig. 52, the user is providing various user inputs to select an object 5200. The user has turned his head towards the object 5200, providing a head gesture input 5202; the user is looking at the object 5200, providing an eye gaze input 5204; and the user is gesturing with his arm.

In general, the system may have difficulty interpreting the user's arm gesture because it may have several different meanings. Perhaps the user is pointing at object 5200 so that the ray cast from the user's wrist to their palm (or fingertip) represents the intended input. Alternatively, perhaps the user is forming an "O" with their fingers and moving their hand such that the "O" surrounds the object within the line of sight thereof such that the projection of light from the user's head to the fingers represents the intended input. In another alternative, perhaps the user points the water hose at the car and intends to cast light from their shoulders to their fingertips to indicate the intended input. In the absence of additional information, it may be difficult for the system to determine which is the intended input.

With the present system, the system can determine which potential inputs (e.g., one of the potential interpretations of an arm or gesture input) converge with another mode of input (e.g., a head or eye gaze gesture input). The system may then assume that the potential interpretation that results in modal convergence is the intended input. In the example of fig. 52, palm-fingertip inputs 5206 (e.g., wrist-palm, joint-fingertip, etc.) converge with head and eye gaze inputs, while head-palm (e.g., head-fingertip) inputs 5208 and shoulder-palm (e.g., shoulder-fingertip) inputs 5210 diverge with other inputs. Thus, the system may determine that the palm-fingertip input 5206 is most likely the intended input, and may then use the palm-fingertip input 5206 to identify the object for selection 5200 (e.g., by using the input 5206 and its uncertainty features to reduce the overall uncertainty features of the three-modality selection of the object 5200). Thus, the system may interpret an arm gesture based at least in part on which possible interpretations result in modal convergence.

Fig. 53, 54, 55, and 56 illustrate additional examples of how user input is interpreted based on the convergence of the user input.

Fig. 53 illustrates how the tri-modal convergence of head gestures (H), eye gaze gestures (E), and palm fingertip inputs (Ha, for the hand) occurs in near field region 5300 (which may be a task space or examination space of the type shown in fig. 44A, 44B, and 44C), mid field region 5302 (which may correspond to a mid field region between about 1 meter and 4 meters from the user shown in fig. 44A, 44B, and 44C), and far field region 5304 (which may correspond to a region outside of the mid field region of fig. 44A, 44B, and 44C). As shown in the example of fig. 53, H, H and Ha inputs may come together at points 5301, 5303, 5305, and 5307. In the example associated with the convergence point 5301, the Ha input may relate to a fingertip position, while in the example associated with the convergence points 5303, 5305, and 5307, the Ha input may relate to a palm-to-fingertip ray cast.

In at least some embodiments, a system with cross-modality fusion capability may place different weights (e.g., importance values) on different inputs. In general, inputs may be dominant (e.g., assigned more weight) when they have factors such as lower error, increased functionality (such as providing depth information), higher frequency, and so forth. Fig. 54, 55, and 56 illustrate how a system with cross-modality fusion capability utilizes inputs of different weights in interpreting user inputs. As an example and in at least some embodiments, eye gaze (E) input may dominate over head pose (H) input and gesture (Ha) because eye gaze input includes an indication of depth (e.g., distance from the user), while head pose input does not include an indication of depth, and because eye gaze (E) is sometimes more indicative of user input gesture (Ha).

As shown in the example of fig. 54, the gaze (E) input may determine how a system with cross-modality fusion capability interprets various other inputs. In the example of fig. 54, the gaze (E) input and head pose (H) input appear to converge at points 5400, 5402, and 5404 (which may be in the near, mid, and far field regions, respectively). However, the head gesture (H) input also appears to converge (which may be substantially far away from, and not on the page of fig. 54) with the gesture (Ha) input at points 5401, 5403, and 5405 beyond the convergence point of the E and H inputs. The system described herein may decide to ignore apparent (apparent) convergence at points 5401, 5403, and 5405, but instead utilize apparent convergence at points 5400, 5402, and 5404. The system may do so based on a relatively higher weight eye gaze (E) input.

In other words, fig. 54 shows how an input such as head pose (H) has multiple interpretations (e.g., one fused with eye gaze (E) and another fused with gesture (Ha)) and how the system selects an interpretation that results in convergence with a higher weighted input. Thus, fig. 54 shows how the system can ignore apparent input convergence when the convergence is inconsistent with another, more dominant input, such as an eye gaze (E) input.

Fig. 55 shows an example similar to fig. 54, except that the eye gaze (E) input converges with the gesture (Ha) input at points 5500, 5502, and 5504. Furthermore, the head pose (H) input clearly converges with the Ha input at points 5501, 5503, and 5505. As with fig. 54, the system may decide in the example of fig. 55 to prefer an apparent convergence involving a more dominant eye gaze (E) input.

In the example of fig. 56, the eye gaze (E) input diverges from both the gesture (Ha) input and the head pose (H) input. In this way, the system may decide to use the apparent convergence of gesture (Ha) inputs and head pose (H) inputs (e.g., inputs to points 5600, 5602, 5604, and 5606) as the intended cross-modal input. The system may filter out eye gaze (E) inputs to points 5601, 5603, 5605, and 5607, or may use the eye gaze (E) inputs for other purposes.

Example diagram of wearable system with cross-modality fusion capability

Fig. 57A is a block system diagram of an example processing architecture of a wearable system 200 (e.g., a wearable system with cross-modality capabilities) that incorporates multiple modes of user input to facilitate user interaction. The processing architecture may be implemented by the local processing and data module 260 shown in fig. 2A or the processor 128 shown in fig. 2B. As shown in fig. 57A, the processing architecture may include one or more interactive blocks, such as blocks 5706 and 5708 that implement a cross-modal fusion technique of the type described herein. Block 5706 receives inputs such as, for example, head pose, eye gaze direction, and gesture inputs, and may apply cross-modal fusion techniques (such as filtering and combining the inputs when input convergence is detected) to those inputs. Similarly, block 5708 receives inputs such as controller inputs, voice inputs, head pose inputs, eye inputs, and gesture inputs, and may apply cross-modal fusion techniques to these (and any other available) user inputs. The integrated and filtered user input may be communicated to and used by various software applications as indicated by the arrows in FIG. 57A.

Fig. 57B is a system block diagram of another example of a processing architecture of wearable system 200 with cross-modality input capabilities. Fig. 57B illustrates how the processing architecture can include a cross-modal interaction toolkit 5752 as part of a software development toolkit (SDK), allowing a software developer to selectively implement some or all of the available cross-modal input capabilities of the system.

Example graphs of vergence distances and regions for converging and diverging user interactions

Fig. 58A and 58B are graphs of the vergence distance and vergence region for various input pairs observed for user interaction with a wearable system, where the user is asked to track a static object using their head, eyes, and controller. In the example of fig. 58A, dynamic cross-modal input fusion is disabled, while in the example of fig. 58B, dynamic cross-modal input fusion is enabled. In the example of fig. 58A, a static object is presented to the user at a first location at time 5810 and then at a second location at time 5820. In the example of fig. 58B, a static object is presented to the user at a first location at time 5830 and then at a second location at time 5840.

Fig. 58A and 58B show changes in head-eye vergence distance 5800, head-controller vergence distance 5802, and controller-eye vergence distance 5804 over time. Head-eye vergence distance 5800 is the distance between the head pose input vector and the eye gaze input vector. Similarly, head-controller vergence distance 5802 is the distance between the head pose input vector and the controller input vector. Additionally, the controller-eye vergence distance 5804 is the distance between the controller input vector and the eye gaze input vector. Fig. 58A and 58B also depict a vergence region 5806, which may indicate a region of uncertainty associated with user input (e.g., the system has uncertainty related to user input tracking the object).

As shown in fig. 58B (particularly when compared to fig. 58A), cross-modal filtering of the head pose and eye gaze input vectors significantly reduces the vergence region 5806 and the head-eye vergence distance 5800. In particular, and after the initial spike, head-eye vergence distance 5800 may be significantly reduced by dynamic cross-modal filtering as the user shifts his input to the newly presented object at times 5830 and 5840. In the example of fig. 58B, dynamic cross-modal filtering may include recognizing that head gesture and eye gaze inputs have converged, and then integrating the inputs together and applying filtering to achieve a more accurate result than either input alone.

The system may use information similar to the graphs in fig. 58A and 58B to determine the cognitive load of the user (e.g., the effort used in the user's working memory). For example, the rate of rise of the plot 5800-5804 or the time difference between the peaks of these curves represents the mental processing power that may be applied to the user of the task. For example, if the cognitive load is lower, the user may use more working memory for the task, and the rise time may be steeper and the time interval of the peak closer because the user has sufficient cognitive load to complete the task (e.g., aim at the moving object). If the cognitive load is high, the user has less working memory to apply to the task, and the rise time may be less steep and the peak interval time longer, as the user takes longer to complete the task. Note that the rise time (peak arrival) of the eye input tends to be faster than the rise time of the head input, and both tend to be faster than the rise time of the hand input (e.g., it can be seen in fig. 58A, 58B that head-controller graph 5802 has smaller steeply rising and delayed peaks relative to head-eye graph 5800).

Examples of dwell and feedback

Fig. 59A and 59B illustrate examples of user interaction and feedback during gaze and dwell events. In particular, fig. 59A and 59B illustrate how a user provides input by shifting their eye gaze input onto object 5910, gazing on object 5910 and dwelling (e.g., loitering) their eye gaze on object 5910 for a threshold period of time. Graphs 5900 and 5901 illustrate the rate (e.g., velocity) at which a user's eye gaze input changes over time.

At time 5902, the user completes shifting their eye gaze onto object 5910. The system may provide feedback to the user if desired. The feedback may take the form of an indicator 5911 and may indicate that the object 5910 is at least temporarily selected.

After the user's gaze rests on object 5910 for an initial threshold period (represented by time 5904), the system may provide feedback 5912 to the user. In some cases, the dwell time 5904 of the eye on the subject is about 500 ms. Feedback 5912 may be in the form of a progress bar, progress arc, or other such mechanism to generally show what percentage of the dwell time has been completed to successfully provide the dwell user input. As one example, feedback 5914 may take the form of a continuously or gradually updated gaze arc. In at least some embodiments, gaze arc 5914 may approach completion progressively as the user's gaze is incrementally longer (e.g., as shown by the dashed vertical line in graph 5900).

After the user's gaze rests on object 5910 for a threshold period of time (represented by time 5906), the system may provide feedback 5914 to the user. The feedback 5914 may be in any desired form, such as a progress bar for completion, a progress arc for completion, a highlighting of objects, and so forth. In the illustrated example of fig. 59B, the feedback 5914 is in the form of a completed square surrounding object 5910.

Although this is an example based on eye gaze, the concept can be applied to other sensor input modalities, such as head gestures, hand inputs, etc., alone or in combination with other inputs. For example, the dwell time 5904, 5906 of the eye gaze or head pose towards the object may be about 500ms, but if both the eye gaze and head pose inputs converge on the object, the dwell time for the system may be reduced to 300ms to determine that the user has selected the object, given the greater certainty in aiming due to the convergence of the two input modes.

Personalized examples for cross-modality input fusion techniques

The wearable system can monitor the user's interaction with the 3D environment and how the sensors tend to converge or diverge during use. The system may apply machine learning techniques to learn patterns of behavior and convergence/divergence trends of the user. For example, the user may have unstable hands (e.g., due to genetic effects, age, or illness), and thus, the jitter associated with hand input may be more because the user's hands tend to shake during totem use. The system may learn this behavior and adjust the threshold (e.g., increase the variance threshold to determine fusion of the hand input with other inputs), or apply appropriate filtering to compensate for the user's hand jitter (e.g., by adjusting the cutoff frequency in a low pass filter). For users with more stable hands, the threshold or filter may be set differently by the system, as the user's hand sensor input will show less jitter. Continuing with this example, the system can learn how a particular user picks up or grasps an object by learning the sequence and timing of sensor convergence (or divergence) that is specific to that user (see, e.g., the time sequence of head-eye-controller convergence in fig. 58B).

Thus, the system may provide an improved or maximum user experience for any particular user by adaptively integrating an appropriate set of convergence inputs for that particular user. Such personalization may be beneficial to users with poor coordination, illness, age, etc., by allowing the system to customize thresholds and filters to more easily identify input convergence (or divergence).

Cross-modal input techniques may allow the wearable system to operate more efficiently. The wearable system may include a plurality of sensors (see, e.g., the sensors described with reference to fig. 16), including inward and outward facing cameras, depth cameras, IMUs, microphones, user input devices (e.g., totems), electromagnetic tracking sensors, ultrasound, radar or laser sensors, Electromyography (EMG) sensors, and so forth. The system may have a processing thread that tracks each sensor input. The sensor processing thread may update at a lower rate until a convergence event is identified, and then the sensor update rate may be increased (at least for convergence inputs), which increases efficiency by having a higher update rate only for convergence sensor inputs. As described above, the system may learn user behavior based on the time history of other sensors' convergence and predict which sensor inputs may converge. For example, convergence of head and eye inputs onto a target object may indicate that a user is attempting to grasp or interact with the target object, and the system may accordingly predict that soon thereafter the hand inputs will converge (e.g., hundreds of milliseconds or up to about one second in time or later in time). The system may increase the hand sensor thread update rate based on the prediction before the hand sensor inputs actually converge. Such predictive capabilities may provide the user with zero or little perceived latency in responding to the wearable system because the converged (or soon to converge) sensors have an increased update rate, thereby reducing latency or latency of the system.

As another example, the system may initiate (or wake up) certain processing routines based on the convergence (or lack thereof) of sensor inputs. For example, realistically rendering a virtual diamond in a user environment may require the application of processor intensive subsurface scattering (or subsurface light transmission) techniques to convey the glints of the virtual diamond. Performing such computationally intensive tasks may be inefficient whenever a user glances at or near a virtual diamond. Thus, by detecting spatial convergence (e.g., input of the user's head, eyes, and hands) or temporal convergence (e.g., eyes gazing at a diamond longer than dwell time), the system can perform computationally intensive subsurface rendering techniques only when the convergence indicates that the user is interacting with the virtual diamond (e.g., by looking at it for a long time, reaching out to grasp it, etc.). This may result in increased efficiency, as these processing routines are only executed when necessary. Although described with respect to sub-surface scattering or light transmission techniques, other computationally intensive enhancement techniques may additionally or alternatively be applied, such as, for example, reflectograms, surface sparkle effects, sub-surface scattering, gaseous lenses and refraction, particle counting, advanced High Dynamic Range (HDR) rendering or illumination methods, and so forth.

As another example, the user's head may be turning to an object. The system may predict that the user's head will be pointing at the object at a particular time in the future, and the system may begin rendering (or prepare to render) the virtual object such that when the user's head reaches the object at the future time, the system will be able to render the virtual object with little or no perceived latency.

User history, behavior, personalization, etc. may be stored, for example, locally or remotely (e.g., in the cloud) as a composition of the world map or model 920. The system may start with a default set of cross-modal parameters that gradually improve as the system learns and adapts to user behavior. For example, the world map or model 920 may include a cross-modal interaction profile for a user with information about thresholds, variances, filters, etc. that is specific to the manner in which the user interacts with the real or virtual world. Cross-modality input fusion is user-centric, with an appropriate set of sensor inputs for that particular user fusion integrated together to provide improved targeting, tracking, and interaction.

Example wearable System with Electromyography (EMG) sensor

In some embodiments, fine motion (motor) control may be achieved by additional sensors such as those depicted in fig. 60A and 60B, which provide position feedback or additional sensor input. For example, an Electromyography (EMG) sensor system as described herein may provide controller or gesture data that may be used as an additional or alternative input in a cross-modal input system, and may facilitate accurate selection and user input. As with the other input modes described herein, when the inputs converge, the inputs received through the EMG system can be opportunistically fused with the other input modes, thereby improving the speed or accuracy of the EMG inputs and any other fused inputs. Thus, EMG sensor inputs may be used with any of the cross-modality, multi-modality, or cross-modality techniques described herein. EMG sensors are a broad term as used herein and may include any type of sensor configured to detect neuromuscular activity or neuromuscular signals (e.g., neural activation of spinal motor neurons to innervate muscles, muscle activation, or muscle contraction). The EMG sensors may include mechanomyography (MMG) sensors, Sonomorphism (SMG) sensors, etc. of muscle contraction. The EMG sensors may include electrodes configured to measure surface or in-vivo electrical potentials, vibration sensors configured to measure skin surface vibrations, acoustic sensors configured to measure acoustic (e.g., ultrasound) signals caused by muscle activity, and the like.

Referring to fig. 60A and 60B, additional embodiments are shown in which electromyography or EMG techniques may be used to assist in determining the position of one or more portions of the user's body, such as the position of one or more fingers or thumbs of the user, and/or the position of one or more hands of the user, while the user is operating the wearable computing system. An example of a wearable computing system is depicted in fig. 60A, which includes a head-mounted module 58, a handheld controller module 606, and a belt pack or chassis module 70 (e.g., modules described further herein at least in connection with fig. 2B), each of which may be operatively coupled 6028, 6032, 6034, 6036, 6038, 6042 to each other and to other connected resources (46, such as, for example, cloud resources, which may also be referred to as computing resources for storage and/or processing), such as via a wired or wireless connection configuration (such as an IEEE 802-11 connection configuration, a bluetooth wireless configuration, etc.).

EMG techniques may be used in various sensor configurations, such as, for example, intracorporeal indwelling electrodes or surface electrodes to monitor activation of a muscle or muscle group. With advances in modern manufacturing and connectivity, EMG electrodes may be used to form systems or aspects of systems that are non-traditional relative to previous uses. Referring again to fig. 60A, one or more EMG electrodes may be coupled to a portion of the body, such as the proximal forearm of the hand (6000), using a cuff or other coupling platform (6026). The EMG electrodes may be operably coupled (6020, 6022, 6024; such as via direct wire or wireless connection) to a local controller module (6018), which local controller module (6018) may be configured with an on-board power source (such as via a battery), a controller or processor, and various amplifiers to assist in reducing the signal-to-noise ratio in observing information generated by the associated EMG electrodes (in the illustrated embodiment, three arrays 6012, 6014, and 6016 of non-indwelling EMG surface electrodes are shown, however, this is for illustration only and not for limitation). EMG related signals may be communicated through a local controller module (6018) to other modules (70, 606, 58, 46) of an operably coupled system using wired or wireless connection (6044, 6040, 6030, 6046) configurations (such as IEEE 802-11 connection configurations, bluetooth wireless configurations, etc.) or, if so connected, directly from the electrodes (6012, 6014, 6016) themselves.

Referring again to fig. 60A, the EMG electrodes may be positioned relative to the anatomy of the user such that they may be utilized to track activation of various muscles known to produce various motions in the relevant joints, such as by tracking the muscles of the carpal tunnel through the wrist used to move various joints of the hand to produce, for example, hand motion such as hand gestures. In various configurations, the EMG electrodes may be placed on or in the anatomy of the user, such as, for example, on fingers, arms, legs, feet, neck or head, torso, or the like. A plurality of EMG electrodes may be placed on or in the anatomy of the user in order to detect muscle signals from a plurality of muscles or muscle groups. The configurations shown in fig. 60A and 60B are intended to be illustrative and not limiting.

In one embodiment, with all of the illustrated modules operatively coupled to one another, a central processor, which may reside, for example, in the belt pack or base module 70 or on the cloud 46, may be used to coordinate and refine the tracking of hand 6000's gestures that may be visible through camera elements of the head-mounted module 58 (such as, for example, the world camera 124 of the head-mounted wearable device 58 of fig. 2B). In some embodiments, gestures may also be tracked by features of the handheld controller module 70 that feature certain camera elements (such as the world camera 124 of the handheld assembly 606 of fig. 2B) that may be capable of capturing various aspects of information about the location of the hand 6000, depending on the location/orientation of the various camera elements and the location/orientation of the user's relevant hand 6000. In other words, in one embodiment, EMG data that predicts hand motion may be used alone or with camera-based data regarding the observation of the hand to help refine the system's prediction of the hand's position in space relative to various other modules, and what various parts of the hand are doing.

For example, one or more camera views may be utilized to provide a prediction that the user is making a U.S. non-verbal "OK" flag with his or her thumb and index finger (an example of an "OK" flag is shown in fig. 42C). The EMG data associated with the various muscles passing through the user's carpal tunnel can be observed to further understand that, in fact, the user's thumb and index finger appear to bend in a manner that is generally relevant to making the american non-language "OK" sign, and thus, the system can provide more accurate predictions as to what the user is doing. This perception of "OK" may be fed back into the system to provide an indication from the user that any next steps, dialog boxes, etc. are accepted within the software that may be operated by the user when the user wears the various wearable computing modules 70, 606, 58, 6010 (wearable EMG module). In one variation, for example, at a given operational juncture, the associated software may present a dialog box to the user asking the user to select "OK" or "decline"; for example, the user may view the dialog box in an augmented reality or virtual reality visualization mode through head-mounted module 58. In this illustrative example, the user may generate an "OK" flag by hand in order to select an "OK" button in the software, as described above. These three arrays may be used to assist in common mode error suppression to refine the output of the EMG module, and/or may be used to view different muscles or portions thereof to assist in viewing things that may occur under the user's skin that may be related to, for example, the user's hand movements.

Although the foregoing examples are described in the context of gestures (and in particular "OK" flags), this is merely exemplary and not a limitation of EMG sensor systems. As described above, EMG electrodes may be placed on or in the anatomy of the user to measure signals from other muscle groups in order to determine that the user is making any form of gesture (e.g., non-verbal markers). Examples of gestures (or non-language tokens) have been described with reference to gesture 2080 of FIG. 20 or with reference to gestures described with reference to FIGS. 42A-42C. Thus, the EMG system may measure gestures, non-verbal markers, positions, or movements by fingers, arms, feet, legs, torso (e.g., twists or bends), neck, or head, etc.

Referring to fig. 60B for illustrative purposes, an embodiment similar to fig. 60A is depicted, but without interconnected hand-held modules 606, as such hand-held modules may not be needed or desired for certain system configuration or functional paradigms.

Additional examples of Cross-modality input fusion techniques

This section provides additional details regarding examples of various implementations of cross-modal input fusion. Cross-modal input fusion can provide opportunistic feature fusion of multi-modal input using self-centric motion dynamics to improve interactive fidelity. These example implementations are intended to be illustrative and not restrictive. These techniques may be performed by wearable display systems described elsewhere in this application (e.g., see wearable system 200 described in fig. 2A and 2B). Any particular wearable system may implement one, some, or all of these functions and techniques, or may implement additional different functions and techniques.

The following provides an explanation of some terms for the cross-modal fusion technique described herein. These explanations are exemplary only, and not limiting:

elements: discrete interactive display items.

Main aiming vectors: the dominant input pose vector for the steering (stee) spatial targeting method (e.g., ray casting, cone casting, ball casting, hit test associated with a collider, or normal).

Inputting a posture vector: the obtained gestures are input from a standard system modality. This may be a head gaze gesture, an eye gaze gesture, a controller gesture, or a hand gesture. It may also come from cross-modality static hybrid inputs that create statically fused gesture vectors, such as controller and touchpad or eye gaze and touchpad derived gestures.

An interaction field: this may be based on the effective reach of the user and constrained by overlap limitations from the sensing field, display field, and audio field.

Region of intent (ROI): this may be a volume made up of overlapping uncertainty regions (volumes) associated with a set of aiming vector pairs or triples.

And (3) identifying the cross-modal state: the identification of cross-modal states may be performed by analyzing the relative convergence of all defined and available input aiming vectors. This can be achieved by first checking the angular distance between the aiming vectors in the pair. The relative variance of each pair is then checked. If the distance and variance are below defined thresholds, a bimodal state may be associated with a pair. If the triplet aiming vector has a set of angular distances and variances below a defined threshold, an associated region may be established. In the case of a head pose aiming vector (head gaze), an eye vergence aiming vector (eye gaze), and a tracking controller or tracking hand (hand pointer) aiming vector, creating a triplet that matches these requirements is called a cross-modal triangle. The relative sizes of the triangle sides, the area of the triangle and its associated variance present characteristic features that can be used to predict aiming and activation intent. Group motions targeting vector pairs (bimodal) or triplets (trimodal) (or a large number of sensor inputs, e.g., 4, 5, 6, 7, or more) can be used to further define exact subtypes of cross-modal coordination.

Selection and configuration of the fusion method: an appropriate fusion method may be determined based on the detailed cross-modal state of the input system. This may be defined by a cross modality type, a motion type, and an interaction field type. The defined fusion type determines which of the listed input modes (and associated input vectors) should be selectively fused. The motion type and field type determine the fusion method settings.

The techniques described herein may allow a user to select smaller elements by reducing the uncertainty of the primary input targeting method. The techniques described herein may also be used to speed up and simplify the selection of elements. The techniques described herein may allow a user to increase the success rate of aiming a motion element. The techniques described herein may be used to accelerate rich rendering of display elements. The techniques described herein may be used to prioritize and accelerate local (and cloud) processing of object point cloud data, dense meshing and plane acquisition to improve indentation fidelity of discovered entities and surfaces and grasped objects of interest.

The techniques described herein allow, for example, wearable system 200 to establish varying degrees of cross-modal focus from the perspective of the user, while still preserving the subtle movements of the head, eyes, and hands, thereby significantly enhancing the system's understanding of the user's intent.

Examples of some of the functions that can be addressed by the cross-modal input fusion technique are described below. Such functions include, but are not limited to: aiming at the smaller elements; fast targeting of static close proximity elements; aiming the dynamic moving element; managing transitions between near field and mid field targeting methods; managing transitions between relative targeting methods; activation of a management element; manipulation of management elements; managing transitions between macro (macro) and micro (micro) manipulation methods; deactivation recovery (integration) of management elements; managing active (active) physical modeling elements; managing active rendering of display elements; managing active collection of dense point clouds; managing active collection of dense grids; managing active collection of planar surfaces; managing active collection of dynamically discovered entities; and managing active modeling of entities that grip discovery.

(1) Targeting the smaller elements: the techniques may provide for targeting smaller elements (discrete interactive display items). Aiming smaller items from a distance can be inherently difficult. As the size presented by the target item decreases to within accuracy limits, reliably intersecting the projection (project) aiming mechanism may become increasingly difficult. Opportunistic fusion multimodal input can improve effective accuracy by reducing uncertainty of the primary targeting vector. Embodiments of wearable system 200 may perform any combination of the following actions: identifying a current cross-modal state; identifying ROIs defined by cross-modal vergence and gaze; identifying the corresponding interaction field (near/medium/far); selecting the correct input fusion method and settings for the (static/pseudo-static/dynamic) primary targeting vector; applying the defined condition to the primary targeting vector; transmitting the stabilized gesture vector to the application with increased confidence; reducing aiming vector registration error, jitter and drift; and enables the confident targeting of smaller elements that are near or far (as compared to modal targeting methods).

(2) Fast targeting of static closely-proximate elements: this technique can provide reliable targeting of static elements in close proximity. Since the separation distance may be less accurate than the primary input vector, objects in close proximity may be inherently difficult to resolve and reliably aim. Using a conditional (condition) input gesture vector to target a static, closely-proximate target provides an improvement in accuracy through a limited change in perceived latency. Embodiments of wearable system 200 may perform any combination of the following actions: identifying a current cross-modal state; identifying a ROI defined by cross-modal vergence and gaze (visualization); identifying the corresponding interaction field (near/medium/far); selecting the correct input fusion method for the (pseudo-static) primary targeting vector; applying the defined condition to the primary targeting vector; presenting the application with a stable gesture vector with increased confidence; and reducing fixation and dwell time.

(3) Aiming the dynamic moving element: this technique can provide reliable targeting of moving elements. This includes panning, rotating, or zooming relative to the display or relative to the world. When in a dynamic environment, moving objects may be inherently difficult to track. Such targeting may be more challenging when presented with the increased number of degrees of freedom provided by dynamic 3D content distributed worldwide and head and eye driven display methods. Embodiments of wearable system 200 may perform any combination of the following actions: identifying a ROI defined by cross-modal vergence and gaze; identifying the corresponding interaction field (near/medium/far); selecting the correct input fusion method for the (dynamic) primary targeting vector; transmitting the stable gesture vector to the application with increased confidence; reducing aiming vector registration error, jitter and drift; and smaller elements that are confidently targeted near or far are enabled, which move at a faster speed at a non-linear rate (compared to modal targeting methods).

(4) Managing the transition between near-field and mid-field targeting methods: the technique provides a grouping (marshaling) process between a near-field targeting mechanism and a mid-field-far-field targeting method. There are various approaches to direct and indirect targeting methods. Typical direct methods occur in the near field and indirect methods target content in the mid and far fields. Knowing whether the region of intent (ROI) is near-field or mid-field allows selection between method groups, which also provides opportunities and methods for identifying field transition events and handling associated transitions in the interaction mechanism, thereby reducing the need to explicitly manage mode changes in the application layer. Embodiments of wearable system 200 may perform any combination of the following for active cross-modality intent: identifying a ROI defined by cross-modal vergence and gaze; identifying the corresponding interaction field (near/medium/far); selecting the correct input fusion method for the (dynamic) primary targeting vector; and presenting the stable gesture vector to the application with increased confidence.

(5) Managing transitions between relative targeting methods: the techniques may provide for reliably selecting an optimal or most likely interaction mechanism within an interaction field. Embodiments of wearable system 200 may perform any combination of the following for active cross-modality intent: identifying a current cross-modal state; identifying a ROI defined by cross-modal vergence and gaze; identifying the corresponding interaction field (near/medium/far); selecting a starting point and an end point of a most suitable aiming vector for the main input vector; and managing any transitions between aiming vectors (either in the same mode or between modes). A pre-targeting "cool down" period may be enforced between the targeting feedbacks in order to reduce any disorientation due to changes in the targeting vector in the field of view (FOV).

(6) Activation of management elements: embodiments of wearable system 200 may perform any combination of the following for active cross-modality intent: identifying a current cross-modal state; identifying a ROI defined by cross-modal vergence and gaze; identifying the corresponding interaction field (near/medium/far); presenting the application with a stable gesture vector with increased confidence; and enabling confident activation of near or far smaller elements that move at a non-linear rate and at a faster speed (compared to modal targeting methods).

(7) Manipulation of management elements: embodiments of wearable system 200 may perform any combination of the following actions: identifying a current cross-modal state; identifying a ROI defined by cross-modal vergence and gaze; identifying the corresponding interaction field (near/medium/far); presenting the application with a stable gesture vector with increased confidence; enabling confident steering and manipulation of near or far smaller elements that move at a non-linear rate and at a faster speed (compared to modal targeting methods); a trusted enabled micro-steering and manipulation element; and enabling confident micro-steering actions.

(8) Managing transitional macro and micro manipulation methods: embodiments of wearable system 200 may perform any combination of the following for active cross-modality intent: identifying a ROI defined by cross-modal vergence and gaze; identifying the corresponding interaction field (near/medium/far); identifying relevant modal macro interaction mechanisms and checking whether enhancement is enabled; activating a related micro-interaction management method; identifying whether cross-modal dispersion occurs; and preparing the system for the disabling of the operational micro-gesture by reducing the cross-modal confidence.

For example: the gesture is kneaded. The enhancement can be aggressively performed by micro-pinch manipulations (such as thumb index finger joint tap or index finger thumb slide action), but only in robust tracking areas, speed and confidence and robust user gaze metrics, these methods can be enabled (e.g., gaze or dwell time exceeds a user gaze threshold time of, for example, hundreds to thousands of milliseconds). Cross-modal gaze is used as a more robust method of conditionally activating micro-gesture analysis.

(9) Disabling and integrating management elements: embodiments of wearable system 200 may perform any combination of the following for active cross-modality intent: identifying a current cross-modal state; identifying a ROI defined by cross-modal vergence and gaze; identifying the corresponding interaction field (near/medium/far); presenting the application with a stable gesture vector with increased confidence; and to enable confident micro-motions that can result in more robust modal input state changes.

For example: greater confidence in the transition from pinch touch to pinch hover to pinch end results in more predictable virtual object release and separation behavior.

For example: when partially occluded, greater confidence in hand trajectory and gesture transitions may result in better recognition of disablement and rendering more reliable end state characteristics. This may result in more reliable separation behavior, a more stable throwing mechanism, and more realistic physical behavior.

(10) Physical modeling of management elements: embodiments of wearable system 200 may perform any combination of the following for active cross-modality intent: identifying a current cross-modal state; identifying a ROI defined by cross-modal vergence and gaze; identifying the corresponding interaction field (near/medium/far); presenting the application with a stable gesture vector with increased confidence; enabling confident custom processing of local conditions resulting from standard (e.g., simulated) physical interactions in the ROI, which may result in more robust state changes in the physics engine; and enabling efficient management and context-driven optimization of advanced (e.g., simulated) physical behaviors, such as hyper-local software simulation, inelastic collisions, or liquid simulation.

For example: the resulting behavior of the thrown or placed object with greater confidence exhibits a more predictable and intentional physical behavior.

(11) Managing active rendering of display elements: embodiments of wearable system 200 may perform any combination of the following for passive cross-mode intent: identifying a cross-modal gaze point, defining an extended intent region based on cross-modal gaze time and predicted dwell time; identifying whether the intent region intersects the rendering element and the enhancement option; identifying available enhancements compatible with the ROI, gaze time, and predicted dwell time; activating an optimal rendering enhancement option (which may require second and third rendering channels), such as a level of detail (LOD) method that may be activated during the predicted dwell time; and detecting cross-modal dispersion and managing rendering priorities. Activating such LOD methods may, for example, include increasing the resolution or quality of the virtual content or other graphics to be displayed or otherwise presented to the user. Similarly, in some implementations, activating such LOD methods may include improving the quality of audio output to a user.

For example: detailed reflectograms, surface sparkle effects, subsurface scattering, gas lens and refraction, particle counting, advanced High Dynamic Range (HDR) rendering or illumination methods, etc. These proposed mechanisms may differ from typical foveal rendering techniques driven by eye vergence locations, and tend to use first order/first channel rendering methods to manage polygon count, pixel density, and generally must work on different time scales in order to remain imperceptible. Using cross-modality gaze and dwell allows rendering options that involve higher latency, which are typically not attempted on mobile platforms due to computational limitations.

(12) Managing active collection of dense point clouds: embodiments of wearable system 200 may perform any combination of the following for passive cross-mode intent: identifying a ROI defined by cross-modal vergence and gaze; identifying the corresponding interaction field (near/medium/far); accelerating the processing of the concerned point cloud; and improving the fidelity of object interactions by preparing for direct observation and interaction.

For example, the objects of the discovered entities bind dense point clouds to make rich element interactions on the surfaces (dynamic object textures) of the discovered entities.

(13) Managing active acquisition of dense grids: embodiments of wearable system 200 may perform any combination of the following for passive cross-mode intent: identifying a ROI defined by cross-modal vergence and gaze; identifying the corresponding interaction field (near/medium/far); accelerating the processing of the dense grid of interest; and improving the fidelity of object interactions by preparing for direct observation and interaction.

For example: concave meshing to improve dense grid occlusion of objects in near field or non-planar surface touch interactions.

(14) Managing active acquisition of planar surfaces: embodiments of wearable system 200 may perform any combination of the following for passive cross-mode intent: identifying a ROI defined by cross-modal vergence and gaze; identifying the corresponding interaction field (near/medium/far); speeding up processing of fast planes of interest; and to improve the fidelity of surface interactions by preparing for direct observation and interaction.

For example: improve surface touch interaction and actively reduce errors in surface touch tracking.

(15) Managing active collection of dynamically discovered entities: embodiments of wearable system 200 may perform any combination of the following for active cross-modality intent: identifying ROIs defined by cross-modal vergence, gaze, and smooth pursuit; identifying the corresponding interaction field (near/medium/far); accelerating processing of the discovered entities of interest; and improving the fidelity of discovered entity interactions by preparing for dynamic motions or interactions.

For example: pre-provisioning the system for interaction with discovered entities may reduce the apparent latency of dynamic discovery entity tracking.

(16) Managing active grasping of dynamically discovered entities: embodiments of wearable system 200 may perform any combination of the following for active cross-modality intent: identifying ROIs defined by cross-modal vergence, gaze, and smooth pursuit; identifying the corresponding interaction field (near/medium/far); accelerating processing of the discovered entities of interest; enabling local processing of the discovered entities of interest; and improving the fidelity of discovered physical interactions by preparing dynamic interactions based on hand grips.

For example: preparing the system in advance for hand tracking interaction with the object being grasped may reduce the apparent latency of object tracking of grasping, or improve real-time object segmentation methods and improve gesture estimation when grasping the object. Objects that are often gripped may produce better segmentation models in the cloud that can be used for personalized user-specific optimization of hand tracking for gripping.

The foregoing functionality may be provided by various implementations of cross-modal input fusion techniques, and is not limiting. A wearable system (e.g., such as wearable system 200) may perform embodiments of one, some, or all of these techniques, or may perform additional or different cross-modal input fusion techniques. Many variations and combinations are possible.

Example software code

Appendix a includes examples of code in the C # programming language that may be used to perform example implementations of the cross-modal input fusion technique described herein. In various implementations, the programs in appendix a may be executed by local processing and data module 260, remote processing module 270, central runtime server 1650, processor 128, or other processors associated with wearable system 200. Appendix A is intended to illustrate example implementations of various features of the cross-modal input fusion technique and is not intended to limit the scope of the technique. Appendix a is incorporated herein by reference in its entirety to form a part of this specification.

Example neurophysiological approach to Cross-modality input fusion techniques

Without intending to be bound or limited by any particular neurophysiological model or sensorimotor paradigm, certain embodiments of the cross-modal input fusion system and method may apply or utilize the teachings of such models or paradigms to, for example, sensor convergence or divergence, cross-modal input fusion techniques, sensor input filtering, identification or characterization of cross-modal states, operations performed using cross-modal fusion inputs, and the like.

For example, many leading theories in the field of neurophysiology suggest that the kinematic and kinetic properties of human motor behavior are too large and complex to be reasonably controlled by a single internal model or computational scheme. Instead, the human brain has been assumed to employ a modular computing architecture. Examples of theoretical architectures that may be useful in describing the human sensorimotor paradigm or aspects thereof may include Multiple pairs of Inverse Models ("MPFIM"), a hybrid expert architecture, a modular selection and recognition for control ("MOSAIC") model, and the like. In some implementations, certain aspects of one or more of the systems and techniques described herein may be functionally similar or analogous to aspects of such theoretical architectures.

Within such architectures, switching between multiple different "modules" (also referred to as "synergies" or "coordination structures") may be performed based on context. Each module may physically correspond to a neurophysiological mechanism, but may logically represent a complex dynamical system that may be described using differential equations and configured to implement a particular motor or sensorimotor control strategy.

In some examples, each module may contain a forward model and a reverse (inverse) model. For example, such a reverse model may be considered specific to a particular behavioral context, while a corresponding forward model may be considered to be the "responsibility" for determining such a reverse model in the current context. In operation, the inverse model may receive reference inputs indicative of the target sensory state, and then compute and provide motion commands (e.g., one or more muscle activation signals) to a "device (plant)" or motion unit (e.g., one or more muscles) in the system. The device may then execute the motion command, effectively creating a new sensory state. The inverse model may also provide a reference (reference) copy of the motion command to the forward model, which may in turn calculate the predicted sensation state. The new sensory state may be evaluated relative to the predicted sensory state to generate an error signal that may be used by the inverse model as feedback to correct for current motion or otherwise improve system performance. In practice, the feedback loop formed between the forward model and the reverse model creates an inevitable interplay between the input of the module and the output of the module.

In some embodiments, one or more of the cross-modal input fusion systems and techniques described herein may seek to detect the occurrence of a switching event in which one or more modules are activated, deactivated, or a combination thereof. To this end, one or more of the systems described herein may monitor module outputs (e.g., movements and positions of muscles, joints, or other anatomical features that may be tracked by electronic sensing components of the system) for feedback of indications of the stabilization process that occur immediately or relatively soon after activation of a given module (e.g., by virtue of its respective step response), or other indications of operating the bifurcation. Such feedback stabilization processes may produce random convergence between at least one pair of inputs to or outputs from the modules. That is, as the modules initially stabilize themselves upon activation, the inputs to or outputs from the modules may become increasingly influential with respect to each other. For example, a change in statistical variance, covariance, or correlation between a given pair of module outputs monitored by the system may indicate a convergence event (e.g., module activation, increase in module control contribution, etc.), a divergence event (e.g., module deactivation, decrease in module control contribution, etc.).

Other considerations

Although certain examples of cross-modal input fusion have been described herein in the context of an AR/MR/VR system, this is merely exemplary and not limiting. Embodiments of the cross-modal input fusion techniques described herein may be applied to, for example, robotics, drones, user-guided perception, human-machine interaction, human-computer interaction, brain-computer interface, user experience design, and the like. For example, a robotic system or drone may have multiple input modes, and cross-modal techniques may be used to dynamically determine which of the multiple inputs have converged and utilize the converged input mode as described above.

Each of the processes, methods, and algorithms described herein and/or depicted in the figures can be embodied in, and fully or partially automated by, code modules executed by one or more physical computing systems, hardware computer processors, application specific circuits, and/or electronic hardware configured to execute specific and special computer instructions. For example, a computing system may include a general purpose computer (e.g., a server) or a special purpose computer, special purpose circuits, and so forth, programmed with specific computer instructions. Code modules may be compiled and linked into executable programs, installed in dynamically linked libraries, or written in interpreted programming languages. In some implementations, certain operations and methods may be performed by circuitry that is dedicated to a given function.

Furthermore, certain implementations of the functionality of the present disclosure are sufficiently complex mathematically, computationally, or technically that application-specific hardware or one or more physical computing devices (with appropriate specific executable instructions) may be necessary to perform the functionality, e.g., due to the number or complexity of computations involved or to provide the results in substantially real-time. For example, video may include many frames, each having millions of pixels, and specifically programmed computer hardware is necessary to process the video data to provide the desired image processing task or application in a commercially reasonable amount of time. Furthermore, cross-modal techniques may utilize dynamic monitoring of sensor inputs to detect convergence and divergence events, and may utilize complex hardware processor or firmware based solutions for real-time execution.

The code modules or any type of data may be stored on any type of non-transitory computer readable medium, such as physical computer memory, including hard drives, solid state memory, Random Access Memory (RAM), Read Only Memory (ROM), optical disks, volatile or non-volatile memory, combinations thereof, and/or the like. The methods and modules (or data) may also be transmitted as a generated data signal (e.g., as part of a carrier wave or other analog or digital propagated signal) over a variety of computer-readable transmission media, including wireless-based and wired/cable-based media, and may take many forms (e.g., as part of a single or multiplexed analog signal, or as multiple discrete digital packets or frames). The results of the disclosed methods or method steps may be stored persistently or otherwise in any type of non-transitory tangible computer memory, or may be communicated via a computer-readable transmission medium.

Any process, block, state, step, or function in the flowcharts described herein and/or depicted in the figures should be understood as potentially representing a module, segment, or portion of code which includes one or more executable instructions to implement a particular function (e.g., logical or arithmetic) or step in a method. Various methods, blocks, states, steps or functions may be combined, rearranged, added to, deleted, modified or otherwise altered with the illustrative examples provided herein. In some embodiments, additional or different computing systems or code modules may perform some or all of the functions described herein. The methods and processes described herein are also not limited to any particular sequence, and the blocks, steps, or states associated therewith may be performed in other sequences as appropriate, e.g., serially, in parallel, or in some other manner. Tasks or events can be added to, or removed from, the disclosed exemplary embodiments. Moreover, the separation of various system components in the implementations described herein is for illustrative purposes, and should not be understood as requiring such separation in all implementations. It should be understood that the described program components, methods and systems can generally be integrated together in a single computer product or packaged into multiple computer products. Many implementation variations are possible.

The processes, methods, and systems may be practiced in network (or distributed) computing environments. Network environments include enterprise-wide computer networks, intranets, Local Area Networks (LANs), Wide Area Networks (WANs), Personal Area Networks (PANs), cloud computing networks, crowd-sourced computing networks, the internet, and the world wide web. The network may be a wired or wireless network or any other type of communication network.

The systems and methods of the present disclosure each have several inventive aspects, no single one of which is fully responsible for or required for the desirable attributes disclosed herein. The various features and processes described above may be used independently of one another or may be combined in various ways. All possible combinations and sub-combinations are intended to fall within the scope of the present disclosure. Various modifications to the embodiments described in this disclosure may be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the disclosure. Thus, the claims are not intended to be limited to the embodiments shown herein but are to be accorded the widest scope consistent with the present disclosure, the principles and novel features disclosed herein.

Certain features that are described in this specification in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination. Furthermore, although features may be described above as acting in certain combinations and even initially claimed as such, one or more features from a claimed combination can in some cases be excised from the combination, and the claimed combination may be directed to a subcombination or variation of a subcombination. No single feature or group of features is essential or essential to each embodiment.

Conditional language, such as "can," "might," "should," "may," "for example," and the like, or otherwise understood in the context, as used herein, is generally intended to convey that certain embodiments include, while other embodiments do not include, certain features, elements and/or steps, unless expressly stated otherwise. Thus, such conditional language is not generally intended to imply that features, elements and/or steps are in any way required for one or more embodiments or that one or more embodiments necessarily include instructions for deciding, with or without author input or prompting, whether these features, elements and/or steps are included or are to be performed in any particular embodiment. The terms "comprising," "including," "having," and the like, are synonymous and are used inclusively, in an open-ended fashion, and do not exclude additional elements, features, acts, operations, and the like. Furthermore, the term "or" is used in its inclusive sense (and not its exclusive sense), so that when used, for example, to connect lists of elements, the term "or" refers to one, some, or all of the elements in the list. In addition, the articles "a", "an" and "the" as used in this application and the appended claims should be construed to mean "one or more" or "at least one" unless specified otherwise.

As used herein, a phrase referring to "at least one of" a list of items refers to any combination of these items, including a single member. For example, "A, B or at least one of C" is intended to cover: A. b, C, A and B, A and C, B and C and A, B and C. Joinder language such as the phrase "X, Y and at least one of Z," unless specifically stated otherwise, as used generally to express items, terms and the like may be at least one of X, Y or Z. Thus, such conjunctive language generally does not imply that certain embodiments require the presence of at least one of X, at least one of Y, and at least one of Z.

The term "threshold" as used herein refers to any possible type of threshold. By way of example, the term "threshold" includes predefined thresholds, dynamically determined thresholds, dynamically adjusted thresholds, and learned thresholds (e.g., thresholds learned through user interaction, thresholds based on user preferences, thresholds based on user capabilities, etc.). The user capability based threshold may be adjusted up or down based on the capabilities of the individual user. As an example, the touchpad force threshold based on user capabilities may be adjusted downward for users weaker than normal finger strength (either known from previous interactions by the user or through user preferences).

Similarly, while operations may be depicted in the drawings in a particular order, it will be appreciated that these operations need not be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. Further, the figures may schematically depict one or more example processes in the form of a flow diagram. However, other operations not shown may be incorporated into the exemplary methods and processes illustrated schematically. For example, one or more additional operations may be performed before, after, concurrently with, or between any of the illustrated operations. In addition, in other implementations, the operations may be rearranged or reordered. In some cases, multitasking and parallel processing may be advantageous. Moreover, the separation of various system components in the implementations described above should not be understood as requiring such separation in all implementations, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products. Additionally, other implementations are within the scope of the following claims. In some cases, the actions recited in the claims can be performed in a different order and still achieve desirable results.

Examples of the invention

Various examples of systems that dynamically fuse multiple modes of user input to facilitate interaction with virtual objects in a three-dimensional (3D) environment are described herein, such as the following enumerated examples:

example 1: a system, the system comprising: a first sensor of the wearable system configured to acquire first user input data in a first input mode; a second sensor of the wearable system configured to acquire second user input data in a second input mode, the second input mode being different from the first input mode; a third sensor of the wearable system configured to acquire third user input data in a third input mode, the third input mode being different from the first input mode and the second input mode; and a hardware processor in communication with the first, second, and third sensors, the hardware processor programmed to: receiving a plurality of inputs including first user input data in a first input mode, second user input data in a second input mode, and third user input data in a third input mode; identifying a first interaction vector based on the first user input data; identifying a second interaction vector based on the second user input data; identifying a third interaction vector based on the third user input data; determining vergence between at least two of the first interaction vector, the second interaction vector and the third interaction vector; identifying a target virtual object from a set of candidate objects in a three-dimensional (3D) region around the wearable system based at least in part on the vergence; determining a user interface operation on the target virtual object based on at least one of the first user input data, the second user input data, the third user input data, and the vergence; and generating a cross-modality input command that causes a user interface operation to be performed on the target virtual object.

Example 2: the system of example 1, wherein the first sensor comprises a head pose sensor, the second sensor comprises an eye gaze sensor, and the third sensor comprises a gesture sensor.

Example 3: the system of example 1 or 2, wherein the vergence is among all three of the first interaction vector, the second interaction vector, and the third interaction vector.

Example 4: the system of any of examples 1-3, wherein the hardware processor is programmed to determine a divergence of at least one of the first interaction vector, the second interaction vector, or the third interaction vector from the vergence.

Example 5: the system of any of examples 1-4, wherein to determine vergence between at least two of the first interaction vector, the second interaction vector, and the third interaction vector, the hardware processor is programmed to determine a dataset comprising vergence of user input data associated with the determined vergence of sensors.

Example 6: the system of any of examples 1-5, wherein the third sensor comprises an Electromyography (EMG) sensor sensitive to hand motion.

Example 7: the system of any of examples 1-5, wherein the third sensor comprises an Electromyography (EMG) sensor sensitive to muscles passing through a carpal tunnel of the user.

Example 8: a method, comprising: under control of a hardware processor of the wearable system: accessing sensor data from a plurality of more than three sensors having different modalities; identifying convergence events for a first sensor and a second sensor of a plurality of sensors having different modalities greater than three; and targeting objects in the three-dimensional 3D environment around the wearable system using the first sensor data from the first sensor and the second sensor data from the second sensor.

Example 9: the method of example 8, further comprising: identifying a second convergence event of a third sensor fused with the first sensor and the second sensor, the third sensor from a plurality of sensors of different modalities greater than three, and wherein utilizing the first sensor data from the first sensor and the second sensor data from the second sensor to target objects in a three-dimensional (3D) environment surrounding the wearable system further comprises: third sensor data from a third sensor is utilized.

Example 10: the method of any of examples 8 or 9, further comprising: a divergence event is identified, wherein the first sensor diverges from the second sensor, or the third sensor diverges from the first sensor or the second sensor.

Example 11: the method of example 10, wherein the utilizing does not include utilizing data from a dispersed sensor.

Example 12: the method of example 10, wherein the utilizing includes weighting data from a divergence sensor less than weighting data from a convergence sensor.

Example 13: the method of any of examples 10-12, wherein a plurality of sensors with different modalities greater than three includes: head pose sensors, eye gaze sensors, gesture sensors, and touch sensors.

Example 14: a method according to any of examples 8-13, wherein the first sensor comprises an Electromyography (EMG) sensor sensitive to hand movement.

Example 15: the method of any of examples 8-13, wherein the first sensor comprises an Electromyography (EMG) sensor sensitive to hand motion, wherein the second sensor comprises a camera-based gesture sensor, and wherein identifying convergence events of the first sensor and the second sensor comprises: determining, using the EMG sensor, that a muscle of the user is flexed in a manner consistent with the non-verbal token; and determining, with the camera-based gesture sensor, that at least a portion of the user's hand is positioned in a manner consistent with the non-language indicia.

Example 16: a method, comprising: under control of a hardware processor of the wearable system: accessing sensor data from at least first and second sensors having different modalities, wherein the first sensor provides sensor data having a plurality of potential interpretations; identifying a convergence of sensor data from the second sensor with a given potential interpretation of the potential interpretations of the sensor data from the first sensor; and generating an input command for the wearable system based on a given one of the potential interpretations.

Example 17: the method of example 16, wherein generating the input command comprises: an input command is generated based on a given potential interpretation of the potential interpretations while discarding the remaining potential interpretations.

Example 18: the method of any of examples 16 or 17, wherein the first sensor comprises a gesture sensor that tracks hand motion of the user.

Example 19: the method of any of examples 16-18, wherein the first sensor comprises a gesture sensor that tracks a motion of an arm of the user.

Example 20: the method of any of examples 16-19, wherein the potential interpretation of the sensor data from the first sensor comprises: a first ray or cone projection from a wrist of a user to a fingertip of the user, and comprising: a second ray or cone projection from the user's head to the user's fingertip.

Example 21: the method of any of examples 16-20, wherein the second sensor comprises an eye tracking sensor, and wherein identifying the convergence comprises: it is determined that the user's gaze and the first ray or cone projection are directed approximately at a common point in space.

Example 22: a method, comprising: under control of a hardware processor of the wearable system: accessing sensor data from a plurality of sensors having different modalities; identifying a convergence event of sensor data from first and second sensors of the plurality of sensors; and selectively applying a filter to sensor data from the first sensor during the convergence event.

Example 23: the method of example 22, wherein selectively applying the filter to the sensor data from the first sensor during the convergence event comprises: the method further includes detecting an initial convergence of the sensor data from the first sensor and the second sensor, and applying a filter to the sensor data from the first sensor based on the detected initial convergence.

Example 24: the method of any of examples 22 or 23, wherein selectively applying a filter to sensor data from a first sensor during a convergence event comprises: the method includes detecting convergence of sensor data from the first sensor and the second sensor, applying a filter to the sensor data from the first sensor based on the detected convergence, detecting divergence of the sensor data of the first sensor from the sensor data of the second sensor, and inhibiting application of the filter to the sensor data from the first sensor based on the detected divergence.

Example 25: the method of any of examples 22-24, wherein the filter comprises a low pass filter having an adaptive cutoff frequency.

Example 26: a wearable system comprising a hardware processor programmed to perform the method of any of examples 8-25.

Example 27: the wearable system of example 26, comprising at least first and second sensors having different modalities.

Example 28: a wearable system, comprising: a head pose sensor configured to determine a head pose of a user of the wearable system; an eye gaze sensor configured to determine an eye gaze direction of a user of the wearable system; a gesture sensor configured to determine a gesture of a user of the wearable system; a hardware processor in communication with the head pose sensor, the eye gaze sensor, and the gesture sensor, the hardware processor programmed to: determining a first vergence between an eye gaze direction and a head pose with respect to a subject user; executing a first interactive command associated with the object based at least in part on input from the head pose sensor and the eye gaze sensor; determining a second dispersion of the gesture with the eye gaze direction and the head pose relative to the subject user; and executing a second interactive command associated with the object based at least in part on the input from the gesture, the head pose sensor, and the eye gaze sensor.

Example 29: the wearable system of example 28, wherein the head pose sensor comprises an Inertial Measurement Unit (IMU), the eye gaze sensor comprises an eye tracking camera, and the gesture sensor comprises an outward facing camera.

Example 30: the wearable system of any of examples 28 or 29, wherein to determine the first vergence, the hardware processor is programmed to determine that an angle between an eye gaze direction and a head pose direction associated with the head pose is less than a first threshold.

Example 31: the wearable system of any of examples 28-30, wherein to determine the second scatter, the hardware processor is programmed to determine that a cross-modal triangle associated with the gesture, the eye gaze direction, and the head pose is less than a second threshold.

Example 32: the wearable system of any of examples 28-31, wherein the first interaction command comprises aiming the object.

Example 33: the wearable system of any of examples 28-32, wherein the second interaction command comprises a selection object.

Example 34: the wearable system of any of examples 28-33, wherein the hardware processor is further programmed to determine a divergence of the object from at least one of a gesture, an eye gaze direction, or a head pose.

Example 35: the wearable system of any of examples 28-34, wherein the first interaction command comprises applying a first filter, or the second interaction command comprises applying a second filter.

Example 36: the wearable system of example 35, wherein the first filter is different from the second filter.

Example 37: the wearable system of any of examples 35 or 36, wherein the first filter or the second filter comprises a low pass filter having an adaptive cutoff frequency.

Example 38: the wearable system of example 37, wherein the low pass filter comprises a euro filter.

Example 39: the wearable system of any of examples 28-38, wherein to determine the first vergence, the hardware processor is programmed to determine that a dwell time of the eye gaze direction and the head pose toward the object exceeds a first dwell time threshold.

Example 40: the wearable system of any of examples 28-39, wherein to determine the second scatter, the hardware processor is programmed to determine that a dwell time relative to the subject's eye gaze direction, head pose, and gesture exceeds a second dwell time threshold.

Example 41: the wearable system of any of examples 28-40, wherein the first or second interaction command comprises providing a stable aiming vector associated with the object.

Example 42: the wearable system of example 41, wherein the hardware processor provides the stable aiming vector to the application.

Example 43: the wearable system of any of examples 28-42, wherein the gesture sensor comprises a handheld user input device.

Example 44: the wearable system of example 43, wherein the hardware processor is programmed to determine a third divergence between the input from the user input device and at least one of the eye gaze direction, the head pose, or the gesture.

Example 45: the wearable system of any of examples 28-44, further comprising a voice sensor, and wherein the hardware processor is programmed to determine a fourth vergence between the input from the voice sensor and at least one of the eye gaze direction, the head pose, or the gesture.

Example 46: a method, comprising: under control of a hardware processor of the wearable system: identifying a current cross-modal state, the current cross-modal state comprising cross-modal vergence associated with the object; identifying an intent Region (ROI) associated with cross-modal vergence; identifying a corresponding interaction field based at least in part on the ROI; selecting an input fusion method based at least in part on the cross-modality state; selecting a setting for a primary targeting vector (setting); applying an adjustment (condition) to the primary targeting vector to provide a stable pose vector; and communicating the stabilized gesture vector to the application.

Example 47: the method of example 46, wherein the corresponding interaction field includes one or more of: near field, mid field, or far field.

Example 48: the method of any one of examples 46 or 47, wherein applying an adjustment comprises: reducing registration error, jitter or drift of the primary targeting vector.

Example 49: the method of any of examples 46-48, further comprising targeting the object.

Example 50: the method of any of examples 46-49, wherein identifying a current cross-modal state includes determining a gaze or a dwell.

Example 51: the method of example 50, further comprising: determining whether the gaze or dwell exceeds a user focus (focus) threshold, and activating the micro-gesture manipulation in response to determining that the gaze or dwell exceeds the user focus threshold.

Example 52: the method of any of examples 46-51, wherein identifying a corresponding interaction field includes identifying a field transition event that includes a transition between a first interaction field and a second interaction field.

Example 53: the method of any of examples 46-52, wherein identifying a current cross-modal state includes analyzing convergence between a plurality of input aiming vectors.

Example 54: the method of example 53, wherein analyzing convergence comprises determining an angular distance between a plurality of input aiming vector pairs.

Example 55: the method of example 54, further comprising determining a relative variance (variance) between each pair of the plurality of input aim vectors.

Example 56: the method of example 55, further comprising: determining that an angular distance of a pair of input aiming vectors is below a first threshold and a relative variance of the pair of input aiming vectors is below a second threshold, and in response to the determining, identifying a current cross-modal state as a bimodal state associated with the pair of input aiming vectors.

Example 57: the method of any of examples 53-56, wherein analyzing convergence comprises: the input aiming vector that determines the triplet is associated with a cross-modal triangle having an area and three sides.

Example 58: the method of example 57, comprising: determining: the area of the cross-modal triangle is below a third threshold, the variance of the area is below a fourth threshold or the variance of the side length of the cross-modal triangle is below a fifth threshold, and in response to the determination, the current cross-modal state is identified as the tri-modal state associated with the input aim vector of the triplet.

Example 59: the method of any of examples 46-58, wherein the current cross-modal state comprises a bi-modal state, a tri-modal state, or a quad-modal state.

Example 60: a method, comprising: under control of a hardware processor of the wearable system: identifying a cross-modal fixation point; defining an extended region of interest (ROI) based on a cross-modal gaze time or a predicted dwell time in a vicinity of a cross-modal gaze point; determining that the ROI intersects a rendering element; determining a rendering enhancement compatible with the ROI, cross-modal gaze time, or predicted dwell time; and activating rendering enhancement.

Example 61: the method of example 60, wherein rendering the enhancement includes one or more of: a reflection map, a surface sparkle effect, sub-surface scattering, gaseous lens or refraction, particle counting, or advanced High Dynamic Range (HDR) rendering or illumination methods.

Example 62: the method of any of examples 60 or 61, wherein rendering enhancement is activated only during predicted dwell time or cross-modality gaze time.

Example 63: the method of any of examples 60-62, further comprising: detecting divergence of a previously converging input modality, and disabling rendering enhancement.

Example 64: a wearable system comprising a hardware processor programmed to perform the method of any of examples 46-63.

Example 65: the wearable system of example 64, comprising at least a first sensor and a second sensor having different modalities.

Example 66: the wearable system of example 65, wherein the at least first and second sensors having different modalities comprise: a head pose sensor, an eye gaze sensor, a gesture sensor, a voice sensor, or a handheld user input device.

Example 67: a method, comprising: under control of a hardware processor of the wearable system: receiving sensor data from a plurality of sensors having different modalities; determining that data from a particular subset of the plurality of sensors having different modalities indicates that a user is initiating execution of a particular motor or sensorimotor control strategy from a plurality of predetermined motor and sensorimotor control strategies; selecting a particular sensor data processing scheme corresponding to a particular motor or sensorimotor control strategy from a plurality of different sensor data processing schemes corresponding to respective different ones of a plurality of predetermined motor and sensorimotor control strategies; and processing data received from a particular subset of the plurality of sensors having different modalities according to a particular sensor data processing scheme.

Example 68: the method of example 67, wherein determining that data from a particular subset of the plurality of sensors having different modalities indicates that the user is initiating performance of a particular motor or sensorimotor control strategy from among a plurality of predetermined motor and sensorimotor control strategies comprises: determining that data from a particular subset of the plurality of sensors having different modalities is randomly converged.

Example 69: the method of any of examples 67-68, further comprising: in processing data received from a particular subset of a plurality of sensors having different modalities according to a particular sensor data processing scheme: determining that data from a particular subset of the plurality of sensors having different modalities indicates that the user is ending execution of a particular motor or sensorimotor control strategy; in response to determining that data from a particular subset of the plurality of sensors having different modalities indicates that the user is ending execution of a particular motor or sensorimotor control strategy: according to a particular sensor data processing scheme, processing of data received from a particular subset of a plurality of sensors having different modalities is avoided.

Example 70: the method of example 69, wherein determining that data from a particular subset of a plurality of sensors having different modalities indicates that a user is ending performance of a particular motor or sensorimotor control strategy comprises: determining that data from a particular subset of the plurality of sensors having different modalities is randomly divergent.

Example 71: the method of any of examples 67-70, wherein processing data received from a particular subset of a plurality of sensors having different modalities according to a particular sensor data processing scheme comprises: data from one or more of a particular subset of the plurality of sensors having different modalities is filtered in a particular manner.

Example 72: the method of any of examples 67-71, wherein processing data received from a particular subset of a plurality of sensors having different modalities according to a particular sensor data processing scheme comprises: data from a particular subset of the plurality of sensors having different modalities is fused in a particular manner.

Example 73: the method of examples 67-72, wherein determining that data from a particular subset of the plurality of sensors having different modalities indicates that a user is initiating performance of a particular motor or sensorimotor control strategy from among a plurality of predetermined motor and sensorimotor control strategies comprises: one or more statistical parameters describing one or more relationships between data from a particular subset of the plurality of sensors having different modalities are determined to satisfy one or more thresholds.

Example 74: the method of example 73, wherein the one or more statistical parameters comprise one or more of variance, covariance, or correlation.

Example 75: a method, comprising: under control of a hardware processor of the wearable system: receiving sensor data from a plurality of sensors having different modalities; determining that data from a particular subset of a plurality of sensors having different modalities varies randomly in a particular manner; in response to determining that data from a particular subset of the plurality of sensors having different modalities randomly varies in a particular manner, switching between: processing data received from a particular subset of a plurality of sensors having different modalities according to a first sensor data processing scheme; and processing data received from a particular subset of the plurality of sensors having different modalities according to a second sensor data processing scheme different from the first sensor data processing scheme.

Example 76: the method of example 75, wherein determining that data from a particular subset of a plurality of sensors having different modalities randomly changes in a particular manner comprises: determining that data from a particular subset of a plurality of sensors having different modalities randomly converges.

Example 77: the method of any of examples 76 or 77, wherein determining that data from a particular subset of a plurality of sensors having different modalities randomly changes in a particular manner comprises: determining that data from a particular subset of the plurality of sensors having different modalities is randomly divergent.

Example 78: a wearable system comprising a hardware processor programmed to perform the method of any of examples 67-77.

Any of the above examples may be combined with any other example or any other feature described in this application. The examples are not intended to exclude additional elements described herein. All possible combinations and subcombinations of the examples with or without additional features described herein are contemplated and are considered a part of this disclosure.

Appendix A

A portion of the disclosure of this patent document contains material which is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure, as it appears in the patent and trademark office patent file or records, but otherwise reserves all copyright rights whatsoever.

The following computer code and description are intended to illustrate various embodiments of the cross-modal input fusion technique, but are not intended to limit the scope of the cross-modal input fusion technique. In various implementations, this computer code may be executed by local processing and data module 260, remote processing module 270, central runtime server 1650, or other processor associated with wearable system 200.

C # script

The script disclosed herein illustrates how the context is derived from using a 2D context fusion calculation and then used to define the dynamic filtering of the primary targeting vector (hand). Linear and spherical interpolation methods can be used to ramp up the filtering of the primary targeting vector pose.

Calculating a list:

head-hand midpoint current position

Current position of hand-eye midpoint

Head-eye midpoint current position

Head-hand midpoint current distance

Current hand-eye midpoint distance

Head-eye midpoint current distance

Current center position of head-eye-hand triangle

Head-eye-hand triangle current region

V. for variance calculation

Head (H) interaction point (gaze) average position: the window 10 frames are scrolled.

Eye (E) interaction point (vergence) mean position: the window 10 frames are scrolled.

Hand (Ha) interaction point (fingertip) average position: the window 10 frames are scrolled.

Head-hand midpoint average position: the window 10 frames are scrolled.

Hand-eye midpoint mean position: the window 10 frames are scrolled.

Head-eye midpoint mean position: the window 10 frames are scrolled.

Head-eye-hand center point mean position: the window 10 frames are scrolled.

Head-eye-hand average triangular area: the window 10 frames are scrolled.

Head interaction point (gaze) location variance: the window 10 frames are scrolled.

Eye interaction point (vergence) position variance: the window 10 frames are scrolled.

Hand interaction point (fingertip) position variance: the window 10 frames are scrolled.

Head-hand midpoint location variance: the window 10 frames are scrolled.

Hand-eye midpoint position variance: the window 10 frames are scrolled.

Head-eye midpoint location variance: the window 10 frames are scrolled.

Head-eye-hand center point position variance: the window 10 frames are scrolled.

Head-eye-hand triangle region variance: the window 10 frames are scrolled.

Head interaction point (gaze) current velocity (xy tangential component)

Eye interaction point (vergence) current velocity (xy tangential component)

Hand interaction point (index finger tip) current speed (xy tangential component)

Head-hand midpoint current velocity (xy tangential component)

Hand-eye midpoint current velocity (xy tangential component)

Head-eye midpoint current velocity (xy tangential component)

Head-eye-hand center point current velocity (xy tangential component)

Head interaction point (gaze) average velocity: rolling window 20 frames (xy tangential component)

Eye interaction point (vergence) average velocity: rolling window 20 frames (xy tangential component)

Hand interaction point (fingertip) average velocity: rolling window 20 frames (xy tangential component)

Head-hand midpoint average velocity: rolling window 10 frame (xy tangential component)

Hand-eye midpoint average velocity: rolling window 10 frame (xy tangential component)

Head-eye midpoint average velocity: rolling window 10 frame (xy tangential component)

Head-eye-hand center point average velocity: rolling window 10 frame (xy tangential component)

Head interaction point (gaze) average acceleration: rolling window 20 frames (xy tangential component)

Eye interaction point (vergence) mean acceleration: rolling window 20 frames (xy tangential component)

Hand interaction point (fingertip) average acceleration: rolling window 20 frames (xy tangential component)

Head interaction point (gaze) current acceleration (xy tangential component)

Eye interaction point (vergence) current acceleration (xy tangential component)

Hand interaction point (index finger tip) current acceleration (xy tangential component)

Head-hand midpoint current acceleration (xy tangential component)

Hand-eye midpoint current acceleration (xy tangential component)

Head-eye midpoint current acceleration (xy tangential component)

Head-eye-hand center current acceleration (xy tangential component)

Head-eye-hand average triangle center current acceleration (xy tangential component)

Dynamic filtering starts pseudo-code for the first time:

in some examples, the euro filter may quickly fade out (ease out) when the fusion condition breaks, and in some cases may suddenly turn off. In other implementations, as the filtered interaction point (in this case, the hand input aiming vector) degrades from tri-modal fusion to bi-modal fusion, one filter has been slowed down while the other filter is slowing down. In this regard, multiple filters will be run simultaneously to create a smooth-transitioning position function and avoid a step function (or discontinuity) in the velocity and acceleration of the primary targeting vector (managed cursor).

Multimodal dwell first opens the pseudo code:

192页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：信息处理装置、信息处理方法和程序

Cross-modal input fusion for wearable systems

相关技术

网友询问留言