Non-calibration eye movement interaction method and device

文档序号:19625 发布日期:2021-09-21 浏览:11次 中文

阅读说明:本技术 一种非标定眼动交互方法和装置 (Non-calibration eye movement interaction method and device ) 是由 闫野 王小东 谢良 张敬 印二威 闫慧炯 张亚坤 艾勇保 罗治国 于 2021-05-27 设计创作,主要内容包括:本发明公开了一种非标定眼动交互方法与装置,属于计算机视觉技术领域。所述方法包括:采集大量双眼图像、场景图像和头部运动信息,并对采集到的数据进行预处理;使用所采集的大量数据训练深度卷积网络注视点预测模型,获得所述双眼图像、场景图像和头部运动信息到预估注视点坐标的映射关系;将已有或实时采集数据输入已完成训练的深度卷积网络注视点预测模型,获取预估注视点。相较于标定式眼动交互方法,本方法通过预先训练卷积神经网络模型,实现使用时无需标定的效果,增强了头戴式设备使用的便捷性;本发明还整合了人眼图像,场景图像以及头部运动信息对注视点估计的贡献,三者相互补充,使得注视点估计的精度更好,鲁棒性更强。(The invention discloses a non-calibration eye movement interaction method and device, and belongs to the technical field of computer vision. The method comprises the following steps: acquiring a large number of binocular images, scene images and head movement information, and preprocessing the acquired data; training a depth convolution network fixation point prediction model by using the collected mass data to obtain a mapping relation from the binocular image, the scene image and the head movement information to the estimated fixation point coordinate; inputting the existing or real-time collected data into the trained depth convolution network fixation point prediction model to obtain the estimated fixation point. Compared with a calibration type eye movement interaction method, the method has the advantages that the convolution neural network model is trained in advance, so that the calibration is not needed when the method is used, and the convenience of using the head-mounted equipment is enhanced; the invention also integrates the contribution of the human eye image, the scene image and the head movement information to the gaze point estimation, and the three are mutually supplemented, so that the gaze point estimation has better precision and stronger robustness.)

1. A non-calibrated eye movement interaction method is characterized by comprising the following steps:

s1: acquiring binocular images, scene images and head movement information as first acquired data; acquiring first acquired data of a large number of different people in different scenes as a first acquired data group, and preprocessing the data in the first acquired data group;

s2: training a depth convolution network fixation point prediction model by using the first acquisition data set to obtain a mapping relation from the binocular image, the scene image and the head movement information to the estimated fixation point coordinate;

s3: and preprocessing the first acquired data which is acquired in the prior or real-time manner, inputting the trained depth convolution network fixation point prediction model, and acquiring the estimated fixation point.

2. The method of claim 1, wherein preprocessing the first acquired data comprises:

acquiring binocular image data of a person wearing the head-mounted equipment through a near-eye camera of the head-mounted equipment, and preprocessing the binocular image data to obtain binocular image standardized vector data; acquiring scene image data through the head-mounted equipment, preprocessing the scene image data, and acquiring scene image standardized vector data; the method comprises the steps of collecting head movement information data of a person through a gyroscope in the head-mounted device, preprocessing the head movement information of the person, and obtaining head movement one-dimensional vector data.

3. The non-calibrated eye movement interaction method according to claim 2, wherein the deep convolutional network gaze point prediction model comprises:

selecting any convolutional neural network trained and completed on an international published significance detection data set as a first convolutional neural network, and inputting the scene image standardized vector data into the first convolutional neural network for processing to obtain significance map data; carrying out full-connection layer processing on the saliency map data to generate saliency map high-dimensional features;

selecting any one open source basis convolutional neural network as a second convolutional neural network, and inputting the binocular image standardized vector data into the second convolutional neural network for processing to obtain binocular image high-dimensional features;

and fusing the high-dimensional feature of the saliency map, the high-dimensional feature of the binocular image and the head motion one-dimensional vector data, and then carrying out three-layer full-connection layer processing to obtain the estimated fixation point coordinate.

4. The non-calibrated eye movement interaction method according to claim 3, wherein training a deep convolutional network point-of-regard prediction model using the first acquisition data set comprises:

s21: grouping the first collected data set into a training set and a testing set;

s22: training the deep convolutional network fixation point prediction model by using the training set, performing one-time training on all data in the training set to serve as one round, after each round of training is completed, testing test group data by using the deep convolutional network fixation point prediction model after each round of training is completed, calculating a loss function value, and updating and storing parameters of the deep convolutional network fixation point prediction model after the round of training by using an Adam optimization algorithm;

s23: and performing multiple rounds of training, selecting the parameter of the depth convolution network fixation point prediction model with the minimum loss function value, and determining the final depth convolution network fixation point prediction model.

5. The method of claim 3 or 4, wherein the first convolutional neural network is Ua2 Net.

6. The method of claim 3 or 4, wherein the second convolutional neural network is MobileNet V3.

7. A non-calibrated eye movement interaction device, comprising:

the data acquisition and preprocessing module comprises: acquiring binocular images, scene images and head movement information as first acquired data; acquiring first acquired data of a large number of different people in different scenes as a first acquired data group, and preprocessing the data in the first acquired data group;

a depth convolution network fixation point prediction model: training the depth convolution network fixation point prediction model by using the first acquisition data group to obtain a mapping relation from the binocular image, the scene image and the head movement information to the pre-estimated fixation point coordinate; and preprocessing the first acquired data acquired in the prior or real-time manner, inputting the trained depth convolution network fixation point prediction model, and acquiring the estimated fixation point.

8. The non-calibrated eye movement interaction device according to claim 7, wherein preprocessing the first acquired data comprises:

acquiring binocular image data of a person wearing the head-mounted equipment through a near-eye camera of the head-mounted equipment, and preprocessing the binocular image data to obtain binocular image standardized vector data; acquiring scene image data through the head-mounted equipment, preprocessing the scene image data, and acquiring scene image standardized vector data; the method comprises the steps of collecting head movement information data of a person through a gyroscope in the head-mounted device, preprocessing the head movement information of the person, and obtaining head movement one-dimensional vector data.

9. The non-calibrated eye movement interaction device according to claim 8, wherein the deep convolutional network gaze point prediction model comprises:

the salient map feature extraction module: selecting any convolutional neural network trained and completed on an international published significance detection data set as a first convolutional neural network, and inputting the scene image standardized vector data into the first convolutional neural network for processing to obtain significance map data; carrying out full-connection layer processing on the saliency map data to generate saliency map high-dimensional features;

eye feature extraction module: selecting any one open source basis convolutional neural network as a second convolutional neural network, and inputting the binocular image standardized vector data into the second convolutional neural network for processing to obtain binocular image high-dimensional features;

an eye movement prediction module: and fusing the high-dimensional feature of the saliency map, the high-dimensional feature of the binocular image and the head motion one-dimensional vector data, and then carrying out three-layer full-connection layer processing to obtain the estimated fixation point coordinate.

10. The non-calibrated eye movement interaction device according to claim 9, wherein training a deep convolutional network point of regard prediction model using the first acquisition data set comprises:

s21: grouping the first collected data set into a training set and a testing set;

s22: training the deep convolutional network fixation point prediction model by using the training set, performing one-time training on all data in the training set to serve as one round, after each round of training is completed, testing test group data by using the deep convolutional network fixation point prediction model after each round of training is completed, calculating a loss function value, and updating and storing parameters of the deep convolutional network fixation point prediction model after the round of training by using an Adam optimization algorithm;

s23: and performing multiple rounds of training, selecting the parameter of the depth convolution network fixation point prediction model with the minimum loss function value, and determining the final depth convolution network fixation point prediction model.

Technical Field

The invention relates to the technical field of computer vision, in particular to a non-calibration eye movement interaction method and a non-calibration eye movement interaction device.

Background

The head-mounted AR device can provide a rich life and work communication experience for the user, also referred to as a new generation mobile internet device. Eye movement interaction, which is the most direct way for a person to observe the world, is a key interaction way for a head-mounted AR device. Eye movement interaction needs accurate and robust sight line estimation as support, and the action intention of a user can be effectively understood by detecting the real-time visual gazing direction of the user. Eye movement interaction based on head mounted AR devices is currently deployed in a variety of applications. Such as line-of-sight based target selection, attention monitoring, foveal rendering, and the like.

The head-mounted eye tracker usually uses a calibration type eye tracking interaction method depending on image characteristics, and the method comprises the steps of illuminating eyes by using infrared light, detecting the positions of pupils to obtain the coordinates of the pupils, detecting the positions of reflected bright spots to obtain the coordinates of the bright spots, and finally performing function fitting according to the positions of screen fixation points and the relationship between the positions of the pupils and the positions of the bright spots. The function input obtained by fitting in the method needs to provide designated dominant eye features, and the dominant eye features cannot be detected under the conditions of insufficient light and fuzzy pictures, so that the fixation point estimation cannot be carried out.

The method of the calibration type eye movement interaction has at least two problems, namely, the method depends on an additional infrared light source under the condition of poor light, but the additional infrared light source can disturb and even completely block the acquisition of the eye image characteristics. Thus, up to now, eye tracking research has been mostly limited to indoor environments with well controlled lighting conditions; secondly, before use, the user must perform complicated eye movement calibration and keep the head posture fixed, and needs to recalibrate when the wearing position changes, so that the calibration steps are complicated, the complexity of equipment use is increased, and the professional requirement on the personnel assisting in calibration is higher. Because of these problems, it is difficult for the current head-mounted device adopting the calibration type eye movement interaction method to be used in scenes such as long-time wearing, high-acceleration motion and the like.

Disclosure of Invention

In view of this, the method and the device for non-calibrated eye movement interaction provided by the invention can effectively improve the accuracy of an eye movement interaction algorithm and the algorithm robustness in a complex scene by fusing information related to eye movement, such as head motion information, eye images, scene images and the like.

According to one aspect of the invention, a non-calibrated eye movement interaction method is provided, and the method comprises the following steps: s1: acquiring binocular images, scene images and head movement information as first acquired data; acquiring first acquired data of a large number of different people in different scenes as a first acquired data group, and preprocessing the data in the first acquired data group; s2: training a depth convolution network fixation point prediction model by using the first acquisition data set to obtain a mapping relation from the binocular image, the scene image and the head movement information to the estimated fixation point coordinate; s3: and preprocessing the first acquired data which is acquired in the prior or real-time manner, inputting the trained depth convolution network fixation point prediction model, and acquiring the estimated fixation point.

As a further improvement of the present invention, the preprocessing the first collected data includes: acquiring binocular image data of a person wearing the head-mounted equipment through a near-eye camera of the head-mounted equipment, and preprocessing the binocular image data to obtain binocular image standardized vector data; acquiring scene image data through the head-mounted equipment, preprocessing the scene image data, and acquiring scene image standardized vector data; the method comprises the steps of collecting head movement information data of a person through a gyroscope in the head-mounted device, preprocessing the head movement information of the person, and obtaining head movement one-dimensional vector data.

As a further improvement of the present invention, the deep convolutional network gaze point prediction model comprises: selecting any convolutional neural network trained on an international published significance detection data set as the first convolutional neural network, inputting the scene image standardized vector data into the first convolutional neural network for processing to obtain significance map data, and performing full-connection layer processing on the significance map data to generate significance map high-dimensional features; selecting any one open source basis convolutional neural network as the second convolutional neural network, and inputting the binocular image standardized vector data into the second convolutional neural network for processing to obtain the binocular image high-dimensional features; and fusing the high-dimensional feature of the saliency map, the high-dimensional feature of the binocular image and the head motion one-dimensional vector data, and then carrying out three-layer full-connection layer processing to obtain the estimated fixation point coordinate.

As a further improvement of the present invention, training a deep convolutional network gaze point prediction model using the first acquisition data set comprises: s21: grouping the first collected data set into a training set and a testing set; s22: training the deep convolutional network fixation point prediction model by using the training set, performing one-time training on all data in the training set to serve as one round, after each round of training is completed, testing test group data by using the deep convolutional network fixation point prediction model after each round of training is completed, calculating a loss function value, and updating and storing parameters of the deep convolutional network fixation point prediction model after the round of training by using an Adam optimization algorithm; s23: and performing multiple rounds of training, selecting the parameter of the depth convolution network fixation point prediction model with the minimum loss function value, and determining the final depth convolution network fixation point prediction model.

As a further improvement of the invention, the first convolutional neural network is U ^2 Net.

As a further improvement of the present invention, said second convolutional neural network is MobileNetV 3.

According to another aspect of the present invention, there is provided a non-calibrated eye movement interaction device, comprising: the data acquisition and preprocessing module comprises: acquiring binocular images, scene images and head movement information as first acquired data; acquiring first acquired data of a large number of different people in different scenes as a first acquired data group, and preprocessing the data in the first acquired data group; a depth convolution network fixation point prediction model: training the depth convolution network fixation point prediction model by using the first acquisition data group to obtain a mapping relation from the binocular image, the scene image and the head movement information to the pre-estimated fixation point coordinate; and preprocessing the first acquired data acquired in the prior or real-time manner, inputting the trained depth convolution network fixation point prediction model, and acquiring the estimated fixation point.

As a further improvement of the present invention, the preprocessing the first collected data includes: acquiring binocular image data of a person wearing the head-mounted equipment through a near-eye camera of the head-mounted equipment, and preprocessing the binocular image data to obtain binocular image standardized vector data; acquiring scene image data through the head-mounted equipment, preprocessing the scene image data, and acquiring scene image standardized vector data; the method comprises the steps of collecting head movement information data of a person through a gyroscope in the head-mounted device, preprocessing the head movement information of the person, and obtaining head movement one-dimensional vector data.

As a further improvement of the present invention, the deep convolutional network gaze point prediction model comprises: the salient map feature extraction module: selecting any convolutional neural network trained on an international published significance detection data set as the first convolutional neural network, inputting the scene image standardized vector data into the first convolutional neural network for processing to obtain significance map data, and performing full-connection layer processing on the significance map data to generate significance map high-dimensional features; eye feature extraction module: selecting any one open source basis convolutional neural network as the second convolutional neural network, and inputting the binocular image standardized vector data into the second convolutional neural network for processing to obtain the binocular image high-dimensional features; an eye movement prediction module: and fusing the high-dimensional feature of the saliency map, the high-dimensional feature of the binocular image and the head motion one-dimensional vector data, and then carrying out three-layer full-connection layer processing to obtain the estimated fixation point coordinate.

As a further improvement of the present invention, training a deep convolutional network gaze point prediction model using the first acquisition data set comprises: s21: grouping the first collected data set into a training set and a testing set; s22: training the deep convolutional network fixation point prediction model by using the training set, performing one-time training on all data in the training set to serve as one round, after each round of training is completed, testing test group data by using the deep convolutional network fixation point prediction model after each round of training is completed, calculating a loss function value, and updating and storing parameters of the deep convolutional network fixation point prediction model after the round of training by using an Adam optimization algorithm; s23: and performing multiple rounds of training, selecting the parameter of the depth convolution network fixation point prediction model with the minimum loss function value, and determining the final depth convolution network fixation point prediction model.

By the technical scheme, the invention has the following beneficial effects:

(1) compared with a calibration type eye movement interaction method commonly used in the prior art, the method has the advantages that the effect of no need of calibration during use is achieved by training the deep convolutional network fixation point prediction model in advance, convenience of use of the head-mounted equipment is improved, and the head-mounted equipment can be worn and used for a long time.

(2) Because the training data of the model comprises the binocular images under the condition of various light rays (strong light rays, dim light rays or blinking), the deep convolution network gaze point prediction model obtained through training has better light interference resistance.

(3) The human eye image, the scene image and the head movement information are integrated to contribute to the gaze point estimation, and the three complement each other, so that the gaze point estimation is better in precision and stronger in robustness.

The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:

FIG. 1 is a general flow chart illustrating a non-calibrated eye movement interaction method according to an embodiment of the present invention;

fig. 2 illustrates a flow of extracting features of a saliency map in a non-calibrated eye movement interaction method according to an embodiment of the present invention;

fig. 3 illustrates a process of converting a binocular image into an eye high latitude feature in a non-calibration eye movement interaction method according to an embodiment of the present invention;

fig. 4 is a block diagram illustrating a non-calibrated eye movement interaction apparatus according to an embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

It will be understood that the description and claims of the present invention and the method and apparatus of the drawings are referred to one another with respect to the features thereof. Furthermore, the terms "first," "second," and the like in the description and in the claims, and in the drawings, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein.

First, partial terms or terms appearing in the description of the embodiments of the present invention are applied to the following explanations:

deep learning: deep learning is the intrinsic law and expression level of the learning sample data, and the information obtained in the learning process is very helpful for the interpretation of data such as characters, images and sounds. The final aim of the method is to enable the machine to have the analysis and learning capability like a human, and to recognize data such as characters, images and sounds.

A convolutional neural network: the Convolutional Neural Networks based Neural network system is a deep learning method. The convolutional neural network is mainly composed of two parts, one part is feature extraction (convolution, activation function, pooling), and the other part is classification identification (full connection layer). Wherein: convolution (convolution): extracting different input features; activation Function (Activation Function): introducing a nonlinear factor, and carrying out nonlinear mapping on the result of the convolutional layer; pooling layer (Pooling): reducing the input image and the calculation amount; fully connected layers (Fully connected layers): the full-connection layer plays a role of a classifier in the whole convolutional neural network, namely, the result is identified and classified through the full-connection layer after the deep network such as convolution, activation function, pooling and the like.

PyTorch: PyTorch is an open source Python machine learning library, which is based on Torch and used for applications such as natural language processing.

Full connection layer: each layer in the fully-connected layer is a tiled structure consisting of a plurality of neurons, and the core operation is the matrix vector product y ═ Wx, and the essence is that one feature space is linearly transformed to another feature space.

Basic convolutional neural network model: namely the Base CNN Model, the basic Model determines the development of the industry to a great extent, and the influence of a good Model on the whole visual task is huge. Originally, the method is mainly used for performing a picture classification task, but other deep learning tasks such as target detection, semantic segmentation and the like can be performed based on the model architecture modification and combination, and common basic convolutional neural network models include AlexNet, google lenet, ResNet, MobileNet, EfficientNet and the like.

U ^2 Net: a convolution neural network structure for image semantic segmentation and image significance detection.

MobileNetV 3: the third generation mobile net series network proposed by Google, an underlying convolutional neural network architecture that is intended for mobile device operation.

The core technical problem to be solved by the invention is that the calibration type eye movement interaction method in the prior art is too complicated in calibration and use steps and difficult to adapt to scenes such as long-time wearing and high-acceleration movement.

Aiming at the technical problems, the invention provides a non-calibration eye movement interaction method and a non-calibration eye movement interaction device, which can realize the effect of no need of calibration during use by training a deep convolution network fixation point prediction model in advance, enhance the convenience of using head-mounted equipment and be worn for a long time; the training data of the model comprises binocular images under various light conditions, and the deep convolution network fixation point prediction model obtained through training has good light interference resistance; by fusing head motion information, eye images, scene images and other information related to eye movement, the accuracy of an eye movement interaction algorithm and the algorithm robustness in a complex scene can be effectively improved, and the use scene is rich and flexible.

Example 1

Referring to fig. 1, a general flowchart of a non-calibration eye movement interaction method according to an embodiment of the present invention is shown.

As shown in fig. 1, the method of the embodiment of the present invention mainly includes the following steps:

s1: acquiring binocular images, scene images and head movement information as first acquired data; the method comprises the steps of collecting first collected data of a large number of different people in different scenes to be a first collected data group, and preprocessing the data in the first collected data group.

Acquiring binocular image data of a person wearing the head-mounted equipment through a near-eye camera of the head-mounted equipment, and preprocessing the binocular image data to obtain binocular image standardized vector data; acquiring scene image data through head-mounted equipment, preprocessing the scene image data, and acquiring scene image standardized vector data;

specifically, tenor vectors with the sizes of 3 × 128 and 3 × 196 × 128 are generated for the binocular image and the scene image, and then normalization processing with the mean value and standard deviation of [0.485, 0.456, 0.406], [0.229, 0.224, 0.225] is performed on the vector of each channel, so as to obtain normalized vector data of the binocular image and normalized vector data of the scene image.

Acquiring head motion information data of a person through a gyroscope in the head-mounted equipment, preprocessing the head motion information of the person, and acquiring head motion one-dimensional vector data;

specifically, head motion information data acquired by a built-in gyroscope is converted into a one-dimensional tensor vector with the length of 9 and the data type of float, and head motion one-dimensional vector data is obtained.

When the method is used for the head-mounted AR equipment, the used scene image is a real scene image shot by a front camera of a user, and when the method is used for the head-mounted VR equipment or mixed reality equipment, the used scene image is a virtual image observed by the user.

S2: training a fixation point prediction model of the depth convolution network by using a first acquisition data set to obtain a mapping relation from binocular images, scene images and head movement information to estimated fixation point coordinates;

specifically, the deep convolutional network gaze point prediction model comprises:

1. selecting any convolutional neural network trained on an internationally disclosed significance detection data set as a first convolutional neural network, and inputting scene image standardized vector data into the first convolutional neural network for processing to obtain significance map data; carrying out full-connection layer processing on the saliency map data to generate saliency map high-dimensional features;

specifically, a saliency map of a scene image is generated through a convolutional neural network, and then high-dimensional features are generated through a full-connected layer. The first convolutional neural network used usually needs to be generated by using an additional significance test data set for pre-training, needs a large amount of training data and tedious calibration work, so the embodiment uses the convolutional neural network trained and completed on the international significance detection data set such as DUTS, HKU-IS, and ECSSD to perform significance detection tasks, and freezes the part of parameters in subsequent training and does not need to update. These training models have all been open-source published on github.

Specifically, the embodiment adopts a U2 Net network, and the full connection layer processing comprises two layers of full connection layer processing.

As shown in the salient feature extraction flow shown in fig. 2, it can be seen that: the method comprises the steps of inputting scene image standardized vector data, outputting a 196 x 128 single-channel saliency map after passing through a U2 Net convolution neural network, and outputting a high-dimensional feature of the saliency map after passing through two fully-connected layers to be a one-dimensional vector with the length of 128.

2. Selecting any one open source basis convolutional neural network as a second convolutional neural network, and inputting the binocular image standardized vector data into the second convolutional neural network for processing to obtain the high-dimensional characteristics of the binocular image;

in the present embodiment, considering the limited computing resources of the head-mounted device, the open-source-based convolutional neural network in the prior art consumes less computational resources of MobileNetV 3.

The process of converting the binocular image into the eye high latitude feature as shown in fig. 3 shows that: the same MobileNet V3 convolutional neural network (sharing the same parameters) is used for the normalized vector data of the left eye image and the right eye image to obtain the high-dimensional features of the binocular image. Specifically, the high-dimensional features of the binocular image are two one-dimensional vectors with the length of 128.

3. And fusing the high-dimensional feature of the saliency map, the high-dimensional feature of the binocular image and the head motion one-dimensional vector data, and then carrying out three-layer full-connection layer processing to obtain the estimated fixation point coordinate.

Specifically, the high-dimensional feature of the saliency map is a one-dimensional vector with the length of 128 output by the saliency map generation module, the high-dimensional feature of the binocular image is two one-dimensional vectors with the length of 128 output by the eye image processing module, the head motion feature provided by the gyroscope is a one-dimensional vector with the length of 9 formed by a three-dimensional space angle recorded by the gyroscope, a three-dimensional space angular velocity and a three-dimensional space velocity, and the vectors are spliced into a one-dimensional vector with the length of 393. And processing the one-dimensional vector by three fully-connected layers, wherein a Relu activation function is contained between each two fully-connected layers. If no activation function is used, no matter how many layers of the neural network are, only linear combination is needed. The input vector is linearly changed once through each full-connection layer, nonlinear mapping is completed through a nonlinear activation function Relu, and finally a one-dimensional vector with the length of 2, namely a predicted fixation point coordinate, is output.

Besides the fixation point coordinate, the pitch angle and yaw angle of the eyes under the head coordinate and the world coordinate system can be generated, the three-dimensional fixation unit vectors of the eyes under the head coordinate and the world coordinate system, fixation targets of the eyes in real events and the like can be generated, all the output targets can be mutually converted through the geometrical relationship, and no essential difference exists.

Training a deep convolutional network point of regard prediction model by using a first acquisition data set, and specifically comprises the following steps:

s21: dividing the first collected data set into a training group and a testing group;

specifically, a large amount of data of 50 subjects including binocular images, scene images and head movement data are collected, and the data of the large amount of data is represented by 4: the method 1 comprises the steps of dividing training groups and test groups, and preprocessing the training group data and the test group data. The scene for collecting the binocular images comprises the conditions of normal light, strong light, dim light or blinking of the testee. Wherein the training group and the test group data are provided with actual screen fixation point labels.

S22: training the depth convolution network fixation point prediction model by using a training set, taking one training for all data in the training set as one round, testing test set data by using the depth convolution network fixation point prediction model after each round of training is finished, calculating a loss function value, and updating and storing parameters of the depth convolution network fixation point prediction model after the round of training by using an Adam optimization algorithm;

specifically, the deep learning training is divided into two processes of forward operation and backward propagation, and the neural network parameters need to be updated through backward propagation so that the predicted fixation point coordinates are as close as possible to the actual screen fixation point. The back propagation process is generally divided into three steps of calculating a loss function, calculating a gradient and updating a weight.

The loss function of this embodiment is an L1 loss function, which has strong robustness to outliers, and the specific formula is as follows:

wherein y isiRepresenting the estimated point of gaze coordinates, giRepresenting the actual screen gaze point coordinates.

Specifically, after the loss function value is calculated, parameters of the deep learning model proposed in this case are updated using the Adam optimization algorithm. For example, the initial learning rate is 1e-3, the learning rate is attenuated to 0.1 of the original learning rate after every 30 rounds of iteration, and the total number of model training iterations is 120 rounds.

S23: and performing multiple rounds of training, selecting the parameter of the depth convolution network fixation point prediction model with the minimum loss function value, and determining the final depth convolution network fixation point prediction model.

In the deep learning training process, one training is performed on all data in a training set as one round, after each round of training is completed, the test set data is tested by the neural network structure after each round of training is completed, the average loss function value on the test set is calculated, and the depth convolution network fixation point prediction model parameters after the round of training are stored. For example, after the 120 rounds of training are completed, the parameter of the depth convolution network gaze point prediction model with the smallest loss function value is selected, and the final depth convolution network gaze point prediction model is determined.

It is emphasized that, since the data of the training set and the test set include binocular images under various light conditions (strong light, dim light or blinking), the deep convolution network gaze point prediction model obtained by training has a relatively good anti-light interference capability. Under the condition of poor light condition, the output can be more accurate.

S3, preprocessing the first acquired data which are acquired in real time or in real time, inputting the trained depth convolution network fixation point prediction model, and acquiring the estimated fixation point.

And after the trained deep convolutional network fixation point prediction model is obtained, embedding the obtained model into a head-mounted eye movement equipment program. In the specific process, the eye image can be automatically acquired through a pre-programmed computer program only by wearing the equipment by a user, the scene image and gyroscope data are input into a depth convolution network fixation point prediction model, and the screen coordinate watched by the user is output in real time.

Example 2

Further, as an implementation of the method shown in the above embodiment, another embodiment of the present invention further provides a non-calibrated eye movement interaction apparatus. The embodiment of the apparatus corresponds to the embodiment of the method, and for convenience of reading, details in the embodiment of the apparatus are not repeated one by one, but it should be clear that the apparatus in the embodiment can correspondingly implement all the contents in the embodiment of the method. Fig. 4 is a schematic diagram illustrating a non-calibrated eye movement interaction device according to an embodiment of the present invention. As shown in fig. 4, in the apparatus of this embodiment, there are the following modules:

1. the data acquisition and preprocessing module comprises: acquiring binocular images, scene images and head movement information as first acquired data; acquiring first acquired data of a large number of different people in different scenes as a first acquired data group, and preprocessing the data in the first acquired data group;

acquiring binocular image data of a person wearing the head-mounted equipment through a near-eye camera of the head-mounted equipment, and preprocessing the binocular image data to obtain binocular image standardized vector data; acquiring scene image data through head-mounted equipment, preprocessing the scene image data, and acquiring scene image standardized vector data; acquiring head motion information data of the person through a gyroscope in the head-mounted equipment, preprocessing the head motion information of the person, and acquiring head motion one-dimensional vector data;

2. a depth convolution network fixation point prediction model: training a fixation point prediction model of the depth convolution network by using a first acquisition data set to obtain a mapping relation from binocular images, scene images and head movement information to estimated fixation point coordinates; preprocessing first acquired data acquired in the prior or real-time mode, inputting a trained depth convolution network fixation point prediction model, and acquiring a predicted fixation point.

The depth convolution network fixation point prediction model comprises a salient image feature extraction module, an eye feature extraction module and an eye movement prediction module.

2.1 salient map feature extraction module: selecting any convolutional neural network trained on an internationally disclosed significance detection data set as a first convolutional neural network, and inputting scene image standardized vector data into the first convolutional neural network for processing to obtain significance map data; carrying out full-connection layer processing on the saliency map data to generate saliency map high-dimensional features;

2.2 eye feature extraction module: selecting any one open source basis convolutional neural network as a second convolutional neural network, and inputting the binocular image standardized vector data into the second convolutional neural network for processing to obtain the high-dimensional characteristics of the binocular image;

2.3 eye movement prediction module: and fusing the high-dimensional feature of the saliency map, the high-dimensional feature of the binocular image and the head motion one-dimensional vector data, and then carrying out three-layer full-connection layer processing to obtain the estimated fixation point coordinate.

The steps of training the deep convolutional network gaze point prediction model using the first collected data set are specifically described in steps S21-S23 of embodiment 1, and are not described herein again.

In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

The algorithms and displays presented herein are not inherently related to any particular computer, virtual machine, or other apparatus. Various general purpose systems may also be used with the teachings herein. The required structure for constructing such a system will be apparent from the description above. Moreover, the present invention is not directed to any particular programming language. It is appreciated that a variety of programming languages may be used to implement the teachings of the present invention as described herein, and any descriptions of specific languages are provided above to disclose the best mode of the invention.

In the description provided herein, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

13页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种基于头部时序信号校正的眼动交互方法及装置

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!

技术分类