Eye movement interaction method and device based on head time sequence signal correction

文档序号：19626 发布日期：2021-09-21 浏览：8次中文

阅读说明：本技术 一种基于头部时序信号校正的眼动交互方法及装置 (Eye movement interaction method and device based on head time sequence signal correction ) 是由张敬王小东闫野印二威谢良闫慧炯罗治国艾勇保张亚坤于 2021-05-27 设计创作，主要内容包括：本发明公开了一种基于头部时序信号校正的眼动交互方法与装置,属于计算机视觉技术领域。所述方法包括：采集大量连续多帧双眼图像以及对应的头部运动时序信息、实际屏幕注视点坐标,并对采集到的数据进行预处理；使用所采集的大量数据训练深度卷积网络注视点预测模型,获得连续多帧双眼图像和头部运动时序信息到预估注视点坐标的映射关系；将已有或实时采集数据输入已完成训练的深度卷积网络注视点预测模型,获取预估注视点。本发明融合头部时序信号与眼部图像时序信号,使用深度学习强大的特征处理能力,能准确快速估计人眼注视点；通过预先训练深度卷积网络注视点预测模型,新用户使用时无需校正,直接佩戴头戴式眼动仪即可进行实时眼动交互。(The invention discloses an eye movement interaction method and device based on head timing signal correction, and belongs to the technical field of computer vision. The method comprises the following steps: collecting a large number of continuous multi-frame binocular images, corresponding head motion time sequence information and actual screen fixation point coordinates, and preprocessing the collected data; training a gaze point prediction model of a deep convolutional network by using the collected mass data to obtain a mapping relation from continuous multi-frame binocular images and head motion time sequence information to pre-estimated gaze point coordinates; inputting the existing or real-time collected data into the trained depth convolution network fixation point prediction model to obtain the estimated fixation point. The invention integrates the head time sequence signal and the eye image time sequence signal, uses the strong characteristic processing capability of deep learning, and can accurately and quickly estimate the fixation point of the human eye; by training the depth convolution network fixation point prediction model in advance, a new user can carry out real-time eye movement interaction by directly wearing the head-wearing eye movement instrument without correction when using the model.)

1. An eye movement interaction method based on head timing signal correction is characterized by comprising the following steps:

s1: acquiring continuous multi-frame binocular images, corresponding head motion time sequence information and actual screen fixation point coordinates as first acquisition data; acquiring first acquired data of a large number of different people in different scenes as a first acquired data group, and preprocessing the data in the first acquired data group;

s2: training a depth convolution network fixation point prediction model by using the first acquisition data group to obtain a mapping relation from the continuous multi-frame binocular images and the head movement time sequence information to a pre-estimated fixation point coordinate;

s3: and after preprocessing the continuous multi-frame binocular images and the corresponding head motion time sequence information which are acquired in real time or in real time, inputting the trained depth convolution network gaze point prediction model, and acquiring the estimated gaze point corresponding to the last frame binocular image in the continuous multi-frames.

2. The head timing signal correction-based eye movement interaction method according to claim 1, wherein preprocessing the first acquisition data comprises:

acquiring the continuous multiframe binocular image data of a person wearing the head-mounted equipment through a near-eye camera of the head-mounted equipment, preprocessing the continuous multiframe binocular image data, and acquiring standardized vector data of each frame of binocular image; acquiring the head movement time sequence information data of the personnel through a gyroscope in the head-mounted equipment, preprocessing the head movement time sequence information of the personnel, and acquiring one-dimensional vector data of the head movement of each frame; the actual screen fixation point coordinate is a relative coordinate relative to the upper left corner of the screen under the relative coordinate system established by the AR screen, and the value range is [0, 1 ].

3. The head timing signal correction-based eye movement interaction method according to claim 2, wherein the deep convolutional network gaze point prediction model comprises:

selecting a MobileNet V3 network model as a first convolution neural network, inputting the standardized vector data of each frame of binocular image into the first convolution neural network for processing to obtain high-dimensional features of the binocular image, and splicing the high-dimensional features of the binocular image to obtain a one-dimensional binocular image feature vector;

selecting a GRU network model as a second convolutional neural network, and inputting the high-dimensional features of the binocular image of continuous n frames (n is a natural number) into the second convolutional neural network for processing to obtain a two-dimensional binocular image feature vector;

selecting two full-connection layers and a GRU network model to form a third convolutional neural network, and inputting the continuous n frames of head motion time sequence information into the third convolutional neural network for processing to obtain a two-dimensional head motion characteristic vector; wherein the continuous n-frame header motion timing information corresponds to the continuous n-frame binocular image normalized vector data;

selecting two full-connection layers and a GRU network model to form a fourth convolutional neural network, and inputting the two-dimensional binocular image feature vector and the two-dimensional head motion feature vector into the fourth convolutional neural network for processing to obtain an estimated fixation point corresponding to the last frame in the continuous n frames.

4. The head timing signal correction-based eye movement interaction method of claim 3, wherein training a deep convolutional network point of regard prediction model using the first acquisition data set comprises:

s21: grouping the first collected data set into a training set and a testing set;

s22: training the deep convolutional network fixation point prediction model by using the training set, performing one-time training on all data in the training set to serve as one round, after each round of training is completed, testing test group data by using the deep convolutional network fixation point prediction model after each round of training is completed, calculating a loss function value, and updating and storing parameters of the deep convolutional network fixation point prediction model after the round of training by using an Adam optimization algorithm;

s23: and performing multiple rounds of training, selecting the parameter of the depth convolution network fixation point prediction model with the minimum loss function value, and determining the final depth convolution network fixation point prediction model.

5. The method according to claim 3, wherein n is 7 in the consecutive n frames.

6. An eye movement interaction device based on head timing signal correction, comprising:

the data acquisition and preprocessing module comprises: acquiring continuous multi-frame binocular images, corresponding head motion time sequence information and actual screen fixation point coordinates as first acquisition data; acquiring first acquired data of a large number of different people in different scenes as a first acquired data group, and preprocessing the data in the first acquired data group;

a depth convolution network fixation point prediction model: training the depth convolution network fixation point prediction model by using the first acquisition data group to obtain a mapping relation from the continuous multi-frame binocular images and the head movement time sequence information to the pre-estimated fixation point coordinates; and preprocessing the first acquired data acquired in the prior or real-time manner, inputting the trained depth convolution network fixation point prediction model, and acquiring the estimated fixation point.

7. The head timing signal correction-based eye movement interaction device according to claim 6, wherein preprocessing the first acquired data comprises:

8. The head timing signal correction-based eye movement interaction device according to claim 7, wherein the deep convolutional network gaze point prediction model comprises:

eye feature extraction module: selecting a MobileNet V3 network model as a first convolution neural network, inputting the standardized vector data of each frame of binocular image into the first convolution neural network for processing to obtain high-dimensional features of the binocular image, and splicing the high-dimensional features of the binocular image to obtain a one-dimensional binocular image feature vector;

eye time sequence signal processing module: selecting a GRU network model as a second convolutional neural network, and inputting the high-dimensional features of the binocular image of continuous n frames (n is a natural number) into the second convolutional neural network for processing to obtain a two-dimensional binocular image feature vector;

head timing signal processing module: selecting two full-connection layers and a GRU network model to form a third convolutional neural network, and inputting the continuous n frames of head motion time sequence information into the third convolutional neural network for processing to obtain a two-dimensional head motion characteristic vector; wherein the continuous n-frame header motion timing information corresponds to the continuous n-frame binocular image normalized vector data;

an eye movement prediction module: selecting two full-connection layers and a GRU network model to form a fourth convolutional neural network, and inputting the two-dimensional binocular image feature vector and the two-dimensional head motion feature vector into the fourth convolutional neural network for processing to obtain an estimated fixation point corresponding to the last frame in the continuous n frames.

9. The head timing signal correction-based eye movement interaction device of claim 8, wherein training a deep convolutional network point of regard prediction model using the first acquisition data set comprises:

s21: grouping the first collected data set into a training set and a testing set;

10. The head timing signal correction-based eye movement interaction device according to claim 9, wherein n is 7 in the consecutive n frames.

Technical Field

The invention relates to the technical field of computer vision, in particular to an eye movement interaction method and device based on head time sequence signal correction.

Background

The eye movement interaction technology is a method for tracking the eye gaze direction state of a person in various modes such as eye electric signals, optical signals, picture signals and the like and then performing human-computer interaction, and has great application potential in future human-computer interaction. The traditional human-computer interaction can be completed by external devices capable of inputting and outputting and corresponding software, namely a keyboard, a mouse and various mode recognition devices, but the traditional human-computer interaction is mostly manually operated and sometimes matched with voice operation, certain restriction factors exist, and for example, the traditional human-computer interaction is very inconvenient to use for disabled people with inconvenient upper limbs or pilots in military battles. When the eye movement technology is applied in the process of man-machine interaction, the movement condition of eyes is used as an input device. When the eyes of the user move, the human and the computer are interacted by estimating the falling point of the sight of the human eyes in the head-mounted screen. As a more natural and direct input mode, the convenience degree of man-machine interaction can be greatly improved.

At present, the eye movement interaction method mainly uses an optical recording method of an eye video, obtains binocular images through an eye camera, and then obtains the gazing direction or the screen gazing point of eyes by using eye image information, wherein the method is divided into a traditional method based on a geometric model and a deep learning method based on appearance. However, these methods use simple eye image information, and the requirements of the image information on the ambient brightness and the degree of opening and closing of the human eyes are relatively high, so that these methods generally have the problem of poor robustness in a complex environment.

In a natural environment, a person unconsciously generates corresponding head movement when the gazing direction changes, and scientific research proves that the gazing direction of eyes and the head movement have strong correlation, but in the process of realizing the invention, the inventor notices that the current eye movement interaction technical field does not combine with the time sequence information of the head movement. Meanwhile, the human eye movement physiology considers that the human eye movement is divided into tracking, glancing and staring modes, a certain movement rule can be followed, and the eye movement time sequence information is not fully utilized in the field of eye movement interaction.

Disclosure of Invention

In view of the above, the present invention provides an eye movement interaction method and apparatus based on head timing signal correction, and mainly aims to solve the problems of sensitivity to ambient brightness and degree of opening and closing of human eyes, poor robustness in a complex environment, and low accuracy in the conventional eye movement interaction method by integrating human eye images, eye movement timing information, and head movement timing information.

According to an aspect of the present invention, there is provided an eye movement interaction method based on head timing signal correction, the method comprising the steps of: s1: acquiring continuous multi-frame binocular images, corresponding head motion time sequence information and actual screen fixation point coordinates as first acquisition data; acquiring first acquired data of a large number of different people in different scenes as a first acquired data group, and preprocessing the data in the first acquired data group; s2: training a depth convolution network fixation point prediction model by using the first acquisition data group to obtain a mapping relation between the continuous multi-frame binocular images and the head movement time sequence information to a pre-estimated fixation point coordinate: s3: and after preprocessing the continuous multi-frame binocular images and the corresponding head motion time sequence information which are acquired in real time or in real time, inputting the trained depth convolution network gaze point prediction model, and acquiring the estimated gaze point corresponding to the last frame binocular image in the continuous multi-frames.

As a further improvement of the present invention, the preprocessing the first collected data includes: acquiring the continuous multiframe binocular image data of a person wearing the head-mounted equipment through a near-eye camera of the head-mounted equipment, preprocessing the continuous multiframe binocular image data, and acquiring standardized vector data of each frame of binocular image; acquiring the head movement time sequence information data of the personnel through a gyroscope in the head-mounted equipment, preprocessing the head movement time sequence information of the personnel, and acquiring one-dimensional vector data of the head movement of each frame; the actual screen fixation point coordinate is a relative coordinate relative to the upper left corner of the screen under the relative coordinate system established by the AR screen, and the value range is [0, 1 ].

As a further improvement of the present invention, the deep convolutional network gaze point prediction model comprises: selecting a MobileNet V3 network model as a first convolution neural network, inputting the standardized vector data of each frame of binocular image into the first convolution neural network for processing to obtain high-dimensional features of the binocular image, and splicing the high-dimensional features of the binocular image to obtain a one-dimensional binocular image feature vector; selecting a GRU network model as a second convolutional neural network, and inputting the high-dimensional features of the binocular image of continuous n frames (n is a natural number) into the second convolutional neural network for processing to obtain a two-dimensional binocular image feature vector; selecting two full-connection layers and a GRU network model to form a third convolutional neural network, and inputting the continuous n frames of head motion time sequence information into the third convolutional neural network for processing to obtain a two-dimensional head motion characteristic vector; wherein the continuous n-frame header motion timing information corresponds to the continuous n-frame binocular image normalized vector data; selecting two full-connection layers and a GRU network model to form a fourth convolutional neural network, and inputting the two-dimensional binocular image feature vector and the two-dimensional head motion feature vector into the fourth convolutional neural network for processing to obtain an estimated fixation point corresponding to the last frame in the continuous n frames.

As a further improvement of the present invention, training a deep convolutional network gaze point prediction model using the first acquisition data set comprises: s21: grouping the first collected data set into a training set and a testing set; s22: training the deep convolutional network fixation point prediction model by using the training set, performing one-time training on all data in the training set to serve as one round, after each round of training is completed, testing test group data by using the deep convolutional network fixation point prediction model after each round of training is completed, calculating a loss function value, and updating and storing parameters of the deep convolutional network fixation point prediction model after the round of training by using an Adam optimization algorithm; s23: and performing multiple rounds of training, selecting the parameter of the depth convolution network fixation point prediction model with the minimum loss function value, and determining the final depth convolution network fixation point prediction model.

As a further improvement of the present invention, in the consecutive n frames, n is 7.

According to another aspect of the present invention, there is provided an eye movement interaction device based on head timing signal correction, comprising: the data acquisition and preprocessing module comprises: acquiring continuous multi-frame binocular images, corresponding head motion time sequence information and actual screen fixation point coordinates as first acquisition data; acquiring first acquired data of a large number of different people in different scenes as a first acquired data group, and preprocessing the data in the first acquired data group; a depth convolution network fixation point prediction model: training the depth convolution network fixation point prediction model by using the first acquisition data group to obtain a mapping relation from the continuous multi-frame binocular images and the head movement time sequence information to the pre-estimated fixation point coordinates; and preprocessing the first acquired data acquired in the prior or real-time manner, inputting the trained depth convolution network fixation point prediction model, and acquiring the estimated fixation point.

As a further improvement of the present invention, the deep convolutional network gaze point prediction model comprises: eye feature extraction module: selecting a MobileNet V3 network model as a first convolution neural network, inputting the standardized vector data of each frame of binocular image into the first convolution neural network for processing to obtain high-dimensional features of the binocular image, and splicing the high-dimensional features of the binocular image to obtain a one-dimensional binocular image feature vector; eye time sequence signal processing module: selecting a GRU network model as a second convolutional neural network, and inputting the high-dimensional features of the binocular image of continuous n frames (n is a natural number) into the second convolutional neural network for processing to obtain a two-dimensional binocular image feature vector; head timing signal processing module: selecting two full-connection layers and a GRU network model to form a third convolutional neural network, and inputting the continuous n frames of head motion time sequence information into the third convolutional neural network for processing to obtain a two-dimensional head motion characteristic vector; wherein the continuous n-frame header motion timing information corresponds to the continuous n-frame binocular image normalized vector data; an eye movement prediction module: selecting two full-connection layers and a GRU network model to form a fourth convolutional neural network, and inputting the two-dimensional binocular image feature vector and the two-dimensional head motion feature vector into the fourth convolutional neural network for processing to obtain an estimated fixation point corresponding to the last frame in the continuous n frames.

As a further improvement of the present invention, in the consecutive n frames, n is 7.

By the technical scheme, the beneficial effects provided by the invention are as follows:

(1) the head movement time sequence information and the eye image time sequence information are fused, and the eye fixation point can be accurately and quickly estimated in the eye movement interaction application by using the strong feature processing capability of deep learning.

(2) According to the method, the deep convolution network fixation point prediction model is trained in advance, after the model is trained, correction is not needed, and the new user can directly wear the head-mounted eye tracker to carry out real-time eye movement interaction.

The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:

fig. 1 is a general flowchart of an eye movement interaction method based on head timing signal correction according to an embodiment of the present invention;

FIG. 2 is a diagram illustrating a neural network structure of an eye movement interaction method based on head timing signal correction according to an embodiment of the present invention;

fig. 3 is a block diagram illustrating an eye movement interaction apparatus based on head timing signal correction according to an embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

First, partial terms or terms appearing in the description of the embodiments of the present invention are applied to the following explanations:

deep learning: deep learning is the intrinsic law and expression level of the learning sample data, and the information obtained in the learning process is very helpful for the interpretation of data such as characters, images and sounds. The final aim of the method is to enable the machine to have the analysis and learning capability like a human, and to recognize data such as characters, images and sounds.

A convolutional neural network: the Convolutional Neural Networks based Neural network system is a deep learning method. The convolutional neural network is mainly composed of two parts, one part is feature extraction (convolution, activation function, pooling), and the other part is classification identification (full connection layer). Wherein: convolution (convolution): extracting different input features; activation Function (Activation Function): introducing a nonlinear factor, and carrying out nonlinear mapping on the result of the convolutional layer; pooling layer (Pooling): reducing the input image and the calculation amount; fully connected layers (Fully connected layers): the full-connection layer plays a role of a classifier in the whole convolutional neural network, namely, the result is identified and classified through the full-connection layer after the deep network such as convolution, activation function, pooling and the like.

PyTorch: PyTorch is an open source Python machine learning library, which is based on Torch and used for applications such as natural language processing.

Full connection layer: each layer in the fully-connected layer is a tiled structure consisting of a plurality of neurons, and the core operation is the matrix vector product y ═ Wx, and the essence is that one feature space is linearly transformed to another feature space.

MobileNetV 3: the third generation mobile net series network proposed by Google, an underlying convolutional neural network architecture that is intended for application to mobile device operation,

GRU: the GRU, which is called a gated cyclic unit, is a very effective variant of the LSTM network, and is simpler in structure and more effective than the LSTM network, and thus is also a very popular network at present. The method has good performance in natural language processing tasks and time sequence signal processing tasks.

An Adam optimizer: the method is a machine learning self-adaptive learning rate optimization algorithm, and has the advantages of high convergence rate, easy parameter adjustment and the like. The adaptive learning rate optimization algorithm improves the model training speed aiming at the learning rate of the machine learning model.

The core technical problem to be solved by the invention is that the eye movement interaction method in the prior art is sensitive to the environment brightness and the eye opening and closing degree, and has poor robustness in a complex environment.

Aiming at the technical problems, the invention provides an eye movement interaction method and device based on head time sequence signal correction, which integrates a human eye image, eye movement time sequence information and head movement time sequence information by using strong characteristic processing capacity of deep learning, corrects an eye movement prediction method based on the eye image by using the head movement time sequence information as auxiliary information, can accurately and quickly estimate a human eye fixation point in eye movement interaction application, has better robustness and further improves the precision of eye movement interaction.

Example 1

Referring to fig. 1, a general flowchart of an eye movement interaction method based on head timing signal correction according to an embodiment of the present invention is shown.

As shown in fig. 1, the method of the embodiment of the present invention mainly includes the following steps:

s1: acquiring continuous multi-frame binocular images, corresponding head motion time sequence information and actual screen fixation point coordinates as first acquisition data; the method comprises the steps of collecting first collected data of a large number of different people in different scenes to form a first collected data group, and preprocessing the data in the first collected data group.

The method comprises the steps of collecting binocular image data of a wearer through a near-eye camera of the head-mounted device, preprocessing the binocular image into a 128 x 128 image, and generating two tenor vectors with the size of 3 x 128 through picture compression and the standardized preprocessing of mean values and standard deviations of 0.485, 0.456 and 0.406 respectively for the binocular image corresponding to each frame.

The method comprises the steps of collecting head movement information data of a wearing person through a gyroscope in the head-mounted device, wherein the head movement information contained in the gyroscope used can include but is not limited to the spatial position, the attitude angle, the spatial movement speed, the attitude angular speed, the spatial movement acceleration and the attitude angular acceleration of the device. The motion information used in this embodiment is an attitude angle, a spatial motion velocity, and an attitude angular velocity recorded by a gyroscope, and the attitude angles, the spatial motion velocity, and the attitude angular velocity are spliced into a one-dimensional vector with a length of 9.

The actual screen fixation point coordinate is a relative coordinate relative to the upper left corner of the screen under the relative coordinate system established for the AR screen, and the value range is [0, 1 ].

S2: training a depth convolution network fixation point prediction model by using the first acquisition data group to obtain a mapping relation from the continuous multi-frame binocular images and the head movement time sequence information to a pre-estimated fixation point coordinate;

referring to fig. 2, a diagram of a neural network structure of an eye movement interaction method based on head timing signal correction according to an embodiment of the present invention is shown.

Specifically, the deep convolutional network gaze point prediction model comprises:

1. selecting a MobileNet V3 network model as a first convolution neural network, inputting the standardized vector data of each frame of binocular image into the first convolution neural network for processing to obtain high-dimensional features of the binocular image, and splicing the high-dimensional features of the binocular image to obtain a one-dimensional binocular image feature vector;

in this embodiment, the first convolutional neural network is specifically a MobileNetV3 network model, the MobileNetV3 network is a lightweight deep learning network structure proposed by Google, and in this embodiment, only the final adaptive average pooling layer parameter is modified to change the final output dimension, and other network structures and parameters remain unchanged. Specifically, the two eye images corresponding to each frame are subjected to picture compression and normalization preprocessing with mean values and standard deviations of [0.485, 0.456, 0.406], [0.229, 0.224, 0.225] respectively to generate two tensor vectors with the size of 3 × 128, then the tensor vectors are respectively input into two identical MobileNetV3 networks, weights are shared between the two MobileNetV3 networks, finally two one-dimensional high-dimensional feature vectors with the length of 64 are output to realize information compression of the eye images, and then feature splicing is performed to generate one-dimensional feature vector with the length of 128.

2. Selecting a GRU network model as a second convolutional neural network, and inputting the high-dimensional features of the binocular image of continuous n frames (n is a natural number) into the second convolutional neural network for processing to obtain a two-dimensional binocular image feature vector;

in this embodiment, the second convolutional neural network used is a GRU network commonly used in deep learning. Illustratively, the eye high-dimensional feature vector corresponding to 7 continuous frames of images is used as input, and the 7 × 128 two-dimensional binocular image feature vector is finally output through two GRU models with the length of 128 hidden layers, wherein the vector is coupling compression of eye picture information and each frame of time sequence information.

3. Selecting two full-connection layers and a GRU network model to form a third convolutional neural network, and inputting the continuous n frames of head motion time sequence information into the third convolutional neural network for processing to obtain a two-dimensional head motion characteristic vector; wherein the continuous n-frame header motion timing information corresponds to the continuous n-frame binocular image normalized vector data;

the motion information used in this embodiment is an attitude angle, a spatial motion velocity, and an attitude angular velocity recorded by a gyroscope, and the attitude angles, the spatial motion velocity, and the attitude angular velocity are spliced into a one-dimensional vector with a length of 9. The module takes motion information of continuous 7 frames as input, and the head time sequence information processing module consists of two full connection layers and a GRU model and respectively carries out dimensionality increase and time sequence signal processing tasks. Specifically, the input vector expands the motion information of each frame into a 128-dimensional vector through two fully-connected layers, and then passes through two GRU models of hidden layers with the length of 128, and finally outputs a 7 × 128 two-dimensional feature vector.

4. Selecting two full-connection layers and a GRU network model to form a fourth convolutional neural network, and inputting the two-dimensional binocular image feature vector and the two-dimensional head motion feature vector into the fourth convolutional neural network for processing to obtain an estimated fixation point corresponding to the last frame in the continuous n frames.

In this case, the two-dimensional binocular image feature vector and the two-dimensional head motion feature vector are spliced into a two-dimensional high-dimensional gaze point feature vector of 7 × 256, a GRU model with 256 hidden layers and two full-connected layers are input, and a one-dimensional vector with a length of 2 is output, that is, a predicted screen relative gaze point corresponding to the last frame of the continuously output seven frames.

Training a deep convolutional network point of regard prediction model by using a first acquisition data set, and specifically comprises the following steps:

s21: grouping the first collected data set into a training set and a testing set;

in this embodiment, 40 experimental subjects were summoned by wearing a head-mounted commercial eye tracker, and the fixation point calibration was performed first. Specifically, eye images, gyroscope data and AR screen fixation points of 40 experimental testees within one continuous minute are collected, wherein the sampling frequency of an eye tracker is 30 Hz; the gyroscope data comprises attitude angles, spatial motion speeds and attitude angular speeds; the eye tracker and the human eye posture are in a direct following relationship, namely when the head deflects but the eye posture is not changed, the screen fixation point is not changed. All data were tested at 4: 1 into a training set and a test set. The input of the data set is a binocular image and gyroscope data of 7 continuous frames, the label is a screen fixation point predicted by a calibration algorithm of the last frame in the 7 frames, the screen fixation point is a relative coordinate relative to the upper left corner of the screen under a relative coordinate system established for the AR screen, and the value range is [0, 1 ]. The data segments are sampled through a sliding window with 7 frames as intervals.

S22: training the deep convolutional network fixation point prediction model by using the training set, performing one-time training on all data in the training set to serve as one round, after each round of training is completed, testing test group data by using the deep convolutional network fixation point prediction model after each round of training is completed, calculating a loss function value, and updating and storing parameters of the deep convolutional network fixation point prediction model after the round of training by using an Adam optimization algorithm;

specifically, the deep learning training is divided into two processes of forward operation and backward propagation, and the neural network parameters need to be updated through backward propagation so that the predicted fixation point coordinates are as close as possible to the actual screen fixation point. The back propagation process is generally divided into three steps of calculating a loss function, calculating a gradient and updating a weight.

In this embodiment, the used optimizer is an Adam optimizer, the loss function of this embodiment is an L1 loss function, which has strong robustness to outliers, and the specific formula is as follows:

wherein y is_iRepresenting model predicted values, g_iRepresenting the actual truth.

Specifically, after the loss function value is calculated, parameters of the deep learning model proposed in this case are updated using the Adam optimization algorithm. For example, the initial learning rate is 1e-3, the learning rate is attenuated to 0.1 of the original learning rate after every 30 rounds of iteration, and the total number of model training iterations is 120 rounds.

S23: and performing multiple rounds of training, selecting the parameter of the depth convolution network fixation point prediction model with the minimum loss function value, and determining the final depth convolution network fixation point prediction model.

In the deep learning training process, one training is performed on all data in a training set as one round, after each round of training is completed, the test set data is tested by the neural network structure after each round of training is completed, the average loss function value on the test set is calculated, and the depth convolution network fixation point prediction model parameters after the round of training are stored. For example, after the 120 rounds of training are completed, the parameter of the depth convolution network gaze point prediction model with the smallest loss function value is selected, and the final depth convolution network gaze point prediction model is determined.

It should be emphasized that, since the data of the training set and the test set include the binocular images and the time sequence information under various light conditions (strong light, dim light or blinking), and the head movement time sequence information itself is not interfered by the light conditions, the deep convolution network gaze point prediction model obtained by training using the data has better anti-light interference capability. Under the condition of poor light condition, the output can be more accurate.

S3: and after preprocessing the continuous multi-frame binocular images and the corresponding head motion time sequence information which are acquired in real time or in real time, inputting the trained depth convolution network gaze point prediction model, and acquiring the estimated gaze point corresponding to the last frame binocular image in the continuous multi-frames.

And after the trained deep convolutional network fixation point prediction model is obtained, embedding the obtained model into a head-mounted eye movement equipment program. In the specific process, the eye image and gyroscope data can be automatically acquired by a pre-programmed computer program and input into a depth convolution network fixation point prediction model only by wearing the equipment by a user, and the screen coordinate watched by the user is output in real time.

Example 2

Further, as an implementation of the method shown in the foregoing embodiment, another embodiment of the present invention further provides an eye movement interaction device based on head timing signal correction. The embodiment of the apparatus corresponds to the embodiment of the method, and for convenience of reading, details in the embodiment of the apparatus are not repeated one by one, but it should be clear that the apparatus in the embodiment can correspondingly implement all the contents in the embodiment of the method. Fig. 3 is a block diagram illustrating an eye movement interaction apparatus based on head timing signal correction according to an embodiment of the present invention. As shown in fig. 3, in the apparatus of this embodiment, there are the following modules:

1. the data acquisition and preprocessing module comprises: acquiring continuous multi-frame binocular images, corresponding head motion time sequence information and actual screen fixation point coordinates as first acquisition data; acquiring first acquired data of a large number of different people in different scenes as a first acquired data group, and preprocessing the data in the first acquired data group;

acquiring the continuous multiframe binocular image data of a person wearing the head-mounted equipment through a near-eye camera of the head-mounted equipment, preprocessing the continuous multiframe binocular image data, and acquiring standardized vector data of each frame of binocular image; acquiring the head movement time sequence information data of the personnel through a gyroscope in the head-mounted equipment, preprocessing the head movement time sequence information of the personnel, and acquiring one-dimensional vector data of the head movement of each frame; the actual screen fixation point coordinate is a relative coordinate relative to the upper left corner of the screen under a relative coordinate system established by the AR screen, and the value range is I0,1 ].

2. A depth convolution network fixation point prediction model: training the depth convolution network fixation point prediction model by using the first acquisition data group to obtain a mapping relation from the continuous multi-frame binocular images and the head movement time sequence information to the pre-estimated fixation point coordinates; and preprocessing the first acquired data acquired in the prior or real-time manner, inputting the trained depth convolution network fixation point prediction model, and acquiring the estimated fixation point.

The depth convolution network fixation point prediction model comprises an eye feature extraction module, an eye time sequence signal processing module, a head time sequence signal processing module and an eye movement prediction module.

2.1 eye feature extraction module: selecting a MobileNet V3 network model as a first convolution neural network, inputting the standardized vector data of each frame of binocular image into the first convolution neural network for processing to obtain high-dimensional features of the binocular image, and splicing the high-dimensional features of the binocular image to obtain a one-dimensional binocular image feature vector;

2.2 eye time sequence signal processing module: selecting a GRU network model as a second convolutional neural network, and inputting the high-dimensional features of the binocular image of continuous n frames (n is a natural number) into the second convolutional neural network for processing to obtain a two-dimensional binocular image feature vector;

2.3 head timing signal processing module: selecting two full-connection layers and a GRU network model to form a third convolutional neural network, and inputting the continuous n frames of head motion time sequence information into the third convolutional neural network for processing to obtain a two-dimensional head motion characteristic vector; wherein the continuous n-frame header motion timing information corresponds to the continuous n-frame binocular image normalized vector data;

2.4 eye movement prediction module: selecting two full-connection layers and a GRU network model to form a fourth convolutional neural network, and inputting the two-dimensional binocular image feature vector and the two-dimensional head motion feature vector into the fourth convolutional neural network for processing to obtain an estimated fixation point corresponding to the last frame in the continuous n frames.

The steps of training the deep convolutional network gaze point prediction model using the first collected data set are specifically described in steps S21-S23 of embodiment 1, and are not described herein again.

In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

The algorithms and displays presented herein are not inherently related to any particular computer, virtual machine, or other apparatus. Various general purpose systems may also be used with the teachings herein. The required structure for constructing such a system will be apparent from the description above. Moreover, the present invention is not directed to any particular programming language. It is appreciated that a variety of programming languages may be used to implement the teachings of the present invention as described herein, and any descriptions of specific languages are provided above to disclose the best mode of the invention.

In the description provided herein, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

14页详细技术资料下载

Eye movement interaction method and device based on head time sequence signal correction

相关技术

网友询问留言