Desktop interaction method based on artificial intelligence

文档序号：1782712 发布日期：2019-12-06 浏览：7次中文

阅读说明：本技术 基于人工智能的桌面交互方法 (Desktop interaction method based on artificial intelligence ) 是由张哲� 于 2019-08-05 设计创作，主要内容包括：本申请公开了一种基于人工智能的桌面交互方法,包括：当使用者的手势正确时,检测手的位置；匹配所述手的位置图像与预设图像；获取匹配成功的所述预设图像对应的第一文本信息；识别手指关键点的周围第二文本信息；及匹配所述第一文本信息与所述第二文本信息,获得第三文本信息,并返回将所述第三文本信息对应的服务。本申请提出一种利用人工智能技术来识别动作和手势的新的交互方式,并将这种交互方式应用到桌面学习和娱乐场景。能够自动识别使用者的手势并根据手势触发相应的应用程序。(The application discloses a desktop interaction method based on artificial intelligence, which comprises the following steps: when the gesture of the user is correct, detecting the position of the hand; matching the position image of the hand with a preset image; acquiring first text information corresponding to the successfully matched preset image; identifying second text information around the finger key points; and matching the first text information with the second text information to obtain third text information, and returning the service corresponding to the third text information. The present application proposes a new interaction approach that utilizes artificial intelligence techniques to recognize actions and gestures and apply this interaction approach to desktop learning and entertainment scenarios. The gesture of the user can be automatically recognized, and the corresponding application program is triggered according to the gesture.)

1. A desktop interaction method based on artificial intelligence is characterized by comprising the following steps:

When the gesture of the user is correct, detecting the position of the hand;

Matching the position image of the hand with a preset image;

Acquiring first text information corresponding to the successfully matched preset image;

Identifying second text information around the finger key points; and

and matching the first text information with the second text information to obtain third text information, and returning the service corresponding to the third text information.

2. The artificial intelligence based desktop interaction method of claim 1, wherein detecting the position of the hand when the gesture of the user is correct further comprises:

Judging whether the hands are completely placed on the desktop; and

and when the hand is completely placed on the table top, judging whether the gesture is correct or not.

3. the artificial intelligence based desktop interaction method of claim 2, wherein determining whether the hand is completely on the desktop comprises:

the mobile phone, the tablet or other camera devices collect continuous images in real time, the images are input into the trained depth perception model, the depth perception model can judge whether the hand is on the desktop or not, when the hand is completely placed on the desktop, a result 1 is returned, and the next step is carried out; and when the hand is not completely placed on the desktop, returning a result of 0, and repeatedly judging whether the hand is completely placed on the desktop.

4. the artificial intelligence based desktop interaction method of claim 3, wherein determining whether the gesture is correct comprises:

When the depth perception model judges that the hand is on the table top, inputting the image of the current frame into the trained gesture recognition model, wherein the gesture recognition model can judge whether the gesture is a correct operation gesture, and when the gesture in the image is the correct gesture, returning to the result 1 and entering the next step; when the gesture in the image is a wrong gesture, returning to the result 0, and repeatedly judging whether the hand is completely placed on the desktop.

5. the artificial intelligence based desktop interaction method of claim 4, wherein detecting a hand position comprises: and inputting the image of the current frame into a trained hand detection and positioning model, wherein the detection and positioning model can identify the position of a hand and the position of a corresponding hand key point.

6. The artificial intelligence based desktop interaction method of claim 5, wherein matching the hand position image with a preset image comprises:

After the position of the hand and the position of the corresponding key point are detected, the SIFT feature of the current frame image is extracted, the SIFT feature of the current frame image is matched with the features of all images in the database, and the ID of the matched image is returned.

7. the artificial intelligence based desktop interaction method of claim 6, wherein obtaining the first text information corresponding to the preset image successfully matched comprises:

and acquiring corresponding text information including but not limited to Chinese and English sentences, words and titles from a database according to the ID value.

8. the artificial intelligence based desktop interaction method of claim 7, wherein identifying second text information surrounding the finger keypoints comprises:

And intercepting an image area according to the position of the key point of the hand, and recognizing text information in the image area through OCR.

9. The artificial intelligence based desktop interaction method of claim 8, wherein matching the first text information with the second text information, obtaining third text information, and returning a service corresponding to the third text information comprises:

Matching the second text information with the first text information to obtain a matched third text, and reading the translated result aloud and displaying the translated result on a screen if the translated result is a Chinese word or a sentence according to the matched third text information; if it is the title information, the solution process is searched and displayed on the screen.

10. A computer device comprising a memory, a processor and a computer program stored in the memory and executable by the processor, wherein the processor implements the method of any one of claims 1-9 when executing the computer program.

Technical Field

The application relates to the field of artificial intelligence, in particular to a desktop interaction method based on artificial intelligence.

Background

in the existing desktop interaction method, the gesture of a user cannot be automatically recognized, and an interaction program is triggered according to the gesture.

Disclosure of Invention

according to one aspect of the application, a desktop interaction method based on artificial intelligence is provided, which comprises the following steps:

When the gesture of the user is correct, detecting the position of the hand;

matching the position image of the hand with a preset image;

Acquiring first text information corresponding to the successfully matched preset image;

Identifying second text information around the finger key points; and

and matching the first text information with the second text information to obtain third text information, and returning the service corresponding to the third text information.

Optionally, when the gesture of the user is correct, the detecting the position of the hand further comprises:

judging whether the hands are completely placed on the desktop; and

and when the hand is completely placed on the table top, judging whether the gesture is correct or not.

Optionally, determining whether the hand is fully on the desktop comprises:

The mobile phone, the tablet or other camera devices collect continuous images in real time, the images are input into the trained depth perception model, the depth perception model can judge whether the hand is on the desktop or not, when the hand is completely placed on the desktop, a result 1 is returned, and the next step is carried out; and when the hand is not completely placed on the desktop, returning a result of 0, and repeatedly judging whether the hand is completely placed on the desktop.

Optionally, determining whether the gesture is correct comprises:

Optionally, detecting the position of the hand comprises: and inputting the image of the current frame into a trained hand detection and positioning model, wherein the detection and positioning model can identify the position of a hand and the position of a corresponding hand key point.

Optionally, matching the position image of the hand with a preset image comprises:

optionally, the obtaining of the first text information corresponding to the successfully matched preset image includes:

and acquiring corresponding text information including but not limited to Chinese and English sentences, words and titles from a database according to the ID value.

Optionally, identifying the second text information around the finger key point comprises:

and intercepting an image area according to the position of the key point of the hand, and recognizing text information in the image area through OCR.

Optionally, matching the first text information with the second text information, obtaining third text information, and returning a service corresponding to the third text information includes:

to achieve the above object, according to another aspect of the present application, there is provided a computer apparatus.

The computer device according to the present application includes: a memory, a processor and a computer program stored in the memory and executable by the processor, the processor implementing the method of any one of the above when executing the computer program.

The application provides a new interactive mode for recognizing actions and gestures by using an artificial intelligence technology, and the interactive mode is applied to desktop learning and entertainment scenes. This interaction needs to be done by means of external devices, which are divided into two categories. The first is that the hardware is vertical to the desktop, and a monocular camera is arranged at the top of the hardware to irradiate downwards in an inclined mode, so that the information of books, hands, characters, pictures and the like on the desktop can be shot clearly. The second is that the mobile phone or the flat plate is arranged on the desktop, an optical reflection system is arranged above the mobile phone or the flat plate, and the optical reflection system forms images through reflection, so that the front camera of the mobile phone or the flat plate can clearly shoot the book, hand, character, picture information and the like on the desktop.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this application, serve to provide a further understanding of the application and to enable other features, objects, and advantages of the application to be more apparent. The drawings and their description illustrate the embodiments of the invention and do not limit it. In the drawings:

FIG. 1 is a flow diagram of a method of desktop interaction based on artificial intelligence according to one embodiment of the present application;

FIG. 2 is a detailed block diagram of a depth perception model according to one embodiment of the present application;

FIG. 3 is a detailed block diagram of a gesture recognition model according to one embodiment of the present application;

FIG. 4 is a detailed block diagram of a hand detection and keypoint localization model according to one embodiment of the present application;

FIG. 5 is a flow diagram of extracting SIFT features of a current picture according to one embodiment of the present application; and

FIG. 6 is a schematic diagram of a computer device according to one embodiment of the present application.

Detailed Description

In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only partial embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

It should be noted that the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover non-exclusive inclusions, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.

As shown in fig. 1, in an embodiment of the present application, a desktop interaction method based on artificial intelligence is provided, including:

Step S102: when the gesture of the user is correct, detecting the position of the hand;

step S104: matching the position image of the hand with a preset image;

Step S106: acquiring first text information corresponding to the successfully matched preset image;

Step S108: identifying second text information around the finger key points; and

Step S110: and matching the first text information with the second text information to obtain third text information, and returning the service corresponding to the third text information.

in an embodiment of the present application, when the gesture of the user is correct, the detecting the position of the hand further includes:

Step S100: judging whether the hands are completely placed on the desktop; and

Step S101: and when the hand is completely placed on the table top, judging whether the gesture is correct or not.

In an embodiment of the present application, determining whether the hand is completely placed on the desktop includes: the mobile phone, the tablet or other camera devices collect continuous images in real time, the images are input into the trained depth perception model, the depth perception model can judge whether the hand is on the desktop or not, when the hand is completely placed on the desktop, a result 1 is returned, and the next step is carried out; and when the hand is not completely placed on the desktop, returning a result of 0, and repeatedly judging whether the hand is completely placed on the desktop.

In an embodiment of the present application, determining whether the gesture is correct includes: when the depth perception model judges that the hand is on the table top, inputting the image of the current frame into the trained gesture recognition model, wherein the gesture recognition model can judge whether the gesture is a correct operation gesture, and when the gesture in the image is the correct gesture, returning to the result 1 and entering the next step; when the gesture in the image is a wrong gesture, returning to the result 0, and repeatedly judging whether the hand is completely placed on the desktop.

In an embodiment of the present application, detecting the position of the hand includes: and inputting the image of the current frame into a trained hand detection and positioning model, wherein the detection and positioning model can identify the position of a hand and the position of a corresponding hand key point.

In an embodiment of the present application, matching the position image of the hand with a preset image includes: after the position of the hand and the position of the corresponding key point are detected, the SIFT feature of the current frame image is extracted, the SIFT feature of the current frame image is matched with the features of all images in the database, and the ID of the matched image is returned.

In an embodiment of the present application, obtaining first text information corresponding to the successfully matched preset image, where the first text information is obtained from a database, includes: and acquiring corresponding text information including but not limited to Chinese and English sentences, words and titles from a database according to the ID value.

In an embodiment of the present application, identifying second text information around the finger key point, where the second text information includes text information around the finger key point, includes: and intercepting an image area according to the position of the key point of the hand, and recognizing text information in the image area through OCR.

in an embodiment of the present application, matching the first text information with the second text information to obtain third text information, where the third text information is the matched exact text information, and returning the service corresponding to the third text information includes:

matching the second text information with the first text information to obtain a matched third text, and reading the translated result aloud and displaying the translated result on a screen if the translated result is a Chinese word or a sentence according to the matched third text information; if it is the title information, the solution process is searched and displayed on the screen.

and (4) training and reasoning a depth perception model.

FIG. 2 depicts the detailed structure of one example of a depth perception model. The model consists of 3 layers of convolution network layers, 2 layers of pooling network layers and 1 layer of full-link layer. The input to the model is a 128x128x3 RGB color image data, and the output layer uses softmax to output two values representing a full hand on the table and a non-full hand on the table, respectively. The model loss function adopts softmax cross entropy loss of a target value and an inference value, an optimization algorithm adopts an Adam optimizer, the initial learning rate is 0.001, the number of samples is 100 thousands, and training is converged in 100 periods.

128x128x3 when the network model inputs an image; in the first layer of convolutional network layer, the convolutional kernel is 5x5, and the number of channels is 32; the step length of the first layer of the pooling network layer is 2, and the padding adopts an SAME mode; a second layer of convolution network layer, the convolution kernel is 5x5, and the number of channels is 64; the step length of the second layer of the pooling network layer is 2, and the padding adopts an SAME mode; a third layer network convolution layer, the convolution kernel is 3x3, and the channel number is 96; and the last layer is a full connection layer, and the output unit is 2.

Gesture discrimination model training and reasoning

FIG. 3 depicts the detailed structure of one example of a gesture discrimination model. The model consists of 4 layers of convolution network layers, 3 layers of pooling network layers and 1 layer of full-link layer. The input of the model is a 96x96 gray scale image data, and the output layer uses softmax to output two values representing correct gesture and wrong gesture respectively. The model loss function adopts softmax cross entropy loss of a target value and an inference value, an optimization algorithm adopts an Adam optimizer, the initial learning rate is 0.001, the number of samples is 40 thousands, and training is converged in 80 periods.

96x96 when the network model inputs an image; in the first layer of convolutional network layer, the convolutional kernel is 5x5, and the number of channels is 24; the step length of the first layer of the pooling network layer is 2, and the padding adopts an SAME mode; a second layer of convolution network layer, the convolution kernel is 5x5, and the number of channels is 24; the step length of the second layer of the pooling network layer is 2, and the padding adopts an SAME mode; a third layer network convolution layer, the convolution kernel is 3x3, and the channel number is 36; the step length of the third layer of the pooling network layer is 2, and the padding adopts an SAME mode; in the fourth layer of convolutional network layer, the convolutional kernel is 3x3, and the number of channels is 36; and the last layer is a full connection layer, and the output unit is 2.

hand detection and key point localization model training and reasoning

FIG. 4 depicts details of one example of a hand detection and keypoint localization model. The model uses the Mask RCNN model architecture as a reference, the positions of candidate hand target frames and corresponding confidence degrees, and each hand target frame corresponds to 5 characteristic thermodynamic diagrams and respectively corresponds to the positions of a thumb, an index finger, a middle finger, a ring finger and a little finger. The model loss function includes three parts: the first part is the L1 penalty for the position value of the target box, the second part is whether the target box is the softmax penalty for the target gesture, and the third part is the L1 penalty for the hand keypoint position. The algorithm optimizer takes the RMS Prop, the training data consists of 100 ten thousand sample data, and iterations are performed for 200 cycles to converge.

Matching of current book page pictures

Fig. 5 describes the process of extracting the SIFT feature of the current picture. The picture is first reduced to a 100x100 gray scale map. The thumbnail gray scale is then divided into 100 equally divided regions, each of which is 10x10 in area. For each region, a vector with a length of 128 dimensions is extracted, and each picture has 100 vectors. The database stores a vector matrix of 100 ten thousand pictures calculated in advance. The similarity of the two vectors is calculated by calculating the cosine values of the two vectors. When the similarity exceeds a threshold value of 0.8, the two feature vectors are considered to be consistent, otherwise, the two feature vectors are considered to be inconsistent. And sequentially comparing all the feature vectors in the pictures with 100 ten thousand pictures in the database, and determining the ID with the highest matching degree.

obtaining the resource file of the current picture according to the matching ID

According to the ID value, the resource file of the current page can be obtained from the server database, and the resource file comprises text information, audio information, picture information, formulas, titles and the like of the current picture.

Recognizing text information of surrounding area according to finger position

after the finger position is determined, the screenshot height is 50 pixels, and the screenshot length is 1000 pixel rectangular areas. And inputting the region picture into a Google open source OCR tool tesseract. And obtaining sentences, words, formulas and the like of various languages in the regional pictures.

searching the identified text information and returning corresponding service

and E, matching and searching the text information identified in the step F and the text information acquired in the step E. And when the similarity exceeds a set threshold, returning and displaying the current resource file. For example, if an English word is matched, the translation and pronunciation of the English word are returned; if the matched item is the item, returning to the answering process of the current item and the like.

As shown in fig. 6, the present application further provides a computer device comprising a memory, a processor and a computer program stored in the memory and executable by the processor, wherein the processor implements the method of any one of the above when executing the computer program.

the present application proposes a new interaction approach that utilizes artificial intelligence techniques to recognize actions and gestures and apply this interaction approach to desktop learning and entertainment scenarios. This interaction needs to be done by means of external devices, which are divided into two categories. The first is that the hardware is vertical to the desktop, and a monocular camera is arranged at the top of the hardware to irradiate downwards in an inclined mode, so that the information of books, hands, characters, pictures and the like on the desktop can be shot clearly. The second is that the mobile phone or the flat plate is arranged on the desktop, an optical reflection system is arranged above the mobile phone or the flat plate, and the optical reflection system forms images through reflection, so that the front camera of the mobile phone or the flat plate can clearly shoot the book, hand, character, picture information and the like on the desktop.

It should be noted that the steps illustrated in the flowcharts of the figures may be performed in a computer system such as a set of computer-executable instructions and that, although a logical order is illustrated in the flowcharts, in some cases, the steps illustrated or described may be performed in an order different than presented herein.

it will be apparent to those skilled in the art that the modules or steps of the present invention described above may be implemented by a general purpose computing device, they may be centralized on a single computing device or distributed across a network of multiple computing devices, and they may alternatively be implemented by program code executable by a computing device, such that they may be stored in a storage device and executed by a computing device, or fabricated separately as individual integrated circuit modules, or fabricated as a single integrated circuit module from multiple modules or steps. Thus, the present invention is not limited to any specific combination of hardware and software.

The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

11页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：数字化互动展览综合应用系统

Desktop interaction method based on artificial intelligence

相关技术

网友询问留言