Intelligent point reading scheme based on photographing and object recognizing

文档序号：1379188 发布日期：2020-08-14 浏览：6次中文

阅读说明：本技术 基于拍照识物的智能点读方案 (Intelligent point reading scheme based on photographing and object recognizing ) 是由王鹏于 2020-04-08 设计创作，主要内容包括：本发明公开了基于拍照识物的智能点读方案,包括信号采集器、信号处理器、语音合成单元、人机交互端口,其特征在于：信号采集器是进行图片采集,且信号采集器基于若干类型的设备设计软件,信号处理器是完成图像信号的分析处理,包括图片定位、分割、识别,以及文本句子的生成,语音合成单元是将文本内容转换为语音信号,人机交互端口包括用户触发界面或者开关以及音频信号的输出。本发明的基于拍照识物的智能点读方案通过手持移动设备对目标区域进行拍照,对照片中的物体和文字进行定位、分割和识别,并生成特定语言的文字描述,具有很好的灵活性的特点。该专利技术的实施无需复杂的定制设备,用户操作方便,简捷易用。(The invention discloses an intelligent point-reading scheme based on photographing and object recognizing, which comprises a signal collector, a signal processor, a voice synthesis unit and a man-machine interaction port, and is characterized in that: the signal collector is used for collecting pictures, the signal collector is based on a plurality of types of equipment design software, the signal processor is used for completing analysis and processing of image signals, including picture positioning, segmentation and recognition and generation of text sentences, the voice synthesis unit is used for converting text contents into voice signals, and the man-machine interaction port comprises a user trigger interface or a switch and outputs audio signals. The intelligent click-to-read scheme based on the photographed literacy shoots a target area through the handheld mobile equipment, positions, segments and identifies objects and characters in a photo, generates character description of a specific language, and has the characteristic of good flexibility. The patent technology is implemented without complex customized equipment, and is convenient to operate, simple and easy to use by a user.)

1. Intelligent point reading scheme based on recognition of shooing, including signal collector, signal processor, speech synthesis unit, man-machine interaction port, its characterized in that: the signal collector carry out the picture and gather, and signal collector is based on the equipment design software of a plurality of types, signal processor accomplish image signal's analysis and processing, including picture location, segmentation, discernment to and the formation of text sentence, signal collector, signal processor constitute picture content edit and generate front-end system, speech synthesis unit convert text content into speech signal, the human-computer interaction port include user trigger interface or switch and audio signal's output, and be equipped with the content based on the position trigger in the human-computer interaction port and spell and read.

2. The intelligent reading scheme based on photographing literacy as claimed in claim 1, wherein: the signal collector is based on several types of device design software, for example, a handheld mobile device (such as a smart phone, a tablet computer, a camera, a video pen, etc.) or other wearable device (such as smart glasses) to take a picture of a target area.

3. The signal collector of claim 1 or 2, wherein: the scene of image acquisition can include image acquisition of real objects, image acquisition of books and file contents, image acquisition in virtual reality and the like.

4. The intelligent reading scheme based on photographing literacy as claimed in claim 1, wherein: the signal processor is used for analyzing and processing image signals, analyzing the acquired images, including positioning, segmenting, identifying and the like of objects or characters in the images, and the adopted technical implementation algorithm includes but is not limited to performing End-to-End analysis on image contents through a trained deep neural network model, such as RCNN, fast-RCNN and the like.

5. The intelligent reading scheme based on photographing literacy as claimed in claim 1, wherein: the generation of the text sentence generates a sentence-level text description for the key information according to the obtained labels and the character contents of the objects in the picture, and common models include an Attention-based model, a GAN, a relationship Learning and the like.

6. The intelligent reading scheme based on photographing literacy as claimed in claim 1, wherein: the content spelling based on the position triggering completes the region positioning and the recognition understanding of the object and the character content in the picture, and when the user clicks the corresponding position triggering, the intelligent spelling can be carried out according to the preset language (such as English). The step is realized based on a Text-to-Speech (TTS) technology, and different types of sound colors can be customized.

7. The location-based triggered content spelling of claim 6, wherein: the triggering modes are further divided into off-line triggering and on-line triggering. The off-line triggering refers to that after the device collects and analyzes the picture, the device waits for the user to trigger and awaken the corresponding position area, and only carries out information interaction on the content of the area and the user; the on-line triggering refers to that the equipment acquires, analyzes and processes the picture, synchronously obtains the triggering intention of the user, and performs comprehensive information interaction on the whole content of the picture and the user, such as character acquisition based on a video pen, auxiliary scenes of the visually impaired user based on intelligent glasses and the like.

8. The intelligent reading scheme based on photographing literacy as claimed in claim 1, wherein: the image editing and generation is based on the associated picture content editing and generation, and a user inputs a required instruction (such as a keyword) through handwriting or voice, generates a new picture based on a pre-training model (such as GAN) or automatically modifies the style and the content of the picture.

Technical Field

The invention relates to a neural network and AI (artificial intelligence) identification technology, in particular to an intelligent touch-reading scheme based on photographing and object identification.

Background

With the advent of the precipitation of the AI technology and the 5G, the AI has been widely applied in many fields, such as scenes of online education, remote medical treatment and the like, at present, image positioning and recognition based on the deep neural network technology are mature, the accuracy rate can reach more than 99%, for example, a security inspection system based on face recognition has reached a practical level, and face-brushing payment begins to be popular, so that the AI technology based on image positioning and recognition can ensure robustness and high efficiency, but the point-reading equipment in the market is customized by combining a drawing book, and the flexibility is poor.

Disclosure of Invention

The invention aims to solve the technical problem of providing an intelligent click-to-read scheme based on photographing and object recognition.

In order to solve the technical problems, the technical scheme provided by the invention is as follows: intelligent point reading scheme based on recognition of shooing, including signal collector, signal processor, speech synthesis unit, man-machine interaction port, its characterized in that: the signal collector carry out the picture and gather, and signal collector is based on the equipment design software of a plurality of types, signal processor accomplish image signal's analysis and processing, including picture location, segmentation, discernment to and the formation of text sentence, signal collector, signal processor constitute picture content edit and generate front-end system, speech synthesis unit convert text content into speech signal, the human-computer interaction port include user trigger interface or switch and audio signal's output, and be equipped with the content based on the position trigger in the human-computer interaction port and spell and read.

Compared with the prior art, the invention has the advantages that: the intelligent click-to-read scheme based on the photographed literacy of the invention photographs a target area through a handheld mobile device, positions, segments and identifies objects and characters in the photograph, and generates character description of a specific language. The user can automatically use a specific language to spell the text description content by clicking the corresponding object in the photo through the touch screen. In addition, corresponding explanation, semantic extension, content test and fairy tale search can be obtained, and picture content editing and generation based on association can be realized, so that an innovative information interaction mode is provided, the method is suitable for helping a user to spell and tutor and acquire information for unidentified objects and characters, and the method has the characteristic of good flexibility. The patent technology is implemented without complex customized equipment, and is convenient to operate, simple and easy to use by a user.

As an improvement, the signal collector is based on several types of device design software, such as a handheld mobile device (e.g. a smart phone, a tablet computer, a camera, a video pen, etc.) or other wearable device (e.g. smart glasses) to take a picture of the target area.

As an improvement, the scene of acquiring the image may include image acquisition of a real object, image acquisition of contents of books and documents, image acquisition in virtual reality, and the like.

As an improvement, the signal processor is used for completing the analysis processing of image signals, analyzing the acquired picture, including the positioning, segmentation, recognition and the like of objects or characters in the picture,

as an improvement, the adopted technical implementation algorithm includes but is not limited to the End-to-End analysis of the image content through a trained deep neural network model, such as RCNN, fast-RCNN and the like.

As an improvement, the generation of the text sentence generates a text description at sentence level for the key information by obtaining the label and the text content of the object in the picture, and commonly used models are an Attention-based model, GAN, and a relationship Learning.

As an improvement, the content spelling based on the position triggering completes the area positioning and the recognition understanding of the object and the character content in the picture, and when the user clicks the corresponding position triggering, the intelligent spelling can be performed according to the preset language (such as English). The step is realized based on a Text-to-Speech (TTS) technology, and different types of sound colors can be customized.

As an improvement, the triggering mode is divided into off-line triggering and on-line triggering. The off-line triggering refers to that after the device collects and analyzes the picture, the device waits for the user to trigger and awaken the corresponding position area, and only carries out information interaction on the content of the area and the user; the on-line triggering refers to that the equipment acquires, analyzes and processes the picture, synchronously obtains the triggering intention of the user, and performs comprehensive information interaction on the whole content of the picture and the user, such as character acquisition based on a video pen, auxiliary scenes of the visually impaired user based on intelligent glasses and the like.

As an improvement, the image editing and generation is based on the associated picture content editing and generation, and a user inputs a required instruction (such as a keyword) by handwriting or voice, generates a new picture based on a pre-training model (such as GAN) or automatically modifies the style and the content of the picture.

Drawings

Fig. 1 is a flow chart of target location based on a smart click-to-read scheme for photographed literacy.

Fig. 2 is a flow chart of object segmentation based on a smart click-to-read scheme for photographed literacy.

Fig. 3 is a flow chart of object recognition based on a smart click-to-read scheme for photographed literacy.

Fig. 4 is a text recognition flow chart of the intelligent reading scheme based on the photographed literacy.

Fig. 5 is a sentence generation module effect output diagram of an intelligent click-to-read scheme based on photographed literacy.

Fig. 6 is an information retrieval and recommendation module effect output diagram based on an intelligent click-to-read scheme of photographed literacy.

Fig. 7 is a flow chart of a technical scheme of an intelligent reading scheme based on photographed literacy.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings.

When the intelligent touch reading scheme is implemented specifically, the intelligent touch reading scheme based on photographing and object recognizing comprises a signal collector, a signal processor, a voice synthesis unit and a man-machine interaction port, and is characterized in that: the signal collector carry out the picture collection to the target area, and be equipped with image positioning module in the signal collector, signal processor accomplish image signal's analysis and processing, including picture location, segmentation, discernment to and the formation of text sentence, and signal processor constitutes discernment and generation module, speech synthesis unit convert text content into speech signal, speech synthesis unit constitution speech synthesis and identification module, and speech synthesis and identification module are equipped with sentence generation module and spell the reading module based on the content that the position triggered, man-machine interaction port include user trigger interface or switch and audio signal's output, and information retrieval and recommendation module locate in the man-machine interaction port.

The signal collector is based on several types of device design software, for example, a handheld mobile device (such as a smart phone, a tablet computer, a camera, a video pen, etc.) or other wearable device (such as smart glasses) to take a picture of a target area.

The scene of image acquisition can include image acquisition of real objects, image acquisition of books and file contents, image acquisition in virtual reality and the like.

The signal processor is used for analyzing and processing image signals, analyzing the acquired images, including positioning, segmenting, identifying and the like of objects or characters in the images, and the adopted technical implementation algorithm includes but is not limited to performing End-to-End analysis on image contents through a trained deep neural network model, such as RCNN, fast-RCNN and the like.

The positioning, segmentation and recognition of the words comprise the following details:

positioning a target, identifying the position of an object or a character in an image, marking the coordinates (x, y) of the upper left corner of a frame of the target and the width w and the height h of the frame, and distinguishing the object and the character by positioning and detecting the target as shown in figure 1;

segmenting the object, namely performing semantic segmentation on the positioned object, and distinguishing the edges of the object and the background at a pixel level, as shown in figure 2;

recognizing the object, recognizing the segmented object, and outputting a corresponding keyword tag, as shown in fig. 3;

and recognizing the characters, and processing the characters positioned in the positioning of the target. As shown in fig. 4, a convolutional neural network is constructed for identification.

The generation of the text sentence generates a sentence-level text description by using the obtained labels and the character contents of the objects in the picture as key information, and common models include an Attention-based model, a GAN, a Reinforcement learning and the like, and the effect is shown in fig. 5.

The content spelling based on the position triggering completes the region positioning and the recognition understanding of the object and the character content in the picture, and when the user clicks the corresponding position triggering, the intelligent spelling can be carried out according to the preset language (such as English). The step is realized based on a Text-to-Speech (TTS) technology, and different types of sound colors can be customized.

The triggering modes are further divided into off-line triggering and on-line triggering. The off-line triggering refers to that after the device collects and analyzes the picture, the device waits for the user to trigger and awaken the corresponding position area, and only carries out information interaction on the content of the area and the user; the on-line triggering refers to that the equipment acquires, analyzes and processes the picture, synchronously obtains the triggering intention of the user, and performs comprehensive information interaction on the whole content of the picture and the user, such as character acquisition based on a video pen, auxiliary scenes of the visually impaired user based on intelligent glasses and the like.

The image editing and generation is based on the associated picture content editing and generation, and a user inputs a required instruction (such as a keyword) by handwriting or voice, generates a new picture based on a pre-training model (such as GAN) or automatically modifies the style and content of the picture, and the effect is as shown in FIG. 6.

The working principle of the invention is as follows: the patent technology of the invention relates to three directions of image, voice, natural language processing and the like, at present, the image positioning and identification based on the deep neural network technology is mature, the accuracy rate can reach more than 99 percent, for example, a security inspection system based on face identification reaches a practical level, and face brushing payment begins to be popular, so the AI technology based on the image positioning and identification can ensure the stability and high efficiency of the patent.

The speech processing field is also a great breakthrough, the speech recognition under the near-field sound pickup or quiet scene is also 97%, and in addition, the speech synthesis technology can generate sounds with various timbres.

What is more important is a natural language processing technology, in which key elements existing in a picture recognized through an image are used as keywords to generate sentence expression contents. Natural language processing tasks based on BERT, XLNet, etc. models, such as semantic classification, reading comprehension, etc., have exceeded human cognitive levels.

According to the intelligent point-reading technology based on photographing literacy, a target area is photographed through a handheld mobile device (such as a smart phone, a tablet personal computer, a camera, a video pen and the like) or other wearable devices (such as smart glasses), objects and characters in the photograph are automatically positioned, segmented and recognized, and character descriptions of specific languages (such as Chinese, English and the like according to set language types) are generated, namely, natural language texts are generated according to collected picture contents to describe the objects and scene contents in the picture. When a user clicks a corresponding object in the photo through the touch screen or is triggered in other triggering modes, the word description content can be automatically spelled and read by using a specific language. In addition, corresponding interpretation, semantic extension, content-based question answering and fairy tale searching, and picture content editing and generation based on association can also be obtained.

The intelligent touch-reading scheme based on photographed literacy can be used in educational scenes, for example, a user cannot express names of objects in English, can perform touch-reading by photographing through a handheld device, and can also perform knowledge tutoring on the user; when reading, the video pen can read the suspicious book at any time without spreading codes, and is convenient and quick; wearing the intelligent glasses with the patent technology can identify and explain objects and scenes in the visual field, and bring new brightness for users with visual impairment.

The technical scheme provides an innovative information acquisition mode, facilitates effective acquisition and utilization of information, and well serves users of online education and visual impairment.

The partial codes of the technical scheme are as follows:

furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature, and in the description of the invention, "plurality" means two or more unless explicitly defined otherwise.

In the present invention, unless otherwise specifically stated or limited, the terms "mounted," "connected," "fixed," and the like are to be construed broadly and may, for example, be fixedly connected, detachably connected, or integrally connected; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meanings of the above terms in the present invention can be understood by those skilled in the art according to specific situations.

In the present invention, unless otherwise expressly stated or limited, "above" or "below" a first feature means that the first and second features are in direct contact, or that the first and second features are not in direct contact but are in contact with each other via another feature therebetween. Also, the first feature being "on," "above" and "over" the second feature includes the first feature being directly on and obliquely above the second feature, or merely indicating that the first feature is at a higher level than the second feature. A first feature being "under," "below," and "beneath" a second feature includes the first feature being directly above and obliquely above the second feature, or simply meaning that the first feature is at a lesser level than the second feature.

In the description herein, reference to the terms "one embodiment," "some embodiments," "an example," "a specific example," or "some examples" or the like means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made in the above embodiments by those of ordinary skill in the art without departing from the principle and spirit of the present invention.

14页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：一种基于高光谱遥感技术的古墓题记文字识别方法

Intelligent point reading scheme based on photographing and object recognizing

相关技术

网友询问留言