Target area identification method, device, equipment and readable storage medium

文档序号：1170457 发布日期：2020-09-18 浏览：12次中文

阅读说明：本技术 目标区域识别方法、装置、设备及可读存储介质 (Target area identification method, device, equipment and readable storage medium ) 是由任玉强潘兴甲董未名朱旭东袁豪磊郭晓威徐常胜于 2020-05-25 设计创作，主要内容包括：本申请公开了一种目标区域识别方法、装置、设备及可读存储介质,涉及人工智能领域。该方法包括：获取输入图像,输入图像中包括待识别的图像内容；通过目标识别模型以旋转卷积方式对输入图像的图像特征进行特征处理,得到目标特征；确定与目标特征对应的区域数据,其中,区域数据中包括旋转角度；通过区域数据在输入图像中确定图像内容对应的目标区域。通过目标识别模型以旋转卷积方式对输入图像的图像特征进行处理,从而识别得到输入图像中图像内容对应的目标区域,且目标区域为通过旋转卷积方式确定图像内容的旋转角度后,进行对应旋转后得到的区域,提高了对图像内容对应的目标区域进行识别的识别准确率。(The application discloses a target area identification method, a target area identification device, target area identification equipment and a readable storage medium, and relates to the field of artificial intelligence. The method comprises the following steps: acquiring an input image, wherein the input image comprises image content to be identified; performing feature processing on the image features of the input image in a rotary convolution mode through a target identification model to obtain target features; determining area data corresponding to the target feature, wherein the area data comprise a rotation angle; and determining a target area corresponding to the image content in the input image through the area data. The image characteristics of the input image are processed in a rotary convolution mode through the target identification model, so that a target area corresponding to the image content in the input image is identified and obtained, the target area is obtained after the rotation angle of the image content is determined in the rotary convolution mode and is correspondingly rotated, and the identification accuracy rate of identifying the target area corresponding to the image content is improved.)

1. A method for identifying a target area, the method comprising:

acquiring an input image, wherein the input image comprises image content to be identified;

predicting a first rotation angle of the image content in the input image;

after the convolution kernel in the target identification model is rotated by the first rotation angle, performing convolution processing on the image characteristics of the input image by the rotated convolution kernel to obtain target characteristics;

identifying the target features to obtain area data corresponding to the image content, wherein the area data comprise a rotation angle, and the rotation angle is used for indicating a deflection angle of the image content relative to a default angle in the input image;

and determining the target area corresponding to the image content in the input image through the area data.

2. The method according to claim 1, wherein after the convolution kernel in the target recognition model is rotated by the first rotation angle, performing a convolution rotation process on the image feature to obtain a target feature, includes:

rotating at least two convolution kernels by the first rotation angle;

and performing convolution processing on the image features through the at least two convolution kernels to obtain the target features.

3. The method of claim 2, wherein the convolving the image features with the at least two convolution kernels to obtain the target features comprises:

performing feature processing on the image features through the at least two convolution kernels in the rotary convolution mode to obtain at least two rotary convolution features, wherein each convolution kernel corresponds to one rotary convolution feature;

convolving the at least two convolution features by an attention mechanism to generate at least two attention maps, wherein each attention map corresponds to one convolution feature;

generating the target feature of the input image in conjunction with the at least two deconvolution features and the at least two attention maps.

4. The method of claim 3, wherein the generating the target feature of the input image in combination with the at least two deconvolution features and the at least two attention maps comprises:

normalizing the at least two attention diagrams to obtain normalized features;

and multiplying the normalized feature and the at least two rotation convolution features respectively to obtain a weighted sum, and performing convolution through the attention mechanism to generate the target feature.

5. The method according to any one of claims 1 to 4, wherein the identifying the target feature to obtain the region data corresponding to the image content comprises:

identifying the target features to obtain size data and position data corresponding to the image content; determining the first rotation angle, the size data and the position data as the area data corresponding to the image content;

or the like, or, alternatively,

identifying the target characteristics to obtain a second rotation angle, size data and position data corresponding to the image content; and determining the second rotation angle, the size data and the position data as the area data corresponding to the image content.

6. The method of claim 5, wherein the location data includes center point data and offset values;

the method further comprises the following steps:

predicting the central point of the image content through the target feature to obtain central point data;

and predicting the offset of the central point in the image feature scaling process through the target feature to obtain the offset.

7. An apparatus for identifying a target area, the apparatus comprising:

the device comprises an acquisition module, a recognition module and a recognition module, wherein the acquisition module is used for acquiring an input image, and the input image comprises image content to be recognized;

a prediction module for predicting a first rotation angle of the image content in the input image;

the processing module is used for rotating a convolution kernel in a target identification model by the first rotation angle and then performing convolution processing on the image characteristics of the input image by the rotated convolution kernel to obtain target characteristics;

the identification module is used for identifying the target features to obtain area data corresponding to the image content, wherein the area data comprise a rotation angle, and the rotation angle is used for indicating a deflection angle of the image content relative to a default angle in the input image; and determining the target area corresponding to the image content in the input image through the area data.

8. The apparatus of claim 7, wherein the processing module is further configured to rotate at least two convolution kernels by the first rotation angle; and performing convolution processing on the image features through the at least two convolution kernels to obtain the target features.

9. The apparatus according to claim 8, wherein the processing module is further configured to perform feature processing on the image feature through the at least two convolution kernels in the manner of the deconvolution to obtain at least two deconvolution features, where each convolution kernel corresponds to one of the deconvolution features;

the device, still include:

the generating module is used for convolving the at least two convolution characteristics through an attention mechanism to generate at least two attention diagrams, wherein each attention diagram corresponds to one convolution characteristic; generating the target feature of the input image in conjunction with the at least two deconvolution features and the at least two attention maps.

10. The apparatus of claim 9, wherein the generating module is further configured to normalize the at least two attention maps to obtain a normalized feature; and multiplying the normalized feature and the at least two rotation convolution features respectively to obtain a weighted sum, and performing convolution through the attention mechanism to generate the target feature.

11. The apparatus according to any one of claims 7 to 10, wherein the identifying module is further configured to identify the target feature to obtain size data and position data corresponding to the image content; determining the first rotation angle, the size data and the position data as the area data corresponding to the image content;

or the like, or, alternatively,

the identification module is further configured to identify the target feature to obtain a second rotation angle, size data and position data corresponding to the image content; and determining the second rotation angle, the size data and the position data as the area data corresponding to the image content.

12. The apparatus of claim 11, wherein the location data comprises center point data and offset values;

the prediction module is further configured to predict a central point of the image content through the target feature to obtain the central point data; and predicting the offset of the central point in the image feature scaling process through the target feature to obtain the offset.

13. A computer device comprising a processor and a memory, the memory having stored therein at least one instruction, at least one program, set of codes, or set of instructions, which is loaded and executed by the processor to implement a target area identification method as claimed in any one of claims 1 to 6.

14. A computer readable storage medium having stored therein at least one instruction, at least one program, a set of codes, or a set of instructions, which is loaded and executed by a processor to implement the target area identification method according to any one of claims 1 to 6.

Technical Field

The embodiment of the application relates to the field of artificial intelligence, in particular to a target area identification method, a target area identification device, target area identification equipment and a readable storage medium.

Background

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. The neural network model is an implementation mode of artificial intelligence.

Disclosure of Invention

The embodiment of the application provides a target area identification method, a target area identification device and a readable storage medium, which can improve the accuracy of identifying an area corresponding to image content. The technical scheme is as follows:

in one aspect, a target area identification method is provided, and the method includes:

acquiring an input image, wherein the input image comprises image content to be identified;

predicting a first rotation angle of the image content in the input image;

and determining the target area corresponding to the image content in the input image through the area data.

In another aspect, an apparatus for identifying a target area is provided, the apparatus comprising:

a prediction module for predicting a first rotation angle of the image content in the input image;

In another aspect, a computer device is provided, which includes a processor and a memory, wherein at least one instruction, at least one program, code set, or instruction set is stored in the memory, and the at least one instruction, the at least one program, the code set, or the instruction set is loaded and executed by the processor to implement the target area identification method according to any one of the embodiments of the present application.

In another aspect, a computer readable storage medium is provided, in which at least one instruction, at least one program, a set of codes, or a set of instructions is stored, which is loaded and executed by the processor to implement the target area identification method as described in any of the embodiments of the present application.

In another aspect, a computer program product is provided, which when run on a computer causes the computer to perform the target area identification method as described in any of the embodiments of the present application.

The beneficial effects brought by the technical scheme provided by the embodiment of the application at least comprise:

the image characteristics of the input image are processed in a rotary convolution mode through the target identification model, so that a target area corresponding to the image content in the input image is identified and obtained, the target area is obtained after the rotation angle of the image content is determined in the rotary convolution mode and is correspondingly rotated, and the identification accuracy rate of identifying the target area corresponding to the image content is improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a diagram illustrating the results of target area identification provided by an exemplary embodiment of the present application;

FIG. 2 is a flowchart of a target area identification method provided by an exemplary embodiment of the present application;

FIG. 3 is a schematic diagram of a process for rotating the frame area provided based on the embodiment shown in FIG. 2;

FIG. 4 is a flow chart of a target area identification method provided by another exemplary embodiment of the present application;

FIG. 5 is a schematic diagram of a recognition process of the target recognition model provided based on the embodiment shown in FIG. 4;

FIG. 6 is a schematic diagram of a convolutional process provided based on the embodiment shown in FIG. 4;

FIG. 7 is a flowchart of a target area identification method provided by another exemplary embodiment of the present application;

FIG. 8 is a schematic structural diagram of the overall scheme of the present application provided based on the embodiment shown in FIG. 7;

fig. 9 is a block diagram of a target area recognition apparatus according to an exemplary embodiment of the present application;

fig. 10 is a block diagram of a target area recognition apparatus according to another exemplary embodiment of the present application;

fig. 11 is a block diagram of a server according to an exemplary embodiment of the present application.

Detailed Description

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

First, terms referred to in the embodiments of the present application are briefly described:

artificial Intelligence (AI): the method is a theory, method, technology and application system for simulating, extending and expanding human intelligence by using a digital computer or a machine controlled by the digital computer, sensing the environment, acquiring knowledge and obtaining the best result by using the knowledge. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

Computer Vision technology (Computer Vision, CV): the method is a science for researching how to make a machine see, and particularly refers to that a camera and a computer are used for replacing human eyes to perform machine vision such as identification, tracking and measurement on a target, and further graphics processing is performed, so that the computer processing becomes an image more suitable for human eyes to observe or is transmitted to an instrument to detect. As a scientific discipline, computer vision research-related theories and techniques attempt to build artificial intelligence systems that can capture information from images or multidimensional data. The computer vision technology generally includes image processing, image Recognition, image semantic understanding, image retrieval, Optical Character Recognition (OCR), video processing, video semantic understanding, video content/behavior Recognition, three-dimensional object reconstruction, 3D technology, virtual reality, augmented reality, synchronous positioning, map construction, and other technologies, and also includes common biometric technologies such as face Recognition and fingerprint Recognition.

Machine Learning (ML): the method is a multi-field cross discipline and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and teaching learning.

And (3) rotation convolution: the convolution kernel is a method of rotating the convolution kernel and performing convolution processing on an image with the rotated convolution kernel. For an image content a in an image, first, a rotation angle of the image content a in the image is predicted, a convolution kernel is rotated by the predicted rotation angle, then, a convolution process is performed on an image feature by the rotated convolution kernel, a target feature is obtained, and an area where the image content a is located is identified according to the target feature.

In conjunction with the above noun introduction, an application scenario of the embodiment of the present application is illustrated:

first, in the application scenario of the unmanned aerial vehicle shelf, the commodities on the shelf need to be identified, so as to determine the arrangement density and the arrangement position of the commodities, however, the camera disposed above the unmanned aerial vehicle shelf corresponds to different angles for the image acquisition of different commodities, such as: when the camera shoots left-side commodities, the commodities are inclined from the upper left corner to the lower right corner, and when the camera shoots right-side commodities, the commodities are inclined from the upper right corner to the lower left corner;

referring to fig. 1, schematically, a product planogram 100 is input into a target recognition model 110, after image features of the product planogram 100 are subjected to a rotation convolution process by the target recognition model 110, a product in the product planogram 100 and a rotation angle of the product are identified, and the product in the product planogram 100 is labeled according to the rotation angle, as shown in a labeling block 120.

Secondly, in a scene content auditing application scene, auditing flag content is taken as an example for explanation, and whether the team flags arranged on a street meet requirements or not is determined by identifying the team flags on the street, however, because the flags are arranged at different positions of the street and show a flying state, when the street is shot from the angle of a camera, the angles shown by the different flags are different.

The two application scenarios are only illustrative examples in the present application, and the target area identification method provided in the embodiment of the present application may also be applied to other schemes for determining target content in an image through rotation convolution, which is not limited in the embodiment of the present application.

It is to be noted that the target area identification method provided in the embodiment of the present application may be implemented by a terminal, may also be implemented by a server, and may also be implemented by cooperation of the terminal and the server. The terminal comprises at least one of terminals such as a smart phone, a tablet computer, a portable laptop, a desktop computer, a smart sound box and a smart wearable device, the server can be a physical server or a cloud server providing cloud computing service, and the server can be implemented as one server or a server cluster or distributed system formed by a plurality of servers. When the terminal and the server cooperatively implement the scheme provided by the embodiment of the present application, the terminal and the server may be directly or indirectly connected in a wired or wireless communication manner, which is not limited in the embodiment of the present application.

With reference to the above noun introduction and application scenario, a target area identification method provided in the embodiment of the present application is described, taking an example that the method is applied to a server as an illustration, as shown in fig. 2, the method includes:

step 201, an input image is obtained, and the input image includes image content to be identified.

Optionally, the input image is an image whose image content is to be identified, and the image content is identified by performing frame selection identification on an area where the image content is located. In an optional embodiment, after the area where the image content is located is subjected to frame selection identification, at least one of identification modes such as object identification, person identification, category identification and the like is carried out on the image content from the frame selection area; in an optional embodiment, after the region where the image content is located is subjected to frame selection identification, the frame-selected region is labeled in the image, so that the position of the region of the image content in the image is indicated.

Schematically, in this embodiment, an application scenario of the unmanned aerial vehicle is taken as an example for explanation, then the input image is an image acquired by a camera disposed on the periphery of the unmanned aerial vehicle, the camera is disposed on the periphery of the unmanned aerial vehicle, the image acquisition is performed on a plurality of unmanned aerial vehicles in turn, and the image acquisition is completed at different angles for different shelves, so that the angles of the commodities on different shelves in the acquired image are different, that is, not all the commodities can take a rectangular shape in front in the image.

In this embodiment, schematically, taking an image object search scene in a shopping application as an example, a user takes a picture of a commodity to be searched in the shopping application, and uploads the picture to a server of the shopping application, the server identifies and obtains image content that the user needs to search from the picture according to the picture uploaded by the user, and after framing, searches from a commodity library, for example: after a user shoots a picture of trousers needing to be purchased, the picture is uploaded to a server, after the server identifies and selects the trousers in the picture, commodity searching is carried out on the trousers in a frame selection area, a search result and a frame selection result are fed back to the user, and the user confirms whether the frame selection area is accurate and whether the search result comprises the trousers needing to be purchased.

In step 202, a first rotation angle of the image content in the input image is predicted.

Optionally, a first rotation angle of the image content in the input image is predicted by the target recognition model, where the first rotation angle may be a rotation angle in the final region data, or a second rotation angle is obtained according to the generated target feature after performing a rotation convolution process on the image feature in combination with the first rotation angle, and the second rotation angle is used as a rotation angle in the region data.

And step 203, after the convolution kernel in the target identification model is rotated by the first rotation angle, performing convolution processing on the image characteristics of the input image by the rotated convolution kernel to obtain the target characteristics.

Optionally, the target recognition model is a deep learning model and the target recognition model is a neural network model.

Optionally, the target recognition model is a model obtained through sample image training in advance, optionally, the sample image is an image in an open rotating target data set, the rotating target data set is used as a training data set to train the target recognition model, the image in the training data set is labeled with a target frame, and the target frame is a rotating frame labeled with a rotating angle. Optionally, for an image with larger pixels, firstly, performing overlapped cutting on the image according to a development kit to obtain a subgraph with a proper scale, and training and testing the target recognition model through the subgraph, wherein in the testing stage, the testing results of the subgraph are combined.

Optionally, when the target recognition model is trained, solving convolutional layer template parameters w and bias parameters b of the neural network model by using an Adam-based gradient descent method, calculating a prediction result error and propagating the error back to the neural network model in the process of each iteration, calculating a gradient and updating parameters in the neural network model.

Optionally, when a target region corresponding to image content is identified, firstly, feature extraction is performed on an input image through a target identification model to obtain an image feature, and thus, after feature processing is performed through a convolution kernel in a rotation convolution manner, a target feature is obtained.

Optionally, when performing the feature processing, the feature processing is performed on the image feature by at least two convolution kernels in a rotation convolution manner, and convolution results of the at least two convolution kernels are fused to obtain the target feature.

And 204, identifying the target characteristics to obtain area data corresponding to the image content, wherein the area data comprises a rotation angle.

Optionally, the rotation angle is used to indicate a deflection angle of the image content in the input image with respect to a default angle. Illustratively, the default angle in the input image is a direction along the side of the input image, and the rotation angle is a deflection angle of the image content with respect to the side of the input image.

Referring to fig. 3, the image 300 includes image content 310, a frame selection area 320 is an area corresponding to a default angle, and a frame selection area 330 is an area corresponding to the image content 310, which is obtained by rotating the frame selection area 320 according to a rotation angle.

Step 205, determining a target area corresponding to the image content in the input image according to the area data.

Optionally, the area data includes the rotation angle, the area data further includes size data and position data, the size of the target area corresponding to the image content is determined through the size data, and the position of the image content in the input image is determined through the position data.

And determining a corresponding target area of the image content in the input image by combining the size data, the position data and the rotation angle.

Optionally, the size data is used to indicate the length and width values of the target area corresponding to the image content; the position data is used for indicating a pixel point corresponding to a central point of the image content in the input image, wherein the central point can correspond to one pixel point or a group of pixel points.

In summary, in the target area identification method provided in this embodiment, the target identification model is used to process the image features of the input image in a rotation convolution manner, so as to identify and obtain the target area corresponding to the image content in the input image, and the target area is obtained by determining the rotation angle of the image content in the rotation convolution manner and then performing corresponding rotation, so that the identification accuracy rate of identifying the target area corresponding to the image content is improved.

In an alternative embodiment, the image features are processed by performing a convolution process with at least two convolution kernels, fig. 4 is a flowchart of a target area identification method provided in another exemplary embodiment of the present application, which is exemplified by applying the method to a server, as shown in fig. 4, and the method includes:

step 401, an input image is obtained, wherein the input image includes image content to be identified.

And 402, performing feature extraction on the input image through the target recognition model to obtain image features.

Optionally, a hourglass network Hourglassnetwork is used as a trunk network to perform feature extraction on the input image, so as to obtain image features.

And 403, performing feature processing on the image features through at least two convolution kernels in a rotation convolution mode to obtain target features.

Optionally, the target recognition model is based on Dynamic Information Aggregation Module (DIAM) for extracting more accurate features with rotation invariance. The target recognition model comprises two main parts: 1. the adaptive rotation convolution operator is used for extracting calibrated characteristics according to the predicted rotation angle; 2. and the self-adaptive feature aggregation operator is used for self-adaptively aggregating the features from the receptive fields with different shapes and sizes. That is, when the target recognition model recognizes the image content, the corresponding steps include: 1. the rotary convolution is used for extracting features which are more fit with the rotary target; 2. and (4) multi-channel feature aggregation, wherein different features with different receptive fields are adaptively aggregated by means of an attention mechanism to obtain final semantic features.

Optionally, in this embodiment, when performing feature processing on an image feature in a convolution with at least two types of convolution kernels in a convolution rotation manner, the following cases are included: carrying out feature processing on the image features in a rotary convolution mode through two convolution kernels; carrying out feature processing on the image features in a rotary convolution mode through three convolution kernels; the image features are subjected to feature processing and the like in a convolution rotation manner through four convolution kernels, in the above example, two, three and four convolution kernels are taken as examples for illustration, the number of different convolution kernels may be more, and the embodiment of the present application does not limit this.

Optionally, a first rotation angle of the image content in the input image is predicted first, at least two convolution kernels in the target identification model are rotated by the first rotation angle, and the image features are subjected to feature processing by the at least two convolution kernels to obtain the target features.

Optionally, the image features are subjected to feature processing by at least two convolution kernels in a convolution rotation manner to obtain at least two convolution rotation features, wherein each convolution kernel corresponds to one convolution rotation feature.

Optionally, the image features output by the Hourglassnetwork are firstly subjected to channel compression through convolution of a 1 × 1 shape to obtain compression features, and the compression features are subjected to feature processing through at least two convolution kernels in a rotation convolution mode.

In this embodiment, an example in which feature processing is performed on an image feature by a convolution kernel method using three types of convolution kernels will be described. Illustratively, the compressed image features (i.e., the compressed features) are convolved with three branches, each branch using a different shape convolution kernel, such as: the first branch adopts a convolution kernel of 3 multiplied by 3, the second branch adopts a convolution kernel of 1 multiplied by 3, the third branch adopts a convolution kernel of 3 multiplied by 1, and the convolution kernels of the three branches are adopted to respectively carry out the convolution processing on the image characteristics so as to obtain three convolution characteristics.

Optionally, after obtaining at least two convolution features through at least two convolution kernels, convolving the at least two convolution features through an attention mechanism to generate at least two attention maps, wherein each attention map corresponds to one convolution feature, and combining the at least two convolution features and the at least two attention maps to generate the target feature of the input image. Optionally, after the at least two attention diagrams are normalized to obtain normalized features, the normalized features and the at least two rotation convolution features are multiplied respectively to obtain weighted sums, and convolution is performed through an attention mechanism to generate target features.

Referring to fig. 5, schematically, an input image is subjected to feature extraction to obtain an image feature 510, the image feature 510 is subjected to channel compression by convolution of a 1 × 1 shape to obtain a compressed feature 520, the compressed feature 520 is subjected to convolution by a first convolution kernel 531 (convolution kernel of 3 × 3 shape), a second convolution kernel 532 (convolution kernel of 1 × 3 shape), and a third convolution kernel 533 (convolution kernel of 3 × 1 shape), wherein the first convolution kernel 531 is subjected to convolution to generate a first rotation convolution feature 541, the second convolution kernel 532 is subjected to convolution to generate a second rotation convolution feature 542, the third convolution kernel 533 is subjected to convolution to generate a third rotation convolution feature 543, the first rotation convolution feature 541, the second rotation convolution feature 542, and the third rotation convolution feature 543 are subjected to attention by an attention machine to generate attention, the first convolution feature 541 is convolved by an attention mechanism to generate a first attention map 551, the second convolution feature 542 is convolved by the attention mechanism to generate a second attention map 552, the third convolution feature 543 is convolved by the attention mechanism to generate a third attention map 553, the first attention map 551, the second attention map 552 and the third attention map 553 are normalized to obtain normalized features, the normalized features are multiplied by at least two convolution features respectively to obtain weighted sums, and the weighted sums are convolved by the attention mechanism to generate the target feature 560.

Alternatively, the fusion of the convolution rotation features of each branch can also adopt hard fusion, that is, the attention diagram is not normalized, the maximum value is selected pixel by pixel, and the convolution rotation features are selected according to the selection result of the attention diagram.

Optionally, each branch of the DIAM module adopts a convolution rotation at the beginning, please refer to fig. 6, which shows a schematic diagram of a convolution rotation structure provided in an exemplary embodiment of the present application, as shown in fig. 6, an offset coordinate of a sampling position corresponding to a convolution kernel is generated at each pixel position by means of a rotation matrix 600 according to the predicted rotation angle θ, a new sampling position is obtained by adding the offset coordinate to the sampling position, and the convolution operation is continued. The offset coordinate is obtained by analyzing the offset condition of the sampling position in the image after the convolution kernel is rotated.

The calculation method of the offset coordinate is shown as the following formula I:

the formula I is as follows: p is a radical of_i＝M_r(θ)·p_i-p_i

Wherein p is_iDenotes offset coordinates, θ denotes a predicted rotation angle, pi denotes a sampling position, and Mr denotes a convolution kernel rotated in accordance with the predicted rotation angle.

And step 404, identifying the target features to obtain area data corresponding to the image content.

Optionally, identifying the target feature to obtain size data and position data corresponding to the image content, and determining the first rotation angle, the size data and the position data as area data; or, identifying the target feature to obtain a second rotation angle, size data and position data corresponding to the image content, and determining the second rotation angle, size data and position data as the region data. The second rotation angle is an angle obtained by target feature prediction, and the first rotation angle and the second rotation angle may be the same or different.

Step 405, determining a target area corresponding to the image content in the input image according to the area data.

And determining a corresponding target area of the image content in the input image by combining the size data, the position data and the rotation angle.

According to the method provided by the embodiment, a multi-branch structure is designed, different branches adopt convolution kernels with different shapes, meanwhile, by means of the rotary convolution, the receptive field is adaptively adjusted according to the shape, the size and the rotation angle, and a characteristic fusion structure is used, so that neurons in the same layer in a neural network can adaptively adjust the receptive field and adaptively select the receptive fields with different angles, shapes and sizes, the identification of a target identification model is more flexible, and the identification result is more accurate.

In an alternative embodiment, the location data includes center point data and an offset value, fig. 7 is a flowchart of a target area identification method provided in another exemplary embodiment of the present application, which is described by taking an example of applying the method to a server, and as shown in fig. 7, the method includes:

step 701, an input image is obtained, and the input image includes image content to be identified.

Step 702, performing feature extraction on the input image through the target recognition model to obtain image features.

Optionally, a hourglass network Hourglassnetwork is used as a trunk network to perform feature extraction on the input image, so as to obtain image features.

In step 703, a first rotation angle of the image content in the input image is predicted.

Optionally, a first rotation angle of the image content in the input image is first predicted through the target recognition model, and the image feature of the input image is subjected to the rotation convolution processing according to the predicted first rotation angle.

And 704, rotating the at least two convolution kernels by a first rotation angle, and then performing convolution rotation processing on the image characteristics to obtain target characteristics.

Optionally, after obtaining at least two convolution features through at least two convolution kernels, convolving the at least two convolution features through an attention mechanism to generate at least two attention diagrams, and combining the at least two convolution features and the at least two attention diagrams to generate the target feature of the input image. Optionally, after the at least two attention diagrams are normalized to obtain normalized features, the normalized features and the at least two rotation convolution features are multiplied respectively to obtain weighted sums, and convolution is performed through an attention mechanism to generate target features.

Step 705, generating a second rotation angle, size data, center point data and an offset value corresponding to the image content through the target feature.

Optionally, after performing regression analysis processing on the target feature, size data corresponding to the image content is generated.

Optionally, the second rotation angle, the size data and the position data corresponding to the image content are obtained by performing recognition analysis on the target feature. The second rotation angle may be the same as or different from the first rotation angle.

The size data is used to indicate width and height data corresponding to an area corresponding to the image content.

The position data comprises central point data and an offset value, wherein the central point data is used for indicating the position of a pixel point corresponding to the central point of the image content, and the offset value is used for indicating the offset generated by the central point data in the zooming process of the image characteristics.

Optionally, predicting a central point of the image content through the target feature to obtain central point data, namely, outputting the probability that each pixel point belongs to the center of the image content in combination with the target feature, and determining the position of the central point of the image content according to the probability data corresponding to each pixel point; and predicting the offset of the central point in the image feature scaling process through the target feature to obtain the offset. The offset value is used for correcting the predicted central point data.

Optionally, in the process of determining the region data, the second rotation angle, the size data, and the offset value correspond to a regression task, the central point data corresponds to a classification task, that is, the second rotation angle, the size data, and the offset value are identified by regression to a corresponding regression curve, and the central point data determines whether the pixel point belongs to the central point for identification by classifying the pixel point.

Optionally, in the process of generating the region data, the identification process of the region data is modified by the dynamic filter, so as to improve the accuracy in the identification process of the region data. Illustratively, when the correction is performed by the dynamic filter, at least two cases are included:

firstly, performing characteristic correction by a dynamic filter aiming at a classification task;

optionally, performing convolution processing on default features by using the dynamic filter as a convolution kernel to obtain feature correction quantity, wherein the default features are features corresponding to image features (or the target features); after the default features are corrected by the feature correction quantity, the features to be recognized are obtained, and the features to be recognized are classified through the recognition model to obtain classification data, wherein the classification data comprises the following steps: and obtaining the central point data. When the feature correction is performed through the dynamic filter, the target feature can be corrected through the dynamic filter, and the corrected target feature is classified to obtain classification data; the image features can also be modified through a dynamic filter, and after the modified image features are subjected to the rotary convolution processing, target features are generated and classified to obtain classified data.

Optionally, in the feature correction process, a first hyper-parameter for limiting a correction upper limit of the feature correction amount is further used, and the default feature is corrected through the first hyper-parameter and the feature correction amount, so as to obtain the feature to be identified.

For an exemplary calculation process of the characteristic correction amount, refer to the following formula two:

the formula II is as follows: f_Δ＝F_mid×K_c

Wherein, F_ΔFor indicating characteristic correction, F_midIndicating the default characteristics, Kc denotes the dynamic filter. Wherein the default feature is a feature corresponding to an image feature, such as: the default feature is a feature obtained after the image feature is compressed, or the default feature is a feature obtained after the image feature is amplified.

For illustration, please refer to the following formula three in the feature correction process:

the formula III is as follows: h_c＝C((1+×F_Δ/||F_Δ||)×F_mid；Φ)

Hc represents the corrected feature to be identified, C represents a classifier, namely the last layer of convolution, represents a first hyper-parameter, and F_ΔFor indicating characteristic correction, F_midIndicating the default feature, phi is a parameter in the classifier. Optionally, the value is preset, for example: in this embodiment, 0.1 is set for limiting the upper limit of the feature correction.

Illustratively, each pixel point is identified by the characteristic correction mode, and a probability value is respectively determined corresponding to the center point and not corresponding to the center point, such as: after the pixel point A is classified through the recognition model, the probability that the pixel point A belongs to the central point is 0.1, and the probability that the pixel point A does not belong to the central point is 0.9.

Secondly, for the regression task, the result is corrected through a dynamic filter.

Optionally, performing convolution processing on a default feature by using the dynamic filter as a convolution kernel to obtain a result correction amount, wherein the default feature is a feature corresponding to the image feature (or the target feature); performing regression analysis on the default characteristics through the recognition model to obtain a regression analysis result, and correcting the regression analysis result by using the result correction quantity to obtain regression data, wherein the regression analysis result comprises the following steps: and obtaining the second rotation angle, the size data and the offset value. When the result correction is performed by the dynamic filter, the result correction amount corresponding to the image feature may be generated by the dynamic filter, or the result correction amount corresponding to the target feature may be generated by the dynamic filter.

Optionally, in the result correction process, there is a second super-parameter for limiting a correction upper limit of the result correction amount, and the regression analysis result is corrected by the second super-parameter and the result correction amount to obtain regression data.

For illustration, the calculation process of the resulting correction amount refers to the following formula four:

the formula four is as follows: h_Δ＝F_mid×K_r

Wherein H_ΔFor indicating the resulting correction, F_midIndicating the default characteristics, Kr denotes a dynamic filter. Wherein the default feature is a feature corresponding to an image feature, such as: the default feature is a feature obtained after the image feature is compressed, or the default feature is a feature obtained after the image feature is amplified.

For illustration, please refer to the following formula five for the result correction process:

the formula five is as follows: h_r＝(1+×tanh(H_Δ))×H_b

Wherein Hr represents the corrected regression class data, represents the second hyperparameter, H_bShows the results of the regression analysis, H_ΔIndicating the resulting correction amount. Optionally, the value is preset.

Step 706, determining a target area corresponding to the image content in the input image according to the area data.

Optionally, the area data includes a rotation angle, size data and position data, the size of the target area corresponding to the image content is determined by the size data, the position of the image content in the input image is determined by the position data, and the rotation of the image content in the image is determined by the rotation angle.

Optionally, the position data includes center point data and an offset value, after region data is determined for the input image, a target center position is selected according to the center point data and the offset value, a region not including a rotation angle is determined according to predicted size data (that is, the width and the height of the target region), and the final target region is obtained after the region is rotated according to the predicted rotation angle.

Referring to fig. 8, after extracting features of an image 800, performing a convolution rotation process on the extracted features to obtain a target feature 810, recognizing the target feature 810 according to the target feature 810, and outputting a rotation angle 821, size data 822, an offset value 823 and center point data 824.

Illustratively, in the process of identifying the target area in the related art and identifying the target in the present application, the accuracy of the identification result is shown in table one below:

watch 1

The mAP is the field of target detection in machine learning, is used for measuring the performance index of a target detection algorithm, and represents the full-class average accuracy. CP is used to represent Compact CNN-based high-performance simple target detection algorithm; the RC1 is a mode of extracting candidate regions with different sizes and different shapes from an input image by using selective search, selecting a trained deep learning classification model, cutting off an output layer, changing the candidate region type into a fixed shape required by network input to obtain a feature map of each candidate region, classifying through a classifier, and matching the feature map with a position label; RRD is Rotation-Sensitive Regression Detection (Rotation-Sensitive Regression Detection); RoI Trans refers to a means of feature extraction by the Roi-posing method.

As can be seen from the above table, the target area identification method provided in the embodiment of the present application achieves a high overall average accuracy in the target detection field, and significantly improves mPA.

In the method provided by this embodiment, when the position of the image content is determined, the offset generated by the central point in the scaling process of the image feature is determined by determining the central point position and the offset value of the image content, and the central point position is corrected by the offset value, so that the accuracy of the target area identification result corresponding to the image content is improved.

Fig. 9 is a block diagram of a target area recognition apparatus according to an exemplary embodiment of the present application, and as shown in fig. 9, the apparatus includes:

an obtaining module 910, configured to obtain an input image, where the input image includes image content to be identified;

a prediction module 920, configured to predict a first rotation angle of the image content in the input image;

a processing module 930, configured to rotate a convolution kernel in a target identification model by using the first rotation angle, and perform convolution processing on an image feature of the input image by using the rotated convolution kernel to obtain a target feature;

an identifying module 940, configured to identify the target feature to obtain region data corresponding to the image content, where the region data includes a rotation angle, and the rotation angle is used to indicate a deflection angle of the image content relative to a default angle in the input image; and determining the target area corresponding to the image content in the input image through the area data.

In an alternative embodiment, the processing module 930 is further configured to rotate at least two convolution kernels by the first rotation angle; and performing convolution processing on the image features through the at least two convolution kernels to obtain the target features.

In an optional embodiment, the processing module 930 is further configured to perform feature processing on the image features through the at least two convolution kernels in the manner of the deconvolution to obtain at least two deconvolution features, where each convolution kernel corresponds to one of the deconvolution features;

as shown in fig. 10, the apparatus further includes:

a generating module 950, configured to convolve the at least two convolution features with an attention mechanism to generate at least two attention maps, where each attention map corresponds to one convolution feature; generating the target feature of the input image in conjunction with the at least two deconvolution features and the at least two attention maps.

In an optional embodiment, the generating module 950 is further configured to normalize the at least two attention maps to obtain a normalized feature; and multiplying the normalized feature and the at least two rotation convolution features respectively to obtain a weighted sum, and performing convolution through the attention mechanism to generate the target feature.

In an optional embodiment, the identification module 940 is further configured to identify the target feature to obtain size data and position data corresponding to the image content; determining the first rotation angle, the size data and the position data as the area data corresponding to the image content;

or the like, or, alternatively,

the identification module 940 is further configured to identify the target feature to obtain a second rotation angle, size data and position data corresponding to the image content; and determining the second rotation angle, the size data and the position data as the area data corresponding to the image content.

In an optional embodiment, the location data includes center point data and an offset value;

the predicting module 920 is further configured to predict a central point of the image content through the target feature to obtain the central point data; and predicting the offset of the central point in the image feature scaling process through the target feature to obtain the offset.

In summary, the target area recognition apparatus provided in this embodiment processes the image features of the input image in a convolution and rotation manner through the target recognition model, so as to recognize and obtain the target area corresponding to the image content in the input image, and the target area is obtained by determining the rotation angle of the image content in the convolution and rotation manner, so as to improve the recognition accuracy of recognizing the target area corresponding to the image content.

It should be noted that: the target area identifying device provided in the foregoing embodiment is only illustrated by dividing the functional modules, and in practical applications, the functions may be distributed by different functional modules according to needs, that is, the internal structure of the device may be divided into different functional modules to complete all or part of the functions described above. In addition, the target area identification device provided by the above embodiment and the target area identification method embodiment belong to the same concept, and specific implementation processes thereof are detailed in the method embodiment and are not described herein again.

Fig. 11 shows a schematic structural diagram of a server according to an exemplary embodiment of the present application. Specifically, the method comprises the following steps:

the server 1100 includes a Central Processing Unit (CPU) 1101, a system Memory 1104 including a Random Access Memory (RAM) 1102 and a Read Only Memory (ROM) 1103, and a system bus 1105 connecting the system Memory 1104 and the Central Processing Unit 1101. The server 1100 also includes a basic input/output System (I/O) 1106, which facilitates transfer of information between devices within the computer, and a mass storage device 1107 for storing an operating System 1113, application programs 1114, and other program modules 1115.

The basic input/output system 1106 includes a display 1108 for displaying information and an input device 1109 such as a mouse, keyboard, etc. for user input of information. Wherein the display 1108 and the input device 1109 are connected to the central processing unit 1101 through an input output controller 1110 connected to the system bus 1105. The basic input/output system 1106 may also include an input/output controller 1110 for receiving and processing input from a number of other devices, such as a keyboard, mouse, or electronic stylus. Similarly, input-output controller 1110 also provides output to a display screen, a printer, or other type of output device.

The mass storage device 1107 is connected to the central processing unit 1101 through a mass storage controller (not shown) that is connected to the system bus 1105. The mass storage device 1107 and its associated computer-readable media provide non-volatile storage for the server 1100. That is, the mass storage device 1107 may include a computer-readable medium (not shown) such as a hard disk or Compact disk Read Only Memory (CD-ROM) drive.

Without loss of generality, computer readable media may comprise computer storage media and communication media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes RAM, ROM, Erasable Programmable Read-Only Memory (EPROM), Electrically Erasable Programmable Read-Only Memory (EEPROM), flash Memory or other solid state Memory technology, CD-ROM, Digital Versatile Disks (DVD), or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage, or other magnetic storage devices. Of course, those skilled in the art will appreciate that computer storage media is not limited to the foregoing. The system memory 1104 and mass storage device 1107 described above may be collectively referred to as memory.

The server 1100 may also operate in accordance with various embodiments of the application through remote computers connected to a network, such as the internet. That is, the server 1100 may connect to the network 1112 through the network interface unit 1111 that is coupled to the system bus 1105, or may connect to other types of networks or remote computer systems (not shown) using the network interface unit 1111.

The memory further includes one or more programs, and the one or more programs are stored in the memory and configured to be executed by the CPU.

Embodiments of the present application further provide a computer device, where the computer device includes a processor and a memory, where the memory stores at least one instruction, at least one program, a code set, or a set of instructions, and the at least one instruction, the at least one program, the code set, or the set of instructions is loaded and executed by the processor to implement the target area identification method provided by each of the above method embodiments.

Embodiments of the present application further provide a computer-readable storage medium, on which at least one instruction, at least one program, a code set, or a set of instructions is stored, and the at least one instruction, the at least one program, the code set, or the set of instructions is loaded and executed by a processor to implement the target area identification method provided by the above method embodiments.

Optionally, the computer-readable storage medium may include: a Read Only Memory (ROM), a Random Access Memory (RAM), a Solid State Drive (SSD), or an optical disc. The Random Access Memory may include a resistive Random Access Memory (ReRAM) and a Dynamic Random Access Memory (DRAM). The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The above description is only exemplary of the present application and should not be taken as limiting, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the protection scope of the present application.

22页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：一种基于OCR的单据自动识别方法

Target area identification method, device, equipment and readable storage medium

相关技术

网友询问留言