Network construction method, device and equipment for image recognition

文档序号：1832028 发布日期：2021-11-12 浏览：7次中文

阅读说明：本技术 一种图像识别的网络构建方法、装置及设备 (Network construction method, device and equipment for image recognition ) 是由杜吉祥王莹莹曹渝常张章伟杨会生张洪博黄政于 2021-07-26 设计创作，主要内容包括：本发明提供了一种图像识别的网络构建方法、装置及设备,属于机器视觉领域,方法包括:获取待识别图像,对所述待识别图像的目标区域进行定位,以获取目标区域；对所述目标区域进行破坏,生成区域拼图；获取所述区域拼图的细节信息,结合网络损失函数建立所述待识别图像的网络模型。解决了因过度关注某一细节信息而忽略次要的显著判别性区域的问题。(The invention provides a network construction method, a device and equipment for image recognition, belonging to the field of machine vision, wherein the method comprises the steps of obtaining an image to be recognized, positioning a target area of the image to be recognized to obtain the target area; destroying the target area to generate an area jigsaw; and acquiring the detail information of the area jigsaw, and establishing a network model of the image to be identified by combining a network loss function. The problem that secondary significant discriminant regions are ignored due to excessive attention to certain detailed information is solved.)

1. A network construction method for image recognition is characterized by comprising the following steps:

acquiring an image to be identified, and positioning a target area of the image to be identified to acquire the target area;

destroying the target area to generate an area jigsaw;

and acquiring the detail information of the area jigsaw, and establishing a network model of the image to be identified by combining a network loss function.

2. The method for constructing an image recognition network according to claim 1, wherein the acquiring an image to be recognized, and positioning a target area of the image to be recognized to acquire the target area specifically include:

inputting the image to be recognized into a ResNet50 network for feature extraction;

calculating the characteristic input residual block to obtain the coordinates of the salient region;

and mapping the coordinates of the salient region onto the image to be identified so as to locate the target region.

3. The method as claimed in claim 1, wherein the step of destroying the target area and generating a tile is specifically as follows:

dividing the target region into N × N sub-regions, R_i,jIndicating the location of a certain sub-region;

and moving the sub-region in the neighborhood according to the constraint condition to generate a region jigsaw.

4. The method according to claim 3, wherein the constraints are:

σ(i,j)＝(σ_j ^row(i),σ_i ^col(j))；

where k is a random parameter, σ_j ^rowRepresents the new permutation of the jth row; sigma_i ^colIndicates the new arrangement of the ith column, and σ (i, j) is the part where the coordinate (i, j) is placed.

5. The method of claim 2, wherein the loss function is:

L＝α×L_{raw_cls}+β×L_{local_cls}+γ×L_enhance；

L_enhance＝c×L_{enhance_cls}+s×L_ang；

wherein L is_{raw_cls}Is a loss function of the first sub-network, alpha is the weight of the loss of the first sub-network, L_{local_cls}As a function of the penalty of the second sub-network, β is the weight of the penalty of the second sub-network, L_enhanceGamma is the weight of the loss of the third sub-network as a function of the loss of the third sub-network; p is a radical of_iA category label representing the truth of the ith image; q. q.s_iA label indicating a prediction category of the ith image; n denotes the sample size, c, s denote the parameters in the construction of the loss function of the third subnetwork, L_{enhance_cls}Representing a cross entropy loss function; l is_angRepresenting the angular loss function.

6. An image-recognition network construction apparatus, comprising:

the device comprises a target area acquisition unit, a target area acquisition unit and a recognition unit, wherein the target area acquisition unit is used for acquiring an image to be recognized and positioning a target area of the image to be recognized so as to acquire the target area;

the regional jigsaw generating unit is used for destroying the target region and generating a regional jigsaw;

and the network model establishing unit is used for acquiring the detail information of the area jigsaw and establishing the network model of the image to be identified by combining a network loss function.

7. The image-recognition network construction device according to claim 1, wherein the target area obtaining unit is specifically configured to:

inputting the image to be recognized into a ResNet50 network for feature extraction;

calculating the characteristic input residual block to obtain the coordinates of the salient region;

and mapping the coordinates of the salient region onto the image to be identified so as to locate the target region.

8. The image recognition network construction device according to claim 1, wherein the area mosaic generating unit is specifically configured to:

dividing the target region into N × N sub-regions, R_i,jIndicating the location of a certain sub-region;

and moving the sub-region in the neighborhood according to the constraint condition to generate a region jigsaw.

9. The image-recognition network construction device according to claim 3, wherein the constraint condition is:

σ(i,j)＝(σ_j ^row(i),σ_i ^col(j))；

10. An image recognition network construction device, comprising a processor, a memory and a computer program stored in the memory and configured to be executed by the processor, wherein the processor executes the computer program to implement an image recognition network construction method according to any one of claims 1 to 5.

Technical Field

The invention relates to the field of machine vision, in particular to a network construction method, a device and equipment for image recognition.

Background

In recent years, the research direction of fine-grained image recognition mainly comprises a mode based on an attention mechanism, the mode not only can acquire a key discrimination region, but also can learn the characteristics of the discrimination region so as to realize fine-grained classification, and typical methods of the mode comprise an RA-CNN algorithm and an MA-CNN algorithm. The former realizes the purpose of predicting a key region by constructing a cyclic network in a recursive mode, but the input of a sub-network in the network is derived from images of the same region with different scales, so that the problems that the extracted features are single and the cyclic sub-network has parameter redundancy are easily caused. The latter obtains different key areas based on a multi-attention mechanism to learn the discriminant information of the key areas respectively, and although the problem of single feature is solved, the problem that the number of attention areas is limited also occurs.

In view of this, the present application is presented.

Disclosure of Invention

The invention discloses a network construction method, a device and equipment for image recognition, which solve the problem that secondary remarkable discriminant areas are ignored due to over-attention to certain detailed information

A first embodiment of the present invention provides a network construction method for image recognition, including:

acquiring an image to be identified, and positioning a target area of the image to be identified to acquire the target area;

destroying the target area to generate an area jigsaw;

and acquiring the detail information of the area jigsaw, and establishing a network model of the image to be identified by combining a network loss function.

Preferably, the acquiring the image to be recognized, and positioning the target area of the image to be recognized to acquire the target area specifically include:

inputting the image to be recognized into a ResNet50 network for feature extraction;

calculating the characteristic input residual block to obtain the coordinates of the salient region;

and mapping the coordinates of the salient region onto the image to be identified so as to locate the target region.

Preferably, the step of destroying the target area and generating an area puzzle specifically includes:

dividing the target region into N × N sub-regions, R_i,jIndicating the location of a certain sub-region;

and moving the sub-region in the neighborhood according to the constraint condition to generate a region jigsaw.

Preferably, the constraint condition is:

σ(i,j)＝(σ_j ^row(i),σ_i ^col(j))；

Preferably, the loss function is:

L＝α×L_{raw_cls}+β×L_{local_cls}+γ×L_enhance；

L_enhance＝c×L_{enhance_cls}+s×L_ang；

A second embodiment of the present invention provides a network construction apparatus for image recognition, including:

the regional jigsaw generating unit is used for destroying the target region and generating a regional jigsaw;

Preferably, the target area acquiring unit is specifically configured to:

inputting the image to be recognized into a ResNet50 network for feature extraction;

calculating the characteristic input residual block to obtain the coordinates of the salient region;

and mapping the coordinates of the salient region onto the image to be identified so as to locate the target region.

Preferably, the area mosaic generating unit is specifically configured to:

dividing the target region into N × N sub-regions, R_i,jIndicating the location of a certain sub-region;

and moving the sub-region in the neighborhood according to the constraint condition to generate a region jigsaw.

Preferably, the constraint condition is:

σ(i,j)＝(σ_j ^row(i),σ_i ^col(j))；

A third embodiment of the present invention provides an image recognition network construction device, which includes a processor, a memory, and a computer program stored in the memory and configured to be executed by the processor, wherein the processor executes the computer program to implement an image recognition network construction method as described in any one of the above.

According to the image recognition network construction method, device and equipment provided by the invention, the image to be recognized is obtained by the camera, the target area of the image to be recognized is positioned to obtain the target area, so that the problem of background noise is reduced, the target area is damaged to generate the area jigsaw, the problem that the secondary obvious discriminant area is ignored due to the fact that a certain detail information is excessively concerned about in the network attention target area is solved, finally, the detail information of the area jigsaw is obtained, and the network model of the image to be recognized is built by combining a network loss function.

Drawings

Fig. 1 is a schematic flowchart of a network construction method for image recognition according to a first embodiment of the present invention;

FIG. 2 is a flow chart of input data for the first and second subnets provided by the present invention;

FIG. 3 is a schematic illustration of the regional enhancements provided by the present invention;

fig. 4 is a schematic block diagram of a network construction device for image recognition according to a first embodiment of the present invention;

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

For better understanding of the technical solutions of the present invention, the following detailed descriptions of the embodiments of the present invention are provided with reference to the accompanying drawings.

It should be understood that the described embodiments are only some embodiments of the invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The terminology used in the embodiments of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the examples of the present invention and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should be understood that the term "and/or" as used herein is merely one type of association that describes an associated object, meaning that three relationships may exist, e.g., a and/or B may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship.

The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination" or "in response to a detection", depending on the context. Similarly, the phrases "if determined" or "if detected (a stated condition or event)" may be interpreted as "when determined" or "in response to a determination" or "when detected (a stated condition or event)" or "in response to a detection (a stated condition or event)", depending on the context.

In the embodiments, the references to "first \ second" are merely to distinguish similar objects and do not represent a specific ordering for the objects, and it is to be understood that "first \ second" may be interchanged with a specific order or sequence, where permitted. It should be understood that "first \ second" distinct objects may be interchanged under appropriate circumstances such that the embodiments described herein may be practiced in sequences other than those illustrated or described herein.

The following detailed description of specific embodiments of the invention refers to the accompanying drawings.

Referring to fig. 1, a first embodiment of the present invention provides an image-recognition network construction method, which can be executed by an image-recognition network construction (hereinafter referred to as a network construction device), and in particular, executed by one or more processors in the network construction device, to implement the following steps:

s101, acquiring an image to be identified, and positioning a target area of the image to be identified to acquire the target area;

in this embodiment, the network construction device may be a user terminal (such as a smart phone, a smart printer, or other smart devices), which may be configured with an image capturing device, such as a camera, for capturing an image to be recognized.

It should be noted that, the sight of people is often attracted by the key information first, and the brain receives the information of this area for classification. Both candidate box labeling and keypoint detection in strongly supervised images require a large amount of manual labeling cost. In the embodiment, the key region of the image is located through an attention mechanism, and meanwhile, the convolution characteristic is also used for predicting the salient region, so that the interference of the noise of the background region is reduced.

Specifically, in the present embodiment, the image to be recognized is input into a ResNet50 network for feature extraction;

calculating the characteristic input residual block to obtain the coordinates of the salient region;

and mapping the coordinates of the salient region onto the image to be identified so as to locate the target region.

More specifically, first, the image to be recognized is input in a first sub-network,and acquiring the characteristics of the image to be identified by using a partial layer of a ResNet50 network, and then after the characteristics enter a residual block Conv5_ b, setting the Feature Map (f) E.R in a function formula (1)^H×W×K) Accumulating the channel values to obtain an F value of an activation module; next, the currently acquired channel value F is compared with a threshold value (mean (F) of all the position channel sums) to focus on a region larger than the threshold value to acquire a connected region of the activation block, as shown in formula (2). Similarly, the communication area M is obtained at the Conv5_ c module_{conv_5c}. Last M_{conv_5b}And M_{conv_5c}And performing intersection operation on the communicated regions to obtain accurate Mask region coordinates in the original image.

Wherein f is_iRepresents Conv5_ b output characteristics, K represents Conv5_ b output characteristics in ResNet50 network (f)_i) The number of (x, y) indicates a specific position of the H × W activation map.

Image features are subjected to a channel attention mechanism at a Conv5_ x (Layer4) Layer to acquire salient region coordinates and map the salient region coordinates to an original image, so that a key target region is located, as shown in FIG. 2. The key area of the original image is cut out along the positioning coordinates, and then converted into 448 × 448 key area images by Bilinear Interpolation, and finally the images are used as the input of the second sub-network, as shown in fig. 3.

S102, destroying the target area to generate a regional jigsaw;

specifically, in this embodiment, the target region is divided into N × N sub-regions, R_i,jIndicating the location of a certain sub-region;

and moving the sub-region in the neighborhood according to the constraint condition to generate a region jigsaw.

It should be noted that, when playing a jigsaw puzzle, the whole global information cannot be obtained at one time, and people can only observe the information of a certain part, so as to obtain images of other parts by association or trial and error, thereby achieving the purpose of composing a whole image. The fine-grained image (i.e., the target area) is divided equally into N × N parts, where each block of the part image is provided with a number R (i, j). Moving a part of the randomly selected original image to the R (0, 0) position results in that the image at the R (0, 0) position can only be randomly pressed to other positions, and finally a new image is assembled. Wherein each component may have a size of 3 × (W/N) × (H/N).

In this embodiment, if the relative positions of local regions in the image space are confused, the neural network is forced to learn features from detailed information that locates key regions. First, the image is uniformly divided into N × N sub-regions, R_i,j(1≤i,j<N) denotes the position of a certain sub-region. Second, the part moves within the neighborhood. At R_i,jOn each line of the subareas, a random number q is arranged on the jth line_iThe ith element acquires a new region position q_j,iI + r, wherein (r-U (-k, k), 1. ltoreq. k<N), the same constraint is set at each column shift position. Thus, the position of the new region is represented as follows:

σ(i,j)＝(σ_j ^row(i),σ_i ^col(j)) (4)

wherein σ_j ^rowRepresents the new permutation of the jth row; sigma_i ^colRepresenting a new arrangement of the ith column; σ (i, j) coordinate (i, j) position placed part. A component module can only move within its neighborhood under this constraint. After the global structure of the Image (Image) is destroyed, a new Image (i.e., a region mosaic) realization enhancement data set (Image _ N, N × N sub-regions per Image) is generated.

With continued reference to fig. 3, the input image of the second sub-network is subjected to region enhancement to obtain a new image, and the new image is set as the input image of the third sub-network. The two area enhancement modes are compared by adopting the idea of randomly moving the position of the part, and the difference between the two modes is that a random parameter K (K is more than 0 and less than or equal to N) is set in an area confusion mechanism to limit the range of random movement. When K is small, the position can only be randomly shifted within a relatively close neighborhood, and when K equals N, the two ways are the same. Therefore, the region mosaic has more complete damage to the spatial information of the image than the region confusion mechanism.

S103, acquiring the detail information of the area jigsaw, and establishing a network model of the image to be identified by combining a network loss function.

In this embodiment, first, the attention module (or referred to as a target area acquisition module) acquires area information of a key object to reduce interference of noise such as background, then performs data enhancement in a target area to acquire area puzzles to force a network to focus on detail information, and finally performs training using a joint loss function in three sub-network structures. The overall loss function is expressed as the following equation (5):

L＝α×L_{raw_cls}+β×L_{local_cls}+γ×L_enh (5)

wherein α, β, γ represent the weight lost by each subnetwork, respectively; l is_{raw_cls}A loss function representing a first sub-network; l is_{local_cls}A loss function representing a second sub-network; l is_enhanceRepresenting the loss function of the third sub-network. The first and second sub-networks use the basic network to construct key areas, and only modify the network modules to realize the classification of the network. In the third sub-network, regional augmentation not only highlights detail information but also introduces A-Softmax Loss learning detail information.

Obtaining a loss function of an attention mechanism module

Firstly, an original image is used as input, features are obtained in a network through layers such as convolution, pooling and activation, the features enter a full-connection layer and are converted into corresponding class probabilities, and loss functions of the classes are calculated, so that the loss functions of sub-networks are also calculated by using cross entropy, as shown in a formula (6). The critical information area is then used as input for the next layer of sub-networks, and the loss function is calculated in the network, as shown in equation (7). It is composed ofIn (c) p_iA category label representing the truth of the ith image; q. q.s_iA label indicating a prediction category of the ith image; n represents the sample size.

Loss function for regional augmentation

Data enhancement in the key area to acquire a new image forces the network to focus on the detailed information of the image, and the acquired new image is used as the input of a third branch network, and the loss function of a sub-network is as follows:

L_enhance＝c×L_{enhance_cls}+s×L_ang； (8)

wherein c and s respectively represent parameters in the loss function formation of the third sub-network; l is_{enhance_cls}Representing a cross entropy loss function; l is_angRepresents A-Softmax Loss. A Loss function of A-Softmax Loss is provided according to the requirements of the face recognition task, and the Loss function is adjusted and applied to a network structure in the invention:

and acquiring the region enhanced image as an input in a third sub-network, learning the characteristics of the image, classifying the image, calculating a loss function for identifying the sub-network, and discussing the influence of the loss function and parameters on the network in an experimental result.

In the verification stage, after the image passes through the network model, only the key area needs to be acquired, and the enhanced image of the key area is not acquired, so that the calculated amount is reduced, and the reasoning time is shortened. The network structure is constructed by the first two sub-networks, and the classification is realized by global space information and local key areas together.

The experimental results of the invention are as follows:

1) data set of experiments

In order to verify the validity of the proposed method, the work is respectively expanded on a plurality of reference data sets: CUB-200 (CUB), FGVC-Aircraft (AIR), and Stanford Cars (CAR), and the results of these data sets were further analyzed. Meanwhile, the algorithm of the invention is compared with a mainstream method on the same data set to verify the high efficiency of the method, and the method comprises B-CNN for fine-grained image feature learning, MA-CNN for a multi-attention mechanism, NTS for an automatic supervision mechanism and DCL algorithm for image destruction and reconstruction.

2) Key region extraction

First, the original image is input to the ResNet50 network, and mask information is generated between the residual blocks Conv5_ b and Conv5_ c by using a channel attention mechanism. And then the activated area obtained between the two modules adopts intersection operation to obtain mask coordinates. And finally, mapping the coordinate information to the original image, and simultaneously cutting and amplifying the key area on the original image.

And acquiring the coordinates of the salient region by utilizing a channel attention mechanism in the network, and mapping the coordinates to the original image and cutting out the region. This region is scaled up to 448 x 448 by bilinear interpolation, which is taken as the input for the next subnetwork.

3) Means for enhancing analysis area

The targets in the obtained key target area are different subclasses from the same major class, and have the characteristic of similar global information. The key for distinguishing the fine-grained image categories is the different parts in the similar morphology, namely, the characteristics with discriminability are learned in the network. The spatial information of the image is firstly destroyed to obtain a new image, then the network is utilized to learn the characteristics of the region, and finally the detail information of the region is concerned, so that the local discriminant information is learned.

The module adopts two region enhancement modes: area puzzle and area obfuscation mechanism. If a new image is generated with a smaller size module, the recognition of the image is constrained; whereas if a new image is generated with a larger size module, the detailed information cannot be effectively focused. Thus, to the extent that the present module is reasonably designed to destroy the image, the original image is divided into new images made up of 7 x 7 block elements.

TABLE 1 Fine-grained image recognition accuracy for different region enhancement modes

Region enhancement method	Accuracy(％)
		Area jigsaw puzzle	87.40
Region obfuscation	87.59

The two above-described ways of regional enhancement were separately tested in the algorithm of the present invention. The experimental result on the CUB 200-2011 data set shows that the region confusion mechanism has better effect on fine-grained image recognition than the region jigsaw, and the recognition accuracy is about 0.19% higher.

The reason for the different accuracy rates generated by the analysis is shown in fig. 6 according to the characteristics of the two data enhancement modes. 1) Area jigsaw puzzle: the position of the part in the image may be randomly shifted; 2) area confusion: the method adds constraint conditions relative to the area jigsaw and sets a random parameter K. If the value of K is larger, the part can be moved to a farther position, whereas the part is moved to a closer position. Images utilize new images generated in both ways, where region tiling destroys the spatial information of the image more thoroughly than region enhancement. Although the destruction of spatial information can enable the network to learn more discriminant features, spatial global information also plays a role in assisting and promoting fine-grained image recognition and cannot be completely discarded. At the same time, it was also verified in the mainstream method: the recognition accuracy of image decomposition and recombination in the DCL algorithm is higher than that of only decomposing images.

4) Ablation experiment of loss function

Parameters α, β, γ were set in the loss function of the three subnetworks, discussing the extent of the effect of the parameters on the experiment, as shown in table 2 below:

TABLE 2 influence of different sub-network loss function weights on the algorithm

The experimental results on the CUB 200-2011 data set verified the impact of the third subnetwork on identification, as shown in table 2 above. When the weight of the loss function of the third subnetwork is increased, the recognition accuracy is improved by 0.19%. Therefore, the detail information of the key area of the fine-grained image can effectively improve the algorithm identification precision.

In the third subnetwork, the Loss function, Loss _ enhance, is composed of Softmax Loss and a-Softmax Loss. The loss function of this network is more complex than other sub-networks, so its role and characteristics should be discussed in detail in the experiments. The effect of using these two loss functions and the weighting parameters on the algorithm is discussed in the experimental results, as shown in table 3 below.

Table 3 identification accuracy rates of different construction parameters in Loss _ enhance

Loss_enhance	c	s	Accuracy(％)
				Softmax	1	0	87.07
Softmax+A-Softmax	1	0.001	87.14
				Softmax+A-Softmax	1	0.01	87.28
Softmax+A-Softmax	1	0.1	87.33
				Softmax+A-Softmax	1	1	87.59
A-Softmax	0	1	87.40

Firstly, an experimental result shows that the A-Softmax Loss is more suitable for classifying fine-grained images, and the conclusion accords with the rules of two Loss functions in the accuracy rate of general image recognition. Secondly, the effect of the network constructed by the two loss functions together is analyzed from the identification accuracy and the parameters thereof are discussed. The experimental result shows that under the combined action of the two loss functions, a better experimental result can be obtained, and the parameter is set to be (1, 1).

5) Ablation experiments on network fabric modules

On the CUB 200-2011 data set, the algorithm was analyzed to gradually add ablation experiments for each module in the ResNet50 base network, as shown in table 4:

TABLE 4 ablation experiments of the algorithm of the present invention in the CUB 200-2011 dataset

Method	Base Model	α	β	γ	Accuracy(％)
						(a)ResNet50	ResNet50	1	0	0	85.09
(b)attention	ResNet50	1	1	0	86.43
						(c)Attention+Jigsaw	ResNet50	1	1	1	86.54
(d)Attention+Jigsaw+Attention	ResNet50	1	0	1	86.88
						(e)Attention+Jigsaw+Attention	ResNet50	1	1	1	87.33
(f) Adding A-Softmax Loss in the structure (e)	ResNet50	1	1	1	87.40
						(g)Attention+RC+Attention	ResNet50	1	1	1	87.59
(h) Loss parameter	ResNet50	1	1	0.5	87.40

As shown in the table above, each module of the network was ablated and its recognition accuracy was analyzed against the experimental results. The experimental result verifies the function and the reasonability of each module in the network. The following explains the experimental results separately for each module and the function realized by the module:

at the first sub-network, classification is effected by the baseline ResNet50 network; this module implements baseline classification results;

acquiring a key area by an attention mechanism in a first subnetwork and a second subnetwork; a channel attention mechanism extracts a key target area so as to eliminate background noise;

after the region enhanced image module is added into the third sub-network, the identification precision is obviously improved (the three sub-networks act together); the new image of the regional augmentation forces the network to focus on the detailed information of each module while reducing the focus on similar global information;

and adding a channel attention module in a third sub-network, directly performing a region enhancement mechanism on the key region image from the first sub-network to the third sub-network, and finally constructing the network by the first sub-network and the third sub-network, namely deleting the second sub-network in the network. Because a robust and stable positioning area is not obtained after the positioning and cutting operation, area enhancement is directly introduced into a key area. This results in a less pronounced increase in recognition accuracy, while also indicating the necessity of a second branch;

three sub-networks jointly construct a network; each sub-network has a promoting effect on the identification precision;

modifying the Loss function of the third sub-network in the structure (e), and adding an A-Softmax Loss function in the structure; compared with different Loss functions, the experimental result shows that the A-Softmax Loss function plays a role in improving the network;

comparing the regional enhancement modes; regional confusion and difference in regional puzzles: the range of random movement of the area obfuscating means is constrained. Meanwhile, experimental results verify that the spatial global information cannot be completely discarded in the network;

comparing the parameters of the Loss; and comparing and obtaining proper loss function parameters so as to reasonably set the parameter weight.

6) The algorithm compares the recognition effect of the mainstream algorithm on a plurality of data sets

On a plurality of data sets, the algorithm of the invention is compared with the current mainstream algorithm, and the experimental result verifies that the fine-grained identification accuracy can be effectively improved by using an attention mechanism and a region enhancement construction algorithm, as shown in table 5.

TABLE 5 accuracy of different algorithms on multiple data sets (%)

Methods	Base Model	CUB	CAR	AIR
					ResNet50	-	85.4	92.7	90.3
B-CNN	VGG16	84.1	91.3	84.1
					MA-CNN	VGG-19	86.5	92.5	89.9
NTS	ResNet50	87.5	93.9	91.4
					DCL(DL)	ResNet50	87.2	94.4	91.6
DCL(CL)	ResNet50	86.7	-	-
					DCL(DL+CL)	ResNet50	87.8	94.5	92.2
Algorithm of the invention	ResNet50	87.59	94.14	92.92

The experimental result shows that better identification accuracy is obtained on AIR data, but the results on CUB and CAR are not ideal, on one hand, the importance of global information cannot be completely discarded, on the other hand, the influence of parameter setting is shown, but a control variable is needed when the experimental result is compared, so that the region enhancement parameter (N) is not discussed, and the influence on the experiment is set to be a fixed value of 7.

In the method, a key area is acquired first, then a new image of the area enhancement is acquired, and finally a multilayer sub-network model is constructed. The model is verified by experimental results to obtain two conclusions: 1) generating key areas in the network so as to eliminate noise such as background; the region enhancement makes it easier to learn detailed information. The algorithm not only excavates detail information of a key area, but also accords with the characteristic that the difference of a fine-grained image is mainly in the detail area due to acquaintance forms. 2) In the region enhancement module, the region confusion effect is better than the region mosaic effect, the difference between the region confusion effect and the region mosaic effect is that the moving range of the region confusion component is limited, and the comparison DCL method proves that the spatial global information can not be completely discarded in the promotion effect on the classification of the image.

Compared with the prior art, the invention has a plurality of advantages and beneficial effects, which are embodied in the following aspects:

the channel attention module acquires a key area so as to reduce background noise;

the regional enhancement excavates detail information of a key region, destroys the spatial information of a key region image and realizes the diversity of data;

the sub-category images can be recognized more accurately and efficiently.

Referring to fig. 4, a second embodiment of the present invention provides an image recognition network constructing apparatus, including:

a target area obtaining unit 201, configured to obtain an image to be identified, and locate a target area of the image to be identified, so as to obtain a target area;

a region puzzle generating unit 202, configured to destroy the target region and generate a region puzzle;

and the network model establishing unit 203 is configured to acquire the detail information of the area puzzle, and establish a network model of the image to be identified by combining a network loss function.

Preferably, the target area acquiring unit is specifically configured to:

inputting the image to be recognized into a ResNet50 network for feature extraction;

calculating the characteristic input residual block to obtain the coordinates of the salient region;

and mapping the coordinates of the salient region onto the image to be identified so as to locate the target region.

Preferably, the area mosaic generating unit is specifically configured to:

dividing the target region into N × N sub-regions, R_i,jIndicating the location of a certain sub-region;

and moving the sub-region in the neighborhood according to the constraint condition to generate a region jigsaw.

Preferably, the constraint condition is:

σ(i,j)＝(σ_j ^row(i),σ_i ^col(j))；

Illustratively, the computer programs described in the third and fourth embodiments of the present invention may be partitioned into one or more modules, which are stored in the memory and executed by the processor to implement the present invention. The one or more modules may be a series of computer program instruction segments capable of performing certain functions, which are used to describe the execution of the computer program in the network construction device implementing an image recognition. For example, the device described in the second embodiment of the present invention.

The Processor may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. The general purpose processor may be a microprocessor or the processor may be any conventional processor or the like, the processor being the control center of the network construction method for image recognition, and various interfaces and lines connecting the various parts of the network construction method for image recognition as a whole are utilized.

The memory may be used to store the computer programs and/or modules, and the processor may implement the various functions of a network construction method for image recognition by running or executing the computer programs and/or modules stored in the memory and calling data stored in the memory. The memory may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function (such as a sound playing function, a text conversion function, etc.), and the like; the storage data area may store data (such as audio data, text message data, etc.) created according to the use of the cellular phone, etc. In addition, the memory may include high speed random access memory, and may also include non-volatile memory, such as a hard disk, a memory, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), at least one magnetic disk storage device, a Flash memory device, or other volatile solid state storage device.

Wherein the implemented module, if implemented in the form of a software functional unit and sold or used as a stand-alone product, can be stored in a computer readable storage medium. Based on such understanding, all or part of the flow of the method according to the embodiments of the present invention may also be implemented by a computer program, which may be stored in a computer-readable storage medium, and when the computer program is executed by a processor, the steps of the method embodiments may be implemented. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as is required by legislation and patent practice.

It should be noted that the above-described device embodiments are merely illustrative, where the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. In addition, in the drawings of the embodiment of the apparatus provided by the present invention, the connection relationship between the modules indicates that there is a communication connection between them, and may be specifically implemented as one or more communication buses or signal lines. One of ordinary skill in the art can understand and implement it without inventive effort.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

16页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：一种图像处理方法、计算机可读介质和系统

Network construction method, device and equipment for image recognition

相关技术

网友询问留言