High-generalization cross-domain road scene semantic segmentation method and system

文档序号：569652 发布日期：2021-05-18 浏览：18次中文

阅读说明：本技术 高泛化性的跨域道路场景语义分割方法和系统 (High-generalization cross-domain road scene semantic segmentation method and system ) 是由雷印杰彭铎于 2021-02-05 设计创作，主要内容包括：本发明公开一种高泛化性的跨域道路场景语义分割方法,包括：通过游戏引擎产生虚拟图像及对应标签；利用虚拟图像生成全局/局部纹理迁移图像；对虚拟图像、全局/局部纹理迁移图像送入神经网络进行训练；将神经网络训练的全局/局部纹理迁移图像进行一致性约束；对经过神经网络训练的虚拟图像以及经过一致性约束的局纹理迁移图像和局部纹理迁移图像分别与标签计算损失值,并依据损失值训练语义分割模型；利用训练好的语义分割模型进行语义分割。本发明通过对虚拟图像的全局纹理迁移和局部纹理迁移实现数据增强,攻击神经网络,迫使模型学习跨域不变的形状信息；且该方法只在源域进行网络训练,实现了可靠的跨域分割效果,同时拥有很强的泛化性能。(The invention discloses a high-generalization cross-domain road scene semantic segmentation method, which comprises the following steps of: generating a virtual image and a corresponding label through a game engine; generating a global/local texture migration image by using the virtual image; sending the virtual image and the global/local texture migration image into a neural network for training; carrying out consistency constraint on global/local texture migration images trained by the neural network; calculating loss values of the virtual image trained by the neural network, the local texture migration image and the local texture migration image subjected to consistency constraint and the label respectively, and training a semantic segmentation model according to the loss values; and performing semantic segmentation by using the trained semantic segmentation model. According to the method, data enhancement is realized through global texture migration and local texture migration of the virtual image, a neural network is attacked, and the model is forced to learn shape information which is invariable across domains; and the method only carries out network training in the source domain, realizes reliable cross-domain segmentation effect and has strong generalization performance.)

1. A high-generalization cross-domain road scene semantic segmentation method is characterized by comprising the following steps:

generating a virtual image and a corresponding label through a game engine;

carrying out global texture migration on the virtual image to generate a global texture migration image;

carrying out local texture migration on the virtual image and the global texture migration image to generate a local texture migration image;

carrying out unified size on the virtual image, the global texture migration image and the local texture migration image, and then sending the images into a neural network for training;

carrying out consistency constraint on the global texture migration image and the local texture migration image which are trained by the neural network;

calculating loss values of the virtual image trained by the neural network, the local texture migration image and the local texture migration image subjected to consistency constraint and the label respectively, and training a semantic segmentation model according to the loss values;

and performing semantic segmentation on any real scene by using the trained semantic segmentation model.

2. The method according to claim 1, wherein the virtual image is globally segmentedTexture migration Generation Global texture migration image X is obtained as follows_GTR：

Wherein, F_XIs a feature of the virtual image, F_TIs the feature of the texture image, μ is the mean of the feature map, σ is the variance of the feature map, and f is the neural network that decodes the feature into the global texture migration image.

3. The method according to claim 2, wherein the performing local texture migration on the virtual image and the global texture migration image to generate a local texture migration image comprises:

constructing a random mask M;

the local texture migration image X is obtained as follows_LTR：

WhereinIs an element dot product operation.

4. The method according to claim 3, wherein the constructing a random mask M comprises:

performing convolution smoothing on the random noise image to obtain a convolution image, wherein the convolution kernel has a size of 3 multiplied by 3 and obeys e^logλDistributing;

and generating a random mask M by the convolution image through an adaptive decision threshold, wherein the adaptive decision threshold is determined by the mean value mu and the variance sigma of the characteristic diagram.

5. The method according to claim 1, wherein the training of the virtual image, the global texture migration image and the local texture migration image after being unified in size by the neural network comprises:

respectively cutting the virtual image, the global texture migration image and the local texture migration image into square images according to a set step length s;

respectively scaling the cut virtual image, the global texture migration image and the local texture migration image into a specified resolution;

and respectively sending the scaled virtual image, the global texture migration image and the local texture migration image into a deep convolution neural network for training.

6. The method for semantic segmentation of the highly-generalized cross-domain road scene according to claim 5, wherein the set step length s is obtained as follows:

where l is the length of the image, w is the width of the image, [ ] is rounded down.

7. The method according to claim 1, wherein the consistency constraint of the global texture migration image and the local texture migration image trained by the neural network comprises:

acquiring feature maps of global texture migration images and local texture migration images trained by a neural network;

calculating L of each corresponding feature map₁Distance:

wherein the content of the first and second substances,andrespectively representing the characteristics of the global texture migration image and the local texture migration image in the previous layer of the neural network output layer, wherein i represents the ith characteristic value in the characteristic diagram, and N is the resolution of the characteristic diagram;

the constraint is implemented by calculating the dissimilarity between feature maps by the L1 distance.

8. The system according to claim 1, wherein the system comprises:

the image generation module generates a virtual image and a corresponding label through a game engine;

the global texture migration module carries out global texture migration on the virtual image to generate a global texture migration image;

the local texture migration module carries out local texture migration on the local virtual image and the global texture migration image to generate a local texture migration image;

the neural network training module is used for carrying out unified size on the virtual image, the global texture migration image and the local texture migration image and then sending the images into the neural network for training;

the consistency constraint module is used for carrying out consistency constraint on the global texture migration image and the local texture migration image which are trained by the neural network;

the semantic segmentation model training module is used for calculating loss values of the virtual image trained by the neural network, the local texture migration image and the local texture migration image subjected to consistency constraint and the label respectively, and training a semantic segmentation model according to the loss values;

and the semantic segmentation module is used for performing semantic segmentation on any real scene by using the trained semantic segmentation model.

Technical Field

The invention belongs to the technical field of computer vision, and particularly relates to a high-generalization cross-domain road scene semantic segmentation method and system.

Background

The image semantic segmentation means that a computer realizes the understanding of an image according to the content of the image and then carries out visual segmentation. In recent years, with the continuous development of artificial intelligence, a semantic segmentation technology based on deep learning is increasingly applied to various aspects such as industrial production, social security and transportation, wherein the semantic segmentation for realizing unmanned driving is a hot direction and is also an inevitable trend of development. The semantic segmentation is a core algorithm technology of unmanned vehicle driving, an image is detected by a vehicle-mounted camera or a laser radar and then is input into a neural network, and a background computer can automatically segment and classify the image so as to avoid obstacles such as pedestrians and vehicles.

In recent years, the performance of deep learning semantic segmentation tends to be perfect, but researchers find that models that perform well on training data sets do not perform well when applied in other scenarios because the training images (source domain) and the applied images (target domain) exist in two different domains with inconsistent data distribution. In contrast, researchers have proposed a large number of artificial intelligence methods based on deep learning to deal with the efficiency attenuation problem of semantic segmentation after cross-domain, and this method needs to acquire an image of a target domain in advance to further adapt the source domain distribution to the distribution of the target domain, thereby improving the segmentation robustness in the target domain, but there are two problems: for one, most of the current methods must obtain a part of image data of a new target domain in advance to adapt to the target domain, which is very labor and material consuming; secondly, the methods aim at performing cross-domain segmentation on a certain known target domain, and the fixed model can only be applied to the specific domain, cannot be generalized to other domains, and cannot meet the requirements of practical application.

Disclosure of Invention

Aiming at the problems of high target domain data dependency and poor generalization performance in the prior art, the invention provides a cross-domain road scene semantic segmentation method which does not depend on target domain data and has considerable generalization performance.

In order to achieve the technical purpose, the technical scheme adopted by the invention is as follows:

a high-generalization cross-domain road scene semantic segmentation method comprises the following steps:

generating a virtual image and a corresponding label through a game engine;

carrying out global texture migration on the virtual image to generate a global texture migration image;

carrying out local texture migration on the virtual image and the global texture migration image to generate a local texture migration image;

carrying out unified size on the virtual image, the global texture migration image and the local texture migration image, and then sending the images into a neural network for training;

carrying out consistency constraint on the global texture migration image and the local texture migration image which are trained by the neural network;

and performing semantic segmentation on any real scene by using the trained semantic segmentation model.

Further limiting, the global texture migration image X is obtained by performing global texture migration on the virtual image to generate a global texture migration image as follows_GTR：

Further, the performing local texture migration on the virtual image and the global texture migration image to generate a local texture migration image includes:

constructing a random mask M;

the local texture migration image X is obtained as follows_LTR：

WhereinIs an element dot product operation.

Further defined, the constructing the random mask M includes:

performing convolution smoothing on the random noise image to obtain a convolution image, wherein the convolution kernel has a size of 3 multiplied by 3 and obeys e^logλDistributing;

Further limiting, the sending the virtual image, the global texture migration image and the local texture migration image into the neural network for training after the virtual image, the global texture migration image and the local texture migration image are unified in size includes:

respectively cutting the virtual image, the global texture migration image and the local texture migration image into square images according to a set step length s;

respectively scaling the cut virtual image, the global texture migration image and the local texture migration image into a specified resolution;

and respectively sending the scaled virtual image, the global texture migration image and the local texture migration image into a deep convolution neural network for training.

Further defined, the step size is set_sObtained as follows:

where l is the length of the image, w is the width of the image, [ ] is rounded down.

Further defined, the performing consistency constraint on the global texture migration image and the local texture migration image trained by the neural network includes:

acquiring feature maps of global texture migration images and local texture migration images trained by a neural network;

calculating L of each corresponding feature map₁Distance:

through L₁The disparity between distance computation feature maps implements a constraint.

Another object of the present invention is to provide a high-generalization cross-domain road scene semantic segmentation system, which specifically includes:

the image generation module generates a virtual image and a corresponding label through a game engine;

the global texture migration module carries out global texture migration on the virtual image to generate a global texture migration image;

the local texture migration module carries out local texture migration on the local virtual image and the global texture migration image to generate a local texture migration image;

the consistency constraint module is used for carrying out consistency constraint on the global texture migration image and the local texture migration image which are trained by the neural network;

and the semantic segmentation module is used for performing semantic segmentation on any real scene by using the trained semantic segmentation model.

Compared with the prior art, the invention has the following beneficial effects:

1. the data acquisition is convenient, the labor force is reduced, the most basic virtual image of the training network can be directly generated by a game engine, the corresponding label is also directly generated, the manual marking is not needed, and a large amount of manpower and material resources are saved.

2. The robustness is good, the model is trained only from a single virtual image, the shape of the image is learned, texture features are not concerned, and the training process does not need to adapt to a specific target domain, so that the method can be generally applied to various real scenes.

3. On the basis of not contacting any target domain data, the evaluation index precision of cross-domain semantic segmentation is high.

4. The development prospect is good, the data does not depend on any target domain data, only the texture migration enhancement of the source domain data is needed, texture maps in any forms can be used for realizing the texture migration, the number of the current texture images is only 15, the types of the migrated textures can be continuously increased according to the needs, and the requirements of deeper networks and higher calculation power in the future can be met.

Drawings

FIG. 1 is a flow chart of the cross-domain road scene semantic segmentation method of the present invention;

FIG. 2 is a schematic diagram of global texture migration according to the present invention;

FIG. 3 is a schematic diagram of local texture migration according to the present invention;

FIG. 4 is a schematic structural diagram of a cross-domain road scene semantic segmentation system according to the present invention;

FIG. 5 is a diagram of the cross-domain semantic segmentation effect under three real scenes according to the present invention.

Detailed Description

In order to facilitate understanding of those skilled in the art, the present invention will be further described with reference to the following examples and drawings, which are not intended to limit the present invention.

In the current semantic segmentation method, image data of a target domain is difficult to acquire in advance, for example, automatic driving is taken as an example, an operator cannot acquire road images of all target regions in advance, and the existing segmentation method can only segment known target domains, so that a trained model can only be suitable for the specific domain and cannot be generalized to other target domains.

FIG. 1 shows a flow chart of a highly generalized cross-domain road scene semantic segmentation method, which includes:

s10: a virtual image and corresponding label are generated by the game engine.

Since the game engine can generate a large number of simulation images and easily acquire corresponding labels, the game engine generates a large number of virtual images and labels to train the model in this embodiment.

S20: and carrying out global texture migration on the virtual image to generate a global texture migration image.

Firstly, adain (adaptive texture normalization) is adopted to perform global texture migration on the virtual image X, and the number of texture images T used for global texture migration is not less than 15. Noting that FX is the feature of the virtual image, FT is the feature of the texture image, mu is the mean of the feature map, and sigma is the variance of the feature map, the global texture migration image X is obtained as follows_GTR：

Where f is a neural network that decodes the features into a global texture migration image.

Fig. 2 is a schematic diagram of global texture migration, and it can be seen from the diagram that the virtual image can generate a global texture image with texture image features after the features of the texture image are migrated.

S30: and carrying out local texture migration on the virtual image and the global texture migration image to generate a local texture migration image.

In the local texture migration process, the migration objects are virtual images and completed global texture migration images, and the method comprises the following steps:

firstly, a random mask M is constructed according to the following formula

M＝γ(f^3×3(G，e^logλ))

Wherein f is^3×3(G，e^logλ) Is 3X 3 in size and obeys e^logλThe distributed convolution kernel carries out convolution smoothing on the random noise image to obtain a convolution image, and then the convolution image is used for generating a random mask M through a self-adaptive decision threshold gamma.

It can be seen that the generation of the random mask M depends on the adaptive decision threshold γ, and the parametric adaptive decision threshold is determined by the mean μ and variance σ of the feature map, and is thus adaptive.

The randomness of the random mask is such that λ is randomly taken from between [4, 16] at each iteration of the neural network.

Finally, a local texture migration image X is obtained according to the following mode_LTR：

WhereinIs an element dot product operation.

Fig. 3 shows a local texture migration image generated by using a random mask M for the virtual image and the global texture image.

S40: and sending the virtual image, the global texture migration image and the local texture migration image into a neural network for training after the virtual image, the global texture migration image and the local texture migration image are unified in size.

Before sending images to be trained (virtual images, global texture migration images and local texture migration images) into a neural network for training, unifying the sizes of the images to be trained, specifically:

firstly, setting a step length s to cut an image to be trained into a square image, wherein the setting mode of the step length is as follows:

wherein l is the length of the image to be trained, w is the width of the image to be trained, [ ] is the rounding-down operation.

With an image resolution of 640 x 480, for example, the step s is

Thus, only the long side of the image 640 needs to be cropped 160 to obtain a 480 x 480 square image.

The image to be trained, cropped to a square, is then scaled to a specified resolution, which in this embodiment is 640 x 640.

And finally, respectively sending the scaled image to be trained to a virtual image, a global texture migration image and a local texture migration image to a depth residual error network ResNet-101 with 101 layers for training.

The ResNet-101 network structure is an end-to-end coding-decoding structure, in the decoding process, each module receives the output of the previous module as input, then nearest neighbor interpolation is carried out, the size of a feature graph is changed to be 2 times of the input, and cross entropy loss and L are adopted in the training process₁The function form of the distance is used for measuring the current segmentation effect of the network and penalizing the network weight.

ResNet-101 works better than other convolutional neural networks (e.g., vgg-16, Resnet-50, etc.).

S50: and carrying out consistency constraint on the global texture migration image and the local texture migration image which are trained by the neural network.

Wherein, the consistency constraint process specifically comprises:

acquiring feature maps of global texture migration images and local texture migration images trained by a neural network;

calculating L of each corresponding feature map₁A distance;

wherein the content of the first and second substances,andthe characteristics of the global texture migration image and the local texture migration image in the previous layer of the neural network output layer are respectively shown, i represents the ith characteristic value in the characteristic diagram, and N is the resolution of the characteristic diagram. L is₁The representation is that the element distance of each corresponding feature map is calculated and then summed.

Through L₁The disparity between distance computation feature maps implements a constraint.

S60: calculating loss values of the virtual image trained by the neural network, the local texture migration image and the local texture migration image subjected to consistency constraint and the label respectively, and training a semantic segmentation model according to the loss values.

The computation of the loss value may be computed in accordance with a cross-entropy loss function.

S70: and performing semantic segmentation on any real scene by using the trained semantic segmentation model.

As shown in fig. 4, the present application further provides a high-generalization cross-domain road scene semantic segmentation system, which specifically includes an image generation module, a global texture migration module, a local texture migration module, a neural network training module, a consistency constraint module, a semantic segmentation model training module, and a semantic segmentation module.

The game engine can generate a large number of simulation images, and meanwhile, corresponding labels are easy to obtain, and the image generation module generates virtual images needing to be trained through the game engine.

And the global texture migration module performs global texture migration on the virtual image to generate a global texture migration image, wherein the number of texture images in the global texture migration is not less than 15.

The local texture migration module is used for carrying out local texture migration on the virtual image and the global texture migration image locally to generate a local texture migration image. In the local texture migration, a random noise image needs to be smoothed by a gaussian kernel and then forms a random mask M by a self-adaptive decision threshold:

M＝γ(f^3×3(G，e^logλ))

the convolution kernel is 3 × 3 in size and obeys e^logλThe distribution, the adaptive decision threshold γ, is determined by the mean μ and variance σ of the feature map and is therefore adaptive.

The randomness of the random mask M is such that λ is randomly taken from between [4, 16] at each iteration of the neural network.

Finally, a local texture migration image x is obtained by using random mask according to the following mode_LTR：

WhereinIs an element dot product operation.

The neural network training module sends the virtual image, the global texture migration image and the local texture migration image into the neural network for training after the virtual image, the global texture migration image and the local texture migration image are unified in size, and the method specifically comprises the following steps of:

cutting the image into a square image according to a set step length s;

then scaling the cut virtual image, the global texture migration image and the local texture migration image into a specified resolution;

and finally, respectively sending the scaled virtual image, the global texture migration image and the local texture migration image into a deep convolutional neural network Resnet-101 for training.

And the consistency constraint module is used for carrying out consistency constraint on the global texture migration image and the local texture migration image which are trained by the neural network.

Wherein the consistency constraint means: firstly, feature maps of global texture migration images and local texture migration images trained by a neural network are obtained, and then L of each corresponding feature map is calculated₁Distance, finally through L₁Implementing a constraint on the dissimilarity between distance computation feature maps, where L₁The distance is as follows:

And the semantic segmentation model training module is used for calculating loss values of the virtual image trained by the neural network, the local texture migration image and the local texture migration image subjected to consistency constraint and the label respectively, training a semantic segmentation model according to the loss values, and finally storing the semantic segmentation model.

When the method is used, the semantic segmentation module performs semantic segmentation on any real scene by using the trained semantic segmentation model.

According to the method, data enhancement is realized through global texture migration and local texture migration of the virtual image, a neural network is attacked, and the model is forced to learn shape information which is invariable across domains; the method fully considers that the data of the target domain is difficult to obtain in advance in the application scene, proposes that the network training is only carried out in the source domain, realizes the reliable cross-domain segmentation effect, and has strong generalization performance.

Semantic segmentation evaluation index-mIOU

When semantic segmentation result evaluation is performed, the predicted result is often divided into four cases: true positive (TP, part of a predicted tag that meets the true value), false positive (FP, part of a predicted tag that is divided into parts that are not actually the part of the tag), true negative true positive (TN, true background part of the prediction), and false negative (FN, part of a predicted background that is not actually the background), where negative refers to part of a non-object tag (which can be directly understood as the background), it is obvious that positive refers to part with a tag.

The IoU (interaction redundancy) index is the intersection ratio of big family words, and is used by people as a standard measure in semantic segmentation. IoU, they are generally calculated based on classes, and also based on pictures, IoU, which is based on class calculation, is to accumulate IoU of each class after calculation, and then average to obtain a global-based evaluation, which is IoU of the mean value, that is, average cross ratio mIOU (mean IoU).

Wherein p is_ijRepresenting the true value i, predicted as the number of j, k +1 being the number of classes (including empty classes), p_iiDenotes the number of true TPs, p_ijIndicates the number of false positive FPs, p_jiThe number of false negative FN is indicated.

As shown in fig. 5, the semantic segmentation model obtained by training in this embodiment checks the cross-domain semantic segmentation effect in three real scenes, and the cross-domain semantic segmentation settings from GTA5 to ctyscapes, GTA5 to BDDS, and GTA5 to Mapillary respectively reach mious of 43.7, 39.6, and 39.1, so that on the basis of not contacting any target domain data, the semantic segmentation method and the system for cross-domain road scenes provided by the present application have higher evaluation index accuracy.

The method and the system for segmenting the high-generalization cross-domain road scene semantic meaning are introduced in detail above. The description of the specific embodiments is only intended to facilitate an understanding of the methods of the present application and their core concepts. It should be noted that, for those skilled in the art, it is possible to make several improvements and modifications to the present application without departing from the principle of the present application, and such improvements and modifications also fall within the scope of the claims of the present application.

14页详细技术资料下载

High-generalization cross-domain road scene semantic segmentation method and system

相关技术

网友询问留言