Time sequence shift and multi-branch space-time enhancement network-based stepping footprint image retrieval method

文档序号：1846097 发布日期：2021-11-16 浏览：23次中文

阅读说明：本技术 基于时序移位和多分支时空增强网络的成趟足迹图像检索方法 (Time sequence shift and multi-branch space-time enhancement network-based stepping footprint image retrieval method ) 是由唐俊吴正建王年朱明� 张艳鲍文霞于 2021-08-17 设计创作，主要内容包括：本发明涉及基于时序移位和多分支时空增强网络的成趟足迹图像检索方法,与现有技术相比解决了成趟足迹图像之间有效的时空特征少、难以聚合不同类别的差异性特征的缺陷。本发明包括以下步骤：训练数据的获取；构建成趟足迹图像检索模型；对训练数据进行预处理；成趟足迹图像检索模型的训练；待检索成趟足迹图像的获取；成趟足迹图像的检索。本发明提高了成趟足迹图像的检索速度和准确率。(The invention relates to a time sequence shift and multi-branch space-time enhancement network-based lap footprint image retrieval method, which solves the defects that effective space-time characteristics among lap footprint images are few and different characteristics of different categories are difficult to aggregate compared with the prior art. The invention comprises the following steps: acquiring training data; constructing a footprint image retrieval model; preprocessing training data; training a step-by-step footprint image retrieval model; acquiring a footmark image to be retrieved; and retrieving the footmark images in turns. The invention improves the retrieval speed and accuracy of the stepping footprint images.)

1. A time sequence shift and multi-branch spatio-temporal enhancement network-based lap footprint image retrieval method is characterized by comprising the following steps of:

11) acquisition of training data: acquiring a track-forming footprint pressure image as training data;

12) constructing a footprint image retrieval model: establishing a stepping footprint image retrieval model based on time sequence shift and a multi-branch space-time enhancement network;

13) preprocessing training data: carrying out centralized processing on the training data;

14) training a step of footprint image retrieval model: training the complete-run footprint image retrieval model by using the training data after the centralization processing;

15) acquiring a to-be-retrieved step footprint image: acquiring a footmark image to be retrieved, and preprocessing the footmark image;

16) and (3) retrieving the step of track forming images: inputting the preprocessed to-be-retrieved stepping images into the trained stepping image retrieval model, and completing retrieval of the stepping images.

2. The time series shift and multi-branch spatio-temporal enhancement network-based lap footprint image retrieval method according to claim 1, characterized in that said constructing a lap footprint image retrieval model comprises the steps of:

21) setting a first layer of the footprint image retrieval model as a left foot and right foot distinguishing module;

22) setting a second layer of the footprint image retrieval model as a multi-branch spatial feature extraction module;

23) setting the third layer of the footprint image retrieval model as a time sequence shifting module;

24) the fourth layer of the footprint image retrieval model is set as a multi-branch timing sequence feature extraction module.

3. The time-series shift and multi-branch spatio-temporal enhancement network-based lap footprint image retrieval method according to claim 1, characterized in that said acquisition of training data comprises the steps of:

31) acquiring a track-forming footprint pressure image;

32) denoising the one-pass footprint pressure image, segmenting the one-pass footprint pressure image into single-frame image sample data, wherein each single-frame image sample data has one and only one footprint, and obtaining a sample set d of the one-pass footprint pressure image, wherein the sample set d is { d ═ d } of the one-pass footprint pressure image_x|x＝1,2,3,…,X}，d_xRepresenting the sample data of the X-th frame, wherein X is more than or equal to 1 and less than or equal to X, and taking 12-15 frames of the step-by-step footprint pressure image segmentation;

33) repeating the above operations to obtain the sample set f ═ f of the collector_y|y＝1,2,3,…,Y}，f_yRepresenting sample data of the Y-th time, Y represents the total time collected by a collector, and 6 or 9 is taken;

34) repeating the collection to obtain a sample set D ═ { D | K ═ 1,2,3, …, K }, wherein K is the headcount, and all the collectors are included;

35) label information is defined, which is used to distinguish the IDs of the different footprint samples in D.

4. The time-series shift and multi-branch spatio-temporal enhancement network-based lap footprint image retrieval method according to claim 1, characterized in that said preprocessing of training data comprises the steps of:

41) removing all the points with the pixel value of 255 at the blue channel of the segmented sole pressure image;

42) scanning the segmented foot sole pressure images row by row, setting the row with the pixel average value larger than 5 as an effective row of the footprint image, searching according to the row, and storing the position value of the row;

43) taking the first column position as the initial value of the foot length of the cut one-pass plantar pressure image, taking the last column position as the end value of the foot length of the cut one-pass plantar pressure image, extending and expanding filling pixel values to the left side and the right side at equal intervals according to the start of the foot length and the end of the foot length of the cut foot trace image, and enabling the height-to-width ratio to be 1:1, wherein the number of the filled pixels on the left side and the right side is N respectively_LAnd N_RThe expression is as follows:

N_L＝[250-(L₂-L₁)]/2，

N_R＝250-(L₂-L₁)-N_L，

wherein L is₁For the initial value of the foot length of the pressure image of the sole in the post-cutting pass, L₂Is the end value of the foot length of the pressure image of the sole after cutting, N_LFor left-filled pixel value, N_RFinishing the normalization operation on the training data for the pixel point values filled on the right side;

44) scanning the regulated turn-by-turn plantar pressure image line by line, setting the line with the pixel average value more than 5 as an effective line of the footprint image, searching according to the line and storing the position value of the line;

45) taking the first row position as the initial value of the foot width of the regulated turn-by-turn plantar pressure image, taking the last row position as the end value of the foot width of the regulated turn-by-turn plantar pressure image, extending and expanding the filling pixel values to the upper side and the lower side at equal intervals according to the start of the foot width and the end of the foot width of the regulated turn-by-turn plantar pressure image, and enabling the height-to-width ratio to be 1:1, wherein the number of the pixels filled to the upper side and the lower side is N respectively_UAnd N_DThe expression is as follows:

N_U＝[250-(L₄-L₃)]/2，

N_D＝250-(L₄-L₃)-N_U，

wherein L is₃The initial value L of the foot width of the pressure image of the finished sole is normalized₄Is the end value of the foot width of the regulated post-formation sole pressure image, N_UFor the upper filled pixel value, N_DAnd finishing the centralization operation on the training data for the pixel point values filled below.

5. The time series shift and multi-branch spatio-temporal enhancement network-based lap footprint image retrieval method according to claim 1, characterized in that the training of the lap footprint image retrieval model comprises the steps of:

51) inputting training data after normalization and centralization processing into a first layer of a turn-down footprint image retrieval model, wherein the input characteristic dimensions of the turn-down footprint image retrieval model are (B, T, C, H and W), B is the batch size, T is the frame number sent in each batch, C is 3 to represent a red, green and blue (RGB) three channel, and H and W are spatial resolution; assume B, T takes 16,8, respectively, when the network input feature dimension is (16,8,3, 224);

52) training data passes through a left foot and right foot distinguishing module of the stepping footprint image retrieval model to generate three branches, wherein the branch 1 simultaneously comprises left foot data and right foot data, the branch 2 only comprises right foot data and the branch 3 only comprises left foot data; the feature dimensions of branch 1, branch 2, branch 3 are (16,8,3, 224), (16,4,3, 224), respectively;

53) training a multi-branch spatial feature extraction module aiming at data of three branches to obtain refined spatial feature information, wherein each branch is composed of a convolutional neural network with n layers of convolutional kernels, and each layer sequentially comprises a convolutional layer, an excitation layer and a pooling layer;

54) and the training time sequence shifting module does not perform time sequence shifting processing on the output of the branch 1 and outputs Z 'of the multi-branch spatial feature extraction module on the branches 2 and 3'₂、Z′₃Training a timing shift module;

55) training a multi-branch timing sequence feature extraction module consisting of a long-short term memory network (LSTM) and a full connection layer;

56) multi-branchThe time sequence feature extraction module outputs corresponding characterization features H_nH corresponding to each training image_nAnd (4) calculating a loss function and reversely propagating through the network at the same time, updating parameters of each layer in the network, and finally obtaining an optimal turn-by-turn footprint retrieval model after training.

6. The time-series shift and multi-branch spatio-temporal enhancement network-based lap footprint image retrieval method according to claim 5, wherein the training of the multi-branch spatial feature extraction module for the three-branch data comprises the steps of:

61) establishing a spatial feature extraction module branch corresponding to the branch 1 by using a cyclic feature extraction method to obtain corresponding output Z'₁(ii) a The cyclic feature extraction method comprises the following steps:

611) sequentially taking each frame of output of the branch 1 path as input of a first layer of convolution layer through for circulation, wherein the first layer is composed of convolution layers with the step length of 4, the filling and supplementing 0 value of 2 and the filter size of 11 x 11; to obtain Z₁₁Has a characteristic dimension of (16,64,27, 27);

the activating function is Leaky ReLU, and the expression of the activating function is as follows:

the maximum pooling is selected as the pooling layer, the step length is 2, and the size of the filter is 3 x 3;

by computing output Z₁₁Is (16,64,27,27), and the calculation formula is as follows:

((M-K+2P)/S)+1，

((M-K)/S)+1，

wherein M is the spatial resolution of the input picture, H or W, K is the filter size, P is the number of zero padding, and S is the stride;

612) will Z₁₁As the input of the convolution layer of the second layer, the convolution layer of the second layer has the step length of 1, the filling and supplementing 0 value of 2 and the filter size of 5 x 5; activation function and pooling layer selection Leaky ReLUAnd maximum pooling; to obtain Z₁₂Has a characteristic dimension of (16,192,13, 13);

613) will Z₁₂As the input of the third layer of convolution layer, the third layer is composed of convolution layer with step length of 3, filling and complementing 0 value of 1 and filter size of 3 x 3, the activating function selects Leaky ReLU to obtain output Z₁₃Has a characteristic dimension of (16,384,13, 13);

614) will Z₁₃As input to the fully connected layer, the output Z is finally obtained₁Has a characteristic dimension of (16,2048);

615) output Z of each time in for loop₁Adding a time dimension, changing the characteristic dimension into (16,1,2048), splicing 8 tensors in the time dimension as shown in the following formula, and obtaining output Z₁' is characterized by the characteristic dimensions (16,8, 2048);

Z₁'＝concat((Z₁,Z₂,Z₃,Z₄,Z₅,Z₆,Z₇,Z₈),dim＝1)；

62) constructing spatial feature extraction module branches of the branch 2 and the branch 3 by using a cyclic feature extraction method to obtain corresponding output Z'₂、Z′₃Are all (16,4,192,13, 13).

7. The time-series shift and multi-branch spatio-temporal enhancement network-based lap footprint image retrieval method according to claim 5, characterized in that said training time-series shift module comprises the steps of:

71) establishing a time sequence shift module branch corresponding to the branch 2 by using a time sequence shift method to obtain an output Z ″₂(ii) a The time sequence shifting method comprises the following steps:

711) the timing shift module TSM gives a parameter div during operation, which is represented by:

fold＝C//div，

c is the number of channels, div is a given parameter, fold indicates which channel operations are time-shifted, and C, div and fold are 192, 8 and 24 respectively;

are provided one and Z'₂All 0 tensors out of the same dimension, the eigen-dimensions are (B, T),C,H,W)；

722) Z 'on front fold channel'₂The features of the last T-1 frame are given to the full 0 tensor out, other features are still, and the left shift of the features is completed, and the formula is as follows:

out[:,:-1,:fold,:,:]＝X[:,1:,:fold,:,:]；

723) in [ fold,2 fold]Channel over Z'₂The features of the first T-1 frame are given to the full 0 tensor out, and the other features are stationary, completing the right shift of the features, which has the following formula:

out[:,1:,fold:2*fold,:,:]＝X[:,:-1,fold:2*fold,:,:]；

724) z 'on the remaining channel'₂The characteristics of the T frame are given to the full 0 tensor out, all shifting operations are completed, and the finally obtained tensor out is the output after the input X time sequence is shifted, and the formula is as follows:

out[:,:,2*fold:,:,:]＝X[:,:,2*fold:,:,:]，

725) repeating steps 721) to 724) to construct a layer of convolutional neural network layer, and sequentially inputting each frame through cyclic processing to finally obtain output Z ″₂The characteristic dimension is (16,4, 2048);

726) repeating the steps 721) to 725) to establish the timing shift module corresponding to the branch 3, so as to obtain the corresponding output Z ″₃The characteristic dimension is (16,4, 2048).

8. The time-series shift and multi-branch spatio-temporal enhancement network-based lap footprint image retrieval method according to claim 5, characterized in that said training multi-branch time-series feature extraction module comprises the steps of:

81) initializing an LSTM network, wherein the input characteristic dimension is 2048, the characteristic dimensions of a hidden layer and a memory layer are 512, and the number of RNN layers is 1;

82) outputs Z 'of three branches'₁、Z″₂、Z″₃Is sent into an LSTM network as input to obtain three outputs Output and H_nAnd C_nThe outputs of the last output gate, the hidden layer and the memory layer are respectively; outputs Output [ -1,:]processing and sending the data to a full connection layer to obtain an output Out;

83) and correspondingly adding two outputs obtained by the three branches to obtain a final result as the output of the network, wherein the expression is as follows:

Out＝Out1+Out2+Out3，

H_n＝H_n1+H_n2+H_n3，

wherein the network outputs H_nContaining class labels and features of the training images, Out as a parameter to update the loss function.

Technical Field

The invention relates to the technical field of footprint image processing, in particular to a stepping footprint image retrieval method based on time sequence shift and a multi-branch space-time enhancement network.

Background

In recent years, with the rapid development of computer technology, artificial intelligence has a new breakthrough, for example, a pedestrian re-identification technology can realize cross-border tracking; the target detection technique can be applied to automatic driving of an automobile. The footprint identification technology is a novel biological technology capable of identifying the identity of an individual, and has great development potential in identity identification and case investigation. Compared with traditional video monitoring, fingerprint identification and face identification, the footprint has uniqueness and is not easy to disguise.

Although the prior footprint inspection technology has made certain progress, certain problems exist. The traditional human body footprint inspection technology always depends on footprint experts to extract features according to long-term experience of the experts, and different methods for extracting the features of the experts are different and have no uniform standard.

At present, two methods related to step-by-step footprint image retrieval in the deep learning field are provided, one is a step-by-step footprint identification method combined with step method characteristics, and the other is a footprint image retrieval method based on space-time motion and characteristic fusion. The former can realize the identification of the footprint image, but the utilized features are too few and the identification effect of the model is not maximized, and the practicability of the classification algorithm is only low; the latter is a search algorithm, but only can ensure the search of a single shoe, and the search effect is not ideal once the cross-modal search is involved.

Disclosure of Invention

The invention aims to solve the defects that in the prior art, the number of effective space-time features among turn-by-turn footprint images is small, and different features of different categories are difficult to aggregate, and provides a turn-by-turn footprint image retrieval method based on time sequence shift and a multi-branch space-time enhancement network to solve the problems.

In order to achieve the purpose, the technical scheme of the invention is as follows:

a time sequence shift and multi-branch spatio-temporal enhancement network-based lap footprint image retrieval method comprises the following steps:

11) acquisition of training data: acquiring a track-forming footprint pressure image as training data;

12) constructing a footprint image retrieval model: establishing a stepping footprint image retrieval model based on time sequence shift and a multi-branch space-time enhancement network;

13) preprocessing training data: carrying out centralized processing on the training data;

14) training a step of footprint image retrieval model: training the complete-run footprint image retrieval model by using the training data after the centralization processing;

15) acquiring a to-be-retrieved step footprint image: acquiring a footmark image to be retrieved, and preprocessing the footmark image;

The construction of the lap footprint image retrieval model comprises the following steps:

21) setting a first layer of the footprint image retrieval model as a left foot and right foot distinguishing module;

22) setting a second layer of the footprint image retrieval model as a multi-branch spatial feature extraction module;

23) setting the third layer of the footprint image retrieval model as a time sequence shifting module;

24) the fourth layer of the footprint image retrieval model is set as a multi-branch timing sequence feature extraction module.

The acquisition of the training data comprises the following steps:

31) acquiring a track-forming footprint pressure image;

32) are paired intoDenoising the track-forming pressure image, segmenting the track-forming pressure image into single-frame image sample data, wherein each single-frame image sample data has one and only one footprint, and obtaining a sample set d of the track-forming pressure image, wherein the sample set d is { d ═ d }_x|x＝1,2,3,…,X}，d_xRepresenting the sample data of the X-th frame, wherein X is more than or equal to 1 and less than or equal to X, and taking 12-15 frames of the step-by-step footprint pressure image segmentation;

34) repeating the collection to obtain a sample set D ═ { D | K ═ 1,2,3, …, K }, wherein K is the headcount, and all the collectors are included;

35) label information is defined, which is used to distinguish the IDs of the different footprint samples in D.

The preprocessing of the training data comprises the following steps:

41) removing all the points with the pixel value of 255 at the blue channel of the segmented sole pressure image;

N_L＝[250-(L₂-L₁)]/2，

N_R＝250-(L₂-L₁)-N_L，

wherein L is₁For the initial value of the foot length of the pressure image of the sole in the post-cutting pass, L₂Is the end value of the foot length of the pressure image of the sole after cutting，N_LFor left-filled pixel value, N_RFinishing the normalization operation on the training data for the pixel point values filled on the right side;

N_U＝[250-(L₄-L₃)]/2，

N_D＝250-(L₄-L₃)-N_U，

The training of the lap footprint image retrieval model comprises the following steps:

53) training a multi-branch spatial feature extraction module aiming at data of three branches to obtain more detailed spatial feature information, wherein each branch is composed of a convolutional neural network with n layers of convolutional kernels, and each layer sequentially comprises a convolutional layer, an excitation layer and a pooling layer;

55) training a multi-branch timing sequence feature extraction module consisting of a long-short term memory network (LSTM) and a full connection layer;

56) the multi-branch time sequence feature extraction module outputs corresponding characterization features H_nH corresponding to each training image_nAnd (4) calculating a loss function and reversely propagating through the network at the same time, updating parameters of each layer in the network, and finally obtaining an optimal trained completed footprint retrieval model.

The data training multi-branch spatial feature extraction module aiming at the three branches comprises the following steps:

the activating function is Leaky ReLU, and the expression of the activating function is as follows:

the maximum pooling is selected as the pooling layer, the step length is 2, and the size of the filter is 3 x 3;

by computing output Z₁₁Is (16,64,27,27), and the calculation formula is as follows:

((M-K+2P)/S)+1，

((M-K)/S)+1，

wherein M is the spatial resolution of the input picture, H or W, K is the filter size, P is the number of zero padding, and S is the stride;

612) will Z₁₁As the input of the convolution layer of the second layer, the convolution layer of the second layer has the step length of 1, the filling and supplementing 0 value of 2 and the filter size of 5 x 5; selecting Leaky ReLU and maximum pooling by an activation function and a pooling layer; to obtain Z₁₂Has a characteristic dimension of (16,192,13, 13);

614) will Z₁₃As input to the fully connected layer, the output Z is finally obtained₁Has a characteristic dimension of (16,2048);

615) output Z of each time in for loop₁Adding a time dimension, changing the characteristic dimension into (16,1,2048), splicing the 8 tensors in the time dimension as shown in the following formula to obtain an output Z'₁Is (16,8, 2048);

Z₁'＝concat((Z₁,Z₂,Z₃,Z₄,Z₅,Z₆,Z₇,Z₈),dim＝1)；

The training timing shift module comprises the following steps:

71) the time sequence shifting method is used to establish the time sequence shifting module branch corresponding to the branch 2,obtain an output Z₂(ii) a The time sequence shifting method comprises the following steps:

711) the timing shift module TSM gives a parameter div during operation, which is represented by:

fold＝C//div

c is the number of channels, div is a given parameter, fold indicates which channel operations are time-shifted, and C, div and fold are 192, 8 and 24 respectively;

are provided one and Z'₂All 0 tensors out of the same dimension, the eigen dimensions are (B, T, C, H, W);

722) z 'on front fold channel'₂The features of the last T-1 frame are given to the full 0 tensor out, other features are still, and the left shift of the features is completed, and the formula is as follows:

out[:,:-1,:fold,:,:]＝X[:,1:,:fold,:,:]；

out[:,1:,fold:2*fold,:,:]＝X[:,:-1,fold:2*fold,:,:]；

out[:,:,2*fold:,:,:]＝X[:,:,2*fold:,:,:]，

726) repeating the steps 721) to 725) to establish the timing shift module corresponding to the branch 3, so as to obtain the corresponding output Z ″₃The characteristic dimension is (16,4, 2048).

The training multi-branch timing sequence feature extraction module comprises the following steps:

81) initializing an LSTM network, wherein the input characteristic dimension is 2048, the characteristic dimensions of a hidden layer and a memory layer are 512, and the number of RNN layers is 1;

83) and correspondingly adding two outputs obtained by the three branches to obtain a final result as the output of the network, wherein the expression is as follows:

Out＝Out1+Out2+Out3，

H_n＝H_n1+H_n2+H_n3，

wherein the network outputs H_nContaining class labels and features of the training images, Out as a parameter to update the loss function.

Advantageous effects

Compared with the prior art, the track-forming image retrieval method based on the time sequence shift and the multi-branch space-time enhancement network improves the retrieval speed and accuracy of the track-forming images. Its advantages mainly include the following points:

(1) the invention combines the traditional image processing method, the video depth understanding method and the track-forming image retrieval to form a complete and efficient track-forming image retrieval framework. In terms of pretreatment: resetting the format optimized into the lap footprint sample, thereby making the data set more adaptive to the network; in terms of network structure: the stepping footprint image retrieval model is composed of a left foot judging module, a right foot judging module, a multi-branch space feature extraction module, a time sequence shifting module and a multi-branch time sequence feature extraction module.

(2) The preprocessing module comprises denoising, cutting, regularization and centralization, which is helpful for network extraction of more effective space-time characteristics, so that a better network model is trained.

(3) The multi-branch spatial feature extraction module has a shallow network layer number and uses a small convolution kernel, and is beneficial to better extracting more distinctive pressure information and texture contour information in the step-by-step footprint image.

(4) The left and right foot distinguishing module divides the stepping footprint into a left foot part and a right foot part, only the left foot part and only the right foot part, and is beneficial to better extracting different characteristics of different categories by a network.

(5) The time sequence shifting module shifts adjacent image frames on partial channels, and realizes modeling on a time domain on the premise of not increasing calculated amount.

(6) The multi-branch time sequence feature extraction module uses the LSTM, and is matched with the left and right foot distinguishing module and the time sequence shifting module, so that the network can extract more effective space-time features, and more accurate and efficient retrieval is realized.

Drawings

FIG. 1 is a sequence diagram of the method of the present invention;

FIG. 2 is a schematic diagram of a timing shift module according to the present invention;

FIG. 3 is a diagram of a time shift and multi-branch spatio-temporal enhancement network framework according to the present invention.

Detailed Description

So that the manner in which the above recited features of the present invention can be understood and readily understood, a more particular description of the invention, briefly summarized above, may be had by reference to embodiments, some of which are illustrated in the appended drawings, wherein:

as shown in FIG. 1, the invention relates to a time sequence shift and multi-branch spatio-temporal enhancement network-based lap footprint image retrieval method, which comprises the following steps:

firstly, acquiring training data: and acquiring the lap footprint pressure image as training data. In the laboratory process, a data set is obtained through self-collection and processing. The collector walks on the pressure plate in a normal posture, and the software processes the data transmitted by the pressure plate to obtain a track-forming pressure image which is stored on a computer. The data of the collector can be obtained by repeating the collecting operation by changing the types of shoes (total 6 types) and adjusting the load state (total 3 types). The total data is 360 classes with about 19400 lap footprint images, and after processing about 233280 single frame images, each with a category label for the person ID. The training set has about 129600 single-frame images for the 200 classes, the test set has about 103600 data for the 160 classes, the test base library set has about 69120 single-frame images for the 160 classes, and the test query set also has about 34560 single-frame images for the 160 classes. The method comprises the following specific steps:

(1) acquiring a track-forming footprint pressure image;

(2) denoising the one-pass footprint pressure image, segmenting the one-pass footprint pressure image into single-frame image sample data, wherein each single-frame image sample data has one and only one footprint, and obtaining a sample set d of the one-pass footprint pressure image, wherein the sample set d is { d ═ d } of the one-pass footprint pressure image_x|x＝1,2,3,…,X}，d_xRepresenting the sample data of the X-th frame, wherein X is more than or equal to 1 and less than or equal to X, and taking 12-15 frames of the step-by-step footprint pressure image segmentation;

(3) repeating the above operations to obtain the sample set f ═ f of the collector_y|y＝1,2,3,…,Y}，f_yRepresenting sample data of the Y-th time, Y represents the total time collected by a collector, and 6 or 9 is taken;

(4) repeating the collection to obtain a sample set D ═ { D | K ═ 1,2,3, …, K }, wherein K is the headcount, and all the collectors are included;

(5) label information is defined, which is used to distinguish the IDs of the different footprint samples in D.

Secondly, constructing a footprint image retrieval model: and establishing a step-by-step footprint image retrieval model based on the time sequence shift and the multi-branch space-time enhancement network.

As shown in FIG. 3, the whole step-by-step footprint image retrieval model is composed of four modules arranged in order. The left and right foot distinguishing module divides the stepping footprints into a left foot part and a right foot part, only the left foot part and only the right foot part, which is beneficial for the network to better extract different characteristics of different categories; the network layer number of the multi-branch spatial feature extraction module is shallow, and the used convolution kernel is small, so that the more distinctive pressure information and texture contour information in the step-by-step footprint image can be better extracted; the time sequence shifting module shifts adjacent image frames on partial channels, and modeling in a time domain is realized on the premise of not increasing calculated amount; the multi-branch time sequence feature extraction module uses the LSTM, and a left foot and right foot judgment module and a time sequence shift module are matched, so that the network can extract more effective space-time features, and more accurate and efficient retrieval is realized.

The method comprises the following specific steps:

(1) setting a first layer of the footprint image retrieval model as a left foot and right foot distinguishing module;

(2) setting a second layer of the footprint image retrieval model as a multi-branch spatial feature extraction module;

(3) setting the third layer of the footprint image retrieval model as a time sequence shifting module;

(4) the fourth layer of the footprint image retrieval model is set as a multi-branch timing sequence feature extraction module.

Thirdly, preprocessing the training data: and carrying out centralized processing on the training data. The process is actually a process of aligning footprint features, and is more beneficial to extracting effective features by a convolutional neural network. The preprocessing of the training data comprises the following steps:

(1) removing all the points with the pixel value of 255 at the blue channel of the segmented sole pressure image;

(2) scanning the segmented foot sole pressure images row by row, setting the row with the pixel average value larger than 5 as an effective row of the footprint image, searching according to the row, and storing the position value of the row;

(3) taking the first column position as the initial value of the foot length of the cut one-pass plantar pressure image, taking the last column position as the end value of the foot length of the cut one-pass plantar pressure image, extending and expanding filling pixel values to the left side and the right side at equal intervals according to the start of the foot length and the end of the foot length of the cut foot trace image, and enabling the height-to-width ratio to be 1:1, wherein the number of the filled pixels on the left side and the right side is N respectively_LAnd N_RThe expression is as follows:

N_L＝[250-(L₂-L₁)]/2，

N_R＝250-(L₂-L₁)-N_L，

(4) scanning the regulated turn-by-turn plantar pressure image line by line, setting the line with the pixel average value more than 5 as an effective line of the footprint image, searching according to the line, and storing the position value of the line;

(5) taking the first row position as the initial value of the foot width of the regulated turn-by-turn plantar pressure image, taking the last row position as the end value of the foot width of the regulated turn-by-turn plantar pressure image, extending and expanding the filling pixel values to the upper side and the lower side at equal intervals according to the start of the foot width and the end of the foot width of the regulated turn-by-turn plantar pressure image, and enabling the height-to-width ratio to be 1:1, wherein the number of the pixels filled to the upper side and the lower side is N respectively_UAnd N_DThe expression is as follows:

N_U＝[250-(L₄-L₃)]/2，

N_D＝250-(L₄-L₃)-N_U，

Fourthly, training a footprint image retrieval model in a turn: and training the step-by-step footprint image retrieval model by using the training data after the centralization processing.

Cross-shoe retrieval requires higher requirements for the network model than retrieval of a single shoe. The types of training data are increased due to shoe crossing, the intra-class difference of the features obtained through the network is increased, the inter-class difference is reduced, and the difficulty of shoe crossing retrieval is greatly increased. And training data are unified and normalized, the footprint images with poor quality are removed, the size of the footprint images is reset, the multi-branch spatial feature extraction module is favorable for extracting more distinctive spatial features, and feature loss caused by pooling operation is reduced. The extracted features are more refined by the flexible combination of the left foot shifting module, the right foot shifting module and the time sequence shifting module, and finally the multi-branch spatial feature extraction module is connected to complete space-time modeling, so that the space-time features of training data are better extracted, and the accuracy of cross-shoe retrieval is greatly improved.

The training of the lap footprint image retrieval model comprises the following steps:

(1) inputting training data after normalization and centralization processing into a first layer of a turn-down footprint image retrieval model, wherein the input characteristic dimensions of the turn-down footprint image retrieval model are (B, T, C, H and W), B is the batch size, T is the frame number sent in each batch, C is 3 to represent a red, green and blue (RGB) three channel, and H and W are spatial resolution; assume B, T takes 16,8, respectively, when the network input feature dimension is (16,8,3, 224).

(2) Training data passes through a left foot and right foot distinguishing module of the stepping footprint image retrieval model to generate three branches, wherein the branch 1 simultaneously comprises left foot data and right foot data, the branch 2 only comprises right foot data and the branch 3 only comprises left foot data; the feature dimensions of branch 1, branch 2, and branch 3 are (16,8,3, 224), (16,4,3, 224), respectively. The blocking processing of the input is more beneficial to aggregating different characteristics of different categories, so that the retrieval accuracy is greatly improved.

(3) Training a multi-branch spatial feature extraction module aiming at data of three branches to obtain more detailed spatial feature information, wherein each branch is composed of a convolutional neural network with n layers of convolutional kernels, and each layer sequentially comprises a convolutional layer, an excitation layer, a pooling layer and the like.

The multi-branch spatial feature extraction module for the data training of the three branches comprises the following steps:

firstly, establishing a spatial feature extraction module branch corresponding to the branch 1 by using a cyclic feature extraction method to obtain corresponding output Z'₁(ii) a The cyclic feature extraction method comprises the following steps:

A1) sequentially taking each frame of the output of the branch 1 path as the input of the convolution layer of the first layer through the for loop, wherein the first layer consists of 4 step lengths, 2 filling and 0 supplementing values and 11 × 11 filter sizesThe composition of the convolution layer; to obtain Z₁₁Has a characteristic dimension of (16,64,27, 27).

The activating function is Leaky ReLU, and the expression of the activating function is as follows:

the maximum pooling is selected as the pooling layer, the step length is 2, and the size of the filter is 3 x 3;

by computing output Z₁₁Is (16,64,27,27), and the calculation formula is as follows:

((M-K+2P)/S)+1，

((M-K)/S)+1，

where M is the spatial resolution of the input picture, H or W, K is the filter size, P is the number of zero padding, and S is the stride.

The activation function selects Leaky ReLU to accelerate convergence speed, and the problem that the neuron does not learn after the ReLU function enters a negative interval can be solved. The use of max-pooling may reduce network parameters while reducing the feature map, and max-pooling may reduce the shift of the estimated mean due to convolutional layer parameter errors, and more preserve texture information of the footprint image than average pooling.

A2) Will Z₁₁As the input of the convolution layer of the second layer, the convolution layer of the second layer has the step length of 1, the filling and supplementing 0 value of 2 and the filter size of 5 x 5; selecting Leaky ReLU and maximum pooling by an activation function and a pooling layer; to obtain Z₁₂Has a characteristic dimension of (16,192,13, 13). And normalization processing is carried out before entering an activation layer, so that gradient dispersion is avoided, and network training is accelerated.

A3) Will Z₁₂As the input of the third layer of convolution layer, the third layer is composed of convolution layer with step length of 3, filling and complementing 0 value of 1 and filter size of 3 x 3, the activating function selects Leaky ReLU to obtain output Z₁₃Has a characteristic dimension of (16,384,13, 13);

A4) will Z₁₃As input to the fully connected layer, the output Z is finally obtained₁Has a characteristic dimension of (16,2048);

A5) output Z of each time in for loop₁Adding a time dimension, changing the characteristic dimension into (16,1,2048), splicing 8 tensors in the time dimension as shown in the following formula, and obtaining output Z₁' is characterized by the characteristic dimensions (16,8, 2048);

Z′₁＝concat((Z₁,Z₂,Z₃,Z₄,Z₅,Z₆,Z₇,Z₈),dim＝1)。

finally, constructing the spatial feature extraction module branches of the branch 2 and the branch 3 by using a cyclic feature extraction method to obtain corresponding output Z'₂、Z′₃Are all (16,4,192,13, 13).

(4) And the training time sequence shifting module does not perform time sequence shifting processing on the output of the branch 1 and outputs Z 'of the multi-branch spatial feature extraction module on the branches 2 and 3'₂、Z′₃And training the time sequence shifting module.

And a time sequence Shift module TSM (temporal Shift module) is constructed, and partial channels Shift in the time dimension, so that information exchange between adjacent images is facilitated, and time domain modeling is completed. The specific implementation concept of the TSM is as follows:

as shown in fig. 2, consider first a general convolution operation. The input X is an infinite length one-dimensional vector, the kernel size is 3, and the weight of the convolution is W ═ W₁,w₂,w₃). The convolution operator Y ═ Conv (W, X) can be written as Y_i＝w₁X_i-1+w₂X_i+w₃X_i+1。

The operation of convolution can be decoupled into two steps, shift and multiply accumulate. The input X is shifted by-1, 0, +1, and then multiplied by w1, w2, w3, respectively, the sum of which is Y. Formally, the shift operation is as follows:

the multiply-add operation is as follows:

Y＝w₁X^-1+w₂X⁰+w₃X¹

the first shift step can be performed free of charge because it requires only one offset address pointer. Although the second step is more computationally expensive, the time shift module TSM incorporates the multiply-accumulate into the following 2D convolution. The time sequence shifting module TSM can achieve the 3D effect by using the 2D calculated amount under the condition of not increasing the calculated amount, and meanwhile, the completion of time sequence modeling is beneficial to the later multi-branch time sequence feature extraction module to better extract effective time sequence features.

The training timing shift module comprises the following steps:

(1) establishing a time sequence shift module branch corresponding to the branch 2 by using a time sequence shift method to obtain an output Z ″₂(ii) a The time sequence shifting method comprises the following steps:

B1) the timing shift module TSM gives a parameter div during operation, which is represented by:

fold＝C//div

c is the number of channels, div is a given parameter, fold indicates which channel operations are time-shifted, and C, div and fold are 192, 8 and 24 respectively;

are provided one and Z'₂All 0 tensors out of the same dimension, the eigen dimensions are (B, T, C, H, W);

B2) z 'on front fold channel'₂The features of the last T-1 frame are given to the full 0 tensor out, other features are still, and the left shift of the features is completed, and the formula is as follows:

out[:,:-1,:fold,:,:]＝X[:,1:,:fold,:,:]；

B3) in [ fold,2 fold]Channel over Z'₂The features of the first T-1 frame are given to the full 0 tensor out, and the other features are stationary, completing the right shift of the features, which has the following formula:

out[:,1:,fold:2*fold,:,:]＝X[:,:-1,fold:2*fold,:,:]；

B4) z 'on the remaining channel'₂The characteristics of the T frame are given to the full 0 tensor out, all shifting operations are completed, and the finally obtained tensor out is the output after the input X time sequence is shifted, and the formula is as follows:

out[:,:,2*fold:,:,:]＝X[:,:,2*fold:,:,:]，

B5) repeating the steps B1) to B4) to construct a layer of convolutional neural network layer, and sequentially inputting each frame through cyclic processing to finally obtain an output Z ″₂The characteristic dimension is (16,4, 2048);

(2) repeating the steps B1) to B5) to establish the time sequence shift module corresponding to the branch 3 to obtain the corresponding output Z ″₃The characteristic dimension is (16,4, 2048).

(5) Training multi-branch time sequence feature extraction module, training long-short term memory network LSTM and full connection layer.

The training multi-branch timing sequence feature extraction module comprises the following steps:

C1) initializing an LSTM network, wherein the input characteristic dimension is 2048, the characteristic dimensions of a hidden layer and a memory layer are 512, and the number of RNN layers is 1;

C2) outputs Z 'of three branches'₁、Z″₂、Z″₃Is sent into an LSTM network as input to obtain three outputs Output and H_nAnd C_nThe outputs of the last output gate, the hidden layer and the memory layer are respectively; outputs Output [ -1,:]processing and sending the data to a full connection layer to obtain an output Out;

C3) and correspondingly adding two outputs obtained by the three branches to obtain a final result as the output of the network, wherein the expression is as follows:

Out＝Out1+Out2+Out3，

H_n＝H_n1+H_n2+H_n3。

wherein the network outputs H_nContaining class labels and features of the training images, Out as a parameter to update the loss function.

(6) The multi-branch time sequence feature extraction module outputs corresponding characterization features H_nH corresponding to each training image_nAnd (4) calculating a loss function and reversely propagating through the network at the same time, updating parameters of each layer in the network, and finally obtaining an optimal trained completed footprint retrieval model.

Fifthly, acquiring the track images to be retrieved: and acquiring a to-be-retrieved turn-to-turn footprint image and preprocessing the image.

Sixthly, searching the footprint images in a step: inputting the preprocessed to-be-retrieved stepping images into the trained stepping image retrieval model, and completing retrieval of the stepping images.

When testing in a laboratory link, firstly, the footprint images in the test base library set Gallery are sent to a network, and the corresponding characterization features and class labels are obtained and stored in a database. And then testing, sending the footprint images in the test Query set Query to a network to obtain corresponding characterization features, and comparing the characterization features with all the features in the database by Euclidean distance. The smaller the Euclidean distance is, the higher the similarity between the two is, and the image with the smallest Euclidean distance in the database is the retrieval result. If the labels of the two are consistent, the retrieval is successful, otherwise, the retrieval is failed.

Rank1 and MAP values are commonly used to evaluate the performance of the model for search problems. Therefore, all images in the query set are respectively used as images to be retrieved to obtain retrieval results, and the retrieval results are averaged to obtain the Rank1 and the MAP value on the test set.

TABLE 1 comparison of experimental results of two methods of barefoot data set

Table 1 shows the comparison experiment results of two algorithms when the training set is a barefoot data set, the comparison method is a footprint image retrieval method based on space-time motion and feature fusion, namely Base, and the method disclosed by the patent is Shift _ 8. Train and Test respectively represent a training set and a Test set. The barefoot, cloth, leather, sports data sets are represented by the barefoot, cloth, leather shoe, and sports data sets, respectively. The beneficial effects of this patent as shown in table 1 are:

on the retrieval of single shoes, the MAP is improved by 3.91 percent, and the Rank1 is improved by 2.16 percent.

In the cross-shoe search, when the test set is a cotton shoe data set, the MAP is improved by 17.98 percent, and the Rank1 is improved by 16.13 percent. When the test set is a leather shoe data set, the MAP is improved by 14.86 percent, and the Rank1 is improved by 13.80 percent. When the test set is the sports shoe data set, the MAP is improved by 14.67 percent, and the Rank1 is improved by 16.35 percent.

The foregoing shows and describes the general principles, essential features, and advantages of the invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are merely illustrative of the principles of the invention, but that various changes and modifications may be made without departing from the spirit and scope of the invention, which fall within the scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.

17页详细技术资料下载

Time sequence shift and multi-branch space-time enhancement network-based stepping footprint image retrieval method

相关技术

网友询问留言