Lightweight method for extracting video key frames

文档序号:1878586 发布日期:2021-11-23 浏览:20次 中文

阅读说明:本技术 一种提取视频关键帧的轻量化方法 (Lightweight method for extracting video key frames ) 是由 张烨 陈威慧 王博 闫芳彭 于 2021-07-05 设计创作,主要内容包括:一种提取视频关键帧的轻量化方法,包括:首先对视频帧化后的每一张RGB视频帧进行图像增强、去噪等预处理,然后设计轻量型的特征提取模型LRDN网络去提取每一帧的深度特征,接着采用递归的方式获取每一帧的重要性预测值,最后通过设定判断阈值去判断当前帧是属于关键帧还是冗余帧。本发明利用轻量化卷积神经网络——LRDN模型去提取经预处理后的视频帧的深度特征,与传统卷积神经网络相比,极大地降低了模型的参数数量和计算量,节约了计算资源;利用高效低耗的递归方式不断地预测每一张RGB视频帧的重要性,以摒弃冗余帧,保留有用帧。(A lightweight method for extracting video key frames comprises the following steps: firstly, preprocessing such as image enhancement and denoising is carried out on each RGB video frame after video framing, then a light-weight type feature extraction model LRDN network is designed to extract the depth feature of each frame, then an importance prediction value of each frame is obtained in a recursion mode, and finally a judgment threshold value is set to judge whether the current frame belongs to a key frame or a redundant frame. According to the method, the depth characteristics of the preprocessed video frame are extracted by using the lightweight convolutional neural network-LRDN model, so that the parameter quantity and the calculated amount of the model are greatly reduced and the calculation resources are saved compared with the traditional convolutional neural network; the importance of each RGB video frame is continuously predicted by using a high-efficiency and low-consumption recursion mode so as to abandon redundant frames and keep useful frames.)

1. A lightweight method for extracting video key frames comprises the following steps:

firstly, video preprocessing;

performing framing processing on the input video, and performing preprocessing such as image enhancement, denoising and the like on each RGB video frame to serve as the input of the LRDN model in the second step;

step two, feature extraction;

based on the light weight thought, an LRDN model is provided for extracting the spatial features of a video frame sequence; the network is characterized in that: (1) in order to extract information of different levels in an image, a feature fusion module is adopted in the network, and features obtained by convolution of low-level convolution kernels and high-level convolution kernels are fused, so that the network can extract more potential information in the image, and the obtained image classification result is more accurate; (2) in order to reduce the number of parameters and the calculated amount of the model and remove redundant information, a transition module consisting of a characteristic compression unit and a down-sampling unit is adopted in the network, so that the operation efficiency of the model is improved; (3) in order to avoid the problems of gradient disappearance or model degradation and the like of the network during training, residual connection and dense connection are adopted in the network, and through jump connection, the feature reuse in the network is enhanced, the circulation of features in the network is enhanced, and the trainability of the network is improved;

the LRDN model takes the preprocessed RGB video frame as input and finishes the feature extraction of the input video frame through a shallow feature extraction module and a deep feature extraction module;

firstly, shallow layer feature extraction;

image input in network using FinShowing that shallow features in the image are extracted through 1 layer of standard convolution and 1 layer of improved depth separable convolution; firstly, shrinking input features through a layer of standard convolution, wherein the size of a convolution kernel used by a standard convolution layer is 3 multiplied by 3, the step length is 2, and the number of output feature maps is 4 k; the super parameter k is the network growth rate, the number of characteristic channels of each layer is increased according to the value of k after series operation in the dense connection network, and the k is also the reason of the growth rate; feature F extracted by the layer convolution operation-1Further extracting shallow features for the next convolutional layer; f-1Can be obtained by the formula (1):

F-1=HB(Fin) (1)

wherein HB(. h) represents the standard convolution operation corresponding to the first layer convolution layer;

obtaining a first shallow feature F-1Then, the feature is convolved with 1 layer improved depth separable to obtain a shallow feature F0(ii) a Compared with the standard convolution, the depth separable convolution can greatly reduce the parameter and the calculation amount of the model and reduce the memory required by calculation, but the operation can cause the model to lose a little precision, and therefore, the traditional depth separable convolution is improved to extract the second-layer shallow feature F0(ii) a Among the improvements to conventional deep separable convolution operations are: firstly, adding a layer of 1 × 1 convolution expansion layer before deep convolution, so as to increase the number of channels and obtain more features; because the calculation characteristic of the deep convolution determines that the deep convolution has no capacity of changing the number of channels, the deep convolution can only output how many channels if the number of channels is given to the deep convolution by the previous layer; therefore, if the number of channels given by the previous layer is small, the deep convolution can only be performed in the low-dimensional spaceThe feature is extracted indirectly, so that the feature extraction effect is poor; before each depth convolution, a layer of point-by-point convolution is equipped and is specially used for dimension increasing, and no matter the number of input channels is more or less, after the dimension increasing of the first point-by-point convolution, the depth convolution carries out feature extraction in a relatively higher dimension; secondly, a ReLU activation function is not adopted at last, and direct linear output is adopted, so that the aim of preventing the ReLU from damaging the characteristics is fulfilled; the ReLU function can effectively increase nonlinear features in a high-dimensional space, and can destroy the features in a low-dimensional space, namely the ReLU can cause larger information loss to a tensor with low channel number; the ReLU sets a negative value to zero, and when the number of channels is low, the probability is relatively high, so that tensor values of a certain dimensionality are all 0, which is equivalent to the reduction of the dimensionality of the tensor, and the process cannot be recovered; the reduction of tensor dimension means the reduction of the feature description capacity, so that the effect is not as good as linear transformation; since the main effect of the second point-by-point convolution is dimension reduction, it is known that the ReLU is not suitable to be used after dimension reduction;

the expansion coefficient for the 1 × 1 convolution "expansion" layer is 6, the convolution kernel size for deep convolution is 3 × 3, the step size is 1, the number of output feature maps is 64, and then F is0As an input of the deep feature extraction and fusion stage, the method can be obtained by the following formula (2):

F0=HS(F-1) (2)

wherein HS(. h) represents a depth separable convolution operation corresponding to the second layer of convolution layers;

secondly, deep layer characteristic extraction;

for the LRDN model, the deep feature extraction module is the core part of the network model and consists of an LRDB, and the LRDB comprises a dense block, a transition block and a residual connection;

the dense block in LRDN is formed by densely connecting 24 layers of convolution layers, and the nonlinear combination function of each layer is the combination of BN + ReLU +1 × 1Conv + BN + ReLU +3 × 3 Conv; the 1 × 1Conv is used for fixing the number of output channels, reducing the number of output characteristic graphs to 4k, achieving the purpose of reducing dimensions, and improving the calculation efficiency; when more than ten composite convolutions of 1 × 1Conv +3 × 3Conv in the network are connected, the number of characteristic channels connected in series is increased to thousands, and if the dimension is reduced without increasing the 1 × 1Conv, the parameter quantity required by the subsequent 3 × 3Conv is increased sharply; in addition, because the LRDB contains residual connection, a preactivation design is adopted in the dense block, namely the sequence of BN-ReLU-Conv with the activation function at the front and the convolution layer at the back;

let the first layer output of the first LRDB in the LRDN be x1Then the characteristic treatment is expressed as follows:

wherein the content of the first and second substances,a convolution kernel of 1 × 1Conv for the first layer;a convolution kernel of a first layer of 3 × 3 Conv; σ represents a nonlinear activation function ReLU function; [. the]Representing a feature fusion function; shallow feature F0An input for a first LRDB;

output x of the second layer2Comprises the following steps:

wherein the content of the first and second substances,a convolution kernel of a second layer of 1 × 1 Conv;a convolution kernel of a second layer of 3 × 3 Conv;

by analogy, the output of the last layer is:

because the features extracted by each layer of convolution in the LRDB belong to features of different layers, if the features extracted from low-level images can be fully utilized in the identification process of video frame images, the information contained in high-level images can be more comprehensive, and the identification result is more accurate, so that the features obtained after each layer of composite convolution in the LRDB are subjected to feature fusion through splicing operation, namely:

wherein the content of the first and second substances,representing the output of the first dense block; [ x ] of1,...,x24,F0]Represents a pair x1,...,x24,F0Performing feature fusion on the features through splicing operation; because the number of feature maps output by each layer of composite convolution is k, and the shallow feature F0Since the number of feature maps of (1) is 64, the feature maps are subjected to feature fusionThe number of feature maps of (4) is (64+24 k);

because the (64+24k) feature maps contain a certain amount of redundant information, in order to eliminate the redundant information to obtain finer features, and in order to reduce the computational complexity of network training, the (64+24k) feature maps are converted into 4k feature maps through a transition block of which the structure consists of convolutions with the number of convolution kernels of 4k and the size of 1 × 1 and average pooling with the size of 2 × 2, so as to obtain finer features after fusion; for the transition block, the effect is mainly to carry out down-sampling operation on the input from the dense block and reduce the size of the feature map;

suppose a complex function HTIn order to represent the BN, the first group of the crystal,ReLU, 1 × 1Conv and 2 × 2Avg. Pooling, then:

whereinRepresenting the characteristics obtained after the processing of the transition block connected with the first LRDB;

in order to better acquire the mapping relationship between the learning low-level image and the learning high-level image and further improve the network performance, the LRDB is in the shallow feature F0And features of deep layersResidual error connection is added between the two, and the specific treatment process is as follows:

wherein, F1Represents the output of the first LRDB; hP(. cndot.) represents the average pooling function;

to solve the problem of the size disparity of the feature maps, a further downsampling operation is applied in the jump connection at the element-by-element summation of equation (8), so HP(. is an average pooling function rather than the identity map used in traditional residual networks; furthermore, because of HP(. is a non-linear function and therefore does not cause the problem of loss of information about the original state of the image, so the next input to the LRDB is a down-sampled version containing the original data;

processing by the same method to obtain output F of the i-th LRDB in the LRDNi,i∈[2,n-1](ii) a Since the LRDN will be on the extracted deep features FnPerforming global average pooling for behavior action classification of subsequent part of the network model, so that the last LRDB in the LRDN is adjustedAfter obtaining the output of the n-th dense blockThen, the previous features are compressed and sorted using 1 × 1Conv, and then the features F are further processedn-1Residual connection is added between the compressed features, and the specific processing is as follows:

wherein, FnRepresents the deep characteristics of the fused nth LRDB; w1×1Represents a 1 × 1 convolution kernel; b represents a bias;

in obtaining deep features FnThen, a global average pooling and full-connection layer aggregation output characteristic diagram is adopted to reduce the number of parameters and the calculated amount;

extracting video key frames;

firstly, using each frame of video frame image x obtained in step onejDepth feature of (x)j) The importance prediction is performed on the corresponding frame in a recursive manner, including:

Aj=fimp[ε(X,j-1),τ(xj)] (10)

wherein A isjRepresenting the importance predicted value of the jth video frame, wherein the larger the value is, the more important the frame is; f. ofimp(. -) represents a prediction function; epsilon (X, j-1) represents the total characteristics from the 0 th frame to the (j-1) th frame, namely represents the addition operation of the characteristics of the previous (j-1) video frames;

then, in order to better predict the importance of each video frame, a loss function shown in the formula (12) is introduced into the key frame extraction module to perform difference comparison between the current frame and the next frame, so that redundant frames are more efficiently eliminated, useful frames are reserved, and the robustness of the model is improved;

L(X,y)=LCE(X,y)+μLν(V),μ≥0 (12)

wherein L isCE(X, y) represents a standard cross entropy loss function between the predicted value and the true value; μ is a balance parameter that can balance the number of selected video frames, further optimizing the selection of key frames;

representing an entropy-based regularization term that is effective to prevent the occurrence of overfitting;

and finally, judging whether the current frame belongs to the key frame or the redundant frame by setting a judgment threshold value.

Technical Field

The invention relates to a lightweight method for extracting video key frames.

Technical Field

With the rapid development of computers, a large amount of novel data such as pictures, audio, video and the like are presented. The video is a combination of pictures and audio to some extent, and is a main carrier of information interaction. Video is gradual and visually equivalent to a dense stack of pictures, but not every frame is equally important for understanding the video content, so there is a large redundancy of information from frame to frame. Therefore, when a large amount of videos need to be processed, if multi-frame information of the videos can be presented in the form of key frame images, the influence caused by redundant data can be well removed, the calculation complexity of the model is reduced, the identification accuracy and the real-time performance of the model on the video content are improved, and meanwhile, the time for a user to acquire effective information in the videos is greatly shortened.

Disclosure of Invention

In order to solve the problems, the invention provides a lightweight method for extracting video key frames.

The method comprises the steps of firstly preprocessing each RGB video frame after video framing such as image enhancement and denoising, then designing a light-weight type feature extraction model LRDN network to extract depth features of each frame, then acquiring an importance predicted value of each frame in a recursion mode, and finally judging whether the current frame belongs to a key frame or a redundant frame by setting a judgment threshold value.

In order to achieve the purpose, the invention adopts the following technical scheme:

a lightweight method for extracting video key frames comprises the following steps:

firstly, video preprocessing;

and performing framing processing on the input video, and performing preprocessing such as image enhancement, denoising and the like on each RGB video frame to serve as the input of the LRDN model in the second step.

Step two, feature extraction;

the invention provides an LRDN model for extracting spatial features of a video frame sequence based on a lightweight thought. The network is mainly characterized in that: (1) in order to extract information of different levels in the image, a feature fusion module is adopted in the network, and features obtained by convolution of convolution kernels of a low level and a high level are fused, so that the network can extract more potential information in the image, and the obtained image classification result is more accurate. (2) In order to reduce the number of parameters and the calculated amount of the model and remove redundant information, a transition module consisting of a feature compression unit and downsampling is adopted in the network, so that the operation efficiency of the model is improved. (3) In order to avoid the problems of gradient disappearance or model degradation and the like of the network during training, residual connection and dense connection are adopted in the network, and through jump connection, the feature reuse in the network is enhanced, the circulation of features in the network is enhanced, and the trainable performance of the network is improved.

The LRDN model takes the preprocessed RGB video frame as input and completes the feature extraction of the input video frame mainly through a shallow feature extraction module and a deep feature extraction module.

Firstly, shallow layer feature extraction;

image input in network using FinIt is shown that shallow features in the image are extracted by 1-layer standard convolution and 1-layer improved depth separable convolution. The input features are first shrunk through one layer of standard convolution, where the convolution kernel size of the standard convolution layer is 3 x 3, the step size is 2, and the number of output feature maps is 4 k. The hyperparameter k is the network growth rate, and the number of characteristic channels of each layer is increased according to the value of k after series operation in the dense connection network, which is also the reason why k is called the growth rate. Feature F extracted by the layer convolution operation-1Will be used for the next convolutional layer to further extract the shallow features. F-1Can be obtained by the formula (1):

F-1=HB(Fin) (1)

wherein HB(. cndot.) represents the standard convolution operation for the first layer convolution layer.

Obtaining a first shallow feature F-1Then, the feature is convolved with 1 layer improved depth separable to obtain a shallow feature F0. Compared with standard convolutionThe depth separable convolution can greatly reduce the parameter and the calculation amount of the model and reduce the memory required by calculation, but the operation can cause the model to lose a little precision, therefore, the invention improves the traditional depth separable convolution and is used for extracting the second layer shallow feature F0. Among the improvements of the present invention over conventional deep separable convolution operations are: first, a layer of 1 × 1 convolution "expansion" is added before the deep convolution, in order to increase the number of channels and obtain more features. Because the computational nature of deep convolution determines its own ability to change the number of channels, it can only output how many channels if there are any channels given to it by the previous layer. Therefore, if the number of channels given by the previous layer is small, the deep convolution can only extract features in a low-dimensional space, and therefore the feature extraction effect is poor. And before each depth convolution, a layer of point-by-point convolution is equipped, which is specially used for increasing the dimension, so that no matter how many input channels are, after the dimension increase of the first point-by-point convolution, the depth convolution can perform feature extraction in a relatively higher dimension. Second, the ReLU activation function is not finally employed, but rather a direct linear output, with the aim of preventing the ReLU destruction feature. Because the ReLU function can effectively add nonlinear features in a high-dimensional space, and destroy the features in a low-dimensional space, i.e., the ReLU causes large information loss to a tensor with a low channel number. The ReLU sets a negative value to zero, and when the number of channels is low, there is a relatively high probability that the tensor value of a certain dimension is all 0, which is equivalent to reducing the dimension of the tensor, and the process cannot be recovered. The reduction of tensor dimension means the decrease of the feature description capacity, and the effect is not as good as linear transformation. Since the main effect of the second point-by-point convolution is dimensionality reduction, it is known that the ReLU is not suitable for reuse after dimensionality reduction.

The expansion coefficient of the invention used for 1 × 1 convolution 'expansion' layer is 6, the convolution kernel size of deep convolution is 3 × 3, the step length is 1, the number of output characteristic graphs is 64, and F is obtained0As an input of the deep feature extraction and fusion stage, the method can be obtained by the following formula (2):

F0=HS(F-1) (2)

wherein HS(. cndot.) represents a depth separable convolution operation corresponding to the second layer of convolution layers.

Secondly, deep layer characteristic extraction;

for the LRDN model, the deep feature extraction module is the core part of the network model, and mainly consists of an LRDB, and the LRDB comprises a dense block, a transition block, and a residual concatenation.

The dense block in the LRDN is densely connected by 24 convolutional layers, and the nonlinear combination function of each layer is a combination of BN + ReLU +1 × 1Conv + BN + ReLU +3 × 3 Conv. The function of the 1 × 1Conv here is to fix the number of output channels, reduce the number of output feature maps to 4k, and achieve the purpose of reducing dimensions, thereby improving the calculation efficiency. When more than ten composite convolutions of 1 × 1Conv +3 × 3Conv in the network are connected, the number of the characteristic channels after series connection increases to thousands, and if the dimension is reduced without increasing the 1 × 1Conv, the parameter quantity required by the subsequent 3 × 3Conv increases sharply. In addition, since the LRDB includes residual concatenation, a preactivation design, that is, a sequence of BN-ReLU-Conv in which activation functions are preceded and convolutional layers are followed, is adopted in the dense block.

Let the first layer output of the first LRDB in the LRDN be x1Then the characteristic treatment is expressed as follows:

wherein the content of the first and second substances,a convolution kernel of 1 × 1Conv for the first layer;a convolution kernel of a first layer of 3 × 3 Conv; σ represents a nonlinear activation function ReLU function; [. the]Representing a feature fusion function; shallow feature F0Is the input to the first LRDB.

Output x of the second layer2Comprises the following steps:

wherein the content of the first and second substances,a convolution kernel of a second layer of 1 × 1 Conv;the convolution kernel for the second layer of 3 × 3 Conv.

By analogy, the output of the last layer is:

because the features extracted by each layer of convolution in the LRDB belong to features of different layers, if the features extracted from low-level images can be fully utilized in the identification process of video frame images, the information contained in high-level images can be more comprehensive, and the identification result is more accurate, so that the features obtained after each layer of composite convolution in the LRDB are subjected to feature fusion through splicing operation, namely:

wherein the content of the first and second substances,representing the output of the first dense block; [ x ] of1,...,x24,F0]Represents a pair x1,...,x24,F0The features are fused by a stitching operation. Because the number of feature maps output by each layer of composite convolution is k, and the shallow feature F0Since the number of feature maps of (1) is 64, the feature maps are subjected to feature fusionThe number of feature maps of (4) is (64+24 k).

Because (64+24k) feature maps contain a certain amount of redundant information, in order to eliminate the redundant information and obtain finer features, and in order to reduce the computational complexity of network training, the (64+24k) feature maps are converted into 4k feature maps through a transition block of which the structure is composed of convolutions with the number of convolution kernels of 4k and the size of 1 × 1 and average pooling with the size of 2 × 2, so as to obtain finer features after fusion. For transition blocks, the effect is mainly to perform downsampling operations on the input from dense blocks, reducing the feature map size.

Suppose a complex function HTTo represent the nonlinear transformation function of BN, ReLU, 1 × 1Conv and 2 × 2Avg.

WhereinRepresenting the features resulting from processing of the first LRDB-connected transition block.

In order to better acquire the mapping relationship between the learning low-level image and the learning high-level image and further improve the network performance, the LRDB is in the shallow feature F0And features of deep layersResidual error connection is added between the two, and the specific treatment process is as follows:

wherein, F1Represents the output of the first LRDB; hP(. cndot.) represents the average pooling function.

To solve the problem of the size inconsistency of the feature maps, the invention applies a further down-sampling operation in the jump connection at the element-by-element summation of equation (8), hence HP(. cndot.) is an average pooling function rather than the identity map used in traditional residual networks. Furthermore, because of HP(. cndot.) is a non-linear function and therefore does not cause the problem of losing information about the original state of the image, so the next input to the LRDB is a down-sampled version containing the original data.

Processing by the method to obtain output F of the i-th LRDB in the LRDNi,i∈[2,n-1]. Since the LRDN will be on the extracted deep features FnPerforming global average pooling for behavior and action classification of subsequent part of network model, so that the invention adjusts the last LRDB in LRDN to obtain the output of the nth dense blockThen, the previous features are compressed and sorted using 1 × 1Conv, and then the features F are further processedn-1Residual connection is added between the compressed features, and the specific processing is as follows:

wherein, FnRepresents the deep characteristics of the fused nth LRDB; w1×1Represents a 1 × 1 convolution kernel; b denotes an offset.

In obtaining deep features FnAnd then, outputting the feature map by adopting global average pooling and full-connection layer aggregation to reduce the number of parameters and the calculation amount.

Extracting video key frames;

firstly, using each frame of video frame image x obtained in step onejDepth feature of (x)j) The importance prediction is performed on the corresponding frame in a recursive manner, including:

Aj=fimp[ε(X,j-1),τ(xj)] (10)

wherein A isjRepresenting the importance predicted value of the jth video frame, wherein the larger the value is, the more important the frame is; f. ofimp(. -) represents a prediction function; ε (X, j-1) represents the total feature from frame 0 to frame (j-1), i.e., represents the summation operation of the features of the previous (j-1) video frames.

Subsequently, in order to better predict the importance of each video frame, a loss function shown in formula (12) is introduced into the key frame extraction module to perform difference comparison between the current frame and the next frame, so that redundant frames are more efficiently eliminated, useful frames are reserved, and the robustness of the model is improved.

L(X,y)=LCE(X,y)+μLν(V),μ≥0 (12)

Wherein L isCE(X, y) represents a standard cross entropy loss function between the predicted value and the true value; μ is a balance parameter that can balance the number of selected video frames, further optimizing the selection of key frames;(Aq≥0,) An entropy-based regularization term is represented that effectively prevents the occurrence of overfitting.

And finally, judging whether the current frame belongs to the key frame or the redundant frame by setting a judgment threshold value.

The invention has the advantages that:

the invention provides a lightweight method for extracting video key frames based on a lightweight feature extraction model LRDN network and a recursion mode. The outstanding characteristics are as follows: firstly, the depth characteristics of the preprocessed video frame are extracted by utilizing a lightweight convolutional neural network-LRDN model, and compared with the traditional convolutional neural network, the method greatly reduces the parameter quantity and the calculated amount of the model and saves the calculation resources. Secondly, the importance of each RGB video frame is continuously predicted by using a high-efficiency and low-consumption recursion mode so as to abandon redundant frames and keep useful frames.

Drawings

FIG. 1 is a technical roadmap for the present invention;

FIG. 2 is a flow chart of the detailed operation within each layer convolution in the dense block of the present invention;

Detailed Description

In order to verify the feasibility and superiority of the method provided by the invention, the method is further explained by combining application scenes:

a lightweight method for extracting video key frames comprises the following steps:

firstly, video preprocessing;

and performing framing processing on the input video, and performing preprocessing such as image enhancement, denoising and the like on each RGB video frame to serve as the input of the LRDN model in the second step.

Step two, feature extraction;

the invention provides an LRDN model for extracting spatial features of a video frame sequence based on a lightweight thought. The network is mainly characterized in that: (1) in order to extract information of different levels in the image, a feature fusion module is adopted in the network, and features obtained by convolution of convolution kernels of a low level and a high level are fused, so that the network can extract more potential information in the image, and the obtained image classification result is more accurate. (2) In order to reduce the number of parameters and the calculated amount of the model and remove redundant information, a transition module consisting of a feature compression unit and downsampling is adopted in the network, so that the operation efficiency of the model is improved. (3) In order to avoid the problems of gradient disappearance or model degradation and the like of the network during training, residual connection and dense connection are adopted in the network, and through jump connection, the feature reuse in the network is enhanced, the circulation of features in the network is enhanced, and the trainable performance of the network is improved.

The LRDN model takes the preprocessed RGB video frame as input and completes the feature extraction of the input video frame mainly through a shallow feature extraction module and a deep feature extraction module.

Firstly, shallow layer feature extraction;

graph input in networkImage FinIt is shown that shallow features in the image are extracted by 1-layer standard convolution and 1-layer improved depth separable convolution. The input features are first shrunk through one layer of standard convolution, where the convolution kernel size of the standard convolution layer is 3 x 3, the step size is 2, and the number of output feature maps is 4 k. The hyperparameter k is the network growth rate, and the number of characteristic channels of each layer is increased according to the value of k after series operation in the dense connection network, which is also the reason why k is called the growth rate. Feature F extracted by the layer convolution operation-1Will be used for the next convolutional layer to further extract the shallow features. F-1Can be obtained by the formula (1):

F-1=HB(Fin) (1)

wherein HB(. cndot.) represents the standard convolution operation for the first layer convolution layer.

Obtaining a first shallow feature F-1Then, the feature is convolved with 1 layer improved depth separable to obtain a shallow feature F0. Compared with the standard convolution, the depth separable convolution can greatly reduce the parameter and the calculation amount of the model and reduce the memory required by calculation, but the operation can cause the model to lose a little precision, therefore, the invention improves the traditional depth separable convolution and is used for extracting the second shallow layer characteristic F0. Among the improvements of the present invention over conventional deep separable convolution operations are: first, a layer of 1 × 1 convolution "expansion" is added before the deep convolution, in order to increase the number of channels and obtain more features. Because the computational nature of deep convolution determines its own ability to change the number of channels, it can only output how many channels if there are any channels given to it by the previous layer. Therefore, if the number of channels given by the previous layer is small, the deep convolution can only extract features in a low-dimensional space, and therefore the feature extraction effect is poor. And before each depth convolution, a layer of point-by-point convolution is equipped, which is specially used for increasing the dimension, so that no matter how many input channels are, after the dimension increase of the first point-by-point convolution, the depth convolution can perform feature extraction in a relatively higher dimension. First, theSecond, the ReLU activation function is not finally employed, but rather a direct linear output, with the aim of preventing the ReLU destruction feature. Because the ReLU function can effectively add nonlinear features in a high-dimensional space, and destroy the features in a low-dimensional space, i.e., the ReLU causes large information loss to a tensor with a low channel number. The ReLU sets a negative value to zero, and when the number of channels is low, there is a relatively high probability that the tensor value of a certain dimension is all 0, which is equivalent to reducing the dimension of the tensor, and the process cannot be recovered. The reduction of tensor dimension means the decrease of the feature description capacity, and the effect is not as good as linear transformation. Since the main effect of the second point-by-point convolution is dimensionality reduction, it is known that the ReLU is not suitable for reuse after dimensionality reduction.

The expansion coefficient of the invention used for 1 × 1 convolution 'expansion' layer is 6, the convolution kernel size of deep convolution is 3 × 3, the step length is 1, the number of output characteristic graphs is 64, and F is obtained0As an input of the deep feature extraction and fusion stage, the method can be obtained by the following formula (2):

F0=HS(F-1) (2)

wherein HS(. cndot.) represents a depth separable convolution operation corresponding to the second layer of convolution layers.

Secondly, deep layer characteristic extraction;

for the LRDN model, the deep feature extraction module is the core part of the network model, and mainly consists of an LRDB, and the LRDB comprises a dense block, a transition block, and a residual concatenation.

The dense block in the LRDN is densely connected by 24 convolutional layers, and the nonlinear combination function of each layer is a combination of BN + ReLU +1 × 1Conv + BN + ReLU +3 × 3 Conv. The function of the 1 × 1Conv here is to fix the number of output channels, reduce the number of output feature maps to 4k, and achieve the purpose of reducing dimensions, thereby improving the calculation efficiency. When more than ten composite convolutions of 1 × 1Conv +3 × 3Conv in the network are connected, the number of the characteristic channels after series connection increases to thousands, and if the dimension is reduced without increasing the 1 × 1Conv, the parameter quantity required by the subsequent 3 × 3Conv increases sharply. In addition, since the LRDB includes residual concatenation, a preactivation design, that is, a sequence of BN-ReLU-Conv in which activation functions are preceded and convolutional layers are followed, is adopted in the dense block.

Let the first layer output of the first LRDB in the LRDN be x1Then the characteristic treatment is expressed as follows:

wherein the content of the first and second substances,a convolution kernel of 1 × 1Conv for the first layer;a convolution kernel of a first layer of 3 × 3 Conv; σ represents a nonlinear activation function ReLU function; [. the]Representing a feature fusion function; shallow feature F0Is the input to the first LRDB.

Output x of the second layer2Comprises the following steps:

wherein the content of the first and second substances,a convolution kernel of a second layer of 1 × 1 Conv;the convolution kernel for the second layer of 3 × 3 Conv.

By analogy, the output of the last layer is:

because the features extracted by each layer of convolution in the LRDB belong to features of different layers, if the features extracted from low-level images can be fully utilized in the identification process of video frame images, the information contained in high-level images can be more comprehensive, and the identification result is more accurate, so that the features obtained after each layer of composite convolution in the LRDB are subjected to feature fusion through splicing operation, namely:

wherein the content of the first and second substances,representing the output of the first dense block; [ x ] of1,...,x24,F0]Represents a pair x1,...,x24,F0The features are fused by a stitching operation. Because the number of feature maps output by each layer of composite convolution is k, and the shallow feature F0Since the number of feature maps of (1) is 64, the feature maps are subjected to feature fusionThe number of feature maps of (4) is (64+24 k).

Because (64+24k) feature maps contain a certain amount of redundant information, in order to eliminate the redundant information and obtain finer features, and in order to reduce the computational complexity of network training, the (64+24k) feature maps are converted into 4k feature maps through a transition block of which the structure is composed of convolutions with the number of convolution kernels of 4k and the size of 1 × 1 and average pooling with the size of 2 × 2, so as to obtain finer features after fusion. For transition blocks, the effect is mainly to perform downsampling operations on the input from dense blocks, reducing the feature map size.

Suppose a complex function HTTo represent the nonlinear transformation function of BN, ReLU, 1 × 1Conv and 2 × 2Avg.

WhereinRepresenting the features resulting from processing of the first LRDB-connected transition block.

In order to better acquire the mapping relationship between the learning low-level image and the learning high-level image and further improve the network performance, the LRDB is in the shallow feature F0And features of deep layersResidual error connection is added between the two, and the specific treatment process is as follows:

wherein, F1Represents the output of the first LRDB; hP(. cndot.) represents the average pooling function.

To solve the problem of the size inconsistency of the feature maps, the invention applies a further down-sampling operation in the jump connection at the element-by-element summation of equation (8), hence HP(. cndot.) is an average pooling function rather than the identity map used in traditional residual networks. Furthermore, because of HP(. cndot.) is a non-linear function and therefore does not cause the problem of losing information about the original state of the image, so the next input to the LRDB is a down-sampled version containing the original data.

Processing by the method to obtain output F of the i-th LRDB in the LRDNi,i∈[2,n-1]. Since the LRDN will be on the extracted deep features FnPerforming global average pooling for behavior and action classification of subsequent part of network model, so that the invention adjusts the last LRDB in LRDN to obtain the output of the nth dense blockThen, the previous features are compressed and sorted using 1 × 1Conv, and then the features F are further processedn-1Residual connection is added between the compressed features, and the specific processing is as follows:

wherein, FnRepresents the deep characteristics of the fused nth LRDB; w1×1Represents a 1 × 1 convolution kernel; b denotes an offset.

In obtaining deep features FnAnd then, outputting the feature map by adopting global average pooling and full-connection layer aggregation to reduce the number of parameters and the calculation amount.

Extracting video key frames;

firstly, using each frame of video frame image x obtained in step onejDepth feature of (x)j) The importance prediction is performed on the corresponding frame in a recursive manner, including:

Aj=fimp[ε(X,j-1),τ(xj)] (10)

wherein A isjRepresenting the importance predicted value of the jth video frame, wherein the larger the value is, the more important the frame is; f. ofimp(. -) represents a prediction function; ε (X, j-1) represents the total feature from frame 0 to frame (j-1), i.e., represents the summation operation of the features of the previous (j-1) video frames.

Subsequently, in order to better predict the importance of each video frame, a loss function shown in formula (12) is introduced into the key frame extraction module to perform difference comparison between the current frame and the next frame, so that redundant frames are more efficiently eliminated, useful frames are reserved, and the robustness of the model is improved.

L(X,y)=LCE(X,y)+μLν(V),μ≥0 (12)

Wherein L isCE(X, y) represents a standard cross entropy loss function between the predicted value and the true value; μ is a balance parameter that can balance the number of selected video frames, further optimizing the selection of key frames;(Aq≥0,) An entropy-based regularization term is represented that effectively prevents the occurrence of overfitting. And finally, judging whether the current frame belongs to the key frame or the redundant frame by setting a judgment threshold value.

The embodiments described in this specification are merely illustrative of implementations of the inventive concept and the scope of the present invention should not be considered limited to the specific forms set forth in the embodiments but rather by the equivalents thereof as may occur to those skilled in the art upon consideration of the present inventive concept.

14页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种多媒体播放方法及系统

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!

技术分类