Double-stage time sequence action detection method, device, equipment and medium

文档序号：86789 发布日期：2021-10-08 浏览：23次中文

阅读说明：本技术 一种双阶段的时序动作检测方法、装置、设备和介质 (Double-stage time sequence action detection method, device, equipment and medium ) 是由王田李泽贤吕金虎刘克新张宝昌于 2021-07-02 设计创作，主要内容包括：本发明公开了一种双阶段的时序动作检测方法、装置、设备和介质,所述方法包括获取视频信息特征；根据视频信息特征,找到潜在的动作开始、结束时刻；将开始时刻和结束时刻组合成候选框；校准候选框边界,对候选框的内容进行判断,获得动作类别。本发明公开的双阶段的时序动作检测方法、装置、设备和介质,具有识别精度高、识别稳定性好、鲁棒性能好等诸多优点。(The invention discloses a method, a device, equipment and a medium for detecting double-stage time sequence actions, wherein the method comprises the steps of acquiring video information characteristics; according to the video information characteristics, finding potential action starting and ending moments; combining the start time and the end time into a candidate frame; and calibrating the boundary of the candidate frame, judging the content of the candidate frame and obtaining the action type. The method, the device, the equipment and the medium for detecting the two-stage time sequence actions have the advantages of high identification precision, good identification stability, good robustness and the like.)

1. A two-stage time sequence action detection method is characterized by comprising the following steps:

s1, acquiring video information characteristics;

s2, extracting candidate boundaries according to the video information characteristics, and combining the candidate boundaries to obtain candidate frames;

and S3, correcting the boundary of the candidate frame and judging the action in the video.

2. The two-stage sequential motion detection method of claim 1,

in step S2, the extracting the candidate boundary includes the following sub-steps:

s21, converting the video information characteristics into a score curve;

s23, acquiring potential starting time and potential ending time in the score curve, and combining to obtain a candidate frame;

in step S21, the video information features are converted into a score curve through a generator network, wherein the score curve is a curve of the probability of the action state in the video along with the time of the video.

3. The two-stage sequential motion detection method of claim 2,

the generator network comprises a hole convolution module, the video information characteristics are input into the hole convolution module, the processing result and the video information characteristics are output after sequentially passing through a first activation function, a linear layer and a second activation function together to obtain a score curve,

the hole convolution module is provided with hole convolution, after the video information characteristics or data are input into the hole convolution processing, the processing result is sequentially subjected to third activation function and normalization, and then the processing result is used as the output of the hole convolution module, and preferably, the first activation function is the same as the third activation function.

4. The two-stage sequential motion detection method of claim 2,

between the step S21 and the step S23, a step S22 of improving the stability of the score curve is further provided;

three fused score curves are obtained by fusing a plurality of obtained score curves in each group, so that the effect of improving the stability is achieved.

5. The two-stage sequential motion detection method of claim 2,

in step S23, candidate boundaries are acquired by:

s231, taking the segment time with the score larger than the threshold value and the segment time with the score being the local maximum value as a potential starting time and a potential ending time;

and S232, combining the potential start time and the potential end time, wherein the video clip information characteristics of the potential start time and the potential end time and the video clip information characteristics between the potential start time and the potential end time are obtained candidate frames.

6. The two-stage sequential motion detection method of claim 2,

after step S23, there is step S24, pooling time series segments, converting candidate frame features from indefinite length to fixed length.

7. The two-stage sequential motion detection method of claim 1,

in step S3, the candidate box features are subjected to correction of boundary regression and action classification by the candidate box evaluation module and the instance evaluation module, specifically,

the candidate frame evaluation module carries out a classification task and filters video information characteristics which are obviously not positive samples;

the example evaluation module performs a multi-classification task and outputs a specific category of video information features.

8. A two-stage time sequence action detection device is characterized by comprising a video information characteristic extraction unit, a candidate boundary extraction unit and a video action judgment unit,

the video information feature extraction unit cuts the video into a plurality of segments and extracts video information features;

the candidate boundary extraction unit converts the video information characteristics into a score curve;

the video action judging unit obtains a regression value of the candidate boundary, corrects the candidate boundary according to the regression value, and judges corresponding actions in the candidate frame.

9. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-7.

10. A computer-readable storage medium having computer instructions stored thereon for causing the computer to perform the method of any one of claims 1-7.

Technical Field

The invention relates to a time sequence action detection method, and belongs to the technical field of image recognition and detection.

Background

Motion detection in video is an important branch in image understanding.

The existing motion detection method has the defects of low identification precision, low accuracy in judging the initial and ending positions of the motion, special requirements on the length of a video to be detected and the like.

For the above reasons, the present inventors have conducted intensive studies on the conventional video motion detection method, and have proposed a two-stage time-series motion detection method.

Disclosure of Invention

In order to overcome the above problems, the present inventors have conducted extensive studies to design a two-stage sequential operation detection method, which includes the following steps:

s1, acquiring video information characteristics;

s2, extracting candidate boundaries according to the video information characteristics, and combining the candidate boundaries to obtain candidate frames;

and S3, correcting the boundary of the candidate frame and judging the action in the video.

Further, in step S2, the extracting the candidate boundary includes the following sub-steps:

s21, converting the video information characteristics into a score curve;

s23, acquiring potential starting time and potential ending time in the score curve, and combining to obtain a candidate frame;

Preferably, the generator network comprises a hole convolution module, the video information characteristics are input into the hole convolution module, the processing result and the video information characteristics are sequentially output after passing through the first activation function, the linear layer and the second activation function together, a score curve is obtained,

Preferably, between the step S21 and the step S23, a step S22 of improving the stability of the score curve is further provided;

three fused score curves are obtained by fusing a plurality of obtained score curves in each group, so that the effect of improving the stability is achieved.

Preferably, in step S23, the candidate boundary is acquired by:

Preferably, after step S23, there is step S24, pooling time series segments, converting the candidate frame feature from an indefinite length to a fixed length.

Preferably, in step S3, the candidate box evaluation module and the instance evaluation module perform boundary regression correction and action classification on the candidate box features, specifically, the candidate box evaluation module performs a classification task to filter out video information features that are obviously not positive samples;

the example evaluation module performs a multi-classification task and outputs a specific category of video information features.

On the other hand, the invention also provides a two-stage time sequence action detection device, which comprises a video information characteristic extraction unit, a candidate boundary extraction unit and a video action judgment unit,

the video information feature extraction unit cuts the video into a plurality of segments and extracts video information features;

the candidate boundary extraction unit converts the video information characteristics into a score curve;

In addition, the present invention also provides an electronic device including:

at least one processor; and

Furthermore, the present invention also provides a computer readable storage medium storing computer instructions for causing the computer to execute the above method.

The invention has the advantages that:

(1) the recognition precision is far higher than that of the traditional action detection method;

(2) the recognition stability is high, and the robustness is good;

(3) any length of video can be processed.

Drawings

FIG. 1 illustrates a flow diagram of a two-phase sequential motion detection method in accordance with a preferred embodiment of the present invention;

FIG. 2 is a schematic diagram illustrating the alternate appearance of background motion in a video;

FIG. 3 is a diagram illustrating a scoring curve in a two-stage time-series motion detection method according to a preferred embodiment of the present invention;

FIG. 4 is a schematic diagram of a generator network structure in a two-stage sequential motion detection method according to a preferred embodiment of the present invention;

FIG. 5 is a schematic diagram illustrating a structure of a hole convolution module in a two-stage sequential motion detection method according to a preferred embodiment of the present invention;

FIG. 6 is a diagram illustrating a pooling process of sequential segments in a two-stage sequential motion detection method according to a preferred embodiment of the present invention;

fig. 7 is a schematic structural diagram of a selected boundary evaluation module in a two-stage time-series motion detection method according to a preferred embodiment of the present invention.

Detailed Description

The invention is explained in more detail below with reference to the figures and examples. The features and advantages of the present invention will become more apparent from the description.

The word "exemplary" is used exclusively herein to mean "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments. While the various aspects of the embodiments are presented in drawings, the drawings are not necessarily drawn to scale unless specifically indicated.

The method for detecting the two-stage time sequence action provided by the invention, as shown in fig. 1, comprises the following steps:

s1, acquiring video information characteristics;

s2, extracting candidate boundaries according to the video information characteristics, and combining the candidate boundaries to obtain candidate frames;

and S3, correcting the boundary of the candidate frame and judging the action in the video.

In step S1, the video is cut into a plurality of segments, and the video information features are extracted by a 3D motion recognition model.

Further, the total number of segments is denoted N, and the different segments are denoted N, N ∈ [1, N ].

In the present invention, the specific structure of the 3D motion recognition Model is not particularly limited, and may be any Model capable of extracting video information features, for example, the Model I3D introduced in the article Quo Vadis, Action RecognitionA New Model and the kinematic data set.

Preferably, the video is cut into N segments with the same length according to the chronological order.

In a preferred embodiment, RGB streams and optical flows of all segments are extracted, the RGB streams and the optical flows are respectively input into a 3D motion recognition model to extract RGB features and optical flow features, and then the RGB features and the optical flow features are fused to obtain features representing the whole video information.

In the present invention, the specific fusion method is a means commonly used by those skilled in the art, and is not described in detail herein, for example, those skilled in the art can refer to the documents Quo Vadis, Action Recognition A New Model and the Kinetics Dataset for fusion.

In step S2, the candidate boundary is used to characterize whether the segment video information is characterized as motion start or motion end.

As shown in fig. 2, before detecting an action, the background portion and the action portion need to be distinguished, that is, a candidate boundary is determined, and the action candidate frame can be obtained by combining the candidate boundaries.

The traditional method for extracting the boundary multiple is obtained based on the preset candidate frame with the specific length and the sliding window, the method is limited by the receptive field and the size of the anchor frame, and the boundary extraction cannot be well carried out, and in the invention, different from the traditional extraction of the candidate boundary, the candidate boundary extraction comprises the following sub-steps:

s21, converting the video information characteristics into a score curve;

and S23, acquiring potential starting time and potential ending time in the score curve, and combining to obtain a candidate frame.

In a preferred embodiment, between the step S21 and the step S23, the score curve stability is improved in step S22.

In step S21, the video information features are converted into a score curve through the generator network.

Further, the score curves include three groups, namely, an action start curve, an action progress curve and an action end curve, for measuring the start of the action, the progress of the action and the end of the action, respectively, each group includes X score curves, denoted as S_i[x][n]Wherein i ∈ [1,2,3 ]]Respectively representing an action starting curve group, an action proceeding curve group and an action ending curve group, X belongs to [1, X ∈]Representing different score curves, N ∈ [1, N]Representing different segments.

The score curve is a curve of the probability of the action state in the video along with the change of the video time, the horizontal axis of the curve represents the video time, and the vertical axis of the curve represents the probability of the action state.

Further, the generator network is provided with three channels, each outputting a set of score curves.

Further preferably, the generator network includes a hole convolution module, and after the video information features are input into the hole convolution module, the processing result and the video information features are sequentially output after passing through the first activation function, the linear layer, and the second activation function together, so as to obtain three sets of score curves, as shown in fig. 4.

Preferably, the first activation function is a ReLU function, and the second activation function is a Sigmoid function.

The linear layer is a structure commonly used in a neural network, and details are not described in the present application, and a person skilled in the art can design the linear layer according to actual needs.

The number of the hole convolution modules can be one or more, when the number of the hole convolution modules is multiple, the input of the first hole convolution module is the video information characteristic, the output data of the first hole convolution module is used as the input of the second hole convolution module, and so on.

In a preferred embodiment, the structure of the hole convolution module is as shown in fig. 5, and after the video information features or data are input into the hole convolution processing, the processing result is output after passing through the third activation function in sequence and being normalized.

Preferably, the first activation function and the third activation function are the same.

The hole convolution is a neural network widely applied to semantic segmentation and target detection tasks, and compared with a traditional CNN (convolutional neural network), the hole convolution can effectively expand the receptive field.

Further, in the present invention, the convolution kernel of the void convolution is 3, and the void ratio is 2.

The inventor finds that the parameter is the optimal parameter through a large number of experiments, the larger the proportion of the convolution kernel to the cavity is, the larger the receptive field is, in the invention, although a larger receptive field is expected, the larger the receptive field is, noise of other actions can be introduced, the smaller the receptive field is, the full action can not be well covered, and the parameter can well take the influence of the receptive field and the noise into account.

Preferably, the third activation function is ReLU, and the Normalization is Batch Normalization, a commonly used Normalization processing method.

Preferably, a Dropout strategy is adopted in the hole convolution module to prevent the model from being over-fitted.

Droupout is a method of preventing overfitting proposed by Hinton in the article Improving neural networks by predicting co-adaptation of feature detectors 2012.

After the cavity convolution module is trained, the video information characteristics can be effectively converted into a score curve.

The inventors have found that while the accuracy of the score curve transformed by the hole convolution model is already high, a single set of score curves often has much noise, which may adversely affect the final result.

In order to effectively filter noise, in the present invention, between step S21 and step S23, step S22 is further provided to improve score curve stability.

In step S22, obtaining X score curves S in each group through a score fusion strategy_i[x][n]And (3) merging the curves into one curve to obtain three merged score curves so as to achieve the effect of improving the stability, as shown in figure 3.

In the invention, a group of score curves are fused into a fused score curve, so that the noise output by the neural network can be effectively reduced, and great performance improvement can be obtained by using small calculation amount.

Specifically, in step S22, the input of the score fusion strategy is the scores S in each set of curves_i[x][n]，j∈[1，2，3]，x∈[1，X]，n∈[1，N]The output of the score fusion strategy is the average score S of each group of curves_iaAnd a maximum score S_im。

Further, different sets of curves are fused to obtain an average score S_iaIn the process, each fragmentTime S_ia[n]The acquisition process of (a) may be expressed as:

k＝∑ε_n

obtaining the maximum score S by fusing different groups of curves_imIn the process of (1), each segment time S_im[n]The acquisition process of (a) may be expressed as:

wherein t represents a score threshold, which is a predetermined constant, generally 0.5-0.7.

Wherein, R represents the reception field, the reception field is related to the number of the cavity convolution modules and the parameters thereof, and can be represented as:

R＝[1+w(q-1)p]*m-1

m represents the number of cavity convolution modules, w represents the number of cavity convolutions in the modules, q represents the convolution kernel size, and p represents the cavity proportion.

In step S23, a candidate boundary is acquired in the following manner.

S231, obtaining the segment time with the score larger than the threshold value and the segment time with the score being the local maximum value as the potential starting time and the potential ending time;

the fraction of the action is screened out by comparing the score with a threshold, the specific size of which can be selected empirically by one skilled in the art, preferably 0.5-0.7, e.g. 0.6.

The segment of the local maximum means that the segment score is higher than the segment of the previous time and the segment of the next time.

And S232, combining the potential start time and the potential end time, wherein the video clip information characteristics of the potential start time and the potential end time and the video clip information characteristics between the potential start time and the potential end time are obtained candidate frames.

Further, when there are a plurality of potential start times and potential end times, in combination, each potential start time is combined with a potential end time after the time to obtain all possible candidate frames.

In a preferred embodiment, after step S23, there is step S24, pooling time series segments, converting the candidate frame feature from an indefinite length to a fixed length.

The inventor finds that the obtained candidate frame is indefinite in length, and the feature of indefinite length is not beneficial to subsequent detection.

In the present invention, fixed-length features are obtained by pooling the features of candidate frames.

Specifically, the following steps are included.

S241, expanding the candidate frame to obtain an expansion characteristic;

preferably, the candidate box is expanded by one time; specifically, the candidate frame is enlarged by moving the candidate boundary back and forth.

More preferably, the candidate frame is uniformly enlarged by one time, i.e., the distance of the candidate boundary is moved forward and backward is the same, e.g., the start time and the end time of the original candidate frame, i.e., the candidate boundaries are 10s and 18s, and the candidate boundary is moved forward and backward to the 6s and 22s positions, so that the candidate frame is enlarged by one time.

In the invention, the candidate frames are expanded, so that the foreground and background information of the action in the video can be focused.

S242, pooling expansion characteristics to obtain structural characteristics;

the structural features are combined by dividing the extended features into k shares and sampling the features in each share.

Preferably, the expansion characteristics are equally divided into k parts, k being a positive integer greater than 3.

Preferably, from the divided k shares, each share randomly collects a point, and the feature at the point is obtained by linear interpolation from the features of two adjacent time instants, so that k features are obtained, as shown in fig. 6.

Further, the first feature of the candidate frame is added before the obtained k features, and the last feature of the candidate frame is added after the obtained k features, so that k +2 features are obtained, namely the structural features with the length of k + 2.

In this way, on the basis of maintaining data information as much as possible, while the modeling operation timing is short, a candidate frame feature of an arbitrary length is converted into a feature of a fixed length.

In the conventional video, motion judgment is usually obtained by directly classifying the features of candidate frames, but the accuracy of the method is insufficient.

In step S3, the candidate box features are subjected to correction of boundary regression and action classification by the candidate box evaluation module and the instance evaluation module.

The candidate frame evaluation module carries out a classification task and filters video information characteristics which are obviously not positive samples;

the positive sample refers to a sample containing an action.

Specifically, the input of the candidate box evaluation module is a candidate box feature, and the output comprises a score S of two classifications_foreWhen S is_foreWhen the score is larger than the filtering threshold value, the candidate frame corresponding to the score is reserved, and when S is larger than the filtering threshold value_foreAnd when the value is less than or equal to the filtering threshold value, deleting the candidate frame.

The example evaluation module carries out a multi-classification task and outputs specific classes of video information characteristics in the W classes; the W categories refer to the number and categories of all actions on the training data set.

Specifically, the input of the example evaluation module is a candidate box feature, and the output is a multi-classification score S_multi。

Further, the candidate box evaluation module and the example evaluation module are connected in parallel, and the average value or the multiplier of the outputs of the two models is used as the regression value of the candidate boundary.

The correction of the regression of the candidate boundary further comprises the detection of the composite score of the candidate box score curve, the composite score of said score curve S_pComprises the following steps:

S_p＝S_p，s·S_p，e·S_p，o

S_p，srepresents the fraction, S, in the starting curve at the moment of the start of the movement_p，eIndicating the fraction, S, in the end curve at the end of the movement_p，oThe average value of the progress curve during the progress of the operation is shown.

And different action types correspond to different comprehensive scores, and the action type corresponding to the candidate frame can be obtained through table lookup according to the comprehensive scores.

More preferably, the regression value of the candidate boundary is represented by:

S_final＝S_fore＊S_multi＊S_p

and correcting the candidate frame through the regression value to obtain a more accurate candidate frame, specifically, overlapping a multiplier of the regression value and the length of the candidate frame before correction with the candidate frame before correction to obtain the corrected candidate frame, wherein the regression value is (-0.1, 0.4), the candidate frame before correction is (10s, 30s), the length of the candidate frame before correction is 20s, and the candidate frame after correction is (8s, 38 s).

According to a preferred embodiment of the present invention, the candidate boundary evaluation module includes at least three convolutional layers, and the structure of the candidate boundary evaluation module is as shown in fig. 7, where two convolutional layers are connected in series and then connected in parallel with another convolutional layer, the result after parallel connection is transferred to a linear layer through the activation function ReLu, and the final result is output through the linear layer.

Further, the output of each convolutional layer is normalized, and two convolutional layers in series are connected through an activation function ReLU.

According to the present invention, the output of the candidate box evaluation module is 2 offset values and one binary probability value.

Further preferably, the example evaluation module and the candidate box evaluation module have the same network structure, and only output scales are different, and the output of the example evaluation module is 2 offset values and probability values of W motion classes in the training set.

According to the invention, in the process of training the neural network, the candidate box evaluation module and the example evaluation module are respectively trained, so that the candidate box evaluation module and the example evaluation module do not share parameters.

Further, the candidate box evaluates the loss function L of the module at the time of training_PEMComprises the following steps:

L_PEM＝L_fore+L_offset1

wherein L is_foreRepresenting the loss of the foreground portion, preferably a cross-entropy loss function, L_offset1Representing the loss of the candidate frame evaluation module boundary regression, preferably an MSE loss function;

loss function L of the example evaluation module_IEMComprises the following steps:

L_IEM＝L_multi+L_offset2

wherein L is_multiRepresenting a multi-class penalty, preferably a cross-entropy penalty function, L_offset2Representing the loss of the example estimator module boundary regression, preferably the MSE loss function.

In another aspect, the present invention further provides a time sequence motion detection apparatus for two segments, which includes a video information feature extraction unit, a candidate boundary extraction unit, and a video motion determination unit.

The video information feature extraction unit cuts the video into a plurality of segments and extracts video information features.

Preferably, the video information feature extraction unit is provided with a 3D motion recognition model, and the RGB features and the optical flow features of the video segments are extracted through the 3D motion recognition model, and then the RGB features and the optical flow features are fused to obtain the features representing the whole video information.

The candidate boundary extraction unit comprises a generator network subunit, which is used for converting the video information characteristics into a score curve, wherein the score curve comprises an action starting curve, an action proceeding curve and an action ending curve.

Further, a cavity convolution module and a linear layer are arranged in the generator network subunit, the cavity convolution module is connected with the linear layer through a first activation function, and the linear layer is output through a second activation function.

The cavity convolution module is provided with a cavity convolution network, preferably, the convolution kernel of the cavity convolution network is 3, and the cavity proportion is 2.

Preferably, the result of the hole convolution network in the hole convolution module is output after being normalized sequentially through the third activation function.

Preferably, the first activation function is a ReLU function, the second activation function is a Sigmoid function, and the third activation function is a ReLU.

Preferably, the candidate boundary extraction unit further includes a score curve fusion subunit that fuses three sets of score curves into one set in accordance with the method in step S22.

The candidate boundary extraction unit further comprises a boundary judgment subunit which judges the candidate boundary according to the score curve. Preferably, the candidate boundary is determined according to the method in step S23.

Further, the candidate boundary extraction unit is also capable of combining the candidate boundaries into the candidate frame.

Preferably, the candidate boundary extraction unit further comprises a time-series segment pooling sub-module for converting the candidate frame feature from an indefinite length to a fixed length.

Preferably, the candidate frame is evenly expanded by one time and then evenly divided into k parts, each part of the k parts is randomly collected to obtain a point, the feature of the point is obtained by linear interpolation of the features of two adjacent moments, k features are obtained in total, the first feature of the candidate frame is added before the obtained k features, the last feature of the candidate frame is added after the obtained k features, and the candidate frame features of k +2 features are obtained.

The video action determination unit comprises a score curve synthesis module, a candidate box evaluation module and an example evaluation module.

A composite score S of the score curve_pComprises the following steps:

S_p＝S_p，s·S_p，e·S_p，o

and according to the comprehensive score, obtaining the action type corresponding to the candidate frame through table lookup.

The input of the candidate box evaluation module is a candidate box characteristic, and the output comprises a score S of two classifications_foreWhen S is_foreWhen the score is larger than the filtering threshold value, the candidate frame corresponding to the score is reserved, and when S is larger than the filtering threshold value_foreIf the value is less than or equal to the filtering threshold value, deleting the candidate frame,

the input of the example evaluation module is a candidate frame characteristic, and the output is a multi-classification score S_multi。

The video action judgment unit obtains regression values of the candidate boundaries according to results of the curve dividing comprehensive module, the candidate boundary evaluation module and the example evaluation module, and corrects the candidate boundaries according to the regression values, wherein the regression values of the candidate boundaries are as follows:

S_final＝S_fore＊S_multi＊S_p

further, the candidate boundary evaluation module comprises at least three convolutional layers, wherein two convolutional layers are connected in series and then connected in parallel with one other convolutional layer, the parallel result is transmitted to one linear layer through an activation function ReLu, the final result is output through the linear layer, the output of each convolutional layer is subjected to normalization processing, and the two convolutional layers connected in series are connected through the activation function ReLU.

Preferably, the network structure of the instance evaluation module and the candidate boundary evaluation module is the same, and only the output scale is different.

Various embodiments of the above-described methods and apparatus of the present invention may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), System On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present invention may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of the present invention, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the methods and apparatus described herein may be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The methods and apparatus described herein may be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described herein), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The Server may be a cloud Server, which is also called a cloud computing Server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service extensibility in a traditional physical host and a VPS service ("Virtual Private Server", or "VPS" for short). The server may also be a server of a distributed system, or a server incorporating a blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in this disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed herein can be achieved, and the present disclosure is not limited herein.

Examples

Example 1

The method comprises the following steps of carrying out two-stage time sequence action detection on the public data sets THUMOS-14 and activityNet-1.3:

s1, acquiring video information characteristics

S2, extracting candidate boundaries according to the video information characteristics, and combining the candidate boundaries to obtain candidate frames;

and S3, correcting the boundary of the candidate frame and judging the action in the video.

In step S1, the video is cut into N segments of the same length in time sequence, RGB streams and optical flows of all the segments are extracted, the RGB streams and the optical flows are respectively input into a 3D motion recognition model to extract RGB features and optical flow features, and then the RGB features and the optical flow features are fused to obtain features representing the entire video information, where each segment of the N segments is 16 frames.

Step S2 includes the following substeps:

s21, converting the video information characteristics into a score curve;

s22, improving the stability of the score curve;

s23, acquiring potential starting time and potential ending time in the score curve, and combining to obtain a candidate frame;

and S24, pooling time sequence segments, and converting the candidate frame characteristics from an indefinite length to a fixed length.

In step S21, the generator network includes a hole convolution module, and after the video information features are input into the hole convolution module, the processing result and the video information features are sequentially output through a first activation function, a linear layer, and a second activation function together to obtain a score curve, where the first activation function is a ReLU function, the second activation function is a Sigmoid function, there are two hole convolution modules, a convolution kernel of the hole convolution is 3, a hole proportion is 2, and a Dropout policy is used in the hole convolution module.

In step S22, different sets of curves are fused to obtain an average score S_iaIn the process, each segment time S_ia[n]The acquisition process of (a) may be expressed as:

k＝∑ε_n

obtaining the maximum score S by fusing different groups of curves_imIn the process of (1), each segment time S_im[n]The acquisition process of (a) may be expressed as:

wherein t represents a score threshold value, which is constant 0.6, R represents a receptive field,

R＝[1+w(q-1)p]*m-1＝[1+2(3-1)*2]*2-1＝17

in step S24, the candidate frame is evenly expanded by one time and then evenly divided into k parts, each of the k parts is randomly sampled to obtain a point, the feature of the point is obtained by linear interpolation from the features of two adjacent time instants, k features are obtained in total, the first feature of the candidate frame is added before the obtained k features, and the last feature of the candidate frame is added after the obtained k features, so as to obtain the candidate frame features of k +2 features.

In step S3, the candidate box features are subjected to correction of boundary regression and action classification by the candidate box evaluation module and the instance evaluation module.

Composite score S of score curve_pComprises the following steps:

S_p＝S_p，s·S_p，e·S_p，o

the input of the candidate box evaluation module is a candidate box characteristic, and the output comprises a score S of two classifications_foreWhen S is_foreIf the score is larger than the filtering threshold value, the candidate boundary corresponding to the score is reserved, and when S is larger than the filtering threshold value_foreAnd when the candidate boundary is smaller than or equal to a filtering threshold value, deleting the candidate boundary, wherein the filtering threshold value is 0.5.

The input of the example evaluation module is a candidate frame characteristic, and the output is a multi-classification score S_multi。

The regression values of the candidate boundaries are represented as:

S_final＝S_fore＊S_multi＊S_p

further, the candidate frame evaluation module includes at least three convolutional layers, wherein two convolutional layers are connected in parallel with one another after being connected in series, the parallel result is transmitted to a linear layer through an activation function ReLu, the final result is output through the linear layer, the output of each convolutional layer is subjected to normalization processing, the two convolutional layers connected in series are connected through the activation function ReLu, and the output of the candidate boundary evaluation module is 2 offset values and a binary probability value

The network structure of the example evaluation module is the same as that of the candidate boundary evaluation module, only the output scale is different, and the output of the example evaluation module is 2 offset values and the probability values of W action classes in the training set.

In the process of training the neural network, a candidate frame evaluation module and an example evaluation module are respectively trained, and during training, a loss function L of the candidate frame evaluation module_PEMComprises the following steps:

L_PEM＝L_fore+L_offset1

wherein L is_foreAs a cross-entropy loss function, L_offset1Is the MSE loss function;

loss function L of the example evaluation module_IEMComprises the following steps:

L_IEM＝L_multi+L_offset2

wherein L is_multiAs a cross-entropy loss function, L_offset2Is the MSE loss function.

Comparative example 1

The same experiment as in example 1 was carried out, with the difference that the paper huijua Xu, Abir Das, and Kate saenko.2017.r-c3d was used: region conditional 3 digital network for temporal activity detection in Proceedings of the IEEE international conference on computer vision 5783-5792.

Comparative example 2

The same experiment as in example 1 was carried out, with the difference that the paper Jiyang Gao, Zhenheng Yang, Kan Chen, Chen Sun, and Ram nevatia.2017. turntap: temporal unit regression network for Temporal action responses in proceedings of the IEEE international conference on computer vision 3628-3636.

Comparative example 3

The same experiment as in example 1 was carried out, with the difference that the following paper Jiyang Gao, Kan Chen, and Ram new tia.2018. ctap: comparative temporal development in Proceedings of the European conference on computing (ECCV) 68-83.

Comparative example 4

The same experiment as in example 1 was carried out, with the difference that the following paper, Tianwei Lin, Xu Zhao, Haisheng Su, Chongjing Wang, and Ming yang.2018. bsn: BSN networks in Boundary sensitive network for temporal interaction protocol generation, introduction of the European Conference on Computer Vision (ECCV).3-19.

Comparative example 5

The same experiment as in example 1 was carried out, with the difference that the following paper Tianwei Lin, Xiao Liu, Xin Li, Errui Ding, and shirei wen.2019. bmn: BMN networks in Boundanymath networking for temporal interaction in Proceedings of the IEEE/CVF International Conference on Computer Vision.3889-3898.

Comparative example 6

The same experiment as in example 1 was carried out, except that it was carried out using the MGG network of the paper Yuan Liu, Lin Ma, Yifeng Zhang, Wei Liu, and Shih-Fu Chang.2019. MultigranularyGenerator for temporal action protocol. in procedures of the IEEE/CVF Conference on Computer Vision and Pattern registration. 3604-3613.

Comparative example 7

The same experiment as in example 1 was carried out, except that it was carried out using a DBG network of the paper Chuming Lin, Jian Li, Yabiao Wang, Ying Tai, Donghao Luo, Zhipen Cui, Chengjie Wang, Jilin Li, Feiyue Huang, and Rongrong Ji.2020.fast learning of temporal action protocol video transmitter. in Proceedings of the AAAI Conference on scientific Intelligency, Vol.34.11499-11506.

Comparative example 8

The same experiment as in example 1 was carried out, with the difference that the paper guojiang Gong, Liangfeng Zheng, and yang mu.2020.scale strates: the TSA network in Temporal scale aggregation network for precision action 1 transformation in elementary video in 2020 IEEE International Conference on Multimedia and Expo (ICME) IEEE, 1-6.

Comparative example 9

The same experiment as in example 1 was carried out, except that it was carried out using a BC-GNN network as in the paper Yueran Bai, Yingying Wang, Yunhai Tong, Yang Yang Yang, Qi Liu, and Junhui Liu.2020.boundary content graphic network for temporal action protocol generation. in European Conference on Computer vision. Springer, 121-.

Comparative example 10

The same experiment as in example 1 was carried out, except that the paper Zheng Shou, Dongang Wang, and Shih-Fu Chang.2016.temporal action localization in unknown video via multi was used. In Proceedings of the IEEE conference on computer vision and pattern recognition, 1049-1058.

Comparative example 11

The same experiment as in example 1 was carried out, with the difference that the paper Shyamal Buch, Victor esconia, Chuanqi Shen, Bernard Ghanem, and njua Carlos niebles.2017. sst: in Proceedings of the IEEE conference on Computer Vision and Pattern recognition, 2911-2920.

Comparative example 12

The same experiment as in example 1 was carried out, except that the following papers Zheng Shou, Jonathan Chan, aireza Zareian, Kazuyuki Miyazawa, and Shih-Fu chang.2017. cdc: CDC networking in conventional-de-conventional networks for precise temporal interaction in unknown video in Proceedings of the IEEE conference on computer vision and pattern registration, 5734-5743.

Comparative example 13

The same experiment as in example 1 was carried out, except that it was carried out using the SSAD network in the paper Tianwei Lin, Xu Zhao, and Zheng shou.2017.Single shot temporal action detection. in Proceedings of the 25th ACMinistrationary conference on multimedia.988996.

Comparative example 14

The same experiment as in example 1 was carried out, except that a TCN network as in the paper Xiyang Dai, Bharat Singh, Guyue Zhang, Larry S Davis, and Yan Qiu Chen.2017.temporal context network for activity localization in videos in Proceedings of the IEEE International Conference on Computer Vision.5793-5802 was used.

Comparative example 15

The same experiment as in example 1 was carried out, with the difference that the paper huijua Xu, Abir Das, and Kate saenko.2017.r-c3d was used: region conditional 3d network for temporal activity detection in Proceedings of the IEEE international conference on computer vision 5783-5792.

Comparative example 16

The same experiment as in example 1 was carried out, except that the SSN network in the article Yue Zhuao, Yuanjun Xiong, Limin Wang, Zhirong Wu, Xiaoou Tang, and Dahua Lin.2017.temporal action detection with structured segments networks. in Proceedings of the IEEE International Conference on Computer Vision.2914-2923 was used.

Comparative example 17

The same experiment as in example 1 was carried out, with the difference that the following paper Jiyang Gao, Zhenheng Yang, and Ram new tia.2017. computer bounded regression for temporal action detection. arXiv preprinting arXiv: 1705.01180(2017).

Comparative example 18

The same experiment as in example 1 was carried out, except that it was carried out using the ETP network of the paper Haonan Qiu, Yingbin Zheng, Hao Ye, Yao Lu, FengWang, and Liang He.2018. precision temporal action localization by y evolution temporal protocols in Proceedings of the 2018 ACM on International Conference on Multimedia retrieval.388-396.

Comparative example 19

The same experiment as in example 1 was carried out, except that the test was carried out using the paper Mengmeng Xu, Chen Zhao, David S Rojas, Ali Thabet, and Bernard Ghanem.2020.G-tad: sub-graph localization for temporal action detection in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern recognition.10156-10165.

Comparative example 20

The same experiment as in example 1 was carried out, except that it was carried out using the G-TAN network of the paper Fuchen Long, Ting Yao, ZHAOOFAN Qiu, Xinmei Tian, Jiebo Luo, and Tao Mei.2019.Gaussian temporal architecture for action localization in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern recognition.344-353.

Comparative example 21

The same experiment as in example 1 was carried out, except that a P-GCN network as in the article Runhao Zeng, Wenbing Huang, Mingkui Tan, Yu Rong, Peilin ZHao, Junzhou Huang, and Chuang gan.2019.graph connective networks for temporal action localization. in Proceedings of the IEEE/CVF International Conference computer Vision.7094-7103 was used.

Comparative example 22

The same experiment as in example 1 was carried out with the difference that PBR-NET networks of the paper Qinying Liu and Zilei Wang.2020.progressive boundary detection for temporal action detection. in Proceedings of the AAAI reference on technical Intelligence, Vol.34.11612-11619 were used.

Comparative example 23

The same experiment as in example 1 was carried out, except that it was carried out using a TAL-Net network as in the paper Yu-Wei Chao, Sudheenda Vijayanaarasimohan, Bryan Seybold, David A Ross, Jia Deng, and Rahul Sukthankar.2018.Retening the failure r-cnn architecture for temporal interaction. in Proceedings of the IEEE Conference Computer Vision and Pattern recognition.1130-1139.

Examples of the experiments

Comparative example 1 and various comparative examples results in candidate boxes in the THUMOS-14 dataset, as shown in Table one

Watch 1

Wherein, @50, @100, @200 represents the average recall rate when generating 50, 100, 200 candidate frames on average per video, the higher the average recall rate, the better the performance, it can be seen from table one that the recall rate in embodiment 1 of the present application is obviously higher than that of other modes.

Comparative example 1 with different comparative examples results were generated for candidate boxes in the ActivityNet-1.3 dataset as shown in table two.

Watch two

Wherein AR @ AN is 100, which represents the average recall rate when 100 candidate frames are generated on average per video segment, and the higher the average recall rate is, the better the performance is. AUC is AR @ AN which is the area enclosed by the curve and the coordinate axis of 100, and the larger the value is, the better the performance is. As can be seen from table one, the candidate box generation performance in embodiment 1 of the present application is high.

Comparative example 1 and different comparative examples the overall action test results in the thumb-14 dataset are shown in table three.

Watch III

Wherein, the tIoU represents the intersection ratio, which is the intersection time sequence length/union time sequence length of the candidate frame and the true value, and the higher the value is, the more accurate the requirement on the state of the candidate frame is. As can be seen from table three, the recall of embodiment 1 of the present application under different values of tliou is more accurate, which indicates that the overall detection process in embodiment 1 of the present application is more accurate.

The present invention has been described above in connection with preferred embodiments, but these embodiments are merely exemplary and merely illustrative. On the basis of the above, the invention can be subjected to various substitutions and modifications, and the substitutions and the modifications are all within the protection scope of the invention.

23页详细技术资料下载

Double-stage time sequence action detection method, device, equipment and medium

相关技术

网友询问留言