Video processing method and device, electronic equipment and readable storage medium

文档序号：142655 发布日期：2021-10-22 浏览：36次中文

阅读说明：本技术 视频处理方法、装置、电子设备和可读存储介质 (Video processing method and device, electronic equipment and readable storage medium ) 是由车宏伟于 2021-07-21 设计创作，主要内容包括：本申请公开了一种视频处理方法、装置、电子设备和可读存储介质,属于视频处理技术领域。该方法包括：获取待处理视频,所述待处理视频包括多个待处理视频帧；获取特效视频对应的多个参考视频帧,从多个所述待处理视频帧中确定与每个所述参考视频帧匹配的目标视频帧；由多个所述目标视频帧确定目标时间段,并基于所述特效视频为所述目标时间段内的视频添加特效。本申请通过将每个参考视频帧与待处理视频的视频帧进行匹配,在一定程度上可以提高视频处理的准确性。(The application discloses a video processing method, a video processing device, electronic equipment and a readable storage medium, and belongs to the technical field of video processing. The method comprises the following steps: acquiring a video to be processed, wherein the video to be processed comprises a plurality of video frames to be processed; acquiring a plurality of reference video frames corresponding to a special effect video, and determining a target video frame matched with each reference video frame from the plurality of video frames to be processed; determining a target time period from a plurality of the target video frames, and adding a special effect to the video within the target time period based on the special effect video. According to the method and the device, each reference video frame is matched with the video frame of the video to be processed, so that the accuracy of video processing can be improved to a certain extent.)

1. A method of video processing, the method comprising:

acquiring a video to be processed, wherein the video to be processed comprises a plurality of video frames to be processed;

acquiring a plurality of reference video frames corresponding to a special effect video, and determining a target video frame matched with each reference video frame from the plurality of video frames to be processed;

determining a target time period from a plurality of the target video frames, and adding a special effect to the video within the target time period based on the special effect video.

2. The method of claim 1, wherein the plurality of reference video frames comprises a starting reference video frame, and wherein the plurality of target video frames comprises a starting video frame matching the starting reference video frame;

the determining a target time period from a plurality of the target video frames comprises:

taking the time corresponding to the starting video frame as the starting time;

and acquiring the target time period according to the starting time.

3. The method of claim 2, wherein determining a target video frame from the plurality of to-be-processed video frames that matches each of the reference video frames comprises:

carrying out attitude detection on each video frame to be processed to obtain a plurality of candidate attitude information;

and matching each candidate attitude information with the first reference attitude information of the plurality of reference video frames respectively, and taking the video frame to be processed corresponding to the candidate attitude information with the matching degree greater than the preset matching degree as a target video frame, wherein the video frame to be processed corresponding to the candidate attitude information with the matching degree greater than the preset matching degree of the first reference attitude information of the starting point reference video frame is the starting video frame.

4. The method according to claim 3, wherein after matching each of the candidate pose information with the first reference pose information of the plurality of reference video frames, and taking the video frame to be processed corresponding to the candidate pose information with the matching degree greater than the preset matching degree as the target video frame, the method comprises:

determining a preset number of candidate video frames based on the initial video frame, and acquiring attitude information corresponding to each candidate video frame;

comparing the attitude information corresponding to each candidate video frame with second reference attitude information respectively, and taking the attitude information with the similarity greater than the preset similarity as target attitude information, wherein the reference video frame corresponding to the second reference attitude information is adjacent to the reference video frame corresponding to the first reference attitude information;

and determining the video frame to be processed corresponding to the target attitude information, and updating the initial video frame of the target video frame by using the video frame to be processed.

5. The method of claim 4, wherein before comparing the pose information corresponding to the candidate video frames with the second reference pose information, respectively, comprises:

and decomposing the special effect video into a plurality of state videos, and acquiring second reference attitude information corresponding to each state video.

6. The method of claim 5, wherein before the decomposing the special effects video into a plurality of state videos, the method comprises:

determining a data format of the special effect video;

if the data format of the special effect video is not in a specified format, converting the data format of the special effect video into the specified format, wherein the specified format comprises at least one of the type of the special effect video, the starting time and the ending time of the special effect video and the filtration control frame number, and the filtration control frame number is used for debounce.

7. The method according to claim 5, wherein before the obtaining the matching degree between each candidate pose information and each first reference pose information, respectively, further comprises:

respectively comparing second reference attitude information corresponding to two adjacent state videos, and if the difference value between the second reference attitude information of the latter state video and the second reference attitude information of the former state video is greater than a first difference value, taking the product of the second reference attitude information of the former state video and a first specified numerical value as first reference attitude information corresponding to the former state video, wherein the first specified numerical value is less than 1;

and if the difference value between the second reference attitude information of the next state video and the second reference attitude information of the previous state video is smaller than a second difference value, taking the product of the second reference attitude information of the previous state video and a second specified value as the first reference attitude information corresponding to the previous state video, wherein the second specified value is larger than 1.

8. The method according to any one of claims 3 to 7, wherein the performing pose detection on each of the to-be-processed video frames to obtain a plurality of candidate pose information comprises:

carrying out human body detection on each video frame to be processed to obtain a human body detection result;

performing key point detection on the video frame to be processed according to the human body detection result to obtain a key point detection result;

and performing data conversion on the detection result of the key points to obtain a plurality of candidate posture information, wherein the candidate posture information comprises angle information and distance information corresponding to a plurality of human body joint points.

9. A video processing apparatus, wherein the apparatus is applied to a user terminal, the apparatus comprising:

the device comprises a first acquisition module, a second acquisition module and a processing module, wherein the first acquisition module is used for acquiring a video to be processed, and the video to be processed comprises a plurality of video frames to be processed;

the second acquisition module is used for acquiring a plurality of reference video frames corresponding to the special-effect video and determining a target video frame matched with each reference video frame from the plurality of video frames to be processed;

and the special effect adding module is used for determining a target time period by the plurality of target video frames and adding a special effect to the video in the target time period based on the special effect video.

10. An electronic device, comprising:

one or more processors;

a memory;

one or more applications, wherein the one or more applications are stored in the memory and configured to be executed by the one or more processors, the one or more applications configured to perform the method of any of claims 1-8.

11. A computer-readable storage medium, having stored thereon program code that can be invoked by a processor to perform the method according to any one of claims 1 to 8.

Technical Field

The present application relates to the field of video processing technologies, and in particular, to a video processing method and apparatus, an electronic device, and a readable storage medium.

Background

With the improvement of living standard of people, terminal application programs become an indispensable part of life of people. In order to meet the use requirements of people, the functions of application programs are also becoming powerful, and the interactive functions of applications have become one of the common functions of application programs. Currently, users are already able to take or distribute videos through applications anytime and anywhere. In order to make the display effect of the video richer and more distinctive, the user can add special effects to the video. However, the existing process for increasing the special effect is complicated, and the special effect can be successfully added only by multiple operations of a user. Therefore, how to better add special effects to videos is a technical problem to be solved urgently.

Disclosure of Invention

The present application proposes a video processing method, apparatus, electronic device and readable storage medium to improve the above-mentioned drawbacks.

In a first aspect, an embodiment of the present application provides a video processing method, where the method includes: acquiring a video to be processed, wherein the video to be processed comprises a plurality of video frames to be processed; acquiring a plurality of reference video frames corresponding to a special effect video, and determining a target video frame matched with each reference video frame from the plurality of video frames to be processed; determining a target time period from a plurality of the target video frames, and adding a special effect to the video within the target time period based on the special effect video.

In a second aspect, an embodiment of the present application further provides a video processing apparatus, where the apparatus includes: the device comprises a first acquisition module, a second acquisition module and a special effect adding module. The first acquisition module is used for acquiring a video to be processed, and the video to be processed comprises a plurality of video frames to be processed. The second obtaining module is used for obtaining a plurality of reference video frames corresponding to the special-effect video and determining a target video frame matched with each reference video frame from the plurality of video frames to be processed. And the special effect adding module is used for determining a target time period by the plurality of target video frames and adding a special effect to the video in the target time period based on the special effect video.

In a third aspect, an embodiment of the present application further provides an electronic device, including one or more processors; a memory; one or more applications, wherein the one or more applications are stored in the memory and configured to be executed by the one or more processors, the one or more programs configured to perform the above-described methods.

In a fourth aspect, an embodiment of the present application further provides a computer-readable storage medium, where a program code is stored in the computer-readable storage medium, and the program code can be called by a processor to execute the above method.

According to the video processing method, the video processing device, the electronic equipment and the readable storage medium, the accuracy of video processing can be improved to a certain extent by acquiring the target video frames matched with the reference video frames, specifically, the video to be processed is acquired, wherein the video to be processed can comprise a plurality of video frames to be processed, based on the video to be processed, a plurality of reference video frames corresponding to the special-effect video are acquired, the target video frames matched with each reference video frame are determined from the plurality of video frames to be processed, finally, the target time period is determined by the plurality of target video frames, and the special effect is added to the video in the target time period based on the special-effect video. After the multiple reference video frames corresponding to the special effect video are obtained, the video frames to be processed are matched with the reference video frames, and an accurate target time period is determined based on the matched target video frames, so that the special effect can be added more accurately.

Additional features and advantages of embodiments of the present application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of embodiments of the present application. The objectives and other advantages of the embodiments of the application may be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 illustrates a method flow diagram of a video processing method provided by an embodiment of the present application;

FIG. 2 is a diagram illustrating an example of a video frame in a video processing method according to an embodiment of the present application;

fig. 3 is a diagram illustrating a comparison example of different video frames in the same video to be processed in a video processing method according to an embodiment of the present application;

fig. 4 is a diagram illustrating a comparison example of different video frames in the same special effect video in a video processing method according to an embodiment of the present application;

fig. 5 is a diagram illustrating an example of a position change of a center point when a human body performs a rotation action in a video processing method according to an embodiment of the present application;

FIG. 6 is a diagram illustrating an example of a change in a position of a center point when a person takes a jump in a video processing method according to an embodiment of the present application;

fig. 7 is a diagram illustrating an example of a video frame obtained after a special effect is added to a video in a target time period by using a special effect video in a video processing method according to an embodiment of the present application;

FIG. 8 is a flow chart of a method of video processing according to another embodiment of the present application;

fig. 9 is a flowchart illustrating a step of step S220 in a video processing method according to another embodiment of the present application;

fig. 10 is a flowchart illustrating another step of step S220 in a video processing method according to another embodiment of the present application;

FIG. 11 is a flow chart of a method of video processing according to yet another embodiment of the present application;

fig. 12 is a flowchart illustrating a step of S330 in a video processing method according to another embodiment of the present application;

fig. 13 is a diagram illustrating an exemplary result of human body detection on a video frame in a video processing method according to another embodiment of the present application;

fig. 14 is a diagram illustrating an example of obtaining a target detection box in a video processing method according to another embodiment of the present application;

FIG. 15 is a diagram illustrating an example of a human joint in a video processing method according to yet another embodiment of the present application;

fig. 16 is a block diagram illustrating a structure of a video processing apparatus according to an embodiment of the present application;

fig. 17 shows a block diagram of an electronic device provided in an embodiment of the present application;

fig. 18 illustrates a storage unit provided in an embodiment of the present application and used for storing or carrying program codes for implementing a video processing method according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present application, presented in the accompanying drawings, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present application without making any creative effort, shall fall within the protection scope of the present application.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures. Meanwhile, in the description of the present application, the terms "first", "second", and the like are used only for distinguishing the description, and are not to be construed as indicating or implying relative importance.

At present, a user can shoot or release videos through an application program at any time and any place, and in order to enable the display effect of the videos to be richer and more distinctive, the user can add special effects to the videos. However, the existing process for increasing the special effect is complicated, and the special effect can be successfully added only by multiple operations of a user. Although there are some techniques for automatically adding action effects, these techniques are typically based on pick-up as a reference standard. For example, when the current motion of the human body is the same as the preset motion, the addition of the special effect is triggered. However, it is not possible to add a special effect to a video more accurately only by using the initial motion as a reference standard, that is, although some videos have the same initial motion, the subsequent motions are different, and if only the initial motion is used as a reference, the addition of the special effect cannot be realized more accurately.

In view of the above problems, the inventor proposes a video processing method, an electronic device, and a storage medium according to embodiments of the present application, which can improve accuracy of video processing to a certain extent by obtaining a target video frame matched with a reference video frame, and specifically, obtain a to-be-processed video, where the to-be-processed video may include a plurality of to-be-processed video frames, based on which a plurality of reference video frames corresponding to a special-effect video are obtained, determine a target video frame matched with each reference video frame from the plurality of to-be-processed video frames, finally determine a target time period from the plurality of target video frames, and add a special effect to a video in the target time period based on the special-effect video. After the multiple reference video frames corresponding to the special effect video are obtained, the video frames to be processed are matched with the reference video frames, and an accurate target time period is determined based on the matched target video frames, so that the special effect can be added more accurately. The specific video processing method is described in detail in the following embodiments.

Referring to fig. 1, fig. 1 is a schematic flowchart illustrating a video processing method according to an embodiment of the present application. In a specific embodiment, the video processing method is applied to the video processing apparatus 400 shown in fig. 16 and the electronic device 500 shown in fig. 17. As will be described in detail with respect to the flow shown in fig. 1, the video processing method may specifically include steps S110 to S130.

Step S110: the method comprises the steps of obtaining a video to be processed, wherein the video to be processed comprises a plurality of video frames to be processed.

The video processing method provided in the embodiment of the present application may be applied to an electronic device, and in the embodiment of the present application, the electronic device may be a cellular phone, a smart speaker, a smart watch, a portable computer, a handheld communication device, a handheld computing device, a satellite radio, a global positioning system, a Personal Digital Assistant (PDA), and/or any other suitable device for communicating on a wireless communication system, and the like, but is not limited thereto.

In some embodiments, the to-be-processed video may be obtained by the electronic device from a server, that is, the to-be-processed video may be downloaded by the electronic device from the server, or may be obtained by the electronic device online from the server. For example, the video to be processed may be a video downloaded by the electronic device through the installed video playing software. The server may be a cloud server.

In other embodiments, the video to be processed may be obtained locally from the electronic device, and the video to be processed may be video data that is downloaded by the electronic device in advance and stored in a local memory, or may be video data that is captured by the electronic device using an image capturing apparatus configured by the electronic device itself. In addition, when the to-be-processed video is acquired by the electronic device from another electronic device, the to-be-processed video may be transmitted to the electronic device by the other electronic device through a wireless communication protocol, for example, through a Wlan protocol, a bluetooth protocol, a ZigBee protocol, a WiFi protocol, or the like, or may be transmitted to the electronic device by the other electronic device through a data network, for example, a 2G network, a 3G network, a 4G network, or a 5G network, and the like, which is not limited herein.

In this embodiment of the present application, a video to be processed may include a plurality of video frames to be processed, each of the video frames to be processed may include different objects, and gestures of the objects and corresponding pixel values thereof may also be different. The object or object included in the video frame to be processed may be a human body, an animal, or other things. Moreover, the same video frame to be processed may include a plurality of objects or objects, and the types of the objects or objects may be the same or different. For example, the same video frame to be processed may include a plurality of human bodies, and the postures, expressions, sexes, positions in the video frame to be processed, and the like of the human bodies may be different. As shown in fig. 2, 100 in fig. 2 may be a diagram of a certain frame image in a plurality of video frames to be processed, the video frame to be processed shown in fig. 2 includes three human bodies 101, 102, and 103, and it can be seen from fig. 2 that the positions of the three human bodies in 100, the postures of the three human bodies, the heights of the three human bodies, and the like are all different.

In addition, the corresponding actions and postures of the same human body in different video frames to be processed are also different, and for clearer understanding, the actions of the same human body in different video frames to be processed are shown in fig. 3. 104, 105 and 106 in fig. 3 are different to-be-processed video frames under the same to-be-processed video, and human bodies contained in the to-be-processed video frames are the same human body, but actions of the human body in the different to-be-processed video frames are different. For example, in the video frame to be processed 104, the arm and foot of the human body are at the first position, in the video frame to be processed 105, the arm and foot of the human body are at the second position, and the position where the arm and foot of the human body are in the video frame to be processed 105 becomes the third position, and these video frames to be processed together constitute the video to be processed.

Step S120: obtaining a plurality of reference video frames corresponding to the special effect video, and determining a target video frame matched with each reference video frame from the plurality of video frames to be processed.

As a mode, after a video to be processed is obtained, a standard video, which is a video including a special effect, that is, a special effect video, may be obtained in the embodiment of the present application, where the special effect video is composed of an original video and a special effect action. In order to more clearly understand each special effect video frame in the special effect video, the embodiment of the present application provides an example diagram as shown in fig. 4, where 11, 12, and 13 in fig. 4 are different special effect video frames under the same special effect video, and the human bodies included in the special effect video frames are the same human body, but the actions of the human body in the different special effect video frames are different. For example, in the special effect video frame 11, the arm, the foot, and the like of the human body are at the first position, in the special effect video frame 12, the arm, the foot, and the like of the human body are at the second position, and the position where the arm, the foot, and the like of the human body are located in the special effect video frame 13 becomes the third position, and these special effect video frames together constitute the special effect video. In addition, as can be seen from fig. 4, when the human body movement is different, the special effect movement in the corresponding special effect video frame is different, for example, the special effect 14 in the special effect video frame 12 is larger than the special effect 14 in the special effect video frame 11.

In addition, the electronic device can store a plurality of standard videos, and the original videos and the special effect actions contained in the standard videos are different. For example, in standard video 1 the special effect action 1 is added when the human body jumps up, while in standard video 2 the human body adds the special effect action 2 when rotating. Therefore, the standard video may also be referred to as a video containing an action special effect, and after the acquired standard video, the embodiment of the present application may determine the special effect video from the standard video according to the special effect type and the special effect start/stop time input by the user.

In the embodiment of the application, the special effect video may correspond to a plurality of reference video frames, so after the video to be processed is obtained, the embodiment of the application may first obtain the plurality of reference video frames corresponding to the special effect video, where the reference video frames may also be referred to as standard special effect video frames, and may also be obtained by the electronic device by analyzing the plurality of special effect video frames in the special effect video. Specifically, the reference video frames may be a plurality of representative reference video frames that are selected by the electronic device through analysis of joint point information, motion change information, and the like of the target object in each special effect video. In addition, first reference attitude information and second reference attitude information and the like can be acquired through a reference video frame, wherein the first reference attitude information can also be called as a lowest contrast parameter, and the second reference attitude information can be called as a trigger contrast parameter. Specifically, the lowest contrast parameter and the triggering contrast parameter may include angle information and distance information corresponding to a plurality of human body joint points.

In some embodiments, after acquiring a plurality of reference video frames corresponding to the special effect video, the electronic device may determine, from a plurality of to-be-processed video frames, a target video frame matching each reference video frame. Specifically, the electronic device may match the to-be-processed video frames in the to-be-processed video with the reference video frames respectively. For example, a first frame video in the video frames to be processed is taken as a current video frame to be processed, the current video frame to be processed is matched with the initial reference video frames of the multiple reference video frames, and if the matching degree of the two is smaller than the preset matching degree, a second frame video of the video to be processed is taken as the current video frame to be processed; and if the matching degree of the current video frame to be processed and the initial reference video frame of the multiple reference video frames is greater than the preset matching degree, acquiring the next video frame to be processed from the video to be processed, matching the next video frame to be processed and the next reference video frame of the multiple reference video frames, which is compared with the initial reference video frame, and so on until all the video frames to be processed matched with the reference video frames are found, and taking the video frames to be processed as target video frames.

In this embodiment of the application, matching the video frame to be processed with the reference video frame may be to determine whether pose information of a human body in the video frame to be processed is matched with pose information of a human body in the reference video frame, or may also be to determine whether joint information of a human body in the video frame to be processed is matched with joint information of a human body in the reference video frame, or may also be to determine whether pixel values in the video frame to be processed are matched with pixel values in the reference video frame, and the specific matching information will be described in detail in the following embodiments, and is not described herein again.

In other embodiments, the number of reference video frames may be obtained by analyzing human body motion in the special effect video. For example, if the motion of the human body in the special effect video is a rotational motion, the number of reference video frames may be 4. In a specific embodiment, the number of reference video frames may be determined according to a position change of a center point of a human body, wherein the center point of the human body may be a center position of left and right shoulders of the human body, that is, the center point of the human body may be a center of two shoulders. For example, when a human body performs a selection action, the position change of the center point of the human body is shown in fig. 5, and as can be seen from fig. 5, when the human body rotates, the position change of the center point can be divided into four parts, which can be 0, 1, 2 and 3, respectively, so that the number of reference video frames at this time can be 4. For another example, when the human body jumps, the position change information of the central point of the human body may be as shown in fig. 6, and it can be seen from fig. 6 that when the human body jumps, the human body may also be divided into four parts, which may be 1, 2, 3, and 4, respectively, so that the number of the reference video frames at this time may be 4.

In some embodiments, after the electronic device obtains the multiple reference video frames, the to-be-processed video frames may be respectively matched with each reference video frame, and a target video frame matched with each reference video frame is determined from the multiple to-be-processed video frames of the to-be-processed video, and then the electronic device may determine a target time period from the multiple target video frames and add a special effect to the video in the target time period based on the special effect video, that is, enter step S130.

Step S130: determining a target time period from a plurality of the target video frames, and adding a special effect to the video within the target time period based on the special effect video.

As one way, after the electronic device determines a target video frame matching each reference video frame from a plurality of to-be-processed video frames, it may determine a target time period from the plurality of target video frames and add a special effect to the video within the target time period based on the special effect video.

In some embodiments, after acquiring the plurality of target video frames, the electronic device may determine a starting video frame and an ending video frame from the plurality of target video frames. Specifically, the position of each target video frame in the video to be processed is obtained, the first target video frame with the most front position is used as a starting video frame, the target video frame with the most rear position is used as an end video frame, then a starting point time point corresponding to the starting video frame is determined, and an end time point corresponding to the end video frame is determined. Finally, the target time period is formed by the starting time point and the ending time point, namely the time period between the starting time point and the ending time point forms the target time period. The target time period comprises a plurality of to-be-processed video frames, wherein video frames matched with the special effect video frames exist in the to-be-processed video frames, but a few video frames unmatched with the special effect video frames exist, the unmatched video frames cannot influence the addition of the special effect, and the matching refers to action matching of a first target object in the to-be-processed video frames and a second target object in the special effect video frames.

In other embodiments, after acquiring the plurality of target video frames, the electronic device may also determine a starting video frame from the plurality of target video frames. Specifically, the position of each target video frame in the video to be processed is obtained, the first target video frame with the most front position is used as a starting video frame, and then the starting time corresponding to the starting video frame is determined. On this basis, the time length of the special effect video can be determined, the time length corresponding to the special effect video is also fixed because the special effect video is known, and after the time length corresponding to the special effect video is obtained, the electronic device can determine the target time period by combining the time length and the starting point time.

In other embodiments, after the target time period is obtained, the electronic device may add a special effect to the video within the target time period using the special effect video. Specifically, the special effect action in the special effect video may be correspondingly added to the video frame to be processed in the target time period. For example, a first special effect action may be added to a starting time within the target time period, and a second special effect action may be added to an intermediate time point within the target time period, and so on. For example, the acquired target video frames include 104, 105, and 106 as shown in fig. 3, that is, three video frames 104, 105, and 106 are located in the target time period, where the time point corresponding to the video frame 104 may be a starting time, and the time point corresponding to the video frame 106 may be an ending time. The final target video obtained by adding a special effect to the video frames in the target time period by using the special effect video may be as shown in fig. 7, and the final target video may also be referred to as a video after adding the special effect. As can be known from comparing fig. 4 and fig. 7, the special effect action included in the video after adding the special effect is the same as the special effect action included in the special effect video, but the human body, the object and the scene included in the video and the video may be different, and the identity, the expression, the clothing, and the like of the human body may also be different.

As can be seen from fig. 7, the target video frame after the special effect is added has the special effect 107 added thereto, compared with the target video frame before the special effect is not added, and the special effects 107 included in different special effect videos are also different. As can be seen from fig. 7, as the motion changes, the size of the special effect 107 also changes, that is, the special effect 107 added to the video frame 105 has a larger area than the special effect 107 added to the video frame 104. Therefore, the same special effect is added to different target video frames, but the size, shape, area and the like of the same special effect are changed along with the change of the action.

Referring to fig. 8, the video processing method according to another embodiment of the present application may include steps S210 to S240.

Step S210: the method comprises the steps of obtaining a video to be processed, wherein the video to be processed comprises a plurality of video frames to be processed.

Step S220: obtaining a plurality of reference video frames corresponding to the special effect video, and determining a target video frame matched with each reference video frame from the plurality of video frames to be processed.

In an embodiment of the present application, the plurality of reference video frames may include a starting reference video frame, and the plurality of target video frames include a starting video frame matching the starting reference video frame. As can be known from the above description, the multiple reference video frames may include first reference pose information and second reference pose information, and in the embodiment of the present application, a target video frame matched with each of the reference video frames may be obtained comprehensively according to the first reference pose information and the second reference pose information, please refer to fig. 9 and fig. 10 in detail. Knowing that a plurality of reference video frames correspond to the special effect video are obtained through fig. 9, determining a target video frame matching each of the reference video frames from the plurality of to-be-processed video frames may include steps S221 to S223.

Step S221: and acquiring a plurality of reference video frames corresponding to the special effect video.

The above embodiments have described in detail the obtaining of multiple reference video frames corresponding to a special effect video, and are not described herein again.

Step S222: and carrying out attitude detection on each video frame to be processed to obtain a plurality of candidate attitude information.

As one way, after acquiring a plurality of to-be-processed video frames of a to-be-processed video, the electronic device may perform pose detection on the to-be-processed video frames to obtain a plurality of candidate pose information, where the pose information may be human pose information. The body posture may be various postures, which may be standing, stooping, sitting, lying half, etc. In addition, the human body posture may include a head posture, a trunk posture, a foot posture, and the like, and the human body posture may be analyzed by combining these postures.

In the embodiment of the application, the electronic device can perform attitude detection on the video frames to be processed in the video to be processed by using an artificial intelligence algorithm, and then acquire candidate attitude information corresponding to each video frame to be processed. Specifically, in the embodiment of the present application, a lightweight attitude estimation network (SSN) may be utilized, or a neural network such as an attitude detection network (poseccnn) may be utilized to perform attitude detection on each to-be-processed video frame in a to-be-processed video, so as to obtain a plurality of candidate attitude information.

Step S223: and matching each candidate attitude information with the first reference attitude information of the plurality of reference video frames respectively, and taking the video frame to be processed corresponding to the candidate attitude information with the matching degree greater than the preset matching degree as a target video frame.

In the embodiment of the application, a video frame to be processed corresponding to candidate attitude information with the matching degree of the first reference attitude information of the starting point reference video frame being greater than the preset matching degree is taken as the starting video frame. After the candidate pose information of each video frame to be processed is obtained, the embodiment of the application can match each candidate pose information with the first reference pose information of a plurality of reference video frames respectively, and take the video frame to be processed corresponding to the candidate pose information with the matching degree greater than the preset matching degree as the target video frame.

In this embodiment of the application, the first reference pose information may also be referred to as a lowest contrast parameter, and the first reference pose information is mainly used to determine whether a human body in the video to be processed performs a fixed action to add a special effect, that is, the first reference pose information may provide an automation parameter for an automatic special effect adding process. In addition, the first reference posture information is mainly used for better adding action effects under the condition that the action in the video to be processed is not enough standard or the difference exists between the action in the video to be processed and the standard action, so that better use experience can be provided for users.

In other words, by using the first reference posture information, the embodiment of the application can not only automatically add a special effect to the video to be processed, but also can add the special effect as long as the action of the human body in the video to be processed is approximately similar to the standard action, so that the efficiency of adding the special effect can be improved. For example, when the arm of the user is horizontally placed, a special effect is added to the user, but in the embodiment of the application, even if the arm is not completely horizontally placed, the special effect can be automatically added to the user as long as the included angle between the arm and the horizontal direction is within a preset angle range, wherein the preset angle range is 30 degrees, and when the included angle between the arm and the horizontal direction is within the 30 degree range, the special effect can be added.

In other embodiments, the format of the candidate pose information and the first reference pose information may be the same. For example, if the candidate posture information includes the bending angle of each joint of the human body, the first reference posture information includes the bending angle of each joint of the human body.

In other embodiments, in order to more accurately implement adding of a special effect, that is, to avoid false adding of the special effect, after the target video frame is acquired by using the plurality of first reference pose information, the embodiment of the present application may also add the special effect to a starting video frame in the plurality of target video frames by using the second reference pose information, please refer to fig. 10 in detail, and it is known from fig. 10 that step S220 may further include step S224 to step S226.

Step S224: and determining a preset number of candidate video frames based on the starting video frame, and acquiring attitude information corresponding to each candidate video frame.

As can be appreciated from the above description, the plurality of reference video frames includes a starting reference video frame, and the plurality of target video frames includes a starting video frame matching the starting reference video frame. After the electronic device matches each candidate attitude information in the video to be processed with the first reference attitude information of the multiple reference videos respectively to obtain multiple target video frames, the electronic device can add a special effect to the initial video frame of the target video frames. Specifically, the electronic device may determine a preset number of candidate video frames based on the starting video frame, and obtain the pose information corresponding to each candidate video frame, where the preset number may be a first preset number.

In some embodiments, the first preset number of candidate video frames may be a plurality of candidate video frames that are uniformly acquired before and after the starting video frame, and the first preset number of candidate video frames includes the starting video frame. In addition, the first preset number may be an odd number, for example, the first 1 frame, the last 1 frame, and the starting video frame of the starting video frame may all be taken as candidate video frames, and the first preset number at this time is 3 frames; for another example, the first 2 frames, the second 2 frames, and the starting video frame of the starting video frame may all be taken as candidate video frames, the preset number at this time is 5 frames, and the preset number may be other frames, which is not described herein.

In other embodiments, the first preset number of candidate video frames may also be a plurality of candidate video frames obtained based on front-back non-uniformity of the video frames, and the first preset number of candidate video frames includes the starting video frame. For example, the first 2 frames, the last 1 frame, and the starting video frame of the starting video frame may all be taken as candidate video frames, and the preset number at this time is 4 frames; for another example, the first 3 frames, the last 1 frame, and the starting video frame of the starting video frame may be all candidate video frames, and the first preset number at this time is 6 frames.

It should be noted that, when a preset number of candidate video frames are determined based on the start video frame and the pose information corresponding to each candidate video frame is obtained, the embodiment of the present application may also determine a preset number of candidate video frames based on the end video frame and obtain the pose information corresponding to each candidate video frame. The plurality of reference video frames comprise a termination reference video frame, and the plurality of target video frames comprise an end point video frame matched with the termination reference video frame, wherein the end point video frame is one of the plurality of video frames to be processed. After the electronic device matches each candidate attitude information in the video to be processed with the first reference attitude information of the multiple reference videos respectively to obtain multiple target video frames, the electronic device can add a special effect to the initial video frame of the target video frame and can also add a special effect to the final video frame of the target video frame. Specifically, the electronic device may determine a second preset number of candidate video frames based on the endpoint video frame, and obtain the pose information corresponding to each candidate video frame.

In some embodiments, the second preset number of candidate video frames may be a plurality of candidate video frames that are uniformly acquired before and after the end point video frame, and the second preset number of candidate video frames includes the end point video frame. In addition, the second preset number may be an odd number, for example, the first 1 frame, the last 1 frame, and the end point video frame of the end point video frame may all be taken as candidate video frames, and the second preset number at this time is 3 frames; for another example, the first 2 frames, the last 2 frames, and the end point video frame of the end point video frame may all be taken as candidate video frames, and the preset number at this time is 5 frames, and the preset number may be other, which is not described herein.

In other embodiments, the second preset number of candidate video frames may also be a plurality of candidate video frames obtained based on front-back unevenness of the video frames, and the second preset number of candidate video frames includes the end point video frame. For example, the first 2 frames, the last 1 frame, and the end point video frame of the end point video frame may all be taken as candidate video frames, and the preset number at this time is 4 frames; for another example, the first 3 frames, the last 1 frame, and the end point video frame of the end point video frame may all be taken as candidate video frames, and the second preset number at this time is 6 frames. The first preset number and the second preset number may be the same or different, and how many the first preset number of candidate video frames and the second preset number of candidate video frames are determined are not specifically limited herein, and may be selected according to actual situations.

Step S225: and respectively comparing the attitude information corresponding to each candidate video frame with second reference attitude information, and taking the attitude information with the similarity greater than the preset similarity as target attitude information.

In the embodiment of the present application, a reference video frame corresponding to the second reference pose information is adjacent to a reference video frame corresponding to the first reference pose information. The second reference attitude information can also be called as a trigger contrast parameter, and is mainly used for comparing the human body action in the video to be processed with the standard action, and finally determining whether the human body action completes each state, so as to judge whether the special effect can be automatically added. In other words, more accurate start-stop time can be acquired through the second reference attitude information, and the accuracy of special effect addition is further improved.

In addition, the second reference pose information in this embodiment may be reference pose information corresponding to the starting point reference video frame, that is, the electronic device may compare pose information corresponding to the candidate video frame with the second reference pose information corresponding to the starting point reference video frame, and use, as the target pose information, pose information having a similarity greater than a preset similarity, where the candidate video frame and the target pose information both correspond to the starting video frame.

In addition, the second reference pose information in this embodiment may be reference pose information corresponding to an endpoint reference video frame, that is, the electronic device may compare pose information corresponding to the candidate video frame with second reference pose information corresponding to the endpoint reference video frame, and use pose information with a similarity greater than a preset similarity as target pose information, where the candidate video frame and the target pose information both correspond to the endpoint video frame.

In some embodiments, comparing the pose information corresponding to each candidate video frame with the second reference pose information may be obtaining a difference between the pose information corresponding to the candidate video frame and the second reference pose information, determining whether the difference is smaller than a preset difference, and if the difference is smaller than the preset difference, determining that the similarity between the candidate video frame and the second reference pose information is greater than the preset similarity, and using the pose information at this time as the target pose information.

As a manner, before comparing the pose information corresponding to each candidate video frame with the second reference pose information, in the embodiment of the present application, a plurality of pieces of second reference pose information may be obtained first, specifically, the special effect video is decomposed into a plurality of state videos, and the second reference pose information corresponding to each state video is obtained. In the embodiment of the present application, the number of the status videos may correspond to the number of the reference video frames, that is, one status video may correspond to one reference video frame. Thus, decomposing the special effects video into a plurality of state videos may include: and acquiring the position change of the human body central point, and decomposing the special effect video into a plurality of state videos based on the position change of the human body central point.

In a specific embodiment, the number of the state videos may be determined according to a position change of a center point of a human body, wherein the center point of the human body may be a center position of left and right shoulders of the human body, that is, the center point of the human body may be a center of two shoulders. For example, when a human body performs a selection action, the position change of the center point of the human body is shown in fig. 5, and as can be seen from fig. 5, when the human body rotates, the position change of the center point can be divided into four parts, which can be 0, 1, 2 and 3, respectively, so that the number of state videos at this time can be 4. For another example, when the human body jumps, the position change information of the central point of the human body may be as shown in fig. 6, and as can be seen from fig. 6, when the human body jumps, the human body may also be divided into four parts, which may be 1, 2, 3, and 4, respectively, so that the number of the state videos at this time may be 4.

In summary, the decomposition of the special effect video into the plurality of state videos is mainly to divide the special effect video into a preset number of sub-videos, and the preset number may be determined by a position change of the central point of the human body. In addition, after the special effect video is determined to be decomposed into the plurality of state videos, the embodiment of the present application may equally divide the to-be-processed video according to the preset number of state videos. For example, if it is determined to decompose a special effects video into 4 state videos, then the number of video frames contained in each state video may be the same. By decomposing the special effect video into a plurality of state videos, actions similar to or identical to start and stop states can be effectively eliminated, and information interference can be reduced to a certain extent.

In the embodiment of the present application, after the special effect video is decomposed into the plurality of state videos, the second reference posture information corresponding to each state video may be acquired in the embodiment of the present application. As an example, after a special effect video is acquired, the special effect video is decomposed into 4 (or 6) state videos, then, at a state video switching point, one frame of video before and after the special effect video is selected, then, the attitude information of the three video frames is acquired respectively, the attitude information of the three video frames is averaged, and the obtained value is the second reference attitude information corresponding to the state video. It should be noted that, in the embodiment of the present application, the special effect video may be decomposed into 4 state videos, may also be decomposed into 6 state videos, or may also be decomposed into 8 state videos, and the specific number of states into which the special effect video is decomposed is not specifically limited here, and may be selected according to actual situations.

As another way, before a special effect video is decomposed into a plurality of state videos, an embodiment of the present application may determine data of the special effect video, then determine whether a data format of the special effect video is a specified format, and if the data format of the special effect video is not the specified format, convert the data format of the special effect video into the specified format, where the specified format includes at least one of a type of the special effect video, a start time, an end time, and a filtering control frame number, and the filtering control frame number is used for debounce.

As can be known from the above description, the special effect video is included in the standard video, and the standard video is a video to which a special effect is added, and the start time and the end time of the special effect may be the start time and the end time of the special effect in the standard video frame. The starting time, the ending time and the filtering control frame number of the special effect video can be obtained by a manual marking method, namely, the special effect marking information of each action can be edited into a fixed format by manual marking. The type of the special effect video is used for distinguishing different action special effects; the start-stop time represents the start time and the end time of the special effect action; the frame number filtering control may be referred to as stability control, which is mainly used to avoid the influence of an anomaly on video processing. For example, when the frame number filtering control is 3, the electronic device may randomly acquire 3 video frames, and if the variation of the target object in the 3 video frames is large, an abnormality may exist, and in order to avoid a negative effect caused by the abnormality, the embodiment of the present application may average information of a plurality of video frames in the special effect video.

As can be understood from the above description, before decomposing the special effect video into a plurality of state videos, the embodiment of the present application may also determine whether a data format of the special effect video is a specified format, where the specified format may include at least one of a type of the special effect video, a start time, an end time, and a filtering control frame number of the special effect video. In other words, when determining data of a special effect video, the embodiment of the present application may determine only a start-stop time, may determine both the start-stop time and a type of the special effect video, or may determine the start-stop time, the type of the special effect video, and a filtering control frame number at the same time. The specific determination of which data is not specifically limited herein, and may be selected according to actual situations.

As another way, before each candidate pose information is respectively matched with the first reference pose information of the multiple reference video frames, the embodiment of the present application may also obtain multiple corresponding first reference pose information through multiple second reference pose information. Specifically, comparison operation is performed on second reference attitude information corresponding to two adjacent state videos respectively, so as to obtain the first reference attitude information corresponding to each state video.

Specifically, if the difference between the second reference attitude information of the next state video and the second reference attitude information of the previous state video is greater than the first difference, the product of the second reference attitude information of the previous state video and the first specified value is used as the first reference attitude information corresponding to the previous state video, and the first specified value is smaller than 1. In addition, if the difference value between the second reference attitude information of the next state video and the second reference attitude information of the previous state video is smaller than a second difference value, the product of the second reference attitude information of the previous state video and a second specified value is used as the first reference attitude information corresponding to the previous state video, and the second specified value is larger than 1.

As an example, four state videos are acquired, where the four state videos are respectively state1, state2, state3 and state4, the second reference posture information of the state1 is a1, the second reference posture information of the state2 is a2, the second reference posture information of the state3 is a3, and the second reference posture information of the state4 is a 4. The parameter change during each state switching process is calculated separately, such as the change of the parameter a1 in state1 relative to the change of the parameter a2 in state2, which may include an increment and a decrement. The electronic device may determine a parameter value before the state change based on the parameter change, where the parameter value is the first reference attitude information (lowest contrast parameter). For example, in the case of state1 to state2, the second reference attitude information a2 is increased relative to the second reference attitude information a1, and at this time, the first reference attitude information b1 corresponding to the second reference attitude information a1 is equal to the second reference attitude information a1 × 0.8, where 0.8 may be a super-parameter, and the super-parameter may be obtained through experimental debugging. For another example, in the case of state2 to state3, the second reference attitude information a3 is decreased from the second reference attitude information a2, and the first reference attitude information b2 corresponding to the second reference attitude information a2 is equal to the second reference attitude information a2 × 1.3, where 1.3 is also a hyper-parameter.

As can be seen, if the second reference attitude information of the latter state video is increased compared with the former state video, the corresponding first reference attitude information is decreased; if the second reference attitude information of the latter state video is reduced compared with the second reference attitude information of the former state video, the corresponding first reference attitude information is increased, and the two are in inverse proportion.

In summary, the first reference attitude information is mainly used to ensure that actions in the video are not sufficiently standard or have differences with standard actions, and action effects can be better added, so as to provide better use experience for users, and the second reference attitude information is mainly used to compare actions of the video to be processed with the standard actions, and finally determine whether human actions complete each action state, so as to judge whether special effects can be automatically added, and thus, accuracy of adding the special effects can be improved.

Step S226: and determining a video frame to be processed corresponding to the target attitude information, and updating an initial video frame of the target video frame by using the video frame to be processed.

In some embodiments, after the electronic device acquires the target pose information, it may determine a to-be-processed video frame corresponding to the target pose information, and then update a starting video frame of the target video frame by using the to-be-processed video frame, so as to update the starting time.

Step S230: and taking the time corresponding to the starting video frame as the starting time.

Step S240: and acquiring the target time period according to the starting time, and adding a special effect to the video in the target time period based on the special effect video.

In the embodiment of the application, when the starting time is obtained, the electronic device may determine the target time period by combining the starting time and the duration of the special effect video, and the duration of the special effect video is known through the introduction, so that after the starting time of the starting video frame is determined, the electronic device may determine the target time period by combining the starting time and the duration of the special effect video.

In other embodiments, the electronic device may also obtain the end point video frame when obtaining the start video frame, and then take a time corresponding to the end point video frame as the end time, where the obtaining process of the end point video frame is similar to that of the start video frame, and is not repeated here. In addition, after the starting time and the ending time are acquired, the target time period may be acquired in the embodiment of the present application.

According to the video processing method provided by the embodiment of the application, the accuracy of video processing can be improved to a certain extent by obtaining the target video frames matched with the reference video frames, specifically, the video to be processed is obtained, wherein the video to be processed can comprise a plurality of video frames to be processed, based on the obtained video frames, a plurality of reference video frames corresponding to the special-effect video are obtained, the target video frames matched with each reference video frame are determined from the plurality of video frames to be processed, finally, the target time period is determined by the plurality of target video frames, and the special effect is added to the video in the target time period based on the special-effect video. After the multiple reference video frames corresponding to the special effect video are obtained, the video frames to be processed are matched with the reference video frames, and an accurate target time period is determined based on the matched target video frames, so that the special effect can be added more accurately. In addition, actions similar to or identical to start-stop states can be effectively eliminated, information interference can be reduced to a certain extent, and special effects can be flexibly added by introducing the first reference attitude information and the second reference attitude information.

Referring to fig. 11, the video processing method according to another embodiment of the present application may include steps S310 to S350.

Step S310: the method comprises the steps of obtaining a video to be processed, wherein the video to be processed comprises a plurality of video frames to be processed.

Step S320: and acquiring a plurality of reference video frames corresponding to the special effect video.

Step S330: and carrying out attitude detection on each video frame to be processed to obtain a plurality of candidate attitude information.

Referring to fig. 12, step S330 may include steps S331 to S333.

Step S331: and carrying out human body detection on each video frame to be processed to obtain a human body detection result.

In this embodiment of the application, the human body detection may adopt a detection algorithm SSD (Single Shot multitox Detector, Detector based on a full convolution network), and the specific detection process may include: frame selection operation, frame normalization processing operation, CNN (Convolutional Neural Networks) training feature extraction operation, frame regression operation, classifier classification operation, data post-processing operation and the like, and finally, a target detection frame can be obtained, wherein information corresponding to the target detection frame is a human body detection result.

In some embodiments, when the video frame to be processed includes a plurality of human bodies, performing human body detection on the video frame to be processed may obtain a plurality of human body detection boxes, for example, performing human body detection on the video frame to be processed shown in fig. 2 may generally obtain a diagram shown in fig. 13, and as can be known from fig. 13, when the video frame to be processed includes a plurality of human bodies, performing human body detection on the video frame to be processed may obtain a plurality of human body detection boxes, which may be 301, 302, and 303, respectively. However, when the video frame to be processed contains a plurality of human bodies, the normal addition of the special effect is affected. Therefore, the target detection frame can be selected from the plurality of human body detection frames through frame selection operation, so that the smaller human body detection frame can be removed, and the accuracy of human body detection can be improved.

By one approach, the selection process of the target detection box may include: detecting whether the video frame to be processed comprises a plurality of human bodies, if so, acquiring a candidate frame corresponding to each human body, determining a first central point coordinate of each candidate frame, and determining a second central point coordinate of the video frame to be processed. On the basis, the distance between each first central point coordinate and each second central point coordinate can be obtained to obtain the first distance, the target human body is determined from the multiple human bodies according to the first distance, and the candidate frame corresponding to the target human body is used as the target detection frame.

As another mode, the selecting process of the target detection box may further include: determining the distance of the diagonal line of the video frame to be processed to obtain a second distance; obtaining a target distance according to the first distance and the second distance, and obtaining the area corresponding to each human body candidate frame; and determining a target human body from a plurality of human bodies based on the target distance and the area of the candidate frame, and taking the candidate frame of the target human body as a target detection frame.

For clearer understanding of the acquisition process of the target detection frame, now giving a diagram as shown in fig. 14, D1, D2 and D3 in fig. 14 may be first center point coordinates, O is second center point coordinates of the video frame to be processed, L1 is a distance from D1 to O, L2 is a distance from D2 to O, L3 is a distance from D3 to O, and L1, L2 and L3 may be collectively referred to as the first distance. The distance of the diagonal line of the video frame to be processed is L, the target distance can be obtained according to the first distance and the second distance, and the target distance can be equal to the first distance subtracted by the second distance which is one-half times. And finally, determining a target detection frame from the plurality of human body detection frames based on the target distance and the area of the candidate frame. In fig. 14, 301 corresponds to an area S1, 302 corresponds to an area S2, and 303 corresponds to an area S3; the first distance corresponding to 301 is (L/2-L1), the first distance corresponding to 302 is (L/2-L2), and the first distance corresponding to 303 is (L/2-L3). And comparing and sorting the values of (L/2-L1)/S1, the values of (L/2-L2)/S2 and the values of (L/2-L3)/S3, and taking the human body detection frame corresponding to the minimum value as a target detection frame.

In some embodiments, after a target detection frame corresponding to each to-be-processed video frame in a to-be-processed video is acquired, coordinate smoothing may also be performed on the target detection frames in the to-be-processed video in the embodiments of the present application, where the coordinate smoothing processing is mainly performed on a situation where multiple frames are continuous, and thus a situation where a frame body is jittered and changed may occur due to an error existing in the target detection frames of different to-be-processed video frames. In order to solve the problem, in the embodiment of the present application, the coordinates of each target detection frame may be filtered, and then the filtered target detection frame may be smoothed.

In other embodiments, due to the limitation of the smoothing process, if a recognition error occurs during human detection, a large error may be caused. In order to avoid this problem, when smoothing the filtered target detection frame, the embodiment of the present application may first perform error determination processing on the target detection frame. Therefore, after filtering the plurality of target detection frames, the embodiment of the present application may detect whether each target detection frame is abnormal, and if the target detection frame is abnormal, the target detection frame of the previous video frame to be processed is used as the target detection frame of the current video frame to be processed.

Specifically, detecting whether the candidate frame of the target human body has an abnormality may include: and obtaining a difference value between a target detection frame corresponding to the current video and a target detection frame corresponding to a previous frame of video to be processed, and if the difference value is greater than a preset threshold value, determining that the candidate frame of the target human body is abnormal.

As another way, before obtaining a difference value between a target detection frame corresponding to a current video frame to be processed and a target detection frame corresponding to a previous video frame to be processed, an embodiment of the present application may obtain a specified number of video frames to be processed, and obtain target detection frames corresponding to the specified number of video frames to be processed; and determining the difference value of the target detection frames corresponding to every two adjacent video frames to be processed, sequencing the difference values, and taking the maximum difference value as the preset threshold value.

Step S332: and performing key point detection on the video frame to be processed according to the human body detection result to obtain a key point detection result.

In some embodiments, the key point detection may be used to detect an articulation point of a human body, and the key point detection may include: and performing CNN training, feature extraction, key point regression, key point classification, data post-processing and the like on a target detection frame obtained by human body detection. The CNN training and the feature extraction can adopt a simple baseline method proposed by MSRA, relevant scene samples are collected on the basis of the algorithm, and the key point detection model is obtained through labeling and training. In addition, there are usually jitter situations in the prediction of the keypoint detection model, and some or some keypoint detection errors may occur. In order to solve the problem, the embodiment of the present application may perform smoothing processing on the key point, and may add processing to the relevant exception of the key point. Wherein, the exception adding process may include obtaining a key point similar to the exception key point and replacing the exception key point with the key point. For example, when a key point of a knee is detected to be located at a position of a body, and a key point of a foot is located below the body, the knee point is detected incorrectly, and the confidence of the key point is low, in this case, the embodiment of the present application may acquire a symmetric point of another leg.

In the embodiment of the present application, the number of the key points may be 17, that is, the present application may detect 17 joint points of a human body. In order to more clearly understand the positions of the joint points in the human body, the embodiment of the present application provides a diagram as shown in fig. 15, and it can be seen from fig. 15 that the human body includes 17 joint points. These joints may include the head, neck, hands, feet, etc., as detailed in table 1.

TABLE 1

Each joint point in table 1 is a common joint point of the human body, and the electronic device can better detect the posture of the human body through the joint points.

Step S333: and performing data conversion on the detection result of the key points to obtain a plurality of candidate posture information, wherein the candidate posture information comprises angle information and distance information corresponding to a plurality of human body joint points.

In some embodiments, after obtaining the keypoint detection result, the electronic device may perform data conversion on the keypoint detection result to obtain a plurality of candidate pose information. The candidate posture information may include angle information and distance information corresponding to a plurality of human body joint points, where the angle information may be an angle between a target joint point and two adjacent joint points, and the distance information may be distance information between the target joint point and two adjacent joint points. As shown in fig. 15, angle 1 corresponding to joint 8 is the angle between joint 9 and joint 14, and distance information corresponding to joint 8 may be line segment 89 and line segment 814, where each line segment may include two distances, which may be a horizontal distance and a vertical distance. In order to more clearly understand the coordinate angle information and distance information of each joint point, a candidate posture information table as shown in table 2 is now given.

TABLE 2

Table 2 above shows candidate pose information corresponding to different video frames, and it can be known from table 2 that each to-be-processed video frame may include 9 angle information, and each angle information may correspond to two distance information, which may be collectively referred to as candidate pose information. In addition, table 2 is only an example, and the actual situation is specifically the standard.

In the embodiment of the application, the electronic device can acquire the coordinates of each joint point in a video frame to be processed when the key point detection is performed, and then the angle information and the distance information corresponding to each joint point can be acquired by combining the coordinate information.

Step S340: and matching each candidate attitude information with the first reference attitude information of the plurality of reference video frames respectively, and taking the video frame to be processed corresponding to the candidate attitude information with the matching degree greater than the preset matching degree as a target video frame.

In the embodiment of the application, the candidate pose information may include angle information and distance information, the first reference pose information may also include first angle information and first distance information, and when each candidate pose information is matched with the first reference pose information of a plurality of reference video frames, the embodiment of the application mainly matches the angle information in the candidate pose information with the angle information in the corresponding first reference pose information; and matching the distance information in the candidate attitude information with the distance information in the corresponding first reference attitude information. For example, angle 2 in the candidate pose information is matched with angle 2 in the first reference pose information.

Step S350: determining a target time period from a plurality of the target video frames, and adding a special effect to the video within the target time period based on the special effect video.

In the embodiment of the application, after the electronic device matches each candidate pose information with first reference pose information of a plurality of reference video frames respectively and takes a to-be-processed video frame corresponding to the candidate pose information with the matching degree larger than the preset matching degree as a target video frame, the electronic device may update an initial video frame in the target video frame by introducing second reference pose information.

Specifically, a preset number of candidate video frames are determined based on the starting video frame, and attitude information corresponding to each candidate video frame is obtained; comparing the attitude information corresponding to each candidate video frame with second reference attitude information respectively, and taking the attitude information with the similarity greater than the preset similarity as target attitude information, wherein the reference video frame corresponding to the second reference attitude information is adjacent to the reference video frame corresponding to the first reference attitude information; and determining a video frame to be processed corresponding to the target attitude information, and updating the initial video frame of the target video frame by using the video frame to be processed. The similarity may be a difference between the pose information, that is, a difference between the pose information corresponding to the candidate video frame and the second reference pose information may be a difference between the angle information and the line segment information, and when the difference is smaller than a preset difference, the corresponding pose information may be used as the target pose information. In addition, the similarity between the pose information corresponding to the candidate video frame and the second reference pose information may also be a mean square error of the positions of all the joints. As a further example, the similarity may be equal to the difference 0.7+ the mean square error 0.3.

In the embodiment of the application, after the electronic equipment acquires the plurality of reference video frames, the attitude information corresponding to the reference video frames can be stored and analyzed, so that the manual marking of the position information when the special effect is added can be reduced to a certain extent, and the efficiency when the special effect is added can be further improved. In addition, the embodiment of the application can automatically realize the addition of special effects aiming at different human body actions, thereby reducing the complexity of user operation and improving the user experience.

According to the video processing method provided by the embodiment of the application, the accuracy of video processing can be improved to a certain extent by obtaining the target video frames matched with the reference video frames, specifically, the video to be processed is obtained, wherein the video to be processed can comprise a plurality of video frames to be processed, based on the obtained video frames, a plurality of reference video frames corresponding to the special-effect video are obtained, the target video frames matched with each reference video frame are determined from the plurality of video frames to be processed, finally, the target time period is determined by the plurality of target video frames, and the special effect is added to the video in the target time period based on the special-effect video. After the multiple reference video frames corresponding to the special effect video are obtained, the video frames to be processed are matched with the reference video frames, and an accurate target time period is determined based on the matched target video frames, so that the special effect can be added more accurately. In addition, the method and the device for adding the special effect video have the advantages that the key point detection result is converted into the candidate attitude information, so that the time point of adding the special effect video can be more accurate to a certain extent, and in addition, the first reference attitude information and the second reference attitude information are utilized, so that the corresponding special effect can be automatically and accurately added to the to-be-processed video needing to be added with the special effect.

Referring to fig. 16, an embodiment of the present application provides a video processing apparatus 400. In a specific embodiment, the video processing apparatus 400 includes: a first obtaining module 410, a second obtaining module 420, and a special effects adding module 430.

A first obtaining module 410, configured to obtain a to-be-processed video, where the to-be-processed video includes a plurality of to-be-processed video frames.

The second obtaining module 420 is configured to obtain a plurality of reference video frames corresponding to the special effect video, and determine, from the plurality of to-be-processed video frames, a target video frame matched with each of the reference video frames.

Further, the second obtaining module 420 is further configured to perform pose detection on each to-be-processed video frame to obtain a plurality of candidate pose information; and matching each candidate attitude information with the first reference attitude information of the plurality of reference video frames respectively, and taking the video frame to be processed corresponding to the candidate attitude information with the matching degree greater than the preset matching degree as a target video frame, wherein the video frame to be processed corresponding to the candidate attitude information with the matching degree greater than the preset matching degree of the first reference attitude information of the starting point reference video frame is taken as a starting video frame.

Further, the second obtaining module 420 is further configured to determine a preset number of candidate video frames based on the starting video frame, and obtain pose information corresponding to each candidate video frame; comparing the attitude information corresponding to each candidate video frame with second reference attitude information respectively, and taking the attitude information with the similarity greater than the preset similarity as target attitude information, wherein the reference video frame corresponding to the second reference attitude information is adjacent to the reference video frame corresponding to the first reference attitude information; and determining the video frame to be processed corresponding to the target attitude information, and updating the initial video frame of the target video frame by using the video frame to be processed.

Further, the second obtaining module 420 is further configured to perform human body detection on each to-be-processed video frame to obtain a human body detection result; performing key point detection on the video frame to be processed according to the human body detection result to obtain a key point detection result; and performing data conversion on the detection result of the key points to obtain a plurality of candidate posture information, wherein the candidate posture information comprises angle information and distance information corresponding to a plurality of human body joint points.

A special effect adding module 430, configured to determine a target time period from the multiple target video frames, and add a special effect to the video in the target time period based on the special effect video.

Further, the plurality of reference video frames includes a starting reference video frame, and the plurality of target video frames includes a starting video frame matching the starting reference video frame. The special effect adding module 430 is further configured to use a time corresponding to the starting video frame as a starting time, and obtain the target time period according to the starting time.

Further, the video processing apparatus 400 is further configured to decompose the special effect video into a plurality of state videos, and obtain second reference pose information corresponding to each of the state videos. Wherein decomposing the special effects video into a plurality of state videos may include: determining format data of the special effect video, wherein the format data comprises the type of the special effect video, the starting time and the ending time of the special effect video and a filtering control frame number, and the filtering control frame number is used for de-jittering; decomposing the special effects video into a plurality of the status videos based on format data of the special effects video.

Further, the video processing apparatus 400 is further configured to compare the second reference pose information corresponding to two adjacent state videos, and if a difference between the second reference pose information of a subsequent state video and the second reference pose information of a previous state video is greater than a first difference, take a product of the second reference pose information of the previous state video and a first specified value as the first reference pose information corresponding to the previous state video, where the first specified value is less than 1. In addition, if the difference value between the second reference attitude information of the next state video and the second reference attitude information of the previous state video is smaller than a second difference value, the product of the second reference attitude information of the previous state video and a second specified value is used as the first reference attitude information corresponding to the previous state video, and the second specified value is larger than 1.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In addition, functional modules in the embodiments of the present application may be integrated into one processing module, or each of the modules may exist alone physically, or two or more modules are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode.

According to the video processing device, the target video frames matched with the reference video frames are obtained, the accuracy of video processing can be improved to a certain extent, specifically, videos to be processed are obtained, wherein the videos to be processed can comprise a plurality of video frames to be processed, based on the video frames to be processed, a plurality of reference video frames corresponding to the special effect videos are obtained, the target video frames matched with each reference video frame are determined from the video frames to be processed, finally, the target time period is determined by the plurality of target video frames, and special effects are added to the videos in the target time period based on the special effect videos. After the multiple reference video frames corresponding to the special effect video are obtained, the video frames to be processed are matched with the reference video frames, and an accurate target time period is determined based on the matched target video frames, so that the special effect can be added more accurately.

Referring to fig. 17, a block diagram of an electronic device 500 according to an embodiment of the present disclosure is shown. The electronic device 500 may be a smart phone, a tablet computer, an electronic book, or other electronic devices capable of running an application. The electronic device 500 in the present application may include one or more of the following components: a processor 510, a memory 520, and one or more applications, wherein the one or more applications may be stored in the memory 520 and configured to be executed by the one or more processors 510, the one or more programs configured to perform a method as described in the aforementioned method embodiments.

Processor 510 may include one or more processing cores. The processor 510 interfaces with various components throughout the electronic device 500 using various interfaces and circuitry to perform various functions of the electronic device 500 and process data by executing or executing instructions, programs, code sets, or instruction sets stored in the memory 520 and invoking data stored in the memory 520. Alternatively, the processor 510 may be implemented in hardware using at least one of Digital Signal ProceSSing (DSP), Field-Programmable Gate Array (FPGA), and Programmable Logic Array (PLA). The processor 510 may integrate one or a combination of a Central ProceSSing Unit (CPU), a voice print recognizer (GPU), a modem, and the like. Wherein, the CPU mainly processes an operating system, a user interface, an application program and the like; the GPU is used for rendering and drawing display content; the modem is used to handle wireless communications. It is understood that the modem may not be integrated into the processor 510, but may be implemented by a communication chip.

The Memory 520 may include a Random Access Memory (RAM) or a Read-Only Memory (Read-Only Memory). The memory 520 may be used to store instructions, programs, code sets, or instruction sets. The memory 520 may include a stored program area and a stored data area, wherein the stored program area may store instructions for implementing an operating system, instructions for implementing at least one function (such as a touch function, a sound playing function, an image playing function, etc.), instructions for implementing various method embodiments described below, and the like. The data storage area may also store data created during use by the electronic device 500 (e.g., phone books, audio-visual data, chat log data), and so forth.

Referring to fig. 18, a block diagram of a computer-readable storage medium 600 according to an embodiment of the present application is shown. The computer-readable storage medium 600 has stored therein program code that can be called by a processor to execute the method described in the above-described method embodiments.

The computer-readable storage medium 600 may be an electronic memory such as a flash memory, an EEPROM (electrically erasable programmable read only memory), an EPROM, a hard disk, or a ROM. Alternatively, the computer-readable storage medium 600 includes a non-volatile computer-readable storage medium. The computer readable storage medium 600 has storage space for program code 610 for performing any of the method steps in the above-described method embodiments. The program code can be read from or written to one or more computer program products. The program code 610 may be compressed, for example, in a suitable form. Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not necessarily depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

30页详细技术资料下载

Video processing method and device, electronic equipment and readable storage medium

相关技术

网友询问留言