Video processing method and device, electronic equipment and storage medium

文档序号：1849994 发布日期：2021-11-16 浏览：6次中文

阅读说明：本技术 视频处理方法及装置、电子设备和存储介质 (Video processing method and device, electronic equipment and storage medium ) 是由许通达高宸健王岩袁涛秦红伟于 2021-08-20 设计创作，主要内容包括：本公开涉及一种视频处理方法及装置、电子设备和存储介质,所述方法按照时间轴顺序在待处理视频中确定包括目标帧和相邻多个参考帧的目标帧序列,通过对目标帧序列进行注意力区域检测,得到用于区分目标帧中注意力区域和非注意力区域位置的目标检测图像。基于目标检测图像对根据目标帧背景图像和前景图像进行透明度融合,得到在注意力区域显示前景图像,非注意力区域显示至少部分背景图像的目标替代图像,通过目标替代图像更新所述目标帧。本公开实施例确通过在注意力区域显示前景图像,非注意力区域显示至少部分背景图像的目标替代图像更新目标帧,进一步减少了整个待处理视频帧的码率,避免在后续编码过程中产生编码噪音。(The method determines a target frame sequence comprising a target frame and a plurality of adjacent reference frames in a video to be processed according to a time axis sequence, and obtains a target detection image for distinguishing the positions of an attention area and a non-attention area in the target frame by detecting the attention area of the target frame sequence. And performing transparency fusion on the background image and the foreground image of the target frame based on the target detection image to obtain a target substitute image which displays at least part of the background image in the attention area and displays at least part of the background image in the non-attention area, and updating the target frame through the target substitute image. The embodiment of the disclosure displays the foreground image in the attention area, and displays the target substitute image of at least part of the background image in the non-attention area to update the target frame, thereby further reducing the code rate of the whole video frame to be processed and avoiding generating coding noise in the subsequent coding process.)

1. A method of video processing, the method comprising:

determining a target frame sequence in a video to be processed according to a time axis sequence, wherein the target frame sequence comprises a target frame and at least one reference frame within a preset length range from the target frame;

carrying out attention area detection according to the target frame sequence to obtain a target detection image for representing the positions of an attention area and a non-attention area in the target frame;

determining a corresponding background image and a corresponding foreground image according to the target frame;

performing transparency fusion on the background image and the foreground image according to the target detection image to obtain a target substitute image, wherein an attention area of the target substitute image is the foreground image, and a non-attention area of the target substitute image is at least part of the background image;

updating the target frame with the target substitute image.

2. The method of claim 1, wherein the performing attention area detection based on the sequence of target frames to obtain a target detection image for characterizing attention areas and non-attention areas in the target frame comprises:

performing first image processing on the target frame sequence to obtain an feature tensor, wherein the feature tensor is used for representing image features of a target frame and each reference frame in the target frame sequence;

inputting the feature tensor into a trained neural network for attention area detection, determining an attention area in the target frame by comparing the target frame with each reference frame, and outputting a first detection image for representing the attention area and a non-attention area, wherein the non-attention area is an area except the attention area;

and carrying out secondary image processing on the first detection image to obtain a target detection image with the same resolution as the target frame.

3. The method of claim 2, wherein the first image processing of the sequence of target frames to obtain a feature tensor comprises:

down-sampling each frame in the sequence of target frames by a predetermined multiple;

and determining the feature tensor according to the down-sampled frames.

4. The method of claim 2 or 3, wherein the feature tensor comprises a four-dimensional feature tensor, the four dimensions of which are the timing, channel, height and width of the corresponding frame, respectively.

5. The method according to any one of claims 2-4, wherein the performing a second image processing on the first detected image to obtain a target detected image with a resolution same as that of the target frame comprises:

the first detection image is up-sampled, and a second detection image with the same resolution as the target frame is obtained;

and performing maximum pooling on the second detection image by using a window with a preset size and a step length to obtain a target detection image.

6. The method of any one of claims 2 to 5, wherein the neural network is a MobileNetV3 neural network.

7. The method according to any one of claims 1-6, wherein the determining the corresponding background image and foreground image from the target frame comprises:

carrying out fuzzy processing on the target frame to obtain a background image;

and sharpening the target frame to obtain a foreground image.

8. The method according to any one of claims 1 to 7, wherein performing transparency fusion on the background image and the foreground image according to the target detection image to obtain a target substitute image comprises:

determining a transparency channel according to the target detection image;

and performing transparency fusion on the background image and the foreground image according to the transparency channel to obtain a target substitute image which displays the foreground image at the attention area position and displays the background image at the non-attention area position.

9. The method according to any one of claims 1-8, wherein the determining a sequence of target frames in the video to be processed in timeline order comprises:

sequentially adding each frame in the video to be processed into a preset first-in first-out queue according to a time shaft sequence;

and in response to the fact that all the positions in the queue are occupied, taking the frame at the middle position of the queue as a target frame of the target frame sequence, and taking the frames at other positions as reference frames of the target frame sequence to determine the target frame sequence.

10. The method of claim 9, further comprising:

and responding to the update of the target frame, popping up the frame stored in the first position in the queue, and pushing the next frame in the video to be processed into the queue.

11. The method according to any one of claims 1-10, further comprising:

determining a self-adaptive quantization parameter corresponding to the target detection image;

inputting the updated target frame and the corresponding adaptive quantization parameter into a video encoder so as to perform video encoding on the updated target frame based on the corresponding adaptive quantization parameter.

12. The method of claim 11, wherein the determining the adaptive quantization parameter corresponding to the target detection image comprises:

performing histogram statistics on the target detection image to obtain a corresponding histogram mapping table;

mapping the target detection image according to the histogram mapping table to obtain a corresponding preliminary quantization parameter;

and performing down-sampling on the preliminary quantization parameter to obtain a self-adaptive quantization parameter.

13. The method according to claim 11 or 12, wherein said inputting the updated target frame and the corresponding adaptive quantization parameter into a video encoder comprises:

and inputting the updated target frame as an input frame into the video encoder, and inputting the self-adaptive quantization parameter into a self-adaptive quantization interface of the video encoder.

14. A video processing apparatus, characterized in that the apparatus comprises:

the device comprises a sequence determining module, a processing module and a processing module, wherein the sequence determining module is used for determining a target frame sequence in a video to be processed according to a time axis sequence, and the target frame sequence comprises a target frame and at least one reference frame within a preset length range from the target frame;

an attention area detection module, configured to perform attention area detection according to the target frame sequence to obtain a target detection image for representing an attention area and a non-attention area in the target frame;

the image determining module is used for determining a corresponding background image and a corresponding foreground image according to the target frame;

the image fusion module is used for performing transparency fusion on the background image and the foreground image according to the target detection image to obtain a target substitute image, wherein an attention area of the target substitute image is the foreground image, and a non-attention area of the target substitute image is at least part of the background image;

and the image updating module is used for updating the target frame through the target substitute image.

15. An electronic device, comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to invoke the memory-stored instructions to perform the method of any one of claims 1 to 13.

16. A computer readable storage medium having computer program instructions stored thereon, which when executed by a processor implement the method of any one of claims 1 to 13.

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a video processing method and apparatus, an electronic device, and a storage medium.

Background

In the field of video processing, identifying an attention area in a video and adjusting a code rate are important for improving the subjective quality of the video or the visual experience of audiences. However, how to quickly and accurately identify attention areas is a challenge. In addition, in the video encoding process, simply increasing the attention area code rate and reducing the non-attention area code rate may cause encoding noise to the non-attention area at a low code rate.

Disclosure of Invention

The present disclosure provides a video processing method and apparatus, an electronic device, and a storage medium, which aim to quickly and accurately identify an attention area in a video and avoid generating coding noise during video coding.

According to a first aspect of the present disclosure, there is provided a video processing method, the method comprising:

determining a corresponding background image and a corresponding foreground image according to the target frame;

updating the target frame with the target substitute image.

In a possible implementation manner, the performing attention area detection according to the sequence of target frames to obtain a target detection image for characterizing attention areas and non-attention areas in the target frame includes:

and carrying out secondary image processing on the first detection image to obtain a target detection image with the same resolution as the target frame.

In a possible implementation manner, the obtaining of the feature tensor by performing the first image processing on the target frame sequence includes:

down-sampling each frame in the sequence of target frames by a predetermined multiple;

and determining the feature tensor according to the down-sampled frames.

In one possible implementation, the feature tensor comprises a four-dimensional feature tensor whose four dimensions are the timing, channel, height, and width of the corresponding frame, respectively.

In a possible implementation manner, the performing, by the second image processing on the first detection image, a target detection image with a resolution same as that of the target frame includes:

the first detection image is up-sampled, and a second detection image with the same resolution as the target frame is obtained;

and performing maximum pooling on the second detection image by using a window with a preset size and a step length to obtain a target detection image.

In one possible implementation, the neural network is a MobileNetV3 neural network.

In one possible implementation manner, the determining, according to the target frame, a corresponding background image and a corresponding foreground image includes:

carrying out fuzzy processing on the target frame to obtain a background image;

and sharpening the target frame to obtain a foreground image.

In a possible implementation manner, the performing transparency fusion on the background image and the foreground image according to the target detection image to obtain a target substitute image includes:

determining a transparency channel according to the target detection image;

In one possible implementation, the determining the sequence of target frames in the video to be processed according to the time axis order includes:

sequentially adding each frame in the video to be processed into a preset first-in first-out queue according to a time shaft sequence;

In one possible implementation, the method further includes:

and responding to the update of the target frame, popping up the frame stored in the first position in the queue, and pushing the next frame in the video to be processed into the queue.

In one possible implementation, the method further includes:

determining a self-adaptive quantization parameter corresponding to the target detection image;

In a possible implementation manner, the determining an adaptive quantization parameter corresponding to the target detection image includes:

performing histogram statistics on the target detection image to obtain a corresponding histogram mapping table;

mapping the target detection image according to the histogram mapping table to obtain a corresponding preliminary quantization parameter;

and performing down-sampling on the preliminary quantization parameter to obtain a self-adaptive quantization parameter.

In one possible implementation, the inputting the updated target frame and the corresponding adaptive quantization parameter into a video encoder includes:

According to a second aspect of the present disclosure, there is provided a video processing apparatus, the apparatus comprising:

the image determining module is used for determining a corresponding background image and a corresponding foreground image according to the target frame;

and the image updating module is used for updating the target frame through the target substitute image.

In one possible implementation, the attention area detection module includes:

the first processing submodule is used for carrying out first image processing on the target frame sequence to obtain an feature tensor, and the feature tensor is used for representing the image features of a target frame and each reference frame in the target frame sequence;

the detection submodule is used for inputting the characteristic tensor into a trained neural network to detect an attention area, determining the attention area in the target frame by comparing the target frame with each reference frame, and outputting a first detection image for representing the attention area and a non-attention area, wherein the non-attention area is an area except the attention area;

and the second processing submodule is used for carrying out second image processing on the first detection image to obtain a target detection image with the same resolution as the target frame.

In one possible implementation manner, the first processing sub-module includes:

a down-sampling unit, configured to down-sample each frame in the target frame sequence by a predetermined multiple;

and the characteristic tensor determining unit is used for determining the characteristic tensor according to each down-sampled frame.

In one possible implementation, the second processing sub-module includes:

the up-sampling unit is used for up-sampling the first detection image by the preset multiple to obtain a second detection image with the same resolution as the target frame;

and the pooling unit is used for performing maximum pooling on the second detection image by using a window with a preset size and a step length to obtain a target detection image.

In one possible implementation, the neural network is a MobileNetV3 neural network.

In one possible implementation, the image determination module includes:

the background determining submodule is used for carrying out fuzzy processing on the target frame to obtain a background image;

and the foreground determining submodule is used for carrying out sharpening processing on the target frame to obtain a foreground image.

In one possible implementation, the image fusion module includes:

the channel determining submodule is used for determining a transparency channel according to the target detection image;

and the image fusion submodule is used for performing transparency fusion on the background image and the foreground image according to the transparency channel to obtain a target substitute image which displays the foreground image at the attention area position and displays the background image at the non-attention area position.

In one possible implementation, the sequence determining module includes:

the queue insertion submodule is used for sequentially adding each frame in the video to be processed into a preset first-in first-out queue according to a time shaft sequence;

and the sequence determining submodule is used for responding to the occupation of each position in the queue, using the frame at the middle position of the queue as a target frame, and using the frames at other positions as reference frames to determine the sequence of the target frame.

In one possible implementation, the apparatus further includes:

and the queue updating module is used for responding to the update of the target frame, popping up the frame stored in the first position in the queue, and pushing the next frame in the video to be processed into the queue.

In one possible implementation, the apparatus further includes:

the parameter determining module is used for determining the self-adaptive quantization parameter corresponding to the target detection image;

and the data transmission module is used for inputting the updated target frame and the corresponding adaptive quantization parameter into a video encoder so as to carry out video encoding on the target frame based on the corresponding adaptive quantization parameter.

In one possible implementation, the parameter determining module includes:

the histogram statistics submodule is used for carrying out histogram statistics on the target detection image to obtain a corresponding histogram mapping table;

the first parameter determining submodule is used for mapping the target detection image according to the histogram mapping table to obtain a corresponding preliminary quantization parameter;

and the second parameter determination submodule is used for performing down-sampling on the preliminary quantization parameter to obtain a self-adaptive quantization parameter.

In one possible implementation, the data transmission module includes:

and the data transmission sub-module is used for inputting the updated target frame into the video encoder as an input frame and inputting the adaptive quantization parameter into an adaptive quantization interface of the video encoder.

According to a third aspect of the present disclosure, there is provided an electronic device comprising: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to invoke the memory-stored instructions to perform the above-described method.

According to a fourth aspect of the present disclosure, there is provided a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the above-described method.

The background image and the foreground image of the target frame are determined, the target frame is updated by displaying the foreground image in the attention area and displaying the target substitute image of the background image in the non-attention area, the code rate of the whole video frame to be processed is further reduced, and coding noise is avoided in the subsequent coding process.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure. Other features and aspects of the present disclosure will become apparent from the following detailed description of exemplary embodiments, which proceeds with reference to the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure.

Fig. 1 shows a flow diagram of a video processing method according to an embodiment of the present disclosure;

FIG. 2 illustrates a schematic diagram of determining a sequence of target frames according to an embodiment of the present disclosure;

FIG. 3 shows a flow diagram of an attention area detection process according to an embodiment of the present disclosure;

FIG. 4 shows a schematic diagram of a second image processing procedure according to an embodiment of the disclosure;

FIG. 5 shows a schematic diagram of an attention area detection process according to an embodiment of the present disclosure;

FIG. 6 shows a schematic diagram of a target detection image according to an embodiment of the present disclosure;

FIG. 7 shows a schematic diagram of a process of determining a target substitute image according to an embodiment of the present disclosure;

FIG. 8 shows a schematic diagram of a transparency fusion process according to an embodiment of the present disclosure;

FIG. 9 shows a schematic diagram of a process of determining an adaptive quantization parameter according to an embodiment of the present disclosure;

FIG. 10 shows a schematic diagram of a data transmission process according to an embodiment of the disclosure;

fig. 11 shows a schematic diagram of a video processing apparatus according to an embodiment of the present disclosure;

FIG. 12 is a block diagram of an electronic device shown in accordance with an exemplary embodiment;

FIG. 13 is a block diagram illustrating an electronic device in accordance with an exemplary embodiment.

Detailed Description

Various exemplary embodiments, features and aspects of the present disclosure will be described in detail below with reference to the accompanying drawings. In the drawings, like reference numbers can indicate functionally identical or similar elements. While the various aspects of the embodiments are presented in drawings, the drawings are not necessarily drawn to scale unless specifically indicated.

The word "exemplary" is used exclusively herein to mean "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.

The term "and/or" herein is merely an association describing an associated object, meaning that three relationships may exist, e.g., a and/or B, may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the term "at least one" herein means any one of a plurality or any combination of at least two of a plurality, for example, including at least one of A, B, C, and may mean including any one or more elements selected from the group consisting of A, B and C.

Furthermore, in the following detailed description, numerous specific details are set forth in order to provide a better understanding of the present disclosure. It will be understood by those skilled in the art that the present disclosure may be practiced without some of these specific details. In some instances, methods, means, elements and circuits that are well known to those skilled in the art have not been described in detail so as not to obscure the present disclosure.

Fig. 1 shows a flow diagram of a video processing method according to an embodiment of the present disclosure. The video processing method may be performed by a terminal device or other processing device, where the terminal device may be a User Equipment (UE), a mobile device, a User terminal, a cellular phone, a cordless phone, a Personal Digital Assistant (PDA), a handheld device, a computing device, a vehicle-mounted device, a wearable device, or the like. In some possible implementations, the video processing method may be implemented by a processor calling computer readable instructions stored in a memory.

In an exemplary application scenario, by performing the video processing method of the embodiment of the present disclosure on a predetermined video to be processed, each frame in the video to be processed may be updated to an image with different definition in an attention area and a non-attention area, and an adaptive quantization parameter corresponding to each frame is determined, and video encoding is performed through each frame in the video to be processed and the corresponding adaptive quantization parameter. Further, the video processing method and the video encoding in the embodiments of the present disclosure may be completed by the same device, or the video processing method is executed by a terminal device or other devices and then transmitted to a video encoder for video encoding.

As shown in fig. 1, the video processing method of the embodiment of the present disclosure includes the following steps:

step S10, determining a sequence of target frames in the video to be processed in time axis order.

In a possible implementation manner, the embodiment of the present disclosure may execute the video processing method by respectively processing each frame in the video to be processed. That is, each frame in the video to be processed may be respectively subjected to image processing as a target frame to determine a target substitute image for replacing the target frame. And after the image processing of the current target frame is finished, determining unprocessed frames in the video to be processed as new target frames again until the video processing process of the video to be processed is finished when the image processing of all the frames in the video to be processed is finished. Alternatively, in order to ensure video processing efficiency, the processing order of the target frames may be sequentially determined based on the time axis order.

Further, the video to be processed is substantially a frame sequence composed of a plurality of frames, wherein each frame records a static image information. Therefore, in order to perform video processing on the target frame, it is necessary to acquire a plurality of frames within a preset length range from the target frame as reference frames, to perform attention area detection by comparing the reference frames with the image content of the target frame, to determine an attention area in the target frame, and to further perform video processing.

That is to say, when the video processing method according to the embodiment of the present disclosure is executed, a target frame sequence is first determined in the video to be processed according to a time axis sequence, where the target frame sequence includes a target frame and at least one reference frame within a preset length range from the target frame. The "preset length" may be a preset number of frames. When the number of frames before or after the target frame is not the preset length, that is, the target frame is the first few frames from the beginning or the last few frames from the end of the video, the corresponding frame sequence may not be determined, and the attention area identification result of the adjacent reference frame is directly used as the attention area identification result of the current target frame.

The example of the video to be processed including T1-T10 frames is described. In the case where the preset length is 2, the target frame sequences may be sequentially determined as (T1, T2, T3, T4, T5), (T2, T3, T4, T5, T6), (T3, T4, T5, T6, T7), (T4, T5, T6, T7, T8), (T5, T6, T7, T8, T9) and (T6, T7, T8, T9, T10) according to the time axis order. The target frames corresponding to each target frame sequence are sequentially T3, T4, T5, T6, T7 and T8, and the attention area identification result corresponding to the target frame can be obtained by sequentially performing attention area detection on each target frame sequence. Alternatively, the recognition result of T3 may be regarded as the recognition results of T1 and T2, and the recognition result of T8 may be regarded as the recognition results of T9 and T10.

In one possible implementation, the process of acquiring a target frame sequence according to an embodiment of the present disclosure may include: and sequentially adding all frames in the video to be processed into a preset first-in first-out queue according to a time shaft sequence, responding to the occupation of all positions in the queue, using the frame at the middle position of the queue as a target frame of the target frame sequence, and using the frames at other positions as reference frames of the target frame sequence to determine the target frame sequence. That is, a fifo queue with a fixed length is preset, and frames are sequentially added to the queue according to the order of the frames in the video to be processed on the time axis, wherein each frame occupies a position in the queue. When all positions in the queue are occupied, namely one frame in the video to be processed is stored in each position, acquiring a frame at the middle position of the queue as a target frame, acquiring frames at other positions in the queue as reference frames, and determining a target frame sequence according to the target frame and the corresponding reference frames. Wherein the middle position characterizes a middle-most position in the queue, or a predetermined position of the middle-most two positions. For example, when the queue length is odd, only one middle position is included in the queue, and the frame in the position is determined as the target frame. When the queue length is even, two middle positions are included in the queue, and the frame stored in the front position of the two positions can be determined as the target frame.

Alternatively, the length of the fifo queue may be the sum of the number of target frames and reference frames, i.e. the preset length is doubled by one. For example, when the preset length is 2, the length of the fifo queue is 5.

Further, after the processing process of the target frame is completed according to the current target frame sequence, the frame stored in the queue is popped from the first position in the queue, and the next frame in the video to be processed is pushed into the queue.

Fig. 2 shows a schematic diagram of determining a sequence of target frames according to an embodiment of the present disclosure. As shown in fig. 2, when determining the target frame sequence, the frames in the video 20 to be processed are sequentially added to a preset first-in first-out queue 22 along a time axis sequence 21. In a possible implementation manner, under the condition that each position in the queue 22 is occupied by frames T-2, T, T +1 and T +2 in sequence, the frame T currently in the middle position of the queue 22 is taken as a target frame, and the frames T-2, T +1 and T +2 in other positions are taken as reference frames to determine a target frame sequence (T-2, T +1, T +2), so as to further perform a video processing procedure such as attention area identification on the target frame T based on the target frame sequence (T-2, T +1, T + 2). Wherein the target frame sequence is determined when T-3 frames are popped out and T +2 frames are pushed into the queue.

Further, after the processing procedure of the current target frame sequence (T-2, T +1, T +2) corresponding to the target frame T is completed, the frame T-2 that was first added to the queue 22 is popped from the first position in the current queue 22, and the frame T +3 that is located after the frame T +2 in the last position in the current queue 22 on the time axis is pushed into the queue 22, so that the frames in other positions in the queue 22 are moved forward by one position.

In an alternative implementation, the above-mentioned manner of sequentially determining the sequence of target frames can ensure the efficiency of the whole video processing process. Meanwhile, the target frame sequence comprising the reference frame and the target frame is determined, and the accuracy of the identification result of the attention area corresponding to the target frame is improved by detecting the attention area based on the target frame sequence.

And step S20, detecting attention areas according to the target frame sequence to obtain target detection images for representing the positions of the attention areas and the non-attention areas in the target frame.

In one possible implementation, the target detection image is determined by performing attention area detection on the target frame sequence, and the target detection image is an image for representing attention areas and non-attention areas in the target frame. Further, the attention area is determined by comparing the image content in the target frame and the reference frame in the target frame sequence. Alternatively, the attention area is an area where a human being focuses attention in the target frame, and may be, for example, a moving area in a relatively static background in the image or an area where a specific contour is located in the image. The motion area under the relatively static background can be the area where the football is located in the football match video, and the area where the specific outline is located can be the area where the face outline is located under the face recognition scene, and the like. Alternatively, in the case where the background needs to be focused, the attention area may be an area other than the motion area.

Fig. 3 shows a flow chart of an attention area detection process according to an embodiment of the present disclosure. As shown in fig. 3, in a possible implementation manner, a process of performing attention area detection on a target detection sequence to obtain a target detection image for characterizing an attention area and a non-attention area of a target frame according to an embodiment of the present disclosure may include the following steps:

and step S21, carrying out first image processing on the target frame sequence to obtain a feature tensor.

In a possible implementation manner, the first image processing is performed on the target frame sequence to obtain an feature tensor, where the feature tensor is used to characterize image features of a target frame and each reference frame in the target frame sequence, and each target sequence corresponds to one feature tensor. Optionally, the first image processing procedure is intended to convert each frame in the target frame sequence from a high resolution image to a low resolution image, so as to improve the speed and efficiency of the subsequent attention area detection.

Further, the first image processing may include: each frame in the target frame sequence is down-sampled by a preset multiple, and a feature tensor is determined according to each down-sampled frame. That is, a multiple is preset, each frame in the target frame sequence is reduced by a predetermined multiple in a down-sampling manner, and the feature tensor is determined according to each reduced frame. Alternatively, the down-sampling method may adopt any method, such as nearest neighbor interpolation, bilinear interpolation, mean interpolation, median interpolation, and the like, which is not limited herein.

In one possible implementation, when the embodiment of the present disclosure is applied to a video coding scene, in order to ensure the efficiency of the subsequent video coding process, a predetermined multiple may be set according to the macroblock size applied in the coding process. For example, when the macroblock size is 16 × 16, the predetermined multiple is set to 16, that is, frames at the macroblock level are obtained by down-sampling each frame by 16 times.

In one possible implementation, the feature tensor determined according to each down-sampled frame is a four-dimensional feature tensor, where four dimensions of the feature tensor are a time sequence, a channel, a height, and a width of the corresponding frame, respectively. Alternatively, the time sequence may be determined sequentially according to the time axis of each frame in the image to be processed, the channels may be determined according to the number of color channels of each frame, and the height and width may be determined according to the resolution size of each frame. The four-dimensional tensor can be applied to a lightweight neural network such as a MobileNetV3 neural network, which is used as input data of the neural network.

And step S22, inputting the feature tensor into the neural network obtained by training for attention area detection.

In one possible implementation manner, the feature tensor is input into a trained neural network for attention area detection, so that an attention area in a target frame is determined by comparing the target frame with each reference frame, and a first detection image for representing the attention area and a non-attention area is output. Optionally, the resolution of the first detected image is the same as the resolution of each down-sampled frame. For example, in the case that the motion region is determined to be the attention region, a plurality of object regions included in the target frame and the reference frame may be determined by object recognition, and then the positions of the object regions in which the same object is located in the target frame and the reference frame are compared, and the attention region is determined according to the position of the object region in the target frame, in which the position distance is greater than the preset threshold. .

In one possible implementation, the neural network that performs attention area detection is a lightweight neural network. Further, the neural network may be a MobileNetV3 neural network, comprising a start portion, a middle portion, and a last portion in that order. The initial part comprises a convolution layer with convolution kernel size of 3 x 3 for feature extraction, the middle part comprises 11 or 15 bneck modules, the last part comprises an average pooling layer and a convolution layer with convolution kernel size of 1 x 1, the bneck modules comprise channel separable convolution and channel attention mechanisms which are connected in sequence, and data loss in the convolution process is avoided in a residual connection mode.

Due to the structural characteristics of the mobilenetV3, the accuracy of a calculation result can be improved while the calculation amount is reduced, the attention area detection can be performed in real time based on the neural network under the condition of low-resolution input, the detection speed is improved, and the accuracy of the detection result is ensured.

And step S23, carrying out second image processing on the first detection image to obtain a target detection image with the same resolution as the target frame.

In a possible implementation manner, the first detection image is subjected to second image processing to obtain a target detection image with the same resolution as that of the target frame. And the second image processing process is used for reducing the size of the first detection image to the original size of the target frame, so as to further perform image processing and updating on the target frame based on the obtained target detection image.

In one possible implementation, the second image processing on the first detection image includes: and upsampling the first detection image by a preset multiple to obtain a second detection image with the same resolution as the target frame. And performing maximum pooling on the second detection image by using a window with a preset size and a step length to obtain a target detection image.

Further, the first detection image is up-sampled by a preset multiple which is the same as the down-sampling multiple, so that the resolution of the first detection image can be restored to the same resolution of the target frame, and a second detection image is obtained. The method for upsampling the first detection image may be any method, such as nearest neighbor interpolation, bilinear interpolation, mean value interpolation, median interpolation, and the like, which is not limited herein. Preferably, a bicubic interpolation method may be selected for upsampling to improve the final image effect.

In one possible implementation, the window size for maximum pooling of the second detected image may be determined according to the upsampling ratio, i.e. the same predetermined multiple as described above. For example, when the predetermined multiple is 16, the window size of the maximum pooling may be determined to be 16 × 16. Meanwhile, in order to ensure that the size of the target detection image obtained after the maximum pooling is not changed, the step length of the maximum pooling process is preset to be 1. When the embodiment of the disclosure is applied to a video coding scene, the determination mode of the window size of the maximum pooling process can ensure the efficiency of the subsequent video coding process.

Fig. 4 shows a schematic diagram of a second image processing procedure according to an embodiment of the present disclosure. As shown in fig. 4, after the attention area detection is performed on the feature tensor through the neural network to obtain the first detection image 40, the resolution of the first detection image is reduced to the second detection image 41 which is the same as the target frame in an up-sampling manner. Meanwhile, the texture features of the second detection image 41 are further extracted in a maximum pooling manner, so that the target detection image 42 capable of clearly distinguishing the attention area from the non-attention area is obtained, and the subsequent further image processing is facilitated.

Fig. 5 shows a schematic diagram of an attention area detection process according to an embodiment of the present disclosure. As shown in fig. 5, in the process of detecting the attention area of the target frame according to the embodiment of the present disclosure, a target frame sequence 50 corresponding to the target frame is determined, and each frame in the target frame sequence 50 is down-sampled by the first image processing to obtain a feature vector 51. The feature neighborhood 51 with low resolution is input into the trained neural network 52, and the first detection image 53 with low resolution can be obtained quickly and accurately. The second image processing is performed on the first detection image 53 to obtain a target detection image 54 with clear texture features and high resolution. The attention area detection process improves the detection efficiency and ensures the accuracy of the detection result.

Fig. 6 shows a schematic diagram of an object detection image according to an embodiment of the present disclosure. As shown in fig. 6, the target detection image corresponding to the target frame has the same resolution as the target frame, and the value of each pixel is a numerical value of 0 to 1. Each value is used to represent the probability of the corresponding pixel in the attention area, for example, a pixel with a value of 1 is a pixel in the attention area, and a pixel with a value of 0 is a pixel in the non-attention area.

And step S30, determining a corresponding background image and a foreground image according to the target frame.

In a possible implementation manner, image processing is performed on the target frame through different image processing manners, so as to obtain a background image and a foreground image with different visual effects. The method comprises the steps of blurring a target frame to obtain a background image, and sharpening the target frame to obtain a foreground image.

Optionally, the method for performing the blurring processing on the target frame in the embodiment of the present disclosure may include any image blurring processing method such as gaussian blurring, salt and pepper blurring, motion blurring, and occlusion blurring, which is not limited herein.

Further, the sharpening processing mode for the target frame according to the embodiment of the present disclosure may include any image sharpening processing modes such as sobel operator sharpening, laplacian operator sharpening, prewitt operator sharpening, canny operator sharpening, and the like, which is not limited herein.

The foreground image and the background image can be determined through different processing modes, so that the foreground image and the background image are fused based on the attention area, the image outline of the attention area is enhanced to improve the definition, the image definition of the non-attention area is reduced, and the visual experience of the image obtained after final processing is improved.

And step S40, performing transparency fusion on the background image and the foreground image according to the target detection image to obtain a target substitute image.

In one possible implementation, the attention area of the target substitute image is a foreground image, and the non-attention area is a background image. The method for performing transparency fusion on the background image and the foreground image according to the target detection image to obtain the target substitute image comprises the following steps: and determining a transparency channel according to the target detection image, performing transparency fusion on the background image and the foreground image according to the transparency channel to obtain a target substitute image which displays the foreground image at the attention area position and displays all or part of the background image at the non-attention area position.

Optionally, remapping each pixel value in the target detection image to a range of 0-1 by normalizing the target detection image to obtain a corresponding transparency channel. Here, a region having a pixel value of 1 is an attention region, and a region having a pixel value other than 1 is a non-attention region. Further, a pixel value of 1 indicates a position of 0% transparency, a pixel value of 0 indicates a position of 100% transparency, and pixel values between 0 and 1 indicate the probability of opacity thereof.

Further, the method for performing transparency fusion on the background image and the foreground image according to the transparency channel may include: and adjusting the transparency of each pixel in the foreground image according to the probability represented by each pixel value in the transparency channel, and fusing the adjusted foreground image and the background image to obtain the target substitute image. Wherein, the target substitute image displays an opaque foreground image at the attention area position, and the background image is covered. In the non-attention area, the background image can be displayed in whole or in part due to the transparency of the foreground image being between 0-100%. Alternatively, in a non-attention area with a pixel value of 0, the transparency of the foreground image is 100%, the background image can be directly displayed, and in the positions of the pixel values with the pixel values of non-0 and non-1, the transparency of the corresponding foreground image is adjusted according to the pixel values of the positions, so that a part of the foreground image and a part of the background image are simultaneously displayed at the positions.

Fig. 7 shows a schematic diagram of a process of determining a target substitute image according to an embodiment of the present disclosure. As shown in fig. 7, a background image 71 and a foreground image 72 are obtained by performing a blurring process and a sharpening process on a target frame 70, respectively. Meanwhile, the transparency channel 74 is obtained by performing normalization processing on the target detection image 73. By transparency fusing the background image 71, the foreground image 72 and the transparency channel 74, an object substitute image 75 for replacing the object frame can be determined.

Fig. 8 shows a schematic diagram of a transparency fusion process according to an embodiment of the present disclosure. As shown in fig. 8, when transparency fusion is performed on the foreground image 80, the background image 81 and the transparency channel 82, the foreground image 80 is superimposed as an image top layer, and the background image 81 is superimposed as an image bottom layer. Further, the transparency of the attention area in the foreground image 80 is adjusted to be 100% opaque according to the transparency channel, i.e., the foreground image 80 of the top layer is displayed in the attention area of the target substitute image 83, and the transparency of the non-attention area with a value of 0 in the foreground image 80 is adjusted to be 1000% transparent, i.e., the background image 81 of the bottom layer is displayed in the non-attention area of the target substitute image 83. For the non-attention area with the pixel value between 0-1 in the target detection image, the transparency of each foreground image at each pixel position is adjusted according to the corresponding pixel value to simultaneously display the foreground image 80 and a part of the background image, for example, when the pixel value is 0.8, the transparency of the foreground image is adjusted to 20%.

According to the embodiment of the disclosure, a clear foreground image can be displayed in the attention area in a transparency fusion mode, and a fuzzy background image can be displayed in the non-attention area, so that the subjective visual experience of obtaining the target substitute image is improved.

Step S50, updating the target frame with the target substitute image.

In an alternative implementation, after a target substitute image that displays a foreground image and a background image in an attention area and a non-attention area respectively is obtained, a target frame in a video to be processed is updated through the target substitute image. Optionally, in a video coding scene, the updated target frame may be further input to a video encoder as an input frame for video coding.

Further, in response to the target frame being updated, the frame stored in the first position in the queue is popped up and the next frame in the video to be processed is pushed into the queue. That is, after the target frame in the video to be processed is updated, it is determined that the processing of the current target frame is completed, and the next frame after the previous target frame in the time axis sequence is determined again as a new target frame by popping up the frame stored in the first position in the queue and pushing the next frame into the queue. Meanwhile, each frame in the queue is obtained again to determine a target frame sequence corresponding to the new target frame. At this time, the updated target frame becomes a reference frame in the new target sequence.

In an alternative implementation manner, the video processing method of the embodiment of the disclosure is applied to a video coding scene. In order to improve the effect of the video encoding process, it is further required to determine an adaptive quantization parameter corresponding to the target detected image, and input the updated target frame and the corresponding adaptive quantization parameter into the video encoder, so as to perform video encoding on the target frame based on the corresponding adaptive quantization parameter. The process of inputting the updated target frame and the corresponding adaptive quantization parameter into the video encoder may be: and inputting the updated target frame as an input frame into a video encoder, and inputting the self-adaptive quantization parameter into a self-adaptive quantization interface of the video encoder.

Optionally, the process of determining the adaptive quantization parameter corresponding to the target detection image includes performing histogram statistics on the target detection image to obtain a corresponding histogram mapping table. And mapping the target detection image according to the histogram mapping table to obtain a corresponding preliminary quantization parameter. Alternatively, the mapping process may be to initialize a blank image with the same size as the target detection image, determine corresponding values in the histogram mapping table for each pixel value in the target detection image, and store each value in the same position as the corresponding pixel value on the blank image to obtain the corresponding preliminary quantization parameter. Or determining the corresponding numerical value of each pixel value in the target detection image in the histogram mapping table, and replacing the corresponding pixel value in the target detection image according to each numerical value to obtain a preliminary quantization parameter.

Further, the adaptive quantization parameter is obtained by down-sampling the preliminary quantization parameter. The adaptive quantization parameter is used for video coding of the updated target frame in the video coding process. The down-sampling process is used to convert the preliminary quantization parameter to an image size suitable for video encoding.

In an alternative implementation, the process of down-sampling the preliminary quantization parameter is the same as the process and the scaling factor to be used for each frame in the target frame sequence, and is not described herein again.

Fig. 9 shows a schematic diagram of a process of determining an adaptive quantization parameter according to an embodiment of the present disclosure. As shown in fig. 9, in an application scenario of video coding, after determining a target detection image 90, the embodiment of the present disclosure may obtain a preliminary quantization parameter 91 corresponding to a target frame in a histogram mapping manner. The histogram mapping process includes performing histogram statistics on the target detection image 90 to obtain a corresponding histogram mapping table, and mapping the target detection image through the histogram mapping table to obtain an initial quantization parameter 91. Further, the preliminary quantization parameter is down-sampled by the same predetermined multiple as the down-sampling process of each frame in the target frame sequence, resulting in an adaptive quantization parameter 92.

Fig. 10 shows a schematic diagram of a data transmission process according to an embodiment of the disclosure. As shown in fig. 10, after the target substitute image 100 is updated to the target frame position in the video to be processed, the target substitute image 100 is input to the video encoder 102 as an input frame of the video encoder. At the same time, the adaptive quantization parameter 101 determined based on the target detected image is also input to the adaptive quantization interface of the video encoder 102 as a parameter for video encoding the target substitute image 100.

In a video coding scene, the embodiments of the present disclosure may determine a corresponding adaptive quantization parameter based on the attention region detection result of a target frame, so as to further perform adaptive quantization adjustment and improve the efficiency of a video coding process.

Furthermore, the embodiment of the disclosure performs attention area detection after down-sampling each frame in the frame sequence, thereby improving the efficiency of the attention area detection process and realizing real-time attention area detection.

It is understood that the above-mentioned method embodiments of the present disclosure can be combined with each other to form a combined embodiment without departing from the logic of the principle, which is limited by the space, and the detailed description of the present disclosure is omitted. Those skilled in the art will appreciate that in the above methods of the specific embodiments, the specific order of execution of the steps should be determined by their function and possibly their inherent logic.

In addition, the present disclosure also provides a video processing apparatus, an electronic device, a computer-readable storage medium, and a program, which can be used to implement any video processing method provided by the present disclosure, and the corresponding technical solutions and descriptions and corresponding descriptions in the methods section are not repeated.

Fig. 11 shows a schematic diagram of a video processing apparatus according to an embodiment of the present disclosure, as shown in fig. 11, the apparatus including:

a sequence determining module 110, configured to determine a target frame sequence in a video to be processed according to a time axis sequence, where the target frame sequence includes a target frame and at least one reference frame within a preset length range from the target frame;

an attention area detection module 111, configured to perform attention area detection according to the target frame sequence to obtain a target detection image for representing an attention area and a non-attention area in the target frame;

an image determining module 112, configured to determine a corresponding background image and a corresponding foreground image according to the target frame;

an image fusion module 113, configured to perform transparency fusion on the background image and the foreground image according to the target detection image to obtain a target substitute image, where an attention area of the target substitute image is the foreground image, and a non-attention area of the target substitute image is at least part of the background image;

an image update module 114 for updating the target frame with the target substitute image.