Video strip splitting method and device

文档序号：139104 发布日期：2021-10-22 浏览：42次中文

阅读说明：本技术 视频拆条方法和装置 (Video strip splitting method and device ) 是由陈大友金鑫涂丹丹于 2020-04-21 设计创作，主要内容包括：本申请提供了一种视频拆条方法和装置,根据多个模态的信息来决定视频拆分点,从而提高视频拆条的准确性。第一方面,提供了一种视频拆条方法,该方法包括：根据视频的镜头切换点和视频的语音停顿点获取视频的多个细粒度拆条片段；提取多个细粒度拆条片段中每个细粒度拆条片段的特征,特征包括图片特征、音频特征；根据第一神经网络模型对多个细粒度拆条片段的特征进行处理,以得到视频的拆分点预测序列,拆分点预测序列包括多个拆分点和多个拆分点对应的概率。(The application provides a video stripping method and device, which determine video stripping points according to information of multiple modes, so that the accuracy of video stripping is improved. In a first aspect, a video striping method is provided, which includes: acquiring a plurality of fine-grained strip-splitting fragments of the video according to a shot switching point of the video and a voice stop point of the video; extracting the characteristics of each fine-grained strip splitting segment in the plurality of fine-grained strip splitting segments, wherein the characteristics comprise picture characteristics and audio characteristics; and processing the characteristics of the fine-grained splitting segments according to the first neural network model to obtain a splitting point prediction sequence of the video, wherein the splitting point prediction sequence comprises a plurality of splitting points and probabilities corresponding to the splitting points.)

1. A method for video striping, comprising:

acquiring a plurality of fine-grained strip-splitting fragments of a video according to a shot switching point of the video and a voice stop point of the video;

extracting features of each fine-grained strip splitting segment in the plurality of fine-grained strip splitting segments, wherein the features comprise picture features and audio features;

processing the characteristics of the fine-grained splitting segments according to a first neural network model to obtain a splitting point prediction sequence of the video, wherein the splitting point prediction sequence comprises a plurality of splitting points and probabilities corresponding to the splitting points.

2. The method of claim 1, wherein the features further comprise subtitle features.

3. The method of claim 1 or 2, wherein the first neural network model is a multi-modal fusion stripping detector model.

4. The method of any of claims 1 to 3, wherein prior to processing the features of the plurality of fine-grained striped fragments according to the first neural network model, the method further comprises:

aligning features of the plurality of fine-grained sliver segments in a time dimension.

5. A method of training a neural network model, comprising:

acquiring training data, wherein the training data comprises characteristics of a plurality of video clips and standard split point sequences corresponding to the video clips, and the characteristics comprise picture characteristics and audio characteristics;

training an initial neural network model according to the training data;

and when the difference between the split point prediction sequence of the video obtained by the neural network model according to the characteristics of the plurality of video segments and the standard split point sequence meets a preset condition, obtaining a first neural network model, wherein the split point prediction sequence comprises one or more split points and the probability corresponding to the one or more split points.

6. The method of claim 5, wherein the features further comprise subtitle features.

7. A video striping apparatus, comprising:

the processor is used for acquiring a plurality of fine-grained striping segments of the video according to a shot switching point of the video and a voice stop point of the video;

the processor is further configured to extract features of each fine-grained striped fragment of the plurality of fine-grained striped fragments, where the features include picture features and audio features;

the processor is further configured to process the features of the fine-grained split segments according to a first neural network model to obtain a split point prediction sequence of the video, where the split point prediction sequence includes a plurality of split points and probabilities corresponding to the split points.

8. The apparatus of claim 7, wherein the features further comprise subtitle features.

9. The apparatus of claim 7 or 8, wherein the first neural network model is a multi-modal fusion stripping detector model.

10. The apparatus of any of claims 7 to 9, wherein prior to the processor processing the features of the plurality of fine-grained striped segments according to the first neural network model, the processor is further configured to:

aligning features of the plurality of fine-grained sliver segments in a time dimension.

11. An apparatus for training a neural network model, comprising: a transmission interface and a processor;

the transmission interface is used for acquiring training data, the training data comprises characteristics of a plurality of video clips and standard split point sequences corresponding to the video clips, and the characteristics comprise picture characteristics and audio characteristics;

the processor is configured for performing the steps of:

training an initial neural network model according to the training data;

12. The apparatus of claim 11, wherein the features further comprise subtitle features.

13. An apparatus for charge authentication, the apparatus comprising: a memory storing code and data therein, a processor coupled with the processor, the processor executing the code in the memory to cause the apparatus to perform the method of charging authentication of any of claims 1 to 4, or to perform the method of charging authentication of claim 5 or 6.

14. A computer readable storage medium having instructions stored thereon, wherein the instructions when executed perform the method of charging authentication of any one of claims 1 to 4, or perform the method of charging authentication of claim 5 or 6.

15. A computer program product, comprising: instructions which, when run on a computer, cause the computer to perform the method of charging authentication of any one of claims 1 to 4, or to perform the method of charging authentication of claim 5 or 6.

Technical Field

The present application relates to the field of video processing, and more particularly, to a method and apparatus for splitting video.

Background

With the development of television broadcasting and internet video technologies, the cost of storage and video acquisition equipment is reduced, and various intelligent terminal devices are popularized, mass videos are produced, and meanwhile, the demand of broad users on various videos is increased. Massive videos need to be processed for the second time to form new media programs, and finally the new media programs are displayed to users. The video striping is to split a long video into a plurality of segments with different subjects, and is the most important and important step in video secondary processing. The traditional video splitting method is manually split, namely, a worker browses video materials first and then splits the video materials after understanding the video. The splitting method has low efficiency and poor real-time performance, and cannot meet the increasing multimedia market demand.

The existing automatic video splitting method only decides splitting points according to information of a single mode, and for videos with complex structures and various contents, the splitting method may cause inaccurate splitting.

Disclosure of Invention

The application provides a video stripping method and device, which determine video stripping points according to information of multiple modes, so that the accuracy of video stripping is improved.

In a first aspect, a video striping method is provided, which includes: acquiring a plurality of fine-grained strip-splitting fragments of the video according to a shot switching point of the video and a voice stop point of the video; extracting the characteristics of each fine-grained strip splitting segment in the plurality of fine-grained strip splitting segments, wherein the characteristics comprise picture characteristics and audio characteristics; and processing the characteristics of the fine-grained splitting segments according to the first neural network model to obtain a splitting point prediction sequence of the video, wherein the splitting point prediction sequence comprises a plurality of splitting points and probabilities corresponding to the splitting points.

For news videos or internet videos with complex structures and various contents, systematic video content understanding is required for accurate bar splitting. The video strip splitting needs to solve two main problems, namely ensuring that the boundary of the strip splitting is clean, namely the picture integrity and the audio integrity, and ensuring the accuracy of splitting points. According to the video strip removing method provided by the embodiment of the application, a plurality of fine-grained strip removing fragments of a video are obtained according to a lens switching point and a voice stop point of the video, and the clean boundary of the strip removing is ensured; and then, according to a plurality of modal characteristics such as picture characteristics, audio characteristics and the like of the video clip, a split point prediction sequence of the video is obtained by combining a neural network model, and the accuracy of the split point is ensured.

With reference to the first aspect, in certain implementations of the first aspect, the features further include subtitle features.

For videos with subtitles, the video splitting method of the embodiment of the application further includes obtaining a splitting point prediction sequence of the video according to the subtitle features of the video segments.

With reference to the first aspect, in certain implementations of the first aspect, the first neural network model is a multi-modal fusion split-bar detector model.

With reference to the first aspect, in certain implementations of the first aspect, before processing the features of the plurality of fine-grained striped fragments according to the first neural network model, the method further includes: features of the plurality of fine-grained sliver segments are aligned in a time dimension.

The characteristics of the fine-grained striping segments are aligned in the time dimension, so that the picture characteristics, the audio characteristics and/or the subtitle characteristics in the same time period are a group of characteristics, and the striping accuracy of the video is improved.

In a second aspect, a method of training a neural network model is provided, the method comprising: acquiring training data, wherein the training data comprises characteristics of a plurality of video clips and standard split point sequences corresponding to the video clips, and the characteristics comprise picture characteristics and audio characteristics; training the initial neural network model according to the training data; when the difference between a splitting point prediction sequence of the video obtained by the neural network model according to the characteristics of the video segments and a standard splitting point sequence meets a preset condition, a first neural network model is obtained, wherein the splitting point prediction sequence comprises one or more splitting points and the probability corresponding to the one or more splitting points.

The method of the embodiment of the application also comprises a training method of the neural network model, and the trained neural network model can be used for obtaining the splitting point prediction sequence according to the characteristics of the fine-grained splitting segment, so that the accuracy of video splitting is improved.

With reference to the second aspect, in some implementations of the second aspect, the features further include a caption feature.

In a third aspect, a video striping apparatus is provided, the apparatus comprising: the processor is used for acquiring a plurality of fine-grained striping fragments of the video according to the shot switching point and the voice stop point of the video; the processor is further configured to extract features of each fine-grained strip splitting segment of the plurality of fine-grained strip splitting segments, where the features include picture features and audio features; the processor is further configured to process the features of the fine-grained splitting segments according to the first neural network model to obtain a splitting point prediction sequence of the video, where the splitting point prediction sequence includes a plurality of splitting points and probabilities corresponding to the splitting points.

With reference to the third aspect, in certain implementations of the third aspect, the features further include subtitle features.

With reference to the third aspect, in certain implementations of the third aspect, the first neural network model is a multi-modal fusion split detector model.

With reference to the third aspect, in some implementations of the third aspect, before the processor processes the features of the one or more fine-grained striped fragments according to the first neural network model, the processor is further configured to: features of the plurality of fine-grained sliver segments are aligned in a time dimension.

In a fourth aspect, an apparatus for training a neural network model is provided, the apparatus comprising: a transmission interface and a processor; the transmission interface is used for acquiring training data, the training data comprises the characteristics of a plurality of video clips and standard split point sequences corresponding to the video clips, and the characteristics comprise picture characteristics and audio characteristics; the processor is configured for performing the steps of: training the initial neural network model according to the training data; when the difference between a splitting point prediction sequence of the video obtained by the neural network model according to the characteristics of the video segments and a standard splitting point sequence meets a preset condition, a first neural network model is obtained, wherein the splitting point prediction sequence comprises one or more splitting points and the probability corresponding to the one or more splitting points.

With reference to the fourth aspect, in some implementations of the fourth aspect, the features further include a caption feature.

In a fifth aspect, there is provided a charging authentication apparatus, comprising: a memory, a processor, the memory storing code and data, the memory coupled to the processor, the processor executing the code in the memory to cause the apparatus to perform the method of any one of the implementations of the first aspect or the second aspect.

A sixth aspect provides a computer-readable storage medium having stored thereon instructions that, when executed, perform a method in any one of the implementations of the first or second aspect.

In a seventh aspect, a computer program product is provided, comprising: instructions for causing a computer to perform the method of any one of the implementations of the first or second aspect described above, when the computer program product is run on a computer.

In an eighth aspect, a chip is provided, where the chip includes a processor and a data interface, and the processor reads instructions stored in a memory through the data interface to execute the method in any one implementation manner of the first aspect or the second aspect.

Optionally, as an implementation manner, the chip may further include a memory, the memory stores instructions, and the processor is configured to execute the instructions stored on the memory, and when the instructions are executed, the processor is configured to execute the method in any one of the implementation manners of the first aspect or the second aspect.

Drawings

Fig. 1 is a schematic architecture diagram of a video striping system according to an embodiment of the present application;

FIG. 2 is a schematic block diagram of a prior art video striping method according to an embodiment of the present application;

FIG. 3 is a schematic flow chart diagram of a video striping method according to an embodiment of the present application;

FIG. 4 is a method of training a neural network model according to an embodiment of the present application;

FIG. 5 is a schematic block diagram of a video striping method according to an embodiment of the present application;

FIG. 6 is a schematic block diagram of a method for fine-grained de-striping of video in an embodiment of the present application;

FIG. 7 is a schematic block diagram of a multimodal fusion exact stripping process of an embodiment of the present application;

FIG. 8 is a neural network model structure of a multi-modal fusion stripping detector according to an embodiment of the present application;

fig. 9 is a schematic hardware configuration diagram of a video striping apparatus according to an embodiment of the present application;

fig. 10 is a schematic hardware configuration diagram of a model training apparatus according to an embodiment of the present application.

Detailed Description

In order to facilitate understanding of the technical solutions of the present application, first, concepts related to the present application are briefly introduced.

Deep Learning (DL): a machine learning technology based on deep neural network algorithm is mainly characterized in that a multiple nonlinear transformation structure is used for processing and analyzing data. The method is mainly applied to scenes such as perception, decision and the like in the field of artificial intelligence, such as image and voice recognition, natural language translation, computer gaming and the like.

Since the embodiments of the present application relate to the application of neural networks, for the sake of understanding, the following description will be made about terms and concepts related to neural networks to which the embodiments of the present application may relate.

(1) Neural network

The neural network may be composed of neural units, which may be referred to as x_sAnd an arithmetic unit with intercept 1 as input, the output of which may be:

wherein s is 1, 2, … … n, n is a natural number greater than 1, and W is_sIs x_sB is the bias of the neural unit. f is an activation function (activation functions) of the neural unit for introducing a nonlinear characteristic into the neural network to convert an input signal in the neural unit into an output signal. The output signal of the activation function may be used as an input for the next convolutional layer, and the activation function may be a sigmoid function. A neural network is a network formed by a plurality of the above-mentioned single neural units being joined together, i.e. the output of one neural unit may be the input of another neural unit. The input of each neural unit can be connected with the local receiving domain of the previous layer to extract the characteristics of the local receiving domain, and the local receiving domain can be a region composed of a plurality of neural units.

(2) Deep neural network

Deep Neural Networks (DNNs), also called multi-layer neural networks, can be understood as neural networks with multiple hidden layers. The DNNs are divided according to the positions of different layers, and the neural networks inside the DNNs can be divided into three categories: input layer, hidden layer, output layer. Generally, the first layer is an input layer, the last layer is an output layer, and the middle layers are hidden layers. The layers are all connected, that is, any neuron of the ith layer is necessarily connected with any neuron of the (i + 1) th layer.

Although DNN appears complex, it is not really complex in terms of the work of each layer, simply the following linear relational expression:wherein the content of the first and second substances,is the input vector of the input vector,is the output vector of the output vector,is an offset vector, W is a weight matrix (also called coefficient), and α () is an activation function. Each layer is only for the input vectorObtaining the output vector through such simple operationDue to the large number of DNN layers, the coefficient W and the offset vectorThe number of the same is also large. The definition of these parameters in DNN is as follows: taking coefficient W as an example: assume that in a three-layer DNN, the linear coefficients of the 4 th neuron of the second layer to the 2 nd neuron of the third layer are defined asThe superscript 3 represents the number of layers in which the coefficient W is located, while the subscripts correspond to the third layer index 2 of the output and the second layer index 4 of the input.

In summary, the coefficients from the kth neuron at layer L-1 to the jth neuron at layer L are defined as

Note that the input layer is without the W parameter. In deep neural networks, more hidden layers make the network more able to depict complex situations in the real world. Theoretically, the more parameters the higher the model complexity, the larger the "capacity", which means that it can accomplish more complex learning tasks. The final goal of the process of training the deep neural network, i.e., learning the weight matrix, is to obtain the weight matrix (the weight matrix formed by the vectors W of many layers) of all the layers of the deep neural network that is trained.

(3) Convolutional neural network

A Convolutional Neural Network (CNN) is a deep neural network with a convolutional structure. The convolutional neural network comprises a feature extractor consisting of convolutional layers and sub-sampling layers, which can be regarded as a filter. The convolutional layer is a neuron layer for performing convolutional processing on an input signal in a convolutional neural network. In convolutional layers of convolutional neural networks, one neuron may be connected to only a portion of the neighbor neurons. In a convolutional layer, there are usually several characteristic planes, and each characteristic plane may be composed of several neural units arranged in a rectangular shape. The neural units of the same feature plane share weights, where the shared weights are convolution kernels. Sharing weights may be understood as the way image information is extracted is location independent. The convolution kernel can be initialized in the form of a matrix of random size, and can be learned to obtain reasonable weights in the training process of the convolutional neural network. In addition, sharing weights brings the direct benefit of reducing connections between layers of the convolutional neural network, while reducing the risk of overfitting.

(4) Recurrent Neural Networks (RNNs) are used to process sequence data. In the traditional neural network model, from the input layer to the hidden layer to the output layer, the layers are all connected, and each node between every two layers is connectionless. Although solving many problems, the common neural network still has no capability to solve many problems. For example, you would typically need to use the previous word to predict what the next word in a sentence is, because the previous and next words in a sentence are not independent. The RNN is called a recurrent neural network, i.e., the current output of a sequence is also related to the previous output. The concrete expression is that the network memorizes the previous information and applies the previous information to the calculation of the current output, namely, the nodes between the hidden layers are not connected any more but connected, and the input of the hidden layer not only comprises the output of the input layer but also comprises the output of the hidden layer at the last moment. In theory, RNNs can process sequence data of any length. The training for RNN is the same as for conventional CNN or DNN.

Now that there is a convolutional neural network, why is a circular neural network? For simple reasons, in convolutional neural networks, there is a precondition assumption that: the elements are independent of each other, as are inputs and outputs, such as cats and dogs. However, in the real world, many elements are interconnected, such as stock changes over time, and for example, a person says: i like to travel, where the favorite place is Yunnan, and have a chance to go to __ later. Here, to fill in the blank, humans should all know to fill in "yunnan". Because humans infer from the context, but how do the machine do it? The RNN is generated. RNNs aim at making machines capable of memory like humans. Therefore, the output of the RNN needs to be dependent on the current input information and historical memory information.

(5) A long-time short term memory (LSTM) is a special RNN network, memory units are added in each nerve unit of a hidden layer on the basis of a common RNN, so that memory information on a time sequence is controllable, and the memory and forgetting degree of previous information and current information can be controlled through a plurality of controllable gates (forgetting gates, input gates and output gates) when the memory information is transmitted among the units of the hidden layer each time, so that the RNN has a long-term memory function. The forgetting gate is used for determining which information needs to be discarded in the unit, the input gate is used for determining which new information needs to be added in the unit, and the output gate is used for judging which part of the unit needs to be output. LSTM is suitable for handling and predicting longer-spaced and delayed events in a time series due to a unique design structure.

The video strip splitting system is widely applied to internet video analysis and radio and television video editing production. Fig. 1 shows the architecture of a typical video striping system. As shown in fig. 1, the collected video is transmitted to a video strip splitting system, a strip splitting result is obtained after the video strip splitting system analyzes and splits the video, and the video is released according to the strip splitting result.

The conventional video splitting method is to perform news video splitting by calculating picture similarity and analyzing key characters based on picture frames aiming at news videos and taking key character appearance points as splitting points by combining prior rules.

The method only utilizes the video character analysis of a single mode of a picture frame and lacks audio and subtitle information, so that the problem of splitting a theme when key characters are switched can be solved only. This may cause that the splitting boundary is not accurate, the topic switching situation without the key character cannot be split, or the key character is still split when the topic is not changed but the key character is switched, and the splitting error is large.

Fig. 2 is a schematic block diagram illustrating another prior art video striping method. The method also aims at news videos, firstly, pictures and audios are decoded according to the news videos, host detection and subtitle detection are carried out according to the decoded pictures, and voice pause detection is carried out according to the decoded audios. And then visual candidate points (supporter splitting points and subtitle splitting points) and voice candidate points (voice stop points) of the news item boundary are obtained through a heuristic rule. And finally, performing multi-mode fusion and combination on the visual candidate points and the voice candidate points based on a heuristic rule to serve as final video splitting points. Finally, the multi-modal fusion is just a simple result combination at the decision layer.

According to the method, split points are obtained through separate analysis according to information of three modes, namely a host, subtitles, voice and the like, and then are combined into a final split point based on a certain rule. Due to the fact that information among the modalities is not effectively fused and complemented, the split strips are easy to fragment, and a larger splitting error can be caused after errors of single-modality splitting are superposed.

Another existing automatic bar splitting method for news videos includes initializing broadcast television data to automatically obtain audio waveforms and video images of news programs; extracting audio and video characteristics of news data, including host detection, subtitle detection and tracking and voice detection; acquiring visual candidate points and voice candidate points of a news item boundary through a heuristic rule; positioning calculation of news item boundaries is achieved according to audio and video fusion; and after the processing result provided by the step is manually checked, the result is input into a knowledge base to serve as a knowledge resource for supporting the supervision requirement.

The method is used for tracking the host and the caption based on the traditional model, and the error of the searched monomodal split point is larger. Information of three modes such as a host, a subtitle, voice and the like is also merged into a final splitting point on the basis of a certain rule in a decision layer, and the video splitting under the condition of theme switching without key characters still cannot be solved.

For news videos or internet videos with complex structures and various contents, systematic video content understanding is required for accurate bar splitting. The video strip splitting needs to solve two main problems, namely ensuring that the boundary of the strip splitting is clean, namely the picture integrity and the audio integrity, and ensuring the accuracy of splitting points. The information of a single modality is not enough to accurately judge the transformation of the theme, for example, under the condition that the picture is not switched, the explained theme is transformed, and the picture information alone is not enough to judge; further, for example, if the caption is switched without changing the caption theme, the information is not enough to be determined by the text alone. Therefore, the accurate video stripping requires the comprehensive analysis of multiple modalities such as video pictures, subtitles and voice to accurately determine the transformation of the theme.

Fig. 3 shows a schematic flowchart of a video striping method provided in an embodiment of the present application, and as shown in fig. 3, the video striping method of the embodiment of the present application includes steps 301 to 303, which are described in detail below.

S301, obtaining a plurality of fine-grained striping fragments of the video according to the shot switching point of the video and the voice pause point of the video.

Specifically, for a video to be processed, a video frame and audio of the video are first extracted.

And then acquiring a shot switching point according to the video frame. Specifically, the characteristics of the video frame may be extracted according to the neural network model, and by comparing the characteristic distance between the current video frame and the previous video frame, if the characteristic distance is relatively large, it is determined that the shot switching has occurred. The neural network model may be any neural network model for determining whether shot cut occurs according to the features of the video frames, which is not limited in the embodiment of the present application.

And then obtaining a voice stop point according to the audio. Specifically, the physical waveform analysis may be performed on the audio signal, and the point with the smaller fluctuation amplitude is selected as the audio pause point.

It should be understood that the process of acquiring the shot cut point according to the video frame and the process of acquiring the voice pause point according to the audio may be performed sequentially or simultaneously, which is not limited in the embodiment of the present application.

And finally, acquiring a plurality of fine-grained strip-splitting fragments of the video according to the shot switching point of the video and the voice pause point of the video. Specifically, the video is striped at a location where shot cuts and voice pauses occur simultaneously. For example, a shot cut point may be selected, and then it is determined whether a voice pause has occurred at the shot cut point, if a voice pause has also occurred at the shot cut point, the video is striped at the position, and if a voice pause has not occurred at the shot cut point, the video is striped at the position. Multiple fine-grained striped segments of the video may be obtained.

S302, extracting the characteristics of each fine-grained strip splitting segment in the plurality of fine-grained strip splitting segments, wherein the characteristics comprise picture characteristics and audio characteristics.

Specifically, for a fine-grained striped fragment, the video frames and audio of the fine-grained striped fragment are extracted.

And acquiring the picture characteristics of the fine-grained strip splitting segment according to the video frame. Specifically, the video frame may be input to a picture feature extractor to extract picture features of the fine-grained slice segment. The image feature extractor may be any neural network model for extracting image features, which is not limited in the embodiment of the present application.

And acquiring the audio characteristics of the fine-grained strip splitting segment according to the audio. In particular, the audio input audio feature extractor may be configured to extract audio features of the fine-grained sliced segment. The audio feature extractor may be any neural network model for extracting audio features, which is not limited in this application.

Optionally, the video strip splitting method according to the embodiment of the present application further includes obtaining a subtitle feature of the fine-grained strip splitting segment according to the video frame. Specifically, an Optical Character Recognition (OCR) subtitle of the segment may be obtained through an optical character recognition technology, and then the OCR subtitle is input to a subtitle feature extractor to extract a subtitle feature of the segment. The caption feature extractor may be any neural network model for extracting caption features, which is not limited in this embodiment of the present application.

Thus, according to the above method, for a plurality of fine-grained striped fragments, a plurality of picture features, a plurality of audio features, and/or subtitle features may be obtained.

And S303, processing the characteristics of the fine-grained splitting segments according to the first neural network model to obtain a splitting point prediction sequence of the video, wherein the splitting point prediction sequence comprises a plurality of splitting points and probabilities corresponding to the splitting points.

Specifically, a plurality of picture features, a plurality of audio features, and/or a plurality of subtitle features of the fine-grained splittable segments may be used as an input of a first neural network model, and an output of the first neural network model is a split-point prediction sequence of the video. The splitting point prediction sequence comprises a plurality of splitting points and probabilities corresponding to the splitting points, each splitting point in the splitting points represents a position where the video is possibly split, and the probability corresponding to each splitting point represents the possibility that the video is split at the position.

Optionally, before processing the features of the fine-grained striped segments according to the first neural network model, the video striping method of the embodiment of the present application further includes aligning the features of the fine-grained striped segments in a time dimension. Specifically, the multiple picture features of the multiple fine-grained strip-splitting segments are sequenced according to a video playing time sequence, the multiple audio features are sequenced according to the video playing time sequence, the multiple subtitle features are sequenced according to the video playing time sequence, and the picture features, the audio features and/or the subtitle features in the same time period are used as the same group of features. Thus, a plurality of groups of characteristics can be obtained, namely the aligned characteristic sequences.

The first neural network model is a neural network model for obtaining a splitting point prediction sequence according to the characteristics of the fine-grained splitting segment, and the first neural network model may be trained in advance, and the specific process may refer to the following description of fig. 4.

Fig. 4 illustrates a method for training a neural network model according to an embodiment of the present application, which includes steps 401 to 403, which are described below.

S401, training data are obtained, wherein the training data comprise characteristics of a plurality of video clips and standard split point sequences corresponding to the video clips, and the characteristics comprise picture characteristics and audio characteristics.

The characteristics of the video clips can be picture characteristics and audio characteristics extracted according to the existing video clips, and the standard split point sequences corresponding to the video clips can be split points calibrated manually.

Optionally, the features of the video segment further include a caption feature.

S402, training the initial neural network model according to the training data.

The training of the neural network model according to the training data in the embodiment of the present application may be any existing training method of the neural network model, and the embodiment of the present application is not limited thereto.

And S403, when the difference between the splitting point prediction sequence of the video obtained by the neural network model according to the characteristics of the video segments and the standard splitting point sequence meets a preset condition, obtaining a first neural network model, wherein the splitting point prediction sequence comprises one or more splitting points and the probability corresponding to the one or more splitting points.

The neural network model trained according to the method in fig. 4 may be used in fig. 3 to obtain a split point prediction sequence according to the characteristics of the fine-grained split segment.

Fig. 5 shows a schematic block diagram of a video striping method provided by an embodiment of the present application. As shown in fig. 5, the video striping method provided in the embodiment of the present application is divided into two stages, where the first stage is fine-grained striping based on shot switching and speech pause, and the second stage is multi-modal fusion of fine-grained striping segments in a feature layer to obtain an accurate striping result. The details are as follows.

Firstly, decoding the video needing to be stripped to obtain a decoded picture or a decoded frequency frame and a decoded audio. Since video is a common carrier of visual signals and audio signals, the basic requirement of video striping is that the video picture and audio after striping have complete and clean boundaries. Based on this, it is necessary to accurately detect the shot cut point and the audio pause point of the frame, and then finely divide the video into a plurality of consecutive segments using the shot cut point and the audio pause point as division points. Fig. 6 shows a schematic block diagram of fine-grained de-striping of a video in an embodiment of the present application. As shown in fig. 6, shot cut detection is performed on the decoded picture to obtain one or more shot cut points, and voice pause detection is performed on the decoded audio to obtain one or more voice pause points. And then obtaining a fine-grained stripping result according to the one or more shot switching points and the one or more voice stopping points. For example, at a certain position of a video, a shot cut point and a voice pause point exist at the same time, the video is split at the position; if at a certain position of the video, only a shot cut point or a voice stop point exists, or neither a shot cut point nor a voice stop point exists, the video is not split at the position.

In particular, a shot cut point and an audio pause point may be determined from a neural network model. For example, the feature of the video frame is extracted by using a neural network model, then the feature distance between the current frame and the previous frame is compared, and if the feature distance is relatively large, the shot switching point is determined to occur. For another example, a neural network model is used to perform physical waveform analysis on the audio signal, and a point with a small fluctuation amplitude is determined as an audio pause point.

The video stripping method of the embodiment of the application firstly carries out fine-grained stripping on the video, and ensures the picture integrity and the audio integrity of the fine-grained stripping fragments. And the segments of the fine-grained bar splitting can be used as basic units to align the features of different modes so as to perform feature fusion of the second stage. In addition, the fine-grained segmentation is used as a basic unit to extract features, so that the dimension of the multi-modal fusion feature sequence in the time direction can be effectively reduced.

Because pictures, subtitles, audio and the like in the video are integrated, the video can be accurately understood by combining information of different modalities. Therefore, the multi-mode fusion method of the embodiment of the application fuses information of three modes such as subtitles, video frames and audio, and synthesizes prediction of each mode on a split point to obtain an accurate video splitting result.

FIG. 7 is a schematic block diagram of the multi-modal fusion exact stripping process, including the following steps:

1. and selecting any fine-grained stripped segment, and extracting the video frame and the audio in the segment.

2. Extracting an Optical Character Recognition (OCR) caption of the segment according to the video frame, and inputting the OCR caption into a caption feature extractor to extract a caption feature of the segment. The caption feature extractor may be a neural network model, and is used to implement caption feature extraction, and the embodiment of the present application is not specifically limited herein.

3. And inputting the video frame into a picture feature extractor, and extracting the picture features of the segment. The image feature extractor may be a neural network model, and is used to implement image feature extraction, and the embodiment of the present application is not specifically limited herein.

4. And inputting the audio frequency into an audio frequency characteristic extractor, and extracting the audio frequency characteristic of the segment. The audio feature extractor may be a neural network model, and is used to implement audio feature extraction, and the embodiment of the present application is not specifically limited herein.

5. Repeating the steps 1 to 4 can obtain multi-modal characteristics of a plurality of fine-grained stripped segments. And taking each segment as a basic unit to align the characteristics of the three modes in a time dimension. For example, if there are ten fine-grained fragments, there are ten subtitle features, ten picture features, and ten audio features. And sequencing ten groups of subtitle feature case video playing time sequences, ten groups of image feature case video playing time sequences and ten groups of audio feature case video playing time sequences, wherein the first group of subtitle features, the first group of image features and the first group of audio features are aligned and recorded as a first group of features. Therefore, ten groups of features can be obtained to form a feature sequence, and each group of features comprises a subtitle feature, a picture feature and an audio feature.

6. Inputting the characteristic sequence into a multi-modal fusion stripping detector (multi-modal breakpoint detector), performing multi-modal fusion on a characteristic layer, and finally outputting an accurate stripping result.

Specifically, the multi-modal fusion stripping detector may be a neural network model, and fig. 8 shows a neural network model structure of the multi-modal fusion stripping detector, as shown in fig. 8, the neural network model may include a one-dimensional convolutional neural network CNN and a plurality of long-term memory models LSTM. For example, inputting a plurality of characteristic sequences [ x ] of continuous segments¹,x²,...,xⁱ,...x^m]Wherein x isⁱAnd (3) representing the aligned superposition of the subtitle feature, the picture feature and the audio feature of the ith segment in the segment of the fine-grained splitting bar, namely the ith group of features. The multi-mode fusion strip splitting detector outputs a splitting point prediction sequence [ y ]¹,y²,...,yⁱ,...y^m]Indicates that m positions of the video are predicted to be possible split points, where yⁱIndicating the probability that the ith position is a split point.

Fig. 9 is a schematic hardware configuration diagram of a video striping apparatus 900 according to an embodiment of the present application. The video stripping apparatus 900 shown in fig. 9 includes a memory 901, a processor 902, a communication interface 903, and a bus 904. The memory 901, the processor 902 and the communication interface 903 are connected to each other by a bus 904.

The memory 901 may be a Read Only Memory (ROM), a static memory device, a dynamic memory device, or a Random Access Memory (RAM). The memory 901 may store a program, and when the program stored in the memory 901 is executed by the processor 902, the processor 902 and the communication interface 903 are used to execute the steps of the video striping apparatus of the embodiment of the present application.

The processor 902 may be a general-purpose Central Processing Unit (CPU), a microprocessor, an Application Specific Integrated Circuit (ASIC), a Graphics Processing Unit (GPU), or one or more integrated circuits, and is configured to execute related programs to implement functions required to be executed by units in the video striping apparatus according to the embodiment of the present application, or to execute the video striping method according to the embodiment of the present application.

The processor 902 may also be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the video striping method according to the embodiment of the present application may be implemented by an integrated logic circuit of hardware in the processor 902 or by instructions in the form of software. In particular, the processor 902 is configured to perform the following steps:

and the processor is used for acquiring a plurality of fine-grained strip-splitting fragments of the video according to the shot switching point and the voice stop point of the video.

The processor is further configured to extract features of each fine-grained striping segment of the plurality of fine-grained striping segments, where the features include picture features and audio features.

The processor is further configured to process the features of the fine-grained splitting segments according to the first neural network model to obtain a splitting point prediction sequence of the video, where the splitting point prediction sequence includes a plurality of splitting points and probabilities corresponding to the splitting points.

Optionally, the above features further include a caption feature.

Optionally, the first neural network model is a multi-modal fusion split detector model.

Optionally, the processor is further configured to align the features of the plurality of fine-grained striped segments in the time dimension.

The processor 902 may also be a general purpose processor, a Digital Signal Processor (DSP), an ASIC, an FPGA (field programmable gate array) or other programmable logic device, discrete gate or transistor logic device, or discrete hardware components. The general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in the memory 901, and the processor 902 reads information in the memory 901, and completes, in combination with hardware thereof, functions required to be executed by units included in the video striping apparatus according to the embodiment of the present application, or executes a video striping method according to the embodiment of the present application.

The communication interface 903 enables communication between the apparatus 900 and other devices or communication networks using transceiver means, such as, but not limited to, a transceiver. For example, pending video or training data may be acquired through the communication interface 903.

Bus 904 may include a pathway to transfer information between various components of device 900, such as memory 901, processor 902, and communication interface 903.

Fig. 10 is a hardware configuration diagram of a model training apparatus 1000 according to an embodiment of the present application. Similar to the apparatus 900 described above, the model training apparatus 1000 shown in fig. 10 includes a memory 1001, a processor 1002, a communication interface 1003, and a bus 1004. The memory 1001, the processor 1002, and the communication interface 1003 are communicatively connected to each other via a bus 1004.

The memory 1001 may store a program, and when the program stored in the memory 1001 is executed by the processor 1002, the processor 1002 is configured to perform the steps of the training method for training the video striping apparatus according to the embodiment of the present application.

The processor 1002 may be a general-purpose CPU, a microprocessor, an ASIC, a GPU or one or more integrated circuits, and is configured to execute the relevant programs to implement the training method for training the video striping apparatus according to the embodiment of the present application.

The processor 1002 may also be an integrated circuit chip having signal processing capabilities. In the training process, the steps of the training method of the video strip breaking apparatus according to the embodiment of the present application may be implemented by hardware integrated logic circuits in the processor 1002 or instructions in the form of software.

It should be understood that, by training the neural network model through the model training apparatus 1000 shown in fig. 10, the trained neural network model can be used to execute the video striping method according to the embodiment of the present application. Specifically, training the neural network model by the apparatus 1000 can obtain the multi-modal fusion stripping detector model in the method shown in fig. 9.

Specifically, the apparatus shown in fig. 10 may obtain training data and the initial neural network model to be trained from the outside through the communication interface 1003, and then train the initial neural network model to be trained according to the training data by the processor.

Optionally, the training data may include features of a plurality of video segments and a standard split point sequence corresponding to the plurality of video segments, where the features include a picture feature, an audio feature, and optionally, a subtitle feature. The standard split point sequence corresponding to the plurality of video segments of the training data may be a video segment split point that is manually pre-labeled.

It should be noted that although the above-described apparatus 900 and apparatus 1000 show only memories, processors, and communication interfaces, in particular implementations, those skilled in the art will appreciate that the apparatus 900 and apparatus 1000 may also include other devices necessary to achieve normal operation. Also, those skilled in the art will appreciate that apparatus 900 and apparatus 1000 may also include hardware components to implement other additional functions, according to particular needs. Furthermore, those skilled in the art will appreciate that apparatus 900 and apparatus 1000 may also include only those components necessary to implement embodiments of the present application, and need not include all of the components shown in fig. 9 and 10.

It should be understood that the processor in the embodiments of the present application may be a Central Processing Unit (CPU), and the processor may also be other general-purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, and the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

It will also be appreciated that the memory in the embodiments of the subject application can be either volatile memory or nonvolatile memory, or can include both volatile and nonvolatile memory. The non-volatile memory may be a read-only memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an electrically Erasable EPROM (EEPROM), or a flash memory. Volatile memory can be Random Access Memory (RAM), which acts as external cache memory. By way of example, but not limitation, many forms of Random Access Memory (RAM) are available, such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), double data rate SDRAM (DDR SDRAM), Enhanced SDRAM (ESDRAM), synchlink DRAM (SLDRAM), and direct bus RAM (DR RAM).

The above embodiments may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, the above-described embodiments may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions or computer programs. The procedures or functions according to the embodiments of the present application are generated in whole or in part when a computer instruction or a computer program is loaded or executed on a computer. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored on a computer readable storage medium or transmitted from one computer readable storage medium to another computer readable storage medium, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by wire (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains one or more collections of available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium. The semiconductor medium may be a solid state disk.

It should be understood that the term "and/or" herein is merely one type of association relationship that describes an associated object, meaning that three relationships may exist, e.g., a and/or B may mean: a exists alone, A and B exist simultaneously, and B exists alone, wherein A and B can be singular or plural. In addition, the "/" in this document generally indicates that the former and latter associated objects are in an "or" relationship, but may also indicate an "and/or" relationship, which may be understood with particular reference to the former and latter text.

In the present application, "at least one" means one or more, "a plurality" means two or more. "at least one of the following" or similar expressions refer to any combination of these items, including any combination of the singular or plural items. For example, at least one (one) of a, b, or c, may represent: a, b, c, a-b, a-c, b-c, or a-b-c, wherein a, b, c may be single or multiple.

It should be understood that, in the various embodiments of the present application, the sequence numbers of the above-mentioned processes do not mean the execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present application.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, a division of a unit is merely a logical division, and an actual implementation may have another division, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

Units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

19页详细技术资料下载

Video strip splitting method and device

相关技术

网友询问留言