Video classification method and device, electronic equipment and storage medium

文档序号：1889365 发布日期：2021-11-26 浏览：13次中文

阅读说明：本技术 一种视频分类方法、装置、电子设备和存储介质 (Video classification method and device, electronic equipment and storage medium ) 是由黄剑辉于 2021-04-07 设计创作，主要内容包括：本申请公开了一种视频分类方法、装置、电子设备和存储介质；本申请实施例获取目标视频在至少一个模态下的视频内容信息；对所述目标视频在至少一个模态下的视频内容信息进行语义提取,得到所述目标视频在至少一个模态下的语义特征信息；对所述目标视频在至少一个模态下的语义特征信息进行特征强化处理,得到所述目标视频的强化后特征信息；将所述强化后特征信息与所述目标视频在至少一个模态下的语义特征信息进行融合,得到所述目标视频的目标语义特征信息；基于所述目标语义特征信息,对所述目标视频进行分类。本申请实施例可以通过强化与目标视频内容相关的特征,增强目标语义特征信息的表征力,有利于提高视频分类的准确度。(The application discloses a video classification method, a video classification device, electronic equipment and a storage medium; the method comprises the steps of obtaining video content information of a target video in at least one mode; performing semantic extraction on video content information of the target video in at least one modality to obtain semantic feature information of the target video in at least one modality; performing feature enhancement processing on the semantic feature information of the target video in at least one mode to obtain enhanced feature information of the target video; fusing the enhanced feature information with semantic feature information of the target video in at least one mode to obtain target semantic feature information of the target video; and classifying the target video based on the target semantic feature information. According to the embodiment of the application, the representation force of the target semantic feature information can be enhanced by strengthening the features related to the target video content, and the accuracy of video classification can be improved.)

1. A method of video classification, comprising:

acquiring video content information of a target video in at least one mode;

performing semantic extraction on video content information of the target video in at least one modality to obtain semantic feature information of the target video in at least one modality;

performing feature enhancement processing on the semantic feature information of the target video in at least one mode to obtain enhanced feature information of the target video;

fusing the enhanced feature information with semantic feature information of the target video in at least one mode to obtain target semantic feature information of the target video;

and classifying the target video based on the target semantic feature information.

2. The method according to claim 1, wherein the performing feature enhancement processing on the semantic feature information of the target video in at least one modality to obtain enhanced feature information of the target video includes:

performing logistic regression processing on the semantic feature information of the target video in at least one mode to obtain strengthened feature information of the target video, wherein feature values of features in the strengthened feature information are in a preset numerical range corresponding to the logistic regression.

3. The method according to claim 1, wherein the performing feature enhancement processing on the semantic feature information of the target video in at least one modality to obtain enhanced feature information of the target video includes:

fusing the semantic feature information of the target video in each mode to obtain fused semantic feature information of the target video;

and performing feature enhancement processing on the fused semantic feature information to obtain enhanced feature information of the target video.

4. The method of claim 3, wherein the fused semantic feature information comprises at least one feature;

the performing feature enhancement processing on the fused semantic feature information to obtain enhanced feature information of the target video includes:

when the feature value of the feature in the fused semantic feature information is larger than a first preset value, performing enhancement processing on the feature in the fused semantic feature information;

compressing the features in the fused semantic feature information when the feature value of the features in the fused semantic feature information is smaller than a second preset value;

and obtaining the enhanced feature information of the target video based on the enhanced features and the compressed features.

5. The method of claim 1, wherein the at least one modality comprises an image modality and a text modality;

the acquiring of the video content information of the target video in at least one modality includes:

performing image extraction processing on a target video to obtain an image sequence of the target video, wherein the image sequence is video content information of the target video in an image mode;

and performing text extraction processing on the target video to obtain a text sequence of the target video, wherein the text sequence is video content information of the target video in a text mode.

6. The method according to claim 5, wherein the semantic extracting video content information of the target video in at least one modality to obtain semantic feature information of the target video in at least one modality comprises:

semantic extraction is carried out on each image in the image sequence to obtain image semantic feature information of each image in the image sequence;

processing the image semantic feature information of each image based on the image semantic feature information of the front image and the rear image of each image;

fusing the processed image semantic feature information of each image to obtain semantic feature information of the target video in an image modality;

and performing semantic extraction on the text sequence to obtain semantic feature information of the target video in a text mode.

7. The method according to claim 6, wherein the fusing the enhanced feature information with the semantic feature information of the target video in at least one modality to obtain the target semantic feature information of the target video comprises:

fusing the enhanced feature information and the semantic feature information of the target video in an image modality to obtain first feature information;

fusing the enhanced feature information with semantic feature information of the target video in a text mode to obtain second feature information;

and obtaining target semantic feature information of the target video based on the first feature information and the second feature information.

8. The method according to claim 7, wherein the obtaining target semantic feature information of the target video based on the first feature information and the second feature information comprises:

determining weights corresponding to the first characteristic information and the second characteristic information;

and performing weighted fusion on the first characteristic information and the second characteristic information based on the weight to obtain target semantic characteristic information of the target video.

9. A video classification apparatus, comprising:

the acquisition unit is used for acquiring video content information of a target video in at least one modality;

the extraction unit is used for performing semantic extraction on video content information of the target video in at least one mode to obtain semantic feature information of the target video in at least one mode;

the enhancement unit is used for carrying out feature enhancement processing on the semantic feature information of the target video in at least one mode to obtain enhanced feature information of the target video;

the fusion unit is used for fusing the enhanced feature information and the semantic feature information of the target video in at least one mode to obtain the target semantic feature information of the target video;

and the classification unit is used for classifying the target video based on the target semantic feature information.

10. An electronic device comprising a memory and a processor; the memory stores an application program, and the processor is configured to execute the application program in the memory to perform the operations of the video classification method according to any one of claims 1 to 8.

11. A storage medium storing a plurality of instructions adapted to be loaded by a processor to perform the steps of the video classification method according to any one of claims 1 to 8.

Technical Field

The application relates to the technical field of computers, in particular to a video classification method, a video classification device, electronic equipment and a storage medium.

Background

With the development of computer technology, multimedia is more and more widely applied, the variety of videos is more and more abundant, and the number of videos is also increased sharply. Videos that people can watch are also more and more diversified, and in order to facilitate users to quickly acquire videos to be watched from massive videos, a video playing platform generally classifies a large number of videos in the video playing platform. Video classification plays an important role in realizing management and interest recommendation of videos. In addition, the video classification technology is widely applied in the fields of monitoring, retrieval, human-computer interaction and the like.

In the related art, generally, video frames of a video to be classified are extracted to obtain a plurality of target video images, image feature information of each target video image is extracted through a neural network, then, the image feature information at a frame level is converted into video feature information at a video level, and finally, the video is classified based on the video feature information. But the characterization force of the video feature information extracted in this way is weak, so that the accuracy of the video classification result is relatively low.

Disclosure of Invention

The embodiment of the application provides a video classification method, a video classification device, electronic equipment and a storage medium, which can enhance the representation of target semantic feature information by enhancing features related to target video content, and are beneficial to improving the accuracy of video classification.

The embodiment of the application provides a video classification method, which comprises the following steps:

acquiring video content information of a target video in at least one mode;

performing semantic extraction on video content information of the target video in at least one modality to obtain semantic feature information of the target video in at least one modality;

performing feature enhancement processing on the semantic feature information of the target video in at least one mode to obtain enhanced feature information of the target video;

fusing the enhanced feature information with semantic feature information of the target video in at least one mode to obtain target semantic feature information of the target video;

and classifying the target video based on the target semantic feature information.

Correspondingly, an embodiment of the present application provides a video classification apparatus, including:

the acquisition unit is used for acquiring video content information of a target video in at least one modality;

and the classification unit is used for classifying the target video based on the target semantic feature information.

Optionally, in some embodiments of the application, the enhancing unit may be specifically configured to perform logistic regression processing on the semantic feature information of the target video in at least one modality to obtain enhanced feature information of the target video, where feature values of features in the enhanced feature information are in a preset numerical range corresponding to logistic regression.

Optionally, in some embodiments of the present application, the reinforcement unit may include a fusion subunit and a reinforcement subunit, as follows:

the fusion subunit is configured to fuse semantic feature information of the target video in each modality to obtain fused semantic feature information of the target video;

and the enhancement subunit is used for carrying out feature enhancement processing on the fused semantic feature information to obtain enhanced feature information of the target video.

Optionally, in some embodiments of the present application, the fused semantic feature information includes at least one feature;

the enhancement subunit is specifically configured to perform enhancement processing on the features in the fused semantic feature information when a feature value of the features in the fused semantic feature information is greater than a first preset value; compressing the features in the fused semantic feature information when the feature value of the features in the fused semantic feature information is smaller than a second preset value; and obtaining the enhanced feature information of the target video based on the enhanced features and the compressed features.

Optionally, in some embodiments of the present application, the at least one modality includes an image modality and a text modality;

the acquiring unit may include an image acquiring subunit and a text acquiring subunit, as follows:

the image acquisition subunit is configured to perform image extraction processing on a target video to obtain an image sequence of the target video, where the image sequence is video content information of the target video in an image modality;

the text acquisition subunit is configured to perform text extraction processing on a target video to obtain a text sequence of the target video, where the text sequence is video content information of the target video in a text mode.

Optionally, in some embodiments of the present application, the extracting unit may include an image extracting subunit, a processing subunit, an image fusion subunit, and a text extracting subunit, as follows:

the image extraction subunit is configured to perform semantic extraction on each image in the image sequence to obtain image semantic feature information of each image in the image sequence;

the processing subunit is used for processing the image semantic feature information of each image based on the image semantic feature information of the front image and the rear image of each image;

the image fusion subunit is used for fusing the processed image semantic feature information of each image to obtain semantic feature information of the target video in an image mode;

and the text extraction subunit is used for performing semantic extraction on the text sequence to obtain semantic feature information of the target video in a text mode.

Optionally, in some embodiments of the present application, the fusion unit may include a first fusion subunit, a second fusion subunit, and a determination subunit, as follows:

the first fusion subunit is configured to fuse the enhanced feature information and semantic feature information of the target video in an image modality to obtain first feature information;

the second fusion subunit is configured to fuse the enhanced feature information and semantic feature information of the target video in a text modality to obtain second feature information;

and the determining subunit is used for obtaining the target semantic feature information of the target video based on the first feature information and the second feature information.

Optionally, in some embodiments of the present application, the determining subunit may be specifically configured to determine weights corresponding to the first characteristic information and the second characteristic information; and performing weighted fusion on the first characteristic information and the second characteristic information based on the weight to obtain target semantic characteristic information of the target video.

The electronic device provided by the embodiment of the application comprises a processor and a memory, wherein the memory stores a plurality of instructions, and the processor loads the instructions to execute the steps in the video classification method provided by the embodiment of the application.

In addition, a storage medium is further provided, on which a computer program is stored, where the computer program is executed by a processor to implement the steps in the video classification method provided in the embodiments of the present application.

The embodiment of the application provides a video classification method, a video classification device, electronic equipment and a storage medium, which can acquire video content information of a target video in at least one mode; performing semantic extraction on video content information of the target video in at least one modality to obtain semantic feature information of the target video in at least one modality; performing feature enhancement processing on the semantic feature information of the target video in at least one mode to obtain enhanced feature information of the target video; fusing the enhanced feature information with semantic feature information of the target video in at least one mode to obtain target semantic feature information of the target video; and classifying the target video based on the target semantic feature information. According to the embodiment of the application, the representation force of the target semantic feature information can be enhanced by strengthening the features related to the target video content, and the accuracy of video classification can be improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1a is a scene schematic diagram of a video classification method provided in an embodiment of the present application;

fig. 1b is a flowchart of a video classification method provided in an embodiment of the present application;

fig. 2a is another flowchart of a video classification method provided in an embodiment of the present application;

FIG. 2b is a model framework diagram of a video classification method provided by an embodiment of the present application;

fig. 3a is a schematic structural diagram of a video classification apparatus according to an embodiment of the present application;

fig. 3b is a schematic structural diagram of a video classification apparatus according to an embodiment of the present application;

fig. 3c is a schematic structural diagram of a video classification apparatus according to an embodiment of the present application;

fig. 3d is a schematic structural diagram of a video classification apparatus according to an embodiment of the present application;

fig. 3e is a schematic structural diagram of a video classification apparatus according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of an electronic device provided in an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The embodiment of the application provides a video classification method and device, electronic equipment and a storage medium. The video classification apparatus may be specifically integrated in an electronic device, and the electronic device may be a terminal or a server.

It is understood that the video classification method of the present embodiment may be executed on a terminal, may also be executed on a server, and may also be executed by both the terminal and the server. The above examples should not be construed as limiting the present application.

As shown in fig. 1a, the video classification method is performed by the terminal and the server together. The video classification system provided by the embodiment of the application comprises a terminal 10, a server 11 and the like; the terminal 10 and the server 11 are connected via a network, for example, a wired or wireless network connection, etc., wherein the video classification apparatus may be integrated in the server.

The server 11 may be configured to: acquiring video content information of a target video in at least one mode; performing semantic extraction on video content information of the target video in at least one modality to obtain semantic feature information of the target video in at least one modality; performing feature enhancement processing on the semantic feature information of the target video in at least one mode to obtain enhanced feature information of the target video; fusing the enhanced feature information with semantic feature information of the target video in at least one mode to obtain target semantic feature information of the target video; and classifying the target video based on the target semantic feature information. The server 11 may be a single server, or may be a server cluster or a cloud server composed of a plurality of servers. In the video classification method or apparatus disclosed in the present application, a plurality of servers can be grouped into a blockchain, and the servers are nodes on the blockchain.

Among them, the terminal 10 may receive the classification result of the target video transmitted by the server 11, and perform a service such as recommendation and search based on the classification result. The terminal 10 may include a mobile phone, a smart television, a tablet Computer, a notebook Computer, a Personal Computer (PC), or the like. A client, which may be an application client or a browser client or the like, may also be provided on the terminal 10.

The step of classifying the target video by the server 11 may be executed by the terminal 10.

The video classification method provided by the embodiment of the application relates to a computer vision technology and natural language processing in the field of artificial intelligence. The embodiment of the application can enhance the representation power of the target semantic feature information and is beneficial to improving the accuracy of video classification.

Among them, Artificial Intelligence (AI) is a theory, method, technique and application system that simulates, extends and expands human Intelligence using a digital computer or a machine controlled by a digital computer, senses the environment, acquires knowledge and uses the knowledge to obtain the best result. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making. The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

Computer Vision technology (CV) is a science for researching how to make a machine "see", and more specifically, it refers to that a camera and a Computer are used to replace human eyes to perform machine Vision such as identification, tracking and measurement on a target, and further perform graphic processing, so that the Computer processing becomes an image more suitable for human eyes to observe or transmitted to an instrument to detect. As a scientific discipline, computer vision research-related theories and techniques attempt to build artificial intelligence systems that can capture information from images or multidimensional data. Computer vision technologies generally include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D technologies, virtual reality, augmented reality, synchronous positioning, map construction, and other technologies, and also include common biometric technologies such as face recognition and fingerprint recognition.

Among them, Natural Language Processing (NLP) is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between humans and computers using natural language. Natural language processing is a science integrating linguistics, computer science and mathematics. Therefore, the research in this field will involve natural language, i.e. the language that people use everyday, so it is closely related to the research of linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic question and answer, knowledge mapping, and the like.

The following are detailed below. It should be noted that the following description of the embodiments is not intended to limit the preferred order of the embodiments.

The present embodiment will be described from the perspective of a video classification apparatus, which may be specifically integrated in an electronic device, which may be a server or a terminal, or the like.

The video classification method can be applied to various scenes needing to classify videos, such as video recommendation, video search and other scenes. For example, a certain video platform needs to recommend videos to a user according to the preference of the user, the video classification method provided by the embodiment can be used for rapidly classifying massive videos and recommending the videos based on the classification result, and the video classification method provided by the embodiment enhances the representation power of the target semantic feature information by enhancing the features related to the target video content, thereby being beneficial to improving the accuracy of video recommendation.

As shown in fig. 1b, the specific flow of the video classification method may be as follows:

101. video content information of the target video in at least one modality is acquired.

In this embodiment, the target video is a video to be classified, and the video type and the duration of the target video are not limited. For example, it may be a news-like video, a history video or an art video, and the like, and the target video may be a live video.

The video content information is content information contained in the video, and the video contains rich visual, auditory and subtitle information, so that the video has information of multiple modalities, such as an image modality, an audio modality (namely, a voice modality), a text modality and the like, and based on analysis processing of information in each modality, the video content information of the multiple modalities of the video can be extracted.

Optionally, in this embodiment, the at least one modality includes an image modality and a text modality;

the step of obtaining video content information of the target video in at least one modality may include:

performing image extraction processing on a target video to obtain an image sequence of the target video, wherein the image sequence is video content information of the target video in an image mode;

and performing text extraction processing on the target video to obtain a text sequence of the target video, wherein the text sequence is video content information of the target video in a text mode.

The image sequence may be composed of extracted video frames by performing video frame extraction on the target video. In some embodiments, the target video may be uniformly framed, for example, the target video may be divided into equal-duration video segments, for example, the target video is divided into at least one 10s video segment by taking 10s as a division unit, and one frame of picture is extracted for each video segment. In other embodiments, the key frames may be extracted from the target video, specifically, the key frames may be identified by the difference between the video frames, so as to extract the key frames.

The text sequence of the target video may include a video title, a video brief, a subtitle, and the like. Subtitles can be divided into soft subtitles and hard subtitles. Soft captions are separately stored caption files that can be extracted directly from the video data stream; hard captions are captions embedded in video. For the hard captions, the captions in the image sequence can be extracted by performing character recognition on the image sequence obtained after the target video is processed; alternatively, speech recognition may be performed on the speech segment after the target video processing, and the speech may be translated into subtitles.

Optionally, in other embodiments, the at least one modality includes an audio modality; the step of "obtaining video content information of the target video in at least one modality" may further include: and performing audio data extraction processing on the target video to obtain an audio sequence of the target video, wherein the audio sequence is video content information of the target video in an audio mode.

102. And performing semantic extraction on the video content information of the target video in at least one mode to obtain semantic feature information of the target video in at least one mode.

Optionally, in this embodiment, the step of performing semantic extraction on video content information of the target video in at least one modality to obtain semantic feature information of the target video in at least one modality may include:

semantic extraction is carried out on each image in the image sequence to obtain image semantic feature information of each image in the image sequence;

processing the image semantic feature information of each image based on the image semantic feature information of the front image and the rear image of each image;

fusing the processed image semantic feature information of each image to obtain semantic feature information of the target video in an image modality;

and performing semantic extraction on the text sequence to obtain semantic feature information of the target video in a text mode.

The step of performing semantic extraction on each image in the image sequence to obtain image semantic feature information of each image in the image sequence may specifically include:

and carrying out convolution operation and pooling operation on each image in the image sequence through a neural network to obtain image semantic feature information of each image in the image sequence.

The neural Network may be a Residual Network (ResNet), a Dense connected Convolutional Network (densneet), or the like, and the neural Network of the present embodiment is not limited to the above-mentioned types.

In the step of processing the image semantic feature information of each image based on the image semantic feature information of the previous and subsequent images of each image, the image semantic feature information of each image may be specifically processed through a Long Short-Term Memory network (LSTM), a bidirectional Long Short-Term Memory network (bilst), and the like, which is not limited in this embodiment.

In some embodiments, the step of "fusing the image semantic feature information of each processed image to obtain semantic feature information of the target video in the image modality" may include: and performing weighted transformation on the image semantic feature information of each processed image to obtain the semantic feature information of the target video in the image mode.

The semantic extraction of the text sequence can be carried out through a text semantic extraction model, and semantic feature information of the target video in a text mode is obtained. The Text semantic extraction model may be a Text Convolutional Neural network (TextCNN), a Self-Attention model (Self-Attention), a transformer-based Bidirectional Encoder representation (Bert), a reduced Bidirectional Encoder representation of a transformer (AlBert, alitert), and so on. It should be understood, however, that the text semantic extraction model of the present embodiment is not limited to the types listed above.

In some embodiments, the at least one modality includes an audio modality; the audio sequence of the target video can be extracted, then semantic extraction is performed on each audio frame in the audio sequence to obtain audio content feature information of each audio frame in the audio sequence, and the audio content feature information of each audio frame is fused to obtain semantic feature information of the target video in an audio mode.

103. And performing feature enhancement processing on the semantic feature information of the target video in at least one mode to obtain enhanced feature information of the target video.

The semantic feature information of the target video in at least one modality may include features strongly related to the classification of the target video or features weakly related to the classification of the target video, and the weakly related features may be specifically features irrelevant to the subject information of the target video, because the subject information of the target video is closely related to the classification thereof, and the subject information may represent the category to which the target video belongs.

The feature enhancement processing may be to enhance features in the semantic feature information that are strongly related to the classification of the target video, that is, to enhance features in the semantic feature information that are strongly related to the subject information of the target video. In particular, in some embodiments, the feature enhancement processing may further suppress features of the semantic feature information that are less relevant to the topic information of the target video.

Optionally, in this embodiment, the step "performing feature enhancement processing on the semantic feature information of the target video in at least one modality to obtain enhanced feature information of the target video" may include:

The logistic regression (logistic regression) processing, that is, logistic regression analysis, is often used for data mining, and may determine a feature with strong classification correlation with a target video from semantic feature information in at least one modality of the target video, and perform enhancement processing on the feature with strong correlation, specifically, give a larger weight to the feature with strong correlation, and give a smaller weight to the feature with weak correlation.

The preset value range may be set according to actual conditions, and this embodiment does not limit this. For example, the preset value range may be zero to one, that is, the feature values of the features in the semantic feature information of the target video in at least one modality are mapped to the value range from zero to one, and this mapping process may be non-linear or linear.

The step of performing logistic regression processing on the semantic feature information of the target video in at least one mode to obtain the enhanced feature information of the target video may specifically include:

and performing logistic regression processing on the semantic feature information of the target video in at least one mode through a preset threshold function to obtain the strengthened feature information of the target video.

The preset threshold function (gate function) may be a sigmoid function, or a tanh function, which is not limited in this embodiment.

In which a sigmoid Function, i.e., an S-shaped growth curve, may be used as an Activation Function (Activation Function) in a neural network or in a logistic regression process to map variables to a zero to one numerical range. the tanh function, i.e., the tanh tangent, may be used in neural networks in the field of deep learning as an activation function.

fusing the semantic feature information of the target video in each mode to obtain fused semantic feature information of the target video;

and performing feature enhancement processing on the fused semantic feature information to obtain enhanced feature information of the target video.

The semantic feature information of the target video in each modality is fused, specifically, the semantic feature information of the target video in each modality can be spliced, and the splicing sequence is not limited.

For example, if the semantic feature information of the target video in at least one modality includes semantic feature information of the target video in an image modality and semantic feature information of the target video in a text modality, the step "fusing the semantic feature information of the target video in each modality to obtain fused semantic feature information of the target video" may include:

and fusing the semantic feature information of the target video in the image modality and the semantic feature information of the target video in the text modality to obtain fused semantic feature information of the target video.

In some embodiments, the semantic feature information of the target video in at least one mode may further include semantic feature information of the target video in an audio mode, and the semantic feature information of the target video in the image mode, the text mode, and the audio mode may be fused to obtain fused semantic feature information of the target video.

Specifically, the semantic feature information of the target video in the image modality, the text modality and the audio modality can be spliced to obtain the fused semantic feature information of the target video. It should be noted that the splicing sequence is not limited in this embodiment. For example, the semantic feature information in the text mode can be spliced behind the semantic feature information in the image mode, and then the semantic feature information in the audio mode can be spliced behind the semantic feature information in the text mode; the semantic feature information in the audio mode can be spliced behind the semantic feature information in the text mode, and the semantic feature information in the image mode can be spliced behind the semantic feature information in the audio mode.

Optionally, in this embodiment, the fused semantic feature information includes at least one feature;

the step of performing feature enhancement processing on the fused semantic feature information to obtain enhanced feature information of the target video may include:

compressing the features in the fused semantic feature information when the feature value of the features in the fused semantic feature information is smaller than a second preset value;

and obtaining the enhanced feature information of the target video based on the enhanced features and the compressed features.

The first preset value and the second preset value may be set according to actual conditions, which is not limited in this embodiment, and the second preset value is not greater than the first preset value, and in some embodiments, the second preset value may be equal to the first preset value.

Wherein, the strengthening process may correspond to a larger weight and the compressing process may correspond to a smaller weight. When the feature value of the features in the fused semantic feature information is larger than a first preset value, performing enhancement processing on the features in the fused semantic feature information based on the weight corresponding to the enhancement processing to obtain the enhanced features; and when the feature value of the features in the fused semantic feature information is smaller than a second preset value, compressing the features in the fused semantic feature information based on the weight corresponding to the compression processing to obtain the compressed features.

Optionally, in some embodiments, when a feature value of a feature in the fused semantic feature information is greater than a first preset value, an enhancement amplitude of the feature in the fused semantic feature information may be determined based on a difference between the feature value of the feature in the fused semantic feature information and the first preset value, and then the feature in the fused semantic feature information is enhanced according to the enhancement amplitude. Specifically, the larger the difference, the higher the reinforcement amplitude, whereas the smaller the difference, the lower the reinforcement amplitude. Similarly, when the feature value of the feature in the fused semantic feature information is smaller than a second preset value, the compression amplitude of the feature in the fused semantic feature information may be determined based on the difference between the feature value of the feature in the fused semantic feature information and the second preset value, and then the feature in the fused semantic feature information is compressed according to the compression amplitude. Specifically, the larger the difference, the higher the compression amplitude, whereas the smaller the difference, the lower the compression amplitude.

104. And fusing the enhanced feature information and the semantic feature information of the target video in at least one mode to obtain the target semantic feature information of the target video.

The enhanced feature information can be fused with semantic feature information of the target video in each mode respectively, and then the target semantic feature information of the target video is determined based on each fusion result.

Optionally, in this embodiment, the step of "fusing the enhanced feature information with the semantic feature information of the target video in at least one modality to obtain the target semantic feature information of the target video" may include:

fusing the enhanced feature information and the semantic feature information of the target video in an image modality to obtain first feature information;

fusing the enhanced feature information with semantic feature information of the target video in a text mode to obtain second feature information;

and obtaining target semantic feature information of the target video based on the first feature information and the second feature information.

The feature information may be fused in various ways, which is not limited in this embodiment. For example, the fusion method may be a dot product operation or the like.

Specifically, the point multiplication operation can be performed on the enhanced feature information and the semantic feature information of the target video in the image modality to obtain first feature information; performing dot multiplication operation on the enhanced feature information and the semantic feature information of the target video in the text mode to obtain second feature information; and obtaining target semantic feature information of the target video based on the first feature information and the second feature information.

The step of obtaining the target semantic feature information of the target video based on the first feature information and the second feature information may include:

and fusing the first characteristic information and the second characteristic information to obtain the target semantic characteristic information of the target video.

The fusion mode of the first feature information and the second feature information is not limited, specifically, the fusion mode may be splicing processing, and the splicing sequence is not limited, for example, the second feature information is spliced behind the first feature information, so that the target semantic feature information of the target video can be obtained; optionally, the fusion method may also be a weighting operation, etc.

Optionally, in this embodiment, the step "obtaining the target semantic feature information of the target video based on the first feature information and the second feature information" may include:

determining weights corresponding to the first characteristic information and the second characteristic information;

105. And classifying the target video based on the target semantic feature information.

The probability that the target video belongs to each preset category can be predicted based on the target semantic feature information, and then the target category of the target video is determined. Specifically, the category with the highest probability may be taken as the target category of the target video.

The probability that the target video belongs to each preset category may be predicted through a classifier, where the classifier may specifically be a support vector machine, may also be a fully-connected deep neural network, and the like, and this embodiment is not limited thereto.

In this embodiment, a video classification model may be specifically adopted to perform semantic extraction on video content information of the target video in at least one modality, so as to obtain semantic feature information of the target video in at least one modality; performing feature enhancement processing on the semantic feature information of the target video in at least one mode to obtain enhanced feature information of the target video; fusing the enhanced feature information with semantic feature information of the target video in at least one mode to obtain target semantic feature information of the target video; and classifying the target video based on the target semantic feature information.

It should be noted that the video classification model may be trained from a plurality of labeled sample videos. The annotation information may be a target category corresponding to the sample video, or may also be an expected probability that the sample video belongs to each preset category, which is not limited in this embodiment.

The video classification model can be provided for the video classification device after being trained by other equipment, or can be trained by the video classification device.

If the video classification device performs self-training, before the step of performing semantic extraction on video content information of the target video in at least one modality by using a video classification model to obtain semantic feature information of the target video in at least one modality, the method may further include:

acquiring training data, wherein the training data comprises sample videos and expected probabilities that the sample videos belong to various preset categories;

performing semantic extraction on video content information of the sample video in at least one modality through a video classification model to obtain semantic feature information of the sample video in at least one modality; performing feature enhancement processing on the semantic feature information of the sample video in at least one mode to obtain enhanced feature information of the sample video; fusing the enhanced feature information with semantic feature information of the sample video in at least one mode to obtain target semantic feature information of the sample video;

determining the prediction probability of the sample video belonging to each preset category based on the target semantic feature information;

and adjusting parameters in a video classification model based on the corresponding prediction probability and the expected probability of the sample video.

The training process may adjust parameters of the video classification model based on a back propagation algorithm, so that the prediction probability of the sample video belonging to the preset category approaches to the expected probability, specifically, a difference between the prediction probability and the expected probability may be smaller than a preset value, and the preset value may be set according to an actual situation.

In the embodiment, after the target videos are classified based on the target semantic feature information, the classified target videos can be applied to scenes such as video searching, video recommendation and the like.

As can be seen from the above, the electronic device of the embodiment may obtain video content information of a target video in at least one modality; performing semantic extraction on video content information of the target video in at least one modality to obtain semantic feature information of the target video in at least one modality; performing feature enhancement processing on the semantic feature information of the target video in at least one mode to obtain enhanced feature information of the target video; fusing the enhanced feature information with semantic feature information of the target video in at least one mode to obtain target semantic feature information of the target video; and classifying the target video based on the target semantic feature information. According to the embodiment of the application, the representation force of the target semantic feature information can be enhanced by strengthening the features related to the target video content, and the accuracy of video classification can be improved.

The method described in the foregoing embodiment will be described in further detail below by way of example in which the video classification apparatus is specifically integrated in a server.

An embodiment of the present application provides a video classification method, and as shown in fig. 2a, a specific process of the video classification method may be as follows:

201. the server acquires video content information of the target video in at least one modality.

The step of obtaining video content information of the target video in at least one modality may include:

performing image extraction processing on a target video to obtain an image sequence of the target video, wherein the image sequence is video content information of the target video in an image mode;

and performing text extraction processing on the target video to obtain a text sequence of the target video, wherein the text sequence is video content information of the target video in a text mode.

202. And the server performs semantic extraction on the video content information of the target video in at least one mode to obtain semantic feature information of the target video in at least one mode.

semantic extraction is carried out on each image in the image sequence to obtain image semantic feature information of each image in the image sequence;

processing the image semantic feature information of each image based on the image semantic feature information of the front image and the rear image of each image;

fusing the processed image semantic feature information of each image to obtain semantic feature information of the target video in an image modality;

and performing semantic extraction on the text sequence to obtain semantic feature information of the target video in a text mode.

203. And the server fuses the semantic feature information of the target video in each modality to obtain the fused semantic feature information of the target video.

204. And the server performs feature enhancement processing on the fused semantic feature information to obtain enhanced feature information of the target video.

The feature enhancement processing may be to enhance features in the fused semantic feature information that are strongly related to the classification of the target video, that is, features in the fused semantic feature information that are strongly related to the subject information of the target video. In particular, in some embodiments, the feature enhancement processing may further suppress features of the fused semantic feature information that are less relevant to the topic information of the target video.

Optionally, in this embodiment, the step "the server performs feature enhancement processing on the merged semantic feature information to obtain enhanced feature information of the target video" may include:

and the server performs logistic regression processing on the fused semantic feature information to obtain enhanced feature information of the target video, wherein feature values of features in the enhanced feature information are in a preset numerical range corresponding to logistic regression.

Specifically, the fused semantic feature information may be subjected to logistic regression processing through a sigmoid function or a tanh function, and the reinforced feature information of the target video is obtained.

The step of performing feature enhancement processing on the fused semantic feature information to obtain enhanced feature information of the target video may include:

compressing the features in the fused semantic feature information when the feature value of the features in the fused semantic feature information is smaller than a second preset value;

and obtaining the enhanced feature information of the target video based on the enhanced features and the compressed features.

205. And the server fuses the enhanced feature information and the semantic feature information of the target video in at least one mode to obtain the target semantic feature information of the target video.

fusing the enhanced feature information and the semantic feature information of the target video in an image modality to obtain first feature information;

fusing the enhanced feature information with semantic feature information of the target video in a text mode to obtain second feature information;

and obtaining target semantic feature information of the target video based on the first feature information and the second feature information.

206. And the server classifies the target video based on the target semantic feature information.

In a specific embodiment, as shown in fig. 2b, video frame extraction is performed on a target video to be classified, so as to obtain an image sequence (i.e., video frame) of the target video; then extracting the features of each image in the image sequence through an image semantic construction model (such as ResNet network) to obtain semantic feature information x1 of the target video in an image mode; meanwhile, semantic extraction is carried out on text contents (such as video titles) of the target video through a text semantic construction model (such as BERT) to obtain semantic feature information x2 of the target video in a text mode, x1 and x2 are used as input of a threshold function, feature enhancement processing is carried out on x1 and x2 through the threshold function to obtain enhanced feature information of the target video; fusing the enhanced feature information with x1 and x2 respectively to obtain first feature information and second feature information; finally, splicing the first characteristic information and the second characteristic information to obtain target semantic characteristic information of the target video; and predicting the probability distribution of the target video belonging to each preset category based on the target semantic feature information, and classifying the target video.

The threshold function may be expressed by the following equation σ (w)^T[x1，x2]+ b), where w represents a vector mapping matrix, b represents a bias constant, T represents transposition, σ may adopt a sigmoid function, x1 represents semantic feature information of the target video in an image modality, x2 represents semantic feature information of the target video in a text modality, and [ x1, x2 ] represents]And as an input of the threshold function, the feature information of the image modality and the text modality jointly constructs enhanced feature information, and the features in x1 and x2 are suppressed or enhanced based on the enhanced feature information.

In the step of using x1 and x2 as inputs of a threshold function, and performing feature enhancement processing on x1 and x2 through the threshold function to obtain enhanced feature information of the target video, specifically, the method may include performing fusion (for example, splicing processing) on x1 and x2 to obtain fused semantic feature information x, and substituting x into the threshold function σ (w) as an input (where w is w^Tx + b) to obtain the enhanced characteristic information.

Specifically, the step "fusing the enhanced feature information with x1 and x2 respectively to obtain first feature information and second feature information" may be to process x1 and x2 respectively through the full connection layer, perform a dot product operation on the enhanced feature information and the processed x1 to obtain first feature information, and perform a dot product operation on the enhanced feature information and the processed x2 to obtain second feature information.

The video classification method can perform logistic regression processing on semantic feature information of a target video in at least one mode through a threshold function to obtain strengthened feature information of the target video, if the threshold function is sigmoid, a feature value of a feature in the strengthened feature information can be in a [0, 1] interval, and the feature value of the feature in the strengthened feature information is closer to 0, which means that the feature is more seriously inhibited, and the influence on a classification result is weakened; conversely, the closer the feature value of a feature in the enhanced feature information is to 1, the more the feature is enhanced, and the influence on the classification result is enhanced. Therefore, the threshold function restrains the characteristic information irrelevant to the subject content of the target video, strengthens the characteristic information more relevant to the subject content of the target video, and is beneficial to enhancing the semantic interaction of the semantic characteristic information under each mode of the target video.

As can be seen from the above, in the present embodiment, the server may obtain video content information of the target video in at least one modality; performing semantic extraction on video content information of the target video in at least one modality to obtain semantic feature information of the target video in at least one modality; fusing the semantic feature information of the target video in each mode to obtain fused semantic feature information of the target video; performing feature enhancement processing on the fused semantic feature information to obtain enhanced feature information of the target video; fusing the enhanced feature information with semantic feature information of the target video in at least one mode to obtain target semantic feature information of the target video; and classifying the target video based on the target semantic feature information. According to the embodiment of the application, the representation force of the target semantic feature information can be enhanced by strengthening the features related to the target video content, and the accuracy of video classification can be improved.

In order to better implement the above method, an embodiment of the present application further provides a video classification apparatus, as shown in fig. 3a, which may include an obtaining unit 301, an extracting unit 302, a strengthening unit 303, a fusing unit 304, and a classifying unit 305, as follows:

(1) an acquisition unit 301;

an obtaining unit 301, configured to obtain video content information of a target video in at least one modality.

Optionally, in some embodiments of the present application, the at least one modality includes an image modality and a text modality;

the acquisition unit 301 may comprise an image acquisition sub-unit 3011 and a text acquisition sub-unit 3012, see fig. 3b, as follows:

the image obtaining subunit 3011 is configured to perform image extraction processing on a target video to obtain an image sequence of the target video, where the image sequence is video content information of the target video in an image modality;

the text obtaining subunit 3012 is configured to perform text extraction processing on a target video to obtain a text sequence of the target video, where the text sequence is video content information of the target video in a text mode.

(2) An extraction unit 302;

an extracting unit 302, configured to perform semantic extraction on video content information of the target video in at least one modality, to obtain semantic feature information of the target video in at least one modality.

Optionally, in some embodiments of the present application, the extracting unit 302 may include an image extracting sub-unit 3021, a processing sub-unit 3022, an image fusion sub-unit 3023, and a text extracting sub-unit 3024, see fig. 3c, as follows:

the image extraction subunit 3021 is configured to perform semantic extraction on each image in the image sequence to obtain image semantic feature information of each image in the image sequence;

a processing subunit 3022, configured to process the image semantic feature information of each image based on the image semantic feature information of the preceding and following images of each image;

an image fusion subunit 3023, configured to fuse the processed image semantic feature information of each image to obtain semantic feature information of the target video in an image modality;

a text extraction subunit 3024, configured to perform semantic extraction on the text sequence to obtain semantic feature information of the target video in a text modality.

(3) A reinforcement unit 303;

the enhancing unit 303 is configured to perform feature enhancement processing on the semantic feature information of the target video in at least one modality, so as to obtain enhanced feature information of the target video.

Optionally, in some embodiments of the application, the enhancing unit 303 may be specifically configured to perform logistic regression processing on the semantic feature information of the target video in at least one modality to obtain enhanced feature information of the target video, where feature values of features in the enhanced feature information are in a preset numerical range corresponding to the logistic regression.

Optionally, in some embodiments of the present application, the strengthening unit 303 may include a fusion subunit 3031 and a strengthening subunit 3032, see fig. 3d, as follows:

the fusion subunit 3031 is configured to fuse semantic feature information of the target video in each modality to obtain fused semantic feature information of the target video;

and the enhancing subunit 3032 is configured to perform feature enhancement processing on the fused semantic feature information to obtain enhanced feature information of the target video.

Optionally, in some embodiments of the present application, the fused semantic feature information includes at least one feature;

the strengthening subunit 3032 may be specifically configured to, when a feature value of a feature in the fused semantic feature information is greater than a first preset value, perform strengthening processing on the feature in the fused semantic feature information; compressing the features in the fused semantic feature information when the feature value of the features in the fused semantic feature information is smaller than a second preset value; and obtaining the enhanced feature information of the target video based on the enhanced features and the compressed features.

(4) A fusion unit 304;

a fusion unit 304, configured to fuse the enhanced feature information and the semantic feature information of the target video in at least one modality to obtain target semantic feature information of the target video.

Optionally, in some embodiments of the present application, the fusion unit 304 may include a first fusion subunit 3041, a second fusion subunit 3042 and a determination subunit 3043, see fig. 3e, as follows:

the first fusion subunit 3041 is configured to fuse the enhanced feature information and the semantic feature information of the target video in an image modality to obtain first feature information;

a second fusion subunit 3042, configured to fuse the enhanced feature information with semantic feature information of the target video in a text modality, to obtain second feature information;

a determining subunit 3043, configured to obtain target semantic feature information of the target video based on the first feature information and the second feature information.

Optionally, in some embodiments of the present application, the determining subunit 3043 may be specifically configured to determine weights corresponding to the first feature information and the second feature information; and performing weighted fusion on the first characteristic information and the second characteristic information based on the weight to obtain target semantic characteristic information of the target video.

(5) A classification unit 305;

a classifying unit 305, configured to classify the target video based on the target semantic feature information.

As can be seen from the above, in the present embodiment, the obtaining unit 301 may obtain video content information of the target video in at least one modality; performing semantic extraction on video content information of the target video in at least one modality through an extraction unit 302 to obtain semantic feature information of the target video in at least one modality; performing feature enhancement processing on the semantic feature information of the target video in at least one mode through an enhancement unit 303 to obtain enhanced feature information of the target video; fusing the enhanced feature information and the semantic feature information of the target video in at least one mode by a fusing unit 304 to obtain target semantic feature information of the target video; the target video is classified based on the target semantic feature information by the classification unit 305. According to the embodiment of the application, the representation force of the target semantic feature information can be enhanced by strengthening the features related to the target video content, and the accuracy of video classification can be improved.

An electronic device according to an embodiment of the present application is further provided, as shown in fig. 4, which shows a schematic structural diagram of the electronic device according to the embodiment of the present application, where the electronic device may be a terminal or a server, and specifically:

the electronic device may include components such as a processor 401 of one or more processing cores, memory 402 of one or more computer-readable storage media, a power supply 403, and an input unit 404. Those skilled in the art will appreciate that the electronic device configuration shown in fig. 4 does not constitute a limitation of the electronic device and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components. Wherein:

the processor 401 is a control center of the electronic device, connects various parts of the whole electronic device by various interfaces and lines, performs various functions of the electronic device and processes data by running or executing software programs and/or modules stored in the memory 402 and calling data stored in the memory 402, thereby performing overall monitoring of the electronic device. Optionally, processor 401 may include one or more processing cores; preferably, the processor 401 may integrate an application processor, which mainly handles operating systems, user interfaces, application programs, etc., and a modem processor, which mainly handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 401.

The memory 402 may be used to store software programs and modules, and the processor 401 executes various functional applications and data processing by operating the software programs and modules stored in the memory 402. The memory 402 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data created according to use of the electronic device, and the like. Further, the memory 402 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device. Accordingly, the memory 402 may also include a memory controller to provide the processor 401 access to the memory 402.

The electronic device further comprises a power supply 403 for supplying power to the various components, and preferably, the power supply 403 is logically connected to the processor 401 through a power management system, so that functions of managing charging, discharging, and power consumption are realized through the power management system. The power supply 403 may also include any component of one or more dc or ac power sources, recharging systems, power failure detection circuitry, power converters or inverters, power status indicators, and the like.

The electronic device may further include an input unit 404, and the input unit 404 may be used to receive input numeric or character information and generate keyboard, mouse, joystick, optical or trackball signal inputs related to user settings and function control.

Although not shown, the electronic device may further include a display unit and the like, which are not described in detail herein. Specifically, in this embodiment, the processor 401 in the electronic device loads the executable file corresponding to the process of one or more application programs into the memory 402 according to the following instructions, and the processor 401 runs the application program stored in the memory 402, thereby implementing various functions as follows:

acquiring video content information of a target video in at least one mode; performing semantic extraction on video content information of the target video in at least one modality to obtain semantic feature information of the target video in at least one modality; performing feature enhancement processing on the semantic feature information of the target video in at least one mode to obtain enhanced feature information of the target video; fusing the enhanced feature information with semantic feature information of the target video in at least one mode to obtain target semantic feature information of the target video; and classifying the target video based on the target semantic feature information.

The above operations can be implemented in the foregoing embodiments, and are not described in detail herein.

As can be seen from the above, the present embodiment may acquire video content information of a target video in at least one modality; performing semantic extraction on video content information of the target video in at least one modality to obtain semantic feature information of the target video in at least one modality; performing feature enhancement processing on the semantic feature information of the target video in at least one mode to obtain enhanced feature information of the target video; fusing the enhanced feature information with semantic feature information of the target video in at least one mode to obtain target semantic feature information of the target video; and classifying the target video based on the target semantic feature information. According to the embodiment of the application, the representation force of the target semantic feature information can be enhanced by strengthening the features related to the target video content, and the accuracy of video classification can be improved.

It will be understood by those skilled in the art that all or part of the steps of the methods of the above embodiments may be performed by instructions or by associated hardware controlled by the instructions, which may be stored in a computer readable storage medium and loaded and executed by a processor.

To this end, embodiments of the present application provide a storage medium, in which a plurality of instructions are stored, where the instructions can be loaded by a processor to execute the steps in any one of the video classification methods provided in the embodiments of the present application. For example, the instructions may perform the steps of:

The above operations can be implemented in the foregoing embodiments, and are not described in detail herein.

Wherein the storage medium may include: read Only Memory (ROM), Random Access Memory (RAM), magnetic or optical disks, and the like.

Since the instructions stored in the storage medium can execute the steps in any video classification method provided in the embodiments of the present application, beneficial effects that can be achieved by any video classification method provided in the embodiments of the present application can be achieved, which are detailed in the foregoing embodiments and will not be described herein again.

According to an aspect of the application, a computer program product or computer program is provided, comprising computer instructions, the computer instructions being stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the method provided in the various alternative implementations of the video classification aspect described above.

The foregoing detailed description is directed to a video classification method, apparatus, electronic device and storage medium provided in the embodiments of the present application, and specific examples are applied in the present application to explain the principles and implementations of the present application, and the descriptions of the foregoing embodiments are only used to help understand the method and core ideas of the present application; meanwhile, for those skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

25页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：一种基于单向监督注意力的多模态标签推荐方法

Video classification method and device, electronic equipment and storage medium

相关技术

网友询问留言