Video classification method and device, electronic equipment and storage medium

文档序号：1904705 发布日期：2021-11-30 浏览：12次中文

阅读说明：本技术 视频分类方法、装置、电子设备及存储介质 (Video classification method and device, electronic equipment and storage medium ) 是由王思聪司建锋于 2020-05-25 设计创作，主要内容包括：本申请提供了一种视频分类方法、装置、电子设备及存储介质,属于多媒体技术领域。方法包括：将从目标视频中获取到的至少一个图像帧输入基于卷积神经网络的运算模型,得到至少一个图像帧向量；从至少两个簇中确定该至少一个图像帧向量分别对应的目标簇,一个簇用于表示一类图像帧向量；基于至少一个图像帧向量分别对应的目标簇,获取目标视频对应的待识别文本；将待识别文本输入自然语言处理模型,通过自然语言处理模型对待识别文本进行解码,得到目标视频所属的视频类型。上述方案,在对目标视频进行分类时,将处理目标视频中的图像帧转换为处理待识别文本中的文本内容,降低了计算复杂度,缩短了处理时间,降低了对设备处理能力的要求。(The application provides a video classification method, a video classification device, electronic equipment and a storage medium, and belongs to the technical field of multimedia. The method comprises the following steps: inputting at least one image frame acquired from a target video into an operation model based on a convolutional neural network to obtain at least one image frame vector; determining target clusters corresponding to the at least one image frame vector respectively from at least two clusters, wherein one cluster is used for representing one type of image frame vector; acquiring texts to be recognized corresponding to target videos based on target clusters corresponding to at least one image frame vector respectively; and inputting the text to be recognized into the natural language processing model, and decoding the text to be recognized through the natural language processing model to obtain the video type of the target video. According to the scheme, when the target video is classified, the image frames in the target video are converted into the text contents in the text to be recognized, so that the calculation complexity is reduced, the processing time is shortened, and the requirement on the processing capacity of the equipment is reduced.)

1. A method for video classification, the method comprising:

inputting at least one image frame acquired from a target video into an operation model based on a convolutional neural network to obtain at least one image frame vector;

determining target clusters corresponding to the at least one image frame vector respectively from at least two clusters, wherein one cluster is used for representing one type of image frame vector;

acquiring texts to be recognized corresponding to the target videos based on target clusters corresponding to the at least one image frame vector respectively;

and inputting the text to be recognized into a natural language processing model, and decoding the text to be recognized through the natural language processing model to obtain the video type of the target video.

2. The method according to claim 1, wherein the determining the target clusters respectively corresponding to the at least one image frame vector from at least two clusters comprises:

for any image frame vector in the at least one image frame vector, determining a cluster with the highest similarity between the cluster center vector and the image frame vector from the at least two clusters as a target cluster of the image frame.

3. The method of claim 2, wherein the determining, as the target cluster of the image frame, the cluster with the highest similarity between the cluster center vector and the image frame vector from the at least two clusters comprises:

determining Euclidean distances between the image frame vector and cluster center vectors of the at least two clusters respectively;

in response to the Euclidean distance between the cluster center vector of any cluster and the image frame vector being minimum, the cluster is taken as a target cluster of the image frame vector.

4. The method according to claim 1, wherein the obtaining the text to be recognized corresponding to the target video based on the target clusters respectively corresponding to the at least one image frame vector comprises:

and taking the cluster center identification of the target cluster corresponding to the at least one image frame vector as the vocabulary in the text to be recognized corresponding to the target video.

5. The method of claim 4, wherein before the determining the target cluster to which the at least one image frame vector corresponds from the at least two clusters, the method further comprises:

inputting the obtained multiple sample image frames into the operation model based on the convolutional neural network to obtain multiple sample image frame vectors;

clustering the sample image frame vectors to obtain at least two clusters;

and respectively allocating unique cluster center identifications for the at least two clusters, wherein the cluster center identifications are in a digital coding form.

6. The method according to claim 1, wherein the decoding the text to be recognized through the natural language processing model to obtain the video type to which the target video belongs comprises:

decoding the text to be recognized through the natural language processing model to obtain at least one classification probability, wherein the classification probability is used for expressing the probability that the text to be recognized belongs to different classification types;

and determining the classification type meeting the first target condition as the video type of the target video according to the at least one classification probability.

7. The method according to claim 1, wherein after the text to be recognized is decoded by the natural language processing model to obtain the video type to which the target video belongs, the method further comprises:

inputting the text to be recognized into a label model, and decoding the text to be recognized through the label model to obtain at least one label probability, wherein the label probability is used for representing the probability that the text to be recognized belongs to different video labels;

and determining the video label meeting a second target condition as the video label of the target video according to the at least one label probability.

8. An apparatus for video classification, the apparatus comprising:

the vector acquisition module is used for inputting at least one image frame acquired from a target video into an operation model based on a convolutional neural network to obtain at least one image frame vector;

a determining module, configured to determine target clusters corresponding to the at least one image frame vector from at least two clusters, where one cluster is used to represent one type of image frame vector;

the text acquisition module is used for acquiring a text to be recognized corresponding to the target video based on the target clusters respectively corresponding to the at least one image frame vector;

and the model processing module is used for inputting the text to be recognized into a natural language processing model, and decoding the text to be recognized through the natural language processing model to obtain the video type of the target video.

9. An electronic device, comprising a processor and a memory, wherein the memory is configured to store at least one piece of program code, and wherein the at least one piece of program code is loaded by the processor and executes the video classification method according to any one of claims 1 to 7.

10. A storage medium for storing at least one program code for performing the video classification method of any one of claims 1 to 7.

Technical Field

The present application relates to the field of multimedia technologies, and in particular, to a video classification method and apparatus, an electronic device, and a storage medium.

Background

With the development of multimedia technology, various video websites are in a great variety, and users also prefer to watch videos of interest in each video website in idle time. For a video website, because the number of videos is huge and videos in which different users are interested are different, in order to enable the users to find videos in which the users are interested in a short time, the video website usually classifies or adds tags to the videos to indicate main contents of the videos, so that the users can conveniently select the videos.

In the related art, a deep learning method is usually adopted, and the types of the image frames are determined according to the image vectors corresponding to the image frames included in the video through models such as a Convolutional Neural Network (CNN) or a Long Short-Term Memory network (LSTM), so as to achieve the purpose of classifying the video or adding tags to the video.

Due to the fact that the dimensionality of the picture vector is large, when the image frame is processed, the technical scheme is high in calculation complexity, long time needs to be consumed, and the requirement on the processing capacity of equipment is high.

Disclosure of Invention

The embodiment of the application provides a video classification method, a video classification device, electronic equipment and a storage medium, wherein a one-dimensional cluster center mark is used for replacing a high-dimensional image vector, so that the calculation complexity is reduced, the processing time is shortened, and the requirement on the processing capacity of the equipment is reduced. The technical scheme is as follows:

in one aspect, a video classification method is provided, and the method includes:

inputting at least one image frame acquired from a target video into an operation model based on a convolutional neural network to obtain at least one image frame vector;

determining target clusters corresponding to the at least one image frame vector respectively from at least two clusters, wherein one cluster is used for representing one type of image frame vector;

acquiring texts to be recognized corresponding to the target videos based on target clusters corresponding to the at least one image frame vector respectively;

In another aspect, there is provided a video classification apparatus, the apparatus including:

In an optional implementation manner, the determining module is further configured to determine, for any image frame vector in the at least one image frame vector, a cluster with a highest similarity between a cluster center vector and the image frame vector from the at least two clusters as a target cluster of the image frame vector.

In an optional implementation manner, the determining module is further configured to determine euclidean distances between the image frame vector and the cluster center vectors of the at least two clusters, respectively; in response to the Euclidean distance between the cluster center vector of any cluster and the image frame vector being minimum, the cluster is taken as a target cluster of the image frame vector.

In an optional implementation manner, the text obtaining module is further configured to use cluster center identifiers of target clusters respectively corresponding to the at least one image frame vector as words in the text to be recognized corresponding to the target video.

In an optional implementation, the apparatus further includes:

the vector acquisition module is also used for inputting the plurality of sample image frames into the operation model based on the convolutional neural network to obtain a plurality of sample image frame vectors;

the clustering module is used for clustering the sample image frame vectors to obtain at least two clusters;

and the distribution module is used for respectively distributing unique cluster center identifications for the at least two clusters, and the cluster center identifications are in a digital coding form.

In an optional implementation manner, the model processing module is configured to decode the text to be recognized through the natural language processing model to obtain at least one classification probability, where the classification probability is used to represent probabilities that the text to be recognized belongs to different classification types; and determining the classification type meeting the first target condition as the video type of the target video according to the at least one classification probability.

In an optional implementation manner, the model processing module is further configured to input the text to be recognized into a tag model, and decode the text to be recognized through the tag model to obtain at least one tag probability, where the tag probability is used to represent probabilities that the text to be recognized belongs to different video tags;

the determining module is further configured to determine, according to the at least one label probability, a video label that meets a second target condition as a video label of the target video.

In another aspect, an electronic device is provided, which includes a processor and a memory, where the memory is used to store at least one program code, and the at least one program code is loaded and executed by the processor to implement the operations performed in the video classification method in the embodiments of the present application.

In another aspect, a storage medium is provided, where at least one program code is stored, where the at least one program code is used to execute the video classification method in the embodiment of the present application.

The technical scheme provided by the embodiment of the application has the following beneficial effects:

in the embodiment of the application, the text to be recognized corresponding to the target video can be obtained by determining the target cluster corresponding to the image frame vector of the image frame in the target video, so that when the target video is classified, the image frame in the processed target video is converted into the text content in the text to be recognized, thereby reducing the calculation complexity, shortening the processing time and reducing the requirement on the processing capacity of the equipment.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a block diagram of a video classification system provided according to an embodiment of the present application;

fig. 2 is a flowchart of a video classification method according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a cluster center vector provided in accordance with an embodiment of the present application;

fig. 4 is a schematic diagram of acquiring a target cluster according to an embodiment of the present application;

fig. 5 is a flow chart of another video classification method provided in accordance with an embodiment of the present disclosure;

fig. 6 is a block diagram of a video classification apparatus according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of a terminal provided in an embodiment of the present application;

fig. 8 is a schematic structural diagram of a server provided according to an embodiment of the present application.

Detailed Description

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present application, as detailed in the appended claims.

The following describes possible techniques that may be used in the present application:

clustering: clustering is a process of categorically organizing data members of a data set that are similar in some way. The cluster generated by clustering is a collection of a set of data objects that are similar to objects in the same cluster and distinct from objects in other clusters. Clustering techniques are often referred to as unsupervised learning.

K-means: the K-means clustering algorithm (K-means clustering algorithm) is an iterative solution clustering analysis algorithm, and the steps are that K objects are randomly selected as initial clustering centers, then the distance between each object and each clustering center is calculated, and each object is allocated to the clustering center closest to the object. The cluster centers and the objects assigned to them represent a cluster. The cluster center of a cluster is recalculated for each sample assigned based on the objects existing in the cluster. This process will be repeated until some termination condition is met. The termination condition may be that no (or minimum number) objects are reassigned to different clusters, no (or minimum number) cluster centers are changed again, and the sum of squared errors is locally minimal.

Faiss: faiss is a search library which is open from Facebook AI team and aims at clustering and similarity, provides efficient similarity search and clustering for dense vectors, supports search of billion-level vectors, and is the most mature approximate neighbor search library at present. It contains several algorithms for searching any size vector set (note: the size of the vector set is determined by RAM memory) and supporting code for algorithm evaluation and parameter adjustment.

Inception V3: the inclusion network is an important milestone in the development history of CNN (Convolutional Neural Networks) classifiers. Proposed by google. Versions iterate from V1 to V3, which is currently the preferred open source model for many image domain pre-training picture vectors.

The video classification method provided by the application can be applied to scenes for classifying videos in websites, and the videos in the websites can be videos uploaded by website operators and videos uploaded by users, such as movies, television shows, micro-movies, short videos and the like. According to the video classification method provided by the application, a website operator can classify videos in a website and display the videos according to classification results, and a user can check the videos according to the classification and can screen the videos according to the classification when browsing the website.

The following briefly introduces the main steps of the video classification method provided in the present application: firstly, inputting at least one image frame acquired from a target video into an operation model based on a convolutional neural network to obtain at least one image frame vector, and determining target clusters corresponding to the at least one image frame vector from at least two clusters, wherein one cluster is used for representing one type of image frame vector. And then, acquiring texts to be recognized corresponding to the target video based on the target clusters respectively corresponding to the at least one image frame vector. And finally, inputting the text to be recognized into a natural language processing model, and decoding the text to be recognized through the natural language processing model to obtain the video type of the target video. According to the method, the text to be recognized corresponding to the target video can be obtained through the target cluster corresponding to the image frame vector of the target video, so that when the target video is classified, the image frames in the processed target video are converted into the text content in the text to be recognized, the calculation complexity is reduced, the processing time is shortened, and the requirement on the processing capacity of equipment is lowered. For example, the matrix calculation for the image frame needs to be implemented by using a GPU (Graphics Processing Unit), and the matrix calculation for the text can be implemented by using a CPU (Central Processing Unit), so that the requirement for the Processing capability of the device is reduced, and the cost is saved.

Fig. 1 is a block diagram of a video classification system 100 according to an embodiment of the present application. The video classification system 100 includes: a terminal 110 and a video classification platform 120.

The terminal 110 is connected to the video classification platform 120 through a wireless network or a wired network. The terminal 110 may be at least one of a smartphone, a game console, a desktop computer, a tablet computer, an e-book reader, an MP3 player, an MP4 player, and a laptop portable computer. The terminal 110 is installed and operated with an application program supporting video classification. The application may be a social type application, a multimedia playing type application, or the like. Illustratively, the terminal 110 is a terminal used by a user, and an application running in the terminal 110 has a user account logged therein.

The video classification platform 120 includes at least one of a server, a plurality of servers, a cloud computing platform, and a virtualization center. The video classification platform 120 is used to provide background services for applications that support video classification. Optionally, the video classification platform 120 undertakes primary classification work, and the terminal 110 undertakes secondary classification work; or, the video classification platform 120 undertakes the secondary classification work, and the terminal 110 undertakes the primary classification work; alternatively, the video classification platform 120 or the terminal 110 may be respectively responsible for the classification.

Optionally, the video classification platform 120 comprises: the system comprises an access server, a video classification server and a database. The access server is used for providing the terminal 110 with access service. The video classification server is used for providing background services related to video classification. The video classification server can be one or more. When there are multiple video classification servers, there are at least two video classification servers for providing different services, and/or there are at least two video classification servers for providing the same service, for example, providing the same service in a load balancing manner, which is not limited in the embodiments of the present application.

The terminal 110 may be generally referred to as one of a plurality of terminals, and the embodiment is only illustrated by the terminal 110.

Those skilled in the art will appreciate that the number of terminals described above may be greater or fewer. For example, the number of the terminals may be only one, or several tens or hundreds, or more, in which case the video classification system further includes other terminals. The number of terminals and the type of the device are not limited in the embodiments of the present application.

Fig. 2 is a flowchart of a video classification method according to an embodiment of the present application, as shown in fig. 2. The electronic device may be configured as a terminal or a server, and in this embodiment, the electronic device is configured as a server for example. The video classification method comprises the following steps:

201. and the server inputs at least one image frame acquired from the target video into an operation model based on a convolutional neural network to obtain at least one image frame vector.

In this embodiment of the application, the server may be a server for video classification, the target video may be a video uploaded by a user through a client, may also be a video uploaded by a worker of a video website through a background server, and may also be a video in another video website linked to the video website. After the server acquires the target video, at least one image frame can be acquired from the target video. The image frame is actually an image matrix, the elements of the image matrix are pixel values of each pixel point in the image frame, and the server can convert the image matrix into a vector, so that the server obtains at least one image frame vector. Correspondingly, for any image frame in the at least one image frame, the server may input the image frame into an operation model based on a convolutional neural network, so as to obtain an image frame vector corresponding to the image frame. The operation model based on the convolutional neural network may be an inclusion V2 (open end V2, a convolutional neural network model) model, an inclusion V3 (open end V3, a convolutional neural network model) model, or a VGG16(Visual Geometry Group 16, a convolutional neural network model) model.

For example, an image frame is a 32 × 32 image matrix, and the server may calculate the image matrix through an operation model based on a convolutional neural network to obtain a 1024-dimensional vector. The server may also calculate the image matrix through NumPy (digital Python) library in Python (a computer programming language) to obtain a 1024-dimensional vector. At the time of conversion, the server creates an array of 1 × 1024 by the NumPy library, and then cyclically reads out 32 rows of element values of the image matrix and stores the 32 element values of each row in the array. The embodiment of the present application does not limit the manner of determining the image frame vector.

It should be noted that after the server acquires the target video, there are various ways for the server to acquire at least one image frame: the server can acquire image frames included in the target video frame by frame to obtain a plurality of image frames; the server can also obtain an image frame from the target video every fixed frame number to obtain a plurality of image frames, for example, obtain an image frame every 10 frames; the server can also obtain an image frame from the target video every fixed time to obtain a plurality of image frames, for example, obtain an image frame every 2 seconds; the server may also randomly acquire an image frame from the target video. The server may determine the manner and number of the image frames to be acquired according to the total duration of the target video, which is not limited in the embodiment of the present application.

For example, the total duration of the target video is 2 hours, each second includes 30 image frames, and if the server acquires the image frames frame by frame, 216000 image frames are obtained; if the server acquires one image frame every 10 frames, 21600 image frames are obtained; if the server acquires an image frame every 100 frames, 2160 image frames are obtained; if the server acquires an image frame every 1 second, 7200 image frames are obtained; if the server acquires an image frame every 10 seconds, 720 image frames are obtained; if the server randomly acquires one image frame, 1 image frame is obtained. The more the number of the image frames acquired by the server is, the more accurate the result of classifying the target video is, but the greater the calculation pressure on the server is, and the less the number of the image frames acquired by the server is, the less accurate the result of classifying the target video is, but the calculation speed is faster. For a target video with a total duration of 2 hours, the server may acquire 2160 image frames or 720 image frames to classify the target video. Correspondingly, when the total duration of the target video is short, such as the target video is a short video with the total duration not exceeding 120 seconds, or even a short time-frequency with the total duration not less than 30 seconds, the server can select to acquire the image frames frame by frame, or acquire one image frame every 3 frames, 5 frames, 10 frames, 1 second or 2 seconds, and the like, so that the accuracy of the classification result is ensured while the calculation speed is ensured. In addition, the server can randomly acquire one image frame from the target video for multiple times, so that the acquired image frame has certain randomness, and the phenomenon that the video is wrongly classified due to the fact that a regular acquisition mode is utilized by a video uploading person is avoided.

202. The server determines target clusters corresponding to the at least one image frame vector from at least two clusters, wherein one cluster is used for representing one type of image frame vector.

In this embodiment of the application, the at least two clusters may be obtained by clustering by the server according to the sample image frame vector, or may be obtained directly by the server. After obtaining the at least one image frame vector, the server may determine a target cluster corresponding to each image frame vector from the at least two clusters according to a degree of similarity between each image frame vector and the cluster center vectors of the at least two clusters.

In an optional implementation manner, for any image frame vector in the at least one image frame vector, the server determines a cluster with the highest similarity between the cluster center vector and the image frame vector from at least two clusters as a target cluster of the image frame vector. Wherein, the similarity between the cluster center vector and the image frame vector can be represented by Euclidean distance. Correspondingly, the step of the server determining the target clusters respectively corresponding to the at least one image frame vector from the at least two clusters may be: for any image frame vector in at least one image frame vector, the server may determine a euclidean distance between the image frame vector and the cluster center vectors of the at least two clusters, respectively, and in response to the euclidean distance between the cluster center vector of any cluster and the image frame vector being the smallest, the server may take the cluster as a target cluster of the image frame vector. The server may obtain the target clusters corresponding to the at least one image frame vector in the same manner, so as to obtain at least one target cluster. Of course, the server may also determine the similarity between the cluster center vector and the image frame vector by other means, such as cosine similarity, pearson correlation coefficient, and manhattan distance. Since the cluster corresponding to the cluster center vector with the highest similarity with the image frame vector is selected as the target cluster, the vector in the target cluster has a higher similarity with the image frame vector, and thus the image frame vector can be represented by the cluster center vector of the target cluster.

For example, the server may retrieve, from the at least two clusters, a cluster having the highest similarity between the cluster center vector and the image frame vector, that is, the smallest euclidean distance, through a retrieval function provided by the Faiss search library. The server can also determine the cluster with the highest similarity to the image frame vector, namely the cluster with the minimum Euclidean distance, from the cluster center vectors of the k clusters through a k nearest neighbor (kNN) algorithm. Wherein k is a positive integer. Of course, the server may select another algorithm to determine the cluster with the cluster center vector most similar to the image frame vector, which is not limited in the embodiment of the present application.

It should be noted that, when the at least two clusters are obtained by clustering by the server according to the multiple sample image frame vectors, the obtaining step of the at least two clusters may be: the server can obtain a plurality of sample image frames, input the obtained sample image frames into an operation model based on a convolutional neural network to obtain a plurality of sample image frame vectors, and cluster the plurality of sample image frame vectors to obtain at least two clusters. The server can also allocate unique cluster center identifiers for the at least two clusters respectively, and the cluster center identifiers can be in a digital coding form. The server may cluster the sample image frame vectors by using a k-means algorithm, or a CLARA (Clustering LARge application) algorithm, which is not limited in the embodiment of the present application.

For example, referring to fig. 3, fig. 3 is a schematic diagram of a cluster center vector provided according to an embodiment of the present application. The server calculates a large number of sample image frame vectors through the InceptitionV 3 model, and generates a 1024-dimensional vector for each sample image frame. The number of the sample image frames may be 10 ten thousand, 100 ten thousand, 1000 ten thousand, and the like. Then, the server performs clustering based on the sample image frame vectors through a k-means algorithm to obtain a plurality of clusters, and the number of the clusters can be freely set before clustering, such as 1000, 10000 or 100000. In fig. 3, three clusters C1, C2, and C3 are exemplarily shown, a cluster center vector of C1 is C1, a cluster center vector of C2 is C2, a cluster center vector of C3 is C3, and other points in the figure represent sample image frame vectors. Correspondingly, for any image frame, when a target cluster corresponding to the image frame is acquired, reference may be made to fig. 4, where fig. 4 is a schematic diagram of acquiring a target cluster according to an embodiment of the present application. Fig. 4 also shows an image frame vector T1 corresponding to an image frame T on the basis of fig. 3, and the server calculates the euclidean distance between T1 and C1, the euclidean distance between T1 and C2, and the euclidean distance between T1 and C3, respectively, wherein if the euclidean distance between C2 and T1 is the smallest, then C2 is determined as the target cluster corresponding to the image frame vector T1.

203. And the server acquires the text to be recognized corresponding to the target video based on the target clusters respectively corresponding to the at least one image frame vector.

In this embodiment, after acquiring target clusters corresponding to the at least one image frame vector, the server may acquire a cluster center identifier of each target cluster, where one target cluster corresponds to one unique cluster center identifier, and the cluster center identifier may be in a character form or a digital form. The server may use the cluster center identifiers of the target clusters respectively corresponding to the at least one image frame as the words in the text to be recognized corresponding to the target video. Actually, the target video is used as a text, at least one image frame vector in the target video is used as at least one vocabulary in the text, and the image frame vector is mapped to the cluster center identifier, so that the mapping process from the high-dimensional vector to the low-dimensional identifier is realized, and the data dimension reduction is realized.

For example, the server obtains 2160 image frames from the target video, and correspondingly obtains 2160 image frame vectors, then the server determines 2160 target clusters according to the 2160 image frame vectors, and the further server obtains 2160 cluster center identifiers, and the same cluster center identifier may exist in the 2160 cluster center identifiers. The server stores the 2160 cluster center identifications as 2160 vocabularies in a list form. The 2160 words are equivalent to words obtained by word segmentation of the text to be recognized corresponding to the target video.

In an optional implementation manner, the server may store a correspondence between the cluster center identifier and the video type, and for an image frame vector of any image frame, after determining a target cluster corresponding to the image frame vector, the server may determine, based on the correspondence, at least one classification type corresponding to the cluster center identifier of the target cluster, so as to determine the at least one classification type to which the image frame belongs. After determining at least one classification type to which the at least one image frame belongs, the server may determine a video type to which the target video belongs according to the frequency of occurrence of each classification type. The server may select the category type with the highest frequency of occurrence as the video type of the target video, and may also select a plurality of category types with higher frequency of occurrence as the video type of the target video.

For example, the server determines that the classification type of at least one image frame in the target video is a fun, a game, a cartoon, a movie and a teaching, wherein the fun appears most frequently, and then the server determines the video type of the target video as the fun. Or the frequency of the occurrence of the fun, the game and the teaching is high, and the frequency of the occurrence of the cartoon and the movie is low, the server can determine the video type of the target video as the fun, the game and the teaching.

204. And the server inputs the text to be recognized into the natural language processing model, and decodes the text to be recognized through the natural language processing model to obtain the video type of the target video.

In the embodiment of the application, the server can directly input the text to be recognized into the natural language processing model, and the natural language processing model uses a large number of sample texts and classification types corresponding to the sample texts during training. The obtaining mode of the sample text is as follows: obtaining a large number of sample videos, obtaining at least one sample image frame in each sample video, inputting the sample image frame into an operation model based on a convolutional neural network for any sample image frame to obtain a sample image frame vector, then respectively determining a target cluster corresponding to the sample image frame vector, and representing the sample image frame corresponding to the sample image frame vector by using a cluster center mark of the target cluster, thereby obtaining a sample text corresponding to the sample video. The server can decode the input text to be recognized based on the natural language processing model to obtain at least one classification probability, wherein the classification probability is used for expressing the probability that the text to be recognized belongs to different classification types. And the server determines the classification type meeting the first target condition as the video type of the target video according to the at least one classification probability. The first target condition may be that the probability is highest, the probability is greater than a classification probability threshold, and the like. The classification probability threshold may be 90%, 80%, 75%, etc., which is not limited in this application.

For example, the server obtains that the classification probability of the target video belonging to the fun is 90%, the classification probability of the target video belonging to the game is 80%, the classification probability of the target video belonging to the animation is 30%, the classification probability of the target video belonging to the movie is 20%, and the classification probability of the target video belonging to the teaching is 73%. If only one type is allocated to the target video, the server can make the high-probability classification as the video type of the target video. If multiple types need to be allocated to the target video, the server can take the fun, the game and the teaching with the classification probability of more than 60% as the video type of the target video.

It should be noted that the video classification method provided by the present application can also be used for identifying the quality of a video, that is, performing two classifications on a target video, where the classification types are only high quality and low quality. The server can obtain the high-quality cluster by clustering the vectors of the high-quality sample image frames, obtain the low-quality sample image frames included in the low-quality video and cluster the vectors of the low-quality sample image frames to obtain the low-quality cluster by obtaining the high-quality sample image frames included in the high-quality video. And then, training a natural language processing model by using the high-quality text corresponding to the high-quality video and the low-quality text corresponding to the low-quality video, so that the natural language processing model outputs the probability that the target video is high-quality. The server can determine a high-quality cluster and a low-quality cluster corresponding to at least one image frame of a target video respectively when the target video is classified, and acquire a text to be recognized corresponding to the target video based on the high-quality cluster and the low-quality cluster, so that the probability that the target video is high-quality is output based on a natural language processing model, and whether the target video is the high-quality video or the low-quality video is determined. Of course, the server may also count the frequency of the high-quality clusters and the low-quality clusters in the obtained target cluster without using the natural language processing model, where if the frequency of the high-quality clusters is high, the target video is a high-quality video, and if the frequency of the low-quality clusters is high, the target video is a low-quality video. By adopting different training modes, the video classification method provided by the application can realize various classification tasks, and the application range of the embodiment of the application is widened.

In an alternative implementation manner, the server may further add a video tag to the target video through the tag model. Correspondingly, the step of adding the video tag to the target video by the server may be: the server can input the text to be recognized into the label model, and the text to be recognized is decoded through the label model to obtain at least one label probability, wherein the label probability is used for expressing the probability that the text to be recognized belongs to different video labels. The server may determine, according to the at least one label probability, a video label satisfying a second target condition as a video label of the target video. Wherein the second target condition may be that the probability is highest, the probability is greater than the tag probability threshold, and the like. The label probability threshold may be 90%, 80%, 75%, etc., which is not limited in this application. Of course, the tag model may be the same model as the natural language processing model, that is, when the natural language model outputs at least one classification probability, at least one tag probability may also be output. The embodiment of the present application does not limit this.

For example, the server specifies the types of fun, game, animation, movie, and teaching, the video tags corresponding to the types of fun include joke, twitter, ghost, and the like, the video tags corresponding to the types of game include games a, B, and C, and the video tags corresponding to the animation include animation D, animation E, and animation F. The server may further screen a video tag suitable for the target video from the video tags according to the video type of the target video, which is not limited in the embodiment of the present application.

It should be noted that, the foregoing steps 201 to 204 are optional implementations of the video classification method provided in this application, and accordingly, the video classification method may also be implemented in other ways. For example, referring to fig. 5, fig. 5 is a flowchart of another video classification method provided according to an embodiment of the present disclosure. In fig. 5, the video classification method may be divided into three steps, that is, a first step of obtaining a large number of sample image frames from a plurality of sample videos, calculating a sample image frame vector of each sample image frame, and clustering the sample image frame vectors by using a K-means algorithm to obtain cluster center vectors of at least two clusters. And step two, acquiring at least one image frame from the target video, calculating an image frame vector of each image frame, respectively determining a cluster center vector which is most similar to each image frame vector from the cluster center vectors of the at least two clusters through a KNN algorithm to obtain at least one target cluster, and acquiring a cluster center identifier of the at least one target cluster. And step three, determining a text to be recognized corresponding to the target video according to the cluster center identification of at least one target cluster, and realizing at least one of classification, video tag addition or video quality labeling of the target video based on the text to be recognized according to a natural language processing mode. According to the video classification method, through the corresponding relation between the cluster center vector of the target cluster and the image frame vector, the cluster center identification is used for replacing the image frame to be used as the vocabulary in the text to be recognized, the image frame processing of the video is converted into the processing of the vocabulary in the text, and the calculation amount can be effectively reduced. For example, the method can be applied to the environment of Linux system, CPU 16 core and 32G memory, and classification of videos can be completed without configuring GPU.

In the embodiment of the application, the text to be recognized corresponding to the target video can be obtained by determining the target cluster corresponding to the image frame vector of the image frame in the target video, so that when the target video is classified, the image frame in the processed target video is converted into the text content in the text to be recognized, thereby reducing the calculation complexity, shortening the processing time and reducing the requirement on the processing capacity of the equipment.

Fig. 6 is a block diagram of a video classification apparatus according to an embodiment of the present application. The apparatus is used for executing the steps when the video classification method is executed, and referring to fig. 6, the apparatus comprises: vector acquisition module 601, determination module 602, text acquisition module 603, and model processing module 604.

The vector acquisition module 601 is configured to input at least one image frame acquired from a target video into an operation model based on a convolutional neural network to obtain at least one image frame vector;

a determining module 602, configured to determine target clusters corresponding to the at least one image frame vector from at least two clusters, where one cluster is used to represent one type of image frame vector;

a text obtaining module 603, configured to obtain a text to be recognized corresponding to the target video based on the target clusters respectively corresponding to the at least one image frame vector;

the model processing module 604 is configured to input the text to be recognized into a natural language processing model, and decode the text to be recognized through the natural language processing model to obtain a video type to which the target video belongs.

In an optional implementation manner, the determining module 602 is further configured to determine, for any image frame vector in the at least one image frame vector, a cluster with the highest similarity between the cluster center vector and the image frame vector from the at least two clusters as a target cluster of the image frame vector.

In an alternative implementation, the determining module 602 is further configured to determine euclidean distances between the image frame vector and the cluster center vectors of the at least two clusters, respectively; and in response to the Euclidean distance between the cluster center vector of any cluster and the image frame vector being minimum, taking the cluster as a target cluster of the image frame vector.

In an optional implementation manner, the text obtaining module 603 is further configured to use cluster center identifiers of target clusters respectively corresponding to the at least one image frame vector as words in the text to be recognized corresponding to the target video.

In an optional implementation, the apparatus further includes:

the vector obtaining module 601 is further configured to input the plurality of sample image frames into the operation model based on the convolutional neural network to obtain a plurality of sample image frame vectors;

the clustering module is used for clustering the sample image frame vectors to obtain at least two clusters;

and the distribution module is used for distributing unique cluster center identification for the at least two clusters respectively, and the cluster center identification is in a digital coding form.

In an optional implementation manner, the model processing module 604 is configured to decode the text to be recognized through the natural language processing model to obtain at least one classification probability, where the classification probability is used to indicate probabilities that the text to be recognized belongs to different classification types; and determining the classification type meeting the first target condition as the video type of the target video according to the at least one classification probability.

In an optional implementation manner, the model processing module 604 is further configured to input the text to be recognized into a tag model, and decode the text to be recognized through the tag model to obtain at least one tag probability, where the tag probability is used to indicate probabilities that the text to be recognized belongs to different video tags;

the determining module 602 is further configured to determine, according to the at least one label probability, a video label that meets a second target condition as a video label of the target video.

It should be noted that: in the video classification apparatus provided in the above embodiment, when an application program is run, only the division of the above functional modules is used for illustration, and in practical applications, the above function distribution may be completed by different functional modules according to needs, that is, the internal structure of the apparatus is divided into different functional modules, so as to complete all or part of the above described functions. In addition, the video classification apparatus and the video classification method provided in the above embodiments belong to the same concept, and specific implementation processes thereof are described in the method embodiments in detail and are not described herein again.

In the embodiment of the present application, the electronic device may be implemented as a terminal or a server, when implemented as a terminal, the terminal may implement the operations performed by the video classification method described above, when implemented as a server, the server may implement the operations performed by the video classification described above, and also the server and the terminal may interact with each other to implement the operations performed by the video classification method described above.

The electronic device may be implemented as a terminal, and fig. 7 is a block diagram of a terminal 700 provided according to an embodiment of the present application. Fig. 7 is a block diagram illustrating a terminal 700 according to an exemplary embodiment of the present invention. The terminal 700 may be: a smart phone, a tablet computer, an MP3 player (Moving Picture Experts Group Audio Layer III, motion video Experts compression standard Audio Layer 3), an MP4 player (Moving Picture Experts Group Audio Layer IV, motion video Experts compression standard Audio Layer 4), a notebook computer, or a desktop computer. Terminal 700 may also be referred to by other names such as user equipment, portable terminal, laptop terminal, desktop terminal, and so on.

In general, terminal 700 includes: a processor 701 and a memory 702.

The processor 701 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and so on. The processor 701 may be implemented in at least one hardware form of a DSP (Digital Signal Processing), an FPGA (Field-Programmable Gate Array), and a PLA (Programmable Logic Array). The processor 701 may also include a main processor and a coprocessor, where the main processor is a processor for Processing data in an awake state, and is also called a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 701 may be integrated with a GPU (Graphics Processing Unit), which is responsible for rendering and drawing the content required to be displayed on the display screen. In some embodiments, the processor 701 may further include an AI (Artificial Intelligence) processor for processing computing operations related to machine learning.

Memory 702 may include one or more computer-readable storage media, which may be non-transitory. Memory 702 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 702 is used to store at least one instruction for execution by processor 701 to implement the XXXX methods provided by method embodiments herein.

In some embodiments, the terminal 700 may further optionally include: a peripheral interface 703 and at least one peripheral. The processor 701, the memory 702, and the peripheral interface 703 may be connected by buses or signal lines. Various peripheral devices may be connected to peripheral interface 703 via a bus, signal line, or circuit board. Specifically, the peripheral device includes: at least one of a radio frequency circuit 704, a display screen 705, a camera assembly 706, an audio circuit 707, a positioning component 708, and a power source 709.

The peripheral interface 703 may be used to connect at least one peripheral related to I/O (Input/Output) to the processor 701 and the memory 702. In some embodiments, processor 701, memory 702, and peripheral interface 703 are integrated on the same chip or circuit board; in some other embodiments, any one or two of the processor 701, the memory 702, and the peripheral interface 703 may be implemented on a separate chip or circuit board, which is not limited in this embodiment.

The Radio Frequency circuit 704 is used for receiving and transmitting RF (Radio Frequency) signals, also called electromagnetic signals. The radio frequency circuitry 704 communicates with communication networks and other communication devices via electromagnetic signals. The rf circuit 704 converts an electrical signal into an electromagnetic signal to transmit, or converts a received electromagnetic signal into an electrical signal. Optionally, the radio frequency circuit 704 includes: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a subscriber identity module card, and so forth. The radio frequency circuitry 704 may communicate with other terminals via at least one wireless communication protocol. The wireless communication protocols include, but are not limited to: metropolitan area networks, various generation mobile communication networks (2G, 3G, 4G, and 5G), Wireless local area networks, and/or WiFi (Wireless Fidelity) networks. In some embodiments, the radio frequency circuit 704 may also include NFC (Near Field Communication) related circuits, which are not limited in this application.

The display screen 705 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display screen 705 is a touch display screen, the display screen 705 also has the ability to capture touch signals on or over the surface of the display screen 705. The touch signal may be input to the processor 701 as a control signal for processing. At this point, the display 705 may also be used to provide virtual buttons and/or a virtual keyboard, also referred to as soft buttons and/or a soft keyboard. In some embodiments, the display 705 may be one, providing the front panel of the terminal 700; in other embodiments, the display 705 can be at least two, respectively disposed on different surfaces of the terminal 700 or in a folded design; in still other embodiments, the display 705 may be a flexible display disposed on a curved surface or on a folded surface of the terminal 700. Even more, the display 705 may be arranged in a non-rectangular irregular pattern, i.e. a shaped screen. The Display 705 may be made of LCD (Liquid Crystal Display), OLED (Organic Light-Emitting Diode), or the like.

The camera assembly 706 is used to capture images or video. Optionally, camera assembly 706 includes a front camera and a rear camera. Generally, a front camera is disposed at a front panel of the terminal, and a rear camera is disposed at a rear surface of the terminal. In some embodiments, the number of the rear cameras is at least two, and each rear camera is any one of a main camera, a depth-of-field camera, a wide-angle camera and a telephoto camera, so that the main camera and the depth-of-field camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize panoramic shooting and VR (Virtual Reality) shooting functions or other fusion shooting functions. In some embodiments, camera assembly 706 may also include a flash. The flash lamp can be a monochrome temperature flash lamp or a bicolor temperature flash lamp. The double-color-temperature flash lamp is a combination of a warm-light flash lamp and a cold-light flash lamp, and can be used for light compensation at different color temperatures.

The audio circuitry 707 may include a microphone and a speaker. The microphone is used for collecting sound waves of a user and the environment, converting the sound waves into electric signals, and inputting the electric signals to the processor 701 for processing or inputting the electric signals to the radio frequency circuit 704 to realize voice communication. For the purpose of stereo sound collection or noise reduction, a plurality of microphones may be provided at different portions of the terminal 700. The microphone may also be an array microphone or an omni-directional pick-up microphone. The speaker is used to convert electrical signals from the processor 701 or the radio frequency circuit 704 into sound waves. The loudspeaker can be a traditional film loudspeaker or a piezoelectric ceramic loudspeaker. When the speaker is a piezoelectric ceramic speaker, the speaker can be used for purposes such as converting an electric signal into a sound wave audible to a human being, or converting an electric signal into a sound wave inaudible to a human being to measure a distance. In some embodiments, the audio circuitry 707 may also include a headphone jack.

The positioning component 708 is used to locate the current geographic Location of the terminal 700 for navigation or LBS (Location Based Service). The Positioning component 708 can be a Positioning component based on the GPS (Global Positioning System) in the united states, the beidou System in china, the graves System in russia, or the galileo System in the european union.

Power supply 709 is provided to supply power to various components of terminal 700. The power source 709 may be alternating current, direct current, disposable batteries, or rechargeable batteries. When power source 709 includes a rechargeable battery, the rechargeable battery may support wired or wireless charging. The rechargeable battery may also be used to support fast charge technology.

In some embodiments, terminal 700 also includes one or more sensors 710. The one or more sensors 710 include, but are not limited to: acceleration sensor 711, gyro sensor 712, pressure sensor 713, fingerprint sensor 714, optical sensor 715, and proximity sensor 716.

The acceleration sensor 711 can detect the magnitude of acceleration in three coordinate axes of a coordinate system established with the terminal 700. For example, the acceleration sensor 711 may be used to detect components of the gravitational acceleration in three coordinate axes. The processor 701 may control the display screen 705 to display the user interface in a landscape view or a portrait view according to the gravitational acceleration signal collected by the acceleration sensor 711. The acceleration sensor 711 may also be used for acquisition of motion data of a game or a user.

The gyro sensor 712 may detect a body direction and a rotation angle of the terminal 700, and the gyro sensor 712 may cooperate with the acceleration sensor 711 to acquire a 3D motion of the terminal 700 by the user. From the data collected by the gyro sensor 712, the processor 701 may implement the following functions: motion sensing (such as changing the UI according to a user's tilting operation), image stabilization at the time of photographing, game control, and inertial navigation.

Pressure sensors 713 may be disposed on a side frame of terminal 700 and/or underneath display 705. When the pressure sensor 713 is disposed on a side frame of the terminal 700, a user's grip signal on the terminal 700 may be detected, and the processor 701 performs right-left hand recognition or shortcut operation according to the grip signal collected by the pressure sensor 713. When the pressure sensor 713 is disposed at a lower layer of the display screen 705, the processor 701 controls the operability control on the UI interface according to the pressure operation of the user on the display screen 705. The operability control comprises at least one of a button control, a scroll bar control, an icon control and a menu control.

The fingerprint sensor 714 is used for collecting a fingerprint of a user, and the processor 701 identifies the identity of the user according to the fingerprint collected by the fingerprint sensor 714, or the fingerprint sensor 714 identifies the identity of the user according to the collected fingerprint. When the user identity is identified as a trusted identity, the processor 701 authorizes the user to perform relevant sensitive operations, including unlocking a screen, viewing encrypted information, downloading software, paying, changing settings, and the like. The fingerprint sensor 714 may be disposed on the front, back, or side of the terminal 700. When a physical button or a vendor Logo is provided on the terminal 700, the fingerprint sensor 714 may be integrated with the physical button or the vendor Logo.

The optical sensor 715 is used to collect the ambient light intensity. In one embodiment, the processor 701 may control the display brightness of the display screen 705 based on the ambient light intensity collected by the optical sensor 715. Specifically, when the ambient light intensity is high, the display brightness of the display screen 705 is increased; when the ambient light intensity is low, the display brightness of the display screen 705 is adjusted down. In another embodiment, processor 701 may also dynamically adjust the shooting parameters of camera assembly 706 based on the ambient light intensity collected by optical sensor 715.

A proximity sensor 716, also referred to as a distance sensor, is typically disposed on a front panel of the terminal 700. The proximity sensor 716 is used to collect the distance between the user and the front surface of the terminal 700. In one embodiment, when the proximity sensor 716 detects that the distance between the user and the front surface of the terminal 700 gradually decreases, the processor 701 controls the display 705 to switch from the bright screen state to the dark screen state; when the proximity sensor 716 detects that the distance between the user and the front surface of the terminal 700 is gradually increased, the processor 701 controls the display 705 to switch from the breath-screen state to the bright-screen state.

Those skilled in the art will appreciate that the configuration shown in fig. 7 is not intended to be limiting of terminal 700 and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components may be used.

The electronic device may be implemented as a server, and fig. 8 is a schematic structural diagram of a server provided in an embodiment of the present application, where the server 800 may generate a relatively large difference due to different configurations or performances, and may include one or more processors (CPUs) 801 and one or more memories 802, where the memories 802 store at least one instruction, and the at least one instruction is loaded and executed by the processors 801 to implement the methods provided by the above method embodiments. Of course, the server may also have components such as a wired or wireless network interface, a keyboard, and an input/output interface, so as to perform input/output, and the server may also include other components for implementing the functions of the device, which are not described herein again.

The embodiment of the present application also provides a computer-readable storage medium, which is applied to an electronic device, and at least one program code is stored in the computer-readable storage medium, and is used for being executed by a processor and implementing the operations performed by the electronic device in the video classification method in the embodiment of the present application.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The above description is only exemplary of the present application and should not be taken as limiting, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the protection scope of the present application.

21页详细技术资料下载

Video classification method and device, electronic equipment and storage medium

相关技术

网友询问留言