Label data processing method and device and computer readable storage medium

文档序号：1101647 发布日期：2020-09-25 浏览：8次中文

阅读说明：本技术 一种标签数据处理方法、装置以及计算机可读存储介质 (Label data processing method and device and computer readable storage medium ) 是由陈小帅李伟康于 2020-06-24 设计创作，主要内容包括：本申请公开了一种标签数据处理方法、装置以及计算机可读存储介质,该方法包括：获取目标视频数据,确定目标视频数据的目标视频类型；获取互信息索引表；互信息索引表是基于至少两个已有标签视频数据的已有视频词语集合与至少两个已有标签视频数据的视频标签之间的互信息创建的；获取目标视频数据的目标视频词语集合,根据目标视频词语集合和目标视频类型,在互信息索引表中获取目标视频数据的第一候选视频标签；将第一候选视频标签添加到候选标签集合,从候选标签集合中,确定目标视频数据的目标视频标签。采用本申请,可提高针对目标视频标签的获取效率,并丰富所获取到的目标视频标签的标签种类。(The application discloses a label data processing method, a device and a computer readable storage medium, wherein the method comprises the following steps: acquiring target video data and determining a target video type of the target video data; acquiring a mutual information index table; the mutual information index table is created based on mutual information between the existing video word set of at least two existing label video data and the video labels of at least two existing label video data; acquiring a target video word set of target video data, and acquiring a first candidate video tag of the target video data in a mutual information index table according to the target video word set and the target video type; and adding the first candidate video label to a candidate label set, and determining a target video label of the target video data from the candidate label set. By the method and the device, the acquisition efficiency of the target video tag can be improved, and the tag types of the acquired target video tag are enriched.)

1. A tag data processing method, comprising:

acquiring target video data and determining a target video type of the target video data;

acquiring a mutual information index table; the mutual information index table is created based on mutual information between an existing video word set of at least two existing label video data and video labels of the at least two existing label video data;

acquiring a target video word set of the target video data, and acquiring a first candidate video tag of the target video data in a mutual information index table according to the target video word set and the target video type; the first candidate video tag is a video tag of existing tag video data with the target video type;

and adding the first candidate video tag to a candidate tag set, and determining a target video tag of the target video data from the candidate tag set according to mutual information between the first candidate video tag and the corresponding existing video word set.

2. The method of claim 1, wherein obtaining the target set of video words for the target video data comprises:

acquiring video title information, video description information and video subtitle information of the target video data;

performing word segmentation on the video title information, the video description information and the video subtitle information respectively to obtain title words in the video title information, description words in the video description information and subtitle words in the video subtitle information;

determining the title words, the description words and the caption words as target video words of the target video data;

combining target video words of the target video data according to the number of the combined words to obtain a target video word set; the number of words of the target video words in a set of target video words is no greater than the number of combined words.

3. The method according to claim 1, wherein the mutual information index table includes a mapping relationship between existing video word sets of the at least two existing tagged video data and video tags of the at least two existing tagged video data, and the mapping relationship further carries video type information of existing tagged video data to which the included video tags belong; the video type information comprises target video type information pointing to the target video type;

the obtaining a first candidate video tag of the target video data in a mutual information index table according to the target video word set and the target video type includes:

determining an existing video word set which is the same as the target video word set in the mutual information index table as a target word set;

determining the mapping relation which carries the target video type information and comprises the target word set in the mutual information index table as a target mapping relation;

and determining the video label included in the target mapping relation as the first candidate video label.

4. The method of claim 3, further comprising:

respectively carrying out word combination on the existing video words of each existing label video data according to the number of the combined words to obtain an existing video word set corresponding to each existing label video data; the word number of the existing video words in an existing video word set is not more than the combined word number;

establishing a mapping relation between each existing video word set and the video tags of the existing tag video data;

and generating the mutual information index table according to the mapping relation between each existing video word set and the corresponding video label.

5. The method according to claim 4, wherein the generating the mutual information index table according to the mapping relationship between each existing video word set and the corresponding video tag comprises:

acquiring a mutual information value between an existing video word set and a video label contained in each mapping relation according to the video quantity of the existing label video data to which the existing video word set and the video label contained in each mapping relation belong together;

determining the mapping relation of which the mutual information value is greater than or equal to the mutual information threshold value as a reserved mapping relation;

adding the video type information to the reserved mapping relation according to the video type of the existing label video data to which the video label contained in the reserved mapping relation belongs;

and generating the mutual information index table according to the reserved mapping relation and the video type information carried by the reserved mapping relation.

6. The method of claim 1, wherein the set of candidate tags further comprises a second candidate video tag;

the method further comprises the following steps:

acquiring a video feature vector of the target video data;

inputting the video feature vector of the target video data into a label generation model; the label generation model is obtained based on the video feature vectors of the at least two existing label video data and the video label training of the at least two existing label video data;

generating at least two video generation labels of the target video data based on the label generation model and the video feature vector of the target video data, and acquiring the generation probability of each video generation label;

and determining the video generation label with the generation probability greater than or equal to a generation probability threshold value in the at least two video generation labels as the second candidate video label.

7. The method of claim 6, wherein the set of candidate tags further comprises a third candidate video tag;

the method further comprises the following steps:

acquiring a first associated label of the first candidate video label, and acquiring a second associated label of the second candidate video label; the first associated tag is determined based on the co-occurrence frequency of the first candidate video tag and the video tag of the first candidate video data in the video tags of the at least two existing tag video data; the first candidate video data is the existing label video data containing the first candidate video label; the second associated tag is determined based on the co-occurrence frequency of the second candidate video tag and the video tag of the second candidate video data in the video tags of the at least two existing tag video data; the second candidate video data is the existing label video data containing the second candidate video label;

and determining the first associated label and the second associated label as the third candidate video label.

8. The method of claim 7, wherein determining the target video tag of the target video data from the candidate tag set according to mutual information between the first candidate video tag and the corresponding existing video word set comprises:

determining the first candidate video tag, the second candidate video tag and the third candidate video tag in the candidate tag set as candidate video tags;

acquiring label credibility between each candidate video label and the target video data according to mutual information between the first candidate video label and the corresponding existing video word set and the generation probability corresponding to the second candidate video label;

and determining the target video label from the candidate label set according to the label credibility between each candidate video label and the target video data.

9. The method of claim 8, wherein the set of candidate tags comprises candidate video tag b_lL is a positive integer less than or equal to the total number of tags of the candidate video tags in the candidate tag set;

the obtaining of the tag credibility between each candidate video tag and the target video data according to the mutual information between the first candidate video tag and the existing video word set corresponding to the first candidate video tag and the generation probability corresponding to the second candidate video tag includes:

if the candidate video object isLabel b_lBelongs to the first candidate video label and does not belong to the second candidate video label, according to the candidate video label b_lDetermining the candidate video label b according to the mutual information between the candidate video label b and the corresponding existing video word set_l(ii) a tag confidence level with the target video data;

if the candidate video label b_lBelongs to the second candidate video label and does not belong to the first candidate video label, the candidate video label b is labeled_lCorresponding generation probability is determined as the candidate video label b_l(ii) a tag confidence level with the target video data;

if the candidate video label b_lIf the first candidate video tag belongs to the first candidate video tag and the second candidate video tag belongs to the second candidate video tag, acquiring a first tag configuration weight corresponding to the first candidate video tag and acquiring a second tag configuration weight corresponding to the second candidate video tag;

configuring weight according to the first label, the second label and the candidate video label b_lMutual information between the existing video word sets corresponding to the video words and the candidate video label b_lDetermining the candidate video label b according to the corresponding generation probability_lAnd a tag confidence level with the target video data.

10. The method of claim 9, wherein the video tag candidate b is further included in the mutual information index table_lMutual information values between the existing video word sets corresponding to the existing video word sets; the candidate video tag b_lThe mutual information value between the candidate video label b and the corresponding existing video word is determined according to the candidate video label b_lDetermining the video quantity of the existing label video data which is commonly affiliated with the corresponding existing video words;

the label b according to the candidate video_lDetermining the candidate video label b according to the mutual information between the candidate video label b and the corresponding existing video word set_lAnd the target video dataA tag trust level comprising:

obtaining the candidate video label b from the mutual information index table_lMutual information values between the existing video word sets corresponding to the existing video word sets;

obtaining the candidate video label b_lThe word number of the corresponding words in the existing video word set;

adjusting parameters and the candidate video label b according to the credibility_lCorresponding mutual information value and the word number determine the candidate video label b_lThe tag confidence of (1).

11. The method of claim 9, wherein the candidate tag set further comprises a candidate video tag b_jJ is a positive integer less than or equal to the total number of tags of the candidate video tags in the candidate tag set;

the method further comprises the following steps:

if the candidate video label b_jLabel b for the candidate video_lThen the candidate video tag b is obtained_jAnd the candidate video tag b_lA first tag association degree therebetween; the first label association degree is based on the candidate video label b_jAnd the candidate video tag b_lDetermined by the number of co-occurrences in the video tags of the at least two existing tagged video data;

according to the first label association degree and the candidate video label b_lDetermining the candidate video label b according to the mutual information between the candidate video label b and the corresponding existing video word set_j(ii) a tag confidence level with the target video data;

if the candidate video label b_jLabel b for the candidate video_lThen the candidate video tag b is obtained_jAnd the candidate video tag b_lA second degree of tag association therebetween; the second label association degree is based on the candidate video label b_jAnd the candidate video tag b_lAt the at least two already existingThe number of co-occurrences in the video tag of the tagged video data;

according to the second label association degree and the candidate video label b_lDetermining the candidate video label b according to the corresponding generation probability_jAnd a tag confidence level with the target video data.

12. The method of claim 8, wherein determining the target video tag from the set of candidate tags according to tag confidence between each candidate video tag and the target video data comprises:

inputting each candidate video label and the video feature vector of the target video data into a reliability determination model; the credibility determination model is obtained by training the video feature vectors of the at least two existing label video data and the video labels of the at least two existing label video data;

based on the credibility determination model and the video feature vector of the target video data, outputting model credibility between each candidate video tag and the target video data respectively;

determining screening label credibility between each candidate video label and the target video data based on model credibility between each candidate video label and the target video data and label credibility between each candidate video label and the target video data;

and determining the candidate video tags with the screening tag credibility greater than or equal to a screening credibility threshold value among the candidate tag set and the target video data as the target video tags.

13. The method of claim 12, wherein determining a filter tag confidence level between each candidate video tag and the target video data based on the model confidence level between each candidate video tag and the target video data and the tag confidence level between each candidate video tag and the target video data comprises:

acquiring a first credibility configuration weight aiming at the model credibility, and acquiring a second credibility configuration weight aiming at the label credibility;

and determining the screening label reliability between each candidate video label and the target video data according to the first reliability configuration weight, the second reliability configuration weight, the model reliability between each candidate video label and the target video data, and the label reliability between each candidate video label and the target video data.

14. The method of claim 1, wherein the obtaining target video data and determining the target video type of the target video data comprises:

acquiring video image information and video audio information of the target video data, and acquiring video text information of the target video data;

inputting the video image information, the video audio information and the video text information into a video classification model; the video classification model is obtained by training the at least two existing label video data and the video types corresponding to the at least two existing label video data;

outputting the target video type of the target video data based on the video classification model.

15. The method of claim 14, wherein the video image information comprises at least two image frames of the target video data; the video audio information comprises at least two audio frames of audio data of the target video data;

the outputting the target video type of the target video data based on the video classification model comprises:

generating an image feature vector of each image frame of the at least two image frames based on the video classification model, and performing feature vector fusion on the image feature vector of each image frame to obtain an image fusion feature vector;

generating an audio feature vector of each audio frame of the at least two audio frames based on the video classification model, and performing feature vector fusion on the audio feature vector of each audio frame to obtain an audio fusion feature vector;

generating a text feature vector of the video text information based on the video classification model;

performing vector splicing on the image fusion feature vector, the audio fusion feature vector and the text feature vector to obtain a video feature vector of the target video data;

and outputting the target video type of the target video data in the video classification model according to the video feature vector of the target video data.

Technical Field

The present application relates to the field of data processing technologies, and in particular, to a tag data processing method and apparatus, and a computer-readable storage medium.

Background

With the continuous development of computer networks, the amount of video data generated in computer networks is increasing, and in order to enable a user to quickly identify video data that the user wants to watch, a video tag is usually added to the video data, so that the user can quickly determine whether the user wants to watch browsed video data through the video tag of the video data.

Disclosure of Invention

The application provides a tag data processing method, a tag data processing device and a computer-readable storage medium, which can improve the acquisition efficiency of a target video tag and enrich the tag types of the acquired target video tag.

One aspect of the present application provides a tag data processing method, including:

acquiring target video data and determining a target video type of the target video data;

acquiring a mutual information index table; the mutual information index table is created based on mutual information between the existing video word set of the at least two existing label video data and the video labels of the at least two existing label video data;

acquiring a target video word set of target video data, and acquiring a first candidate video tag of the target video data in a mutual information index table according to the target video word set and the target video type; the first candidate video label is a video label of existing label video data with a target video type;

The acquiring of the target video data and the determining of the target video type of the target video data include:

acquiring video image information and video audio information of target video data, and acquiring video text information of the target video data;

inputting video image information, video audio information and video text information into a video classification model; the video classification model is obtained by training at least two existing label video data and video types corresponding to the at least two existing label video data;

and outputting the target video type of the target video data based on the video classification model.

The method for acquiring the video text information of the target video data comprises the following steps:

acquiring video title information, video description information and video subtitle information of target video data;

performing word segmentation on the video subtitle information to obtain subtitle keywords in the video subtitle information;

and splicing the video title information, the video description information and the subtitle keywords to obtain video text information of the target video data.

Wherein, obtaining target video data comprises:

acquiring target video data sent by a client;

the method further comprises the following steps:

and sending the target video label of the target video data to the client so that the client can perform correlation output on the target video data and the target video label.

One aspect of the present application provides a tag data processing apparatus, including:

the video acquisition module is used for acquiring target video data and determining the target video type of the target video data;

the index table acquisition module is used for acquiring a mutual information index table; the mutual information index table is created based on mutual information between the existing video word set of the at least two existing label video data and the video labels of the at least two existing label video data;

the candidate tag acquisition module is used for acquiring a target video word set of target video data and acquiring a first candidate video tag of the target video data in the mutual information index table according to the target video word set and the target video type; the first candidate video label is a video label of existing label video data with a target video type;

and the target label determining module is used for adding the first candidate video label to the candidate label set and determining the target video label of the target video data from the candidate label set according to the mutual information between the first candidate video label and the corresponding existing video word set.

Wherein, the candidate label obtaining module comprises:

an information acquisition unit for acquiring video title information, video description information, and video subtitle information of target video data;

the word segmentation unit is used for performing word segmentation on the video title information, the video description information and the video subtitle information respectively to obtain title words in the video title information, description words in the video description information and subtitle words in the video subtitle information;

the word determining unit is used for determining the title words, the description words and the caption words as target video words of the target video data;

the word combination unit is used for combining the target video words of the target video data according to the number of the combined words to obtain a target video word set; the number of words of the target video words in a set of target video words is no greater than the number of combined words.

The mutual information index table comprises a mapping relation between an existing video word set of at least two existing label video data and video labels of the at least two existing label video data, and the mapping relation also carries video type information of the existing label video data to which the contained video labels belong; the video type information includes target video type information pointing to a target video type;

a candidate tag acquisition module comprising:

the target word determining unit is used for determining an existing video word set which is the same as the target video word set in the mutual information index table as the target word set;

the target relation determining unit is used for determining a mapping relation which carries target video type information and comprises a target word set in the mutual information index table as a target mapping relation;

and the candidate label determining unit is used for determining the video label included in the target mapping relation as a first candidate video label.

Wherein, the tag data processing apparatus further comprises:

the word combination module is used for respectively carrying out word combination on the existing video words of each existing label video data according to the number of the combined words to obtain an existing video word set corresponding to each existing label video data; the word number of the existing video words in an existing video word set is not more than the number of the combined words;

the relation establishing module is used for establishing a mapping relation between each existing video word set and the video tags of the existing tag video data;

and the index table generating module is used for generating a mutual information index table according to the mapping relation between each existing video word set and the corresponding video label.

Wherein, the index table generation module includes:

the mutual information value acquisition unit is used for acquiring the mutual information value between the existing video word set and the video label contained in each mapping relation according to the video quantity of the existing label video data to which the existing video word set and the video label contained in each mapping relation belong together;

a reserved relation determining unit, configured to determine a mapping relation in which the mutual information value is greater than or equal to a mutual information threshold as a reserved mapping relation;

the information adding unit is used for adding video type information to the reserved mapping relation according to the video type of the existing label video data to which the video label contained in the reserved mapping relation belongs;

and the index table generating unit is used for generating a mutual information index table according to the reserved mapping relation and the video type information carried by the reserved mapping relation.

Wherein the set of candidate tags further comprises a second candidate video tag;

the tag data processing apparatus further includes:

the vector acquisition module is used for acquiring video characteristic vectors of target video data;

the vector input module is used for inputting the video characteristic vector of the target video data into the label generation model; the label generation model is obtained based on video feature vectors of at least two existing label video data and video label training of at least two existing label video data;

the label generation module is used for generating at least two video generation labels of the target video data based on the label generation model and the video feature vector of the target video data and acquiring the generation probability of each video generation label;

and the first candidate label determining module is used for determining the video generation label of which the generation probability is greater than or equal to the generation probability threshold value in the at least two video generation labels as a second candidate video label.

The candidate label set further comprises a third candidate video label;

the tag data processing apparatus further includes:

the association tag acquisition module is used for acquiring a first association tag of the first candidate video tag and acquiring a second association tag of the second candidate video tag; the first association tag is determined based on the co-occurrence frequency of the first candidate video tag and the video tag of the first candidate video data in the video tags of at least two existing tag video data; the first candidate video data is the existing label video data containing the first candidate video label; the second associated tag is determined based on the co-occurrence frequency of the second candidate video tag and the video tag of the second candidate video data in the video tags of at least two existing tag video data; the second candidate video data is the existing label video data containing the second candidate video label;

and the second candidate label determining module is used for determining the first associated label and the second associated label as a third candidate video label.

Wherein, the target label determination module comprises:

the set label determining unit is used for determining a first candidate video label, a second candidate video label and a third candidate video label in the candidate label set as candidate video labels;

the credibility obtaining unit is used for obtaining the credibility of the label between each candidate video label and the target video data according to the mutual information between the first candidate video label and the corresponding existing video word set and the generation probability corresponding to the second candidate video label;

and the target label acquisition unit is used for determining the target video label from the candidate label set according to the label credibility between each candidate video label and the target video data.

Wherein, the candidate label set comprises a candidate video label b_lL is a positive integer less than or equal to the total number of tags of the candidate video tags in the candidate tag set;

a credibility obtaining unit including:

a first confidence level determination subunit for determining if the candidate video tag b_lBelongs to the first candidate video tag, andif the video tag does not belong to the second candidate video tag, the video tag is determined according to the candidate video tag b_lDetermining candidate video label b by mutual information between the candidate video label and the corresponding existing video word set_lLabel confidence with the target video data;

a second confidence level determination subunit for determining if the candidate video tag b_lBelongs to the second candidate video label and does not belong to the first candidate video label, the candidate video label b is labeled_lThe corresponding generation probability is determined as a candidate video label b_lLabel confidence with the target video data;

a label weight obtaining subunit, configured to obtain the label weight of the candidate video label b_lIf the first candidate video tag belongs to the first candidate video tag and the second candidate video tag belongs to the second candidate video tag, acquiring a first tag configuration weight corresponding to the first candidate video tag and acquiring a second tag configuration weight corresponding to the second candidate video tag;

a third confidence level determination subunit for determining the second confidence level according to the first tag configuration weight, the second tag configuration weight, and the candidate video tag b_lMutual information between the existing video word sets corresponding to the video words and candidate video labels b_lDetermining candidate video label b according to the corresponding generation probability_lAnd tag confidence with the target video data.

Wherein, the mutual information index table also comprises a candidate video label b_lMutual information values between the existing video word sets corresponding to the existing video word sets; candidate video tag b_lThe mutual information value between the candidate video label b and the corresponding existing video word is determined according to the candidate video label b_lDetermining the video quantity of the existing label video data which is commonly affiliated with the corresponding existing video words;

a first confidence-determining subunit comprising:

a mutual information value obtaining subunit, configured to obtain the candidate video tag b from the mutual information index table_lMutual information values between the existing video word sets corresponding to the existing video word sets;

a word number acquiring subunit for acquiring a candidate video tag b_lCorresponding existing video wordsThe number of words in the set;

a credibility operator unit for adjusting the parameters and the candidate video tags b according to the credibility_lDetermining candidate video label b according to the corresponding mutual information value and word number_lThe tag confidence of (1).

Wherein, the candidate label set also comprises a candidate video label b_jJ is a positive integer less than or equal to the total number of tags of the candidate video tags in the candidate tag set;

the tag data processing apparatus further includes:

a first association degree obtaining module, configured to obtain the first association degree if the candidate video tag b is a candidate video tag_jAs candidate video label b_lThen obtaining the candidate video tag b_jAnd candidate video tag b_lA first tag association degree therebetween; the first label association degree is based on the candidate video label b_jAnd candidate video tags b_lDetermined by the number of co-occurrences in the video tags of at least two existing tagged video data;

a first credibility determination module for determining the candidate video label b according to the first label association degree_lDetermining candidate video label b by mutual information between the candidate video label and the corresponding existing video word set_jLabel confidence with the target video data;

a second association degree obtaining module for obtaining the candidate video label b_jAs candidate video label b_lThen obtaining the candidate video tag b_jAnd candidate video tag b_lA second degree of tag association therebetween; the second label association degree is based on the candidate video label b_jAnd candidate video tags b_lDetermined by the number of co-occurrences in the video tags of at least two existing tagged video data;

a second credibility determination module for determining the candidate video label b according to the second label association degree_lDetermining candidate video label b according to the corresponding generation probability_jAnd tag confidence with the target video data.

Wherein, the target label acquisition unit includes:

the credibility model input subunit is used for inputting the video characteristic vectors of each candidate video label and the target video data into a credibility determination model; the credibility determination model is obtained by training video feature vectors of at least two existing label video data and video labels of at least two existing label video data;

the model credibility output subunit is used for determining the model and the video feature vector of the target video data based on the credibility and outputting the model credibility between each candidate video tag and the target video data;

the screening reliability determining subunit is used for determining the screening label reliability between each candidate video label and the target video data based on the model reliability between each candidate video label and the target video data and the label reliability between each candidate video label and the target video data;

and the target label determining subunit is used for determining the candidate video label with the screening label reliability greater than or equal to the screening reliability threshold value between the candidate label set and the target video data as the target video label.

Wherein the screening confidence determination subunit includes:

the credibility weight obtaining subunit is used for obtaining a first credibility configuration weight aiming at the model credibility and obtaining a second credibility configuration weight aiming at the label credibility;

and the screening credibility operator unit is used for determining the screening label credibility between each candidate video label and the target video data according to the first credibility configuration weight, the second credibility configuration weight, the model credibility between each candidate video label and the target video data and the label credibility between each candidate video label and the target video data.

Wherein, the video acquisition module includes:

the text information acquisition unit is used for acquiring video image information and video audio information of the target video data and acquiring video text information of the target video data;

the classification model input unit is used for inputting video image information, video audio information and video text information into a video classification model; the video classification model is obtained by training at least two existing label video data and video types corresponding to the at least two existing label video data;

and the target type output unit is used for outputting the target video type of the target video data based on the video classification model.

Wherein, text information acquisition unit includes:

the video information acquisition subunit is used for acquiring video title information, video description information and video subtitle information of the target video data;

the information word segmentation subunit is used for segmenting the video subtitle information to obtain subtitle keywords in the video subtitle information;

and the splicing subunit is used for splicing the video title information, the video description information and the subtitle keywords to obtain video text information of the target video data.

Wherein the video image information comprises at least two image frames of the target video data; the video audio information comprises at least two audio frames of audio data of the target video data;

a target type output unit including:

the image vector fusion subunit is used for generating an image feature vector of each image frame of the at least two image frames based on the video classification model, and performing feature vector fusion on the image feature vector of each image frame to obtain an image fusion feature vector;

the audio vector fusion subunit is used for generating an audio feature vector of each audio frame in at least two audio frames based on the video classification model, and performing feature vector fusion on the audio feature vector of each audio frame to obtain an audio fusion feature vector;

the text vector generating subunit is used for generating text characteristic vectors of the video text information based on the video classification model;

the vector splicing subunit is used for carrying out vector splicing on the image fusion characteristic vector, the audio fusion characteristic vector and the text characteristic vector to obtain a video characteristic vector of the target video data;

and the target type output subunit is used for outputting the target video type of the target video data in the video classification model according to the video feature vector of the target video data.

Wherein, the video acquisition module is used for:

acquiring target video data sent by a client;

the tag data processing apparatus is further configured to:

and sending the target video label of the target video data to the client so that the client can perform correlation output on the target video data and the target video label.

An aspect of the application provides a computer device comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to perform a method as in an aspect of the application.

An aspect of the application provides a computer-readable storage medium having stored thereon a computer program comprising program instructions which, when executed by a processor, cause the processor to perform the method of the above-mentioned aspect.

According to an aspect of the application, a computer program product or computer program is provided, comprising computer instructions, the computer instructions being stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the method provided in the various alternatives of the above aspect and the like.

The method and the device can acquire the target video data and determine the target video type of the target video data; acquiring a mutual information index table; the mutual information index table is created based on mutual information between the existing video word set of the at least two existing label video data and the video labels of the at least two existing label video data; acquiring a target video word set of target video data, and acquiring a first candidate video tag of the target video data in a mutual information index table according to the target video word set and the target video type; the first candidate video label is a video label of existing label video data with a target video type; and adding the first candidate video tag to a candidate tag set, and determining a target video tag of the target video data from the candidate tag set according to mutual information between the first candidate video tag and the corresponding existing video word set. Therefore, the method provided by the application can obtain the first candidate video tag aiming at the target video data through the mutual information index table established by the existing tag video data, and further can obtain the target video tag aiming at the target video data through the first candidate video tag, so that the acquisition efficiency aiming at the target video tag is improved. Moreover, the first candidate video tags can be multiple and various, so that the tag types of the target video tags are enriched.

Drawings

In order to more clearly illustrate the technical solutions in the present application or the prior art, the drawings needed for the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic structural diagram of a network architecture according to an embodiment of the present application;

FIG. 2a is a schematic view of a tag data processing scenario provided herein;

fig. 2b is a schematic view of a scene for acquiring a candidate video tag according to the present application;

fig. 2c is a schematic view of a scene for acquiring a target video tag according to the present application;

FIG. 3 is a schematic flow chart of a tag data processing method provided in the present application;

fig. 4 is a schematic flowchart of a video type identification method provided in the present application;

fig. 5 is a schematic flowchart of a video tag obtaining method provided in the present application;

FIG. 6 is a table diagram of tag association probabilities provided herein;

FIG. 7 is a schematic flow chart diagram of a model confidence determination method provided herein;

fig. 8 is a schematic view of a scenario of a tag obtaining method provided in the present application;

fig. 9a is a schematic page diagram of a terminal device provided in the present application;

fig. 9b is a schematic page diagram of a terminal device provided in the present application;

FIG. 10 is a schematic flow chart diagram illustrating a tag acquisition method provided herein;

fig. 11 is a schematic structural diagram of a tag data processing apparatus provided in the present application;

fig. 12 is a schematic structural diagram of a tag data processing apparatus provided in the present application;

fig. 13 is a schematic structural diagram of a computer device provided in the present application.

Detailed Description

The technical solutions in the present application will be described clearly and completely with reference to the accompanying drawings in the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

The present application relates generally to machine learning in artificial intelligence. Machine Learning (ML) is a multi-domain cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis and algorithm complexity theory. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formal education learning.

Machine learning referred to in this application mainly means that a label generation model, a video classification model and a reliability determination model are obtained through machine learning. The label generation model is used for generating video labels of the video data, the video classification model is used for identifying video types of the video data, and the reliability determination model is used for identifying the reliability between the video labels and the video data. The specific uses of the label generation model, the video classification model and the confidence level determination model can be seen from the following steps and the description in the embodiment corresponding to fig. 3.

Referring to fig. 1, fig. 1 is a schematic structural diagram of a network architecture according to an embodiment of the present disclosure. As shown in fig. 1, the network architecture may include a server 200 and a terminal device cluster, and the terminal device cluster may include one or more terminal devices, where the number of terminal devices is not limited herein. As shown in fig. 1, the plurality of terminal devices may specifically include a terminal device 100a, a terminal device 101a, terminal devices 102a, …, and a terminal device 103 a; as shown in fig. 1, the terminal device 100a, the terminal device 101a, the terminal devices 102a, …, and the terminal device 103a may all be in network connection with the server 200, so that each terminal device may perform data interaction with the server 200 through the network connection.

The server 200 shown in fig. 1 may be an independent physical server, may also be a server cluster or a distributed system formed by a plurality of physical servers, and may also be a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a network service, cloud communication, a middleware service, a domain name service, a security service, a CDN, and a big data and artificial intelligence platform. The terminal device may be: the intelligent terminal comprises intelligent terminals such as a smart phone, a tablet computer, a notebook computer, a desktop computer and an intelligent television.

The following takes communication between the terminal device 100a and the server 200 as an example, and a detailed description of an embodiment of the present application is made. The embodiment of the present application specifically describes how to obtain a video tag of target video data by using a video tag of existing tagged video data. The existing tag video data is equivalent to sample video data and refers to video data to which a video tag has been added, in other words, the existing tag video data is historical video data to which a video tag has been added. The target video data can be any video data to which a video tag needs to be added. The terminal device 100a may send the target video data to the server 200 through the client, and then the server 200 may generate a corresponding video tag for the target video data, please refer to the following:

referring to fig. 2a, fig. 2a is a schematic view of a scenario of tag data processing provided in the present application. As shown in fig. 2a, the server 200 generates a mutual information index table by using the video words and video tags of the existing tag video data 101b as an example, and the description of the present application is given here. As shown in the area 101b, the video words of the existing tag video data 101b include video word 1, video word 2, and video word 3, and the video tags of the existing tag video data 101b include video tag 1 and video tag 2. The video words of the existing tag video data 101b may include words in the video title information, words in the video description information, and words in the video caption information of the existing tag video data 101 b. It will be appreciated that the video title information of the video data is also the title of the video data, for example, the title of the video data may be "xx engineers home-made helicopter" or the like. The video description information of the video data may be introduction information about the video data, for example, the video description information of the video data may be "the video is shot at xx venue, xx is mainly described", and the like. The video subtitle information of the video data is the subtitle in the video data, and the video subtitle information can be subtitle information extracted from the video data.

Next, the server 200 may combine the video words of the existing tag video data 101b by combining the number of words to obtain a video word set composed of the video words of the existing tag video data 101b, where the video words included in one video word set are not greater than the number of combined words. Assuming that the number of combined words is 3, the server 200 combines the video words of the existing tagged video data 101b, and a total of 7 sets of video words displayed in the area 104b can be obtained, where the 7 sets of video words are: the video word set 105b, the video word set 106b, the video word set 107b, the video word set 108b, the video word set 109b, the video word set 110b and the video word set 111b, and the number of video words in each video word set is not greater than the number of combined words 3.

Wherein, the video word set 105b includes video word 1; video word set 106b includes video word 2; video word set 107b includes video word 3; video word set 108b includes video word 1 and video word 2; video word set 109b includes video word 1 and video word 3; video word set 110b includes video word 2 and video word 3; video word set 111b includes video word 1, video word 2, and video word 3.

Next, the server 200 may construct the mapping relationship between each obtained video word set and the video tag of the existing tagged video data 101b, and calculate the mutual information value between each video word set and the video tag of the existing tagged video data 101b, so as to generate the mutual information index table. As shown in table 112b, except for the header "video type information video word set video tag mutual information value" in table 112b, each row contains a mapping relationship, and a mapping relationship includes a video word set, a video tag, a mutual information value, and a video type information. The video type information is used to characterize the video type of the existing tagged video data to which the video tag contained in the mapping relationship belongs, for example, the mapping relationship included in the table 112b is the mapping relationship between the video word set of the existing tagged video data 101b and the video tag of the existing tagged video data 101b, as can be known from column 1 of "video type information" in the table 112b, the video type of the existing tagged video data 101b is the video type of a tv series. The mutual information value is calculated by the video number of the existing label video data to which the video word set and the video label belong together, and the specific calculation process for the mutual information value may be referred to in the following step S102.

Specifically, in the table 112b, the row 100h includes a mapping relationship between the video word set 105b and the video tag 1, and the mutual information value between the video word set 105b and the video tag 1 is 0.109. The line 101h includes a mapping relationship between the video word set 106b and the video tag 1, and the mutual information value between the video word set 106b and the video tag 1 is 0.762. The line 102h includes the mapping relationship between the video word set 107b and the video tag 1, and the mutual information value between the video word set 107b and the video tag 1 is 0.234. Line 103h includes the mapping relationship between video word set 108b and video tag 1, and the mutual information value between video word set 108b and video tag 1 is 0.325. The row 104h includes the mapping relationship between the video word set 109b and the video tag 1, and the mutual information value between the video word set 109b and the video tag 1 is 0.865. The line 105h includes the mapping relationship between the video word set 110b and the video tag 1, and the mutual information value between the video word set 110b and the video tag 1 is 0.561. The line 106h includes a mapping relationship between the video word set 111b and the video tag 1, and a mutual information value between the video word set 111b and the video tag 1 is 0.686.

More specifically, the row 107h includes the mapping relationship between the video word set 105b and the video tag 2, and the mutual information value between the video word set 105b and the video tag 2 is 0.356. The line 108h includes the mapping relationship between the video word set 106b and the video tag 2, and the mutual information value between the video word set 106b and the video tag 2 is 0.891. Line 109h includes the mapping between video word set 107b and video tag 2, with the mutual information value between video word set 107b and video tag 2 being 0.985. Line 110h includes the mapping relationship between video word set 108b and video tag 2, and the mutual information value between video word set 108b and video tag 2 is 0.997. Line 111h includes the mapping between video word set 109b and video tag 2, with the mutual information value between video word set 109b and video tag 2 being 0.416. The row 112h includes the mapping relationship between the video word set 110b and the video tag 2, and the mutual information value between the video word set 110b and the video tag 2 is 0.632. The row 113h includes the mapping relationship between the video word set 111b and the video tag 2, and the mutual information value between the video word set 111b and the video tag 2 is 0.367.

The server 200 may reserve the mapping relationship in the table 112b where the mutual information value is greater than or equal to the mutual information threshold value, so as to generate the mutual information index table, and remove the mapping relationship where the mutual information value is less than the mutual information threshold value. Assuming that the mutual information threshold is 0.7, as in table 113b, table 113b may be regarded as a mutual information index table generated by a video word set and a video tag of existing tag video data 101b, where table 113b includes a mapping relationship in which the mutual information value in table 112b is greater than the mutual information threshold 0.7, and specifically includes a mapping relationship in row 101h, a mapping relationship in row 104h, a mapping relationship in row 108h, a mapping relationship in row 109h, and a mapping relationship in row 110h in table 112 b.

The above is a process of generating the mutual information index table by the server 200 through the video word set and the video tags of the existing tag video data 101b, and in an actual scene, a plurality of existing tag video data may be involved, and the mutual information index table may be generated through the video word set and the video tags of the plurality of existing tag video data. The video number of the existing tag video data used for generating the mutual information index table may be determined according to the actual application scenario, and is not limited thereto.

Please refer to fig. 2b together, and fig. 2b is a scene schematic diagram for obtaining candidate video tags according to the present application. As shown in fig. 2b, assume that the server 200 generates a mutual information index table as the table 116b by using a video word set and video tags of a plurality of existing tagged video data. The server 200 may further obtain a video word set of the target video data, and a process of obtaining the video word set of the target video data by the server is the same as a process of obtaining the video word set of the existing tag video data. The set of video words from which the server 200 acquires the target video data here includes the set of video words 117b, the set of video words 118b, and the set of video words 119b in the region 114 b. Video word 1 is included in video word set 117b, video word 2 is included in video word set 118b, and video word 1 and video word 2 are included in video word set 119 b.

If the video genre of the target video data is the video genre of the drama, the server 200 may retrieve, as the target mapping relationship, the video word set including the target video data and the video genre information being the mapping relationship of the drama in the table 116 b. The server 200 may use the video tag included in the target mapping relationship as a candidate video tag of the target video data. As shown in table 116b, the mapping relationships in row 100k, row 102k, row 103k, row 104k, and row 105k in table 116b include both the video genre information of "television series" and the set of video words of the target video data. Thus, video tag 1 in line 100k, video tag 2 in line 102k, video tag 4 in line 103k, video tag 2 in line 104k, and video tag 3 in line 105k may be taken as candidate video tags for the target video data. Therefore, the candidate video tags of the target video data acquired here include video tag 1, video tag 2, video tag 3, and video tag 4 in the area 115 b.

Please refer to fig. 2c, fig. 2c is a schematic view of a scene for acquiring a target video tag according to the present application. The server 200 may obtain a target video tag of the target video data from the candidate video tags obtained in fig. 2b, where the target video tag is a video tag that is finally generated by the server 200 for the target video data. First, the server 200 may obtain a tag reliability of each candidate video tag, where the tag reliability may characterize the reliability of the candidate video tag as the target video tag. The server 200 may obtain the tag reliability of each candidate video tag by the mutual information value corresponding to each candidate video tag.

The mutual information value of each candidate video tag is obtained in the following mode: since the candidate video tag includes video tag 1, and video tag 1 is obtained from row 100k of the table 116b, the mutual information value of video tag 1 is 0.762. The candidate video tags also include video tag 2, and video tag 2 is obtained through line 102k and line 104k of table 116b, so the maximum value of 0.997 of the mutual information values in line 102k and line 104k can be taken as the mutual information value of video tag 2. Thus, similarly, the mutual information value of video tag 3 as the candidate video tag is the mutual information value of 0.997 in line 105k, and the mutual information value of video tag 4 as the candidate video tag is the mutual information value of 0.985 in line 103 k. Then, the server can calculate the label credibility corresponding to the video label 1, the video label 2, the video label 3 and the video label 4 through the mutual information values corresponding to the video label 1, the video label 2, the video label 3 and the video label 4 respectively. As shown in the area 100c, here, the calculated tag reliability corresponding to the video tag 1 is tag reliability 1, the tag reliability corresponding to the video tag 2 is tag reliability 2, the tag reliability corresponding to the video tag 3 is tag reliability 3, and the tag reliability corresponding to the video tag 4 is tag reliability 4. The specific process of the server calculating the tag reliability of each candidate video tag can be seen in step S102 below.

Next, the server 200 may also input each candidate video tag and the video feature vector of the target video data into the confidence level determination model 101 c. The credibility determination model 101c is obtained by training a video feature vector of existing label video data and a video label of the existing label video data, and is used for obtaining model credibility between each input candidate video label and target video data, and the model credibility can also represent the credibility of the candidate video label as the video label of the target video data. A specific process for acquiring a video feature vector of video data, i.e. a machine language representing video data, the video feature vector of the existing tag video data and the video feature vector of the target video data, can be referred to as the following step S101.

Next, server 200 may output, via confidence determination model 101c, a model confidence between each candidate video tag and the target video data, as shown by region 102c, including model confidence 1 for video tag 1, model confidence 2 for video tag 2, model confidence 3 for video tag 3, and model confidence 4 for video tag 4. Next, as shown in the area 103c, the server 200 may calculate, through the tag reliability and the model reliability corresponding to each candidate video tag, a screening tag reliability of each candidate video tag, where the screening tag reliability represents a reliability of a final target video tag in which each candidate video tag is the target video data. Here, the calculated filter tag reliability of the video tag 1 is a filter tag reliability 1, the filter tag reliability of the video tag 2 is a filter tag reliability 2, the filter tag reliability of the video tag 3 is a filter tag reliability 3, and the filter tag reliability of the video tag 4 is a filter tag reliability 4. The specific process of calculating the screening tag reliability of each candidate video tag may also be referred to the following step S104.

The server 200 may use the candidate video tag whose screening tag reliability is greater than or equal to the screening reliability threshold as the target video tag of the target video data. The screening reliability threshold value can be set according to the actual application scene, and is not limited to this. As shown in the area 104c, the target video tags of the target video data finally obtained by the server 200 may include video tag 1 and video tag 3.

Then, the server 200 may send the acquired target video tag to the terminal device 100a, and the terminal device 100a may perform associated output display on the target video data and the target video tag to be displayed for the user to view.

Optionally, the process of obtaining the target video tag of the target video data may also be executed by the terminal device 100a, in other words, the terminal device 100a may independently obtain the target video tag of the target video data, and further perform associated output display on the target video data and the target video tag. Of course, the above-described process of acquiring the target video tag of the target video data may also be performed by the terminal device 100a and the server 200 together. The execution subject for acquiring the target video tag is determined according to a specific application scenario, and is not limited thereto.

By the method, the mutual information index library can be obtained by creating the video word set and the video tags of the existing tag video data, and then the target video tags of the target video data can be obtained through the created mutual information index library, so that the obtaining efficiency of the target video tags is improved, and the tag types of the target video tags are enriched.

Referring to fig. 3, fig. 3 is a flowchart illustrating a tag data processing method provided in the present application, where the method may be executed by a terminal device (e.g., the terminal device shown in fig. 1), or may be executed by a server (e.g., the server 200 shown in fig. 1), or may be executed by cooperation between the terminal device and the server. For the sake of understanding, the present embodiment is described as an example in which the method is executed by the above server to describe a specific process of acquiring the target video tag of the target video data. As shown in fig. 3, the method may include:

step S101, acquiring target video data and determining a target video type of the target video data;

specifically, the server may obtain target video data, where the target video data may be any one of video data, and the target video data may carry video title information, video description information, and video subtitle information. For example, the target video data may be sent by a client to a server, and the client may request the server for a video tag for generating the target video data by sending the target video data to the server, where the target video data may be any video imported by a user at the client. Wherein the video title information of the target video data refers to a video title, i.e., a video name, of the target video data. The video description information of the target video data may be introduction information or the like related to the target video data. The video subtitle information of the target video data may refer to a video subtitle in the video data.

Preferably, the server may obtain a video type of the target video data, which may be referred to as a target video type, for example, the target video type may be a type of a tv show, a type of a movie, a type of a game, a type of an animation, a type of a technology, a type of a politics, or a type of a life. The following describes how to obtain a target video type of target video data:

first, the server may obtain video image information, video audio information, and video text information of the target video data. For the video image information of the target video data, the server may extract the image frame of the target video data, for example, the FFmpeg uniform frame extraction method may be adopted to perform frame extraction on the target video data, that is, to extract the image frame of the target video data. The FFmpeg is a set of open-source computer programs which can be used for recording and converting digital audio and video and converting the digital audio and video into streams, and target video data can be converted into the streams by using the FFmpeg, so that the target video data can be rapidly extracted. When the image frames of the target video data are extracted, the target video data can be extracted at intervals of 20 milliseconds, and a plurality of image frames in the target video data can be obtained by extracting the target video data. Each image frame may be represented by pixel values in the image, and an image frame may be represented as a sequence by the included pixel values or as a matrix by the included pixel values. In this application, a plurality means at least two. A plurality of image frames obtained by extracting frames from the target video data may be used as the video image information of the target video data.

For the video and audio information of the target video data, the server may separate the audio data of the target video data from the target video data. Then, the server may perform audio framing on the separated audio data of the target video data, for example, the audio data of the target video data may also be framed by using the above-mentioned FFmpeg uniform framing method. By framing the audio data of the target video data, a plurality of audio frames of the target video data can be obtained. Wherein an audio frame may be represented as a sequence by the contained energy values. When audio frames of the target video data are extracted, the frames may be extracted at intervals of 20 milliseconds.

For the video text information of the target video data, the server can obtain the video title information, the video description information and the video caption information of the target video data. The video caption information may be identified by the server by ocr (a method of extracting text from an image by image recognition) identifying the video picture of the target video data. Alternatively, the video subtitle information may be recognized by the server by performing asr (a method of converting speech into text by speech recognition) recognition on the audio data of the target video data. Because the video subtitle information of the target video data is usually longer, the video subtitle information of the target video data can be segmented, and the subtitle keywords in the video subtitle information can be obtained through segmentation. A keyword matching library may be preset, where words contained in the keyword matching library are keywords, and words not contained in the keyword matching library are not keywords. Therefore, after the words are segmented, a plurality of words obtained after the words are segmented can be matched with the words in the keyword matching library, the words in the keyword matching library in the plurality of words are reserved and used as the subtitle keywords of the target video data, and the words which do not exist in the keyword matching library in the plurality of words are discarded. Generally, the video title information and the video description information of the target video data are relatively short, so that the video title information, the video description information and the subtitle keywords of the target video data can be directly spliced to obtain the video text information of the target video data.

Optionally, if the video description information of the target video data is also longer, the video description information of the target video data may also be word-segmented to obtain keywords in the video description information, and then the server may splice the video title information of the target video data, the keywords in the video description information, and the subtitle keywords to obtain video text information of the target video data.

After obtaining the video image information, the video audio information and the video text information of the target video data, further, the server may further construct a mel-frequency spectrogram feature of each audio frame in the video audio information of the target video data. Because the Mel frequency spectrogram feature can extract and obtain the outline information of the frequency spectrogram of the audio frame, the energy change feature of the audio frame can be more embodied by representing the audio frame by the Mel frequency spectrogram feature. The server may input the video image information, the mel-frequency spectrogram features of each audio frame in the video and audio information, and the video text information into the video classification model.

The video classification model is obtained by training video and audio information, video text information, video and audio information and video types of existing label video data. The existing label video data is historical video data to which a corresponding video label has been added. The video classification model is trained by using video and audio information, video text information, video and audio information and video types of a large amount of existing label video data, so that the video classification model can learn which video and audio information, video text information and video and audio information correspond to which video type. Therefore, by inputting the video audio information, the video text information, and the video audio information of the target video data to the video classification model, the video classification model may correspond to the video type of the output target video data. The video type of the target video data may be referred to as a target video type.

The specific process of obtaining the target video type of the target video data by the video classification model comprises the following steps:

the video classification model may generate an image feature vector for each image frame (represented as a sequence or matrix) input by the server, where the image feature vector is a feature included in each image frame learned by the video classification model, and each image frame corresponds to one image feature vector. The video classification model may further generate an audio feature vector corresponding to the mel-frequency spectrogram feature of each audio frame input by the server, where the audio feature vector is a feature included in the mel-frequency spectrogram feature of each audio frame learned by the video classification model, and one audio frame corresponds to one audio feature vector. The video classification model can also generate a text feature vector of the video text information input by the server, wherein the text feature vector is a feature contained in the video text information learned by the video classification model.

Then, the video classification model may perform feature vector fusion on all image feature vectors, for example, the video classification model may perform feature vector fusion on each image feature vector through a NetVLAD network, where the NetVLAD network is a feature extraction network, and may implement dimension reduction on features, for example, a plurality of feature vectors are fused into one feature vector to implement dimension reduction on features. Therefore, the video classification model can fuse the image feature vectors corresponding to each image frame into one feature vector through the NetVLAD network, and the feature vector obtained by fusing each image feature vector can be called an image fusion feature vector.

The video classification model can also perform feature vector fusion on all audio feature vectors, for example, the video classification model can also perform feature vector fusion on each audio feature vector through a NetVLAD network. Therefore, the video classification model can fuse the audio feature vectors corresponding to each audio frame into one feature vector through the NetVLAD network, and the feature vector obtained by fusing each audio feature vector can be called as an audio fusion feature vector.

The video classification model can carry out vector splicing on the image fusion characteristic vector, the audio fusion characteristic vector and the text characteristic vector to obtain a video characteristic vector of the target video data. The video feature vector of the target video data is a multi-modal feature vector, and text features of video text information, audio features of video audio information and image features of video image information of the target video data are fused at the same time, so that the video feature vector of the target video data obtained through the video classification model can comprehensively and accurately represent the video features of the target video data. In other words, the video feature vector of the target video data is the feature of the target video data finally learned by the video classification model.

The video classification model learns the characteristics of the existing label video data, namely learns the video characteristic vectors of the existing label video data, and learns which video type the video characteristic vectors of the existing label video data should correspond to, by adopting the same process as the above process, because the video classification model already passes through the video text information, the video audio information and the video image information of the existing label video data. Therefore, the video classification model can output the video type corresponding to the video feature vector of the learned target video data through a full-link layer, that is, the target video type is output.

Through the process, the identification of the video type of the target video data is completed, and the target video type of the target video data is obtained.

Referring to fig. 4, fig. 4 is a schematic flowchart of a video type identification method provided in the present application. The network structure in fig. 4 is a network structure of a video classification model. S201: first, the server may input a video frame sequence of the target video data, that is, a pixel sequence corresponding to each of the plurality of image frames of the target video data, into the video classification model. S202: the video classification model may construct a video frame representation, where constructing the video frame representation is to obtain an image feature vector corresponding to each image frame, and the image feature vector is a vector representation of the image frame. The video classification model may obtain the image feature vector of each image frame through an initiation-renet 2 network (a convolution network for feature extraction). S203: the server may perform multi-frame feature fusion on the obtained plurality of image feature vectors, that is, perform fusion on the plurality of image feature vectors to obtain one image fusion feature vector.

Subsequently, S204: the server may input an audio frame sequence of the target video data into the video classification model, where the audio frame sequence is a sequence of energy values corresponding to each of the plurality of audio frames of the target video data. S205: the video classification model may construct an audio frame representation, where constructing an audio frame representation is to obtain an audio feature vector corresponding to each audio frame, and the audio feature vector is a vector representation of the audio frame. The video classification model may obtain the audio feature vector of each audio frame through a Vggish network (an audio feature extraction network). S206: the server may perform multi-frame feature fusion on the obtained multiple audio feature vectors, that is, perform fusion on the multiple audio feature vectors to obtain one audio fusion feature vector.

Subsequently, S207: the server can obtain the video text information of the target video data through the video title information, the video description information and the caption keywords of the target video data. S208: the server may input the video text information of the target video data to a video classification model, which may construct a textual representation of the video text information of the target video data via a self-attention mechanism network (a natural language processing network). S209: the video classification model can obtain the text feature vector corresponding to the video text information by constructing the text representation of the video text information of the target video data. The text feature vector is the text feature obtained by the video classification model.

Then, S210: the video classification model can perform vector splicing on the obtained image fusion characteristic vector, audio fusion characteristic vector and text characteristic vector to obtain a video characteristic vector of the target video data. The video feature vector of the target video data is a video multi-modal feature fusion representation of the target video data. S211: the video classification model can give the video feature vector of the target video data to a full-connection layer network, and the video feature vector of the target video data is identified through the full-connection layer network, so that the video type of the target video data can be obtained. S212: the video classification model may output the video type of the resulting target video data, i.e., output the target video type.

Step S102, obtaining a mutual information index table; the mutual information index table is created based on mutual information between the existing video word set of the at least two existing label video data and the video labels of the at least two existing label video data;

specifically, the server may further obtain a mutual information index table, where the mutual information index table is an index table used for obtaining candidate video tags of the target video data, and the candidate video tags of the target video data obtained from the mutual information index table may be collectively referred to as a first candidate video tag. The mutual information index table is obtained through a video word set of existing label video data and a video label of the existing label video data. The video words of the existing label video data can be called existing video words, the video word set of the existing label video data can be called existing video word set, the video words of the target video data can be called target video words, and the video word set of the target video data can be called target video word set. The specific obtaining process of the mutual information index table is as follows:

the server can create a mutual information index table through a plurality of (at least two) existing label video data, and the server can respectively acquire an existing video word set of each existing label video data. The method for acquiring the existing video word set of the existing label video data comprises the following steps: the server can obtain video title information, video description information and video subtitle information of existing label video data. The server can perform word segmentation on the video title information, the video description information and the video caption information of the existing label video data respectively to obtain words in the video title information, words in the video description information and words in the video caption information. Words in the video title information may be referred to as title words, words in the video description information may be referred to as description words, and words in the video subtitle information may be referred to as subtitle words. For example, if the video title information is "dinner party at night today", the video title information is participled to obtain the title word "today", the title word "night", and the title word "dinner party".

Because the video caption information of the video data is usually longer and the words in the video caption information are usually more, only the caption keywords in the video caption information can be taken as the caption words in the video caption information. The caption keywords in the video caption information can be screened out through the keyword matching library, namely, words existing in the keyword matching library are keywords, and words not existing in the keyword matching library are not keywords. Words existing in the keyword matching library in the video subtitle information can be used as subtitle words of the video subtitle information, and words not existing in the keyword matching library in the video subtitle information can be discarded.

The title words, description words, and caption words of the existing tagged video data may be collectively referred to as existing video words of the existing tagged video data. Next, the server may combine the existing video words corresponding to each existing tag video data to obtain an existing video word set corresponding to each existing tag video data. Specifically, the server may combine the existing video words of the existing tagged video data according to the number of the combined words to obtain an existing video word set of the existing tagged video data, and the number of the existing video words included in one existing video word set is not greater than the number of the combined words. It is to be understood that combining existing video words refers to combining between a plurality of existing video words belonging to the same existing tagged video data.

For example, if an existing video word of a certain existing tag video data includes video word 1, video tag word 2, and video word 3, and the number of combined words is 2, the existing video words of the existing tag video data are combined to obtain a video word set including video word 1, a video word set including video word 2, a video word set including video word 3, a video word set including video word 1 and video word 2, a video word set including video word 1 and video word 3, and a video word set including video word 2 and video word 3. If the number of the combined words is 3, combining the existing video words of the existing label video data, and obtaining a video word set comprising video words 1, 2 and 3 in addition to the above 6 video word sets.

The server may create a mapping relationship between an existing set of video words for each existing tagged video data and a video tag for the each existing tagged video data, one mapping relationship corresponding to one existing set of video words and one video tag. For example, if there are existing tag video data 1 and existing tag video data 2, the video tag of existing tag video data 1 includes video tag b1 and video tag b2, and the existing video word set of existing tag video data 1 includes existing video word set j1 and existing video word set j 2. The video tags of existing tag video data 2 include video tag b3 and video tag b4, and the existing set of video words of existing tag video data 2 includes existing set of video words j1 and existing set of video words j 3.

The mapping relationship created by the server may include the mapping relationship for the existing tag video data 1: a mapping relation between an existing video word set j1 and a video tag b1, a mapping relation between an existing video word set j1 and a video tag b2, a mapping relation between an existing video word set j2 and a video tag b1, and a mapping relation between an existing video word set j2 and a video tag b 2; a mapping relationship for the existing tag video data 2 may also be included: a mapping relationship between existing set of video words j1 and video tag b3, a mapping relationship between existing set of video words j1 and video tag b4, a mapping relationship between existing set of video words j3 and video tag b3, and a mapping relationship between existing set of video words j3 and video tag b 4.

The server can also calculate a mutual information value between the video label included in each mapping relation and the existing video word set, wherein the mutual information value represents the relevance between the existing video word set and the video label, in other words, the mutual information value represents the probability of the common occurrence of the existing video word set and the video label. For example, the mutual information value between an existing video word set x and a video tag y refers to the probability that, when an existing video word set of certain existing tag video data includes an existing video word set x, a video tag of the existing tag video data includes the video tag y at the same time.

The calculation of the mutual information value is described below: the mutual information value can be calculated respectively for the existing label video data of each video type, and one mapping relationship corresponds to one mutual information value. The mutual information value of each mapping relationship is calculated in the same manner, and the mutual information value of the mapping relationship x is calculated as an example. Assuming that the mapping relationship x includes an existing video word set y and a video tag z, and the video type of existing tagged video data to which the video tag z belongs is a video type h, the server may obtain the occurrence number of the existing video word set y in the existing video word set of existing tagged video data having the video type h. It can be understood that, if there is no existing video word set which is repeated in a plurality of existing video word sets belonging to the same existing tagged video data, it can be understood that the number of occurrences of the existing video word set y is the number of videos of the existing tagged video data having the existing video word set y.

For example, if existing tagged video data having a video type h includes existing tagged video data 1, existing tagged video data 2, and existing tagged video data 3, where an existing video word set of the existing tagged video data 1 includes an existing video word set y, an existing video word set of the existing tagged video data 2 also includes an existing video word set y, and an existing video word set of the existing tagged video data 3 does not include an existing video word set y, the number of occurrences of the existing video word set y is 2.

The server can also acquire the occurrence frequency of the video tag z in the video tag of the existing tag video data with the video type h in the same manner as the acquisition of the occurrence frequency of the existing video word set y. In addition, the server can also obtain the co-occurrence times of the existing video word set y and the video label z, wherein the co-occurrence times are the video number of the existing label video data to which the existing video word set y and the video label z belong together. For example, if the existing tag video data with the video type h includes existing tag video data 1, existing tag video data 2, and existing tag video data 3, and only the existing video word set of the existing tag video data 1 includes an existing video word set y and the video tag of the existing tag video data 1 includes a video tag z, the number of co-occurrences of the existing video word set y and the video tag z is 1.

After acquiring the occurrence number (which may be recorded as c1) of the existing video word set h, the occurrence number (which may be recorded as c2) of the video tag z, and the co-occurrence number between the existing video word set h and the video tag z (which may be recorded as c3), the server may divide a square term of the co-occurrence number between the existing video word set h and the video tag z by the occurrence number of the existing video word set h and the occurrence number of the video tag zThe product of (a) and (b) is used as a mutual information value between the existing video word set h and the video tag z, that is, as a mutual information value corresponding to the mapping relation x, in other words, the mutual information value corresponding to the mapping relation x is equal to the mutual information value corresponding to the mapping relation x

The server may refer to a mapping relationship in which the mutual information value is greater than or equal to the mutual information threshold as a reserved mapping relationship, where the reserved mapping relationship is a mapping relationship used for generating the mutual information index table, and discard a mapping relationship in which the mutual information value is less than the mutual information threshold. In other words, the server may filter the mapping relationship, remove the mapping relationship with a smaller mutual information value, and reserve the mapping relationship with a larger mutual information value to generate the mutual information index table.

More, the server may further add video type information to each reserved mapping relationship according to the video type of the existing tagged video data to which the video tag included in each reserved mapping relationship belongs, where the video type information indicates the video type of the existing tagged video data to which the video tag included in each reserved mapping relationship belongs. For example, if the video type of the existing tagged video data to which the video tag contained in the reserved mapping relationship belongs is a tv series, the video type information to which the reserved mapping relationship is added may be the tv series. If the video type of the existing label video data to which the video label contained in the reserved mapping relationship belongs is a movie, the video type information to which the reserved mapping relationship is added may be the movie.

The server may generate the mutual information index table according to each reserved mapping relationship and the video type information added to each reserved mapping relationship. In other words, each reserved mapping relationship is included in the mutual information index table, and each reserved mapping relationship in the mutual information index table further carries corresponding video type information.

The mutual information index table may be generated and stored in advance by the server, and the server may directly obtain the generated mutual information index table from the storage area. In other words, the mutual information index table does not need to be generated in real time again every time a target video data is acquired, and the server can directly acquire the mutual information index table. Furthermore, the server may update the generated mutual information index table periodically, for example, by using an existing video word set and video tags of newly acquired existing tagged video data.

Step S103, acquiring a target video word set of target video data, and acquiring a first candidate video tag of the target video data in a mutual information index table according to the target video word set and the target video type; the first candidate video label is a video label of existing label video data with a target video type;

specifically, the server may obtain the target video word set of the target video data in the same manner as the existing video word set of the existing tag video data. That is, the server may obtain video title information, video description information, and video subtitle information of the target video data, and the server may perform word segmentation on the video title information, the video description information, and the video subtitle information of the target video data to obtain title words in the video title information, description words in the video description information, and subtitle words in the video subtitle information of the target video data.

The server may take the title words, description words, and caption words of the target video data as the target video words of the target video data. The server can combine the target video words of the target video data through the number of the combined words to obtain a target video word set of the target video data, wherein the number of the target video words in the target video word set is not more than the number of the combined words. It can be understood that the number of combined words corresponding to the existing tag video data is the same as the number of combined words corresponding to the target video data.

The server may retrieve the target set of video words in the mutual information index table according to the video type of the target video data (i.e., the target video type). The video type information added to the mapping relationship in the mutual information index table further includes target video type information, and the target video type information indicates the video type of the existing tag video data to which the video tag contained in the mapping relationship belongs, and is the target video type, that is, the video type is the same as the target video data. The server can refer to the existing video word set which is retrieved from the mutual information index table and is the same as the target video word set as the target word set. The server may refer to the mapping relationship that carries the target video type information and includes the target word set in the mutual information index table as a target mapping relationship.

The server may use the video tag included in the target mapping relationship as a first candidate video tag of the target video data.

Step S104, adding the first candidate video label to a candidate label set, and determining a target video label of target video data from the candidate label set according to mutual information between the first candidate video label and the corresponding existing video word set;

specifically, the server may add the first candidate video tag of the acquired target video data to the candidate tag set. More, the candidate tag set may further include a second candidate video tag, and the obtaining manner of the second candidate video tag may be:

the server may input the video feature vectors of the target video data and the target video type into the tag generation model. The label generation model is obtained by training video feature vectors of a large amount of existing label video data, video labels of the existing label video data and video types of the existing label video data. Alternatively, the server may input video image information, video text information, and video/audio information of the target video data and the target video type into the tag generation model, and generate the video feature vector of the target video data using the tag generation model. By the aid of the video feature vectors of the existing label video data, the video labels and the label generation model obtained by video type training, which video feature vector corresponds to which video label and which video label corresponds to which video data of which video type can be learned.

The label generation model can generate a plurality of video labels of the target video data according to the obtained video feature vector of the target video data and the type of the target video. The video tag of the target video data generated by the tag generation model may be referred to as a video generation tag. In addition, when the label generation model generates a video generation label, the generation probability of the label generation for each video can be obtained. The server may set, as the second candidate video tag, a video generation tag whose generation probability is greater than or equal to the generation probability threshold. The generation probability threshold may be set according to an actual application scenario, which is not limited to this.

Referring to fig. 5, fig. 5 is a schematic flowchart of a video tag obtaining method provided in the present application. The network structure in fig. 5 is a network structure of a tag generation model. S301: first, the server may input a video frame sequence of the target video data, that is, a pixel sequence corresponding to each of the plurality of image frames of the target video data, to the tag generation model. S302: the label generation model can construct a video frame representation, wherein the video frame representation is constructed by obtaining an image feature vector corresponding to each image frame, and the image feature vector is a vector representation of the image frame. Wherein the label generation model may obtain the image feature vector of each image frame through an initiation-renet 2 network (a convolution network for feature extraction). S303: the server may perform multi-frame feature fusion on the obtained plurality of image feature vectors, that is, perform fusion on the plurality of image feature vectors to obtain one image fusion feature vector.

Then, S304: the server may input an audio frame sequence of the target video data, that is, a sequence of energy values corresponding to each of the plurality of audio frames of the target video data, to the tag generation model. S305: the label generation model can construct an audio frame representation, wherein the audio frame representation is constructed by obtaining an audio feature vector corresponding to each audio frame, and the audio feature vector is a vector representation of the audio frame. The label generation model may obtain the audio feature vector of each audio frame through a Vggish network (an audio feature extraction network). S306: the server may perform multi-frame feature fusion on the obtained multiple audio feature vectors, that is, perform fusion on the multiple audio feature vectors to obtain one audio fusion feature vector.

Subsequently, S307: the server can obtain the video text information of the target video data through the video title information, the video description information and the caption keywords of the target video data. S308: the server may input the video text information of the target video data to a tag generation model, which may construct a text representation of the video text information of the target video data through a transform Encoder network (a deep learning network). S309: the label generation model can obtain the text characteristic vector corresponding to the video text information by constructing the text representation of the video text information of the target video data. The text feature vector is the text feature obtained by the label generation model.

Subsequently, S310: the label generation model may input the audio fusion feature vector, the image fusion feature vector, and the text feature vector of the obtained target video data to a feature extractor, and the feature extractor may be configured by the transform Encoder network. In addition, the server may further input the target video type of the target video data into the feature extractor, and the feature extractor may extract the video feature vector of the target video data, and then the video classification model may output, through the obtained video feature vector of the target video data and the target video type, a plurality of model generation labels generated for the target video data, where the plurality of model generation labels specifically include label 1, label 2, … …, and label n. The video classification model also outputs a generation probability of each generated model generation label. The server may set, as the second candidate video tag, the video generation tag whose generation probability is greater than or equal to the generation probability threshold.

The candidate tag set may further include a third candidate video tag. The third candidate video tag may be obtained by: the server may obtain the associated tag of the first candidate video tag, and may refer to the associated tag of the first candidate video tag as the first associated tag. The first associated tag is determined by the number of co-occurrences of the first candidate video tag and the video tag of the first candidate video data in the video tags of all the existing tagged video data. The first candidate video data is the existing tag video data of the video tag which belongs to the video tag and contains the first candidate video tag. The server may further obtain an associated tag of the second candidate video tag, and may refer to the associated tag of the second candidate video tag as a second associated tag. The second associated tag is determined by the number of co-occurrences of the second candidate video tag and the video tag of the second candidate video data in the video tags of all the existing tagged video data. The second candidate video data is the existing tag video data of the second candidate video tag in the video tag. The first associated tag and the second associated tag may be both regarded as the third candidate video tag.

And when the third candidate video tags are obtained, obtaining the third candidate video tags from the video tags of the existing tagged video data with the video type as the target video type. Therefore, the video types of the first candidate video data and the second candidate video data are both target video types. The co-occurrence frequency of the first candidate video tag and the video tag of the first candidate video data in the video tags of all the existing tagged video data refers to the co-occurrence frequency of the first candidate video tag and the video tag of the first candidate video data in the video tags of all the existing tagged video data with the video type being the target video type. The co-occurrence frequency of the second candidate video tag and the video tag of the second candidate video data in the video tags of all the existing tagged video data is also the co-occurrence frequency of the second candidate video tag and the video tag of the second candidate video data in the video tags of all the tagged video data of which the video type is the target video type.

Specifically, the server may count the number of co-occurrences of the first candidate video tag and the video tag of the first candidate video data in the existing tag video data of which all video types are the target video type. For example, if the first candidate video tag includes video tag b1, there are 2 first candidate video data, where the video tag of one first candidate video data includes video tag b1, video tag b2 and video tag b3, and the video tag of the other first candidate video data includes video tag b1 and video tag b 2. Then, the number of co-occurrences of video tag b1 and video tag b2 is 2, and the number of co-occurrences of video tag b1 and video tag b3 is 1.

Then, the server can calculate and obtain the tag association probability between the first candidate video tag and the video tag of the first candidate video data through the co-occurrence times of the first candidate video tag and the video tag of the first candidate video data in all the existing tag video data. Next, as an example mentioned in the above paragraph, if there are 3 existing tagged video data with a video type as the target video type in addition to the 2 first candidate video data, the video tag of the 3 existing tagged video data does not include the video tag b 1.

Then, the co-occurrence probability between video tag b1 and video tag b2 is a value of the number of co-occurrences between video tag b1 and video tag b2 divided by the number of videos of all existing tag video data (including the above 2 candidate video data and here 3 additional existing tag video data), i.e., 2/5. The probability of the video tag b1 occurring in the video tags of the existing tag video data of all target video types is a value of the number of times the video tag b1 occurs divided by the number of existing tag video data of all target video types, i.e., 2/5. The probability of tag association between video tag b1 and video tag b2 is 1, which is the probability of co-occurrence between video tag b1 and video tag b2 divided by the probability of occurrence of video tag b1, which is 2/5.

Similarly, the co-occurrence probability between video tag b1 and video tag b3 is the value of the number of co-occurrences between video tag b1 and video tag b3 divided by the number of videos of all existing tag video data (including the above 2 candidate video data and here 3 additional existing tag video data), i.e., 1/5. The probability of the video tag b1 occurring in the video tags of the existing tag video data of all target video types is a value of the number of times the video tag b1 occurs divided by the number of existing tag video data of all target video types, i.e., 2/5. The probability of tag association between video tag b1 and video tag b3 is 1/2 of the probability of co-occurrence 1/5 between video tag b1 and video tag b3 divided by the probability of occurrence 2/5 of video tag b 1.

Through the above process, the server can obtain the tag association probability between each video tag of the first candidate video data and the first candidate video tag. The server may regard, as a first associated tag of the first candidate video tags, a video tag, of the video tags of the first candidate video data, for which a tag association probability with the first candidate video tag is greater than or equal to an association probability threshold. Similarly, the server may obtain the second associated tag of the second candidate video tag in the same manner as the first associated tag of the first candidate video tag is obtained. By the method, the first associated label of the acquired first candidate video label and the second associated label of the second candidate video label can be acquired according to the label association degree between the video labels. The first associated tag and the second associated tag may be collectively referred to as a third candidate video tag. The association probability threshold may also be set according to an actual application scenario.

Referring to fig. 6, fig. 6 is a table diagram illustrating a tag association probability provided in the present application. As shown in fig. 6, it is assumed that the target video type of the target video data is a video type of "movie", and it is assumed that the original tag in the table of fig. 6 is the first candidate video tag described above, and the associated tag is a video tag of the first candidate video data. And, the calculated label association probability between label b1 and label b2 is 0.937, the label association probability between label b3 and label b4 is 0.856, and the label association probability between label b5 and label b6 is 0.717. Assuming that the association probability threshold is 0.8, since both the tag association probability 0.937 between tag b1 and tag b2 and the tag association probability 0.856 between tag b3 and tag b4 are greater than 0.8, tag b2 and tag b4 may be regarded as the first associated tags.

As can be seen from the above, the candidate tag set may include the first candidate video tag, the second candidate video tag, and the third candidate video tag. The first candidate video tag, the second candidate video tag, and the third candidate video tag in the set of candidate tags may be collectively referred to as candidate video tags of the target video data. The server can obtain the tag reliability between each candidate video tag in the candidate tag set and the target video data through the corresponding generation probability or mutual information value of the candidate video tag in the candidate tag set, and further the server can obtain the target video tag of the target video data from the candidate tag set through the tag reliability between each candidate video tag and the target video data.

Specifically, assume that the candidate tag set includes the candidate video tag b_lAnd l is a positive integer less than or equal to the total number of tags of the candidate video tags in the candidate tag set. If candidate video label b_lBelong to the first candidate video tag but not the second candidate video tag, then the candidate video tag b can be selected_lObtaining the candidate video label b by the mutual information between the candidate video label b and the corresponding existing video word set_lAnd tag confidence with the target video data. Specifically, the server may obtain the candidate video tag b from the mutual information index table_lObtaining the mutual information value between the existing video word sets corresponding to the video words and obtaining the candidate video label b_lAnd the word number of the corresponding words in the existing video word set can also be obtained to obtain the credibility adjustment parameter. For example, if the candidate video tag b_lThe corresponding existing video word set is 'online shopping festival promotion', and the word number in the existing video word set is 7; if candidate video label b_lAnd if the corresponding existing video word set is football, the word number of the word in the existing video word set is 2.

The server can adjust the credibility to the parameter and the candidate video label b_lThe corresponding word number of the existing video word set and the candidate video label b_lAnd the product of the mutual information values of the existing video word sets corresponding to the video word sets is used as a candidate video label b_lAnd tag confidence with the target video data. Thus, candidate video tag b_lThe more word numbers in the corresponding existing video word set, the candidate video label b_lThe higher the confidence level of the tag with the target video data.

Wherein, the above-mentioned credibility adjustment parameter may be a parameter with a self-set value range within a certain reasonable range, due to the candidate video tag b_lThe word number of the corresponding existing video word set may be more, resulting in candidate video tag b_lThe credibility of the label with the target video data is too large, so that the candidate video label b can be adjusted by the credibility adjustment parameter_lThe tag confidence level with the target video data is adjusted to a normal range, for example, to less than 10.

If candidate video label b_lBelongs to the second candidate video tag but not to the first candidate video tag, then candidate video tag b_lThe label credibility with the target video data can be the candidate video label b obtained above_lThe corresponding generation probability.

If candidate video label b_lThe server may obtain a first tag configuration weight corresponding to the first candidate video tag and obtain a second tag configuration weight corresponding to the second candidate video tag.

The server can configure the weight to the candidate video label b through the first label_lThe label credibility of only the first candidate video label is weighted to obtain a weighted value, and the server can also be communicatedConfiguring weight pair candidate video label b by second label_lThe corresponding generated probabilities are weighted to obtain another weighted value. The server can sum the two weighted values to obtain the candidate video tag b_lAnd tag confidence with the target video data.

In addition, assume that the candidate tag set further includes a candidate video tag b_jAnd j is a positive integer less than or equal to the total number of tags of the candidate video tags in the candidate tag set. If candidate video label b_jLabel b for the above candidate video_lThe server may tag the candidate video with the first association tag of b_jAnd candidate video tag b_lThe degree of label association (which may be referred to as the first degree of label association) between the video tag and the candidate video tag b_lThe product of the credibility of the labels only belonging to the first candidate video label is used as the candidate video label b_jAnd tag confidence with the target video data. If candidate video label b_jLabel b for the above candidate video_lThen the server may tag the candidate video with the second associated tag b_jAnd candidate video tag b_lThe degree of label association (which may be referred to as a second degree of label association) between the candidate video label b and the video label b_lThe product between the corresponding generation probabilities is used as the candidate video label b_jAnd tag confidence with the target video data.

Wherein, the candidate video label b_jAnd candidate video tag b_lThe degree of label association (the first degree of label association or the second degree of label association) between the video tags is the candidate video tag b_jAnd candidate video tag b_lThe probability of tag association between them. In the above case, only one candidate video tag b in the candidate tag set is required to exist_j. If a plurality of candidate video labels b exist in the candidate label set_jThen calculate candidate video tag b_jSee the description of fig. 8 below for a process of tag confidence.

Through the process, the label credibility of each candidate video label in the candidate label set can be obtained. Then, the server can also obtain the model reliability of each candidate video tag, and the server can obtain the screening tag reliability finally corresponding to each candidate video tag through the tag reliability and the model reliability respectively corresponding to each candidate video tag, so that the server can obtain the target video tag of the target video data from the candidate tag set through the screening tag reliability of each candidate video tag.

Specifically, the server may input each candidate video tag in the candidate tag set and the video feature vector of the target video data into the confidence level determination model. The credibility determination model is obtained by training video feature vectors of a large amount of existing label video data and video labels of the existing label video data. The credibility determination model obtained by training the video feature vector of the existing label video data and the video label of the existing label video data can learn which video feature vector is more relevant to which video label. The more the video feature vector is correlated with the video tag, the higher the model confidence between the video feature vector and the corresponding video tag obtained by the confidence determination model. The credibility determination model can also obtain the video feature vectors of the video data, so that the server can also input the video image information, the video audio information and the video text information of the target video data into the credibility determination model, and the credibility determination model obtains the video feature vectors of the target video data according to the video image information, the video audio information and the video text information of the target video data. Then, the reliability determination model may correspondingly output the model reliability between the target video data and each candidate video tag according to the obtained video feature vector of the target video data.

The server may obtain a second confidence configuration weight for the label confidence and obtain a first confidence configuration weight for the model confidence. The first confidence configuration weight and the second confidence configuration weight may be self-set parameters within a reasonable range, for example, the first confidence configuration weight may be 0.7, and the second confidence configuration weight may be 0.3. The server can weight the model credibility of each candidate video tag through the first credibility configuration weight to obtain a weighted value corresponding to each candidate video tag, and can also weight the tag credibility of each candidate video tag through the second credibility configuration weight to obtain another weighted value corresponding to each candidate video tag. The server may sum the two weighted values corresponding to each candidate video tag, that is, the screening tag reliability corresponding to each candidate video tag. For example, if the tag confidence level of candidate video tag b is x1, the model confidence level is x2, the first confidence level configuration weight is y1, and the second confidence level configuration weight is y2, then the screening tag confidence level of candidate video tag b is x1 y2+ x2 y 1.

Through the process, the server can obtain the screening label credibility of each candidate video label, and the server can use the candidate video label with the screening label credibility greater than or equal to the screening credibility threshold value in the candidate label set as the target video label of the target video data. The screening reliability threshold may be set according to an actual application scenario, which is not limited to this. Here, the target video tag obtained by the server is a video tag that is finally generated for the target video data.

The server can also send the obtained target video tag to the client, so that the client can perform associated output display on the target video data and the target video tag to be presented to a user for viewing.

Referring to fig. 7, fig. 7 is a schematic flow chart of a model confidence level determination method provided in the present application. The network structure in fig. 7 is a network structure of the credibility determination model. S401: first, the server may input a video frame sequence of the target video data, that is, a pixel sequence corresponding to each of the plurality of image frames of the target video data, to the reliability determination model. S402: the credibility determination model can construct a video frame representation, wherein the video frame representation is constructed by obtaining an image feature vector corresponding to each image frame, and the image feature vector is a vector representation of the image frame. Wherein, the credibility determination model can obtain the image feature vector of each image frame through an acceptance-rest 2 network (a convolution network for feature extraction). S403: the server may perform multi-frame feature fusion on the obtained plurality of image feature vectors, that is, perform fusion on the plurality of image feature vectors to obtain one image fusion feature vector.

Subsequently, S404: the server may input the audio frame sequence of the target video data, that is, the sequence of energy values corresponding to each of the plurality of audio frames of the target video data, into the reliability determination model. S405: the credibility determination model may construct an audio frame representation, where constructing an audio frame representation is to obtain an audio feature vector corresponding to each audio frame, and the audio feature vector is a vector representation of the audio frame. The credibility determination model may obtain the audio feature vector of each audio frame through a Vggish network (an audio feature extraction network). S406: the server may perform multi-frame feature fusion on the obtained multiple audio feature vectors, that is, perform fusion on the multiple audio feature vectors to obtain one audio fusion feature vector.

Subsequently, S407: the server can obtain the video text information of the target video data through the video title information, the video description information and the caption keywords of the target video data. S408: the server may input the video text information of the target video data to a credibility determination model, which may construct a textual representation of the video text information of the target video data via a self-attention mechanism network (a natural language processing network). S409: the credibility determination model can obtain the text characteristic vector corresponding to the video text information by constructing the text representation of the video text information of the target video data. The text feature vector is the text feature obtained by the credibility determination model.

Then, S410: the credibility determination model can perform vector splicing on the audio fusion characteristic vector, the image fusion characteristic vector and the text characteristic vector of the acquired target video data, so as to obtain the video characteristic vector of the target video data. The video feature vector of the target video data is the video multi-modal feature fusion representation of the target video data.

Subsequently, S411: the server may further input all candidate video tags of the target video data (i.e., the candidate video tags in the candidate tag set, where the input candidate video tags are not repeated, and the candidate video tags are specifically referred to as tags 1, … … and tag n herein) into the confidence level determination model, where the confidence level determination model may construct a text representation of each candidate video tag through the self-attention mechanism network, i.e., represent each candidate video tag as a machine language. S412, the credibility determination model may obtain a label representation of each candidate video label by constructing a text representation of each candidate video label, where the label representation may be a logo or a vector.

Subsequently, S413: the credibility determination model can perform feature interaction identification on the label representation of each candidate video label and the video feature vector of the target video data, namely, identify the correlation between the label representation of each candidate video label and the video feature vector of the target video data, wherein the correlation is the credibility of the model. S414: the credibility determination model may output a model credibility between each candidate video tag and the target video data, respectively.

Please refer to fig. 8, fig. 8 is a scene schematic diagram of a tag obtaining method provided in the present application. As shown in fig. 8, the tag 100e is obtained through the mutual information index table, that is, the tag 100e may be the first candidate video tag. The tag 102e is obtained by the tag generation model, in other words, the tag 102e is the second candidate video tag.

The first associated tag 103e is an associated tag of the acquired tag 100 e. The second associated tag 105e is an associated tag of the acquired tags 102 e.

The tag 100e and the associated tag 103e may be merged to obtain a tag 110 e; tag 102e and associated tag 105e may be merged to obtain tag 112 e.

The same label may exist between the label 110e and the label 112e, for example, the label 110e and the label 112e both include the label b 1. In this case, it is necessary to obtain a first tag arrangement weight for tag 110e and a second tag arrangement weight for tag 112e, where the first tag arrangement weight for tag 110e is z1 and the second tag arrangement weight for tag 112e is z 2.

If the tag b1 exists only in the tags 100e among the tags 110e, the tag reliability of the tag b1 is the tag reliability calculated from the mutual information value corresponding to the tag b 1. If tag b1 exists only in first associated tag 103e of tags 110e and is the associated tag of tag b2, then the tag confidence level of tag b1 is the product of the first tag association level between tag b1 and tag b2 multiplied by the tag confidence level of tag b2 when tag b2 belongs only to tag 100 e.

If tag b1 exists only in tag 102e of tags 112e, then the tag confidence level of tag b1 is the generation probability of tag b 1. If label b1 exists only in second associated label 105e of labels 112e and is the associated label of label b2, then the label reliability of label b1 is the product of the second label association degree between label b1 and label b2 multiplied by the generation probability of label b 2.

If tag b1 exists in both tag 110e and tag 112e, the tag confidence level of tag b1 is the tag confidence level when tag b1 exists in tag 110e alone multiplied by z1, and the tag confidence level when tag b1 exists in tag 112e alone multiplied by z 2.

Through the above process, the tag reliability of each of the tag 110e, the tag 111e, and the tag 112e can be obtained, and the fused tag 106e can be obtained. The fused tag 106e includes each of the tag 110e, the tag 111e, and the tag 112e, and the tags in the fused tag 106e are not repeated, and each of the fused tags 106e corresponds to one tag reliability, respectively. The fusion tag 106e corresponds to the candidate tag set. It can be understood that although the repeated candidate video tags are recorded in the candidate tag set, that is, the repeated candidate video tags are caused by different acquisition methods, and when there are repeated candidate video tags, the tag reliability calculation methods are also different, but the repeated candidate video tags correspond to the same tag reliability, so that the repeated candidate video tags are actually one candidate video tag, and only the tag reliability corresponding to the repeated candidate video tags needs to be calculated through the acquisition methods corresponding to the repeated candidate video tags, which is equivalent to a process of fusing the repeated candidate video tags.

The server may input each tag in the fusion tags 106e into the reliability determination model 107e, and obtain, through the reliability determination model, a model reliability between each tag in the fusion tags 106e and the target video data, that is, a model reliability 108 e. Next, the server may obtain the screening label reliability corresponding to each label in the fusion label 106e according to the model reliability and the label reliability corresponding to each label in the fusion label 106 e. Further, the server may set, as the target video tag 109e, a tag whose screening tag reliability is greater than or equal to a screening reliability threshold value in the fusion tags 106 e.

Referring to fig. 9a, fig. 9a is a schematic page view of a terminal device provided in the present application. The terminal device may respond to a click operation of the user on the control 104f in the terminal page 100f, acquire a video imported by the user to the terminal device, and display the video to the terminal page 101 f. As shown in the terminal page 101f, the terminal device has acquired the video imported by the user. The terminal device may display to the terminal page 102f in response to a user's selection of the "automatically tag video" control 105f having clicked on the "confirm upload" control 106 f. In this process, since the user selects the control 105f, which indicates that the user wants the system to automatically add a tag to the uploaded video, the terminal device may use the video uploaded by the user as the target video data and send the target video data to the server.

After acquiring the target video data sent by the terminal device, the server may generate a corresponding target video tag for the target video data by using the method described in the embodiment of fig. 3. After acquiring the target video tag of the target video data, the server may send the acquired target video tag to the terminal device. After the terminal device acquires the target video tag, the terminal device can jump from the terminal page 102f to the terminal page 103f for display. In the terminal page 103f, the terminal device may perform associated display on the video uploaded by the user and the acquired target video tag. As shown in the terminal page 103f, the target video tag 107f acquired by the terminal device includes a tag "make up", a tag "good share", a tag "travel", and a tag "cate".

Please refer to fig. 9b, fig. 9b is a schematic page diagram of a terminal device provided in the present application. As shown in fig. 9b, a plurality of video data, specifically including video data 101g, video data 102g, video data 103g, and video data 104g, is displayed in the terminal page 100 g. A video tag corresponding to each video data is displayed below each video data in the terminal page 100g, and the video tag corresponding to each video data may be obtained by the method described in the embodiment corresponding to fig. 3. As shown in the terminal page 100g, a video tag "make a fun", a video tag "good share", a video tag "travel", and a video tag "food" corresponding to the video data 101g are displayed below the video data. Below the video data 102g are displayed their corresponding video tags "sports", video tag "basketball", and video tag "games". The video label "clothing", the video label "wearing and putting on", the video label "visiting shop" and the video label "make-up" corresponding to the video data 103g are displayed below the video data 103 g. The video label "eat and broadcast", the video label "food" and the video label "abogwang" corresponding to the video data 104g are displayed below the video data.

Referring to fig. 10, fig. 10 is a schematic flow chart of a tag obtaining method provided in the present application. As shown in fig. 10, the method includes: step S501: the server may obtain a tag video to be identified, which is the target video data. Step S502: the server can perform video classification identification on the video to be identified, namely identify the video type of the target video data. Step S503: the server may recall (i.e., retrieve) the candidate video tags to the target video data through the mutual information index table. Step S504: the server may recall the candidate video tags of the target video data through the tag generation model (i.e., the generation model herein). Step S505: the server may recall the association tag of the candidate video tag acquired in step S503 and step S504, that is, acquire the association tag (which may include the first association tag and the second association tag) of the candidate video tag acquired in step S503 and step S504, and may also use the association tag as the candidate video tag of the target video data.

Subsequently, step S506: the server may perform multi-channel video tag candidate fusion on the candidate video tags obtained in step S503, step S504, and step S505, that is, calculate the tag reliability corresponding to each candidate video tag, and since there may be repeated candidate video tags, a process of calculating a common tag reliability for the repeated candidate video tags may be referred to as a tag fusion process, which may be understood as a process of performing deduplication on the candidate video tags. Step S507: the model credibility of each candidate video tag can be obtained through a credibility determination model. And calculating the screening label credibility corresponding to each candidate video label through the model credibility and the label credibility corresponding to each candidate video label. And sorting each candidate video label by screening label credibility, namely sorting the video-label correlation degree. Step S508: after the video tag relevancy sorting, the candidate video tags in the first s numbers can be used as target video tags of the target video data, and the target video tags are video tag results of the target video data finally obtained. The specific value of s may be set according to an actual application scenario.

Referring to fig. 11, fig. 11 is a schematic structural diagram of a tag data processing apparatus provided in the present application. As shown in fig. 11, the tag data processing apparatus 2 may include: the video tag identification method comprises a video acquisition module 21, an index table acquisition module 22, a candidate video tag acquisition module 23 and a target tag determination module 24;

the video acquiring module 21 is configured to acquire target video data and determine a target video type of the target video data;

an index table obtaining module 22, configured to obtain a mutual information index table; the mutual information index table is created based on mutual information between the existing video word set of the at least two existing label video data and the video labels of the at least two existing label video data;

the candidate tag obtaining module 23 is configured to obtain a target video word set of the target video data, and obtain a first candidate video tag of the target video data in the mutual information index table according to the target video word set and the target video type; the first candidate video label is a video label of existing label video data with a target video type;

and the target label determining module 24 is configured to add the first candidate video label to the candidate label set, and determine a target video label of the target video data from the candidate label set according to mutual information between the first candidate video label and the existing video word set corresponding to the first candidate video label.

For specific functional implementation manners of the video obtaining module 21, the index table obtaining module 22, the candidate tag obtaining module 23, and the target tag determining module 24, please refer to steps S101 to S104 in the embodiment corresponding to fig. 3, which is not described herein again.

It can be understood that the tag data processing apparatus 2 in this embodiment of the application can perform the description of the tag data processing method in the embodiment corresponding to fig. 3, which is not described herein again. In addition, the beneficial effects of the same method are not described in detail.

Referring to fig. 12, fig. 12 is a schematic structural diagram of a tag data processing apparatus provided in the present application. As shown in fig. 12, the tag data processing apparatus 1 may include: the system comprises a video acquisition module 101, an index table acquisition module 102, a candidate tag acquisition module 103 and a target tag determination module 104;

the video acquisition module 101 has the same function as the video acquisition module 21 in fig. 11, the index table acquisition module 102 has the same function as the index table acquisition module 22 in fig. 11, the candidate tag acquisition module 103 has the same function as the candidate tag acquisition module 23 in fig. 11, and the target tag determination module 104 has the same function as the target tag determination module 24 in fig. 11.

The candidate tag obtaining module 103 includes: an information acquisition unit 1031, a word segmentation unit 1032, a word determination unit 1033, and a word combination unit 1034;

an information obtaining unit 1031 for obtaining video title information, video description information, and video subtitle information of the target video data;

a word segmentation unit 1032, configured to perform word segmentation on the video title information, the video description information, and the video subtitle information, respectively, to obtain a title word in the video title information, a description word in the video description information, and a subtitle word in the video subtitle information;

a word determining unit 1033 configured to determine a title word, a description word, and a subtitle word as a target video word of the target video data;

a word combination unit 1034, configured to combine target video words of the target video data according to the number of the combined words to obtain a target video word set; the number of words of the target video words in a set of target video words is no greater than the number of combined words.

For specific functional implementation manners of the information obtaining unit 1031, the word segmentation unit 1032, the word determination unit 1033, and the word combination unit 1034, please refer to step S103 in the embodiment corresponding to fig. 3, which is not described herein again.

the candidate tag obtaining module 103 includes: a target word determination unit 1035, a target relation determination unit 1036, and a candidate tag determination unit 1037;

a target word determining unit 1035, configured to determine an existing video word set that is the same as the target video word set in the mutual information index table as the target word set;

a target relation determining unit 1036, configured to determine, as a target mapping relation, a mapping relation that carries target video type information and includes a target word set in the mutual information index table;

a candidate tag determining unit 1037, configured to determine the video tag included in the target mapping relationship as the first candidate video tag.

For specific functional implementation manners of the target word determining unit 1035, the target relation determining unit 1036, and the candidate tag determining unit 1037, please refer to step S103 in the corresponding embodiment of fig. 3, which is not described herein again.

Wherein, the tag data processing apparatus 1 further comprises: the system comprises a word combination module 105, a relation establishing module 106 and an index table generating module 107;

the word combination module 105 is configured to perform word combination on existing video words of each existing tag video data according to the number of combined words, so as to obtain an existing video word set corresponding to each existing tag video data; the word number of the existing video words in an existing video word set is not more than the number of the combined words;

the relationship establishing module 106 is configured to establish a mapping relationship between each existing video word set and the video tag of the existing tag video data to which the existing video word set belongs;

and an index table generating module 107, configured to generate a mutual information index table according to a mapping relationship between each existing video word set and the corresponding video tag.

For a specific implementation manner of functions of the word combination module 105, the relationship establishing module 106, and the index table generating module 107, please refer to step S102 in the embodiment corresponding to fig. 3, which is not described herein again.

The index table generating module 107 includes: a mutual information value acquisition unit 1071, a retention relationship determination unit 1072, an information addition unit 1073, and an index table generation unit 1074;

a mutual information value obtaining unit 1071, configured to obtain a mutual information value between an existing video word set and a video tag included in each mapping relationship according to the video quantity of existing tag video data to which the existing video word set and the video tag both belong included in each mapping relationship;

a retained relationship determining unit 1072, configured to determine a mapping relationship where the mutual information value is greater than or equal to the mutual information threshold as a retained mapping relationship;

an information adding unit 1073, configured to add video type information to the reserved mapping relationship according to the video type of the existing tagged video data to which the video tag included in the reserved mapping relationship belongs;

the index table generating unit 1074 is configured to generate a mutual information index table according to the reserved mapping relationship and the video type information carried by the reserved mapping relationship.

For specific functional implementation manners of the mutual information value obtaining unit 1071, the retention relation determining unit 1072, the information adding unit 1073, and the index table generating unit 1074, please refer to step S102 in the corresponding embodiment of fig. 3, which is not described herein again.

Wherein the set of candidate tags further comprises a second candidate video tag;

the tag data processing apparatus 1 further includes: the system comprises a vector acquisition module 108, a vector input module 109, a label generation module 110 and a first candidate label determination module 111;

a vector obtaining module 108, configured to obtain a video feature vector of target video data;

a vector input module 109, configured to input a video feature vector of target video data into a tag generation model; the label generation model is obtained based on video feature vectors of at least two existing label video data and video label training of at least two existing label video data;

the tag generation module 110 is configured to generate at least two video generation tags of the target video data based on the tag generation model and the video feature vector of the target video data, and obtain a generation probability of each video generation tag;

the first candidate tag determining module 111 is configured to determine, as a second candidate video tag, a video generation tag of the at least two video generation tags, where a generation probability is greater than or equal to a generation probability threshold.

For specific functional implementation manners of the vector obtaining module 108, the vector input module 109, the tag generating module 110, and the first candidate tag determining module 111, please refer to step S104 in the corresponding embodiment of fig. 3, which is not described herein again.

The candidate label set further comprises a third candidate video label;

the tag data processing apparatus 1 further includes: an associated tag obtaining module 112 and a second candidate tag determining module 113;

an associated tag obtaining module 112, configured to obtain a first associated tag of the first candidate video tag, and obtain a second associated tag of the second candidate video tag; the first association tag is determined based on the co-occurrence frequency of the first candidate video tag and the video tag of the first candidate video data in the video tags of at least two existing tag video data; the first candidate video data is the existing label video data containing the first candidate video label; the second associated tag is determined based on the co-occurrence frequency of the second candidate video tag and the video tag of the second candidate video data in the video tags of at least two existing tag video data; the second candidate video data is the existing label video data containing the second candidate video label;

a second candidate tag determining module 113, configured to determine the first associated tag and the second associated tag as a third candidate video tag.

For a specific function implementation manner of the associated tag obtaining module 112 and the second candidate tag determining module 113, please refer to step S104 in the corresponding embodiment of fig. 3, which is not described herein again.

The target tag determination module 104 includes: a set tag determination unit 1041, a credibility acquisition unit 1042, and a target tag acquisition unit 1043;

a set tag determining unit 1041, configured to determine a first candidate video tag, a second candidate video tag, and a third candidate video tag in the candidate tag set as candidate video tags;

the reliability obtaining unit 1042 is configured to obtain tag reliability between each candidate video tag and target video data according to mutual information between the first candidate video tag and the corresponding existing video word set and a generation probability corresponding to the second candidate video tag;

and an object tag obtaining unit 1043, configured to determine, according to tag reliability between each candidate video tag and the object video data, an object video tag from the candidate tag set.

For specific functional implementation manners of the set tag determining unit 1041, the reliability obtaining unit 1042, and the target tag obtaining unit 1043, please refer to step S104 in the embodiment corresponding to fig. 3, which is not described herein again.

Wherein, the candidate label set comprises a candidate video label b_lL is a positive integer less than or equal to the total number of tags of the candidate video tags in the candidate tag set;

the reliability acquisition unit 1042 includes: a first certainty-determining sub-unit 10421, a second certainty-determining sub-unit 10422, a label weight obtaining sub-unit 10423, and a third certainty-determining sub-unit 10424;

a first confidence level determination subunit 10421 configured to determine if the candidate video tag b is a video tag b_lBelongs to the first candidate video label and does not belong to the second candidate video label, according to the candidate video label b_lDetermining candidate video label b by mutual information between the candidate video label and the corresponding existing video word set_lLabel confidence with the target video data;

a second confidence level determination subunit 10422 for determining if the candidate video tag b is a video tag b_lBelongs to the second candidate video label and does not belong to the first candidate video label, the candidate video label b is labeled_lThe corresponding generation probability is determined as a candidate video label b_lLabel confidence with the target video data;

a label weight obtaining sub-unit 10423 for determining if the candidate video label b_lIf the first candidate video tag belongs to the first candidate video tag and the second candidate video tag belongs to the second candidate video tag, acquiring a first tag configuration weight corresponding to the first candidate video tag and acquiring a second tag configuration weight corresponding to the second candidate video tag;

a third confidence determination subunit 10424 for determining a confidence level according to the first tag configuration weight, the second tag configuration weight,Candidate video tag b_lMutual information between the existing video word sets corresponding to the video words and candidate video labels b_lDetermining candidate video label b according to the corresponding generation probability_lAnd tag confidence with the target video data.

For a specific function implementation manner of the first confidence determining subunit 10421, the second confidence determining subunit 10422, the label weight obtaining subunit 10423, and the third confidence determining subunit 10424, please refer to step S104 in the corresponding embodiment of fig. 3, which is not described herein again.

first certainty subunit 10421, including: a mutual information value acquisition subunit 104211, a word number acquisition subunit 104212, and a reliability degree operator unit 104213;

a mutual information value obtaining subunit 104211, configured to obtain the candidate video tag b from the mutual information index table_lMutual information values between the existing video word sets corresponding to the existing video word sets;

a word number obtaining subunit 104212 for obtaining a candidate video tag b_lThe word number of the corresponding words in the existing video word set;

a credibility operator unit 104213 for adjusting the parameters and the candidate video tags b according to the credibility_lDetermining candidate video label b according to the corresponding mutual information value and word number_lThe tag confidence of (1).

For specific functional implementation manners of the mutual information value obtaining subunit 104211, the word number obtaining subunit 104212, and the reliability degree calculating subunit 104213, please refer to step S104 in the corresponding embodiment of fig. 3, which is not described herein again.

Wherein, the candidate label set also comprises candidatesVideo label b_jJ is a positive integer less than or equal to the total number of tags of the candidate video tags in the candidate tag set;

the tag data processing apparatus 1 further includes: a first association degree obtaining module 114, a first credibility determining module 115, a second association degree obtaining module 116, and a second credibility determining module 117;

a first association obtaining module 114, configured to obtain the first association if the candidate video tag b is a video tag b_jAs candidate video label b_lThen obtaining the candidate video tag b_jAnd candidate video tag b_lA first tag association degree therebetween; the first label association degree is based on the candidate video label b_jAnd candidate video tags b_lDetermined by the number of co-occurrences in the video tags of at least two existing tagged video data;

a first confidence level determination module 115, configured to determine a confidence level of the video tag according to the first tag association level and the candidate video tag b_lDetermining candidate video label b by mutual information between the candidate video label and the corresponding existing video word set_jLabel confidence with the target video data;

a second association degree obtaining module 116, configured to obtain the second association degree if the candidate video tag b is a video tag b_jAs candidate video label b_lThen obtaining the candidate video tag b_jAnd candidate video tag b_lA second degree of tag association therebetween; the second label association degree is based on the candidate video label b_jAnd candidate video tags b_lDetermined by the number of co-occurrences in the video tags of at least two existing tagged video data;

a second confidence level determination module 117, configured to determine a second confidence level according to the second tag association level and the candidate video tag b_lDetermining candidate video label b according to the corresponding generation probability_jAnd tag confidence with the target video data.

For specific functional implementation manners of the first association degree obtaining module 114, the first reliability determining module 115, the second association degree obtaining module 116, and the second reliability determining module 117, please refer to step S104 in the embodiment corresponding to fig. 3, which is not described herein again.

The target tag obtaining unit 1043 includes: a credibility model input subunit 10431, a model credibility output subunit 10432, a screening credibility determination subunit 10433, and a target tag determination subunit 10434;

a credibility model input subunit 10431, configured to input the video feature vectors of each candidate video tag and the target video data into a credibility determination model; the credibility determination model is obtained by training video feature vectors of at least two existing label video data and video labels of at least two existing label video data;

a model reliability output subunit 10432, configured to determine, based on the reliability, video feature vectors of the model and the target video data, and output a model reliability between each candidate video tag and the target video data;

a screening reliability determining subunit 10433, configured to determine, based on a model reliability between each candidate video tag and the target video data, and a tag reliability between each candidate video tag and the target video data, a screening tag reliability between each candidate video tag and the target video data;

the target tag determining subunit 10434 is configured to determine, as a target video tag, a candidate video tag in the candidate tag set, where the screening tag reliability with respect to the target video data is greater than or equal to the screening reliability threshold.

For specific functional implementation manners of the credibility model input subunit 10431, the model credibility output subunit 10432, the screening credibility determination subunit 10433, and the target label determination subunit 10434, please refer to step S104 in the embodiment corresponding to fig. 3, which is not described herein again.

Among them, screening reliability determination subunit 10433 includes: a credibility weight obtaining sub-unit 104331 and a screening credibility degree sub-unit 104332;

a confidence weight obtaining subunit 104331, configured to obtain a first confidence configuration weight for model confidence, and obtain a second confidence configuration weight for label confidence;

and the screening reliability degree operator unit 104332 is configured to determine the screening label reliability degree between each candidate video label and the target video data according to the first reliability degree configuration weight, the second reliability degree configuration weight, the model reliability degree between each candidate video label and the target video data, and the label reliability degree between each candidate video label and the target video data.

For a specific function implementation manner of the reliability weight obtaining subunit 104331 and the screening reliability degree subunit 104332, please refer to step S104 in the corresponding embodiment of fig. 3, which is not described herein again.

The video obtaining module 101 includes: a text information acquisition unit 1011, a classification model input unit 1012, and a target type output unit 1013;

a text information obtaining unit 1011, configured to obtain video image information and video audio information of target video data, and obtain video text information of the target video data;

a classification model input unit 1012 for inputting video image information, video audio information, and video text information into a video classification model; the video classification model is obtained by training at least two existing label video data and video types corresponding to the at least two existing label video data;

a target type output unit 1013 for outputting a target video type of the target video data based on the video classification model.

For a specific implementation manner of functions of the text information obtaining unit 1011, the classification model input unit 1012, and the target type output unit 1013, please refer to step S101 in the corresponding embodiment of fig. 3, which is not described herein again.

The text information obtaining unit 1011 includes: a video information acquisition subunit 10111, an information word segmentation subunit 10112, and a splicing subunit 10113;

a video information acquiring subunit 10111, configured to acquire video title information, video description information, and video subtitle information of the target video data;

an information word segmentation subunit 10112, configured to segment words of the video subtitle information to obtain subtitle keywords in the video subtitle information;

and a splicing subunit 10113, configured to splice the video title information, the video description information, and the subtitle keyword to obtain video text information of the target video data.

For a specific function implementation manner of the video information obtaining subunit 10111, the information word segmentation subunit 10112, and the splicing subunit 10113, please refer to step S101 in the embodiment corresponding to fig. 3, which is not described herein again.

the target type output unit 1013 includes: an image vector fusion sub-unit 10131, an audio vector fusion sub-unit 10132, a text vector generation sub-unit 10133, a vector splicing sub-unit 10134, and a target type output sub-unit 10135;

an image vector fusion subunit 10131, configured to generate an image feature vector of each image frame of the at least two image frames based on the video classification model, and perform feature vector fusion on the image feature vector of each image frame to obtain an image fusion feature vector;

an audio vector fusion subunit 10132, configured to generate an audio feature vector of each of the at least two audio frames based on the video classification model, and perform feature vector fusion on the audio feature vector of each audio frame to obtain an audio fusion feature vector;

a text vector generation subunit 10133, configured to generate a text feature vector of the video text information based on the video classification model;

the vector splicing subunit 10134 is configured to perform vector splicing on the image fusion feature vector, the audio fusion feature vector, and the text feature vector to obtain a video feature vector of the target video data;

and a target type output subunit 10135, configured to output the target video type of the target video data in the video classification model according to the video feature vector of the target video data.

For a specific function implementation manner of the image vector fusion subunit 10131, the audio vector fusion subunit 10132, the text vector generation subunit 10133, the vector splicing subunit 10134, and the target type output subunit 10135, please refer to step S101 in the embodiment corresponding to fig. 3, which is not described herein again.

The video obtaining module 101 is configured to:

acquiring target video data sent by a client;

the tag data processing apparatus 1 described above is further configured to:

and sending the target video label of the target video data to the client so that the client can perform correlation output on the target video data and the target video label.

The method and the device can acquire the target video data and determine the target video type of the target video data; acquiring a mutual information index table; the mutual information index table is created based on mutual information between the existing video word set of the at least two existing label video data and the video labels of the at least two existing label video data; acquiring a target video word set of target video data, and acquiring a first candidate video tag of the target video data in a mutual information index table according to the target video word set and the target video type; the first candidate video label is a video label of existing label video data with a target video type; and adding the first candidate video tag to a candidate tag set, and determining a target video tag of the target video data from the candidate tag set according to mutual information between the first candidate video tag and the corresponding existing video word set. Therefore, the device can obtain the first candidate video tag aiming at the target video data through the mutual information index table established by the existing tag video data, and further can obtain the target video tag aiming at the target video data through the first candidate video tag, so that the acquisition efficiency of the target video tag is improved. Moreover, the first candidate video tags can be multiple and various, so that the tag types of the target video tags are enriched.

Referring to fig. 13, fig. 13 is a schematic structural diagram of a computer device provided in the present application. As shown in fig. 13, the computer apparatus 1000 may include: the processor 1001, the network interface 1004, and the memory 1005, and the computer device 1000 may further include: a user interface 1003, and at least one communication bus 1002. Wherein a communication bus 1002 is used to enable connective communication between these components. The user interface 1003 may include a Display screen (Display) and a Keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface and a standard wireless interface. The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 1005 may be a high-speed RAM memory or a non-volatile memory (non-volatile memory), such as at least one disk memory. The memory 1005 may optionally be at least one memory device located remotely from the processor 1001. As shown in fig. 13, a memory 1005, which is a kind of computer storage medium, may include therein an operating system, a network communication module, a user interface module, and a device control application program.

In the computer device 1000 shown in fig. 13, the network interface 1004 may provide a network communication function; the user interface 1003 is an interface for providing a user with input; and the processor 1001 may be configured to call the device control application stored in the memory 1005 to implement the description of the tag data processing method in the corresponding embodiment of fig. 3. It should be understood that the computer device 1000 described in this application may also perform the description of the tag data processing apparatus 2 in the embodiment corresponding to fig. 11, and may also perform the description of the tag data processing apparatus 1 in the embodiment corresponding to fig. 12, which is not described herein again. In addition, the beneficial effects of the same method are not described in detail.

Further, here, it is to be noted that: the present application further provides a computer-readable storage medium, and the computer-readable storage medium stores the aforementioned computer programs executed by the tag data processing apparatus 1 and the tag data processing apparatus 2, and the computer programs include program instructions, and when the processor executes the program instructions, the description of the tag data processing method in the embodiment corresponding to fig. 3 can be performed, so that details will not be described here again. In addition, the beneficial effects of the same method are not described in detail. For technical details not disclosed in the embodiments of the computer storage medium referred to in the present application, reference is made to the description of the embodiments of the method of the present application.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.

The above disclosure is only for the purpose of illustrating the preferred embodiments of the present application and is not to be construed as limiting the scope of the present application, so that the present application is not limited thereto but rather by the claims appended hereto.

54页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：一种跨网络边界的实时流传输方法

Label data processing method and device and computer readable storage medium

相关技术

网友询问留言