Video quick retrieval method and system and video quick recommendation method

文档序号：1921661 发布日期：2021-12-03 浏览：31次中文

阅读说明：本技术 视频快速检索方法、系统和视频快速推荐方法 (Video quick retrieval method and system and video quick recommendation method ) 是由范清唐大闰于 2021-08-11 设计创作，主要内容包括：本申请涉及一种视频快速检索方法、系统和视频快速推荐方法,其中,该方法包括：利用爬虫获取视频,将经过预处理的视频作为训练数据集,基于无监督对比学习训练视频特征抽取模型；将业务相关视频逐一输入至视频特征抽取模型,并输出对应的视频特征向量,构成视频服务器检索库；获得待检索视频并依据视频特征抽取模型获得待检索视频对应的待检索视频特征向量,将待检索视频特征向量与视频服务器检索库中的视频特征向量进行相似度比较,获取相似度满足一定预设条件的业务相关视频。本申请通过获得视频的视频特征向量,并对视频特征向量进行比较获得检索结果,能有效提升视频时空特征表示能力,同时进行向量对比极大地提高了检索效率。(The application relates to a method and a system for quickly retrieving videos and a method for quickly recommending videos, wherein the method comprises the following steps: acquiring a video by using a crawler, taking the preprocessed video as a training data set, and learning a training video feature extraction model based on unsupervised comparison; inputting the service-related videos into the video feature extraction model one by one, and outputting corresponding video feature vectors to form a video server search library; and obtaining a video to be retrieved, obtaining a video feature vector to be retrieved corresponding to the video to be retrieved according to the video feature extraction model, and comparing the similarity of the video feature vector to be retrieved with the video feature vector in the video server retrieval library to obtain a service-related video with the similarity meeting certain preset conditions. According to the method and the device, the video characteristic vectors of the video are obtained, the video characteristic vectors are compared to obtain the retrieval result, the video space-time characteristic representation capability can be effectively improved, and meanwhile, the vector comparison is carried out, so that the retrieval efficiency is greatly improved.)

1. A video fast retrieval method is characterized by comprising the following steps:

a video feature extraction model training step, namely acquiring a video by using a crawler, taking the preprocessed video as a training data set, and training a video feature extraction model based on unsupervised comparison learning;

a video feature extraction step, namely, inputting the service-related videos into the video feature extraction model one by one, and outputting corresponding video feature vectors to form a video server search library;

and a video retrieval step, namely acquiring a video to be retrieved, acquiring a video feature vector to be retrieved corresponding to the video to be retrieved according to the video feature extraction model, comparing the similarity of the video feature vector to be retrieved with the video feature vector in the video server retrieval library, and acquiring the service-related video with the similarity meeting certain preset conditions.

2. The method for fast video retrieval according to claim 1, wherein a batch of video data randomly loaded from the training data set is input to an unsupervised video contrast learning framework for iterative training, and the method specifically comprises the following steps:

a sample enhancement step, namely randomly extracting two segments with the length being a preset duration from the video aiming at each video in the video data in the batch, respectively extracting preset frame number images at the rate of each frame corresponding to the two segments to obtain two image sequences containing the preset frame number images, carrying out the same processing on the two image sequences and obtaining two corresponding enhancement samples;

a characteristic embedding step, namely inputting the enhanced sample into a characteristic embedding network and outputting a coded corresponding characteristic expression;

a target loss step, namely calculating a contrast loss L according to the following formula and updating the network parameters of the video feature extraction model;

L＝∑_i(1-C_ii)²+0.001∑_i∑_j≠1C_ij ²，

and is

Wherein z is₁And z₂Representing the expression of a trait of the enhanced example, C represents z₁And z₂Autocorrelation matrices, b represents various pairs of instances in the batch, i and j represent z, respectively₁And z₂The dimensional components of (a).

3. The method of claim 2, further comprising a custom service data model fine tuning step of fine tuning training a predetermined range of epochs for said video feature extraction model that has undergone said target loss step according to a custom service data set.

4. The method for fast video retrieval according to claim 1, wherein the video retrieval step further compares the video feature vectors by a hash retrieval method.

5. A video quick recommendation method is characterized by comprising the following steps:

receiving video information returned by a client, searching a video corresponding to the video information by adopting the video quick searching method of any one of claims 1 to 4, and returning a searching result to the client.

6. A video fast retrieval system, which applies the video fast retrieval method of any one of the above claims 1 to 4, characterized by comprising: a terminal device, a transmission device and a server device; the terminal equipment is connected with the server equipment through the transmission equipment;

the terminal equipment is used for selecting the trigger video by a user and receiving the returned retrieval video;

the server device is used for executing the video quick retrieval method according to any one of claims 1 to 4.

7. The video quick retrieval system of claim 6, wherein the server device comprises:

the video feature extraction model training module is used for acquiring videos by using a crawler, taking the preprocessed videos as a training data set, and training a video feature extraction model based on unsupervised comparison learning;

the video feature extraction module is used for inputting the service-related videos into the video feature extraction model one by one and outputting corresponding video feature vectors to form a video server search library;

and the video retrieval module is used for receiving video information returned by a client, acquiring a trigger video feature vector corresponding to the video information according to the video feature extraction model, comparing the similarity of the trigger video feature vector with the video feature vector in the video server retrieval library, and returning the service-related video with the similarity meeting a certain preset condition to the client.

8. The video quick retrieval system of claim 6, wherein the video feature extraction model training module further comprises:

a sample enhancement unit, which randomly extracts two segments with a length being a preset duration from the video for each video in the video data in the batch, extracts preset frame number images at a rate of each frame for the two segments to obtain two image sequences containing the preset frame number images, and performs the same processing on the two image sequences to obtain two corresponding enhancement samples;

the characteristic embedding unit is used for inputting the enhanced sample into a characteristic embedding network and outputting a coded corresponding characteristic expression;

the target loss unit is used for calculating the contrast loss L according to the following formula and updating the network parameter theta of the video feature extraction model;

and is

9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the video fast retrieval method according to any one of claims 1 to 4 when executing the computer program.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements a video fast retrieval method according to any one of claims 1 to 4.

Technical Field

The application relates to the technical field of computers, in particular to a method and a system for quickly retrieving videos and a method for quickly recommending videos.

Background

In recent years, domestic short videos are increasingly exploded, videos are gradually becoming one of favorite entertainment interaction modes of people, and video interaction functions have been introduced into a large number of business applications (apps) (enterprise WeChat, flybook and the like). However, most of the current functions are limited to simple video compression, transmission, reception and the like, the ability of the video serving as an interactive medium for improving the potential economic value of a company is limited to a certain extent, and the demand for real-time and effective video recommendation to a client according to the demand or preference of the client is more and more strong.

In the prior art, a feature-based fuzzy retrieval mode is adopted, and the technical scheme mainly comprises that the unstructured video is structured on the basis of video content analysis. The video structure is mainly divided into scenes, shots and key frames. And extracting video key frames by segmenting scenes and shots of the video, and then performing subsequent processing.

However, in the prior art, the use of a single feature cannot accurately express video content, and is difficult to extract effective video spatio-temporal feature representation and meet the high real-time requirements of business applications.

At present, no effective solution is provided for the technical problem that the representation of the spatiotemporal features of the video is difficult to effectively extract in the prior art.

Disclosure of Invention

The embodiment of the application provides a video quick retrieval method, a video quick retrieval system and a video quick recommendation method, and provides a quick retrieval technology based on video content similarity so as to at least solve the problem that the representation of the spatiotemporal characteristics of videos is difficult to effectively extract in the related technology.

In a first aspect, an embodiment of the present application provides a method for fast video retrieval, which is characterized by including the following steps:

a video feature extraction step, namely, inputting the service-related videos into a video feature extraction model one by one, and outputting corresponding video feature vectors to form a video server search library;

and a video retrieval step, namely acquiring a video to be retrieved, acquiring a video characteristic vector to be retrieved corresponding to the video to be retrieved according to the video characteristic extraction model, and comparing the similarity of the video characteristic vector to be retrieved with the video characteristic vector in the video server retrieval library to acquire a service-related video with the similarity meeting certain preset conditions.

In some embodiments, randomly loading a batch of video data from a training data set and inputting the video data into an unsupervised video contrast learning framework for iterative training, specifically including the following steps:

a sample enhancement step, namely randomly extracting two segments with the length being a preset duration from the video aiming at each video in the video data in the batch, respectively extracting preset frame number images at the rate of each frame aiming at the two segments to obtain two image sequences containing the preset frame number images, carrying out the same treatment on the two image sequences and obtaining two corresponding enhanced samples;

a characteristic embedding step, namely inputting the enhanced sample into a characteristic embedding network and outputting a coded corresponding characteristic expression;

a target loss step, namely calculating a contrast loss L according to the following formula and updating a network parameter theta of the video feature extraction model;

and is

In some of these embodiments, the processing of the image sequence in the sample enhancement step includes at least random cropping, resizing, random horizontal flipping, random color dithering, graying, and gaussian blur transformation.

In some embodiments, the method further includes a step of fine tuning the customized service data model, and fine tuning and training an epoch within a predetermined range for the video feature extraction model subjected to the target loss step according to the customized service data set.

In some embodiments, the video retrieval step may further compare the video feature vectors by a hash retrieval method.

In a second aspect, an embodiment of the present application provides a method for quickly recommending a video, including:

receiving video information returned by a client, retrieving a video corresponding to the video information by adopting the video quick retrieval method of the first aspect, and returning a retrieval result to the client.

In a third aspect, an embodiment of the present application provides a video fast retrieval system, where the video fast retrieval method according to the first aspect is applied, and includes: a terminal device, a transmission device and a server device; the terminal equipment is connected with the server equipment through the transmission equipment;

the terminal equipment is used for selecting the trigger video by a user and receiving the returned retrieval video;

the server device is used for executing the video quick retrieval method of any one of the first aspect.

In some of these embodiments, the server device comprises:

the video characteristic extraction module is used for inputting the service-related videos into the video characteristic extraction model one by one and outputting corresponding video characteristic vectors to form a video server search library;

the video retrieval module receives video information returned by the client, obtains a trigger video feature vector corresponding to the video information according to the video feature extraction model, compares the similarity of the trigger video feature vector with the video feature vector in the video server retrieval library, and obtains a service-related video with the similarity meeting a certain preset condition to the client.

In some embodiments, the video feature extraction model training module further comprises:

a sample enhancement unit, which randomly extracts two segments with the length being a preset duration from the video aiming at each video in the video data in the batch, extracts preset frame number images at the rate of each frame aiming at the two segments to obtain two image sequences containing the preset frame number images, and performs the same processing on the two image sequences to obtain two corresponding enhancement samples;

the characteristic embedding unit is used for inputting the enhanced sample into a characteristic embedding network and outputting the coded corresponding characteristic expression;

the target loss unit is used for calculating the contrast loss L according to the following formula and updating the network parameter theta of the video feature extraction model;

L＝∑_i(1-C_ii)²+0.001∑_i∑_j≠1C_ij ²,

and is

In a fourth aspect, the present application provides a computer device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor executes the computer program to implement the video fast retrieval method according to the first aspect.

In a fifth aspect, the present application provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the video fast retrieval method according to the first aspect.

Compared with the related technology, the video fast retrieval method, the video fast retrieval system and the video fast recommendation provided by the embodiment of the application can be applied to the technical field of deep learning and can also be applied to the technical field of computer vision, the video feature extraction model is trained based on the unsupervised video representation learning technology, the video feature vectors of the video are obtained, the video feature vectors are compared to obtain the retrieval result, the video space-time feature representation capability can be effectively improved, and meanwhile, the retrieval efficiency is greatly improved by vector comparison.

The details of one or more embodiments of the application are set forth in the accompanying drawings and the description below to provide a more thorough understanding of the application.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

fig. 1 is a flowchart of a video fast retrieval method according to an embodiment of the present application;

FIG. 2 is a flow chart of the training steps of a video feature extraction model according to an embodiment of the present application;

FIG. 3 is a flow chart of a video fast retrieval method according to the preferred embodiment of the present application;

FIG. 4 is a video feature extraction framework based on contrast learning in an embodiment of the present application;

FIG. 5 is a schematic diagram of a feature embedding encoder network proposed in the embodiment of the present application;

fig. 6 is a block diagram of a video fast retrieval system according to an embodiment of the present application;

FIG. 7 is a flowchart of a video quick recommendation method according to an embodiment of the present application;

fig. 8 is a block diagram of a video quick recommendation system according to an embodiment of the present application;

fig. 9 is a hardware configuration diagram of a computer device according to an embodiment of the present application.

Description of the drawings:

a terminal device 1; a transmission device 2; a server device 3;

a video feature extraction model training module 31; a video feature extraction module 32;

a video retrieval module 33; a sample enhancing unit 311; a feature embedding unit 312;

a target loss unit 313; a custom business data model fine-tuning module 314;

a processor 81; a memory 82; a communication interface 83; a bus 80.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be described and illustrated below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments provided in the present application without any inventive step are within the scope of protection of the present application.

It is obvious that the drawings in the following description are only examples or embodiments of the present application, and that it is also possible for a person skilled in the art to apply the present application to other similar contexts on the basis of these drawings without inventive effort. Moreover, it should be appreciated that in the development of any such actual implementation, as in any engineering or design project, numerous implementation-specific decisions must be made to achieve the developers' specific goals, such as compliance with system-related and business-related constraints, which may vary from one implementation to another.

Reference in the specification to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the specification. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of ordinary skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments without conflict.

Unless defined otherwise, technical or scientific terms referred to herein shall have the ordinary meaning as understood by those of ordinary skill in the art to which this application belongs. Reference to "a," "an," "the," and similar words throughout this application are not to be construed as limiting in number, and may refer to the singular or the plural. The present application is directed to the use of the terms "including," "comprising," "having," and any variations thereof, which are intended to cover non-exclusive inclusions; for example, a process, method, system, article, or apparatus that comprises a list of steps or modules (elements) is not limited to the listed steps or elements, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus. Reference to "connected," "coupled," and the like in this application is not intended to be limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. The term "plurality" as referred to herein means two or more. "and/or" describes an association relationship of associated objects, meaning that three relationships may exist, for example, "A and/or B" may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. Reference herein to the terms "first," "second," "third," and the like, are merely to distinguish similar objects and do not denote a particular ordering for the objects.

In the prior art, a traditional video search mode for feature-based video retrieval is to perform artificial text annotation on video frame images. The method describes video data according to the content of frame images in the video by using a text annotation mode, thereby forming keywords for describing the video content. The method mainly adopts a manual labeling mode, when the search is carried out, a user provides keywords according to own interests, and the database returns the query result by matching the video tags and the keywords. The searching mode is easy to realize, and when the searching mode is manually marked, the accuracy of the query is high,

its disadvantages are however evident: the first is that the main way of annotation is manual, so the annotation is greatly affected by the subjective factors of annotators, and the observation levels of different annotators are different, which may cause different descriptions of the same video. The second is that the textual description is a fixed abstraction of the video scene content, so that a particular tag is only appropriate for a particular query. Third, the amount of video data is large and the effort to manually add annotations is large, especially for today's growing video volume, which is costly and inefficient.

To improve the shortcomings of text labeling methods, feature-based video retrieval has emerged. Different from the manual labeling method, the video retrieval based on the characteristics is a fuzzy query technology,

it is based on the analysis of video content and carries out the structuralization processing to the unstructured video. The method mainly divides a video structure into scenes, shots and key frames. The video key frame is extracted by segmenting the scene and the shot of the video and then is subjected to subsequent processing, but the image content cannot be expressed accurately enough by using a single feature.

Currently, video retrieval based on video representation learning has become a new research focus. Meanwhile, in the prior art, effective video space-time characteristic representation is difficult to extract, and the high real-time requirement of service application is difficult to meet.

Example 1:

based on this, the embodiment provides a video fast retrieval method. Fig. 1 is a flowchart of a video fast retrieval method according to an embodiment of the present application, and as shown in fig. 1, the flowchart includes the following steps:

a video feature extraction model training step S1, wherein a crawler is used for obtaining videos, the videos after being preprocessed are used as training data sets, and a video feature extraction model is trained on the basis of unsupervised comparison learning;

a video feature extraction step S2, wherein, the video related to the service is input into the video feature extraction model one by one, and corresponding video feature vectors are output to form a video server search library;

and a video retrieval step S3, obtaining a video to be retrieved, obtaining a video feature vector to be retrieved corresponding to the video to be retrieved according to the video feature extraction model, comparing the similarity of the video feature vector to be retrieved with the video feature vector in the video server retrieval library, and obtaining the service-related video with the similarity meeting a certain preset condition.

The certain preset condition may be adjusted by the search range, and may be set to, for example, top five of the similarity ranks.

Through the steps, the video feature extraction model is trained on the basis of the unsupervised video representation learning technology, the video feature vectors of the video are obtained, the video feature vectors are compared to obtain the retrieval result, the video space-time feature representation capability can be effectively improved, and meanwhile, the vector comparison is carried out, so that the retrieval efficiency is greatly improved.

A discriminant feature is learned for each video by using a representation learning technology, and then similarity of feature vectors is used for quick retrieval.

In some embodiments, fig. 2 is a flowchart of a training step of a video feature extraction model according to an embodiment of the present application, and as shown in fig. 2, a batch of video data randomly loaded from the training data set is input to an unsupervised video contrast learning framework for iterative training, which specifically includes the following steps:

a sample enhancement step S11, for each video in the video data in the batch, randomly extracting two segments with a length equal to a preset duration from the video, extracting preset frame number images at a rate of each frame for the two segments, obtaining two image sequences including the preset frame number images, performing the same processing on the two image sequences, and obtaining two corresponding enhancement samples;

a feature embedding step S12, inputting the enhanced sample into a feature embedding network, and outputting a coded corresponding feature expression;

a target loss step S13 of calculating a contrast loss L according to the following formula, and updating a network parameter θ of the video feature extraction model;

L＝∑_i(1-C_ii)²+0.001∑_i∑_j≠1C_ij ²,

and is

The processing of the image sequence in the sample enhancement step S11 includes at least random cropping, resizing, random horizontal flipping, random color dithering, graying, and gaussian blur transformation.

In some embodiments, the video fast retrieval method provided in this embodiment of the present application further includes a step S14 of fine tuning a customized service data model, where the video feature extraction model after the target loss step is fine-tuned and trained to an epoch within a predetermined range according to a customized service data set.

In some embodiments, the video feature vectors may also be compared in the video retrieval step S3 by a hash retrieval method.

Based on the above method, the following focuses on the sales service support scenario, and describes and explains an embodiment of the present application.

The method and the system realize video interaction prompt based on content similarity, help sales personnel to quickly filter and retrieve the potentially interesting videos, and aim to improve the sales conversion rate and improve the user experience. In application scenarios, such as when a salesperson selects a video to be sent to a customer during the process of using enterprise WeChat and customer communication by the salesperson, the system can automatically retrieve a plurality of most similar related videos from the library and prompt the salesperson to select, thereby helping the salesperson to carry out marketing

Fig. 3 is a flowchart of a video fast retrieval method according to a preferred embodiment of the present application.

The system adopts a C/S framework, when a client user manually selects a video, an interactive prompt request is triggered, the currently selected video ID is sent to a server, the server extracts the video characteristics after receiving the request by using the following technical scheme, compares the characteristics with the videos in a video library, and outputs the first 5 most similar video paths to return to the client.

S301, extracting space-time consistency characteristics based on unsupervised contrast learning

And in the time-space consistency feature extraction stage, a video feature extraction neural network model is trained on the basis of an unsupervised contrast learning paradigm, a single video is input after the training is finished, and a 4096-dimensional feature vector expressing the input video is output. It mainly comprises the following steps:

1. a large number of videos (5-50 ten thousand segments) are crawled from a search engine, a social network and a video sharing platform and are simply preprocessed (advertisement information irrelevant to the video content at the beginning and the end is removed) to serve as a training data set.

2. And self-supervision contrast learning training. Randomly loading a batch of video data from a data set, inputting the video data into an unsupervised video contrast learning framework shown in fig. 5 for training (assuming that the batch size is n, n in the present scheme is 512), and performing an iterative training process as follows:

for each video v in the batch of data, firstly, sample enhancement with space-time consistency is carried out to generate two enhanced samples v₁And v₂. The concrete steps of sample enhancement are as follows:

randomly extracting two segments with the length of 32 seconds from the video v, respectively extracting 32 frames of images from the two 32-second video segments at the rate of one frame per second (1fps), and obtaining two image sequences s containing 32 frames of images₁And s₂。

To s₁And s₂All applying the same random clipping, Resize to 224x224 size, random horizontal flipping, random color dithering, graying and Gaussian blur transformation to obtain an enhanced sample v₁And v₂。

The above process can be formulated as

v₁,v₂＝aug(v),aug(v) (1)

Inputting the n pairs of enhanced samples into two paths of the network respectively, and expressing the coded features after embedding the features into the network as shown in FIG. 6

z₁,z₂＝f(v₁),f(v₂) (2)

Finally, the contrast loss L is calculated according to the following formula (3), and the network parameter θ is updated. Wherein C represents z₁And z₂Autocorrelation matrices, b represents various pairs of instances in the batch, i and j represent z, respectively₁And z₂The dimensional components of (a).

L＝∑_i(1-C_ii)²+0.001∑_i∑_j≠1C_ij ²,

And is

3. And (5) fine adjustment of the custom business data model. In the future, the trained model is better generalized to the user-defined service data, and the model needs to be subjected to fine tuning training for 5-10 epochs by using the user-defined service data set.

4. And extracting the space-time characteristics of the video. When the user-defined service data is used for extracting the space-time characteristics, the video is directly input to the encoder trained in the step 3, and 4096-dimensional video characteristics are output.

5. And constructing a video server search library. And (4) collecting the service related videos, and extracting the features one by using the feature extraction method in the step (4) and storing the features in the video feature library.

S302, video retrieval based on feature similarity

And (4) video retrieval and prompt, wherein when the server receives a client request, video features are extracted by using the method in the step (4), cosine similarity calculation is carried out on the video features and the features in the video retrieval library, and 5 most similar video paths are sequenced and returned to the client.

With the above steps, it is possible to respond in milliseconds due to the calculations between vectors only. The video retrieval recall rate is high, the retrieval speed is high, the user experience is further improved, and the sales conversion rate is improved. Accuracy can be improved while maintaining a high rate of response.

The final retrieval completed by utilizing the similarity comparison can be replaced by a Hash retrieval method, the purpose of quick retrieval can be achieved, but the accuracy is slightly lower than the cosine similarity.

The embodiment also provides a video fast retrieval system, and the apparatus is used to implement the above embodiments and preferred embodiments, and the description of which has been already made is omitted. As used hereinafter, the terms "module," "unit," "subunit," and the like may implement a combination of software and/or hardware for a predetermined function. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware, or a combination of software and hardware is also possible and contemplated.

Fig. 6 is a block diagram of a video fast retrieval system according to an embodiment of the present application, and as shown in fig. 6, the video fast retrieval system includes:

a terminal device 1, a transmission device 2, and a server device 3; wherein, the terminal device 1 is connected with the server device 3 through the transmission device 2;

the terminal equipment 1 is used for selecting a trigger video by a user and receiving a returned retrieval video;

the server device 3 is used to execute the video fast retrieval method as described above.

In some of these embodiments, the server device 3 comprises:

the video feature extraction model training module 31 is used for acquiring videos by using crawlers, taking the preprocessed videos as a training data set, and training a video feature extraction model based on unsupervised comparison learning;

the video feature extraction module 32 is used for inputting the service-related videos into the video feature extraction model one by one, and outputting corresponding video feature vectors to form a video server search library;

the video retrieval module 33 receives video information returned by a client, obtains a trigger video feature vector corresponding to the video information according to the video feature extraction model, compares the similarity between the trigger video feature vector and the video feature vector in the video server retrieval library, and returns the service-related video with the similarity satisfying a certain preset condition to the client.

In some of these embodiments, the video feature extraction model training module 31 further includes:

a sample enhancement unit 311, configured to randomly extract, for each video in the video data in the batch, two segments with a length equal to a preset duration from the video, extract preset frame number images at a rate of each frame from the two segments, obtain two image sequences including the preset frame number images, perform the same processing on the two image sequences, and obtain two corresponding enhancement samples;

a feature embedding unit 312, which inputs the enhanced sample into a feature embedding network and outputs a coded corresponding feature expression;

a target loss unit 313, which calculates a contrast loss L according to the following formula, and updates a network parameter θ of the video feature extraction model;

and is

The system further includes a custom service data model trimming module 314 for trimming and training an epoch within a predetermined range for the video feature extraction model that has undergone the target loss step according to a custom service data set.

Meanwhile, the video retrieval module 33 may also obtain a retrieval result according to the feature vector of the video and a hash retrieval method.

The above modules may be functional modules or program modules, and may be implemented by software or hardware. For a module implemented by hardware, the modules may be located in the same processor; or the modules can be respectively positioned in different processors in any combination.

Example 2:

the embodiment provides a video quick recommendation method. Fig. 7 is a flowchart of a video quick recommendation method according to an embodiment of the present application, and as shown in fig. 7, the flowchart includes the following steps:

s701, receiving video information returned by a client;

s702, retrieving the video corresponding to the video information by using the video fast retrieval method in embodiment 2, and returning the retrieval result to the client.

Through the steps, a video rapid recommendation method is designed, and sales personnel can be helped to rapidly filter and retrieve potentially interesting videos so as to improve the sales conversion rate and improve the user experience.

The embodiment also provides a video quick recommendation system, which is used for implementing the above embodiments and preferred embodiments, and the description of the system is omitted. As used hereinafter, the terms "module," "unit," "subunit," and the like may implement a combination of software and/or hardware for a predetermined function. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware, or a combination of software and hardware is also possible and contemplated.

Fig. 8 is a block diagram of a video quick recommendation system according to an embodiment of the present application, and as shown in fig. 8, the video quick recommendation system includes:

the client 801 is used for receiving and returning relevant information of a video clicked by a user;

the video fast retrieval module 802 extracts video feature vectors according to the related information of the video, and compares the video feature vectors with a plurality of video feature vectors in a video retrieval library to obtain a retrieval result;

and the recommending module 803 returns and displays the search result to the client.

In addition, the video quick recommendation method described in conjunction with fig. 1 in the embodiment of the present application may be implemented by a computer device. Fig. 9 is a hardware configuration diagram of a computer device according to an embodiment of the present application.

The computer device may comprise a processor 81 and a memory 82 in which computer program instructions are stored.

Specifically, the processor 81 may include a Central Processing Unit (CPU), or A Specific Integrated Circuit (ASIC), or may be configured to implement one or more Integrated circuits of the embodiments of the present Application.

Memory 82 may include, among other things, mass storage for data or instructions. By way of example, and not limitation, memory 82 may include a Hard Disk Drive (Hard Disk Drive, abbreviated to HDD), a floppy Disk Drive, a Solid State Drive (SSD), flash memory, an optical Disk, a magneto-optical Disk, tape, or a Universal Serial Bus (USB) Drive or a combination of two or more of these. Memory 82 may include removable or non-removable (or fixed) media, where appropriate. The memory 82 may be internal or external to the data processing apparatus, where appropriate. In a particular embodiment, the memory 82 is a Non-Volatile (Non-Volatile) memory. In particular embodiments, Memory 82 includes Read-Only Memory (ROM) and Random Access Memory (RAM). The ROM may be mask-programmed ROM, Programmable ROM (PROM), Erasable PROM (EPROM), Electrically Erasable PROM (EEPROM), Electrically rewritable ROM (EAROM), or FLASH Memory (FLASH), or a combination of two or more of these, where appropriate. The RAM may be a Static Random-Access Memory (SRAM) or a Dynamic Random-Access Memory (DRAM), where the DRAM may be a Fast Page Mode Dynamic Random-Access Memory (FPMDRAM), an Extended data output Dynamic Random-Access Memory (EDODRAM), a Synchronous Dynamic Random-Access Memory (SDRAM), and the like.

The memory 82 may be used to store or cache various data files for processing and/or communication use, as well as possible computer program instructions executed by the processor 81.

The processor 81 reads and executes the computer program instructions stored in the memory 82 to implement any one of the video fast retrieval methods in the above embodiments.

In some of these embodiments, the computer device may also include a communication interface 83 and a bus 80. As shown in fig. 9, the processor 81, the memory 82, and the communication interface 83 are connected via the bus 80 to complete communication therebetween.

The communication interface 83 is used for implementing communication between modules, devices, units and/or equipment in the embodiment of the present application. The communication port 83 may also be implemented with other components such as: the data communication is carried out among external equipment, image/data acquisition equipment, a database, external storage, an image/data processing workstation and the like.

Bus 80 includes hardware, software, or both to couple the components of the computer device to each other. Bus 80 includes, but is not limited to, at least one of the following: data Bus (Data Bus), Address Bus (Address Bus), Control Bus (Control Bus), Expansion Bus (Expansion Bus), and Local Bus (Local Bus). By way of example, and not limitation, Bus 80 may include an Accelerated Graphics Port (AGP) or other Graphics Bus, an Enhanced Industry Standard Architecture (EISA) Bus, a Front-Side Bus (Front Side Bus), an FSB (FSB), a Hyper Transport (HT) Interconnect, an ISA (ISA) Bus, an Infini Band Interconnect, a Low Pin Count (LPC) Bus, a memory Bus, a microchannel Architecture (MCA) Bus, a PCI (Peripheral Component Interconnect) Bus, a PCI-Express (PCI-X) Bus, a Serial Advanced Technology Attachment (SATA) Bus, a Video Electronics Bus (audio Electronics Association), abbreviated VLB) bus or other suitable bus or a combination of two or more of these. Bus 80 may include one or more buses, where appropriate. Although specific buses are described and shown in the embodiments of the application, any suitable buses or interconnects are contemplated by the application.

The computer device can execute the video retrieval steps in the embodiment of the application based on the acquired video feature vectors, thereby implementing the video fast retrieval method described in conjunction with fig. 1.

In addition, in combination with the video fast retrieval method in the foregoing embodiments, the embodiments of the present application may provide a computer-readable storage medium to implement. The computer readable storage medium having stored thereon computer program instructions; the computer program instructions, when executed by a processor, implement any of the above embodiments of the method for fast video retrieval.

The technical features of the embodiments described above may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

18页详细技术资料下载

Video quick retrieval method and system and video quick recommendation method

相关技术

网友询问留言