Media data processing method and device, electronic equipment and computer storage medium

文档序号:1937608 发布日期:2021-12-07 浏览:17次 中文

阅读说明:本技术 媒体数据处理方法、装置、电子设备及计算机存储介质 (Media data processing method and device, electronic equipment and computer storage medium ) 是由 张斌杰 李昱 袁春 徐得景 赵珣 单瀛 于 2021-04-16 设计创作,主要内容包括:本申请公开了一种媒体数据处理方法、装置、电子设备及计算机存储介质,涉及人工智能及云技术领域,该方法包括:获取待处理文本和待处理视频;提取待处理文本对应的全局文本特征和局部文本特征,以及待处理视频的第一视频特征,将全局文本特征融合到第一视频特征中,得到第二视频特征;根据局部文本特征和第二视频特征,从待处理视频中确定与待处理文本匹配的目标片段。根据本申请的方法,由于基于全局文本特征和局部文本特征两个粒度的特征,可以更全面的描述待处理文本的全部信息,则通过全局文本特征和局部文本特征确定的目标片段与待处理文本更加匹配,语义更加接近。(The application discloses a media data processing method, a device, electronic equipment and a computer storage medium, which relate to the technical field of artificial intelligence and cloud, and the method comprises the following steps: acquiring a text to be processed and a video to be processed; extracting global text features and local text features corresponding to the text to be processed and first video features of the video to be processed, and fusing the global text features into the first video features to obtain second video features; and determining a target fragment matched with the text to be processed from the video to be processed according to the local text characteristics and the second video characteristics. According to the method, because the method is based on the features of two granularities of the global text feature and the local text feature, all information of the text to be processed can be described more comprehensively, and the target segment determined by the global text feature and the local text feature is more matched with the text to be processed, so that the semantics are closer.)

1. A method for media data processing, comprising:

acquiring a text to be processed and a video to be processed;

extracting global text features and local text features corresponding to the text to be processed and first video features of the video to be processed, wherein the global text features comprise phrase features corresponding to phrases contained in the text to be processed, and the local text features comprise features corresponding to unit texts contained in the text to be processed;

fusing the global text feature into the first video feature to obtain a second video feature;

and determining a target segment matched with the text to be processed from the video to be processed according to the local text feature and the second video feature.

2. The method of claim 1, wherein the first video feature comprises features of a plurality of video segments in the video to be processed; the fusing the global text feature into the first video feature to obtain a second video feature includes:

fusing the global text feature and the first video feature to obtain a fused video feature;

acquiring corresponding position information characteristics of each video clip in the video to be processed;

overlapping the characteristics of each video segment in the fusion video characteristics and the position information characteristics corresponding to each video segment to obtain overlapped video characteristics;

and obtaining the second video characteristic based on the superposed video characteristics.

3. The method according to claim 1, wherein the determining, from the video to be processed, a target segment matching the text to be processed according to the local text feature and the second video feature comprises:

determining associated features of the local textual feature and the second video feature;

adjusting the local text features according to the association features and the second video features to obtain adjusted local features;

adjusting the second video characteristic according to the association characteristic and the local text characteristic to obtain an adjusted video characteristic;

and determining a target segment matched with the text to be processed from the video to be processed based on the adjusted local features and the adjusted video features.

4. The method of claim 3, wherein the first video feature comprises features of a plurality of video segments in the video to be processed; the determining, from the video to be processed, a target segment matching the text to be processed based on the adjusted local features and the adjusted video features includes:

according to the adjusted local features, determining guidance information of the text to be processed on the video to be processed;

adjusting the characteristics of each video segment in the adjusted video characteristics according to the guide information to obtain third video characteristics;

and determining a target segment matched with the text to be processed from the video to be processed according to the third video characteristics.

5. The method according to claim 4, wherein the determining, from the video to be processed, a target segment matching the text to be processed according to the third video feature comprises:

determining the weight corresponding to each video segment according to the incidence relation among the characteristics of each video segment contained in the third video characteristics;

weighting the third video features of the video segments based on the weights corresponding to the video segments to obtain fourth video features;

determining position information of a target segment matched with the text to be processed in the video to be processed based on the fourth video feature;

and determining a target segment which is determined to be matched with the text to be processed in the video to be processed based on the position information.

6. The method according to any one of claims 1 to 5, wherein the extracting global text features and local text features of the text to be processed comprises:

acquiring unit texts in the text to be processed and a position relation between the unit texts;

determining the characteristics of each unit text based on the position relation between each unit text and each unit text, wherein the local text characteristics comprise the characteristics of each unit text;

determining phrase characteristics of phrases contained in the text to be processed based on characteristics corresponding to the unit texts;

and fusing the phrase characteristics of each phrase to obtain the global text characteristics.

7. The method of claim 2, wherein the first video feature comprises features of a plurality of video segments in the video to be processed; the obtaining the second video feature based on the superimposed video feature includes:

determining the weight corresponding to each video segment based on the incidence relation of each video segment in the overlapped video features;

for each video clip, obtaining an enhanced feature corresponding to the video clip based on the weight corresponding to the video clip and the feature corresponding to the video clip in the superimposed video features;

and extracting the second video characteristics based on the enhanced characteristics corresponding to the video clips.

8. The method according to any one of claims 1 to 5, wherein global text features of the text to be processed are extracted, the global text features are fused into the first video features to obtain second video features, and the target segments matched with the text to be processed are determined from the video to be processed based on a neural network model according to the local text features and the second video features;

the neural network model comprises a phrase feature extraction network, a multi-mode feature extraction network and a video clip determination network, and is obtained by training in the following mode:

acquiring training data, wherein the training data comprises a plurality of samples, each sample comprises a sample video and a sample text, each sample carries a position tag, and the position tag represents position information of a target video fragment corresponding to the sample text in the sample video;

for each sample in the training data, extracting global text features and local text features of sample texts in the sample, and video features of the sample video;

for the sample, inputting the characteristics of each unit text in the sample text into a phrase characteristic extraction network to obtain the predicted phrase characteristics of each phrase in the sample text;

determining a first loss value based on the matching degree between the predicted phrase characteristics of the phrases corresponding to the samples, wherein the first loss value represents the semantic difference between the phrases in the samples for one sample;

for the sample, inputting the global text features and the local text features of the sample text and the video features of the sample video into the multi-modal feature extraction network to obtain multi-modal video features corresponding to the sample video;

for the sample, inputting the multi-modal video features to the video fragment determination network to obtain weights corresponding to all sample video fragments in the multi-modal video features, and obtaining corresponding predicted position information of predicted video fragments corresponding to the sample text in the sample video based on the multi-modal video features and the weights corresponding to all the sample video fragments;

determining a second loss value based on the predicted position information corresponding to each sample and each position label, wherein the second loss value represents the difference between the predicted position information corresponding to each sample and the position label of each sample;

determining a third loss value based on the weight corresponding to each sample video segment in the multi-modal video features corresponding to each sample and the position label corresponding to each sample, wherein for one sample, the third loss value represents the possibility that each sample video segment in the sample is the target video segment;

determining a value of a training loss function corresponding to the neural network model based on the first loss value, the second loss value and the third loss value;

and if the training loss function is converged, taking a corresponding model during convergence as a final neural network model, if the training loss function is not converged, adjusting model parameters of the neural network model, and training the neural network model based on the training data.

9. The method of claim 8, wherein for one of the samples, determining a second penalty value based on the predicted location information and each of the location tags for each of the samples comprises:

determining a position deviation value based on the predicted position information corresponding to the sample and the position label;

if the absolute value of the position deviation value is smaller than a threshold parameter, determining the second loss value based on a least square error loss function corresponding to the position deviation value;

and if the absolute value of the position offset value is not less than the threshold parameter, determining the second loss value based on a loss function corresponding to the position offset value, wherein the loss function comprises an average absolute error loss function and the threshold parameter.

10. The method according to any one of claims 1 to 5, wherein the obtaining the text to be processed and the video to be processed comprises:

acquiring a video search request of a user, wherein the video search request comprises a search text;

acquiring a video database corresponding to the video search request, wherein the search text is the text to be processed, and any video in the video database is the video to be processed;

the method further comprises the following steps:

and if the target segment exists in the video to be processed, sending the video segment to the user.

11. The method according to any one of claims 1 to 5, wherein the obtaining the text to be processed and the video to be processed comprises:

acquiring a video to be processed and title information of the video to be processed, wherein the text to be processed is the title information of the video to be processed;

the method further comprises the following steps:

if the target segment exists in the video to be processed, determining that the title information is matched with the video to be processed;

and if the target fragment does not exist in the video to be processed, determining that the title information is not matched with the video to be processed.

12. A media data processing apparatus, comprising:

the data acquisition module is used for acquiring a text to be processed and a video to be processed;

the feature extraction module is configured to extract global text features and local text features corresponding to the to-be-processed text, and first video features of the to-be-processed video, where the global text features include phrase features corresponding to phrases included in the to-be-processed text, and the local text features include features corresponding to unit texts included in the to-be-processed text;

the feature fusion module is used for fusing the global text feature into the first video feature to obtain a second video feature;

and the target fragment determining module is used for determining a target fragment matched with the text to be processed from the video to be processed according to the local text feature and the second video feature.

13. An electronic device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the method of any one of claims 1-11 when executing the computer program.

14. A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, which computer program, when being executed by a processor, carries out the method of any one of claims 1-11.

Technical Field

The present application relates to the field of artificial intelligence and cloud technologies, and in particular, to a media data processing method and apparatus, an electronic device, and a computer storage medium.

Background

In the prior art, in order to obtain a video segment matching with a text from a video, a video segment matching with a global text feature is determined from video features, which are generally based on the global text feature of the text and the video features of the video.

In the prior art, the global text features cannot comprehensively represent all information of the text, so that a video clip determined based on the global text features and the video features of the text is not accurate enough, that is, the video clip does not conform to the text description.

Disclosure of Invention

The present application aims to solve at least one of the above technical drawbacks, and particularly proposes the following technical solution to solve the problem that a video segment determined from a video and matching with a text is not accurate enough.

According to an aspect of the present application, there is provided a media data processing method, the method including:

acquiring a text to be processed and a video to be processed;

extracting global text features and local text features corresponding to the text to be processed and first video features of the video to be processed, wherein the global text features comprise phrase features corresponding to phrases contained in the text to be processed, and the local text features comprise features corresponding to unit texts contained in the text to be processed;

fusing the global text feature into the first video feature to obtain a second video feature;

and determining a target fragment matched with the text to be processed from the video to be processed according to the local text characteristics and the second video characteristics.

According to another aspect of the present application, there is provided a media data processing apparatus, the apparatus including:

the data acquisition module is used for acquiring a text to be processed and a video to be processed;

the feature extraction module is used for extracting global text features and local text features corresponding to the text to be processed and first video features of the video to be processed, the global text features comprise phrase features corresponding to phrases contained in the text to be processed, and the local text features comprise features corresponding to unit texts contained in the text to be processed;

the feature fusion module is used for fusing the global text feature into the first video feature to obtain a second video feature;

and the target fragment determining module is used for determining a target fragment matched with the text to be processed from the video to be processed according to the local text characteristics and the second video characteristics.

According to yet another aspect of the present application, there is provided an electronic device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the media data processing method of the present application when executing the computer program.

According to yet another aspect of the present application, a computer-readable storage medium is provided, having stored thereon a computer program, which, when executed by a processor, implements the media data processing method of the present application.

Embodiments of the present invention also provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the method provided in the various alternative implementations of the media data processing method described above.

The beneficial effect that technical scheme that this application provided brought is:

when a target segment matched with a text to be processed needs to be determined from a video to be processed, the method, the device, the electronic device and the computer readable storage medium provided by the application can firstly perform primary processing on a first video characteristic based on a global text characteristic of the text to be processed and a first video characteristic of the video to be processed to obtain a second video characteristic, then determine a target segment matched with the text to be processed from each video segment in the video to be processed based on a local text characteristic of the text to be processed and the second video characteristic, and can describe all information of the text to be processed more comprehensively from different granularities due to the global text characteristic and the local text characteristic, so that in the scheme of the application, the target segment determined by text characteristics (the global text characteristic and the local text characteristic) with different granularities is more matched with the text to be processed, the semantics are more similar.

Additional aspects and advantages of the present application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the present application.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings used in the description of the embodiments of the present application will be briefly described below.

Fig. 1 is a schematic flowchart of a media data processing method according to an embodiment of the present application;

fig. 2 is a schematic diagram of a feature adjustment process of a text to be processed and a video to be processed according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a network structure of a neural network model according to an embodiment of the present application;

fig. 4 is a schematic diagram illustrating a data processing flow in an encoder and a decoder according to an embodiment of the present application;

fig. 5 is a schematic diagram of an implementation environment of a media data processing method according to an embodiment of the present application;

fig. 6 is a schematic diagram of an implementation environment of yet another media data processing method according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of a media data processing apparatus according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

Reference will now be made in detail to embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary only for the purpose of explaining the present application and are not to be construed as limiting the present application.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. As used herein, the term "and/or" includes all or any element and all combinations of one or more of the associated listed items.

The embodiment of the application provides a media data processing method for accurately acquiring video segments consistent with semantic expressions of texts in videos, the method can be suitable for any scene needing to determine the video segments corresponding to the texts to be processed in the videos to be processed, phrase features of phrases in the texts to be processed and multi-modal features corresponding to the videos to be processed and the texts to be processed related by the method can be achieved through an artificial intelligence technology, the method particularly relates to the fields of machine learning, deep learning technology and the like in the artificial intelligence technology, and data processing related in the method can be achieved through a cloud technology.

The scheme provided in the optional embodiments of the present application may be implemented based on cloud technology, and the data processing (including but not limited to data computing, etc.) involved in each optional embodiment may be implemented by cloud computing. Cloud technology refers to a hosting technology for unifying serial resources such as hardware, software, network and the like in a wide area network or a local area network to realize calculation, storage, processing and sharing of data. The cloud technology is based on the general names of network technology, information technology, integration technology, management platform technology, application technology and the like applied in the cloud computing business model, can form a resource pool, is used as required, and is flexible and convenient. Cloud computing technology will become an important support. Background services of the technical network system require a large amount of computing and storage resources, such as video websites, picture-like websites and more web portals. With the high development and application of the internet industry, each article may have its own identification mark and needs to be transmitted to a background system for logic processing, data in different levels are processed separately, and various industrial data need strong system background support and can only be realized through cloud computing.

Cloud computing (cloud computing) is a computing model that distributes computing tasks over a pool of resources formed by a large number of computers, enabling various application systems to obtain computing power, storage space, and information services as needed. The network that provides the resources is referred to as the "cloud". Resources in the "cloud" appear to the user as being infinitely expandable and available at any time, available on demand, expandable at any time, and paid for on-demand.

As a basic capability provider of cloud computing, a cloud computing resource pool (called as a cloud Platform in general, an Infrastructure as a Service) Platform is established, and multiple types of virtual resources are deployed in the resource pool for selective use by external clients, the cloud computing resource pool mainly includes a computing device (including an operating system, for a virtualized machine), a storage device, and a network device, and is divided according to logical functions, a PaaS (Platform as a Service) layer may be deployed on an IaaS (Infrastructure as a Service) layer, a SaaS (Software as a Service) layer may be deployed on the PaaS layer, or the SaaS may be directly deployed on the IaaS layer, the PaaS may be a Platform running on Software, such as a web database, a container, and the like, as business Software of various websites, a web portal, and the like, SaaS and PaaS are upper layers relative to IaaS.

Cloud computing (cloud computing) refers to a delivery and use mode of an IT infrastructure, and refers to obtaining required resources in an on-demand and easily-extensible manner through a network; the generalized cloud computing refers to a delivery and use mode of a service, and refers to obtaining a required service in an on-demand and easily-extensible manner through a network. Such services may be IT and software, internet related, or other services. Cloud Computing is a product of development and fusion of traditional computers and Network Technologies, such as Grid Computing (Grid Computing), Distributed Computing (Distributed Computing), Parallel Computing (Parallel Computing), Utility Computing (Utility Computing), Network Storage (Network Storage Technologies), Virtualization (Virtualization), Load balancing (Load Balance), and the like.

With the development of diversification of internet, real-time data stream and connecting equipment and the promotion of demands of search service, social network, mobile commerce, open collaboration and the like, cloud computing is rapidly developed. Different from the prior parallel distributed computing, the generation of cloud computing can promote the revolutionary change of the whole internet mode and the enterprise management mode in concept.

The media data processing method provided by the application can also be realized through an artificial intelligence cloud Service, which is generally called as AI as a Service (AI as a Service in chinese). The method is a service mode of an artificial intelligence platform, and particularly, the AIaaS platform splits several types of common AI services and provides independent or packaged services at a cloud.

This service model is similar to the one opened in an AI theme mall: all developers can access one or more artificial intelligence services provided by the platform through an API (application programming interface), and part of the qualified developers can also use an AI framework and an AI infrastructure provided by the platform to deploy and operate and maintain the self-dedicated cloud artificial intelligence services. In the present application, the media data processing method provided by the present application may be implemented by using an AI framework and an AI infrastructure provided by a platform.

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can reflect in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

Machine Learning (ML) is a multi-domain cross subject, and relates to multiple subjects such as probability theory, statistics, approximation theory, convex analysis and algorithm complexity theory. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and counterlearning.

With the research and progress of artificial intelligence technology, the artificial intelligence technology is developed and applied in a plurality of fields, such as common smart homes, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned driving, automatic driving, unmanned aerial vehicles, robots, smart medical care, smart customer service, and the like.

The scheme provided by the embodiment of the application can be executed by any electronic device, can be executed by user terminal equipment, and can also be executed by a server, wherein the server can be an independent physical server, a server cluster or distributed system formed by a plurality of physical servers, and a cloud server for providing cloud computing service. The terminal device may comprise at least one of: smart phones, tablet computers, notebook computers, desktop computers, smart speakers, smart watches, smart televisions, and smart car-mounted devices.

The following describes the technical solutions of the present application and how to solve the above technical problems in detail with specific embodiments. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments. Embodiments of the present application will be described below with reference to the accompanying drawings.

A possible implementation manner is provided in the embodiments of the present application, and as shown in fig. 1, a flowchart of a media data processing method is provided, where the scheme may be executed by any electronic device, for example, the scheme of the embodiments of the present application may be executed on a terminal device or a server, or executed by the terminal device and the server interactively. For convenience of description, the method provided by the embodiment of the present application will be described below by taking a server as an execution subject. As shown in the flow chart of fig. 1, the method may comprise the steps of:

and step S110, acquiring the text to be processed and the video to be processed.

In the application, data sources of the to-be-processed video and the to-be-processed text are not limited, optionally, the to-be-processed video may be an uncut video, and at least one of the to-be-processed text and the to-be-processed video may be data received by a server corresponding to the multimedia data distribution platform and sent by a user through a user terminal, or may also be data acquired by the server corresponding to the multimedia distribution platform from a preset storage space.

The text to be processed may be text containing one or more languages, such as chinese, english, etc., and the language type of the text to be processed is not limited in this application.

Step S120, extracting global text features and local text features corresponding to the text to be processed and first video features of the video to be processed, wherein the global text features comprise phrase features corresponding to phrases contained in the text to be processed, and the local text features comprise features corresponding to unit texts contained in the text to be processed.

The global text feature represents the whole information (full text semantic information) of the text to be processed, that is, the semantic to be expressed by the text to be processed is represented by the global text feature, for example, the phrase features are spliced together to serve as the global text feature of the text to be processed, and the semantic to be expressed by the text to be processed can be known based on the global text feature. The local text features are features representing local information (partial semantic information) of the text to be processed. For example, the local text feature may be a feature vector of each unit text included in the text to be processed, and the semantic corresponding to the feature vector of each unit text represents the local feature of the text to be processed, and then the semantic to be expressed by the text to be processed is represented by the feature vector of each unit text. The unit text may be at least one of a word or a word segmentation.

Optionally, if the unit text includes words, the local text features may include features of words included in the text to be processed. If the unit text comprises the participles, the local text characteristics can comprise the participle characteristics of the participles contained in the text to be processed, and as one word can be composed of at least one word, the participle characteristics of the participles can be obtained based on the characteristics of the words. Compared with words, the words can more accurately express the semantics of the text, so the local text characteristics can also comprise the word segmentation characteristics of each word segmentation contained in the text to be processed.

In the scheme of the application, the division mode of the video to be processed is not limited, for example, a set number of adjacent frame video images can be divided into one video segment, and adjacent frame video images within a set time duration can also be divided into one video segment.

The method for extracting the global text feature and the local text feature of the text to be processed and the first video feature will be described below, and will not be described herein again.

And step S130, fusing the global text feature into the first video feature to obtain a second video feature.

The purpose of fusing the global text feature and the first video feature is to determine a second video feature which is preliminarily matched with the text to be processed (which can be understood as coarse-grained positioning) from the first video feature based on the global text feature.

And step S140, determining a target segment matched with the text to be processed from the video to be processed according to the local text characteristics and the second video characteristics.

The local text features can reflect local information of the text to be processed from local parts (which can be understood as fine granularity), and based on the local text features and the second video features, the association relationship between the local text features and the second video features can be captured from local parts, that is, the determined target segments can be closer to the semantics expressed by the text to be processed based on the detail features of the text to be processed provided by the local text features as a supplement to the local text features.

In an alternative, the target segment may be determined based on a degree of matching between the local text feature and the second video feature, the degree of matching may be characterized by feature similarity, and the more similar the more matched the indication, the more similar the semantics.

Wherein, the target segment refers to a segment of video in the video to be processed.

According to the scheme, when the target segment matched with the text to be processed is required to be determined from the video to be processed, the first video feature can be subjected to primary processing on the basis of the global text feature of the text to be processed and the first video feature of the video to be processed to obtain the second video feature, then the target segment matched with the text to be processed is determined from all the video segments in the video to be processed on the basis of the local text feature of the text to be processed and the second video feature, and all information of the text to be processed can be described more comprehensively from different granularities due to the fact that the global text feature and the local text feature are different in granularity.

In an embodiment of the present application, extracting global text features and local text features of a text to be processed includes:

acquiring unit texts in a text to be processed and a position relation between the unit texts;

determining the characteristics of each unit text based on the position relation between each unit text and each unit text, wherein the local text characteristics comprise the characteristics of each unit text;

determining phrase characteristics of each phrase contained in the text to be processed based on the characteristics corresponding to each unit text;

and fusing the phrase characteristics of the phrases to obtain global text characteristics.

The phrase features of each phrase can represent the global information of the text to be processed from the level of the phrase, the semantics of one phrase can reflect the global information of the text to be processed more accurately than the semantics of a word, and therefore the global text features are determined based on the phrase features of each phrase in the scheme. Since one phrase may be composed of at least one unit text, the phrase characteristics of each phrase may be determined based on the characteristics of each unit text.

In the scheme, when the features of each unit text are extracted, the context relationship (position relationship) among the unit texts is also considered, namely the relationship among one unit text, the unit text before the unit text and the unit text after the unit text, so that the features of each unit text are extracted and obtained based on the features of each unit text and the position relationship among the unit texts, and the semantic features of the text to be processed can be represented more accurately. The features of each unit text can represent the local information of the text to be processed from the level of the unit text, so that the features of each unit text can be used as local text features.

In one embodiment of the application, the first video features comprise features of a plurality of video segments in the video to be processed; fusing the global text feature into the first video feature to obtain a second video feature, wherein the method comprises the following steps:

fusing the global text feature and the first video feature to obtain a fused video feature;

acquiring corresponding position information characteristics of each video clip in a video to be processed;

overlapping the characteristics of each video segment in the fused video characteristics and the position information characteristics corresponding to each video segment to obtain overlapped video characteristics;

and obtaining a second video characteristic based on the superposed video characteristics.

The purpose of fusing the global text features and the first video features is to determine the video features which are preliminarily matched with the text to be processed in the first video features based on the global text features.

In consideration of the time sequence of each video segment in the video to be processed, namely the corresponding position information of each video segment in the video to be processed, the feature of each video segment in the fusion video feature and the corresponding position information feature of each video segment are overlapped, so that the second video feature contains the position information of each video segment, namely the overlapped video feature is the video feature containing the position information of each video segment. The position information features are features corresponding to the position information.

In one embodiment of the application, the first video features comprise features of a plurality of video segments in the video to be processed; obtaining a second video feature based on the superimposed video features, comprising:

determining the weight corresponding to each video clip based on the incidence relation of each video clip in the video features after superposition;

for each video clip, obtaining an enhanced feature corresponding to the video clip based on the weight corresponding to the video clip and the feature corresponding to the video clip in the video features after superposition;

and extracting to obtain second video characteristics based on the enhanced characteristics corresponding to the video clips.

For one video clip, the association relationship of each video clip in the video features after superposition includes the relationship between the video clip and the video clip itself, and the relationship between the video clip and other video clips except the video clip in each video clip. In the video features after superposition, the importance degree of each video segment in the video to be processed is different, the importance degree is represented by the weight corresponding to each video segment, and the more the weight is, the more important the representation is.

The enhanced features refer to adjusting the importance degree of the video features after superposition corresponding to each video segment based on the weight of each video segment.

The enhanced feature extraction corresponding to each video segment is to obtain a deeper feature expression, i.e. to make the second video feature contain more detailed video features.

Alternatively, the weight corresponding to each video segment may be determined by a self-attention mechanism.

In an embodiment of the present application, determining a target segment matching a text to be processed from a video to be processed according to a local text feature and a second video feature includes:

determining the associated characteristics of the local text characteristics and the second video characteristics;

adjusting the local text characteristics according to the association characteristics and the second video characteristics to obtain adjusted local characteristics;

adjusting the second video characteristics according to the association characteristics and the local text characteristics to obtain adjusted video characteristics;

and determining a target segment matched with the text to be processed from the video to be processed based on the adjusted local features and the adjusted video features.

For the text to be processed and the video to be processed, the text to be processed can guide the video to be processed to focus on important information in the video, and the video to be processed can also guide important information (such as keywords) in the text to be processed, so that the finally obtained target segment can be more matched with the text to be processed based on bidirectional feature adjustment.

The associated features characterize the association relationship between the second video features and the local text features, from the perspective of the text, the associated features may include features related to the second video features in the local text features, and from the perspective of the video, the associated features may include features related to the local text features in the second video features.

Based on the association feature and the second video feature, adjusting the local text feature means guiding important information in a text concerned with the text to be processed based on the video to be processed, based on the association feature and the second video feature, which information in the local text feature is more important (fusing detail information of the text), based on the association feature and the local text feature, adjusting the second video feature means guiding important information in the video concerned with the video to be processed based on the text to be processed, and based on the association feature and the local text feature, which information in the second video feature is more important can be known.

Optionally, the adjusting the local text feature according to the association feature and the second video feature to obtain an adjusted global feature may specifically include:

acquiring a first weight of each unit text in local text characteristics; and adjusting the first weight of each unit text in the local text features according to the associated features and the second video features, and obtaining the adjusted local features based on the features of each unit text and the adjusted weights corresponding to each unit text.

The adjustment of the local text features can be understood as adjusting the first weight corresponding to each unit text in the local text features, and the larger the adjusted weight is, the more important the corresponding unit text is relative to the text to be processed is. The second video features are obtained by fusing the global text features and the first video features, so that the local text features are adjusted, and the detail (local) information omitted from the second video features can be captured through the adjusted local features.

Similarly, the first video feature includes features of a plurality of video segments in the video to be processed, and one video segment may be composed of at least one frame of video frame image. The adjusting the second video feature according to the association feature and the local text feature may refer to adjusting a second weight corresponding to each video segment in the video to be processed according to the association feature and the local text feature, and the larger the adjusted weight is, the more important the corresponding video segment is relative to the video to be processed is.

Referring to the video to be processed and the text to be processed shown in fig. 2 as an embodiment, in this example, the text to be processed is: a person is eating while standing while he is watching television. The to-be-processed text contains the following word segments: "one, person, on, standing, eating, something while he is watching, television". A person in the video to be processed is watching television while standing in the first frames of video frame images, and the person is watching television while standing, and meanwhile, the person also eats something, and the person is only watching television while standing in the last frames of video frames, and does not eat something.

In this example, the first weight and the second weight are determined by the self-attention mechanism, and the video-to-text attention value shown in fig. 2 is the first weight and the text-to-video attention value shown in fig. 2 is the second weight. Group is the valid range of the attention value, the valid area of the text-to-video attention value is 0 to 1, and the valid range of the video-to-text attention value is 0-1.

As can be seen from fig. 2, the attention values (the adjusted first weights) corresponding to the participles "eat" and "look" in the text to be processed are relatively large (the deeper the color of the mark on the participle is, the greater the attention value is), and the participles "eat" and "look" in the text to be processed are relatively important information in the text to be processed. The corresponding attention values (the adjusted second weights) of the two frames of images A and B in the video to be processed are larger (the deeper the color of the mark is, the larger the attention value is), and the two frames of images A and B in the video to be processed are relatively important information in the video to be processed. Wherein the people in both figure a and figure B are standing on television while eating.

Thus, the target segment (video segments corresponding to 0s to 8.3s as shown in fig. 2) matching the text to be processed can be accurately determined from the video to be processed based on the word segmentation "eat" and "look" in the text to be processed and the image a and the image B in the video to be processed.

In one embodiment of the application, the first video features comprise features of a plurality of video segments in the video to be processed; determining a target segment matched with the text to be processed from the video to be processed based on the adjusted local features and the adjusted video features, wherein the step of determining the target segment comprises the following steps:

according to the adjusted local features, determining guide information of the text to be processed on the video to be processed;

adjusting the characteristics of each video segment in the adjusted video characteristics according to the guide information to obtain third video characteristics;

and determining a target segment matched with the text to be processed from the video to be processed according to the third video characteristics.

The adjusted local text features supplement the global text features and provide some detailed information in the text to be processed, so that the guidance information determined based on the adjusted local features can fully transmit the semantic information of the text to be processed. And adjusting the characteristics of each video clip according to the guide information, namely filtering out the characteristics irrelevant to text semantic information from the video clips to obtain third video characteristics so that the target clip determined based on the third video characteristics is more matched with the text to be processed.

Optionally, the adjusting the characteristics of each video segment in the adjusted video characteristics according to the guidance information may be adjusting weights corresponding to the characteristics of each video segment, representing the importance degree of the video characteristics by the size of the adjusted weights, filtering out the video characteristics, wherein the adjusted weights corresponding to the video characteristics are relatively small, and the video characteristics required to be retained are relatively large.

In an embodiment of the application, determining a target segment matching a text to be processed from a video to be processed according to a third video feature includes:

determining the weight corresponding to each video clip according to the incidence relation among the features of each video clip contained in the third video features;

weighting the third video features of the video clips based on the weights corresponding to the video clips to obtain fourth video features;

determining the position information of a target segment matched with the text to be processed in the video to be processed based on the fourth video characteristics;

and determining a target segment which is determined to be matched with the text to be processed in the video to be processed based on the position information.

For the third video feature corresponding to one video clip, the weight corresponding to the third video feature represents the importance degree of the video clip in the video to be processed, and the fourth video feature determined based on the weight corresponding to each video clip can more accurately reflect the video feature matched with the text to be processed in the video to be processed.

The position information includes a start position and an end position corresponding to the video segment, and the position information can be specifically represented by time information, and the start time and the end time of one video segment represent the position information of the video segment.

Optionally, based on the fourth video feature, the network may be determined by using the pre-trained video segment, the position information corresponding to the target segment in the video to be processed is determined, and based on the position information, which video segment the target segment is specifically determined.

The input of the video clip determining network is the video characteristics of the video, the output is the position information corresponding to each video clip in the video, and the video clip determining network can be trained on the basis of the following modes:

acquiring training data, wherein the training data comprises a plurality of sample videos carrying position labels, and for one sample video, the position labels represent corresponding position information of each video segment in the sample video;

for a sample video, extracting video features corresponding to the sample video, wherein the video features comprise segment features corresponding to all video segments;

for a sample video, inputting the video characteristics of the sample video into an initial neural network model to obtain the predicted position information corresponding to each video segment in the sample video;

determining a training loss based on the predicted position information corresponding to each sample video and the position information corresponding to each position tag, wherein for one sample video, the value of the training loss represents the difference between the predicted position information corresponding to the sample video and the position information corresponding to the position tag corresponding to the sample video;

and if the training loss meets the training ending condition, determining a network by taking the corresponding model as a video clip, if not, adjusting the model parameters of the initial neural network model, and training the initial neural network model based on the training data.

In one embodiment of the application, global text features of a text to be processed are extracted, the global text features are fused into first video features to obtain second video features, and target segments matched with the text to be processed are determined from the video to be processed according to the local text features and the second video features and are obtained on the basis of a neural network model;

the neural network model comprises a phrase feature extraction network, a multi-mode feature extraction network and a video segment determination network, and is obtained by training in the following way, and specifically comprises the following steps:

step 1, training data is obtained, the training data comprises a plurality of samples, each sample comprises a sample video and a sample text, each sample carries a position label, and for one sample, the position label represents position information of a target video fragment corresponding to the sample text in the sample video.

The position label may be a character, or the like, and the specific expression form of the position label is not limited in this application.

And 2, extracting the global text feature and the local text feature of the sample text in the sample and the video feature of the sample video for each sample in the training data.

As in the foregoing method, the global text features and the local text features of the sample text are the same as the global text features and the local text features of the text to be processed, and are not described herein again. The local text features of the sample text comprise the features of each unit text contained in the sample text, and the global text features of the sample text comprise the phrase features of each phrase contained in the sample text; the features of each unit text included in the sample text may be extracted based on other manners, for example, network extraction may be performed based on pre-trained text features. The video characteristics of the sample video can be extracted based on the trained network, so that the model training speed can be increased.

And 3, for the sample, inputting the characteristics of each unit text in the sample text into a phrase characteristic extraction network to obtain the predicted phrase characteristics of each phrase in the sample text.

And 4, determining a first loss value based on the matching degree of the predicted phrase features of the phrases corresponding to the samples, wherein for one sample, the first loss value represents the semantic difference between the phrases in the sample.

Wherein, the matching degree between the predicted phrase characteristics of each phrase can be characterized based on the characteristic similarity, and the more matching between two phrases means that the closer the semantics between the two phrases are, the smaller the semantic difference is.

And 5, inputting the global text features and the local text features of the sample texts and the video features of the sample videos into the multi-modal feature extraction network for the samples to obtain the multi-modal video features corresponding to the sample videos.

The process of determining the multi-modal video features based on the global text features and the local text features of the sample text and the video features of the sample video is consistent with the process of determining the third video features based on the global text features and the local text features of the text to be processed and the first video features of the video to be processed, which is described in the foregoing, and is not repeated here.

Specifically, at the decoder side, the video features of the sample video can be adjusted based on the association features between the local text features of the sample text and the video features of the sample video, and meanwhile, the local text features of the sample text can be adjusted based on the association features, so that the determined multi-modal video features are more accurate in consideration of the mutual influence between the text features and the video features during model training.

Optionally, at the decoder side, an association feature between the local text feature of the sample text and the video feature of the sample video may be determined based on a cooperative self-attention mechanism, and the local text feature and the video feature may be alternately adjusted based on the association feature. This part will be described in detail below and will not be described in detail here.

And 6, inputting the multi-modal video characteristics to a video fragment determination network for the sample to obtain the weight corresponding to each sample video fragment in the multi-modal video characteristics, and obtaining the corresponding predicted position information of the predicted video fragment corresponding to the sample text in the sample video based on the multi-modal video characteristics and the weight corresponding to each sample video fragment.

The video features of the sample video comprise features of a plurality of sample video segments.

And 7, determining a second loss value based on the predicted position information corresponding to each sample and each position label, wherein the second loss value represents the difference between the predicted position information corresponding to each sample and the position label of each sample.

For a sample, the second loss value corresponding to the sample represents a difference between the predicted position information corresponding to the sample and the position label corresponding to the sample, that is, a difference between the predicted position information and the position information corresponding to the position label.

And 8, determining a third loss value based on each sample video fragment in the multi-modal video feature corresponding to each sample and the position label corresponding to each sample, wherein for one sample, the third loss value represents the possibility that each sample video fragment in the sample is the target video fragment.

Wherein, for a sample, the greater the weight, the greater the probability that the corresponding sample video segment is the target video segment.

Step 9, determining the value of a training loss function corresponding to the neural network model based on the first loss value, the second loss value and the third loss value; and if the training loss function is converged, taking the corresponding model during convergence as a final neural network model, if the model is not converged, adjusting model parameters of the neural network model, and training the neural network model based on training data.

Alternatively, the second loss value may be represented by an L1 average absolute error loss function, and the second loss value is an average value of the sum of absolute differences between the position information corresponding to the position tag and the predicted position information.

Optionally, the second loss value may be further characterized by using a L2 least square error loss function, that is, the second loss value is characterized by a least square error, and the second loss value is an average value of a sum of squares of differences between the position information corresponding to the position tag and the predicted position information.

The input of the second loss functions is normalized within the range of 0-1, and when the second loss value is close to 0, namely the difference between the predicted position information corresponding to the sample and the position information corresponding to the position label corresponding to the sample is small, the gradient of the L2 loss is smaller than the gradient of the L1 loss, so that the training stability is better by adopting L2 to determine the second loss functions. When the second loss value is larger, and considering that the input values are all smaller than 1, the penalty effect of the L1 loss on the deviation is better than that of the L2 loss, at the moment, the L1 is adopted to determine the second loss value, and the accuracy is higher.

The specific scheme is as follows: in an embodiment of the present application, determining, for a sample, a second loss value based on predicted location information and location tags corresponding to the samples includes:

determining a position deviation value based on the predicted position information and the position label corresponding to the sample;

if the absolute value of the position deviation value is smaller than the threshold parameter, determining a second loss value based on a least square error loss function corresponding to the position deviation value;

and if the absolute value of the position offset value is not less than the threshold parameter, determining a second loss value based on a loss function corresponding to the position offset value, wherein the loss function comprises an average absolute error loss function and the threshold parameter.

For a second loss value corresponding to one sample, see in particular the following formula:

wherein, deltaiBeta is a threshold parameter, | delta | represents the absolute value of the position deviation value, f is the position deviation value corresponding to the ith sample in the training data, namely the difference between the predicted position information corresponding to the sample and the position information corresponding to the position labelreg(δ)=2β|δ|-β2Corresponding loss for position deviation valueFunction, freg(δ)=δ2Is the least square error loss function (L2) corresponding to the position deviation value, LregA second loss value, f, corresponding to any one sample in the training dataregi) And N represents the number of samples in the training data, wherein 1 is larger than or equal to delta and larger than or equal to 0.

When the absolute value of the position deviation value is less than the threshold parameter, the second loss value adopts an L2 loss function f corresponding to the position deviation valuereg(δ)=δ2And determining a second loss value.

Not less than | delta>Beta, i.e. the absolute value of the position offset value is not less than the threshold parameter, the second loss value adopts the loss function f corresponding to the position offset valuereg(δ)=2β|δ|-β2And determining a second loss value.

Optionally, the predicted position information includes a predicted starting position and a predicted ending position, and the position information corresponding to the position tag includes a labeled starting position and a labeled ending position.

Then for a sample, the corresponding second loss value for that sample includes a start loss characterizing the difference between the start position and the predicted start position and an end loss characterizing the difference between the start position and the predicted end position.

The formula corresponding to the second loss value may be:

wherein the content of the first and second substances,indicating the difference (position deviation value) between the starting position corresponding to the ith sample and the predicted starting position,for the initial loss corresponding to the ith sample,for the end loss corresponding to the ith sample,a second loss value, L, corresponding to the ith sampleregFor a second loss value corresponding to one sample, N represents the number of samples in the training data.

The following describes the training of the neural network model in detail with reference to the schematic diagram of the neural network structure shown in fig. 3:

the neural network model comprises a cascade input coding module, a multi-mode fusion module (a multi-mode feature extraction network) and a time sequence positioning module. Wherein, the input coding module comprises a phrase feature extraction network (SPE shown in the figure).

The method comprises the steps of obtaining training data, wherein the training data comprise a plurality of samples, each sample comprises a sample video and a sample text, each sample carries a position tag, and for one sample, the position tag represents position information of a target video fragment corresponding to the sample text in the sample video.

The following describes the processing flow of each module related to the scheme of the present application, taking a sample as an example:

first, the sample is input to an input encoding module, which includes a pre-trained text feature extraction network, a phrase feature extraction network, and a video feature extraction network, in this example, the text feature extraction network may be optionally a Bi-directional LSTM (Long Short-Term Memory network) (Bi-LSTM shown in the figure).

For the sample text, in this example, the unit text is taken as an example of a word, the feature (feature vector) of each word in the sample text may be extracted through GloVe, specifically, the embedded expression (initial feature) of each word in the sample text may be obtained through GloVe, optionally, 300-dimensional embedding may be extracted, and then, based on the initial feature of each word, the feature of each word including the context relationship is obtained based on the bidirectional LSTM, which may specifically be represented as:wherein L is the number of characters in the sample text,representing the characteristics of each word.

The features of each word (which may be referred to as a word vector) may be obtained by the following formula:

wherein L represents the number of words in the sample text,features containing history information (words before a word) obtained using forward LSTM,features containing future information (words after a word) obtained using backward LSTM,and (b) the characteristics corresponding to the ith word containing the context information are shown, wherein i is more than or equal to 1 and less than or equal to L.

The local text features of the sample text include features corresponding to each word.

A sample text can comprise a plurality of phrases, and a phrase can be a word or at least two words. For a sample text, all information expressed by the text cannot be comprehensively and accurately summarized through a text feature corresponding to the sample text, so that in the embodiment, each phrase in the sample text can be determined based on each word and the position relation of each word, then the phrase feature of each phrase is extracted through a phrase feature extraction network, and then the phrase feature of each phrase is fused to obtain the global text feature of the sample text.

Wherein, the global text feature can be recorded as: g ═ G(1),g(2),…,g(k)]Where k is the number of phrases, g(1)Representing the phrase characteristics of the first phrase.

Optionally, for a sample text, the number of phrases in the sample text is 3, and at this time, the performance of the model is the best.

In this example, the phrase features of each phrase may be extracted based on a phrase feature extraction network, which may be trained based on the following training modes:

extracting the characteristics of each character in each sample text in the training data;

for each sample text, inputting the characteristics of each word in the sample text into a phrase characteristic extraction network, so that the phrase characteristic extraction network determines the predicted phrase characteristics of each phrase in the sample text based on the characteristics of each word;

and determining a first loss value based on the matching degree of the predicted phrase characteristics of the phrases corresponding to the samples, wherein the first loss value represents the semantic difference between the phrases in the sample text for one sample text.

Alternatively, for a sample text, phrase features of phrases contained in the sample text (predicted phrase features of phrases) may be determined based on features of words in the sample text and weights of the words.

Wherein the weight of each phrase can be expressed as:

Asps=softmax(Ws2(tanh(Ws1Qin)))

wherein softmax is a loss function, Ws2And Ws1As a network parameter, QinIs the special character of each wordSymbol, AspeIs a matrix of K x L, L is the number of words in the sample text, K is the number of phrases in the sample text, AspeThe element in each line in the sample text characterizes the weight of each word in the phrase (each word in the corresponding phrase), and for the weight of one word, the importance degree of the word in the sample text in the current phrase (the phrase in which the word is located) is characterized by the weight, and K is less than or equal to L.

A is to bespeAnd QinThe global text features G of the K different semantic terms, that is, the global text features corresponding to the sample text, can be included by multiplication.

For a sample, the corresponding first loss value of the sample can be expressed as:

wherein L issdFor the first loss value corresponding to the sample,is AspeThe transpose of (a) is performed,each element in (a) may characterize the similarity between any two phrases in the respective phrase,the diagonal elements of (a) characterize the similarity (1) of each phrase to itself,each element on the off-diagonal line represents the similarity between one phrase and other phrases, | | · |. survivalFIs Frobenius norm, I is an identity matrix, and the identity matrix is used to makeThe element on the diagonal of (1) is set to 0,smaller values for off-diagonal elements indicate better variability between extracted phrases.

Based on the characteristics of each word in the sample text, the global text characteristics of the sample text can be obtained through the trained phrase characteristic extraction network, and the global text characteristics can be specifically expressed as:

wherein the content of the first and second substances,is AspeTranspose of (Q)inFor each word feature, G represents a global text feature containing K different semantic terms.

For sample video, video features can be extracted from the sample video based on a video feature extraction network, which can include a 3D convolutional neural network (denoted as f)v(.)) and at least one fully-connected layer, firstly extracting a first video feature from a sample video through a 3D convolutional neural network, wherein the first video feature passes through the fully-connected layer to obtain an embedded expression of the first video feature of the sample video:

wherein, VinA first video feature comprising features of a plurality of video segments in the sample video,is characteristic of the first video segment.

The video feature extraction network may be written as: vin=ReLU(Wvfv(X))

Wherein X is a sample video, WvThe parameters of the network are extracted for the video features,the ReLU is a nonlinear activation function of a video feature extraction network.

For a sample, inputting the local text features and the global text features of the sample text in the sample and the video features of the sample video into a multi-modal feature extraction network (a multi-modal fusion module shown in the figure) to obtain multi-modal video features corresponding to the sample video, wherein the multi-modal video features include multi-modal video features corresponding to each sample video fragment.

In the multimodal feature extraction network, as shown in fig. 3, the multimodal fusion module includes an encoder and a decoder, and at an encoder end, considering that an attention mechanism cannot express a time sequence relationship between sample video segments in a sample video, a global text feature (a text shown in fig. 4) and a first video feature (a video feature of the sample video, a video shown in fig. 4) are fused to obtain a fused video feature, and then, corresponding position information features (PEs shown in fig. 4) of a plurality of sample video segments in the sample video are obtained; adding corresponding position information characteristic (marked as f) into the fusion video characteristicPEAnd (h)), overlapping the characteristics of the video after overlapping, namely overlapping the characteristics of each video segment in the fusion video characteristics and the position information characteristics corresponding to each video segment to obtain the video characteristics after overlapping, wherein the video characteristics after overlapping are the video characteristics containing the position information.

The encoder includes a feature extraction Layer (in this example, the feature extraction Layer may be a Multi Layer Perceptron (MLP) and a self-attention module, and in this example, includes an M-Layer Perceptron (Perceptron), after obtaining the superimposed video features, the feature corresponding to each sample video segment in the superimposed video features may be input to the self-attention module, so as to determine, by the self-attention module, a weight corresponding to each sample video segment based on an association relationship of each sample video segment in the superimposed video features, and for each sample video segment, obtain an enhanced feature corresponding to the video segment based on the weight corresponding to the sample video segment and the feature corresponding to the video segment in the superimposed video features.

In this example, the self-attention module consists of a multi-head self-attention and forward propagation network FFN, such as the 3 SelfAttn and FFN shown in FIG. 3. For a sample video clip, the superimposed video features corresponding to the sample video clip are input into the multi-head self-attention module each time through the self-attention module shared by 3 weights, so that the superimposed video features can be obtained. Specifically, the query (Q) vector, the key (K) vector and the value vector (V) are all the same input features, the input features are fused through a standard attention calculation formula, and then the gradient disappearance problem which easily occurs in the training process of the deep network is relieved through addition and normalization operations, so that the enhanced features corresponding to the input features (the video features after superposition) are obtained. And inputting the enhanced features corresponding to each sample video segment into a feed-forward neural network (forward propagation network FFN) to further extract deep fusion features of each enhanced feature, and repeating the process (a processing process corresponding to SelfAttn and FFN) for M times to obtain 3 enhanced video features. And then, fusing the 3 enhanced video features through a Multi-Layer Perceptron (MLP) to obtain a second video feature. The purpose of using multi-head self-attention is to reduce the feature dimension and increase the nonlinear capability of the model.

Wherein, since the second video feature is determined based on two-dimensional features, the second video feature can be called a first multi-modal feature, and the first multi-modal feature can represent common features between the global text feature and the first video feature.

The working process of the above encoder can be seen in the following formula:

wherein, VinRepresenting a first video feature comprising features of a plurality of sample video segments in a sample video, G being a global text feature, fu(VinG) is a fusion function that fuses the first video feature and the global text feature, N is the number of sample video segments, in this example, N may be 128; f. ofPE([1,2,…,N]) For the position information characteristic of the plurality of sample video segments in the sample video,for the corresponding superimposed video features of the ith encoder branch,represents the deep fusion feature (enhanced feature) obtained by further extracting the superimposed video feature corresponding to the ith encoder branch through a multi-head self-attention module, FencRepresenting the second video feature, the encoder has three branches, i.e. one branch for each SelfAttn and FFN, so i is 1, 2, 3 in this example.

In this example, the encoder has three branches, and the global text feature includes phrase features of 3 phrases, so that each branch in the encoder can fuse one phrase feature in the global text feature with the first video feature to obtain a preliminary fusion feature corresponding to each phrase feature, and then further extract and fuse the preliminary fusion feature through the three branches to obtain a second video feature.

In this example, fusing the global text feature into the first video feature to obtain the second video feature may specifically include: dimension alignment is carried out on each phrase feature in the global text feature and the feature of each video segment in a copying mode, and then the global text feature is fused into the first video feature by adopting a Hadamard Product (Hadamard Product) algorithm (element-level multiplication) to obtain a second video feature.

Specifically, see also a schematic processing flow diagram of the global text feature and the first video feature in the encoder shown in fig. 4, which is consistent with the processing procedure in fig. 3 and is not repeated here.

At the decoder side, a target video segment matching the sample text may be determined from the sample video based on the local text feature (text corresponding to the decoder side in fig. 4) and the second video feature output at the encoder side. And further processing the second video features based on the local text features so as to obtain video features accurately representing the predicted video segments corresponding to the sample texts in the sample videos.

Specifically, the decoder includes two branches, one is a video branch and one is a text branch, each branch is composed of a self-attention module (SelfAttn shown in the figure) and a cooperative attention module (BiAttn shown in the figure), the self-attention module can be a multi-head self-attention module, and the cooperative attention module can be a multi-head cooperative attention module.

Similarly, considering that the attention mechanism cannot characterize the time sequence relationship between the words, a position information feature corresponding to each word (i.e., a position feature corresponding to the position information of each word in the sample text, i.e., the PE position code shown in fig. 3 and the PE corresponding to the decoder of fig. 4) may be added to the local text feature.

Firstly, depth features are extracted from the second video features through a video branched self-attention module to obtain depth video features, and depth features are extracted from local text features through a text branched self-attention module to obtain depth text features.

In the video branch, the input of the branch is the second video feature output by the encoder, and the local text feature (feature vector of each word) of the sample text; the function of the branch is: and guiding the local text feature to pay attention to important information in the local text feature through the second video feature.

Specifically, the depth video features and the depth text features are input to the cooperative attention module, so that the depth text features are adjusted by the cooperative attention module, specifically: and firstly determining the correlation characteristics between the depth video characteristics and the depth text characteristics, and adjusting the depth text characteristics based on the correlation characteristics and the depth video characteristics to obtain the adjusted local characteristics.

In the text branch, the inputs of the branch are the second video feature output by the encoder and the global text feature of the sample text, and the function of the branch is: and guiding the second video characteristic to pay attention to important information in the second video characteristic through the local text characteristic. Specifically, the depth text feature and the depth video feature are input to the collaborative attention module, so that the depth video feature is adjusted through the module, specifically: and determining the association characteristics between the depth video characteristics and the depth text characteristics, and adjusting the depth video characteristics based on the association characteristics and the depth text characteristics to obtain the adjusted video characteristics.

In this example, the self-attention module in the decoder may be a multi-head self-attention module, for example, the depth feature may be obtained by performing depth feature extraction on the second video feature through the multi-head self-attention module, and a specific implementation process of the self-attention module may refer to a processing process of the decoder shown in fig. 4, where a detailed process is consistent with a processing process of the multi-head self-attention module on the first fusion feature in the foregoing encoder, and is not described herein again. It can be understood that, the depth feature obtained by extracting the depth feature from the attention module for the local text feature may be the same as that in the foregoing, and the processing procedure for the first fusion feature by the multi-head self-attention module is not described herein again.

The specific implementation process of the decoder can be seen in the following formula:

Qin=Qin+fPE([1,2,…,L])

FQ1=SelfAttn(Qin,Qin,Qin)

FV1=SelfAttn(Fenc,Fenc,Fenc)

FQ2=BiAttn(FQ1,FV1,FV1)

FV2=BiAttn(FV1,FQ1,FQ1)

wherein f isPE([1,2,…,L]Coding the corresponding position information of each character in the sample text, QinIs a local text feature; SelfAttn is a self-attention module, FQ1For deep text features where local text features are processed by the self-attention module, FencFor the second video feature, FV1The depth video feature of the second video feature after being processed by the self-attention module, BiAttn is a cooperative attention module, FQ2For adjusted local features, FV2Is the adjusted video feature.

After the outputs of the two branches are obtained, that is, after the adjusted local features and the adjusted video features are obtained, the guidance information determined based on the adjusted local features can be used for filtering out the features irrelevant to the semantics of the adjusted local features in the adjusted video features through an information gate (IG shown in fig. 3), so that the accuracy of the multi-modal video features corresponding to the sample text in the determined sample video is further ensured.

Specifically, the adjusted local feature F may be first adjustedQ2Fusion (information aggregation) is carried out to obtain the guidance information sglobal. And then based on the guide information sglobalAnd adjusting the characteristics of each video segment in the adjusted video characteristics to obtain third video characteristics.

Wherein the adjusted global feature (F) is adjustedQ2) Fusing to obtain guidance information sglobalMay be implemented by BiLSTM.

The specific process can be seen in the following formula:

sglobal=BiLSTM(FQ2)

wherein, FQ2For the adjusted global feature, sglobalFor the guide information, WIGAnd WIVAs a parameter, FV2For adjusted video characteristics, IVFor the third video feature (the multi-modal video feature corresponding to the sample video), σ is a Sigmoid activation function.

After obtaining the multimodal video features, for a sample, inputting the multimodal video features to a video segment determination network (a time sequence positioning module shown in fig. 3), obtaining weights corresponding to each sample video segment in the multimodal video features, and obtaining predicted position information, in the sample video, of a predicted video segment corresponding to a sample text, based on the multimodal video features and the weights corresponding to each sample video segment, that is, a start time and an end time of the predicted video segment corresponding to the sample text, in the sample video.

Specifically, the weight a corresponding to each video clip is determined according to the incidence relation between the features of each sample video clip contained in the third video feature, and the third video feature of each sample video clip is weighted based on the weight corresponding to each sample video clip to obtain the fourth video featureBased on the fourth video feature, determining predicted position information of the predicted video segment in the sample video, wherein the predicted position information comprises a predicted start time and a predicted end time.

The predicted position information corresponding to one sample can be specifically determined by the following formula:

wherein, aiThe weight corresponding to the ith video segment in each sample, N is the number of sample video segments, iGreater than or equal to 1, less than or equal to N,for the multi-modal video feature (third video feature) corresponding to the ith video segment in each sample,for the fourth video feature, MLP determines the network for the pre-trained video segments, (τ)se) To predict position information, τsTo predict the onset time, τeTo predict the end time.

Weighting the third video characteristics of each sample video segment to obtain fourth video characteristicsThe method specifically comprises the following steps: carrying out weighted summation on the third video characteristics of each sample video segment on the characteristic dimension to obtain fourth video characteristicsAs an example, for example, if a video includes 128 video segments, each video segment having 512 dimensions, the dimensions of the video can be expressed as: 128 x 512; by calculating the corresponding weight (attention value) of each video segment, 128 attention values can be obtained, and the video features of 128 x 512 dimensions can be aggregated by copying the 128 attention values along 512 dimensions, so that 512-dimensional features can be obtained.

After obtaining the predicted position information corresponding to the predicted video segment corresponding to the sample, a second loss value may be determined based on the predicted position information corresponding to each sample and each position tag, where, for a sample, the second loss value represents a difference between the predicted position information corresponding to the sample and the position information corresponding to the position tag corresponding to the sample.

For a second loss value corresponding to one sample, see in particular the following formula:

wherein, deltaiBeta is a threshold parameter, | delta | represents the absolute value of the position deviation value, f is the position deviation value corresponding to the ith sample in the training data, namely the difference between the predicted position information corresponding to the sample and the position information corresponding to the position labelreg(δ)=2β|δ|-β2As a loss function corresponding to the position deviation value, freg(δ)=δ2Is the least square error loss function (L2) corresponding to the position deviation value, LregA second loss value corresponding to any one sample in the training data,for the initial loss corresponding to the ith sample,for the end loss corresponding to the ith sample,and N represents the number of samples in the training data for a second loss value corresponding to the ith sample.

When the absolute value of the position deviation value is less than the threshold parameter, the second loss value adopts an L2 loss function f corresponding to the position deviation valuereg(δ)=δ2And determining a second loss value.

Not less than | delta>Beta, i.e. the absolute value of the position offset value is not less than the threshold parameter, the second loss value adopts the loss function f corresponding to the position offset valuereg(δ)=2β|δ|-β2And determining a second loss value.

And determining a third loss value based on the weight corresponding to each sample video clip in the multi-modal video features corresponding to each sample and the position label corresponding to each sample, wherein for one sample, the third loss value represents the possibility that each sample video clip in the sample is the target video clip.

Optionally, the attention mask corresponding to each sample video segment may be used to characterize the weight corresponding to each sample video segment.

Wherein, the third loss value corresponding to one sample can be expressed as:

wherein L istaIs the third loss value, aiThe weight (attention mask) corresponding to the multi-modal video feature corresponding to the ith video clip in each sample can be represented by a probability, the higher the probability is, the probability that the sample video clip corresponding to the weight is the target video clip is, N represents the number of the sample video clips,to indicate a function, if aiWithin the interval corresponding to the position tag, thenIs 1, otherwise is 0;when it is 1, it represents aiThe greater the likelihood that the corresponding sample video segment is the target video segment,when it is 0, it represents aiThe less likely the corresponding sample video segment is to be the target video segment. Within the interval corresponding to the position label, aiThe larger the better.

As an example, for example, if the target video segment corresponding to the position tag is from 2 second to 4 second of the sample video, the interval corresponding to the position tag is from 2 second to 4 second, and the weight corresponding to each sample video segment is based on (note that)Intentional mask), when the weight is within the interval corresponding to the position label (interval corresponding to 2 nd to 4 th seconds),the sample video segment corresponding to the weight is more likely to be the target video segment with 1, otherwise, when the weight is not in the interval corresponding to the position label (the interval corresponding to the 2 nd to 4 th seconds),the smaller the probability that the sample video segment corresponding to the weight is the target video segment is, which is 0.

And determining a value of a training loss function corresponding to the neural network model based on the first loss value, the second loss value and the third loss value, taking the corresponding model during convergence as a final neural network model if the training loss function is converged, adjusting model parameters of the neural network model if the training loss function is not converged, and training the neural network model based on training data.

In this example, after the neural network model is derived, tests can be performed on three datasets, Charads-STA, ActivinyNet Captions, and TACOS. Corresponding to the three data sets, optionally, the number of video segments of the sample video corresponding to the charads-STA data set is 128, the number of video segments of the sample video corresponding to the ActivityNet Captions data set is 128, and the number of video segments of the sample video corresponding to the TACoS data set is 200.

Optionally, the maximum length of the sample text that can be processed by the neural network model corresponding to the charads-STA dataset at a time is 10, the maximum length of the sample text that can be processed by the neural network model corresponding to the ActivityNet Captions dataset at a time is 25, and the maximum length of the sample text that can be processed by the neural network model corresponding to the TACoS dataset at a time is 25.

Optionally, the threshold parameter β corresponding to the charads-STA data set is 0.1, the threshold parameter β corresponding to the ActivityNet Captions data set is 0.4, and the threshold parameter β corresponding to the TACoS data set is 0.2.

In this example, the number of samples that the neural network model can process at one time is 100.

Optionally, the neural network model may select an Adam optimizer to improve the training speed of the model.

Optionally, the initial learning rate of the neural network model is 1 e-3. .

The invention achieves the best performance in three data sets of Chardes-STA, activityNet Captions and TACOS, wherein [email protected] m and mIoU are used as evaluation indexes, wherein [email protected] m represents the accuracy rate that IoU in recalled Top1 exceeds m, and the higher the value is, the better the performance is; mliu represents the recalled Top1 average IoU value, with higher values giving better performance. The specific experimental results are shown in the following table:

TABLE 1 Charrades-STA data set Performance comparison

Among them, Ours represents the method of the present application, and it is clear from table 1 that the evaluation indexes [email protected] obtained by the method of the present application are 72.53, [email protected] is 59.84, [email protected] is 37.74, and mlio u is 51.45, and based on the LGI [5] method, the evaluation indexes [email protected] obtained are 72.96, [email protected] is 59.46, [email protected] is 35.48, and mlio u is 51.38, and it is known that the performance of the method of the present application is higher than that of other methods, compared with the evaluation indexes corresponding to other methods.

TABLE 2 ActivityNet Captions dataset Performance comparison

Among them, Ours represents the method of the present application, and it is understood from table 2 that the evaluation indexes [email protected] obtained by the method of the present application are 60.26, [email protected] is 42.46, [email protected] is 24.09, and mlio u is 42.51, and based on the LGI [5] method, the evaluation indexes [email protected] obtained are 58.52, [email protected] is 41.51, [email protected] is 23.07, and mlio u is 41.13, and it is known that the performance of the method of the present application is higher than that of other methods.

TABLE 3 TACOS data set Performance comparison

Wherein Ours is the method of the present application, and it is clear from Table 3 that the evaluation indexes [email protected] obtained by the method of the present application are 60.08, [email protected] is 45.81, [email protected] is 31.12, and mIoU is 13.87, and based on the 2D-TAN [1] method, the evaluation indexes [email protected] obtained are 47.59, [email protected] is 37.29, and [email protected] is 25.32, and it is clear that the method of the present application has higher performance than the other methods as compared with the evaluation indexes corresponding to the other methods.

In an embodiment of the present application, acquiring a text to be processed and a video to be processed includes:

acquiring a video search request of a user, wherein the video search request comprises a search text;

acquiring a video database corresponding to the video search request, wherein the search text is a text to be processed, and any video in the video database is a video to be processed;

the method further comprises the following steps:

and if the target segment exists in the video to be processed, sending the video segment to the user.

The scheme for determining the target segment matched with the text to be processed in the video to be processed can be applied to any scene needing to determine the target segment, for example, a scene of a video segment searched based on the text.

The search text indicates the relevant information of the video segment that the user wants to search, for example, the search text is: and (4) basket filling, which means that the user wants to search for the video clip related to basket filling. The search text is used as a text to be processed, any video in the video database is used as a video to be processed, and the target segment corresponding to the search text can be determined from the video database based on the manner described above.

Wherein the video search request may be initiated by a user based on a terminal device of the user, and the terminal device may include at least one of: smart phones, tablet computers, notebook computers, desktop computers, smart speakers, smart watches, smart televisions, and smart car-mounted devices.

The video clip can be displayed through a terminal device of a searcher, wherein the terminal device can be operated by a client providing a video display function, the client provides the video display function, and the specific form of the client is not limited, for example: a media player, a browser, etc., and the client may be in the form of an application program or a web page, which is not limited herein.

In an embodiment of the present application, acquiring a text to be processed and a video to be processed includes:

acquiring a video to be processed and title information of the video to be processed, wherein a text to be processed is the title information of the video to be processed;

the method further comprises the following steps:

if the target fragment exists in the video to be processed, determining that the title information is matched with the video to be processed;

and if the target fragment does not exist in the video to be processed, determining that the title information is not matched with the video to be processed.

In another application scenario, for example, the text to be processed is the title information of the first video, the video to be processed is the first video, and in order to determine whether the title information matches the video content of the first video, it may also be determined, based on the scheme of the present application, whether a target segment corresponding to the title information exists in the first video, and if so, it indicates that the title information matches (matches) the video content of the first video, and if not, it indicates that the title information does not match (does not match) the video content of the first video.

Fig. 5 is a schematic diagram of an implementation environment of a media data processing method according to an embodiment of the present application, where the implementation environment in this example may include, but is not limited to, a search server 101, a network 102, and a terminal device 103. The search server 101 may communicate with the terminal device 103 via the network 102, transmit the received video search request to the search server 101, and the search server 101 may transmit the retrieved target image to the terminal device 103 via the network.

The terminal device 103 includes a human-computer interaction screen 1031, a processor 1032 and a memory 1033. The man-machine interaction screen 1031 is used to display a target image. The memory 1033 is used for storing relevant data such as the retrieval image and the target image. Search server 101 includes a database 1011 and a processing engine 1012, processing engine 1012 being operable to train a neural network model. The database 1011 is used for storing the trained neural network model and the video database. The terminal device 103 may upload the video search request to the search server 101 through the network, and the processing engine 1012 in the search server 101 may obtain a video database corresponding to the video search request, determine a video segment corresponding to the search text from the video database according to the global text features of the search text, the local text features corresponding to each unit text included in the search text, and the video features of the video to be processed, and provide the video segment to the terminal device 103 of the searcher for display.

The processing engine in the search server 101 has two main functions, the first function is used for training to obtain a neural network model, and the second function is used for processing a video search request based on the neural network model and a video database to obtain a video segment (search function) of the video search request in the video database corresponding to a search text. It is understood that the above two functions can be implemented by two servers, referring to fig. 6, the two servers are a training server 201 and a search server 202, respectively, the training server 201 is used for training to obtain a neural network model, and the search server 202 is used for implementing a search function. The video database is stored in the search server 202.

In practical applications, the two servers may communicate with each other, and after the training server 201 has trained the neural network model, the neural network model may be stored in the training server 201, or sent to the search server 202. Alternatively, when the search server 202 needs to call the neural network model, a model call request is sent to the training server 201, and the training server 201 sends the neural network model to the search server 202 based on the request.

As an example, the terminal device 204 sends a video search request to the search server 202 through the network 203, the search server 202 invokes a neural network model in the training server 201, and after completing a search function, the search server 202 sends a video clip obtained by the search to the terminal device 204 through the network 203 based on the neural network model, so that the terminal device 204 displays the video clip.

According to an alternative scheme of the application, recommendation of the video clips can be performed according to the video clips which are determined in the video and matched with the text. For example, based on the search keyword (text) of the user, at least one video matching the search keyword is searched from the database, and the at least one video is recommended to the user. The application of the video segments determined from the video and matched with the text is wide, and the detailed description is not repeated herein.

Based on the same principle as the method shown in fig. 1, an embodiment of the present application further provides a media data processing apparatus 20, as shown in fig. 7, the media data processing apparatus 20 may include a data acquisition module 210, a feature extraction module 220, a feature fusion module 230, and a target segment determination module 240, where:

a data obtaining module 210, configured to obtain a text to be processed and a video to be processed;

the feature extraction module 220 is configured to extract global text features and local text features corresponding to the text to be processed, and first video features of the video to be processed, where the global text features include phrase features corresponding to phrases included in the text to be processed, and the local text features include features corresponding to unit texts included in the text to be processed;

a feature fusion module 230, configured to fuse the global text feature into the first video feature to obtain a second video feature;

and the target segment determining module 240 is configured to determine a target segment matching the text to be processed from the video to be processed according to the local text feature and the second video feature.

Optionally, the first video feature includes features of a plurality of video segments in the video to be processed; when the global text feature is fused into the first video feature to obtain the second video feature, the feature fusion module 230 is specifically configured to:

fusing the global text feature and the first video feature to obtain a fused video feature;

acquiring corresponding position information characteristics of each video clip in a video to be processed;

overlapping the characteristics of each video segment in the fused video characteristics and the position information characteristics corresponding to each video segment to obtain overlapped video characteristics;

and obtaining a second video characteristic based on the superposed video characteristics.

Optionally, when the target segment determining module determines the target segment matched with the text to be processed from the video to be processed according to the local text feature and the second video feature, the target segment determining module is specifically configured to:

determining the associated characteristics of the local text characteristics and the second video characteristics;

adjusting the local text characteristics according to the association characteristics and the second video characteristics to obtain adjusted local characteristics;

adjusting the second video characteristics according to the association characteristics and the local text characteristics to obtain adjusted video characteristics;

and determining a target segment matched with the text to be processed from the video to be processed based on the adjusted local features and the adjusted video features.

Optionally, the target segment determining module determines that the first video feature includes features of a plurality of video segments in the video to be processed; when determining a target segment matched with the text to be processed from the video to be processed based on the adjusted local features and the adjusted video features, the method is specifically configured to:

according to the adjusted local features, determining guide information of the text to be processed on the video to be processed;

adjusting the characteristics of each video segment in the adjusted video characteristics according to the guide information to obtain third video characteristics;

and determining a target segment matched with the text to be processed from the video to be processed according to the third video characteristics.

Optionally, when the target segment determining module determines the target segment matched with the text to be processed from the video to be processed according to the third video feature, the target segment determining module is specifically configured to:

determining the weight corresponding to each video clip according to the incidence relation among the features of each video clip contained in the third video features;

weighting the third video features of the video clips based on the weights corresponding to the video clips to obtain fourth video features;

determining the position information of a target segment matched with the text to be processed in the video to be processed based on the fourth video characteristics;

and determining a target segment which is determined to be matched with the text to be processed in the video to be processed based on the position information.

Optionally, when extracting the global text feature and the local text feature of the text to be processed, the feature extraction module is specifically configured to:

acquiring unit texts in a text to be processed and a position relation between the unit texts;

determining the characteristics of each unit text based on the position relation between each unit text and each unit text, wherein the local text characteristics comprise the characteristics of each unit text;

determining phrase characteristics of each phrase contained in the text to be processed based on the characteristics corresponding to each unit text;

and fusing the phrase characteristics of the phrases to obtain global text characteristics.

Optionally, the first video feature includes features of a plurality of video segments in the video to be processed; the feature fusion module is specifically configured to, when obtaining the second video feature based on the superimposed video feature:

determining the weight corresponding to each video clip based on the incidence relation of each video clip in the video features after superposition;

for each video clip, obtaining an enhanced feature corresponding to the video clip based on the weight corresponding to the video clip and the feature corresponding to the video clip in the video features after superposition;

and extracting to obtain second video characteristics based on the enhanced characteristics corresponding to the video clips.

Optionally, extracting global text features of the text to be processed, fusing the global text features into the first video features to obtain second video features, and determining that the target segment matched with the text to be processed is obtained from the video to be processed based on a neural network model according to the local text features and the second video features;

the neural network model comprises a phrase feature extraction network, a multi-mode feature extraction network and a video clip determination network, and is obtained through the following model training modules:

the model training module is used for:

acquiring training data, wherein the training data comprises a plurality of samples, each sample comprises a sample video and a sample text, each sample carries a position tag, and the position tag represents position information of a target video fragment corresponding to the sample text in the sample video;

extracting global text features and local text features of sample texts in the samples and video features of sample videos for each sample in the training data;

for a sample, inputting the characteristics of each unit text in the sample text into a phrase characteristic extraction network to obtain the predicted phrase characteristics of each phrase in the sample text;

determining a first loss value based on the matching degree between the predicted phrase features of the phrases corresponding to the samples, wherein for one sample, the first loss value represents semantic difference between the phrases in the sample;

for a sample, inputting the global text features and the local text features of a sample text and the video features of a sample video into the multi-modal feature extraction network to obtain multi-modal video features corresponding to the sample video;

for a sample, inputting the multi-modal video features into a video fragment determination network to obtain weights corresponding to all sample video fragments in the multi-modal video features, and obtaining prediction position information corresponding to prediction video fragments corresponding to sample texts in the sample video based on the multi-modal video features and the weights corresponding to all the sample video fragments;

determining a second loss value based on the predicted position information corresponding to each sample and each position label, wherein the second loss value represents the difference between the predicted position information corresponding to each sample and the position label of each sample;

determining a third loss value based on the weight corresponding to each sample video clip in the multi-modal video features corresponding to each sample and the position label corresponding to each sample, wherein for one sample, the third loss value represents the possibility that each sample video clip in the sample is a target video clip;

determining a value of a training loss function corresponding to the neural network model based on the first loss value, the second loss value and the third loss value;

and if the training loss function is converged, taking the corresponding model during convergence as a final neural network model, if the model is not converged, adjusting model parameters of the neural network model, and training the neural network model based on training data.

Optionally, for one sample, when the model training module determines the second loss value based on the predicted position information and each position tag corresponding to each sample, the model training module is specifically configured to:

determining a position deviation value based on the predicted position information and the position label corresponding to the sample;

if the absolute value of the position deviation value is smaller than the threshold parameter, determining a second loss value based on a least square error loss function corresponding to the position deviation value;

and if the absolute value of the position offset value is not less than the threshold parameter, determining a second loss value based on a loss function corresponding to the position offset value, wherein the loss function comprises an average absolute error loss function and the threshold parameter.

Optionally, when the data obtaining module obtains the text to be processed and the video to be processed, the data obtaining module is specifically configured to:

acquiring a video search request of a user, wherein the video search request comprises a search text;

acquiring a video database corresponding to the video search request, wherein the search text is a text to be processed, and any video in the video database is a video to be processed;

the device also includes:

the first video processing module is used for sending the video clip to the user when the target clip exists in the video to be processed.

Optionally, when the data obtaining module obtains the text to be processed and the video to be processed, the data obtaining module is specifically configured to:

acquiring a video to be processed and title information of the video to be processed, wherein a text to be processed is the title information of the video to be processed;

the device also includes:

the second video processing module is used for determining that the title information is matched with the video to be processed when the target fragment exists in the video to be processed; when the target segment does not exist in the video to be processed, determining that the title information is not matched with the video to be processed.

The media data processing device of the embodiment of the present application can execute the media data processing method provided by the embodiment of the present application, and the implementation principle is similar, the actions executed by each module and unit in the media data processing device of the embodiments of the present application correspond to the steps in the media data processing method of the embodiments of the present application, and for the detailed functional description of each module of the media data processing device, reference may be specifically made to the description in the corresponding media data processing method shown in the foregoing, and details are not repeated here.

Wherein the media data processing means may be a computer program (comprising program code) running on a computer device, e.g. an application software; the apparatus may be used to perform the corresponding steps in the methods provided by the embodiments of the present application.

In some embodiments, the media data processing Device provided by the embodiments of the present invention may be implemented by combining hardware and software, and by way of example, the media data processing Device provided by the embodiments of the present invention may be a processor in the form of a hardware decoding processor, which is programmed to execute the media data processing method provided by the embodiments of the present invention, for example, the processor in the form of the hardware decoding processor may employ one or more Application Specific Integrated Circuits (ASICs), DSPs, Programmable Logic Devices (PLDs), Complex Programmable Logic Devices (CPLDs), Field Programmable Gate Arrays (FPGAs), or other electronic components.

In other embodiments, the media data processing apparatus provided in the embodiments of the present invention may be implemented in software, and fig. 7 illustrates the media data processing apparatus stored in the memory, which may be software in the form of programs and plug-ins, and includes a series of modules, including a data acquisition module 210, a feature extraction module 220, a feature fusion module 230, and a target segment determination module 240, for implementing the media data processing method provided in the embodiments of the present invention.

The modules described in the embodiments of the present application may be implemented by software or hardware. Wherein the name of a module in some cases does not constitute a limitation on the module itself.

Based on the same principle as the method shown in the embodiments of the present application, there is also provided in the embodiments of the present application an electronic device, which may include but is not limited to: a processor and a memory; a memory for storing a computer program; and the processor is used for executing the media data processing method shown in any embodiment of the application by calling the computer program.

According to the media data processing method, when a target segment matched with a text to be processed is required to be determined from a video to be processed, the first video feature can be subjected to preliminary processing on the basis of the global text feature of the text to be processed and the first video feature of the video to be processed to obtain a second video feature, then the target segment matched with the text to be processed is determined from all video segments in the video to be processed on the basis of the local text feature and the second video feature of the text to be processed, and all information of the text to be processed can be described more comprehensively from different granularities due to the fact that the global text feature and the local text feature can be different in granularity.

In an alternative embodiment, an electronic device is provided, as shown in fig. 8, the electronic device 4000 shown in fig. 8 comprising: a processor 4001 and a memory 4003. Processor 4001 is coupled to memory 4003, such as via bus 4002. Optionally, the electronic device 4000 may further include a transceiver 4004, and the transceiver 4004 may be used for data interaction between the electronic device and other electronic devices, such as transmission of data and/or reception of data. In addition, the transceiver 4004 is not limited to one in practical applications, and the structure of the electronic device 4000 is not limited to the embodiment of the present application.

The Processor 4001 may be a CPU (Central Processing Unit), a general-purpose Processor, a DSP (Digital Signal Processor), an ASIC (Application Specific Integrated Circuit), an FPGA (field programmable Gate Array) or other programmable logic device, a transistor logic device, a hardware component, or any combination thereof. Which may implement or perform the various illustrative logical blocks, modules, and circuits described in connection with the disclosure. The processor 4001 may also be a combination that performs a computational function, including, for example, a combination of one or more microprocessors, a combination of a DSP and a microprocessor, or the like.

Bus 4002 may include a path that carries information between the aforementioned components. The bus 4002 may be a PCI (Peripheral Component Interconnect) bus, an EISA (Extended Industry Standard Architecture) bus, or the like. The bus 4002 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in FIG. 8, but this is not intended to represent only one bus or type of bus.

The Memory 4003 may be a ROM (Read Only Memory) or other type of static storage device that can store static information and instructions, a RAM (Random access Memory) or other type of dynamic storage device that can store information and instructions, an EEPROM (Electrically Erasable Programmable Read Only Memory), a CD-ROM (Compact Disc Read Only Memory) or other optical Disc storage, optical Disc storage (including Compact Disc, laser Disc, optical Disc, digital versatile Disc, blu-ray Disc, etc.), a magnetic Disc storage medium or other magnetic storage device, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to these.

The memory 4003 is used for storing application program codes (computer programs) for executing the present scheme, and is controlled by the processor 4001 to execute. Processor 4001 is configured to execute application code stored in memory 4003 to implement what is shown in the foregoing method embodiments.

The electronic device may also be a terminal device, and the electronic device shown in fig. 8 is only an example, and should not bring any limitation to the functions and the application scope of the embodiments of the present application.

The present application provides a computer-readable storage medium, on which a computer program is stored, which, when running on a computer, enables the computer to execute the corresponding content in the foregoing method embodiments.

According to another aspect of the application, there is also provided a computer program product or a computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the media data processing method provided in the implementation manner of the various embodiments.

Computer program code for carrying out operations for aspects of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

It should be understood that the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The computer readable storage medium provided by the embodiments of the present application may be, for example but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

The computer-readable storage medium carries one or more programs which, when executed by the electronic device, cause the electronic device to perform the methods shown in the above embodiments.

The above description is only a preferred embodiment of the application and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the disclosure herein is not limited to the particular combination of features described above, but also encompasses other arrangements formed by any combination of the above features or their equivalents without departing from the spirit of the disclosure. For example, the above features may be replaced with (but not limited to) features having similar functions disclosed in the present application.

38页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:虚拟资源处理方法、装置、介质及电子设备

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!