Method and apparatus for training caption model, computer device and storage medium

文档序号：956377 发布日期：2020-10-30 浏览：4次中文

阅读说明：本技术 训练字幕模型的方法和装置、计算机设备及存储介质 (Method and apparatus for training caption model, computer device and storage medium ) 是由宫博庆于 2020-03-19 设计创作，主要内容包括：本申请的实施例提供一种训练字幕模型的方法,所述字幕模型用于对输入视频执行自动视频字幕,所述方法包括：使用交叉熵损失初始化包括在所述字幕模型中的多个长短期记忆(LSTM)单元；使用强化学习训练所述LSTM单元；使用多任务训练对包括在所述字幕模型中的LSTM单元和多个卷积神经网络(CNN)进行训练；以及使用所述字幕模型生成对应于所述输入视频的视频字幕。(An embodiment of the present application provides a method of training a caption model for performing automatic video captioning on an input video, the method comprising: initializing a plurality of Long Short Term Memory (LSTM) units included in the caption model using cross-entropy loss; training the LSTM unit using reinforcement learning; training LSTM units and a plurality of Convolutional Neural Networks (CNNs) included in the caption model using multitask training; and generating a video subtitle corresponding to the input video using the subtitle model.)

1. A method of training a caption model for performing automatic video captioning on input video, the method comprising:

initializing a plurality of Long Short Term Memory (LSTM) units included in the caption model using cross entropy loss;

Training the LSTM unit using reinforcement learning;

training the LSTM units and a plurality of Convolutional Neural Networks (CNNs) included in the caption model using multi-tasking training; and

and generating a video subtitle corresponding to the input video by using the subtitle model.

2. The method of claim 1, wherein the weights of the CNN are frozen during the initialization and the reinforcement learning of the LSTM cells.

3. The method of claim 2, wherein the weight of the CNN is released during the multitasking training.

4. The method of claim 1, wherein the generating the video caption comprises:

converting the input video into a plurality of feature representations using the plurality of CNNs;

encoding the plurality of feature representations using the plurality of LSTM units; and

decoding the encoded plurality of feature representations using the plurality of LSTM units to provide a statement describing the content of the input video.

5. The method of claim 1, wherein the initializing comprises:

receiving an input frame i at a time step t_t；

Using the plurality of CNNs to the input frame i _tCarrying out encoding;

encoding the encoded input frame i_tEmbedding a projection matrix Wi;

computing and embedding the encoded input frame i using the plurality of LSTMs_tIs a characteristic of (A) represents x_tThe corresponding hidden state ht and cell state ct.

6. Method according to claim 5, wherein said hidden state h is_tAnd the cell state c_tIs calculated as follows:

i_t＝σ(W_ixX_t+W_ihh_t-1+b_i)

f_t＝σ(W_fxX_t+W_fhh_t-1+b_f)

o_t＝σ(W_oxX_t+W_ohh_t-1+b_o)

c_t＝I_t⊙g_t+f_t⊙c_t-1

where, σ denotes a sigmoid function,

7. The method of claim 1, wherein the reinforcement learning comprises:

receiving visual features of the input video, at least one annotated word (group word) provided by the subtitle model in a previous step, and a reward (reward) associated with the at least one annotated word;

providing a new annotated word;

receiving a new reward associated with the new tagged word;

changing at least one weight of the plurality of LSTMs based on the new reward.

8. The method of claim 1, wherein a cost function L for performing the reinforcement learning_r(θ) is represented as follows:

wherein p represents the caption model, θ represents a parameter of the caption model, M represents the number of the plurality of tracks, M represents an index of the plurality of tracks, and s represents _mOne track representing the plurality of tracks, and r(s)_m) Representing the assignment to said track s_mThe prize of (1).

9. The method of claim 1, further comprising:

receiving an output of the caption model;

mining the attribute of the output by using an attribute prediction branch of the subtitle model;

training the plurality of CNNs based on the mined attributes,

wherein the attribute comprises at least one of a noun, a verb, or an adjective included in the output.

10. The method of claim 9, wherein the attribute prediction branch uses a binary cross entropy loss function L_a(θ) is represented as follows:

where θ represents a parameter of the caption model, N is the number of the attributes, i represents an index of the attributes, y_iIndicating the presence of attributes within the input video, and q_θ(i) Representing an output of the subtitle model for an ith attribute.

11. An apparatus for training a caption model for performing automatic video captioning on an input video, the apparatus comprising:

an initialization module for initializing a plurality of Long Short Term Memory (LSTM) units included in the caption model using cross entropy loss;

A first training module for training the LSTM unit using reinforcement learning;

a second training module for training the LSTM unit and the plurality of convolutional neural networks CNN included in the caption model using multitask training; and

and the generation module is used for generating the video subtitles corresponding to the input video by using the subtitle model.

12. A non-transitory computer-readable storage medium storing instructions that, when executed by a computer device, cause the computer device to perform the method of training a caption model of any one of claims 1-10.

13. A computer device, comprising one or more processors and one or more memories having stored therein at least one instruction, the at least one instruction being loaded and executed by the one or more processors to implement the method of training a caption model according to any one of claims 1-10.

Technical Field

The present application relates to video captioning technology. In particular, the present application relates to a method and apparatus for training a caption model, a computer device, and a storage medium.

Background

Video captioning is crucial for many downstream applications such as video retrieval, indexing, browsing, etc. The existing video captioning method is trained component by component, and the quality of the overall system is affected by the performance of each individual component.

Disclosure of Invention

The embodiment of the application provides a method and a device for training a caption model, computer equipment and a storage medium, and aims to solve the problems that the existing method for training the caption model consumes memory and data, and is high in training difficulty and low in training quality.

In one embodiment, a method of training a caption model for performing automatic video captioning on input video is provided, the method comprising: initializing, by at least one processor, a plurality of Long Short Term Memory (LSTM) units included in the caption model using cross entropy loss; the at least one processor training the LSTM unit using reinforcement learning; the at least one processor training the LSTM units and a plurality of Convolutional Neural Networks (CNNs) included in the caption model using multitask training; and the at least one processor generating video subtitles corresponding to the input video using the subtitle model.

In one embodiment, an apparatus for training a caption model for performing automatic video captioning on an input video is provided that includes: an initialization module to initialize a plurality of Long Short Term Memory (LSTM) units included in the caption model using cross-entropy loss; a first training module for training the LSTM unit using reinforcement learning; a second training module to train the LSTM units and a plurality of Convolutional Neural Networks (CNNs) included in the caption model using multi-tasking training; and a generation module for generating a video subtitle corresponding to the input video using the subtitle model.

In one embodiment, a non-transitory computer-readable storage medium is provided for storing instructions that, when executed by a computer device, cause the computer device to perform the method of training a caption model.

In one embodiment, a computer device is provided that includes one or more processors and one or more memories having at least one instruction stored therein, the at least one instruction being loaded and executed by the one or more processors to implement the method of training a caption model.

In an embodiment of the application, a subtitle model is trained using reinforcement learning and multitask learning, and a video subtitle corresponding to an input video is generated using the trained subtitle model. The training method of the caption model can save memory and data consumption, simplify the training process and improve the training quality.

Drawings

Fig. 1 is a schematic diagram of input video for video subtitles.

Fig. 2 is a schematic diagram of an environment in which methods, apparatus, and systems described herein may be implemented, according to an embodiment.

FIG. 3 is a schematic diagram of example components of one or more of the devices of FIG. 2.

Fig. 4 is a schematic diagram illustrating a video caption model for performing automatic video caption on input video according to one embodiment.

Fig. 5 is a flow diagram of a method of training a caption model for performing automatic video captioning on input video according to one embodiment.

Detailed Description

At present, great progress has been made in image and video subtitles. In large part, due to advances in machine translation. For example, the encoder-decoder framework and attention mechanism is first introduced in machine translation and then extended to subtitles. Both the image caption method and the video caption method follow their flow (pipeline), and an attention mechanism is applied in caption generation. In contrast to image captions, video captions describe dynamic scenes rather than static scenes.

As can be seen from fig. 1, video subtitles are much more difficult due to the large change in appearance. Some related art proposes boundary-aware long-short-term memory (LSTM) units to automatically detect temporal video segments. Some related techniques integrate natural language knowledge into their networks by training language LSTM models on large external text data sets. Some related techniques extend gated loop units (GRUs) to multi-rate GRUs to handle different video frame rates. Some related art proposes deep synthesis subtitling machines to describe new objects by means of vocabulary classifier training on external image description datasets. Recently, a maximum likelihood estimation algorithm is used in video subtitles, which maximizes the probability of a current word based on previous tagged words (group words). However, all of these methods have two major problems.

First is the exposure bias, which is the input mismatch in training and reasoning. In training, the output of the decoder depends on the annotated words rather than the model predictions. Whereas in inference the decoder has access to prediction only. Some related techniques arrange sampling to alleviate the gap between training and reasoning by choosing more from the annotation data (ground truth) at the beginning and more from the model prediction at the end. However, this technique is still optimized at the word level.

Another problem is the mismatch of goals between training and reasoning. In training, the loss is optimized at the word level. While in inference, discrete metrics such as BLEU4, METEOR, CIDER, and ROUGE-L may be used for evaluation. Some image subtitle works have been proposed to solve the problem, which exhibit excellent performance with the help of reinforcement learning. Some related arts introduce an action-evaluation method (action-critical method) to image subtitles, and also propose a new look-ahead inference algorithm (lookup inference algorithm) having better performance than bundle search (beamsearch). Some related techniques employ a policy gradient method (policy gradient method) to optimize the SPIDEr score. Some related technologies combine a conditional generative confrontation network (policy gradient) with a policy gradient to generate natural and diverse statements. However, there are much fewer works using reinforcement learning in video captions.

Many video subtitle models may actually be deployed in the manner of E2E. Some related art proposes a stack of two Long Short Term Memory (LSTM) networks. Some related art proposes a transmission unit that feeds semantic concepts to LSTM. Some related art have developed advanced word detectors and semantic attention mechanisms that combine concepts with subtitle decoders. However, they actually treat the Convolutional Neural Network (CNN) as a feature extractor and do not train the CNN part of their framework and do not train the CNN together with other parts.

Multitask learning is a machine learning technique. During multi-task learning, multiple tasks are solved simultaneously with a shared representation, and are particularly useful for a limited amount of raw data. It is used not only for computer vision but also for natural language processing. However, there are few related techniques that use multitask learning in video captioning.

Embodiments of the present application may relate to an E2E training method for video captions with enabling factors for multitask learning and reinforcement learning, which may be a first video caption system trained in an E2E manner. Embodiments of the present application may provide up-to-date results on several reference data sets.

Embodiments of the present application relate to an end-to-end (E2E) training method for jointly optimizing different components of a video captioning system used to perform automatic video captioning on input video. As described herein, video captioning may involve the automatic generation of descriptive statements or phrases for short video clips. In an embodiment, the video clip may be, for example, 10-20 seconds long.

According to an embodiment, a multitask reinforcement learning method for training an E2E caption model (e.g., a video caption model) may be used. Multitask learning may be used because the model capacity may exceed the existing data set, for example, when all the weights of the model are updated from the original video input to the caption output. The multitask reinforcement learning method can mine and construct effective tasks such as attributes, rewards and subtitles from the artificial caption video, so that the tasks can jointly adjust the search space of the E2E neural network, and an E2E video caption model can be found and popularized to a test stage. An embodiment may be trained end-to-end from raw video input to subtitle output. Related art methods of video captioning may train components of the system separately. However, embodiments of the present application optimize the entire system end-to-end. Experimental results show that such models perform much better on several sets of reference video caption data than the related art models.

Embodiments of the E2E trained video caption model described herein may include a deepened version of the S2VT model. Despite its conceptual simplicity, it is very challenging to train the entire model to achieve good generalization capability across the test set. Experimental results show that if an efficient training method is not used, the gain obtained by co-training the Convolutional Neural Network (CNN) and the long-short term memory network (LSTM) may be insignificant compared to fixing the CNN to the feature extractor. Thus, the embodiments described herein may be useful when combined for training the E2E video caption model.

According to an embodiment, the E2E method of training a video caption procedure (pipeline) may include developing an auxiliary task to help train a main task, i.e., video captioning. The method may further include automatically mining attributes from the video captions to construct an auxiliary attribute prediction task, and using the evaluation index as a reward function to define an auxiliary reinforcement learning task.

FIG. 2 is a schematic diagram of an example environment 200 in which systems and/or methods described herein may be implemented. As shown in fig. 2, environment 200 may include user device 210, platform 220, and network 230. The devices of environment 200 may be interconnected by wired connections, wireless connections, or a combination of wired and wireless connections.

User device 210 includes one or more devices capable of receiving, generating, storing, processing, and/or providing information related to platform 220. For example, the user device 210 may include a computing device (e.g., a desktop computer, a laptop computer, a tablet computer, a handheld computer, a smart speaker, a server, etc.), a mobile phone (e.g., a smartphone, a wireless phone, etc.), a wearable device (e.g., smart glasses or a smart watch), or similar device. In some implementations, the user device 210 can receive information from the platform 220 and/or send information to the platform 220.

The platform 220 includes one or more devices that are capable of selecting the best LMW for performing automatic speech recognition, as described elsewhere herein. In some implementations, the platform 220 can include a cloud server or a group of cloud servers. In some embodiments, the platform 220 may be designed to be modular such that certain software components may be swapped in and out according to particular needs. In this way, platform 220 may be easily and/or quickly reconfigured for different uses.

In some implementations, as shown, the platform 220 may be hosted (hosted) in a cloud computing environment 222. Notably, although the embodiments described herein describe the platform 220 as being hosted in the cloud computing environment 222, in some embodiments, the platform 220 is not cloud-based (i.e., may be implemented outside of the cloud computing environment) or may be partially cloud-based. Cloud computing environment 222 includes an environment hosting platform 220. The cloud computing environment 222 may provide computing, software, data access, storage, etc. services that do not require an end user (e.g., user device 210) to know the physical location and configuration of the systems and/or devices of the hosting platform 220. As shown, the cloud computing environment 222 may include a set of computing resources 224 (collectively referred to as "computing resources" 224 "and individually referred to as" computing resources "224").

Computing resources 224 include one or more personal computers, workstation computers, server devices, or other types of computing and/or communication devices. In some implementations, the computing resources 224 may host the platform 220. Cloud resources may include computing instances executing in computing resources 224, storage devices provided in computing resources 224, data transfer devices provided by computing resources 224, and so forth. In some implementations, the computing resources 224 may communicate with other computing resources 224 through wired connections, wireless connections, or a combination of wired and wireless connections.

As further shown in FIG. 2, the computing resources 224 include a set of cloud resources, such as one or more application programs ("APP") 224-1, one or more virtual machines ("VM") 224-2, virtualized storage ("VS") 224-3, one or more hypervisors ("HYP") 224-4, and so forth.

The application 224-1 comprises one or more software applications that may be provided to or accessed by the user device 210 and/or the platform 220. The application 224-1 may eliminate the need to install and execute software applications on the user device 210. For example, the application 224-1 may include software related to the platform 220 and/or any other software capable of being provided through the cloud computing environment 222. In some embodiments, one application 224-1 may send/receive information to/from one or more other applications 224-1 through the virtual machine 224-2.

The virtual machine 224-2 comprises a software implementation of a machine (e.g., a computer) that executes programs, similar to a physical machine. The virtual machine 224-2 may be a system virtual machine or a process virtual machine depending on the use and correspondence of the virtual machine 224-2 to any real machine. The system virtual machine may provide a complete system platform that supports execution of a complete operating system ("OS"). The process virtual machine may execute a single program and may support a single process. In some implementations, the virtual machine 224-2 can execute on behalf of a user (e.g., the user device 210) and can manage the infrastructure of the cloud computing environment 222, such as data management, synchronization, or long-term data transfer.

Virtualized storage 224-3 includes one or more storage systems and/or one or more devices that use virtualization techniques within the storage systems or devices of computing resources 224. In some embodiments, within the context of a storage system, the types of virtualization may include block virtualization and file virtualization. Block virtualization may refer to the abstraction (or separation) of logical storage from physical storage so that a storage system may be accessed without regard to physical storage or heterogeneous structure. The separation may allow an administrator of the storage system to flexibly manage end-user storage. File virtualization may eliminate dependencies between data accessed at the file level and the location where the file is physically stored. This may optimize performance of storage usage, server consolidation, and/or uninterrupted file migration.

Hypervisor (Hypervisor)224-4 may provide hardware virtualization techniques that allow multiple operating systems (e.g., "guest operating systems") to execute concurrently on a host computer, such as computing resources 224. Hypervisor 224-4 may provide a virtual operating platform to the guest operating systems and may manage the execution of the guest operating systems. Multiple instances of various operating systems may share virtualized hardware resources.

The network 230 includes one or more wired and/or wireless networks. For example, network 230 may include a cellular network (e.g., a fifth generation (5G) network, a Long Term Evolution (LTE) network, a third generation (3G) network, a Code Division Multiple Access (CDMA) network, etc.), a Public Land Mobile Network (PLMN), a Local Area Network (LAN), a Wide Area Network (WAN), a Metropolitan Area Network (MAN), a telephone network (e.g., the Public Switched Telephone Network (PSTN)), a private network, an ad hoc network, an intranet, the internet, a fiber-based network, etc., and/or a combination of these or other types of networks.

The number and arrangement of devices and networks shown in fig. 2 are provided as examples. In practice, there may be more devices and/or networks, fewer devices and/or networks, different devices and/or networks, or a different arrangement of devices and/or networks than those shown in FIG. 2. Further, two or more of the devices shown in fig. 2 may be implemented within a single device, or a single device shown in fig. 2 may be implemented as multiple distributed devices. Additionally or alternatively, a set of devices (e.g., one or more devices) of environment 200 may perform one or more functions described as being performed by another set of devices of environment 200.

Fig. 3 is a schematic diagram of example components of a device 300. The device 300 may correspond to the user device 210 and/or the platform 220. As shown in fig. 3, device 300 may include a bus 310, a processor 320, a memory 330, a storage component 340, an input component 350, an output component 360, and a communication interface 370.

Bus 310 includes components that allow communication among the components of device 300. Processor 320 is implemented in hardware, firmware, or a combination of hardware and software. Processor 320 is a Central Processing Unit (CPU), Graphics Processing Unit (GPU), Accelerated Processing Unit (APU), microprocessor, microcontroller, Digital Signal Processor (DSP), Field Programmable Gate Array (FPGA), Application Specific Integrated Circuit (ASIC), or another type of processing component. In some implementations, processor 320 includes one or more processors that can be programmed to perform functions. Memory 330 includes a Random Access Memory (RAM), a Read Only Memory (ROM), and/or another type of dynamic or static storage device (e.g., flash memory, magnetic memory, and/or optical memory) that stores information and/or instructions for use by processor 320.

The storage component 340 stores information and/or software related to the operation and use of the device 300. For example, storage component 340 may include a hard disk (e.g., a magnetic disk, an optical disk, a magneto-optical disk, and/or a solid state disk), a Compact Disc (CD), a Digital Versatile Disc (DVD), a floppy disk, a tape cartridge, a magnetic tape, and/or another type of non-volatile computer-readable medium, and a corresponding drive.

Input components 350 include components that allow device 300 to receive information, such as through user input, for example, a touch screen display, a keyboard, a keypad, a mouse, buttons, switches, and/or a microphone. Additionally or alternatively, input component 350 may include sensors for sensing information (e.g., Global Positioning System (GPS) components, accelerometers, gyroscopes, and/or actuators). Output components 360 include components that provide output information from device 300, such as a display, a speaker, and/or one or more Light Emitting Diodes (LEDs).

Communication interface 370 includes transceiver-like components (e.g., a transceiver and/or a separate receiver and transmitter) that enable device 300 to communicate with other devices, e.g., over a wired connection, a wireless connection, or a combination of wired and wireless connections. Communication interface 370 may allow device 300 to receive information from and/or provide information to another device. For example, communication interface 370 may include an ethernet interface, an optical interface, a coaxial interface, an infrared interface, a Radio Frequency (RF) interface, a Universal Serial Bus (USB) interface, a Wi-Fi interface, a cellular network interface, and/or the like.

Device 300 may perform one or more processes described herein. Device 300 may perform these processes in response to processor 320 executing software instructions stored by a non-transitory computer-readable medium, such as memory 330 and/or storage component 340. A computer-readable medium is defined herein as a non-volatile memory device. The memory device includes storage space within a single physical storage device or storage space distributed across multiple physical storage devices.

The software instructions may be read into memory 330 and/or storage component 340 from another computer-readable medium or from another device via communication interface 370. When executed, software instructions stored in memory 330 and/or storage component 340 may cause processor 320 to perform one or more processes described herein. Additionally or alternatively, hardwired circuitry may be used in place of or in combination with software instructions to implement one or more processes described herein. Thus, implementations described herein are not limited to any specific combination of hardware circuitry and software.

The number and arrangement of components shown in fig. 3 are provided as examples. In practice, device 300 may include more components, fewer components, different components, or a different arrangement of components than those shown in FIG. 3. Additionally or alternatively, a set of components (e.g., one or more components) of device 300 may perform one or more functions described as being performed by another set of components of device 300.

FIG. 4 shows a schematic diagram of a model architecture 400 that may include three major components. According to an embodiment, the model architecture 400 may correspond to the video caption model described herein.

At the top of the model architecture 400, the original video frames can be converted to a high-level feature representation using five copies of the same inclusion-Resnet-v 2 CNN. The last classification layer of inclusion-Resnet-v 2 may be replaced by a fully-connected layer, the output dimension of which is, for example, 500. The LSTM at the bottom of model architecture 400 may first encode a feature representation of a video frame and then decode a statement to describe the content in the video. At the bottom left of model architecture 400, one branch includes a time-averaged pooling layer and an attribute prediction layer. Model architecture 400 can extract multiple attributes, e.g., up to 400 attributes. Thus, the output dimension of the attribute prediction layer may be 400, and the activation function may be sigmoid. The branch may be introduced to assign the relevant attributes to the input video. In an embodiment, this branch may not be used in the testing phase of video subtitles, but it may generate information gradients (information gradients) in the training phase for updating the weights of CNNs other than those from LSTM. The LSTM design, e.g., the number of hidden units, how to calculate the input gates, etc., may be borrowed from S2 VT.

Fig. 5 is a flow diagram of a method 500 of training a video caption model for performing automatic video caption on input video, according to an embodiment. In some embodiments, one or more of the process blocks of fig. 5 may be performed by the platform 220. In some embodiments, one or more of the process blocks of fig. 5 may be performed by another device or group of devices separate from or including platform 220, such as user device 210.

Although fig. 5 shows an example block diagram of the method 500, in some implementations, the method 500 may include more blocks, fewer blocks, different blocks, or a different arrangement of blocks than those depicted in fig. 5. Additionally or alternatively, two or more blocks of method 500 may be performed in parallel.

In an embodiment, the video caption model may be trained step by step in three steps. The first two steps can be used to find a good initialization of the LSTM, and a fully connected layer connecting the CNN and LSTM, so that the last step, E2E training of the entire model, can have a good start. The weights of CNN may be frozen until the third step. In an embodiment, the first two steps may correspond to, for example, operations 510 and 520 of the method 500 of fig. 5, and the third step may correspond to operation 540 of fig. 5. As shown in fig. 5, in operation 510, the method 500 may include: the at least one processor initializes a plurality of Long Short Term Memory (LSTM) units included in the caption model using cross-entropy loss.

As an example, operation 510 may include an S2 VT-based training method using cross-entropy loss. In an embodiment, an input frame i at a time step t _tThe projection matrix WI can be encoded and embedded with the depth CNN. Then, for the projection feature representation xt, LSTM computes the hidden state ht and cell state ct. The details of the computation on the hidden state and the cell state can be shown in equation 1:

where a is a sigmoid function and where a is,is a hyperbolic tangent function, and is an element-wise multiplication (element-wise multiplication). Except that the input may be a combination of first LSTM layer output and word embeddingThe second LSTM layer may be similar to the first LSTM layer.

Given a "grountruth" statement s that describes the input video^★＝{w₁ ^★,w₂ ^★,...w_T ^★Then the cross entropy loss can be minimized as shown in equation 2:

where θ represents the model parameters.

As further shown in fig. 5, the method 500 may include: at operation 520, at least one processor trains the LSTM unit using reinforcement learning.

In an embodiment, operation 520 may include REINFORCE + training of LSTM. After operation 510, a self-critical (self-critical) REINFORCE algorithm may be introduced to the video subtitles to seek better weight for the LSTM in terms of its generalization performance over the validation and test set of LSTMs. The cross-entropy loss may expose recursive LSTM under different data distributions during the training and testing phases, as it would feed model-labeled words that are only available in training. Furthermore, the loss function may not replace the evaluation index well. To address these challenges, the subtitle model can be directly optimized through REINFORCE learning.

In reinforcement learning, an agent may be trained to complete a task by performing a series of actions in the environment. As an example, in the video caption context, the goal of the caption model may be to generate appropriate statements when viewing the video input. The caption model may correspond to an agent and the action may be to predict the next word at each time step. The input video with the user caption may correspond to an environment. The reward for an agent's actions may be defined as the actual assessment indicator used during the testing phase. In particular, the CIDER score may be used as a reward. In an embodiment, a reinforcement learning procedure (pipeline) for video subtitles may operate as follows: the agent receives observations about the environment including visual features and annotated words up to the current step, and rewards from the environment, such as CIDER scores; the agent then performs an action to predict the word; the environment provides another state, such as revealing another annotated word, and a reward for responding to the agent's actions.

The objective function of reinforcement learning can be shown in equation 3:

L_r(θ)＝-E(r(w^s) (equation 3)

Wherein w^sIs a sentence consisting of (w1, w 2.., w) sampled from the network, and r is a reward function.

To solve the above problem, the REINFORCE algorithm may be used again. A general update of the parameter θ can be written as equation 4:

wherein p (w)^s) Basically by the video subtitle model p θ (w)^s) Determination (see equation (2)). In practice, the expected value is approximated by the mean of the samples that result in the variance of the gradient. To reduce variance, the reward r is typically calibrated by the baseline b, as shown in equation 5:

therein, it is apparent that since the baseline b does not depend on the sampled word w^sAnd thus the gradient remains unchanged. How to choose the baseline b will affect the performance of the REINFORCE algorithm. We chose the reward of greedy inference words (greedily inferences) as our baseline. ByIndicating that the baseline may beOne sample approximation of equation 5 may be represented as equation 6:

which can be further viewed as a cost function as follows.

At the beginning of each iteration, up to M tracks, e.g., statements, may be sampled from the current model. By s₁,···,s_MTo represent them, the cost function for generating the gradient for this iteration can be expressed as equation 7:

where r (sm) is the reward assigned to locus sm. In the following we will express this algorithm as REINFORCE + or RFC +.

Equation 7 can be regarded as a running penalty during the entire training process. It varies in different iterations, which is achieved by a trace of samples, which is in contrast to the cross-entropy loss L across different iterations _xThe constant markup subtitles in (1) are different. Also, the reward compensated by the baseline dynamically balances the contribution of the trajectory to the gradient. Collectively, they may further advance the model trained in operation 510 to the extent that: i.e. it can be better generalized to invisible data.

In an embodiment, reinforcement learning may include: receiving visual features of the input video, the at least one annotated word provided by the caption model in the previous step, and a reward associated with the at least one annotated word; providing a new annotated word; receiving a new reward associated with a new tagged word; and changing at least one weight of the plurality of LSTM based on the new reward.

As further shown in fig. 5, the method 500 may include: at operation 530, it is determined whether training of the LSTM is complete.

As further shown in fig. 5, at operation 540, the method 500 may include: the at least one processor trains LSTM units and a plurality of Convolutional Neural Networks (CNNs) included in the caption model using multi-tasking training.

In an embodiment, the weight of CNN is frozen during initialization and reinforcement learning of LSTM units and released during multitasking training.

In an embodiment, operation 540 may include jointly adjusting the entire model, releasing the weight of CNN. As a starting point, it seems natural for E2E optimization to repeat the above operations (e.g., operation 510 and operation 520). However, according to experimental results, this may only increase the marginal gain of freezing CNN weights. This rapid saturation of accuracy may be common for very deep neural networks (very deep neural networks) and may be mitigated by skipping connections between different layers of the feed-forward network. However, LSTM and CNN are heterogeneous mixtures (heterologous mixing) and it is not clear how to apply skip-concatenation.

In contrast, in embodiments, additional information gradients may be provided directly to the CNN to supplement the gradients that indirectly reach the CNN through the LSTM. This direct gradient is provided by the attribute prediction branch shown in FIG. 4.

In an embodiment, the attribute prediction branch may mine attributes in video subtitles according to previous practices for image subtitles. Of the words in the sentences of the training set, the most frequently occurring words including nouns, verbs, and adjectives may be extracted as attributes. Thus, the attribute prediction branch may be equipped with a sigmoid function to predict the presence or absence (yi) of each attribute in the input video. The binary cross-entropy loss of the network branch can be represented by equation 8:

Where N is the total number of attributes and q θ (i) is the net output of the ith attribute.

In an embodiment, the total cost function used in operation 540 may be a convex combination of attribute loss and REINFORCE loss (covex combination), as shown in equation 9:

L(θ)＝αL_r(θ)+(1-α)L_a(theta) (equation 9)

Where α ═ 0.95 is selected by the authentication set.

Thus, in an embodiment, the method 500 may further comprise: the at least one processor receives an output of a caption model, the at least one processor predicts attributes of the branch mining output using attributes of the video caption model, the at least one processor trains a plurality of CNNs based on the mined attributes, wherein the attributes include at least one of a noun, a verb, or an adjective included in the output.

As further shown in fig. 5, at operation 550, the method 500 may include: the at least one processor generates video subtitles corresponding to the input video using a subtitle model.

In an embodiment, generating the video subtitle may include: the at least one processor converting an input video into a plurality of feature representations using a plurality of CNNs; the at least one processor encoding a plurality of feature representations using a plurality of LSTM units; and the at least one processor decodes the encoded plurality of feature representations using the plurality of LSTM units to provide a statement describing content of the input video.

An embodiment of the present application further provides an apparatus for training a caption model, where the caption model is used to perform automatic video caption on an input video, and the apparatus includes:

an initialization module to initialize a plurality of Long Short Term Memory (LSTM) units included in the caption model using cross-entropy loss;

a first training module for training the LSTM unit using reinforcement learning;

a second training module to train the LSTM units and a plurality of Convolutional Neural Networks (CNNs) included in the caption model using multi-tasking training; and

and the generation module is used for generating the video subtitles corresponding to the input video by using the subtitle model.

In the embodiment of the present application, the weights of the CNN are frozen during the initialization of the LSTM unit and the training of the LSTM unit using reinforcement learning, and the weights of the CNN are released during multitask training.

In an embodiment of the present application, the generating module includes:

a conversion unit for converting the input video into a plurality of feature representations using the plurality of CNNs;

a first encoding unit for encoding the plurality of feature representations using the plurality of LSTM units; and

a decoding unit for decoding the encoded plurality of feature representations using the plurality of LSTM units to provide a sentence describing the content of the input video.

In an embodiment of the present application, the initialization module includes:

a receiving unit for receiving an input frame i at a time step t_t；

A second encoding unit for encoding the input frame i using the plurality of CNNs_tCarrying out encoding;

an embedding unit for embedding the encoded input frame i_tEmbedded projection matrix W_i；

A calculation unit for calculating and embedding the encoded input frame i using the plurality of LSTMs_tIs a characteristic of (A) represents x_tThe corresponding hidden state ht and cell state ct.

In an embodiment of the present application, the first training module includes:

a first receiving unit for receiving visual features of the input video, at least one annotation word provided by the caption model in a previous step, and a reward associated with the at least one annotation word;

a providing unit for providing a new annotated word;

a second receiving unit for receiving a new reward associated with the new annotated word;

a changing unit for changing at least one weight of the plurality of LSTM based on the new reward.

In the embodiment of the present application, the cost function L for performing the reinforcement learning_r(θ) is represented as follows:

wherein p represents a caption model, θ represents a parameter of the caption model, M represents the number of the plurality of tracks, and M represents the plurality of tracks Index of trace, s_mOne track representing the plurality of tracks, and r(s)_m) Indicating the assignment to a track s_mThe prize of (1).

In an embodiment of the present application, the apparatus further comprises:

a receiving module, configured to receive an output of the subtitle model;

the mining module is used for forecasting the branch mining the output attribute by using the attribute of the caption model;

a third training module to train the plurality of CNNs based on the mined attributes,

wherein the attribute comprises at least one of a noun, a verb, or an adjective included in the output.

In the embodiment of the application, the attribute is used for predicting the binary cross entropy loss function L used in the branch_a(θ) is represented as follows:

where θ represents a parameter of the caption model, N is the number of the attributes, i represents an index of the attributes, y_iIndicating the presence of attributes within the input video, and q_θ(i) Representing the output of the subtitle model for the ith attribute.

The specific functions and implementation of the modules in this embodiment may refer to the specific processes of the method for training the caption model in the above embodiments.

Embodiments of the present application also provide a computer device, which includes one or more processors and one or more memories, where at least one instruction is stored in the one or more memories, and the at least one instruction is loaded and executed by the one or more processors to implement the method for training a subtitle model as described above.

Embodiments of the present application also provide a non-transitory computer-readable storage medium storing instructions that, when executed by a computer device, cause the computer to perform the method of training a caption model as described above.

The foregoing examples provide illustration and description, but are not intended to be exhaustive or to limit the implementations to the precise form disclosed. Modifications and variations are possible in light of the above description of the embodiments or may be acquired from practice of the embodiments. The term "component" as used herein is intended to be broadly interpreted as hardware, firmware, or a combination of hardware and software.

It is apparent that the systems and/or methods described herein may be implemented in various forms of hardware, firmware, or a combination of hardware and software. The actual specialized control hardware or software code used to implement these systems and/or methods does not limit the embodiments. Thus, the operation and behavior of the systems and/or methods were described herein without reference to the specific software code-it being understood that software and hardware may be designed to implement the systems and/or methods based on the description herein. Even if specific combinations of features are recited in the claims and/or disclosed in the description, these combinations are not intended to limit the disclosure of possible embodiments. Indeed, many of these features may be combined in ways not specifically recited in the claims and/or disclosed in the specification. Although each dependent claim listed below may directly depend on only one claim, the disclosure of possible embodiments includes a combination of each dependent claim in the set of claims with each other claim.

No element, act, or instruction used herein should be construed as critical or essential unless explicitly described as such. In addition, as used herein, the articles "a" and "an" are intended to include one or more items and may be used interchangeably with "one or more". Further, as used herein, the term "set" is intended to include one or more items (e.g., related items, unrelated items, combinations of related items and unrelated items, etc.), and may be used interchangeably with "one or more". Where only one item is intended, the term "one" or similar language is used. Further, as used herein, the terms "having", and the like are intended to be open-ended terms. Further, the phrase "based on" is intended to mean "based, at least in part, on" unless explicitly stated otherwise.

17页详细技术资料下载

Method and apparatus for training caption model, computer device and storage medium

相关技术

网友询问留言