Multitask learning as question and answer

文档序号：1102614 发布日期：2020-09-25 浏览：8次中文

阅读说明：本技术 作为问答的多任务学习 (Multitask learning as question and answer ) 是由 N·S·凯斯卡 B·麦卡恩 C·熊 R·佐赫尔于 2019-01-30 设计创作，主要内容包括：用于作为问答的多任务学习的途径包括用于训练的方法,其包括：接收包括来自多种任务类型的训练样本的多个训练样本,将训练样本呈现给神经模型以生成回答,确定呈现的每个训练样本的生成的回答与自然语言基准真实回答之间的误差,和基于误差调整神经模型的参数。训练样本中的每个包括自然语言语境、问题和基准真实回答。将训练样本呈现给神经模型的次序包括首先根据第一训练策略选择训练样本,并且切换为根据第二训练策略选择训练样本。在一些实施方案中,第一训练策略是依序式训练策略,而第二训练策略是联合训练策略。(Approaches for multi-task learning as questions and answers include methods for training, including: the method includes receiving a plurality of training samples including training samples from a plurality of task types, presenting the training samples to a neural model to generate responses, determining an error between the generated response and a natural language reference true response for each training sample presented, and adjusting parameters of the neural model based on the errors. Each of the training samples includes a natural language context, a question, and a reference true answer. The order in which the training samples are presented to the neural model includes first selecting training samples according to a first training strategy, and switching to selecting training samples according to a second training strategy. In some embodiments, the first training strategy is an in-line training strategy and the second training strategy is a joint training strategy.)

1. A method for training a question-answering system, the method comprising:

receiving a plurality of training samples, each of the training samples comprising a natural language context, a natural language question, and a natural language reference true answer, the training samples comprising training samples from a plurality of task types;

presenting the training samples to a neural model to generate an answer;

for each training sample presented, determining an error between the generated answer and the natural language reference true answer; and

adjusting parameters of the neural model based on the error;

wherein the order in which the training samples are presented to the neural model comprises:

first selecting the training samples according to a first training strategy for controlling an order in which training samples from each of the plurality of task types are presented to the neural model according to a first ordering; and

switching to selecting the training samples according to a second training strategy for controlling an order in which training samples from each of the plurality of task types are presented to the neural model according to a second ordering.

2. The method of claim 1, wherein each of the plurality of task types is a language translation task type, a classification task type, or a question and answer task type.

3. The method of claim 1 or 2, wherein the first training strategy is a sequential training strategy, wherein each of the training samples for a first task type is selected before a training sample of a second task type is selected.

4. The method of claim 3, wherein the in-line training strategy comprises reselecting training samples for the first task type after selecting training samples for each of the plurality of task types.

5. The method of any of claims 1-4, wherein the second training strategy is a joint training strategy, wherein each of the training samples is selected such that successively selected training samples are selected from different ones of the plurality of task types.

6. The method of any of claims 1-4, wherein the second training strategy is a joint training strategy in which each of the training samples is selected such that a successively selected training sample subgroup is selected from different task types of the plurality of task types.

7. The method of claim 1 or 2, wherein the first training strategy is a modified in-line training strategy, wherein the training samples are selected according to an in-line training strategy having a periodic interval in which the training samples are selected according to a joint training strategy.

8. The method of any of claims 1-7, further comprising switching to selecting the training samples using the second training strategy after presenting each of the training samples for each of the plurality of task types to the neural model a predetermined number of times.

9. The method of any of claims 1-7, further comprising switching to selecting the training samples using the second training strategy based on monitoring performance indicators associated with each of the plurality of task types.

10. The method of any one of claims 1 to 9, wherein the neural model comprises:

an input layer for encoding a first word from the context and a second word from the question;

a self-attention-based converter including an encoder and a decoder for receiving an output from the input layer and a portion of the answer;

a bi-directional long-term short-term memory (biLSTM) to further encode an output of the encoder;

a long-term short-term memory (LSTM) for generating context-adjusted hidden states from the output of the decoder and the hidden states;

an attention network to generate the attention weight based on an output of the biLSTM and an attention weight;

a vocabulary layer for generating a distribution over a third word in the vocabulary based on the attention weight;

a context layer to generate a distribution over a first word from the context based on the attention weight; and

a switch for:

generating a weighting between a distribution over the third words from the vocabulary and a distribution over the first words from the context;

generating a composite distribution based on the weighting of the distribution over the third word from the vocabulary and the distribution over the first word from the context; and

selecting words for inclusion in the answer using the composite distribution.

11. A non-transitory machine-readable medium comprising a plurality of machine-readable instructions which, when executed by one or more processors associated with a computing device, are adapted to cause the one or more processors to perform a method comprising:

presenting the training samples to a neural model to generate an answer;

for each training sample presented, determining an error between the generated answer and the natural language reference true answer; and

adjusting parameters of the neural model based on the error;

wherein the order in which the training samples are presented to the neural model comprises:

12. The non-transitory machine-readable medium of claim 11, wherein the first training strategy is a sequential training strategy in which each of the training samples for a first task type is selected before training samples of a second task type are selected.

13. The non-transitory machine-readable medium of claim 11 or 12, wherein the second training strategy is a joint training strategy in which each of the training samples is selected such that successively selected training samples are selected from different ones of the plurality of task types.

14. The non-transitory machine-readable medium of claim 11 or 12, wherein the second training strategy is a joint training strategy in which each of the training samples is selected such that a successively selected training sample panel is selected from different ones of the plurality of task types.

15. The non-transitory machine-readable medium of any of claims 11 to 14, further comprising switching to selecting the training samples using the second training strategy after presenting each of the training samples for each of the plurality of task types to the neural model a predetermined number of times.

16. A system for deep learning, the system comprising:

a multi-layer neural network;

wherein the system is configured to:

presenting the training samples to a neural model to generate an answer;

for each training sample presented, determining an error between the generated answer and the natural language reference true answer; and

adjusting parameters of the neural model based on the error;

wherein the order in which the training samples are presented to the neural model comprises:

17. The system of claim 16, wherein the first training strategy is a sequential training strategy in which each of the training samples for a first task type is selected before training samples for a second task type are selected.

18. The system of claim 16 or 17, wherein the second training strategy is a joint training strategy, wherein each of the training samples is selected such that successively selected training samples are selected from different ones of the plurality of task types.

19. The system of claim 16 or 17, wherein the second training strategy is a joint training strategy, wherein each of the training samples is selected such that a successively selected training sample subset is selected from different task types of the plurality of task types.

20. The system of any of claims 16-19, wherein the system is further configured to select the training samples using the second training strategy after presenting each of the training samples for each of the plurality of task types to the neural model a predetermined number of times.

Technical Field

The present disclosure relates generally to natural language processing and, more particularly, to answering natural language questions about a natural language context (context).

Background

The ability of natural language processing and systems to answer natural language questions about the content of natural language samples is a test benchmark for context-specific reasoning about information provided in the form of natural language. This can be a complex task, as many different types of natural language questions can be asked, and answering the natural language questions may require different types of reasoning and/or different types of analysis.

It would therefore be advantageous to have a unified system and method for simultaneously being able to answer different kinds of natural language questions.

Drawings

Fig. 1 is a simplified diagram of natural language processing tasks according to some embodiments.

Fig. 2 is a simplified diagram of a computing device, according to some embodiments.

Fig. 3 is a simplified diagram of a system for multitasking question-answering according to some embodiments.

Fig. 4 is a simplified diagram of an attention network (attention network) according to some embodiments.

Fig. 5 is a simplified diagram of layers of an attention-based transformer (transformer) network, according to some embodiments.

Fig. 6 is a simplified diagram of a word generator (word generator), according to some embodiments.

Fig. 7 is a simplified diagram of a method of multitask learning according to some embodiments.

Fig. 8 and 9A-9C are simplified diagrams of training performance according to some embodiments.

Fig. 10A and 10B are simplified diagrams of training performance based on training order (training order) according to some embodiments.

In the drawings, elements having the same reference number have the same or similar functions.

Detailed Description

Context-specific reasoning, including context-specific reasoning about the content of natural language information, is an important issue in machine intelligence and learning applications. Context-specific reasoning can provide valuable information for use in interpreting natural language text, and can include different tasks, such as answering questions about the content of the natural language text, language translation, semantic context analysis, and the like. However, each of these different types of natural language processing tasks often involves different types of analysis and/or different types of expected responses (expected responses).

When the task types are similar, multitask learning in natural language processing has progressed. However, when dealing with different types of tasks, such as language translation, question answering, and classification, parameter sharing is often limited to word vectors (word vectors) or subsets of parameters. The final architecture is typically highly optimized and designed for each task type, which limits their ability to generalize across task types.

However, when constructed as a single task type, many of these task types can be handled by the same architecture and model. For example, many, if not all, natural language processing tasks may be considered question-and-answer tasks. For example, the task types of classification, language translation, and question-answering may all be structured as question-answering tasks. An embodiment of each of these three task types in the form of question-answering is shown in FIG. 1.

Fig. 2 is a simplified diagram of a computing device 200 according to some embodiments. As shown in fig. 2, computing device 200 includes a processor 210 coupled to a memory 220. The operation of computing device 200 is controlled by processor 210. Moreover, although computing device 200 is shown with only one processor 210, it should be understood that processor 210 in computing device 200 may represent: one or more central processing units, one or more multi-core processors, one or more microprocessors, one or more microcontrollers, one or more digital signal processors, one or more Field Programmable Gate Arrays (FPGAs), one or more Application Specific Integrated Circuits (ASICs), one or more Graphics Processing Units (GPUs), and the like. Computing device 200 may be implemented as a standalone subsystem, a board (board) added to a computing device, and/or a virtual machine.

Memory 220 may be used to store software executed by computing device 200 and/or one or more data structures used during operation of computing device 200. Memory 220 may include one or more types of machine-readable media. Some common forms of machine-readable media may include a floppy disk, a flexible disk, hard disk, magnetic tape, any other magnetic medium, a CD-ROM, any other optical medium, punch cards, paper tape, any physical medium with patterns of holes, a RAM, a PROM, an EPROM, a FLASH-EPROM, any other memory chip or cartridge, and/or any other medium from which a processor or computer is adapted to read.

Processor 210 and/or memory 220 may be arranged in any suitable physical arrangement. In some embodiments, the processor 210 and/or the memory 220 may be implemented on the same board, in the same package (e.g., system-in-package), on the same chip (e.g., system-on-chip), and so on. In some embodiments, processor 210 and/or memory 220 may include distributed, virtualized, and/or containerized computing resources. Consistent with such embodiments, processor 210 and/or memory 220 may be located in one or more data centers and/or cloud computing facilities.

As shown, memory 220 includes a question-answering module 230, which can be used to implement and/or simulate (emulate) the question-answering systems and models described further herein and/or implement any of the methods described further herein. In some embodiments, the question-answering module 230 can be used to answer natural language questions about the natural language context. In some embodiments, the question-answering module 230 may also process iterative training and/or evaluation of a question-answering system or model for answering natural language questions about a natural language context. In some embodiments, memory 220 may include a non-transitory, tangible machine-readable medium comprising executable code that, when executed by one or more processors (e.g., processor 210), may cause the one or more processors to perform the counting methods described in further detail herein. In some embodiments, the question-answering module 230 may be implemented using hardware, software, and/or a combination of hardware and software. As shown, the computing device 200 receives the natural language context 240 and the natural language questions 250 related to the natural language context 240, which are provided to the question-and-answer module 230, and the question-and-answer module 230 then generates natural language answers 260 to the natural language questions 250 based on the content of the natural language context 240.

Fig. 3 is a simplified diagram of a system 300 for multitasking question-answering according to some embodiments. The system 300 receives a natural language context c and a natural language question q. Each of context c and question q is encoded as a vector for processing by system 300. In some embodiments, each word in context c and question q is encoded using word encoding. In some embodiments, the encoding of each word is based on GloVe encoding, where each word is encoded asOf (2) is used. In some embodiments, the encoding of each word is based on a character n-gram encoding, where each word is encoded as

Of (2) is used. In some embodiments, the encoding of each word is based on a concatenation (concatenation) of GloVe and character engram encodings. In some embodiments, when there is no GloVe and/or character n-gram encoding for a word (e.g., the word is not in english), the random encoding is selected from a normal distribution having the same mean and standard deviation (e.g., a mean of zero and a standard deviation of 0.4) as the GloVe encoding, consistently using the same random encoding for each occurrence of the respective word.

The code for context c is then passed to the linear layer 310 and the code for question q is passed to the linear layer 315. Each of the linear layers 310 and 315 implements a respective transfer function (transfer function) consistent with equation 1, where W and b are the weight (weight) and bias (bias) of the respective linear layer 310 or 315, a is the output of the respective linear layer 310 or 315, x is the input to the respective linear layer 310 or 315, and f is the linear transfer function of the respective linear layer 310 or 315, such as a pure (pure) linear function, a saturated linear function, and so forth. In some embodiments, the linear layers 310 and 315 reduce the dimensionality of the coding for the context c and the problem q. In some embodiments, reducing the dimensionality of the codes to each code is

Of (2) element(s)。

a ═ f (Wx + b) equation 1

The codes output by linear layers 310 and 315 are further encoded by single layer bidirectional long short term memory networks (biLSTM)320 and biLSTM 325, respectively, to form

Andin some embodiments, biLSTM 320 and/or 325 may further reduce the dimensionality of the encoding for context c and question q. Each of bilSTMs 320 and 325 generates an output h at each time step (timestep) i according to equation 2_i(asAnd

cascade of (c) in equation 2, x is the input to the respective biLSTM, and the LSTM corresponds to the long term short term memory network. In some embodiments, bilSTM 320 and/or bilSTM 325 have a concealment size of 200, and willAnd

is further reduced toOf (2) is used.

Then output the output

And

to the collaborative attention (collaboration) layer 330. The cooperative attention layer 330 is first at

Prepend (prepend) context sentinel (sentinel) vector and in

Problem sentinel vectors are added in front. The sentinel vector allows a cooperative attention mechanism of the cooperative attention layer 330 to avoid aligning all tokens (tokens) between the two sequences. Then, the collaborative attention layer 330 stacks vectors along the time dimensionAnd

to respectively obtain

Andthen, the collaborative attention layer 330 generates an incidence matrix (affinity matrix) a according to equation 3.

Then, the cooperative attention layer 330 generates an attention weight a on each sequence using equation 4_cAnd A_qSoftmax (X) is normalized on the column of X in equation 4.

A_c＝softmax(A)

A_q＝softmax(A^T) Equation 4

Then, using equation 5, the collaborative attention layer 330 uses the attention weight A_cAnd A_qGenerating weighted sums of contexts and questions, respectivelyAnd

then, the collaborative attention layer 330 generates a collaborative attention summary (summary) S asAnd

is cascaded. The co-note summary S includes a sequence of vectors S, and the first vector corresponding to the sentinel location may be removed from S. S is then passed to biLSTM 340. BilSTM 340 generates an output to which a position code is added

Then output the output

To a self-attention-based (self-attention-based) multi-layer converter that generates a code for each of layers i of the self-attention-based multi-layer converterAs shown in fig. 3, the self-attention based multi-layer transformer includes transformer layers 351 and 352. Also, although the self-attention based multi-layer converter is shown as having two layers, in some embodiments, the self-attention based multi-layer converterThe converter may comprise only a single layer or three or more layers. The translator layers 351 and 352 each include a multi-head (multi-head) self-attention mechanism, followed by a position-wise fully-connected feed-forward network, along with residual connection and layer normalization, as described in further detail below with respect to fig. 4 and 5.

Fig. 4 is a simplified diagram of an attention network 400 according to some embodiments. As shown in FIG. 4, note that the network 400 receives a queryKey (key)Sum valueAccording to equations 6-8, each of q, k, and v receives a respective weight W^Q410、W^K420 and W^V430. Altering the weights W during training using back propagation^Q410、W^K420 and W^V430。

The resulting Q, K and V vectors are passed through the attention transfer function 440, which attention transfer function 440 generates a dot product (dot product) of Q and K, which is then applied to V according to equation 9.

The addition and normalization module 450 is then usedThe query q is combined with the output from the i's attention transfer function to provide a residual connection that increases the learning rate of the attention network 400. The addition and normalization module 450 implements equation 10, where μ and σ are the mean and standard deviation, g, of the input vector, respectively, in equation 10_iIs the gain parameter used for scaling layer normalization. The output from the summing and normalization module 450 is the output of the attention network 400.

LayerNorm(Attention(Q,K,V)+q)

The attention network 400 is often used in two variations. A first variant is a multi-headed attention layer, in which multiple attention networks consistent with the attention network 400 are implemented in parallel, with each "head" in the multi-headed attention network having its own weight W^Q410、W^K420 and W^V430 that are initialized to different values and thus trained to learn different codes. The outputs from each head are then concatenated together to form the output of the multi-head attention layer. A second variant is a self-attention layer, which is a multi-headed attention layer where the q, k, and v inputs are the same for each head of the attention network.

self-Attention-based layers are further described in "Attention is All You Need" arXivpreprint arXiv:1706.03762, filed 2017, 12.6.2017, the entire contents of which are incorporated herein by reference.

Fig. 5 is a simplified diagram of layers 500 for an attention-based converter network, according to some embodiments. According to some embodiments, each converter layer 351 and/or 352 of system 300 is consistent with layer 500. As shown in fig. 5, layers 500 include an encoding layer 510 and a decoding layer 520.

Encoding layer 510 receives layer inputs (e.g., from an input network for a first layer in an encoding stack or from a layer output for a next lowest layer of all other layers in the encoding stack) and provides them to all three (q, k, and v) inputs of multi-headed attention layer 511, so multi-headed attention layer 511 is configured withIs set as a self-attention network. Each head of multi-head attention layer 511 is consistent with attention network 400. In some embodiments, multi-headed attention layer 511 includes three heads, however, other numbers of heads, such as two or more than three, are also possible. In some embodiments, each attention layer has a size of 200 and a hidden size of 128. The output of multi-headed attention layer 511 is provided to feed-forward network 512, while both the input and output of feed-forward network 512 are provided to addition and normalization module 513 (which generates the layer output for encoding layer 510). In some embodiments, the feed-forward network 512 is a two-layer perceptron (perceptron) network implementing equation 11, where γ is the input to the feed-forward network 512 and M is_iAnd b_iRespectively, the weight and bias of each layer in the sensor network. In some embodiments, the addition and normalization module 513 is substantially similar to the addition and normalization module 450.

FF(γ)＝max(0,γM₁+b₁)M₂+b₂Equation 11

The decoding layer 530 receives a layer input (e.g., from an input network for a first layer in the decoding stack or from a layer output for a next lowest layer of all other layers in the decoding stack) and provides it to all three (q, k, and v) inputs of the multi-attention layer 521, so the multi-attention layer 521 is configured as a self-attention network. Each head of the multi-head attention layer 521 is consistent with the attention network 400. In some embodiments, multi-headed attention layer 521 includes three heads, however, other numbers of heads, such as two or more than three, are also possible. The output of multi-headed attention layer 511 is provided as the q input to another multi-headed attention layer 522, and the k and v inputs of multi-headed attention layer 522 are associated with the encoding output by the corresponding encoding layer^～s _iAre provided together. Each head of the multi-head attention layer 521 is consistent with the attention network 400. In some embodiments, multi-headed attention layer 522 includes three heads, however, other numbers of heads, such as two or more than three, are also possible. In some embodiments, each attention layer has a size of 200 and a hidden size of 128. The output of the multi-headed attention layer 522 is provided to the feed-forward network 523, while both the input and output of the feed-forward network 523 are providedTo the addition and normalization module 524 (which generates the layer output for the encoding layer 510). In some embodiments, the feed-forward network 523 and the summing and normalization module 524 are substantially similar to the feed-forward network 512 and the summing and normalization module 513, respectively.

Referring again to FIG. 3, the output of the encoding side of the self-attention based multi-layer converter (e.g., in the embodiment of FIG. 3)) Is transmitted to biLSTM 360 (which generates the final encoded sequence h). The final encoded sequence h is then passed to a word generator 370, as described in further detail below with respect to fig. 6. In some embodiments, biLSTM 360 has a hidden size of 200.

The output of the decoding side of the self-attention based multi-layer converter is a vector sequence z. The sequence of vectors z is also passed to the word generator 370 and as each word in the answer p is generated, they are passed back to the first layer on the decoding side of the self-attention-based multi-layer transformer.

Fig. 6 is a simplified diagram of word generator 370, according to some embodiments. Word generator 370 treats z as the input vector sequence and h as its context for attention. The word generator iterates to generate an answer p for the system 300. The answer p is first initialized with the sentinel entry, which is removed after the complete answer p is generated. At each iteration t (as indicated by the subscript in FIG. 6), the next word in the answer p is generated as p, as described further below_t。

At time step t, the single-layer unidirectional LSTM 610 generates a context adjusted hidden state based on using equation 12 as follows

Previous input z from decoder side of self-attention based multi-layer converter_t-1And a previous hidden state from a previous time step t

And previous contextual tonesHidden state of integrity

Is cascaded.

Note then that layer 620 adjusts the hidden state based on the final coded sequence h and context

Generating a vector α of attention weights using equation 13^tRepresenting the dependency of each encoding time step on the current decoder state, H is the element of H stacked in the time dimension, and W is the element of H in equation 13₁And b₁Are trainable weights and biases for attention level 620.

Then, a vocabulary (vocabularies) layer including the tanh layer 630 and the softmax layer 640 is generated in the vocabulary p_vocab(w_t) As the next word p of the answer p_tTanh layer 630 is based on attention weight α^tFinal coded sequence h and context adjusted hidden states

Generating a hidden state for a current time step using equation 14In equation 14, H is an element of H stacked in the time dimension, and W₂And b₂Are trainable weights and biases for the tanh layer 630.

softmax layer 640 is based on implicitHidden state

Generated in the vocabulary p using equation 15_vocab(w_t) As the next word p of the answer p_tIn equation 15, W_outAnd b_outAre trainable weights and biases for the softmax layer 640.

Context layer 650 based on attention weight α^tGeneration in context c (p) using equation 16_copy(w_t) Distribution over each word in the set of words as the next word p of the answer p_tIs used as a candidate of (1).

Switch 660 determines how to pair p_vocab(w_t) And p_copy(w_t) The distributions are weighted with respect to each other. Switch 660 first bases on hidden statesHidden state of context adjustmentAnd a previous input z from the decoder side of the self-attention based multi-layer converter_t-1Using equation 17, where σ represents a sigmoid (sigmoid) transfer function (e.g., logarithmic sigmoid, hyperbolic tangent sigmoid, etc.), and W_switchAre trainable weights for weighting factor layers. In some embodiments, a trainable bias b may be used_switchA weighting factor gamma is further determined.

The switcher 660 then generates the final output distribution over the union of the words in the vocabulary and the words in the context using equation 18 using the weighting factor γ. Then, can be based on p (w)_t) To determine the next word p in the answer p_t。

p(w_t)＝γp_vocab(w_t)+(1–γ)p_copyEquation 18

As discussed above and further emphasized here, fig. 3 is merely an example, which should not unduly limit the scope of the claims. One of ordinary skill in the art would recognize many variations, alternatives, and modifications. According to some embodiments, one or more layers in system 300 are optional and may be omitted. In some embodiments, linear layers 310 and/or 315 are optional and may be omitted if: the codes for context c and question q are passed directly to bilstms 320 and 325, respectively. In some embodiments, biLSTM 320 and/or 325 is optional and may be omitted if: the outputs of the linear layers 310 and 315 are passed directly to the cooperative attention layer 330. In some embodiments, linear layers 310 and 315 and bilstms 320 and 325 are optional and may be omitted if: the coding for context c and question q is passed directly to the co-attention layer 330.

Since system 300 is used for multiple tasks (e.g., classification such as emotion analysis, language translation, and question-and-answer), and its parameters are shared for various layers across all task types, catastrophic forgetting may be suffered if system 300 is not carefully trained. To address this issue, in some embodiments, system 300 may be trained according to a joint strategy in which system 300 is trained using the ranking in which training samples are presented, such that system 300 is trained simultaneously for a balanced mix of each task type. That is, the order in which the training samples are presented to the system 300 selects consecutive training samples or a consecutive (e.g., 2 to 10 or so) subset of training samples from different task types. In some embodiments, the joint strategy includes selecting a training sample (context c, question q, and ground true answer) from a different one of the task types with each iteration of training. The goal of the joint strategy is to train simultaneously for each task type, rather than over-focusing on a certain task type. However, in practice, although system 300 learns each task type, it does not learn any task type well. A more detailed description of the co-training strategy is described in Collobert et al, "A Unified architecture for Natural Language Processing," Deep Neural Networks with Multi task Learning International conference "(2008. pages 160-167) and Hashimoto et al," A Joint Man-task model: Growing a Neural Networks for Multiple NLP Tasks "(2017 meeting of Natural Language Processing experience methods, pages 1923-1933), which are incorporated herein by reference in their entirety.

In some embodiments, system 300 may be trained according to a sequential training strategy in which system 300 is trained using the ordering in which training samples are presented in order to train system 300 individually for each task type. That is, the ordering of presenting training samples to the system 300 for training is to present each sample for a first task type, then present each training sample for a second task type, and so on, then present each sample again for the first task type, and so on. In a sequential training strategy, when training for one of the task types is completed and the training switches to a second task type, catastrophic forgetting of the first task type begins to occur. However, after transmitting training samples for each task type multiple times in sequence, the system 300 begins to resume training for each previously trained task type more quickly and collects knowledge of the dormancy (dormant knowledge). In some embodiments, system 300 typically only exhibits strong learning of the last training task type due to catastrophic forgetfulness when switching training between task types. The sequential training strategy is described in more detail in Kirkpatrick et al, "riding Catasterhic formation in neural networks" (Proc. Natl. Acad. Sci. USA, 2017, pages 3521-3526), the entire contents of which are incorporated herein by reference.

In some embodiments, attempts have been made to address the limitations of joint and sequential training strategies. In some embodiments, these include generating Fisher information (Fisher information) that is computationally expensive, using task-specific modifications (e.g., packaging policies and/or adaptation policies), etc., which negatively impact the goal of a unified system for all task types.

In some embodiments, system 300 may be trained according to a hybrid training strategy. In the hybrid training strategy, system 300 is initially trained using an in-line training strategy. This allows the system 300 to gather knowledge of the dormancy for each task type. After multiple passes through the training samples for each task type, the system 300 is then trained using a joint training strategy. Due to the knowledge from the dormancy of the initial in-order training, the following joint training is able to learn each task type more efficiently than a separate joint training without the initial in-order training, even when multitasking is performed. By allowing the system 300 to fully suppress (press) previously trained task types from being dormant during initial in-line training, the hybrid training strategy gives the system 300 more time to focus on specializing for each task type. In some embodiments, the hybrid training strategy decouples the goal of learning each task type from how learning completes all task types together. Thus, the system 300 is well-prepared to learn each task type when training switches to the joint training strategy.

In some embodiments, system 300 is trained according to a composite training strategy, which is a variation of a hybrid training strategy. In the composite training strategy, system 300 is initially trained using an in-line training strategy, but at fixed intervals and for a fixed number of iterations during in-line training, the training switches to a joint training strategy that spans each task type that has been previously trained, and then returns to the in-line training strategy. By temporarily switching to a joint training strategy for previously learned task types, the system 300 is more often alerted to old task types and is also forced to synthesize old knowledge with new knowledge.

Fig. 7 is a simplified diagram of a multitasking learning method 700 according to some embodiments. One or more of the processes 710-780 of the method 700 may be implemented at least in part in the form of executable code stored on a non-transitory, tangible machine-readable medium, which when executed by one or more processors may cause the one or more processors to perform one or more of the processes 710-780. In some embodiments, method 700 may be used as a hybrid training strategy for training system 300, however, method 700 may also be used for other multitasking systems other than training system 300. In some embodiments, the task types trained by method 700 may include any of a variety of natural language processing tasks, such as language translation, classification (e.g., sentiment analysis), question answering, and the like.

At process 710, training samples are selected according to a first training strategy. In some embodiments, the first training strategy is a sequential training strategy, in which training samples are selected from training samples for a first task type until each training sample for the first task type is selected, and then training samples are selected from a second task type different from the first task type until each training sample for the second task type is selected. Training samples are then selected from additional task types (if any), and a switch to the next task type occurs after each training sample for each task type is selected. In some embodiments, the selected training sample includes a natural language context, a natural language question, and a baseline true natural language answer corresponding to the context and the question.

At process 720, the selected training samples are presented to the system. In some embodiments, the system is system 300. When training samples are applied to the system, the system feeds forward through the various layers of the system according to the parameters (e.g., weights and biases) of the current training, and generates an answer. In some embodiments, the answer is a natural language phrase.

At process 730, the system is adjusted based on the error. The response generated by the system during process 720 is compared to the baseline true response of the selected training sample and an error of the selected training sample is determined. The error can then be fed back to the system 300 using back propagation to update various parameters (e.g., weights and biases) of the layers. In some embodiments, back propagation may be performed using a Stochastic Gradient Descent (SGD) training algorithm, an adaptive moment estimation (ADAM) training algorithm, or the like. In some embodiments, the gradient for back propagation may be clipped (clip) to 1.0. In some embodiments, the learned decay rate (learning decay rate) may be the same rate as used by "Attention All You Need" filed on 12.6.2017 by Vaswani et al, and the arXiv preprint arXiv: 1706.03762.

At process 740, it is determined whether to switch from the first training strategy to the second training strategy. In some embodiments, the decision to switch to the second training strategy occurs after each training sample for each task type has been selected a predetermined number of times. In some embodiments, the predetermined number of times may be five, but any other number such as three, four, and/or six or more times may also be used. In some embodiments, one or more other factors may be used to make the decision when to switch to the second training strategy. In some embodiments, one or other factors may include: in the case of each transmission of a training sample, changes in performance indicators for each task type are monitored, and a switch is made when the improvement in each performance indicator after each transmission is below a threshold amount. When it is determined not to switch to the second training strategy, the method 700 returns to process 710, where in process 710 training samples continue to be selected according to the first training strategy. The selection of training samples using the second training strategy occurs when a switch to the second learning training strategy is determined beginning at process 750.

At process 750, training samples are selected according to a second training strategy. In some embodiments, the second training strategy is a joint training strategy, where training samples are equally selected from the training samples for each task type.

At process 760, the selected training samples are presented to the system using substantially the same process as process 720.

At process 770, the system is adjusted based on the error using substantially the same process as process 730.

At process 780, it is determined whether training is complete. In some embodiments, training is completed after the training samples for each task type have been presented to the system a predetermined number of times. In some embodiments, the predetermined number of times may be eight, but any other number may be used, such as two to seven and/or nine or more times. In some embodiments, the decision of when to complete training may be made using one or more other factors. In some embodiments, one or other factors may include: in the case of each pass through a training sample, the change in performance indicators for each task type is monitored, and when the improvement in each performance indicator after each pass is below a threshold amount, the completion of the training is recorded. When it is determined that training is not complete, the method 700 returns to process 740, where the selection of training samples continues according to the second training strategy in process 740. When it is determined that training is complete, the method 700 ends and the trained system can now be used for whatever task it was trained.

After training is complete, the trained system can be used for any task type using a process substantially similar to processes 720 and/or 760, where context c and question q can be presented to the system and fed forward through various layers of the system according to parameters (e.g., weights and biases) trained according to method 700. The generated answer then corresponds to the presented context c and the answer to the question q.

As discussed above and further emphasized here, fig. 7 is merely an example, which should not unduly limit the scope of the claims. One of ordinary skill in the art would recognize many variations, alternatives, and modifications. In some embodiments, method 700 is adapted to use a synthetic training strategy. In the composite training strategy, the first training strategy is a variation of an in-line training strategy, and the second training strategy may be a joint training strategy. The variation of the in-line training strategy typically includes selecting training samples according to the in-line training strategy except during intervals in which training samples are selected according to the joint training strategy. In some embodiments, the location and placement of the joint training strategy intervals may be based on multiple training iterations for each task type (e.g., multiple training samples presented to the system). As non-limiting examples, the selection of training samples may include selecting 10,000 training samples for a first task type, jointly selecting 1000 training samples from each task type, selecting another 10,000 training samples for the first task type, selecting 1000 training sample combinations from each task type, and then repeating until each training sample of the first task type is given, and then selecting 10,000 training samples for a second task type, and so on. In some embodiments, the number of training samples before alternating between sequential type selection and joint type selection may be based on a percentage of the number of training samples for each task type (e.g., after any position of 10% to 25% of the number of training samples for the respective task type).

Fig. 8 is a simplified diagram of training performance according to some embodiments. More specifically, FIG. 8 shows the results of training system 300 according to four task types: english to German (EN-DE) language translation, English to French (EN-FR) language translation, question and answer, and sentiment classification.

Training samples for English-to-German and English-to-French translation task types are based on the oral translation International workshop English-to-German (IWSLT EN- > DE) training set and English-to-French (IWSLT EN- > FR) training set, which contain approximately 210,000 sentence pairs (transmit pair) transcribed from the TED speech (trans). The performance indicator for both language translation task types is the BLEU score.

The training samples for the question-and-answer task type are based on the Stanford question-and-answer dataset (SQuAD), which contains 10,570 training samples based on questions related to paragraph samples from Wikipedia articles. The performance index for the question-and-answer task type is the F1 score.

Training samples for emotion classification task types are based on the Stanford emotion Tree Bank (SST), where neutral instances are removed. SST contains approximately 56,400 training samples based on movie reviews and their emotions. The performance index for the emotion classification task type is the percentage of exact matches.

FIG. 8 further illustrates the learning results for each task type according to the previously described performance indicators. Three results for each task type are shown. The single column indicates the respective performance indicators when training the system 300 using only the training samples for the indicated task type. The joint column indicates the same performance indicators when system 300 is trained using the joint training strategy. The hybrid column indicates the same performance indicators when the system 300 is trained using the hybrid training strategy of the method 700. As expected, a single task type training result has the highest performance metric because each version of the system 300 is allowed to specialize in a single task. The joint column shows that using the joint training strategy results in significantly worse results, and the hybrid column shows that the hybrid training strategy using the method 700 improves over the joint training strategy. Further, the hybrid training strategy of method 700 results in significantly superior performance results relative to the joint training strategy, in addition to emotion classification task types.

Fig. 9A through 9C are simplified diagrams of training performance according to some embodiments. FIG. 9A tracks respective performance indicators over training iterations for each task type when system 300 is trained separately for each task type. (e.g., as compared to the single column of fig. 8.) thus, fig. 9A shows the results of four separate training versions of system 300. FIG. 9B tracks respective performance indicators when system 300 is trained according to a joint training strategy. As indicated by the performance indicators of fig. 9B, the version of system 300 trained using the joint training strategy does not learn any task type particularly well, except for SST classification task types. FIG. 9C tracks respective performance indicators when the system 300 is trained according to the hybrid training strategy of the method 700. The effect of catastrophic forgetting as the training sample switches from one task type to another during initial in-line training is clearly visible in fig. 9C. After the training samples from each task type have been presented five times using the in-line training strategy and the training strategy switched to the joint training strategy (at approximately 250,000 iterations), the performance indicators quickly improve to values that are better than those of the approach of fig. 9B, which is a joint training strategy alone, and more closely approach the performance indicators of the individually trained version of system 300 in fig. 9A.

Fig. 10A and 10B are simplified diagrams of training performance based on a training order according to some embodiments. 10A and 10B demonstrate the effect of changing the order in which training for various task types is presented to the system 300 during the initial in-line training of the hybrid training strategy. As shown in fig. 10A, when the system 300 is first trained using training samples from the english to german (IWSLT EN- > DE) language translation task type (before the system 300 is trained using training samples from the emotion classification (SST) task type), the system 300 can quickly recover its english to german translation knowledge when the training samples are again extracted (draw) from the english to german translation task type. In contrast, FIG. 10B shows that when system 300 is first trained for emotion classification task types (before system 300 is trained for English-to-German translation task types), system 300 is not able to learn English-to-German translation task types well. Presumably, this is due to the initial training for the english to german translation task type (due to the higher complexity and richness of the training samples) resulting in better initial coding knowledge.

Some embodiments of a computing device, such as computing device 100, may include a non-transitory, tangible machine-readable medium including executable code that, when executed by one or more processors (e.g., processor 210), may cause the one or more processors to perform the processes of method 700. Some common forms of machine-readable media that may include the processes of method 700 are, for example, a floppy disk, a flexible disk, hard disk, magnetic tape, any other magnetic medium, a CD-ROM, any other optical medium, punch cards, paper tape, any physical medium with patterns of holes, a RAM, a PROM, an EPROM, a FLASH-EPROM, any other memory chip or cartridge, and/or any other medium from which a processor or computer is adapted to read.

The description and drawings showing aspects, embodiments, implementations or applications of the invention should not be taken as limiting. Various mechanical, compositional, structural, electrical, and operational changes may be made without departing from the spirit and scope of the description and claims. In other instances, well-known methods, structures or techniques have not been shown or described in detail to avoid obscuring the understanding of this description. Like reference symbols in two or more drawings indicate like or similar elements.

In the description, specific details are set forth describing some embodiments consistent with the present disclosure. In the following description, numerous specific details are set forth in order to provide a thorough understanding of the embodiments. It will be apparent, however, to one skilled in the art, that the embodiments may be practiced without some or all of these specific details. The specific embodiments disclosed herein are intended to be illustrative rather than restrictive. Although not specifically described herein, those skilled in the art will recognize other elements that are within the scope and spirit of the disclosure. Furthermore, to avoid unnecessary repetition, one or more features shown and described in connection with one embodiment may be incorporated into other embodiments unless specifically stated otherwise or if the feature or features render the embodiment inoperative.

While illustrative embodiments have been shown and described, a wide range of modifications, changes, and substitutions is contemplated in the foregoing disclosure and, in some instances, some features of the embodiments may be employed without a corresponding use of the other features. One of ordinary skill in the art would recognize many variations, alternatives, and modifications. Accordingly, the scope of the invention should be limited only by the attached claims, and the claims are to be construed broadly and as appropriate in a manner consistent with the scope of the embodiments disclosed herein.

25页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：对抗性概率正则化

Multitask learning as question and answer

相关技术

网友询问留言