Minimization of computational demand in model-agnostic cross-language transfers with neural task representations as a weak supervisor

文档序号：590123 发布日期：2021-05-25 浏览：5次中文

阅读说明：本技术 利用作为弱监督的神经任务表示的模型不可知跨语言转移中的计算需求的最小化 (Minimization of computational demand in model-agnostic cross-language transfers with neural task representations as a weak supervisor ) 是由 S·K·乔哈尔 M·盖蒙 P·潘特尔于 2019-10-11 设计创作，主要内容包括：一种用于将神经模型从第一语言转移为第二语言的任务不可知框架,该框架可以通过仅依赖于第一语言的标记的数据集合、两种语言之间的并行数据集合、标记的损失函数和未标记的损失函数来准确地形成第二语言的模型中的预测,来将计算和货币成本最小化。模型可以被联合训练,或在两阶段过程中被训练。(A task-agnostic framework for transferring a neural model from a first language to a second language can minimize computational and monetary costs by accurately forming predictions in a model of the second language by relying only on labeled data sets of the first language, parallel data sets between the two languages, labeled loss functions, and unlabeled loss functions. The models may be trained jointly, or in a two-stage process.)

1. A system for transferring a cross-language neural model, comprising:

a processor and a memory, wherein a first neural model and a second neural model are stored in the memory, wherein a language or dialect of the first neural model is different from a language or dialect of the second neural model; and

an operating environment executing commands using the processor to

Training the first neural model on the annotated data based on the labeled loss function to define and update parameters for each of a plurality of layers of the first neural model; and

training the first and second neural models on parallel data between the first and second languages or dialects based on an unlabeled loss function to update each of a plurality of layers of the first neural model and to define and update parameters for each of a plurality of layers of the second neural model,

wherein all layers except a lowest layer of the first neural model are copied to the second neural model.

2. The system of claim 1, wherein the first neural model comprises:

a first embedding layer that converts language units of the first language or dialect to a vector representation;

a first task-appropriate model architecture having a predetermined network configuration comprising one or more layers; and

the first prediction layer is a layer of a first prediction,

wherein one of the layers included in the first task-adapted model architecture is a first task representation layer, and

wherein the first task representation layer immediately precedes the first prediction layer.

3. The system of claim 1 or 2, wherein the second neural model comprises:

a second embedding layer that converts language units of the second language or dialect to a vector representation;

a second task-adapted model architecture having a predetermined network configuration comprising one or more layers; and

a second prediction layer.

4. The system of one of claims 1-3, wherein the task of the task-adapted model architecture comprises one of: emotion classification, style classification, intent understanding, message routing, duration prediction, and structured content recognition.

5. The system of one of claims 1-4, wherein the second neural model is trained without annotated data of the second language or dialect.

6. The system of one of claims 1-5, wherein the second neural model is trained without a translation system, a lexicon, or a pivot dictionary.

7. The system of one of claims 1-6, wherein training resources include annotated data in the first language or dialect and unannotated parallel data in both the first language or dialect and the second language or dialect.

8. A computer-implemented method for cross-language neural model transfer, comprising:

supplying the annotated data in the first language to a first neural model in the first language;

training the first neural model in the first language on the annotated data based on a labeled loss function to define and update parameters of the first neural model in the first language;

supplying unannotated parallel data between the first language and a second language to the first neural model of the first language and a second neural model of a second language;

training the first neural model in the first language and the second neural model in the second language on the parallel data to update the parameters of the first neural model in the first language and to define and update parameters of the second neural model in the second language; and

merging a portion of the parameters of the first neural model in the first language to the second neural model in the second language.

9. The method of claim 8, wherein the task of the neural model comprises one of: emotion classification, pattern classification, intent understanding, message routing, duration prediction, or structured content recognition.

10. The method of claim 8 or claim 9, wherein the second neural model of the second language is trained without annotated data, translation systems, dictionaries, and pivot dictionaries of the second language.

11. The method of one of claims 8 to 10, wherein training resources comprise annotated data in the first language and unannotated parallel data in both the first language and the second language.

12. The method of one of claims 8 to 11, wherein training the first neural model of the first language on the annotated data to define and update parameters of the first neural model comprises optimizing a loss function of the labels of the first neural model of the first language.

13. The method of one of claims 8 to 12, wherein training the first neural model in the first language and the second neural model in the second language on the parallel data to update the parameters of the first neural model in the first language and defining and updating parameters of the second neural model in the second language comprises: optimizing a loss function between task representations produced on the parallel data by the first neural model in the first language and the second neural model in the second language.

14. A computer-implemented method for cross-language neural model transfer, comprising:

supplying the annotated data in the first language to a first neural model in the first language,

training the first neural model in the first language on the annotated data based on a labeled loss function to define and update parameters of the first neural model in the first language;

freezing the parameters of the first neural model of the first language;

supplying unannotated parallel data between the first language and a second language to the first neural model of the first language and a second neural model of a second language;

training the second neural model in the second language on the unannotated parallel data to define and update parameters of the second neural model in the second language; and

merging a portion of the parameters of the first neural model in the first language to the second neural model in the second language.

15. The method of claim 14, wherein the task of the neural model comprises one of: emotion classification, pattern classification, intent understanding, message routing, duration prediction, or structured content recognition.

Technical Field

The subject technology relates generally to a neural model that transfers a neural model from one language to a second language. More particularly, the subject technology relates to transferring a neural model from one language to a second language using a representational projection as a weak supervise.

Background

At present, natural language processing is largely centered on english, and there is a greater demand for models that work in languages other than english than ever. However, the task of transferring a model from one language to another can be expensive: in terms of factors such as annotation cost, engineering time, and workload.

Current research in Natural Language Processing (NLP) and deep learning has resulted in systems that can achieve human identity (human identity) in several key areas of research, such as speech recognition and machine translation. That is, these systems perform at the same or higher level than humans. However, many of these studies are conducted around english-centric models, methods, and data sets.

It is estimated that only about 3.5 million people are those who are native to english, while another 5 to 10 million people have english as the second language. This accounts for up to 20% of the world population. As language technology enters into people's digital lives, there is a need to be able to understand the other 80% of NLP applications in the world. However, building such a system from scratch can be expensive, time consuming, and technically challenging.

Disclosure of Invention

In accordance with one aspect of the present technology, a method for cross-language neural model transfer may include: training a first neural model in a first language having a plurality of layers on annotated data in the first language based on a labeled loss function, wherein the training of the first neural model comprises defining and updating parameters for each of the layers of the first neural model; and training a second neural model in a second language having a plurality of layers on parallel data between the first language and the second language based on the unlabeled loss function, wherein the training of the second neural model includes replicating all layers of the first neural model except a lowest layer, and defining and updating parameters of the lowest layer of the second neural model.

The training may be a two-stage training process in which the first model is fully trained prior to the training of the second model, or alternatively in a joint training process, both the first model and the second model may be co-trained after the initial training of the first model.

The following description and the annexed drawings set forth in detail certain illustrative aspects of the claimed subject matter. These aspects are indicative, however, of but a few of the various ways in which the principles of the invention may be employed and the claimed subject matter is intended to include all such aspects and their equivalents. Other advantages and novel features of the claimed subject matter will become apparent from the following detailed description of the invention when considered in conjunction with the drawings.

Drawings

Non-limiting and non-exhaustive examples are described with reference to the following figures.

FIG. 1 illustrates a framework for cross-language neural model transfer, according to an embodiment;

FIG. 2 illustrates a neural model architecture according to an embodiment;

FIG. 3 shows a flow diagram depicting a method for cross-language neural model transfer, in accordance with an embodiment;

FIG. 4 shows a flow diagram depicting a method for cross-language neural model transfer, in accordance with another embodiment;

FIG. 5 illustrates an exemplary block diagram of a computer system in which embodiments may be implemented.

Detailed Description

In the following detailed description, reference is made to the accompanying drawings that form a part hereof, and in which is shown by way of illustration specific embodiments or examples. These aspects may be combined, other aspects may be utilized, and structural changes may be made without departing from the disclosure. Embodiments may be practiced as methods, systems or devices. Accordingly, embodiments may take the form of a hardware implementation, an entirely software implementation, or an implementation combining software and hardware aspects. The following detailed description is, therefore, not to be taken in a limiting sense, and the scope of the present disclosure is defined by the appended claims and their equivalents.

One reason NLP systems are expensive, time consuming and technically challenging to build from scratch is that high performance NLP models typically rely on large amounts of high quality annotation data, at the expense of annotator time, effort and money. The annotated data is a language artifact (e.g., arbitrary text) annotated with additional artifacts. For example, text may be examined for criteria, and tags or annotations may be added to the text based on the criteria. For example, the criteria may be emotions, and the tags or annotations may include positive or negative emotions.

Other exemplary criteria include style classification, where a label may include whether an artifact is formal or informal; intent understanding, wherein tags may include a prediction of an intent of a workpiece selected from a plurality of predetermined intents (e.g., scheduling an event, requesting information, or providing an update); message routing, wherein the label may include a prediction of a primary recipient of the plurality of recipients; a task duration, wherein the tag may include a prediction of an event duration; or structured content recognition where the tags may include predictions of workpiece categories (e.g., categorizing an email into categories such as flight itinerary, transportation notice, or hotel reservations).

In view of the enormous cost of building systems from scratch, much of the effort by the research community to build tools for other languages relies on moving the existing english model to other languages.

Previous efforts to transfer the english model to other languages have relied on Machine Translation (MT) to translate training or test data from english to the target language. Other efforts have also considered using bilingual dictionaries to transfer features directly.

Constructing the most advanced MT systems requires expertise and a large amount of training data, which is expensive. Also, constructing a bilingual dictionary can be equally expensive if done manually, and can contain significant noise if introduced automatically.

Other studies include studies of transferability of neural network components in the context of image recognition. This study illustrates the technical problem in conventional technology that higher layers of the network tend to be more specialized and domain specific and therefore less generalized.

However, the technical solution according to embodiments of the present disclosure includes a framework of opposite cross-language transfers: in particular, higher layers of the network are shared between models of different languages while maintaining separate language-specific embedding (i.e., parameters of lower layers of the network). By sharing higher layers of the network, accurate models can be generated in multiple languages without relying on MT, bilingual dictionaries, or data annotated in the model language.

Sharing information across domains is also relevant to multitask learning. The work in this area can be broadly divided into two approaches: hard parameter sharing and soft parameter sharing. In hard parameter sharing, the model shares a common architecture with some task-specific layers, whereas in soft parameter sharing, the tasks have their own sets of parameters, constrained by some sharing cost.

Previous research including label projection, feature projection, and weak surveillance is different from embodiments of the present disclosure, which are drawn to a neural framework that integrates task characterization, model learning, and cross-language transfer in a joint approach, but at the same time has sufficient flexibility to accommodate various target applications.

In addressing the technical problems faced by conventional techniques, embodiments of the general framework of the present disclosure can easily and efficiently transfer neural models from one language to another. On the one hand, the framework relies on task representation as a form of weak supervision and is model and task agnostic. Typically, a neural network includes a series of nodes arranged in a layer including an input layer and a prediction layer. The portion of the neural network between the input layer and the prediction layer may include one or more layers that transform the input into a representation. Each layer after the input layer is trained on the previous layer, and thus the feature complexity and abstraction of each layer increases. The task representation captures an abstract description of the prediction problem and is embodied as a layer before the prediction layer in the neural network model. By utilizing the disclosed framework, many existing neural architectures can be ported to other languages with minimal effort.

The only requirements for the metastatic neural model according to embodiments of the present disclosure are parallel data and defined penalties on the task representation.

A framework according to embodiments of the present disclosure may reduce monetary and computational costs by: dependency on machine translation or bilingual dictionaries is abandoned while semantically rich and meaningful representations across various languages are accurately captured. By eliminating any dependency on or interaction with the translation means, the framework can reduce the number of instructions processed by the processor, thereby increasing system speed, saving memory, and reducing power consumption.

With respect to these and other general considerations, embodiments of the present disclosure are described below. Additionally, while relatively specific problems have been discussed, it should be understood that embodiments should not be limited to solving the specific problems identified above.

In the following, a framework according to embodiments is described that can transfer an existing neural model of a first language to a second language with minimal cost and effort.

Specifically, the framework: (i) models and tasks are agnostic and therefore can be applied to a variety of new and existing neural architectures; (ii) only a parallel corpus is needed, and target language training data, a translation system or a bilingual dictionary are not needed; (iii) there is a unique modeling requirement that defines the loss on the task representation, thereby greatly reducing the engineering effort, monetary cost, and computational cost involved in transferring the model from one language to another.

Embodiments are particularly useful when high quality MT systems are not applicable in the target language or application specific domain. Traditionally, MT systems, bilingual dictionaries, or pivot dictionaries (pivot lexicon) are required to transfer models from one language to another; however, according to embodiments, none of these need accurately predict the results at a rate comparable to or even exceeding conventional solutions.

A framework for transferring a neural model from a first language to a second language according to an embodiment is described in more detail. For purposes of example, embodiments are shown and described in which the first language is english and the second language is french. Of course, the present technology is not so limited, and it should be understood that the only limitation of the first language and the second language is that they are not the same dialect of the same language.

FIG. 1 illustrates an exemplary framework 100 for transferring an English neural model 200 to a French neural model 300. As shown in fig. 1, the framework includes a training portion or module 101 and a testing portion or module 102. FIG. 1 depicts the implementation of joint training and two-phase training, which will be discussed in detail below.

Training portion 101 depicts english neural model 200 and french neural model 300. As shown in FIG. 1, the training portion 101 depicts how the English neural model 200 is trained and how the English neural model 200 is transferred to the French neural model 300. Training portion 101 of framework 100 utilizes labeled English data D_LAnd unmarked parallel data D_PUnmarked parallel data D_PIncluding english parallel data PE and french parallel data PF.

The tagged data is data that is typically supplemented directly by a human being with contextual information, which may also be referred to as annotated data.

As long as the parallel data is aligned between languages, it may be aligned at any level, including character level, word level, sentence level, paragraph level, or other level.

According to an exemplary embodiment as shown in fig. 1 and 2, marked english data D_LTo the english neural model 200.

English neural model 200 may be a neural NLP model and may include three different components: an embedding layer 201, a task-adapted model architecture 202, and a prediction layer 203.

In more detail, the english neural NLP model 200 includes a first layer, the embedding layer 201, that translates language units w (characters, words, sentences, paragraphs, pseudo-paragraphs, etc.) into a mathematical representation of the language units w. The mathematical representation may preferably be a dense representation of a vector comprising mainly non-zero values, or may alternatively be a sparse representation of a vector comprising many zero values.

The third layer is a prediction layer 203, which is used to generate probability distributions over the space of output labels. According to an example embodiment, the prediction layer 203 may include a softmax function.

Between the prediction layer 203 and the embedding layer 201 is a task-appropriate model architecture 202.

Since the framework 100 is model and task agnostic, the structure of the task-appropriate model architecture 202 may include any number of layers and any number of parameters. That is, the task-appropriate model architecture 202 is an application that adapts the model to a particular task or application, and the configuration and number of layers of the network do not affect the generic framework.

Thus, for simplicity, the task-adapted model architecture 202 is depicted as including an x-layer network 202a (where x is a non-zero integer of layers), and a task representation layer 202b, which is the layer immediately preceding the prediction layer 203.

As shown in FIG. 1, the test portion 102 includes a French model 300, a French embedding layer 301, a task-appropriate model architecture 302, and a prediction layer 303. According to an embodiment of framework 100, test portion 102 represents the use of French model 300 to unmarked French data D_FAnd (6) classifying.

FIG. 2 illustrates an example of a model architecture of neural models 200 and 300, according to one embodiment. The neural models 200 and 300 may be configured as a hierarchical Recurrent Neural Network (RNN)400, but it should be understood that this is exemplary only and the architecture is not limited thereto.

As shown in FIG. 2, a data set is embedded into a sequence w of language units by an embedding layer 401₁₁-w_nmIn (1). Sequence w of language units₁₁-w_nmConverted to a sentence representation 403 by the sentence RNN 402, and the sequence of sentence representations 403 is converted to a task representation by the review RNN404405. The task representation 405 is then converted into a prediction layer 406, the prediction layer 406 being used to generate a probability distribution over the space of the output labels 407. The number of output tags 407 is equal to the number of results of the prediction task.

According to one embodiment, the RNN may comprise, for example, a gated cycle unit (GRU). However, it should be understood that the present disclosure is not so limited and that the RNN may also be a long short term memory network (LSTM) or other network.

Model transfer according to an embodiment relies on two features. First, the task-appropriate architecture and prediction layers are shared across languages. Second, all the information needed to make a successful prediction is contained in the task representation layer.

As shown in fig. 1, in the case where english model 200 is transferred to french model 300, the only difference between english model 200 and french model 300 is the language-specific embedding included in the embedding layers 201, 301 of english model 200 and french model 300, respectively, as shown by the contrasting hatching of the embedding. Second, the task representation layers 204, 304 of the English model 200 and the French model 300 contain all the information needed to make a successful prediction.

An indication of a successful model transfer is that the french and british models predict the same thing when parallel data is considered. That is, the predicted content is irrelevant, but the success of model transfer is based on the predicted identity of the French model and the English model. When the scheme is label projection, the predicted content may be the actual label. Alternatively, where the target is to produce the same task representation in both languages, a representation projection may be utilized. The representation projection is a softer form of weak supervision compared to supervision based on label projection and is a preferred projection according to embodiments.

To better illustrate the framework according to an embodiment, consider task T and labeled data D_L＝{(x_i，y_i) I is more than or equal to |0 and less than or equal to N }, wherein x_iIs English input, y_iIs to take the output of K possible values such that each x_iAll using the value y_iNote that N is labeled data D_LThe number of language units included. In thatWithout loss of generality, assume input x_i＝{e_il，...，e_ilIs a sequence of english words. Furthermore, a parallel data set D_P＝{(e_j，f_j) J is more than or equal to |0 and less than or equal to M }, wherein e_j＝{e_jl，...，e_jlF and f_j＝{f_jl，...，f_jlAre parallel English and French language units, respectively, and M is at parallel data D_PThe number of pairs of language units included.

The English embedding included in English embedding layer 201 may be expressed asMake English vocabulary V_EEach word in the set has a vectorEnglish vocabulary is included in input x_iAll words found in (a). French embedding included in French embedding layer 301 may be represented asSo that French vocabulary V_FEach word in (1) has a vector

In the case of a shared model architecture, vectorsAndmust be the same. English sequence e_j＝{e_j1,…,e_jmThe mapping of } to vector sequences is represented asAnd French sequence f_j＝{f_j1,…,f_jnThe mapping of } to vector sequences is represented asThe x-layer model 202b is represented with a parameter θ_μWith the embedded sequence as input and generating a task representation. Specifically, for English input x_iThe task representation is represented as:

finally, the prediction layer 203 is represented with a parameter θ_πWhich produces a probability distribution over K output variables:

wherein pi_kIs the kth neuron of this layer, and is abbreviatedIs used for showingThe framework according to one embodiment then optimizes both losses.

Loss of label: suppose that the model contains tagged English data D_LAs input, the following penalties are then optimized for the combined network:

wherein Δ L is inAnd variable y_iA loss function defined in (c). E.g. in the binary case, Δ_LCross entropy loss is possible, although it should be understood that this is exemplary only, and the framework is not so limited.

Loss of unlabeled: the model-generated english task representation is used as a weak supervision for parallel data on the french side. Specifically, the method comprises the following steps:

where Δ P is a loss function between task representations produced on the parallel inputs. Since the task representations are vectors, the mean square error between them may be a suitable penalty, for example, although the framework is not so limited.

Then, finally optimize theGiven, where α is a hyper-parameter that controls the mixing strength between the two loss components.

In contrast to the conventional framework, in the framework according to the embodiments, there is no requirement for MT, since neither training nor test data has been translated. Nor any other resource such as a pivot dictionary or bilingual dictionary. The only requirement is the parallel data and loss functionThe definition of (1). Loss of model architecture mu and tagAre attributes defined for the english-only model.

Using well-defined loss functions Δ_LAnd Δ_PThe training consists of back-propagating errors through the network and updating the parameters of the model.

Fig. 3 and 4 illustrate two methods for transferring a neural model from a first language to a second language, according to an embodiment. In detail, fig. 3 illustrates a two-stage training method, and fig. 4 illustrates a joint training method.

As shown in fig. 3, in the two-stage training, a model architecture is defined in step S301. Since the framework is model agnostic, the model may be defined as shown in FIG. 2, although it should be understood that the framework is not limited in this manner.

Defining the loss of the token in step S302And in step S303, by findingMarkup data D in a first language_LThe first model 200 is trained. In this context, "+" denotes an optimized value for the argmax function in step S303.

After the first model 200 is trained, the embedding U and the shared model parameter θ of the first model are frozen in step S304_μAnd theta_π。

Unmarked losses are defined in step S305And by optimization in step S306In parallel data D_PLoss of upper training unlabeling. That is, in the second phase of the two-phase training, only the second embedded V of the second model is updated on the parallel data.

In step S307, the first embedding U of the embedding layer 201 of the first model 200 is replaced with the second embedding V of the embedding layer 301 of the second model 300. The combined model is the updated second model 300. Thus, the updated second model 300 comprises the parameters V, θ_μ、θ_π。

As shown in fig. 4, in the joint training, a model architecture is defined in step S401. Since the framework is model agnostic, the model may be defined as shown in FIG. 2, although it should be understood that the framework is not limited in this manner.

Defining the loss of the mark in step S402And in step S403, the methodOver-findingIn marked data D_LLoss of upper training markers.

Unmarked losses are defined in step S404And by optimization in step S405To parallel data D_PLoss of upper training unlabeling. L is a weighted combination of marked and unmarked penalties and is formed byGiven, where α is the hyper-parameter that controls the mixing intensity between the two loss components.

In joint training, when parallel data D are processed_PThen, the parameters of both the first model 200 and the second model 300 are updated in step S404.

In step S406, the first embedding U of the embedding layer 201 of the first model 200 is replaced with the second embedding V of the embedding layer 301 of the second model 300. This combined model is the updated second model 300. Thus, the updated second model 300 comprises the parameters V, θ_μ、θ*_π。

Example model transfer: emotion classification

To better illustrate the general framework according to embodiments, in the following illustrative example, the emotion classifier is transferred from one language to another.

In this example, the emotion classifier predicts whether the language artifact is positive or negative. According to an embodiment, the only necessary step is to define the model architecture μ and two loss functionsAnd

given the binary nature of the prediction task, the prediction layer can be given as an S-shaped layer with one output neuron that computes the probability of a positive label:the loss of the marker may be a cross-entropy loss:

in the parallel aspect, the unmarked penalty may be a mean square error penalty:

wherein d is^TIs a task representation R^TDimension of (A), R^T(i) Representing its ith dimension.

Although the above example defines a loss function for a binary system, it should be understood that other loss functions may be defined for other systems and that a system may have any number of possible outputs.

Cross-language word association

To demonstrate that the task is represented as weakly supervised, table 1 shows several english words with emotion in the joint model according to one embodiment, and their nearest french neighbors (by vector cosine distance on their respective embeddings).

TABLE 1

As can be seen from Table 1 above, the definition of positive (or negative) emotional terms in English is similar to the nearest neighbor positive (or negative) terms in French. Although the nearest neighbor terms in French are not necessarily direct translations, and even synonyms, the emotion prediction task does not require translation; it is sufficient to identify words that respond to the same emotion. Thus, the framework for model transfer according to embodiments is able to identify emotional similarities across languages without direct supervision and using only weak fuzzy signals from the representation projections.

Machine translation utilization

While the framework does not require an MT, according to one embodiment, an MT may be utilized.

For example, training-time translation (TrnT) may be utilized, which translates training data from a first language to another language, and then trains an emotion model in that language. Test time translation (TstT) may be utilized that trains an emotion model in a first language and uses the trained emotion model to classify a language artifact translated to the first language at test time.

Thus, a framework in accordance with an embodiment, although perhaps not even without a translation engine, may optionally be used in conjunction with a translator.

Multimodal model transfer

The framework can be applied to multi-modal (rather than multi-lingual) transfers. That is, the model may be transferred between different modes including language, images, video, audio clips, and the like. For example, emotional comprehension can be transferred to images without explicit image annotation. In such multimodal transitions, the annotated data may include sentiment data of the tags of the first language. The parallel data may include images having subtitles in a first language. Once the framework is trained on the annotated parallel data, the framework can predict the emotion of the image without subtitles.

Fig. 5 illustrates a schematic diagram of an exemplary computer or processing system that may implement any of the systems, methods, and computer program products described in embodiments of the present disclosure herein, such as english neural model 200 and french neural model 300. The computer system is only one example of a suitable processing system and is not intended to suggest any limitation as to the scope of use or functionality of embodiments of the methods described herein. The processing system as shown is operational with numerous other general purpose or special purpose computing system environments or configurations. Examples of well known computing systems, environments, and/or configurations that may be suitable for use with the processing system shown in FIG. 5 may include, but are not limited to, personal computer systems, server computer systems, thin clients, thick clients, hand-held or laptop devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputer systems, mainframe computer systems, distributed cloud computing environments that include any of the above systems or devices, and the like.

The computer system may be described in the general context of computer system-executable instructions, such as program modules, being executed by a computer system. Generally, program modules may include routines, programs, objects, components, logic, data structures, etc. that perform particular tasks or implement particular abstract data types. Computer systems may be practiced in distributed cloud computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed cloud computing environment, program modules may be located in both local and remote computer system storage media including memory storage devices.

The components of the computer system may include, but are not limited to, a server 500, one or more processors or processing units 510, and a system memory 520. Processor 510 may include software modules that perform the methods described herein. The module may be programmed into an integrated circuit of the processor 510 or may be loaded from the memory 520 or a network (not shown), or a combination thereof.

The computer system may include a variety of computer system readable media. Such media may be any available media that is accessible by a computer system and may include both volatile and nonvolatile media, removable and non-removable media.

Volatile memory can include Random Access Memory (RAM) and/or cache memory or other memory. Other removable/non-removable, volatile/nonvolatile computer system storage media may include a magnetic disk drive for reading from and writing to a removable nonvolatile magnetic disk (e.g., a "floppy disk"), and an optical disk drive that may provide a removable nonvolatile optical disk for reading from or writing to a CD-ROM, DVD-ROM, or other optical media.

As will be appreciated by one skilled in the art, aspects of the framework may be embodied as a system, method or computer program product. Accordingly, aspects of the disclosed technology may take the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, micro-code, etc.) or an embodiment combining software and hardware aspects that may all generally be referred to herein as a "circuit," module "or" system. Furthermore, aspects of the disclosed technology may take the form of a computer program product embodied in one or more computer-readable media having computer-readable program code embodied therein.

Any combination of one or more computer-readable media may be utilized. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations for aspects of the disclosed technology may be written in any combination of one or more programming languages, including: object oriented programming languages such as Java, Smalltalk, C + +, and the like; and conventional procedural programming languages, such as the "C" programming language or similar programming languages; scripting languages such as Perl, VBS, or similar languages; and/or functional languages such as Lisp and ML; and logic-oriented languages such as Prolog. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer, partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

Aspects of the disclosed technology are described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer readable medium that can direct a computer, other programmable data processing apparatus, or other devices to function in a particular manner, such that the instructions stored in the computer readable medium produce an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide processes for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The computer program product may comprise all the respective features enabling the implementation of the methods described herein and capable of performing said methods when loaded in a computer system. In the present context, a computer program, software program, or software, refers to any expression, in any language, code or notation, of a set of instructions intended to cause a system having an information processing capability to perform a particular function either directly or after either or both of the following: (a) conversion to another language, code or notation; and/or (b) replication in a different material form.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

The corresponding structures, materials, acts, and equivalents of all means or step plus function elements in the claims below, if any, are intended to include any structure, material, or act for performing the function in combination with other claimed elements as specifically claimed. The description of the disclosed technology has been presented for purposes of illustration and description, but is not intended to be exhaustive or limiting. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the disclosure. The embodiment was chosen and described in order to best explain the principles of the disclosure and the practical application, and to enable others of ordinary skill in the art to understand the disclosure for various embodiments with various modifications as are suited to the particular use contemplated.

Various aspects of the disclosure may be embodied as programs, software, or computer instructions embodied in a computer-or machine-usable or readable medium that, when executed on a computer, processor, and/or machine, cause the computer or machine to perform the steps of the method. A program storage device readable by a machine, tangibly embodying a program of instructions executable by the machine to perform the various functions and methods described in this disclosure, is also provided.

The systems and methods of the present disclosure may be implemented and executed on a general purpose computer or a special purpose computer system. The terms "computer system" and "computer network" as may be used in this application may include various combinations of fixed and/or portable computer hardware, software, peripheral devices, and storage devices. The computer system may include a plurality of separate components networked or otherwise linked to execute in cooperation, or may include one or more separate components. The hardware and software components of the computer system of the present application may include and may be included in fixed and portable devices such as desktop, laptop, and/or server. A module may be a component of a device, software, program, or system that implements certain "functions," which may be embodied as software, hardware, firmware, electronic circuitry, or the like.

Although specific embodiments have been described, those skilled in the art will appreciate that there are other embodiments equivalent to the described embodiments. It is understood, therefore, that this disclosure is not limited to the particular embodiments shown, but is only limited by the scope of the appended claims.

Concept

Concept 1: a system for transferring a cross-language neural model, comprising: a processor and memory, wherein a first neural model and a second neural model are stored in the memory, wherein a language or dialect of the first neural model is different from a language or dialect of the second neural model; and an operating environment to execute, using a processor, commands to train a first neural model on the annotated data based on the labeled loss function to define and update parameters for each of a plurality of layers of the first neural model; and training the first and second neural models on parallel data between the first and second languages or dialects based on the unlabeled loss function to update each of the plurality of layers of the first neural model and define and update parameters for each of the plurality of layers of the second neural model, wherein all layers except a lowest layer of the first neural model are copied to the second neural model.

Concept 2. a system according to any previous or subsequent concept(s), wherein the first neural model comprises: a first embedding layer that converts language units of the first language or dialect into a vector representation; a first task-appropriate model architecture having a predetermined network configuration comprising one or more layers; and a first prediction layer, wherein one of the layers included in the first task appropriateness model architecture is a first task representation layer, and wherein the first task representation layer immediately precedes the first prediction layer

Concept 3. a system according to any previous or subsequent concept(s), wherein the second neural model comprises: a second embedding layer for converting the language units of the second language or dialect into vector representation; a second task-adapted model architecture having a predetermined network configuration comprising one or more layers; and a second prediction layer.

Concept 4. a system according to any previous or subsequent concept(s), wherein the tasks of the task-adapted model architecture comprise one of: emotion classification, pattern classification, intent understanding, message routing, duration prediction, or structured content recognition.

Concept 5. a system according to any previous or subsequent concept(s), wherein the second neural model is trained without annotated data of the second language or dialect.

Concept 6. a system according to any previous or subsequent concept(s), wherein the second neural model is trained without a translation system, lexicon, or pivot dictionary.

Concept 7. a system according to any previous or subsequent concept(s), wherein the training resources comprise annotated data in the first language or dialect and unannotated parallel data in both the first language or dialect and the second language or dialect.

Concept 8. a computer-implemented method for cross-language neural model transfer, comprising: supplying the annotated data in the first language to a first neural model in the first language; training the first neural model in the first language on annotated data based on a labeled loss function to define and update parameters of the first neural model in the first language; supplying unannotated parallel data between the first language and the second language to the first neural model of the first language and a second neural model of a second language; training the first neural model in the first language and the second neural model in the second language on the parallel data to update the parameters of the first neural model in the first language and to define and update parameters of the second neural model in the second language; and merging a portion of the parameters of the first neural model in the first language to the second neural model in the second language.

Concept 9. a system according to any previous or subsequent concept(s), wherein the task of the neural model comprises one of: emotion classification, pattern classification, intent understanding, message routing, duration prediction, or structured content recognition.

Concept 10. a system according to any previous or subsequent concept(s), wherein the second neural model in the second language is trained without annotated data, translation systems, dictionaries, and pivot dictionaries in the second language.

Concept 11. a system according to any previous or subsequent concept(s), wherein the training resources comprise annotated data in the first language and unannotated parallel data in both the first language and the second language.

Concept 12. a system according to any previous or subsequent concept(s), wherein training the first neural model in the first language on the annotated data to define and update parameters of the first neural model comprises optimizing a loss function of the labels of the first neural model in the first language.

Concept 13. a system according to any previous or subsequent concept(s), wherein the first neural model in the first language and the second neural model in the second language are trained on the parallel data to update parameters of the first neural model in the first language, and defining and updating parameters of the second neural model in the second language comprises: optimizing a loss function between task representations produced on the parallel data by the first neural model in the first language and the second neural model in the second language.

Concept 14: a computer-implemented method for cross-language neural model transfer, comprising: supplying annotated data in a first language to a first neural model in the first language, training the first neural model in the first language on the annotated data based on a labeled loss function to define and update parameters of the first neural model in the first language; freezing the parameters of the first neural model of the first language; supplying unannotated parallel data between the first language and the second language to the first neural model of the first language and the second neural model of the second language; training the second neural model in the second language on the unannotated parallel data to define and update parameters of the second neural model in the second language; and merging a portion of the parameters of the first neural model in the first language to the second neural model in the second language.

Concept 15. a system according to any previous or subsequent concept(s), wherein the task of the neural model comprises one of: emotion classification, pattern classification, intent understanding, message routing, duration prediction, or structured content recognition.

Concept 16. a system according to any previous or subsequent concept(s), wherein a second neural model in a second language is trained without annotated data in the second language.

Concept 17. a system according to any previous or subsequent concept(s), wherein the second neural model in the second language is trained without a translation system, lexicon, or hub dictionary.

Concept 18. a system according to any previous or subsequent concept(s), wherein the training resources comprise annotated data in a first language and parallel data in both the first language and a second language that are not annotated.

Concept 19. a system according to any previous or subsequent concept(s), wherein training a first neural model in a first language on annotated data to define and update parameters of the first neural model in the first language comprises: a loss function of the tokens of the first neural model in the first language is optimized.

Concept 20. a system according to any previous or subsequent concept(s), wherein training a second neural model in a second language on unannotated parallel data to define and update parameters of the second neural model in the second language comprises: unmarked loss functions between task representations produced on unannotated parallel data by a first neural model in a first language and a second neural model in a second language are optimized.

21页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：用于提供跨多个电器的便携式自然语言处理接口的系统和方法

Minimization of computational demand in model-agnostic cross-language transfers with neural task representations as a weak supervisor

相关技术

网友询问留言