System and method for modeling probability distributions

文档序号:1047885 发布日期:2020-10-09 浏览:3次 中文

阅读说明:本技术 用于对概率分布进行建模的系统和方法 (System and method for modeling probability distributions ) 是由 C·K·费希尔 A·M·史密斯 J·R·沃尔什 于 2019-01-16 设计创作,主要内容包括:描述了用于对复杂概率分布进行建模的系统和方法。一个实施例包括一种用于训练受限波尔兹曼机器(RBM)的方法,其中所述方法包括从第一组可见值生成RBM的隐藏层中的一组隐藏值以及基于生成的所述一组隐藏值来生成所述RBM的可见层中的第二组可见值。所述方法包括基于所述第一组可见值和生成的所述一组可见值来计算一组似然梯度,基于所述一组隐藏值和所述一组可见值中的至少一个使用对抗模型来计算一组对抗梯度,基于所述一组似然梯度和所述一组对抗梯度来计算一组复合梯度,以及基于所述一组复合梯度来更新所述RBM。(Systems and methods for modeling complex probability distributions are described. One embodiment includes a method for training a Restricted Boltzmann Machine (RBM), wherein the method includes generating a set of hidden values in a hidden layer of the RBM from a first set of visible values and generating a second set of visible values in a visible layer of the RBM based on the generated set of hidden values. The method includes calculating a set of likelihood gradients based on the first set of visible values and the generated set of visible values, calculating a set of confrontation gradients using a confrontation model based on at least one of the set of hidden values and the set of visible values, calculating a set of composite gradients based on the set of likelihood gradients and the set of confrontation gradients, and updating the RBM based on the set of composite gradients.)

1. A method for training a Restricted Boltzmann Machine (RBM), wherein the method comprises:

generating a set of hidden values in a hidden layer of the RBM from the first set of visible values;

generating a second set of visible values in a visible layer of the RBM based on the generated set of hidden values;

calculating a set of likelihood gradients based on the first set of visible values and at least one of the generated set of visible values;

calculating a set of confrontation gradients using a confrontation model based on at least one of the set of hidden values and the set of visible values;

calculating a set of complex gradients based on the set of likelihood gradients and the set of antagonistic gradients; and

updating the RBM based on the set of composite gradients.

2. The method of claim 1, wherein a visible layer of the RBM comprises a composite layer consisting of multiple sub-layers for different data types.

3. The method of claim 1, wherein the plurality of sub-layers comprises at least one of a bernoulli layer, an octyl layer, a thermal only layer, a von mises fischer layer, a gaussian layer, a ReLU layer, a pruned ReLU layer, a student-t layer, a sequential layer, an exponential layer, and a composite layer.

4. The method of claim 1, wherein the RBM is a Deep Boltzmann Machine (DBM), wherein the hidden layer is one of a plurality of hidden layers.

5. The method of claim 4, wherein the RBM is a first RBM and the hidden layer is a first hidden layer of a plurality of hidden layers, wherein the method further comprises:

sampling a hidden layer from the first RBM;

stacking visible layers and hidden layers from the first RBM into a vector;

training a second RBM, wherein the vector is a visible layer of the second RBM; and

the DBM is generated by copying weights from the first RBM and the second RBM to the DBM.

6. The method of claim 1, further comprising:

receiving a phenotype vector for a patient;

generating a temporal progression of a disease using the RBM; and

treating the patient based on the generated time progression.

7. The method of claim 1, wherein the visible layer and hidden layer are for a first time instance, wherein the hidden layer is further connected to a second hidden layer incorporating data from a different second time instance.

8. The method of claim 1, wherein the visible layer is a composite layer comprising data for a plurality of different time instances.

9. The method of claim 1, wherein computing the set of likelihood gradients comprises performing gibbs sampling.

10. The method of claim 1, wherein the set of composite gradients is a weighted average of the set of likelihood gradients and the set of confrontation gradients.

11. The method of claim 1, further comprising training the confrontation model by:

extracting a data sample based on the real data;

extracting a fantasy sample based on the RBM; and

training the confrontation model based on the confrontation model's ability to distinguish between the data samples and fantasy samples.

12. The method of claim 1, wherein training the countermeasure model comprises measuring a probability of drawing a particular sample from either real data or an RBM.

13. The method of claim 1, wherein the antagonistic model is one of a fully connected classifier, a logistic regression model, a nearest neighbor classifier, and a random forest.

14. The method of claim 1, further comprising generating a set of samples of a target population using the RBM.

15. The method of claim 1, wherein computing a set of likelihood gradients comprises computing a convex combination of a monte carlo estimate and a mean field estimate.

16. The method of claim 1, wherein computing a set of likelihood gradients comprises:

initializing a plurality of samples;

initializing an inverse temperature for each sample of the plurality of samples;

for each sample of the plurality of samples:

updating the inverse temperature by sampling from an autocorrelation gamma distribution; and

gibbs sampling is used to update the samples.

17. A non-transitory machine readable medium containing processor instructions for training a limited boltzmann machine (RBM), wherein execution of the instructions by a processor causes the processor to perform a process comprising:

generating a set of hidden values in a hidden layer of the RBM from the first set of visible values;

generating a second set of visible values in a visible layer of the RBM based on the generated set of hidden values;

calculating a set of likelihood gradients based on the first set of visible values and at least one of the generated set of visible values;

calculating a set of confrontation gradients using a confrontation model based on at least one of the set of hidden values and the set of visible values;

calculating a set of complex gradients based on the set of likelihood gradients and the set of antagonistic gradients; and

updating the RBM based on the set of composite gradients.

18. The non-transitory machine readable medium of claim 17, wherein the visible layer of the RBM comprises a composite layer consisting of a plurality of sub-layers for different data types.

19. The non-transitory machine readable medium of claim 17, wherein the RBM is a Deep Boltzmann Machine (DBM), wherein the hidden layer is one of a plurality of hidden layers.

20. The non-transitory machine readable medium of claim 19, wherein the RBM is a first RBM and the hidden layer is a first hidden layer of a plurality of hidden layers, wherein the processing further comprises:

sampling a hidden layer from the first RBM;

stacking visible layers and hidden layers from the first RBM into a vector;

training a second RBM, wherein the vector is a visible layer of the second RBM; and

the DBM is generated by copying weights from the first RBM and the second RBM to the DBM.

Technical Field

The present invention relates generally to modeling probability distributions, and more particularly to training and implementing Boltzmann (Boltzmann) machines to accurately model complex probability distributions.

Background

In a world full of uncertainty, it is difficult to properly model probability distributions across multiple dimensions based on diverse and heterogeneous sets of data. For example, in the health industry, individual health outcomes are invariably uncertain. The condition of one patient with a disease may deteriorate rapidly, while another patient recovers quickly. The inherent randomness of individual health outcomes means that health informatics must target the prediction of health risks rather than deterministic outcomes. The ability to quantify and predict health risks has important implications for business models that depend on the health of a population.

Disclosure of Invention

Systems and methods for modeling complex probability distributions in accordance with embodiments of the present invention are shown. One embodiment includes a method for training a Restricted Boltzmann Machine (RBM), wherein the method includes generating a set of hidden values in a hidden layer of the RBM from a first set of visible values, and generating a second set of visible values in a visible layer of the RBM based on the generated set of hidden values. The method also includes calculating a set of likelihood gradients based on at least one of the first set of visible values and the generated set of visible values, calculating a set of confrontation gradients using a confrontation model based on at least one of the set of hidden values and the set of visible values, and calculating a set of composite gradients based on the set of likelihood gradients and the set of confrontation gradients. The method includes updating the RBM based on the set of composite gradients.

In another embodiment, the visible layer of the RBM includes a composite layer composed of multiple sub-layers for different data types.

In yet another embodiment, the plurality of sub-layers includes at least one of a Bernoulli (Bernoulli) layer, an octyl (Ising) layer, a one-hot (one-hot) layer, a von Mises Fisher (von Mises-Fisher) layer, a Gaussian layer, a ReLU layer, a pruned ReLU layer, a student-t layer, an order layer, an index layer, and a composite layer.

In yet another embodiment, the RBM is a Deep Boltzmann Machine (DBM), wherein the hidden layer is one of a plurality of hidden layers.

In yet another embodiment, the RBM is a first RBM and the hidden layer is a first hidden layer of a plurality of hidden layers. The method also includes sampling hidden layers from the first RBM, stacking visible layers and hidden layers from the first RBM into a vector, training a second RBM, and generating the DBM by copying weights from the first and second RBMs to the DBM. The vector is a visible layer of the second RBM.

In yet another embodiment, the method further comprises the steps of: receiving a phenotype (phenotype) vector for the patient, generating a temporal progression of the disease using the RBM, and treating the patient based on the generated temporal progression.

In another additional embodiment, the visible layer and the hidden layer are for a first time instance, wherein the hidden layer is further connected to a second hidden layer incorporating data from a different second time instance.

In another additional embodiment, the visible layer is a composite layer that includes data for a plurality of different time instances.

In yet another embodiment, calculating the set of likelihood gradients includes performing Gibbs (Gibbs) sampling.

In yet another embodiment, the set of composite gradients is a weighted average of the set of likelihood gradients and the set of confrontation gradients.

In yet another embodiment, the method further comprises the step of training the confrontation model by: the method includes extracting data samples based on real data, extracting fantasy samples based on data from the RBM, and training the countermeasure model based on the countermeasure model's ability to distinguish between the data samples and fantasy samples.

In yet another embodiment, training the countermeasure model includes measuring the probability of drawing a particular sample from the real data or RBM.

In yet an additional embodiment, the antagonistic model is one of a fully connected classifier, a logistic regression model, a nearest neighbor classifier, and a random forest.

In yet an additional embodiment, the method further comprises the step of generating a set of samples of a target population using the RBM.

In yet another embodiment, computing the set of likelihood gradients includes computing a convex combination of the monte carlo estimate and the mean field estimate.

In yet another embodiment, calculating the set of likelihood gradients includes initializing a plurality of samples, and initializing an inverse temperature for each sample of the plurality of samples. For each sample of the plurality of samples, calculating a set of likelihood gradients further comprises updating the inverse temperature by sampling from an autocorrelation gamma distribution, and updating the sample using gibbs sampling.

Additional embodiments and features are set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the specification or may be learned by practice of the invention. A further understanding of the nature and advantages of the present invention may be realized by reference to the remaining portions of the specification and the drawings, which form a part of this disclosure.

Drawings

The specification and claims will be more fully understood with reference to the following figures and data diagrams, which are presented as exemplary embodiments of the invention and are not to be construed as a complete description of the scope of the invention.

FIG. 1 illustrates a system that provides for the collection and distribution of data used to model probability distributions, according to some embodiments of the invention.

FIG. 2 illustrates data processing elements for training and utilizing stochastic models.

FIG. 3 illustrates a data processing application for training and utilizing stochastic models.

Figure 4 conceptually illustrates a process for preparing data for analysis.

FIG. 5 illustrates a data structure for implementing a generalized Boltzmann machine, in accordance with certain embodiments of the present invention.

FIG. 6 illustrates a bimodal distribution and a smoothed diffusion distribution learned through an RBM distribution according to several embodiments of the present invention.

FIG. 7 illustrates an architecture of a generalized restricted Boltzmann machine in accordance with some embodiments of the invention.

FIG. 8 illustrates a mode for implementing a generalized Boltzmann machine, in accordance with certain embodiments of the present invention.

FIG. 9 illustrates the architecture of a generalized deep Boltzmann machine, in accordance with certain embodiments of the present invention.

Figure 10 conceptually illustrates a process for reverse layer-by-layer training, in accordance with an embodiment of the present invention.

FIG. 11 illustrates the architecture of a generalized depth-time Boltzmann machine in accordance with many embodiments of the present invention.

Figure 12 conceptually illustrates a process for training a boltzmann-coded counter machine, in accordance with some embodiments of the present invention.

FIG. 13 illustrates the resulting samples drawn from RBMs trained to maximize log-likelihood and RBMs trained to BEAM.

FIG. 14 illustrates results of training BEAMs on a 2D mixture of gaussians, in accordance with multiple embodiments of the present invention.

FIG. 15 illustrates an architecture of a counter machine for implementing Boltzmann encoding in accordance with various embodiments of the present invention.

Fig. 16 illustrates a comparison between samples taken from a boltzmann machine using conventional gibbs sampling and samples taken using temperature driven sampling.

Fig. 17 illustrates a comparison between GRBM-generated fantasy particles trained on MNIST datasets using conventional gibbs sampling and those using TDS.

Detailed Description

Machine learning is one potential method of modeling complex probability distributions. In the following description, many examples are described with reference to medical applications, but those skilled in the art will recognize that the techniques described herein may be readily applied in a variety of different fields including (but not limited to) health informatics, image/audio processing, marketing, sociology, and laboratory research. One of the most pressing problems is that there is usually little or no label data directed directly to the particular problem of interest. The task of predicting how a patient will respond to a investigational treatment in a clinical trial is considered. In a supervised learning environment, many patients will be given treatment and observed how each reacts. This data will then be used to build a model that predicts how a new patient will respond to treatment. For example, the nearest neighbor classifier will look at a pool of previously treated patients to find the patient that is most similar to the new patient, and then predict the response of the new patient based on the response of the previously treated patients. However, supervised learning requires a large amount of label data, and especially in cases where the sample size is small or label data is not readily available, unsupervised learning is crucial for successful application of machine learning.

Many machine learning applications, such as computer vision, require the use of isomorphic information (e.g., images having the same shape and resolution) that must be preprocessed or otherwise manipulated to normalize input and training data. However, in many applications, it is desirable to combine various types of data from many sources (e.g., images, numbers, categories, ranges, text samples, etc.). For example, the medical data may include a variety of different types of information from a variety of different sources, including, but not limited to, demographic information (e.g., age, race, etc. of the patient), diagnosis (e.g., a binary code describing whether the patient has a particular disease), laboratory values (e.g., results from laboratory tests, such as blood tests), doctor's notes (e.g., handwritten notes recorded by a physician or entered into a medical record system), images (e.g., X-ray, CT scan, MRI, etc.), and omics data (e.g., data from DNA sequencing studies describing the genetic background of the patient, expression of his/her genes, etc.). Some of these data are binary, some are continuous, and some are categorical. Integrating all of these different types and data sources is crucial, but processing multiple data types using traditional machine learning methods is very challenging. Typically, a large amount of pre-processing must be performed on the data so that all features used for machine learning are of the same type. The data pre-processing step may take a significant portion of the analyst's time in training and implementing the machine learning model.

In addition to processing many different types of data, the data used for analysis is often incomplete or irregular. In the example of medical data, physicians typically do not run the same set of tests on every patient (although clinical trials are an important exception). Conversely, if the doctor has particular concerns about the patient, he/she will schedule the test. Thus, a medical record contains many fields that lack an observed value. However, these observations are not randomly missing. Handling these missing observations is an important part of any application of machine learning in healthcare.

Missing data has two implications for machine learning in healthcare. First, any algorithm needs to be able to learn from data missing from observations in the training set. Second, the algorithm needs to be able to make predictions even if it presents only a subset of the input observations to the algorithm. That is, it is desirable to be able to express any conditional relationship from the joint probability distribution.

One approach that has recently gained widespread popularity is the use of a generative countermeasure network (GAN). GAN uses in its traditional formulation a generator that transforms random gaussian noise into a visible vector through a feedforward neural network. A model with this formulation can be trained using standard back propagation processing. However, GAN training is often unstable-requiring a careful balance between the training of the arbiter (or arbiter) and the generator. Furthermore, it is not possible to generate samples from arbitrary conditional distributions using GAN, and applying GAN to problems involving heterogeneous data sets with different data types and missing observations can be very difficult.

Many embodiments of the present invention provide novel and innovative systems and methods for training and implementing a random, unsupervised machine learning model of complex probability distributions using heterogeneous, irregular, and unlabeled data.

System for modeling probability distributions

Turning now to the drawings, a system that provides for the collection and distribution of data that models probability distributions in accordance with some embodiments of the present invention is illustrated in FIG. 1. Network 100 includes a communication network 160. The communication network 160 is a network, such as the internet, that allows devices connected to the network 160 to communicate with other connected devices. Server systems 110, 140, and 170 are connected to network 160. Each of the server systems 110, 140, and 170 is a set of one or more servers that are communicatively connected to each other via an internal network, and performs a process of providing a cloud service to a user through the network 160. For purposes of this discussion, a cloud service is one or more applications executed by one or more server systems to provide data and/or executable applications to devices over a network. Server systems 110, 140, and 170 are each shown with three servers in the internal network. However, server systems 110, 140, and 170 may include any number of servers, and any additional number of server systems may be connected to network 160 to provide cloud services. According to various embodiments of the present invention, a network using a system and method for modeling complex probability distributions according to embodiments of the present invention may be provided by performing a process (or a set of processes) on a single server system and/or a set of server systems communicating over network 160.

A user may use personal devices 180 and 120 connected to network 160 to perform processes for providing and/or interacting with a network using systems and methods for modeling complex probability distributions according to various embodiments of the present invention. In the illustrated embodiment, personal device 180 is shown as a desktop computer connected to network 160 via a conventional "wired" connection. However, personal device 180 may be a desktop computer, laptop computer, smart television, entertainment game console, or any other device connected to network 160 via a "wired" connection. The mobile device 120 connects to the network 160 using a wireless connection. A wireless connection is a connection to the network 160 using Radio Frequency (RF) signals, infrared signals, or any other form of wireless signals. In fig. 1, the mobile device 120 is a mobile phone. However, the mobile device 120 may be a mobile handset, a Personal Digital Assistant (PDA), a tablet, a smart phone, or any other type of device that connects to the network 160 via a wireless connection without departing from the invention.

A data processing element for training and utilizing stochastic models in accordance with various embodiments is illustrated in fig. 2. In various embodiments, data processing element 200 is one or more of a server system and/or a personal device within a networked system similar to that described with reference to FIG. 1. Data processing element 200 includes a processor (or set of processors) 210, a network interface 225, and a memory 230. The network interface 225 is capable of sending and receiving data across a network via a network connection. In various embodiments, the network interface 225 is in communication with the memory 230. In various embodiments, memory 230 is any form of storage device configured to store a variety of data, including but not limited to data processing applications 232, data files 234, and model parameters 236. The data processing application 232 according to some embodiments of the present invention directs the processor 210 to perform a variety of processes, such as (but not limited to) updating model parameters 236 using data from data files 234 to model complex probability distributions.

A data processing application according to various embodiments of the present invention is illustrated in fig. 3. In this example, data processing element 300 includes a data collection engine 310, a database 320, a model trainer 330, a generative model 340, a discriminator model 350, and a simulator engine 345. Model trainer 330 includes a pattern processor 332 and a sampling engine 334. Data processing applications according to many embodiments of the present invention process data to train stochastic models that can be used to model complex probability distributions.

Data collection engines according to many embodiments of the present invention collect data in a variety of formats from a variety of sources. Data collected in accordance with many embodiments of the present invention includes data that may be heterogeneous (e.g., data having various types, ranges, and constraints) and/or incomplete. Those skilled in the art will recognize that various types and amounts of data may be utilized as appropriate to the requirements of a particular application in accordance with embodiments of the present invention. In some embodiments, the data collection engine is also used to preprocess the data to facilitate training of the model. However, unlike the preprocessing performed in other methods, preprocessing according to some embodiments of the present invention is performed automatically based on the data type and/or schema associated with each data input. For example, in some embodiments, the body of unstructured text (e.g., typed medical notes, diagnoses, free-form questionnaire responses, etc.) is processed in a number of ways, such as (but not limited to) vectorization (e.g., using word2vec), summarization, sentiment analysis, and/or keyword analysis. Other pre-processing steps may include, but are not limited to, normalization, smoothing, filtering, and aggregation. In some embodiments, the preprocessing is performed using various machine learning techniques including, but not limited to, a constrained boltzmann machine, a support vector machine, a recurrent neural network, and a convolutional neural network.

Databases according to various embodiments of the present invention store data for use by data processing applications, including (but not limited to) input data, pre-processing data, model parameters, patterns, output data, and simulation data. In some embodiments, the database is located on a machine separate from the data processing application (e.g., in cloud storage, a server farm, a networked database, etc.).

A model trainer according to various embodiments of the invention is used to train a generative model and/or a discriminator model. In many embodiments, the model trainer utilizes a pattern processor to build the generator model and/or the discriminator model based on patterns defined by various data available to the system. A pattern processor according to some embodiments of the present invention builds a composite layer for generating a model (e.g., a constrained boltzmann machine) that is composed of several different layers for processing different types of data in different ways. In some embodiments, the model trainer trains the generator model and the discriminator model by optimizing a composite objective function based on log-likelihood and a confrontation objective. Training a generative model according to some embodiments of the present invention utilizes a sampling engine to extract samples from the model to measure the probability distribution of the data and/or model. Various methods for sampling from such models to train and/or extract generated samples from the models are described in more detail below.

In many embodiments, the generative model is trained to model complex probability distributions, which can be used to generate predictions/simulations of various probability distributions. The discriminator model discriminates between the data-based samples and the model-generated samples based on the visible and/or hidden states.

Simulator engines according to several embodiments of the present invention are used to generate simulations of complex probability distributions. In some embodiments, the simulator engine is used to simulate patient populations, disease progression, and/or predicted responses to various treatments. Simulator engines according to several embodiments of the present invention use a sampling engine for extracting samples from a generative model of a probability distribution of simulated data.

As described above, as part of the data collection process, data according to several embodiments of the present invention is pre-processed to simplify the data. Unlike other pre-processing, which is typically highly manual and specific to the data, this pre-processing can be performed automatically based on the type of data without additional input from other people.

A process for preparing data for analysis according to some embodiments of the invention is conceptually illustrated in fig. 4. Process 400 processes (405) unstructured data. Unstructured data according to many embodiments of the invention may include various types of data that may be preprocessed to speed up processing and/or reduce memory requirements for storing relevant data. Examples of such data may include, but are not limited to, the body of text, signal processing data, audio data, and image data. Processing unstructured data according to many embodiments of the invention may include, but is not limited to, feature recognition, profiling, keyword detection, sentiment analysis, and signal analysis.

Process 400 reorders the data based on the pattern (410). In some embodiments, the process reorders the data based on the different data types defined in the schema by grouping similar data types to allow for efficient processing of the data types. The process 400 according to some embodiments of the invention rescales (415) the data purely based on the scale of the measurement to prevent over-representation of certain data elements. The process 400 then routes (420) the pre-processed data to a sub-layer of the Boltzmann machine that is constructed based on the data type identified in the schema. Examples of Boltzmann machine structures and architectures are described in more detail below. In some embodiments, the data is pre-processed into a chronological data structure for input to a deep time boltzmann machine. The depth-time boltzmann machine is described in further detail below.

A time data structure for input to a boltzmann machine in accordance with various embodiments of the present invention is illustrated in fig. 5. The example of fig. 5 shows three data structures 510, 520, and 530. Each of the data structures represents a set of data values captured at a particular time (i.e., times t0, t1, and tn). In this example, certain characteristics (e.g., gender, race, birth date, etc.) generally do not change over time, while other characteristics (e.g., test results, medical scans, etc.) change over time. This example further illustrates that for some individuals, some fields may have some data missing for some time. In this example, each individual is assigned a separate identification number in order to maintain patient privacy information.

Boltzmann coded countermeasure machine

Trained to forward KL divergence DKL(pdata||pθ) The minimized model tends to spread the model distribution out to cover the support of the data distribution. An example of a diffusion profile is illustrated in fig. 6. Specifically, fig. 6 illustrates a bimodal distribution 610 and a fairly good, smooth diffusion distribution 620 learned by an RBM distribution. While RBMs can generate such good approximations, they can encounter difficulties when faced with finer, more complex distributions.

To overcome the problems of conventional boltzmann machines, several embodiments of the present invention implement a framework for training a boltzmann machine for countermeasures, referred to herein as a boltzmann coded counter machine (BEAM). BEAM minimizes a loss function, which is a combination of negative log-likelihood and countervailing loss. The challenge component ensures that BEAM training performs a simultaneous minimization of both forward and reverse KL divergence, which prevents the over-smoothing problem observed with conventional RBMs.

Boltzmann machine architecture

Supervised learning is used to train models over a large set of labeled data for prediction and classification, using many conventional machine learning techniques. However, in many cases, it is not feasible or possible to collect such large samples of tagged data. In many cases, the data is not easily labeled, or the samples of events are simply insufficient to meaningfully train the supervised learning model. For example, clinical trials often face difficulties in collecting such marker data. Clinical trials typically go through three main stages. In phase I, treatment is given to healthy volunteers to assess their safety. In phase II, treatment is given to approximately 100 patients to obtain a preliminary estimate of safety and efficacy. Finally, in phase III, treatment is given to hundreds to thousands of patients to rigorously study the efficacy of the drug. Prior to phase II, there was no human data on the effect of study drugs on the desired indications, making supervised learning impossible. After phase II there is some human data on the effect of the study drug, but the sample size is rather limited, rendering supervised learning techniques inefficient. For comparison, a phase II clinical trial may have 100-. As with many data-limited cases, the lack of a large set of label data for many important issues means that health informatics must rely heavily on methods of unsupervised learning.

Restricted Boltzmann Machine (RBM)

One machine learning model (or method) that uses unsupervised learning is the Restricted Boltzmann Machine (RBM). An RBM is a bi-directional neural network in which neurons (also called cells) are divided into two layers, a visible layer and a hidden layer. The visible layer v describes the observed data. The hidden layer h consists of a set of unobserved latent variables that capture the interactions between visible cells. The model describes the joint probability distribution of v and h in an exponential fashion,

p(v,h)=Z-1e-E(v,h)。 (1)

here, E (v, h) is referred to as the energy function, and Z ═ dvdh-E(v, h) is called the partition function. In many embodiments, the process represents a standard integral or a summation of all elements in a discrete set using the integral operator ^ dx.

In a conventional RBM, both the visible cell and the hidden cell are binary. Each of which can take values of only 0 or 1. The energy function can be written as,

or by vector notation, E (v, h) ═ aTv-bTh-vTWh. Note that the visible units interact with the hidden units by a weight W. However, there is no visible-visible or hidden-hidden interaction.

A key feature of RBM is the ease of computing conditional probabilities,

Figure BDA0002641967560000122

and the number of the first and second groups,

Figure BDA0002641967560000123

also, it is easy to calculate the conditional moments,

and the number of the first and second groups,however, it is often very difficult to compute statistics from a joint distribution. Therefore, a random sampling process such as Markov Chain Monte Carlo (MCMC) must be used to estimate the statistics from the joint distribution.

Can be obtained by maximizing the log-likelihood

Figure BDA0002641967560000137

To train the RBM. Here, the first and second liquid crystal display panels are,<·>datarepresents the average over all observed samples. The derivative of the log likelihood for a certain parameter θ of the model is:

Figure BDA0002641967560000133

in the standard formulation of an RBM, there are three parameters, a, b, and W. The derivative is:

computing the expected value from the joint distribution is generally computationally intractable. Therefore, the derivative must be calculated using samples from the model extracted using the MCMC process. Alternative gibbs sampling may be used to draw samples from the RBM.

Inputting: initial configuration (v, h).

Number of steps of monte carlo, k.

RBM。

And (3) outputting: a new configuration (v ', h').

Setting v0=v,h0=h;

Return (v)k,hk)

In theory, gibbs sampling produces uncorrelated random samples from p (v, h) within the limits of n → ∞. Of course, infinity is a very long time. Therefore, the derivative of the log-likelihood of an RBM is typically approximated using one of two processes: contrast Divergence (CD) or continuous contrast divergence (PCD). The K-step CD is very simple: and capturing a batch of data. An approximate batch of samples was calculated from the model by running k steps of gibbs sampling from the data. The gradient of the log-likelihood is calculated and the model parameters are updated. Importantly, for each gradient update, the samples from the model are reinitialized with a batch of observed data. K-step PCD is similar: first, a sample from the model is initialized with a batch of data. And updating the samples for the k steps, calculating the gradient and updating the parameters. Samples from the model are never reinitialized compared to CD. Many architectures of Boltzmann machines according to several embodiments of the present invention utilize sampling to calculate derivatives for training the Boltzmann machine. Various methods for sampling according to several embodiments of the present invention are described in more detail below.

Generalized RBM

One challenge that arises in the use of conventional boltzmann machines is that many RBMs use binary units, and many data to be processed may come in a variety of different forms. To overcome this limitation, some embodiments of the present invention use a generalized RBM. A generalized RBM in accordance with various embodiments of the present invention is illustrated in fig. 7. The example of fig. 7 shows a generalized RBM 700 having a visible layer 710 and a hidden layer 720. Visible layer 710 is a composite layer consisting of several nodes of various types (i.e., continuous, categorical, and binary). The nodes of the visible layer 710 are connected to the nodes of the hidden layer 720. The hidden layer of a generalized RBM according to several embodiments of the present invention operates as a low-dimensional representation of an individual (e.g., a patient in a clinical trial) based on compiled input to a synthetic visible layer.

The generalized RBM according to various embodiments of the present invention is trained using an energy function,

Figure BDA0002641967560000151

where a (-) and b (-) are arbitrary functions and σ >0 and >0 are the scaling parameters for the visible layer and the hidden layer, respectively. Different functions (called layer types) are used to represent different types of data. Examples of layer types for modeling various types of data are described below.

Bernoulli layer: bernoulli layer for representing binary data vi∈ {0, 1 }. the deviation function is a (v) ═ aTv, and the scaling parameter is set to σi=1。

An octyl layer: the Itanium layer being a symmetrical Bernoulli layer v for the visible uniti∈ { -1, +1 }. the bias function is a (v) ═ aTv, and the scaling parameter is set to σi=1。

A single-heat layer: one-hot layer representation where vi∈ {0, 1} and ∑iviData of 1. I.e. one of the units is turned onAnd all other units are turned off. The one-hot layer is typically used to represent classification variables. The deviation function is a (v) ═ aTv, and the scaling parameter is set to σi=1。

Von mises-fischer layer: Von-Misses-Fisher layer representation where vi∈ {0, 1} and

Figure BDA0002641967560000155

the data of (1). I.e. the cells are confined to the surface of an n-dimensional sphere. This layer is for xi∈[0,1]And ∑ixiModeling fractional data of 1 is particularly useful becauseMeets the spherical property. The deviation function is a (v) ═ aTv, and the scaling parameter is set to σi=1。

Gaussian layer: the Gauss layer is represented thereinThe data of (1). A deviation function of

Figure BDA0002641967560000152

Position parameter of the layerAnd a scaling parameter σiAre generally trainable. In practice, this helps to determine the log σiThe model is parameterized to ensure that the scaling parameters remain positive.

ReLU layer: a layer of rectifying linear units (ReLU) represented therein

Figure BDA0002641967560000159

(wherein,

Figure BDA00026419675600001510

) The data of (1). In the context of boltzmann machines, the ReLU layer is essentially a one-sided truncated gaussian layer. Deviation function in the field

Figure BDA00026419675600001511

Is provided with

Figure BDA0002641967560000153

Position parameter of the layer

Figure BDA00026419675600001512

And a scaling parameter σiAre generally trainable and are typically specified prior to trainingIn practice, it helps to determine the log σiThe model is parameterized to ensure that the scaling parameters remain positive.

Trimmed Relu layers: a trimmed rectified Linear Unit (ReLU) layer is represented therein

Figure BDA0002641967560000161

The data of (1). In the context of boltzmann machines, the pruned ReLU layer is essentially a two-sided truncated gaussian layer. Deviation function in the fieldIs provided withPosition parameter of the layerAnd a scaling parameter σiAre generally trainable and are typically specified prior to training

Figure BDA0002641967560000168

And

Figure BDA0002641967560000163

in practice, it helps to determine the log σiThe model is parameterized to ensure that the scaling parameters remain positive.

Student-t layer: student-t distribution is similar to Gaussian distribution but withWith a thicker tail. In various embodiments, the implementation of the student-t layer is implicit. The layer has three parameters, the position parameter of the control mean value

Figure BDA0002641967560000164

Scaling parameter v controlling varianceiAnd a degree of freedom parameter d for controlling the thickness of the tail portioni. The layer passes the extracted variance~InverseGammaThen measuring the energy asTo be defined.

Sequentially counting the layers: the ordinal layer being for representing integer-valued data vi∈{0,NiGeneralization of Bernoulli layer. The deviation function is a (v) ═ aTv, and the scaling parameter is set to σ i1. Upper limit value NiIs specified in advance.

Gaussian order number layer: the Gaussian-ordered number layer is used for representing integer value data v with more flexible distributioni∈{0,NiGeneralization of ordinal number layers. A deviation function ofUpper limit value NiIs specified in advance.

An index layer: index layer is shown thereinThe data of (1). The deviation function is a (v) ═ aTv, and the scaling parameter is set to σ i1. Note that the exponent layer has some constraints, since all values a for the connected hidden unitsi+∑iWhμIs greater than 0. Typically, this limits the types of layers that can be connected to the exponential layer and requires that all weights be guaranteed to be positive values.

And (3) synthesizing a layer: the composite layer itself is not a mathematical object, as is the case with the previously described layer types. In contrast, a composition layer is a software implementation for combining multiple sub-layers of different types to create a meta-layer that can model heterogeneous data.

The above describes a specific example of a layer for modeling data according to an embodiment of the present invention; however, those skilled in the art will recognize that any number of processes may be utilized as appropriate to the requirements of a particular application in accordance with embodiments of the present invention.

Mode(s)

A schema according to several embodiments of the invention is conceptually illustrated in fig. 8. The pattern of the description of the different layers with the generalized RBM is illustrated in fig. 8. The schema allows the model to be adapted to process a particular type of data without requiring human intervention for cumbersome pre-processing. Different layers allow for different types of heterogeneous data that may be incomplete and/or irregular.

The above describes a specific example of a mode for constructing a model according to an embodiment of the present invention; however, those skilled in the art will recognize that any number of processes may be utilized as appropriate to the requirements of a particular application in accordance with embodiments of the present invention.

Generalized Deep Boltzmann Machine (DBM)

Deep learning refers to a method of machine learning in which a model processes data through a series of transformations. The goal is to enable the model to learn to construct appropriate features using a priori knowledge, rather than requiring the researcher to make the features.

The generalized Depth Boltzmann Machine (DBM) is essentially a stack of RBMs. A generalized DBM according to some embodiments of the present invention is illustrated in fig. 9. Generalized DBM 900 shows a visible layer 910 connected to a hidden layer 920. The hidden layer 920 is further connected to another hidden layer 930. The visible layer 910 is encoded to the hidden layer 920, which hidden layer 920 then operates like the visible layer for the next hidden layer 930.

Consider having L hidden layers hl(L ═ 1, …, L).The energy function of the DBM is:

in principle, the DBM can be trained in the same manner as an RBM. However, in practice, the DBM is often trained using a greedy layer-by-layer process. An example of greedy layer-by-layer processing is described in the artificial intelligentice and statics (2009), page 448-. Essentially, forward layer-by-layer training of the DBM is performed by training a series of RBMs with an energy function:

Figure BDA0002641967560000182

where the output of the previous RBM is used as the input to the next RBM. When training a DBM in such a forward, layer-by-layer manner, it may be difficult to derive information from the data distribution to propagate into deep layers of the model. Therefore, it is generally difficult to train a DBM having more than one pair of hidden layers.

To overcome the limitations of forward layer-by-layer training of DBMs, methods according to many embodiments of the invention train DBMs in reverse — from the deepest hidden layer hLStarting and working backwards towards v. This ensures that the deepest hidden layer must contain as much information as possible about the visible layer. The reverse layer-by-layer training process takes advantage of the fact that: having a connection v-h1-h2With a connection [ v, h2]–h1Are identical, thereby allowing to haveThe RBM of the composite layer points downward and backward into the joint map of the DBM.

A process for reverse layer-by-layer training according to an embodiment of the present invention is conceptually illustrated in fig. 10. Process 1000 training (1005) with join v-hLThe first RBM of (1). Process 1000 samples (1010) h from the trained RBML~p(hL| v). Then, the process will sum v and hLStacking (1015) to vector [ v, h ]L]Merge training (1020) with a connection [ v, hL]-hL-1The second RBM. Process 1000 then determines (1025) whether [ v, h ] has been reached2]–h1. When not yet reached, process 1000 returns to step 1005. When process 1100 determines that [ v, h ] has been reached2]–h1The process copies (1030) the weights from each of these intermediate RBMs into their respective locations in the DBM. In some embodiments, the DBM may then be fine-tuned through conventional end-to-end training.

Boltzmann machine for time series

Many problems (e.g., modeling patient trajectories) require the ability to generate time series. I.e. to generate a series of states

Figure BDA0002641967560000185

Two methods according to a number of embodiments of the present invention are described below.

An autoregressive boltzmann machine (ADBM) is a DBM in which the hidden layer has undirected edges connecting adjacent points in time. Thus, the ADBM correlates the node with its previous point in time. A generalized ADBM according to some embodiments of the invention is illustrated in figure 11. Generalized ADBM 1100 shows a visible layer 1110 also at time t connected to a hidden layer 1120 at time t. The hidden layer 1120 is also connected to another hidden layer 1130, the other hidden layer 1130 incorporating data offset τ from time t.

Thus, ADBM is a model for describing the entire sequence of joint probability distributions p (v (0),.., v (τ)). Specifically, x (t) ═ v (t), h1(T),...,hL(t)]Representing the state of all layers at time t. Furthermore, let EDBM(x (t)) isThe energy of the DBM, which is given by:

the energy function of the ADBM is:

for simplicity, this is illustrated with a single autoregressive connection connecting the last hidden layer with its previous value. However, those skilled in the art will recognize that this model may be extended to include multiple time delays or cross-time connections between layers.

As described in the previous section, ADBMs are able to capture correlations by time, but they are generally not able to represent non-stationary distributions or distributions with drift. For example, most patients with degenerative diseases will tend to worsen over time-an effect that ADBM cannot capture. To capture this effect, many embodiments of the present invention implement a Generalized Conditional Boltzmann Machine (GCBM). Considering a time sequence of visible units

Figure BDA0002641967560000193

The joint probability distribution can be decomposed into products

Figure BDA0002641967560000194

In several embodiments, this model may be constructed from two DBMs. First, a non-time dependent DBM, p, can be trained on all data0. Next, v (t-1) may be determined by combining all adjacent time points [ v (t), v (t)]Training a time-dependent DBM on the created composite layer. In this example, the second DBM describes a joint distribution p (v (t), v (t-1)), which enables the computation of both p (v (t) | v (t-1)) and p (v (t-1) | v (t)), allowing both forward and backward predictions.

Although this example is described using a single time lag, those skilled in the art will recognize that processing according to many embodiments of the present invention may be adjusted to account for longer and/or multiple time lags. For example, a second DBM may be trained on a synthetic layer that may be easily expanded to include multiple time lags, e.g., [ v (t), v (t-1),. -, v (t-n) ].

Training RBM

There are several approaches for improving the performance of RBMs. These approaches include new regularization methods, novel optimization algorithms, alternative objective functions, and improved gradient estimators. Systems and methods according to several embodiments of the present invention implement alternative objective functions and improved gradient estimators.

Countermeasure targets for RBMs

A machine learning model is generated if it learns to extract new samples from an unknown probability distribution. Generative models can be used to learn useful representations of data and/or to enable simulation of systems with unknown or very complex mechanical laws. The generative model defined by some model parameters θ describes the probability of observing a certain variable v. Therefore, training the generative model involves making the distribution p of the datad(v) And distribution p defined by the modelθ(v) The distance between them is minimized. Traditional methods for training boltzmann machines maximize the log-likelihood, which is equivalent to minimizing the forward Kullback-leibler (kl) divergence:

forward KL divergence DKL(pd||pθ) The difference between the model distribution and the data weighted by the probability under the data distribution is accumulated. Inverse KL divergence DKL(pθ||pd) The difference between the model distribution and the data weighted by the probability under the model distribution is accumulated. Thus, the forward KL divergence strongly penalizes models that underestimate data probabilities, while the reverse KL divergence strongly penalizes models that overestimate data probabilities.

There are a number of sources of randomness into the training of the RBM. Random means that if the log-likelihoods of different models differ less than the error in estimating themPoor, then these models may be statistically indistinguishable. This creates an entropy force because of the small DKL(pd||pθ) Will have a smaller D than the model withKL(pd||pθ) And DKL(pθ||pd) Both models are much more numerous. Thus, training an RBM using a standard method with PCD reduces DKL(pd||pθ) (as it should), but tends to increase DKL(pθ||pd). This results in a distribution with spurious patterns and/or an overly smooth distribution.

It is conceivable to overcome the limitations of maximum likelihood training of RBMs by minimizing the combination of forward and reverse KL divergence. Unfortunately, calculating the inverse KL divergence requires pdKnowledge of p, thedIs unknown. In many embodiments, a novel type of f-divergence, rather than the inverse KL divergence, may be used as the arbiter divergence to train the RBM:

note that p isdAnd pθThe best discriminator in between will assign a sample v to be the posterior probability extracted from the data distribution:

Figure BDA0002641967560000212

thus, the discriminator divergence can be written as

DD(pd||pθ)=-log2-∫dv pθ(v)log(p(data|v)) (16)

Indicating that it measures the probability that the best discriminant will incorrectly classify the samples extracted from the model distribution as coming from the data distribution.

The discriminator divergence belongs to the field defined as Df(p | | q): the category of f-divergence of ═ dxq (x) f (p (x))/q (x)). Defining the function of the divergence of the discriminators as required

Figure BDA0002641967560000213

The function is convex, where f (1) ═ 0. It can be shown that the discriminator divergence is the upper bound of the inverse KL divergence:

Figure BDA0002641967560000214

≥DKL(pθ||pd)。

it is often difficult to directly access pd(v) Or calculating the inverse KL divergence. However, methods according to a number of embodiments of the present invention may train the arbiter to approximate equation 15, and thus may approximate the arbiter divergence.

Being able to fool the arbiter into making a response to slave pθGenerators with p (data | v) ≈ 1 will have a low discriminator divergence for all samples drawn. The discriminator divergence closely reflects the inverse KL divergence and strongly penalizes the model of overestimated data probability.

Methods according to a number of embodiments of the present invention implement a Boltzmann encoded countermeasure machine (BEAM) for countermeasure training RBMs. BEAM according to various embodiments of the present invention minimizes the loss function as a combination of negative log-likelihood and countervailing loss. The challenge component ensures that BEAM training performs a simultaneous minimization of both forward and reverse KL divergence, which prevents the over-smoothing problem observed with conventional RBMs.

Methods for training a BEAM according to many embodiments of the present invention are described below:

inputting:

n is the number of epochs;

m is the number of fantasy particles;

k is Gibbs sampling step number;

a-likelihood and weight to confrontation gradient

Initialization:

sampling F-p using k-step Gibbs samplingθ(v);

Figure BDA0002641967560000231

A process for training a confrontation model according to some embodiments of the invention is conceptually illustrated in fig. 12. Process 1200 extracts 1205 samples from a model such as, but not limited to, a boltzmann machine such as those described above. Samples may be drawn from the model according to a variety of methods, including (but not limited to) k-step gibbs sampling and TDS. Process 1200 then calculates (1210) a gradient based on the extracted samples. The process 1200 trains (1215) a discriminator based on the samples drawn and calculates the opposition gradient based on the classification of the samples as drawn from the model or from the data. In many embodiments, process 1200 then calculates 1220 the complete composite gradient and updates 1225 the model parameters using the complete gradient.

Fig. 13 presents some comparisons between boltzmann machines trained to maximize log-likelihood and those trained as BEAM. The example of this figure illustrates three multimodal data distributions: a bimodal mixture of 1-dimensional gaussians (1310), a mixture of 8 gaussians in a circular arrangement of 2-dimensional (1320), and a mixture of 25 gaussians in a grid arrangement of 2-dimensional (1330). A similar problem to the 2-dimensional hybrid example of gaussian is commonly used to test GAN. In each case, conventional boltzmann machines learn models with reasonably good likelihood by diffusing probabilities across the support of the data distribution. In contrast, boltzmann machine learning, trained for use as BEAM, reproduces the data distribution very accurately.

An example of the results of training a BEAM on a 2D mixture of gaussians is illustrated in fig. 14. A first picture 1405 illustrates the forward KL divergence D for each training epochKL(pd||pθ) And inverse KL divergence DKL(pθ||pd) Is estimated. The first picture 1405 illustrates that training the RBM to BEAM reduces both the forward KL divergence and the reverse KL divergence. A second screen 1410 illustrates the distribution of fantasy particles for various time periods during training. In the early stages of training, BEAM fantasy particlesThe support distributed across the data is spread out, capturing patterns near the edges of the grid. These early stages are similar to the distributions obtained with GAN, which also concentrates density in patterns near the edges of the grid. As training progresses, BEAM gradually learns to capture patterns near the center of the grid.

An architecture of a boltzmann coded counter machine (BEAM) according to some embodiments of the present invention is illustrated in fig. 15. The illustrated example shows two steps of the BEAM architecture. In the first stage 1510, the generator (e.g., RBM) has a visible layer (circle) and a hidden layer (diamond). A generator according to embodiments of the invention is trained to encode input data by passing the input data through a visible layer to be encoded in a set of nodes of a hidden layer. The generator according to several embodiments of the invention is trained with the goal of generating realistic samples from complex distributions. In many embodiments, the objective function used for the training generator may include contributions from the opposition loss generated by the evaluator (or evaluator).

In the second stage 1520, the hidden layer of the generator is fed into a discriminator (or evaluator) that evaluates the hidden layer to distinguish between samples extracted from the data and samples extracted from the model using the binding weights learned by the generator. The discriminators (or countermeasures) are constructed by encoding visible units using a single forward pass through the layers of the generator, then applying classifiers (e.g., logistic regression, nearest neighbor classifiers, and random forests) trained to discriminate between samples from the data and samples from the model. By refining the discriminators, processing according to many embodiments of the present invention allows for improved modeling of complex probability distributions. Although shown in separate stages, BEAM according to many embodiments of the present invention is trained with a composite target that trains both the evaluator and the generator. In some embodiments, the discriminators are simple classifiers that require little training.

An objective function according to various embodiments of the present invention is

The objective function includes an antagonistic term from the evaluatorThe contribution of (c). Countermeasure items according to various embodiments of the invention may be defined as

Figure BDA0002641967560000242

Where T (v, h) is the merit function. In some embodiments, the countermeasure uses the same architecture and weights as the RBM and encodes the visible units into the hidden unit activation. These hidden cell activations computed for both the data sampled from the RBM and the phantom particles are used by the evaluator to estimate the distance between the data distribution and the model distribution.

To compute the derivatives for training the generators, the method according to some embodiments of the invention uses random derivative trick:

whereinFor RBM.

In principle, the evaluator may be any function of the visible unit and the hidden unit. However, based on the arbiter divergence, methods according to several embodiments of the invention use an arbiter that is monotonically related to P (data | ν). While the arbiter divergence proposal may use log p (data | v), methods according to some embodiments of the invention use a linear function t (v) =2 × p (data | v) -1. In general, the optimal arbiter can be approximated as a function of the hidden unit activation

Figure BDA0002641967560000255

The function g (-) can be implemented by a neural network as in most GANs, or using simpler algorithms such as random forest or nearest neighbor classifiers. In a plurality of entitiesIn an embodiment, a simple approximation to the best discriminator may be sufficient, since the classifier may operate on hidden unit activations rather than visible units of the RBM generator. Therefore, the best evaluator can be approximated using a nearest neighbor method.

Let X be { X ═ X1,...,xNIs from atWith the same and independently distributed samples of the unknown probability distribution of pdf p (x). In various embodiments, p (x) is estimated at any point x based on k-nearest neighbor estimation. Specifically, the method according to some embodiments of the invention fixes some positive integers k and calculates k nearest neighbors to X in X. Then, d iskDefined as the distance between x and the farthest in nearest neighbor and the density p (x) is estimated as the radius dkThe density of the balls is uniformly distributed. That is to say that the first and second electrodes,

Figure BDA0002641967560000252

now with pθ(v) And pd(v) An unknown pdf representing the model distribution and the data distribution, respectively, and defining the distance between the two vectors v and v' as the euclidean distance between their hidden unit activationsThis distance may no longer satisfy all the properties of the appropriate metric. Let X be { v ═ v1,...,v2NIs the set of samples, exactly half of which are from pθDecimating and half-extracting from pdAnd (6) extracting. Fix some k and calculate k nearest neighbors in X, with dkIndicating the farthest distance to. The denominator is then estimated as described above. Let j be from pdInstead of pθIs measured in the same manner as described above. The numerator can then be estimated to be uniform over the same size sphere with only j/k of the denominator density, allowing the nearest neighbor evaluator to be defined as TNN(v) The method comprises the following steps J/k. In many embodiments, the cached sample may be retrieved from a small batch from the modelThis is combined with a small batch of samples from the training data set to calculate the nearest neighbor.

The distance weighted nearest neighbor evaluator is a generalization that adds some continuity to the nearest neighbor evaluator by applying inverse distance weighting to the ratio count. Specifically, let { d0,...,dkIs k-nearest neighbor distance, where { d }0,...,djIs the distance from the neighbor of the data sample, and dj-1,...,dkIs the distance from the neighbors of the model sample. In many embodiments, the distance weighted nearest neighbor evaluator may be defined as:

where is a small parameter that adjusts the inverse distance.

In the context of most formulation expressions of GAN using feed forward neural networks for the generator and the arbiter, BEAM can be said to use RBM as both the generator and the feature extractor of the countermeasure. In various embodiments, this dual use allows a single set of fantasy particles to be reused for multiple steps of the training algorithm. Specifically, a single set of M persistent fantasy particles is updated k times per gradient evaluation. In many embodiments, the same set of phantom particles is used to compute the log-likelihood derivative and the antagonistic derivative. These phantom particles can then replace the phantom particles from the previous gradient evaluation in the nearest neighbor estimate of the evaluator value. Reusing the fantasy particles for each step means that the computational cost of BEAM training is approximately the same as the computational cost of training an RBM with a PCD.

Improved gradient estimation

The gradients of the countermeasure term and the log-likelihood both relate to the expected values for the model distribution. Unfortunately, these expected values cannot be calculated exactly. Thus, the expected value may be approximated using a Monte Carlo method or other approximation. The accuracy of these approximate gradients can have a significant impact on the utility of the resulting model. Various methods of improving the accuracy of the approximation gradient according to certain embodiments of the present invention are described below.

Mean field approximation and shrinkage estimation

Monte carlo estimation of gradients has the advantage of being unbiased. That is, as N → ∞,

Figure BDA0002641967560000271

however, when N hours, the estimates may have high variance. On the other hand, the mean field, such as that derived from the thuulless-Andersen-palmer (tap) expansion, is estimated to be analytic and has zero variance, but with a bias that may be difficult to control. Let f (ω) be ω fMC+(1-ω)fMFIs to estimate f from Monte CarloMCAnd mean field estimate fMFThe convex combination of (a) and (b). Easy to prove Bias2[f]=(1-ω)2Bias2[fMF]And Var [ f ]]=ω2Var[fMC]So that the mean square error of f is MSE f]=Bias2[f]+Var[f]=(1-ω)2Bias2[fMF]+ω2Var[fMC]. Therefore, the value of ω can generally be chosen to minimize the mean square error of the combined estimator.

Annealing sampling

According to many embodiments of the present invention, extracting samples from probability distributions is an important component of many processes for training models. This can typically be done with a simple function call to many 1-dimensional distributions. However, random sampling from boltzmann machines is much more complex.

Sampling from the boltzmann machine is typically performed using gibbs sampling. Gibbs sampling is a local sampling process, which means that consecutive samples are correlated. Drawing irrelevant samples requires many gibbs sampling steps for each successive sample. Therefore, it can take a long time to extract a batch of uncorrelated random samples from the boltzmann machine. A batch of random samples is required for each gradient update-if it takes a long time to generate each batch, this can make it take a long time to train the boltzmann machine to become impractical. Thus, a method of reducing the correlation between successive samples from a boltzmann machine can greatly accelerate the learning process.

Many methods for accelerating sampling from boltzmann machines rely on temperature analogy to that from statistical physics. To this end, the method according to various embodiments of the present invention introduces the imaginary inverse temperature β into the boltzmann machine by defining a probability distribution as follows:

Figure BDA0002641967560000281

the original distribution of the boltzmann machine is restored by setting β to 1.

The imaginary temperature is useful because increasing the temperature (i.e., decreasing β) decreases the autocorrelation between samplesmaxThen the time to travel from (v, h) to (v ', h') scales approximately as follows:

thus, reducing β will reduce the number of gibbs sampling steps required to move between remote configurations.

Although raising the temperature will reduce the mixing time, it also changes the resulting probability distribution. Therefore, simply sampling from a model with β <1 during training will not allow the model to learn correctly. Processing according to certain embodiments of the present invention uses a process called parallel annealing (in the machine learning and statistics literature) or replica swapping (in the physics community). In parallel annealing according to various embodiments of the present invention, multiple gibbs sampling chains are run in parallel, each at a different temperature. Periodically, an attempt is made to exchange the configuration of the two chains. In several embodiments, the exchange may be accepted or rejected based on criteria (e.g., Metropolis criteria) to ensure that the entire system remains balanced. After a long time, the configuration starting at β ═ 1 will travel to the chain with lower temperature (where it can more easily cross the energy barrier) and return to the chain running at β ═ 1. This ensures that chains running at β ═ 1 have faster mixing times while still sampling from the correct probability distribution. However, there is a computational cost because many gibbs sampling chains must run in parallel.

In some embodiments of the present invention, the process uses Temperature Driven Sampling (TDS), which greatly improves the ability to train Boltzmann machines without incurring significant additional computational costs. TDS is a variant of sequential monte carlo samplers. A set of m samples is independently evolved using gibbs sampling updates from the model. Note that this is different from running multiple strands for a parallel annealing process, because each of the m samples in the sequential monte carlo sampler will be used to compute the statistical information, rather than just the samples from the β ═ 1 strand during parallel annealing. Each of these samples has an inverse temperature extracted from a distribution of mean < β > -1 and variance Var [ β ] < 1. In several embodiments, the inverse temperature of each sample may be independently updated once for each gibbs sampling iteration of the model. In various embodiments, the updates are auto-correlated across time such that the inverse temperature changes slowly. Thus, a set of samples is extracted from a distribution that is close to the model distribution but with a coarser tail. This allows mixing to proceed much faster while ensuring that the model average (calculated over a set of m samples) remains very similar to the average calculated from the model with β ═ 1. An example of sampling from an autocorrelation gamma distribution is described below.

Inputting:

the autocorrelation coefficient is more than or equal to 0 and less than 1.

The variance of the distribution Var [ beta ] < 1.

The current value of β.

Setting: v-1/Var [ β ] and c-1- Φ Var [ β ].

Extracting z-Poisson (beta phi/c).

Extract β' gamma (v + z, c).

Is returned to beta'

TDS includes a sequential Monte Carlo sampler based on standard Gibbs sampling at the limit of Var [ β ] → 0. The samples taken with TDS are not samples of the equilibrium distribution from the boltzmann machine. In some embodiments, the extracted samples are re-weighted to correct for deviations due to temperature variations.

Inputting:

the number of samples m.

The number of steps k is updated.

The autocorrelation coefficient for inverse temperature is 0 ≦ φ < 1.

The variance of the inverse temperature Var [ beta ] < 1.

Initialization:

randomly initializing m samples

Randomly initializing m inverse temperatures βi~Gamma(1/Var[β],Var[β])。

Temperature Driven Sampling (TDS) improves sampling from boltzmann machines. A direct comparison between samples taken from the boltzmann machine using conventional gibbs sampling and samples taken using TDS is illustrated in fig. 16. GMM (grey) refers to samples from a gaussian mixture model. GRBM (blue) refers to samples from an equivalent boltzmann machine extracted using 10-step gibbs sampling. TDS (red) refers to samples from an equivalent boltzmann machine extracted using TDS with 10-step gibbs sampling. This example shows a gaussian mixture model with three modes at (-1, 0, +1), with various standard deviations, and uses a simple construction to create an equivalent boltzmann machine with a gaussian visible layer and a one-hot hidden layer with three hidden cells. The autocorrelation coefficient and standard deviation of the inverse temperature were set to 0.9 and 0.95, respectively. All start samples are initialized from the intermediate mode. Starting from the middle mode, conventional gibbs sampling cannot sample from the neighboring mode after 10 steps when the modes are well separated, in contrast to TDS with a thicker tail, allowing for better sampling of the neighboring mode.

Using TDS at training time can have a fairly significant impact on the resulting model. In fig. 17, two identical gaussian-bernoulli RBMs are trained on a grayscale image of a handwritten digit from the MNIST dataset. Images are from models of the same architecture trained with the same hyper-parameters, except that one is trained with conventional gibbs sampling (1710) and the other with TDS (1720), or (a) with Var [ β ] ═ 0 and (b) with Var [ β ] ═ 0.9. Two models are gaussian-bernoulli RBMs with 256 hidden units, trained using an ADAM optimizer with a learning rate of 0.0005 and a batch size of 100 for a continuous contrast divergence of 100 epochs. Temperature Driven Sampling (TDS) improves the learning of the MNIST handwritten digit (grayscale) model. Both models achieve low reconstruction errors (data not shown), but GRBMs trained with conventional gibbs samplers cannot generate realistic phantom particles. In contrast, GRBM trained with TDS generates fantasy particles that look like realistic handwritten numbers.

The specific process for extracting samples from a probability distribution according to an embodiment of the present invention is described above; however, those skilled in the art will recognize that any number of processes may be utilized as appropriate to the requirements of a particular application in accordance with embodiments of the present invention.

Applications of

That is, even though the likelihood of a healthy outcome for a single patient may only be predictable, this capability enables the number of patients in a large population with that healthy outcome to be accurately predicted. For example, predicting health risks enables accurate estimation of the cost of group insurance. Similarly, predicting the likelihood that a patient will respond to a particular treatment enables the probability of a positive outcome in a clinical trial to be estimated.

Simulating patient trajectories

The ability to develop an accurate prediction of patient prognosis is a necessary step towards accurate medicine. A patient may be represented as a collection of information describing their symptoms, their genetic information, the results of diagnostic tests, any medical treatments they are receiving, and other information that may be relevant for characterizing their health. Vectors containing such information about a patient are sometimes referred to as phenotype vectors. Methods for prognosis prediction according to many embodiments of the invention use past and current health information about a patient to predict future health outcomes.

A patient trajectory refers to a time series that describes the detailed health of a patient (e.g., a phenotype vector of the patient) at various points in time. In several embodiments, the prognostic prediction takes a patient's trajectory (i.e., their past and current health information) and makes a prediction about a particular future health outcome (e.g., the likelihood that they will have a heart attack within the next 2 years). In contrast, predicting a patient's future trajectory involves predicting all information that characterizes their health state at all times in the future.

To construct this mathematically, let v (t) be a phenotype vector, which contains all the information that characterizes the patient's health at time t. Thus, the patient trajectories are sets

Figure BDA0002641967560000321

Many examples are described using discrete time steps (e.g., one month), but those skilled in the art will recognize that this is not required and that various other time steps may be employed in accordance with various embodiments of the present invention. In some embodiments of the invention, the model used to model the patient trajectory uses discrete time steps (e.g., one month). The length of the time step according to various embodiments of the present invention will be selected to approximately match the frequency of treatment. A model for a patient trajectory according to many embodiments of the present invention describes a joint probability distribution p (v) for all points along the trajectory0,...,vT). By deriving from the conditional probability distribution p (v)τ,...,vT|v0,...,vτ-1) Mid-sampling, such a model can be used for prediction. In many embodiments, the models are Boltzmann machines because they facilitate the expression condition distribution and may be adapted to heterogeneous datasets, but those skilled in the art will recognize that many of the processes described herein may also be applied to other architectures.

Clinical decision support system

Clinical decision support systems provide information to patients, physicians, or other care givers to help guide choices regarding patient care. The simulated patient trajectory provides insight into the future health of the patient, which may inform the choice of care. For example, consider a patient with mild cognitive impairment. It is beneficial for the physician or caregiver to understand the patient's condition as developing alzheimer's disease or the risk that he or she begins to exhibit other cognitive or psychological systems. In certain embodiments, a system based on simulated patient trajectories may predict these risks to guide care selection. Aggregating such predictions for a patient population may also help estimate population-level risk, enabling long-term planning by an organization (such as an elderly care facility) acting as a caregiver for a large group of patients.

In some embodiments, a set of patient trajectories is collected from an electronic medical record (also known as real world data), a natural history database, or a clinical trial. Patient trajectories according to many embodiments of the present invention may be normalized and used to train a time-dependent boltzmann machine. To use the model, the trajectory may beBy entering the patient's medical history, wherein t0Is the current time and is derived from the probability distribution using Boltzmann machinesAnd simulating the track. These simulated trajectories can then be analyzed to understand the risk associated with a particular outcome (e.g., diagnosis of alzheimer's disease) at various future times. In some cases, a model trained on data with treatment information will contain variables describing treatment selection. Such a model can be used to assess how different treatment options change the patient's future risk by comparing the simulation outcome risks conditioned on different treatments. In many embodiments, a caregiver or physician can treat a patient based on treatment selection and/or simulated trajectories.

Control groups simulated for clinical trials

Randomized Clinical Trials (RCT) are the gold standard for assessing evidence of therapeutic efficacy. In RCT, each patient was randomly assigned to one of two study groups: a treatment group in which patients are treated with experimental therapy, and a placebo group in which patients receive virtual treatment and/or current standard of care. At the end of the trial, a statistical analysis was performed to determine whether patients in the treatment group are more likely to respond positively to the new therapy than patients in the placebo group.

In order to have sufficient statistical power to accurately assess the efficacy of an experimental therapy, RCT needs to include a large number of patients. For example, for phase III clinical trials, it is not uncommon to include thousands of patients. It is challenging to recruit the large number of patients required to achieve sufficient capacity, and many clinical trials have never reached their target for recruitment. Although by definition there is little data on experimental treatments, there may be a lot of data on the efficacy of current standard of care. Thus, one way to reduce the number of patients required for a clinical trial is to replace the control group with a comprehensive control group containing virtual patients simulated from a boltzmann machine trained to model the current standard of care.

Methods according to several embodiments of the present invention create a synthetic or virtual control group for a clinical trial using simulation by training a boltzmann machine using data from a control group of previous clinical trials. In many embodiments, the data set may be constructed by aggregating data from control groups of multiple clinical trials for a selected disease. The boltzmann machine can then be trained to simulate a patient with the disease under current standard of care. The model can then be used to simulate a patient population with specific characteristics (e.g., age, race, medical history) to create a simulated patient cohort that matches the inclusion criteria of the new trial. In some embodiments, each patient in the experimental group can be matched to a simulated patient having the same baseline measurements by simulation from an appropriate condition profile of the boltzmann machine. This may provide a counter-fact (i.e., what would happen if the patient were given placebo rather than experimental therapy). In either case, data from the simulated patient may be used to supplement or replace data from the parallel placebo group using standard statistical methods according to many embodiments of the invention.

Simulation head-to-head clinical trial

Traditionally, healthcare in the united states is provided on a service charge. However, there is currently a shift to value-based care. In the context of medications, value-based care means that the cost of a medication will be based on its effectiveness, rather than simply the cost per bolus. Thus, governments and other payers need to be able to compare the effectiveness of replacement therapies.

Two drugs a and B with the same indication are considered. There are two standard methods to compare the efficacy of A and B. First, electronic health records and insurance claim data can be used to observe the behavior of drugs in real-world clinical practice. Alternatively, the RCT can be run to perform head-to-head comparisons of drugs. Both methods take years of additional observation and/or experimentation to conclude about the comparative effectiveness of a and B.

Simulations according to many embodiments of the present invention provide an alternative method for performing head-to-head experiments. In some embodiments, detailed individual-level data from clinical trials of each drug may be included in the training data for the boltzmann machine. In some embodiments, samples generated with a boltzmann machine (such as BEAM) may be used to simulate a head-to-head clinical trial between a and B. However, individual-level data for experimental groups used in clinical trials are not typically published. Without these data, aggregation level data from experimental groups according to various embodiments of the present invention may be used to adjust models trained on control group data.

Learning unsupervised genomic features

The human genome encodes over 2 million genes involved in an extremely complex network of interactions. This network of gene interactions is so complex that it is difficult to develop a mechanistic model that relates genotype to phenotype. Therefore, studies aimed at predicting phenotypes from genomic information must use machine learning methods.

A common goal of genomic studies in a clinical setting is to predict whether a patient will respond to a given therapy. For example, data describing gene expression (e.g., from messenger RNA sequencing experiments) can be collected at the beginning of a phase II clinical trial. The response of each patient to treatment is recorded at the end of the trial and a mathematical model (e.g., linear or logarithmic regression) is trained to predict the response of each patient from its baseline gene expression data. Successful prediction of patient response will enable the originator of a clinical trial to narrow the study population to a subset of patients for whom the drug is most likely to be successful using genomic testing. This improves the likelihood of success in subsequent phase III trials, while also improving patient outcomes through precise medical treatment.

Unfortunately, phase II clinical trials tend to be small (200 people). Moreover, sequencing experiments for measuring gene expression are still rather expensive. Thus, even non-clinical gene expression studies are limited in scale. Thus, the standard task involves training a regression model with up to 2 million features (i.e., expression of genes) using less than 200 measurements. In general, if the number of features is greater than the number of measurements, the linear regression model is not deterministic enough. While there are techniques to alleviate this problem, most omics studies are so unbalanced that standard methods fail.

In many embodiments, the original gene expression values are combined into a smaller number of synthetic features. For example, individual genes interact as part of a biochemical pathway, and one approach is to use known biochemical information to derive a score describing pathway activation. Pathway activation scores can then be used as features rather than raw expression values. However, due to the complexity of biochemical networks, it may not be clear at the outset how pathway activation scores are constructed.

In certain embodiments, a Deep Boltzmann Machine (DBM) is implemented as a tool for unsupervised feature learning that may be useful for study of chemistry. Let v be an inclusion slaveVector of determined gene expression values. DBM uses a probability distribution p (v) ═ dh1…dhLp(v,h1,...,hL) Describing the distribution of gene expression vectors, in which the layers of hidden units hlDescribes the progressive transformation of gene expression values to higher-level features. Models according to many embodiments of the present invention may be trained without labels; thus, in some embodiments, large data sets may be compiled by combining many different studies. In various embodiments, pre-trained DBM may be used to compute<hL>v=∫dh1…dhLhLp(h1,...,hLIv) transforms the vector of original gene expression values into a lower-dimensional vector of features. These lower dimensional features according to certain embodiments of the present invention can then be used as input to simpler supervised learning algorithms to construct predictors of drug response for a given therapy.

Predicting transcriptome responses

Predicting the effect that changes in the activity or expression of a gene will have in humans is important for both drug design and drug development. For example, if the effect of a compound in a human can be predicted, then a high-throughput computational screen can be performed for drug discovery. Similarly, if the effect of a study drug on different types of patients can be predicted, patient selection can be optimized for phase II clinical trials even if there is no direct data on the effect of the drug in humans.

There is no obvious method to develop predictors of transcriptome responses using supervised learning methods. In many embodiments, generative models of gene expression are used to predict transcriptome responses. Let v be the vector of the original gene expression values, and let p beθ(v) Is a model of the gene expression value distribution parameterized by θ. Further, assume that the model is parameterized such that θiAnd viIs related such that theta is increased (or decreased)iThe resulting increase (or decrease). In many embodiments, by decreasing θiAnd calculating<v>To mimic the effects of drugs that reduce the activity of Gene i. In various embodiments, when the change is small, then this involves calculating the derivative

The utility of generating models according to several embodiments of the present invention relies on the ability of the model to implicitly learn the interactions between gene expression values. That is, the model must know that reducing the activity of gene i using therapy will-via a complex network of interactions-result in a reduction in the expression of some other gene j. In many embodiments, DBMs as described in the previous section of the present application are used as a generative model to learn interactions between genes implicitly (i.e., without attempting to construct a mechanical understanding of biochemical pathways or other methods of direct gene interaction).

In many embodiments, a DBM trained in a completely unsupervised manner on gene expression data lacks the concept of an individual patient. Alternatively, the observation vector v may be divided into two parts: a vector of gene expression values x and a vector of metadata y. Metadata according to some embodiments of the present invention may describe characteristics of the sample such as, but not limited to, from which tissue the sample came, the health of the patient, or other information. Then, in various embodiments, the distribution may be based on conditions

Figure BDA0002641967560000371

And (6) performing prediction.

Finally, prediction of individual patients according to several embodiments of the present invention may use the concept of locality in the gene expression space. Order toAn energy x given y is defined. In DBMs, this also involves integration of all hidden layers. In some embodiments, it may be based on evaluation at x

Figure BDA0002641967560000373

The derivative of (a) to calculate a local measure of gene interaction.

Although the present invention has been described in certain specific aspects, many additional modifications and variations will be apparent to those of ordinary skill in the art. It is, therefore, to be understood that the invention may be practiced otherwise than as specifically described. The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive.

44页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:投递系统

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!