Regularization of antagonism probability

文档序号:1102615 发布日期:2020-09-25 浏览:7次 中文

阅读说明:本技术 对抗性概率正则化 (Regularization of antagonism probability ) 是由 X·孙 M·沙阿 U·库鲁普 J·孙 于 2019-02-21 设计创作,主要内容包括:提出了一种训练有监督神经网络以求解优化问题的方法,所述优化问题牵涉使误差函数<Image he="19" wi="30" file="DEST_PATH_IMAGE002.GIF" imgContent="drawing" imgFormat="GIF" orientation="portrait" inline="no"></Image>最小化,其中<Image he="19" wi="9" file="DEST_PATH_IMAGE004.GIF" imgContent="drawing" imgFormat="GIF" orientation="portrait" inline="no"></Image>是目标分布<Image he="21" wi="14" file="DEST_PATH_IMAGE006.GIF" imgContent="drawing" imgFormat="GIF" orientation="portrait" inline="no"></Image>的独立且同分布(i.i.d.)样本的向量,所述方法包括使用生成式对抗性网络的判别器来生成对抗性概率正则化项(APR)<Image he="22" wi="39" file="DEST_PATH_IMAGE008.GIF" imgContent="drawing" imgFormat="GIF" orientation="portrait" inline="no"></Image>,所述判别器接收来自<Image he="19" wi="9" file="DEST_PATH_IMAGE004A.GIF" imgContent="drawing" imgFormat="GIF" orientation="portrait" inline="no"></Image>的样本和来自正则化项分布<Image he="18" wi="18" file="DEST_PATH_IMAGE010.GIF" imgContent="drawing" imgFormat="GIF" orientation="portrait" inline="no"></Image>的样本作为输入。然后,对于有监督神经网络的每次训练迭代,将APR<Image he="17" wi="42" file="DEST_PATH_IMAGE012.GIF" imgContent="drawing" imgFormat="GIF" orientation="portrait" inline="no"></Image>添加到误差函数<Image he="19" wi="30" file="DEST_PATH_IMAGE002A.GIF" imgContent="drawing" imgFormat="GIF" orientation="portrait" inline="no"></Image>。(A method of training a supervised neural network to solve an optimization problem involving making an error function is presented Is minimized in that Is a target distribution Independently and identically distributed (i.i.d.) A vector of samples, the method comprising generating a antagonism probability regularization term (APR) using a discriminant of a generative antagonism network The discriminator receives the information from Sample and from regularization term distribution As input. Then, for each training iteration of the supervised neural network, the APR is applied Added to the error function 。)

1. A method of training a supervised neural network to solve an optimization problem involving making an error function

Figure 884808DEST_PATH_IMAGE001

generation of a antagonism probability regularization term (APR) using discriminators of generative antagonism networks

Figure 711315DEST_PATH_IMAGE004

for each training iteration of the supervised neural network, the APR is appliedAdded to the error function

2. The method of claim 1, wherein the target distributionIs a discrete distribution.

3. The method of claim 1, wherein the optimization problem is given by

Figure 397698DEST_PATH_IMAGE006

WhereinIs a ratio ofAnd (4) the coefficient.

4. The method of claim 3, wherein the APR

Figure 295563DEST_PATH_IMAGE004

Figure 772943DEST_PATH_IMAGE008

Wherein

Figure 202788DEST_PATH_IMAGE009

Wherein the APR is

Figure 975572DEST_PATH_IMAGE004

Figure 250695DEST_PATH_IMAGE010

5. The method of claim 4, wherein the error function is given by

Wherein, the data label pairAnd wherein

Figure 99200DEST_PATH_IMAGE013

wherein the error function is usedAfter substituting the optimization problem, the optimization problem is given by

6. The method of claim 2, wherein the discrete distribution is a binary distribution.

7. The method of claim 6, wherein the target distribution is set to

8. The method of claim 2, wherein the discrete distribution is a ternary distribution.

9. The method of claim 8, wherein the target distribution is set to

10. A neural network training system, comprising:

a non-transitory computer-readable storage medium storing programming instructions; and

a processor configured to execute programmed instructions,

wherein the programming instructions comprise instructions that, when executed by the processor, cause the processor to perform a method of training a supervised neural network to solve an optimization problem involving making an error function

Figure 405416DEST_PATH_IMAGE001

generation of a antagonism probability regularization term (APR) using discriminators of generative antagonism networksThe discriminator receives the information from

Figure 90158DEST_PATH_IMAGE002

for each training iteration of the supervised neural network, the APR is appliedAdded to the error function

Figure 958998DEST_PATH_IMAGE001

11. The system of claim 10, wherein the target distribution

Figure 183306DEST_PATH_IMAGE003

12. The system of claim 10, wherein the optimization problem is given by

WhereinIs a scaling factor.

13. The system of claim 12, wherein the APRIs given by

Wherein

Figure 72633DEST_PATH_IMAGE009

Wherein the APR isAfter substituting the optimization problem, the optimization problem is given by

14. The system of claim 13, wherein the error function is given by

Wherein, the data label pairAnd whereinIs a loss function, and

wherein the error function is usedAfter substituting the optimization problem, the optimization problem is given by

15. The system of claim 11, wherein the discrete distribution is a binary distribution.

16. The system of claim 15, wherein the target distribution is set to

Figure 867359DEST_PATH_IMAGE021

17. The system of claim 11, wherein the discrete distribution is a ternary distribution.

18. The system of claim 17, wherein the target distribution is set to

Figure 664414DEST_PATH_IMAGE016

Technical Field

The present disclosure relates generally to neural networks, and in particular to training neural networks.

Background

Many problems in machine learning involve solving optimization problems in the form of concepts

In this case, the amount of the solvent to be used,is the target distribution. Two examples involving this optimization problem include sparse regression and supervised neural networks. For the case of a sparse regression,is the data fitting error (error function), and

Figure 194882DEST_PATH_IMAGE002

is advantageous for sparseness or compressibility

Figure 9254DEST_PATH_IMAGE004

Distribution (e.g., bernoulli-sub-gaussian or laplacian). In the case of a supervised neural network,

Figure 831716DEST_PATH_IMAGE003

is the training (i.e., data fitting) error, andfacilitating network weightsCertain of the structures described above. For example,

Figure 904212DEST_PATH_IMAGE002

it may be gaussian to ensure that the weight distribution is "democratic". What is more interesting in practice is when

Figure 897575DEST_PATH_IMAGE002

Are discrete distributions, such as binary { +1, -1} or ternary { +1, 0, -1} -these distributions result in compact (i.e., quantized and sparse) networks that are computationally efficient networks, desirable for hardware implementationAnd is also robust to examples of antagonism.

The present disclosure focuses primarily on training a compact supervised neural network for solving the problem of form (1) above. To translate form (1) into a specific computational problem, consider a regularized version of form (1):

here, the first and second liquid crystal display panels are,

Figure 326600DEST_PATH_IMAGE004

is regarded as a target distribution

Figure 849985DEST_PATH_IMAGE002

I.i.d. (independent and identically distributed) samples of (1), and small

Figure 279829DEST_PATH_IMAGE006

Is equivalent to

Figure 459138DEST_PATH_IMAGE004

Has an empirical distribution close to

Figure 734262DEST_PATH_IMAGE002

. For the purposes of this disclosure,

Figure 377732DEST_PATH_IMAGE006

referred to as a probabilistic regularization term. Adjustable parameterControlling regularization term with respect to

Figure 379504DEST_PATH_IMAGE003

The relative strength of (c).

Given aWill naturally be

Figure 956295DEST_PATH_IMAGE006

Some monotonic function, chosen as a Probability Density Function (PDF), is similar to how priors are coded in bayesian inference. Two challenges are prominent: (i) a general probability distribution may not have a density function, or even if it does, the density function may not be in any closed form. (ii) The density function may be discontinuous-a discrete distribution, and we are particularly interested in PDFs with discrete support. To optimize (2) in a large-scale environment using derivative-based methods or other scalable methods, a significant amount of analysis and design effort is required to address both challenges.

Another natural choice is to combine the empirical moments of the coordinate distribution with the target

Figure 727942DEST_PATH_IMAGE002

Is measured by the difference between the empirical moments

Figure 616264DEST_PATH_IMAGE006

(i.e.,) under the umbrella of the moment matching method. This method tends to result in a large computational burden due to the computation of moments, and it is also not suitable for distributions with unbounded moments (e.g., heavy-tailed distributions).

Disclosure of Invention

According to one embodiment of the present disclosure, a method of training a supervised neural network to solve an optimization problem involving minimizing an error function is presentedWherein

Figure 851253DEST_PATH_IMAGE004

Is a target distribution

Figure 793801DEST_PATH_IMAGE002

Independent and equally distributed (i.i.d.) vectors of samples. The method includes generating an adversarial probability regularization term (APR) using a discriminant of a generative adversarial network

Figure 169419DEST_PATH_IMAGE006

. The discriminator receives the data from

Figure 590036DEST_PATH_IMAGE004

Sample and from regularization term distribution

Figure 62606DEST_PATH_IMAGE008

As input. Then, for each training iteration of the supervised neural network, the APR is applied

Figure 176055DEST_PATH_IMAGE006

Added to the error function

In accordance with another embodiment of the present disclosure, a neural network training system is provided that includes a memory for storing programming instructions and a processor configured to execute the programming instructions. The programming instructions include instructions that, when executed by the processor, cause the processor to perform a method of training a supervised neural network to solve an optimization problem involving making an error functionIs minimized in that

Figure 313055DEST_PATH_IMAGE004

Is the proposed target distributionIs independently and identically distributed (i.i.d.) the vector of the sample. The method includes generating a antagonism probability regularization term (APR) using a discriminant of a generative antagonism network. The discriminator receives the data fromSample and from regularization term distributionAs input. Then, for each training iteration of the supervised neural network, the APR is applied

Figure 346870DEST_PATH_IMAGE006

Added to the error function

Figure 246693DEST_PATH_IMAGE003

Drawings

FIG. 1 is a schematic illustration of a neural network training system according to the present disclosure;

FIG. 2 depicts an algorithm for generating an antagonism probability regularization term (APR);

FIG. 3 shows a table comparing APR and GMM regularization networks;

FIG. 4 shows a histogram of the weights for each layer of LeNet-5;

FIG. 5 depicts the evolution of the weight distribution at the end of epochs 1, 10, 50, 100 and 400 in which ResNet-44 is trained on CIFAR-10;

FIG. 6 illustrates a classification error table for a binary network and a ternary network;

FIG. 7 shows a learning curve for training ResNet-20 with three-valued weights;

FIG. 8 is a schematic illustration of a computing device for implementing the framework described herein.

Detailed Description

For the purposes of promoting an understanding of the principles of the disclosure, reference will now be made to the embodiments illustrated in the drawings and described in the following written specification. It will nevertheless be understood that no limitation of the scope of the disclosure is thereby intended. It is further understood that the present disclosure includes any alterations and modifications to the illustrated embodiments and includes further applications of the principles of the disclosure as would normally occur to one of ordinary skill in the art to which the disclosure relates.

The present disclosure is directed to systems and methods for training supervised neural networksA method, the supervised neural network comprising distributing targets

Figure 812804DEST_PATH_IMAGE002

Regularization term with minimum constraint

Figure 52155DEST_PATH_IMAGE006

. This approach is motivated by the recent experience success of Generative Antagonistic Networks (GAN) in learning the distribution of natural images or language. The central idea of the approach described herein is that the distribution matching problem is reformulated in the GAN framework as a distribution learning problem, which results in a natural parameterized regularization term that is learned from the data.

GAN was originally proposed for generating images that look natural, and has subsequently been extended to a variety of other applications, including semi-supervised learning, image super-resolution, and text generation.

GAN works by emulating the competing game between generator G and discriminator D, both of which are functions: given target distributionAnd noise (i.e. non-information) distribution

Figure 799848DEST_PATH_IMAGE009

G is fromThe learning generation form isTo spoof D and at the same time D learns to discriminate true samples

Figure 794983DEST_PATH_IMAGE012

And false samples. Ideally, at equilibrium, G learns true distributions

Figure 577312DEST_PATH_IMAGE002

So that. Mathematically, D learns to assign high values to true samples and low values to false samples, and gaming can be implemented as the following saddle point optimization problem:

Figure 493632DEST_PATH_IMAGE014

this formula cannot learn a degraded distribution, such as a discrete distribution or a distribution supported on a low-dimensional manifold, because a strong distance measure for the distribution is chosen. Wasserstein GAN (WGAN) was proposed to alleviate some of the problems, using a weaker metric land mobile distance (earth mobile distance) or Wasserstein-1 (W-1) distance. For two distributionsAndthe distance is calculated as

WhereinThe Lipschitz (lipschitz) constant for f is indicated. Therefore, minimizing the W-1 distance between the generator distribution and the target distribution creates a very small maximum (minimax) problem:

Figure 826524DEST_PATH_IMAGE019

this simple change to the metric has resulted in improved learning performance over several tasks.

In the context of the present disclosure, it is,discrete distributions are of interest, and therefore the W-1 distance is a reasonable metric as used together in WGANs. This facilitates regularization terms for probabilityThe following options of (a):

due to only a limited dimension being considered

Figure 371272DEST_PATH_IMAGE004

Thus having utilized the itemTo directly replace the empirical distribution of the second term.

As a function of the criteria in the GAN literature

Figure 433086DEST_PATH_IMAGE022

Implemented as a deep network with weight vectors. Thus, it is possible to provideIs used to make dependencies explicit. Combining this with (2), the central optimization problem of the present disclosure is obtained as:

one significant feature of this approach inherent to the GAN framework is that only the distribution from the target is required

Figure 840748DEST_PATH_IMAGE002

Of samples, e.g. of

Figure 535034DEST_PATH_IMAGE026

As specified by the item. This is advantageous compared to methods that rely on the presence of PDFs with reasonable regularity (e.g. closed form and possibly also differentiability), where samples can be easily obtained. This is the case for learning a discrete distribution.

Fig. 1 depicts a conceptual illustration of a neural network training system 10 according to the present disclosure, the neural network training system 10 using a network of discriminators from GANs to generate an adversarial probability regularization term (APR). As depicted in FIG. 1, there is a raw learner (error function)12 and is prepared from Parameterized arbiter network

Figure 564804DEST_PATH_IMAGE027

Figure 564804DEST_PATH_IMAGE027

14. The original learner 12 attempts to find

Figure 113597DEST_PATH_IMAGE003

Become smaller

Figure 201639DEST_PATH_IMAGE004

And at the same time carries an empirical distribution of the coordinates of the falsification discriminators. The arbiter 14 tries to findSo that it can be distributed from the target

Figure 605256DEST_PATH_IMAGE002

To distinguish true samples from

Figure 8555DEST_PATH_IMAGE004

Distinguishes "false" samples in the coordinates of (a). Arbiter 14 outputs APR

Figure 267498DEST_PATH_IMAGE006

The APR

Figure 21828DEST_PATH_IMAGE006

Is added to the error function at the addition node 16

Figure 696522DEST_PATH_IMAGE003

. The output of the summing node 16 corresponds to

The framework described herein can be subject to the same maker-discriminator game interpretation as shown in GAN (fig. 1), but with two important differences from classical GAN. First, there is no generator and the framework works directly with the experience sample. There are only a limited number of empirical samples, which are finite-dimensional vectors

Figure 649752DEST_PATH_IMAGE004

The coordinates of (a). In contrast, classical GAN is expected to learn valid producers, always (hopefully) according to the information fromOf the sample

Figure 166501DEST_PATH_IMAGE002

A sample is generated. Second, when generating experience samples(i.e., the amount of the acid,all coordinates of) to match/spoof the discriminator network, there is an additional one to be minimized

Figure 811743DEST_PATH_IMAGE003

An item.

In order to adapt the method to learn compact neural networks, the model optimization problem (5) is modified to a supervised learning problem based on Deep Neural Networks (DNN). Given data tag pair

Figure 890558DEST_PATH_IMAGE030

The following function is defined:

Figure 326218DEST_PATH_IMAGE031

in addition to being composed ofA loss function is defined in addition to a certain DNN which is parameterized

Figure 48503DEST_PATH_IMAGE032

Substituting this into the optimization problem (5) results in a saddle point optimization problem, which takes the form:

target distribution due to practical advantages of quantization and sparse weights in training and reasoning

Figure 283493DEST_PATH_IMAGE002

May be provided for proper learning of the compact network. We can set the maximum value of the signal, for example,

in learning quantized binary networks, or for smallIs provided with

Figure 22276DEST_PATH_IMAGE036

To learn sparse and quantized networks. The optimization algorithm we use is the same as that of classical GAN, i.e. alternating (random) gradient descent and ascent, which is summed upThe algorithm depicted in fig. 2. In convergence, coordinate-by-coordinate pairs

Figure 494845DEST_PATH_IMAGE004

A simple single round is applied.

There are two main methods in the literature for comparing and contrasting this method with previous methods for network quantization and sparsification. These methods are partitioned based on whether quantization and sparsification intervene in the training process. Many existing methods operate on trained networks without applying any proactive control over the potential loss of prediction accuracy due to quantization and sparsification. In contrast, other recent approaches perform simultaneous training and quantization (and/or sparsification). The method consists in the second method.

Direct training subject to quantization and sparsification constraints requires difficult discrete optimization. Existing approaches differ in how gracefully constrained is achieved. One possibility is to heuristically interleave the gradient descent and quantization (and possibly also sparsification) steps.

The on-the-fly quantization step tends to save significantly on forward and backward propagation costs. However, these methods are not fundamental from an optimization point of view. Another possibility is to embed the entire learning problem into a bayesian framework so that quantization and sparsity can be facilitated via applying appropriate bayesian priors to the network weights. It has been shown with a bayesian framework that it is advantageous for network compression to exhibit an automatic regularization effect. Furthermore, it is theoretically possible to enforce any desired structural prior in the weights. However, discrete distributions are not suitable for practical bayesian inference via numerical optimization. Analytical techniques such as reparameterization or continuous relaxation are required to find alternatives to discrete distributions so that efficient calculations can be performed.

In contrast to the above possibilities, quantization and sparsification are encoded via an antagonistic network that is directly fed with samples from the desired discrete distribution. The discretization priors are implemented in a principle way. As needed in a bayesian framework, the (sometimes substantial) analytical work to derive benign alternatives for discrete distributions is saved by requiring only samples from the discrete target distribution, which are typically readily available.

The following is a description of three techniques that may be used in an implementation. These techniques are not necessary, but may be beneficial. The first skill is

Figure 280399DEST_PATH_IMAGE023

And (4) cutting. Note that optimization of (5) and (6) obeys

Figure 205629DEST_PATH_IMAGE037

Is a constraint of 1-Liphoz, where the constant 1 can be adjusted accordinglyBut to any bounded K. It is sufficient toIs liphoz. Due to the fact thatIs implemented as a neural network and therefore only when

Figure 391574DEST_PATH_IMAGE023

When the device is in a bounded state,

Figure 153994DEST_PATH_IMAGE037

it is liphoz. This may be done by following each update by adding each

Figure 335576DEST_PATH_IMAGE039

Projection onto [ -1, 1 [)]Is approximated.

Another technique is

Figure 790829DEST_PATH_IMAGE004

Weighted sampling of (2).Is assumed to be i.i.d. However, when training a deep network, different layers may have very different numbers of nodes, resulting in inconsistencies in the number of weights — this is particularly true for the first and last layers, which typically have a small number of weights compared to the other layers. This inconsistency makes it difficult to quantize the first and last layers because in a random optimization setting, layers with a large number of weights tend to be sampled more frequently, and thus their weights tend to converge quickly to the target distribution. In the APR framework, the problem can be easily solved by re-weighting the samples: let

Figure 194445DEST_PATH_IMAGE040

Is the firstiThe number of weights in a layer. Probability of sampling weight in i-th layer is factored byTo scale.

The third technique isHomotopy continuation of above. For discrete target distributionIdeally, the discriminatorWill be discretely supported, which may take a significant amount of time for the neural network to learn to approximate. Homotopy continuation techniques can be used that distribute the distribution from a "good" auxiliary distribution

Figure 441887DEST_PATH_IMAGE042

Gradually moving towards the target distribution

Figure 238942DEST_PATH_IMAGE002

Here, the

Figure 958953DEST_PATH_IMAGE044

Is a time factor and T is the total training period.Can be conveniently selected as an overlayA continuous uniform distribution of ranges. This can be thought of as a coarse hierarchical smoothing process for discrete distributions, which is controlled via the input mixing samples-a distinguishing feature of our method. This may be contrasted with fine analysis smoothing or re-parameterization techniques for discrete distributions. This homotopy continuation empirically improves the convergence speed, but is not necessary for convergence.

The present disclosure focuses on solving the problem of form (1), particularly thereinIs in the context of discretely distributed learning quantization and sparse neural networks. The previous method solves the resulting hybrid continuous-discrete optimization problem by: by projection gradient heuristics (i.e., gradient descent mixed with quantization and/or sparsification); or by embedding the problem in a bayesian framework, in this way it is necessary to solve analytical and computational problems around discrete distributions. In contrast, the present disclosure proposes an Antagonism Probability Regularization (APR) framework for this problem with the following characteristics:

(1) the regularization term based deep network implementation is differentiable (almost anywhere-a.e.). Therefore, ifIs a.e. differentiable-this is true especially when it is also based on a deep network-the combined minimax goal in (5) can be amenable to a gradient-based optimization method. (5) The RippHiz constraint in (a) can be implemented asThe convex constraint on, and therefore the resulting optimization problem tends to be better from an optimization point of view than the optimization problem derived from the hybrid continuous-discrete approach.

(2) Regularization need only come fromWithout the need for a sample ofItself. This allows selection

Figure 520198DEST_PATH_IMAGE002

There is considerable popularity when, as long as the sample is readily available; when in useSampling is particularly simple when it is a discrete distribution. This avoids many of the analysis and computation obstacles surrounding bayesian methods.

The simple approach presented herein is advantageous over prior art approaches for network quantization and sparsification. With respect to the method as set forth herein,

Figure 752914DEST_PATH_IMAGE004

is assumed to be i.i.d., which may be limiting for some applications. The bayesian framework is not limited in theory, but the ease of analysis and computation can be a problem, as we discussed above. When in use

Figure 823638DEST_PATH_IMAGE004

For example, if the deep network is long enough, the framework is popularized toIs distributed a priori over the short segment of (a).

For network quantization and sparsification, methods that perform on-the-fly quantization and sparsification at each optimization iteration tend to save a large amount of forward and backward propagation computations. Although as indicated above this is less fundamental from an optimization point of view, the method can easily be modified to perform the instant operation.

Several approaches, including the present approach, have reported that the performance of a quantized network is comparable to that of a real-valued network. In theory, quantifying the capacity of a network is still not well understood. For example, it is not clear whether there will be a general approximation theorem for quantized networks.

Experiments were performed for the sparse recovery and image classification tasks to study behavior and verify effectiveness of APR. Image classification was evaluated on two datasets, MNIST and CIFAR-10. The comparison methods used include Generative Momentum Matching (GMM), binary concatenation, Training Ternary Quantization (TTQ), Variational Network Quantization (VNQ), and training.

GMM is primarily related to GAN-based approaches. To our knowledge, GMMs have not been developed or adopted for regularization purposes. However, we utilize GMM for probability regularization purposes and compare with APR. More specifically, given a distribution from regularizationSample set of

Figure 135167DEST_PATH_IMAGE045

And weight setMeasuring the distribution distance between two sample sets by means of the Maximum Mean Difference (MMD)

Figure 284706DEST_PATH_IMAGE047

WhereinIs provided with a bandwidth

Figure 568237DEST_PATH_IMAGE049

To match higher order moments. To train using GMM with constraints on arbitrary priors

Figure 613553DEST_PATH_IMAGE008

In a deep network of weights, we minimize an empirical loss function (2), where the regularization termIs defined by (8). To achieve better performance, the heuristic algorithm employed in (8) is as follows: the square root of MMD is used as a regularization term, and a mixture of gaussians is employedAs a kernel function.

In the case of network binarization, the method is compared with a binary concatenation over a VGG-like deep network. Comparing this method to the TTQ method, as a baseline for network tri-valuation on the residual network, the residual network has 20, 32, 44, and 56 layers with 0.27M, 0.46M, 0.66M, and 0.85M learnable parameters, respectively. The method is also compared to a recently proposed continuous relaxation based method, namely a Variational Network Quantization (VNQ) method for network tri-valuation. In the case of consistent experimental setup, the method was compared to VNQ on DenseNet-121.

Adam is used to train the quantization network and default hyper-parameter settings are used to train the primary network. Adam hyper-parameter for regularized networks is set to

Figure 583280DEST_PATH_IMAGE052

. The baseline model was also trained with Adam for fair comparison. The sample lot size evaluated was 256. The weight learning rate is scaled by a weight initialization coefficient. Throughout the experiments we implemented weights with binary or ternary values. For a three-valued network, we evaluate priors with various sparsity levels. For the corresponding data set, we follow conventional image pre-processing and enhancement. We construct a multi-layer perceptron (MLP) based regularization network with three hidden layers, and a ReLU as an activation function.

First, network binarization and tri-quantization are performed on the MNIST dataset for numerical classification. In this experiment, a modified LeNet-5 was employed, which contained four weight layers with 1.26M learnable parameters. The quantization network was trained according to a pre-trained full-precision model with a baseline error of 0.76%. The learning rate starts at 0.001 and decays linearly to zero after 200 periods. The performance of the APR network and the GMM regularized network were compared in this experiment. The learning schedule for both methods is the same. For Gaussian mixture kernelsIs set to {0.001, 0.005, 0.01, 0.05, 0.1 }. Regularization parameters for GMM are set toAnd the regularization parameter for the APR is set to

The following is a comparison of the APR and GMM regularized networks. Referring to the table depicted in FIG. 3, the APR (shown in the table as APR-T, T for three-valued weights) achieves a competitive performance of 0.83% error, which outperforms the GMM (shown as GMM-T) by 0.6%. Both methods implement weight assignment using a clear ternary pattern. However, regularizing deep networks using GMM suffers from scalability issues, even for small networks such as LeNet-5As well as so. To estimate the kernel in (8), the computation cost of the GMM regularization term grows quadratically with respect to the number of weights. In the case of LeNet-5, only 1% of the weights at each step are randomly selected and regularized, which still requires that 10 be computed at each step7And (4) a kernel. In contrast, given a regular network of fixed size, the computational cost of APR grows linearly with respect to the number of weights.

The first and last layers of the deep network present more difficulties for quantization due to the unbalanced size of the different layers. The problem of LeNet-5 quantification is particularly severe: the four layers of the network contain weights of 500, 0.25M, 1.2M and 5K numbers, resulting in an empirical distribution

Figure 137890DEST_PATH_IMAGE056

Governed by a third layer. As set forth above, the problem can be readily solved by employing weighted sampling techniques. The weight histogram for each layer of LeNet-5 is illustrated in FIG. 4. The uniform weights and the weights that have been re-weighted using the weighted sampling technique described above show the region for each layer. For both cases, the weights of the third layer converge to a three-valued mode, where the two histograms overlap each other. However, the weights of the first layer cannot be fitted a priori to the regularization without employing weighted sampling. Instead, the weights from all four layers exhibit a strong ternary pattern by employing weighted sampling.

The classification performance of the APR-regularized network was evaluated on a CIFAR-10 dataset consisting of 50000 training RGB images and 10000 test RGB images of 32x32 size. Standard data preparation strategies were used on CIFAR-10: both the training image and the test image are preprocessed by subtracting the mean value per pixel. The training set is extended by padding 4 pixels on each edge of the image, and randomly cropping a 32x32 region. The mini-batch (minipatch) size for training the host network is 128. The method was evaluated on VGG-9 and ResNet-20, 32, 44.

In this experiment, the weights are implemented to have binary or ternary values. For fair comparison, followThe same quantization protocol, i.e. the first convolutional layer and the fully-connected layer, is not quantized, since they contain only weights of less than 0.4% of the total amount. The deep neural network was trained with a total number of 400 epochs, with an initial learning rate of 0.01. At the end of periods 80, 120 and 150, the learning rate decays to 1/10. Weight attenuation is not used because the APR is already a strong regularization of the weights. To facilitate convergence of the network, homotopy continuation is employed, which employs assisted uniform distribution. Since the APR does not implement discrete values from start to finish, rounding noise is added to the weights after 350 epochs.

The evolution of the weight distribution at the end of epochs 1, 10, 50, 100 and 400, where ResNet-44 is trained on CIFAR-10, is shown in FIG. 5. The upper row shows binary weights and the lower row three-valued weights. The solid line corresponds to the function according to regularization

Figure 663866DEST_PATH_IMAGE058

Is scaled to 0, 1 for display purposes]. The dotted line shows the regularization distribution

Figure 828131DEST_PATH_IMAGE008

. The discrete distribution is smoothed for display purposes. Shaded regions show the empirical distribution of weights. Solid blue line: for regularization functionScaled to [0, 1 ] for display purposes]. As can be seen, according to a regularization functionIs close to a discrete prior

Figure 526780DEST_PATH_IMAGE008

The learning curve for training the ResNet-20 with a three-valued weight is shown in FIG. 7, which shows the first 200 epochs. Given strong regularization (

Figure 927805DEST_PATH_IMAGE060

) The primary network is trained to stall without homotopy continuation (black line). In contrast, when homotopy extension (red line) is used, the network resumes convergence while achieving weights with a ternary pattern. By selecting small values Implicitly making discrete priors by using a regularizing networkRelaxation and lossAnd also drops rapidly.

Fig. 6 shows classification error tables for the binary network and the ternary network. The method is compared to a full-precision baseline model, Binary Connectivity (BC), and Trained Ternary Quantization (TTQ). Although the method is able to train a discrete network from scratch, the network is trained using a pre-trained full-precision model for fair comparison. APR-B refers to APR that is regularized with binary weights, and APR-T refers to APR that is regularized with ternary weights. Use in tables ""label the model trimmed from the pre-trained full-precision network. The method realizes the most advanced network three-valued performance on VGG-9, ResNet-20 and ResNet-32. Deep networks using APR-tri-value introduce less performance degradation than full-precision networks on ResNet-44, andexceeding the full precision networks on VGG-9, ResNet-20 and ResNet-32. APR-B achieved 7.82% error over VGG-9 and outperformed BC 2.5%. The tri-valued network further reduces the error to 7.47%.

FIG. 8 depicts an embodiment of a computer system 100 that may be used to implement the framework described herein. In particular, the computer system includes at least one processor 102, such as a Central Processing Unit (CPU), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) device, or a microcontroller. The processor 102 is configured to execute programming instructions stored in the memory 104. The memory 104 may be any suitable type of memory, including solid state memory, magnetic memory, or optical memory, to name a few, and may be implemented in a single device or distributed across multiple devices. The programming instructions stored in the memory 104 include instructions for implementing various functions in the system, including identifying candidate and candidate nodes for terms, and scoring the candidates using collective reasoning based on occurrence and co-occurrence statistics. The computing system may include one or more network interface devices 106 for transmitting and receiving data and communications via a network.

While the disclosure has been illustrated and described in detail in the drawings and foregoing description, the same is to be considered as illustrative and not restrictive in character. It is understood that only the preferred embodiment has been presented and that all changes, modifications, and further applications that come within the spirit of the disclosure are desired to be protected.

18页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:仿神经网络及其制造方法

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!