Method, device and storage medium for updating parameters

文档序号：1905617 发布日期：2021-11-30 浏览：13次中文

阅读说明：本技术 一种更新参数的方法、装置及存储介质 (Method, device and storage medium for updating parameters ) 是由王紫东陈梦云于璠陈雷于 2020-05-26 设计创作，主要内容包括：本申请公开了一种更新参数的方法,该方法应用于人工智能领域。该方法通过多次迭代来多次更新神经网络模型的参数,多次迭代包括第一迭代范围和第二迭代范围,该方法包括：在第一迭代范围内,每第一更新步长所指示的迭代次数更新一次神经网络模型的附加矩阵的逆矩阵,在第二迭代范围内,每第二更新步长所指示的迭代次数更新一次神经网络模型的附加矩阵的逆矩阵,在迭代顺序上,第二迭代范围的第一次迭代在第一迭代范围的最后一次迭代之后,第二更新步长大于第一更新步长。这样,随着迭代次数的增多,更新步长越来越大,可以减少逆矩阵更新的次数,减少了神经网络模型训练的时间,提高了神经网络模型训练的速度。(The application discloses a method for updating parameters, which is applied to the field of artificial intelligence. The method updates parameters of the neural network model a plurality of times through a plurality of iterations, the plurality of iterations including a first iteration range and a second iteration range, the method comprising: and updating the inverse matrix of the additional matrix of the neural network model once every iteration number indicated by the first updating step length in a first iteration range, updating the inverse matrix of the additional matrix of the neural network model once every iteration number indicated by the second updating step length in a second iteration range, and enabling the second updating step length to be larger than the first updating step length after the last iteration of the first iteration range in the iteration sequence of the first iteration range. Therefore, with the increase of the iteration times, the updating step length is larger and larger, the updating times of the inverse matrix can be reduced, the training time of the neural network model is reduced, and the training speed of the neural network model is improved.)

1. A method of updating parameters, wherein the method of updating parameters is used for updating parameters of a neural network model a plurality of times through a plurality of iterations, the plurality of iterations comprising a first iteration range and a second iteration range, the method comprising:

updating an inverse matrix of an additional matrix of the neural network model once per iteration number indicated by a first update step within the first iteration range, the first iteration range comprising at least two iterations;

updating an inverse of an additional matrix of the neural network model once every iteration number indicated by a second update step in the second iteration range, the second iteration range including at least two iterations, a first iteration of the second iteration range being after a last iteration of the first iteration range in the iteration order, the second update step being larger than the first update step.

2. The method of claim 1, wherein the plurality of iterations includes a third iteration range, the third iteration range being any one of the plurality of iterations, the method further comprising:

if the Nth iteration in the multiple iterations is located in the third iteration range and is the iteration which needs to update the inverse matrix and is indicated by a third updating step length, updating the inverse matrix of the additional matrix of the neural network model, using the updated inverse matrix of the additional matrix and the first-order gradient of the Nth iteration to update the parameters in the neural network model, wherein the third updating step length is the updating step length of the third iteration range, N is an integer, and N is greater than 1.

3. The method of claim 2, wherein updating the inverse of the additional matrix of the neural network model and updating the parameters in the neural network model using the updated inverse of the additional matrix and the first order gradient of the nth iteration comprises:

updating an inverse matrix of an additional matrix of P blocks, wherein the P blocks are partial blocks or all blocks in Q blocks of the neural network model, P and Q are integers, Q is more than or equal to P, Q is more than or equal to 2, and P is more than or equal to 1;

updating the parameters of the corresponding blocks in the P blocks by adopting the inverse matrix of the additional matrix after the updating of the P blocks and the first-order gradient of the Nth iteration of the P blocks;

if Q > P, then (Q-P) blocks other than the P blocks update the parameters of the corresponding block in the (Q-P) block using the inverse of the addition matrix used by the (Q-P) block at the (N-1) th iteration and the first order gradient of the (Q-P) block Nth iteration.

4. The method of claim 3, further comprising:

and obtaining the P blocks from the M blocks based on the information of the additional matrix of the M blocks in the neural network model, wherein the information of the additional matrix comprises the trace of the additional matrix or the two norms of the additional matrix, the M blocks are blocks needing to update the additional matrix in the Q blocks of the Nth iteration, M is an integer, and Q is more than or equal to M and more than or equal to P.

5. The method of claim 4, wherein the deriving the P blocks from the M blocks based on information of an additional matrix of M blocks in the neural network model comprises:

obtaining the P blocks from the M blocks according to the trace of the addition matrix of the M blocks of the Nth iteration and the trace of the addition matrix of the M blocks of the (N-1) th iteration.

6. The method of claim 5, wherein said deriving said P blocks from said M blocks according to traces of an additional matrix of said M blocks for said Nth iteration and traces of an additional matrix of said M blocks for an (N-1) th iteration comprises:

and obtaining P blocks with a first ratio larger than a first threshold value from the M blocks, wherein the first ratio is the ratio of a first difference value to the trace of the additional matrix of the (N-1) th iteration, and the first difference value is the difference value between the trace of the additional matrix of the Nth iteration and the trace of the additional matrix of the (N-1) th iteration.

7. The method of claim 3, further comprising:

obtaining the P blocks from a plurality of blocks in the neural network model based on sampling probabilities of the blocks, wherein the sampling probability of a block is used to indicate a probability that a block is updated with an inverse of an addition matrix at the Nth iteration.

8. The method according to any one of claims 2-7, further comprising:

updating the inverse matrix if a second difference value of the nth iteration is equal to an update start value, wherein the second difference value is a difference value of the N and a total length of a preceding iteration range, the preceding iteration range is located before the third iteration range in the execution order, and the update start value is used for indicating an iteration of the third iteration range for updating the inverse matrix for the first time.

9. The method according to any one of claims 2-7, further comprising:

updating the inverse matrix if a first remainder of the Nth iteration is 0; wherein the first remainder is a remainder of a third difference value and the third update step, the third difference value is a difference value between (N-update start value) and a total length of a previous iteration range, the previous iteration range is located before the third iteration range in execution order, and the update start value is used to indicate an iteration for updating the inverse matrix for the first time in the third iteration range.

10. A method of updating parameters, wherein the method of updating parameters is used for updating parameters of a neural network model for a plurality of iterations, N being an integer greater than 1, for an nth iteration of the plurality of iterations, the method comprising:

11. The method of claim 10, further comprising:

12. The method of claim 11, wherein the deriving the P blocks from the M blocks based on information of an additional matrix of the M blocks in the neural network model comprises:

13. The method of claim 12, wherein said deriving said P blocks from said M blocks according to the trace of the additional matrix of M blocks of said nth iteration and the trace of the additional matrix of said M blocks of the (N-1) th iteration comprises:

14. An apparatus for updating parameters, the apparatus for updating parameters being configured to update parameters of a neural network model a plurality of times through a plurality of iterations, the plurality of iterations including a first iteration range and a second iteration range, the apparatus comprising:

a first processing unit, configured to update an inverse matrix of an additional matrix of the neural network model once every iteration number indicated by a first update step in the first iteration range, where the first iteration range includes at least two iterations;

a second processing unit, configured to update an inverse matrix of an additional matrix of the neural network model once every iteration number indicated by a second update step size within the second iteration range, where the second iteration range includes at least two iterations, and a first iteration of the second iteration range is subsequent to a last iteration of the first iteration range in an iteration order, and the second update step size is larger than the first update step size.

15. The apparatus of claim 14, wherein the plurality of iterations includes a third iteration range, the third iteration range being any one of the plurality of iterations, the apparatus further comprising:

and the third processing unit is configured to update an inverse matrix of an additional matrix of the neural network model and update parameters in the neural network model using the updated inverse matrix of the additional matrix and a first-order gradient of a third update step if an nth iteration of the multiple iterations is within the third iteration range and the iteration of the inverse matrix is required to be updated as indicated by the third update step, where the third update step is the update step of the third iteration range, N is an integer and N is greater than 1.

16. The apparatus of claim 15,

the third processing unit is configured to:

17. The apparatus of claim 16,

the third processing unit is further configured to obtain the P blocks from the M blocks based on information of an additional matrix of the M blocks in the neural network model, where the information of the additional matrix includes a trace of the additional matrix or a two-norm of the additional matrix, the M blocks are blocks of the N iteration, which need to update the additional matrix, in Q blocks, and M is an integer, where Q is greater than or equal to M and greater than or equal to P.

18. The apparatus of claim 17,

the third processing unit is configured to obtain the P blocks from the M blocks according to the trace of the additional matrix of the M blocks of the nth iteration and the trace of the additional matrix of the M blocks of the (N-1) th iteration.

19. The apparatus of claim 18,

and the third processing unit is configured to obtain P blocks of which a first ratio is greater than a first threshold from the M blocks, where the first ratio is a ratio of a first difference to a trace of the additional matrix of the (N-1) th iteration, and the first difference is a difference between a trace of the additional matrix of the nth iteration and a trace of the additional matrix of the (N-1) th iteration.

20. The apparatus of claim 16,

the third processing unit is further configured to obtain the P blocks from a plurality of blocks in the neural network model based on sampling probabilities of the blocks, where the sampling probability of a block is used to indicate a probability that a block is updated with an inverse of an addition matrix at the nth iteration.

21. An apparatus for updating parameters, wherein the apparatus for updating parameters is configured to update parameters of a neural network model for a plurality of iterations, N being an integer greater than 1, for an nth iteration of the plurality of iterations, and the apparatus comprises:

the first processing unit is used for updating an inverse matrix of an additional matrix of P blocks, wherein the P blocks are partial blocks or all blocks in Q blocks of the neural network model, P and Q are integers, Q is more than or equal to P, Q is more than or equal to 2, and P is more than or equal to 1;

the second processing unit is used for updating the parameters of the corresponding blocks in the P blocks by adopting the inverse matrix of the additional matrix after the updating of the P blocks and the first-order gradient of the Nth iteration of the P blocks; if Q > P, then (Q-P) blocks other than the P blocks update the parameters of the corresponding block in the (Q-P) block using the inverse of the addition matrix used by the (Q-P) block at the (N-1) th iteration and the first order gradient of the (Q-P) block Nth iteration.

22. The apparatus of claim 21, further comprising:

and the third processing unit is used for obtaining the P blocks from the M blocks based on the information of the additional matrix of the M blocks in the neural network model, the information of the additional matrix comprises the trace of the additional matrix or the two norms of the additional matrix, the M blocks are blocks which need to update the additional matrix in the Q blocks of the Nth iteration, M is an integer, and Q is more than or equal to M and more than or equal to P.

23. The apparatus of claim 22,

24. The apparatus of claim 23,

25. A computing device comprising a processor and a computer readable storage medium storing a computer program;

the processor is coupled with the computer-readable storage medium, the computer program implementing the method of any of claims 1-9 when executed by the processor or the computer program implementing the method of any of claims 10-13 when executed by the processor.

26. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the method of any one of claims 1-9, or which, when being executed by a processor, carries out the method of any one of claims 10-13.

27. A chip system, comprising a processor, wherein the processor is invoked to perform the method of any one of claims 1 to 9, or wherein the processor is invoked to perform the method of any one of claims 10 to 13.

Technical Field

The present application relates to the field of Artificial Intelligence (AI), and in particular, to a method, an apparatus, and a storage medium for updating parameters.

Background

Machine learning exhibits excellent performance in many application areas, such as: the method is widely applied to the application fields of image recognition, target detection, natural language processing and the like. No matter in which application field, the neural network model is trained through sample data of the corresponding application field, and then the trained neural network model is applied to each application field.

The neural network model is subjected to multiple iterations during training, and parameters of the neural network model are updated once by sampling a first-order optimization algorithm and a second-order optimization algorithm in each iteration. In the first-order optimization, a Stochastic Gradient Descent (SGD) algorithm is usually used to perform first-order derivation on the loss function of the neural network model to obtain a first-order gradient of the parameter. And then, performing second-order optimization by adopting a second-order optimization algorithm on the basis of the first-order gradient to obtain a second-order gradient of the parameter.

The second-order optimization involves the calculation of an inverse matrix of an additional matrix of the neural network model, and the calculation complexity of the inverse matrix is very high, so that the training speed of the neural network model is influenced.

Disclosure of Invention

The embodiment of the application provides a method for updating parameters, which is used for reducing the time for training a neural network model. The embodiment of the application also provides a corresponding device and a storage medium.

A first aspect of the present application provides a method of updating a parameter for updating a parameter of a neural network model a plurality of times through a plurality of iterations, the plurality of iterations including a first iteration range and a second iteration range, the method comprising: updating an inverse matrix of an additional matrix of the neural network model once every iteration number indicated by the first updating step within a first iteration range, wherein the first iteration range comprises at least two iterations; updating an inverse matrix of an additional matrix of the neural network model once every iteration number indicated by a second update step in a second iteration range, the second iteration range comprising at least two iterations, the first iteration of the second iteration range being after the last iteration of the first iteration range in the iteration order, the second update step being greater than the first update step.

In the first aspect, a time-division updating concept is provided, in which a training process of a whole neural network model is divided into a plurality of iteration ranges, and an inverse matrix of an additional matrix of the neural network model is updated once every iteration number indicated by an update step length in one iteration range. The neural network model includes a Deep Neural Network (DNN) model or a Convolutional Neural Network (CNN) model. In the model training process, the parameter in the neural network model may be the weight of each neuron in the neural network model. The process of training a neural network model using sample data typically requires many iterations to obtain a target neural network model, each iteration may be referred to as a step (step). In the process of training the neural network model, input data is sample data, and output data is the weight of each neuron in the neural network model. The sample data may be image data, voice data or text data, and the type of the sample data is determined according to a field to which the neural network model is applied. For example: when the neural network model is used in the field of automatic driving, the sample data may be various image data in a traffic scene, such as: images of buildings around the autonomous vehicle, images of pedestrians, images of surrounding vehicles, images of ground signs, images of traffic lights, and the like. When the neural network model is used for intelligent security or safe cities, the sample data can be various image data of the cities, such as: images of each block of a city. When the neural network model is used in other service scenes, the sample data is image, audio or text data of the corresponding service scene. The whole training process may start from the first iteration until all steps in the whole process of training the target neural network model can be divided into at least two iteration ranges (period), for example: ten thousand iterations are needed to train the neural network model, ten thousand iterations can be divided into 10 iteration ranges, and the 10 iteration ranges are arranged according to the sequence used in the iteration process and are from period1 to period 10. The length of the iteration range may be the same, such as: each iteration range is 1000 steps, and the length of the iteration range may be different, such as: some contain hundreds of steps and some contain thousands of steps. If the convergence condition of the training neural network model is not the condition of presetting a certain number of iterations, more iteration ranges can be set, so that the set iteration ranges are not used when the neural network model converges, and one iteration range can also be an epoch. The first iteration range and the second iteration range may be any two of all the iteration ranges as long as the second iteration range follows the first iteration range in execution order. Each iteration range corresponds to an update step (update stride), the update step represents the update interval, and represents that the inverse matrix of the additional matrix of the neural network model is updated once per update step, which can also be described as being updated once every other (update step-1), and the update step can also be referred to as an update interval (update interval). The update step size may be an integer value greater than or equal to 1. The change trend of the update step may be that the update step of the corresponding iteration range is larger and larger as the number of iterations increases, or the update steps of some iteration ranges are equal, and the update step of some iteration ranges is larger than the update step of the previous iteration range. The setting mode of the update step length can be square of the sequence number of the iteration range, cosine curve, exponential curve, multiple increase or piecewise constant, etc. The additional matrix is a matrix for preprocessing the first order gradient, and the additional matrix can be a second order information matrix in a second order optimization algorithm, such as: fisher matrix (FIM) in the natural gradient method. The additional matrix may also be other additional matrices, such as a second moment of the gradient, which is the product of the first order gradient and the transpose of the first order gradient. According to the first aspect, the inverse matrix of the additional matrix of the neural network model is updated once every iteration number indicated by the updating step length by adopting a time-sharing updating mode, and the inverse matrix does not need to be updated every iteration, so that the time overhead of updating the inverse matrix of the additional matrix can be reduced, the training time of the neural network model can be reduced, and the training speed of the neural network model can be improved.

In a possible implementation manner of the first aspect, the multiple iterations include a third iteration range, where the third iteration range is any one iteration range in the multiple iterations, and the method further includes: and if the Nth iteration in the multiple iterations is in the third iteration range and is the iteration which is indicated by the third updating step and needs to update the inverse matrix, updating the inverse matrix of the additional matrix of the neural network model, using the updated inverse matrix of the additional matrix and the first-order gradient of the Nth iteration to update the parameters in the neural network model, wherein the third updating step is the updating step of the third iteration range, N is an integer, and N is greater than 1.

In this possible implementation manner, the third iteration range may be the first iteration range, the second iteration range, or any other one-time iteration range. The nth iteration may be any iteration from the second iteration of the training to the end of the training of the neural network model, and actually, the first iteration, that is, when N is 1, the inverse matrix may also be updated, except that the update of the first iteration does not need a third update step to indicate, and the first iteration may be indicated to update the inverse matrix in a manner of presetting an update start position. According to the possible implementation mode, aiming at the step needing to be updated, the inverse matrix is updated, and the parameters are updated by using the updated inverse matrix, so that the neural network model is converged.

In a possible implementation manner of the first aspect, the steps are as follows: updating an inverse of an additional matrix of the neural network model, and updating parameters in the neural network model using the updated inverse of the additional matrix and a first order gradient of the nth iteration, including: updating an inverse matrix of an additional matrix of P blocks, wherein the P blocks are partial blocks or all blocks in Q blocks of the neural network model, P and Q are integers, Q is more than or equal to P, Q is more than or equal to 2, and P is more than or equal to 1; updating the parameters of the corresponding blocks in the P blocks by adopting the inverse matrix of the additional matrix after the updating of the P blocks and the first-order gradient of the Nth iteration of the P blocks; if Q > P, then (Q-P) blocks other than P blocks update the parameters of the corresponding block in the (Q-P) block using the inverse of the addition matrix used by the (Q-P) block at the (N-1) th iteration and the first order gradient of the Nth iteration of the (Q-P) block.

In this possible implementation, a concept of block update is proposed, which may divide a neuron in a neural network model into at least two blocks, and then update an inverse matrix of an additional matrix of the corresponding block according to the blocks. The concept about "block" may be a set of vector relationships of neurons between two adjacent layers in a neural network model, which may also be referred to as a "layer". The division manner of the block is not limited to the division in the form of layers, and may also be the division in the form of neurons in the neural network model, and in this division manner, 1.5 layers, two layers, or more layers may be divided into one block. When updating the inverse matrix for each block, the entire block may be updated, or a part of the block may be updated. Typically, the model training begins with all blocks being updated, and as the number of iterations increases, the number of blocks that need to update the inverse matrix decreases. If Q is 8 and P is 3, updating the inverse matrix of the additional matrix of three blocks, and updating the parameters of the three blocks by using the updated inverse matrix, and updating the parameters of the current time by using the inverse matrix used in the last iteration without updating the inverse matrices of the other five blocks. According to the possible implementation mode, by adopting a block updating mode, the inverse matrix of the additional matrix of all blocks or part of blocks can be updated according to requirements, so that the time overhead of updating the inverse matrix can be reduced, the training time of the neural network model can be reduced, and the training speed of the neural network model can be improved.

In a possible implementation manner of the first aspect, the method further includes: p blocks are obtained from the M blocks based on the information of the additional matrix of the M blocks in the neural network model, the information of the additional matrix comprises the trace of the additional matrix or the two norms of the additional matrix, the M blocks are blocks needing to update the additional matrix in the Q blocks of the Nth iteration, M is an integer, and Q is more than or equal to M and more than or equal to P.

In this possible implementation, the trace (trace) of the additional matrix is the sum of the values on the diagonal of the additional matrix, and the two-norm of the additional matrix is the square of the maximum eigenvalue after multiplication of the transpose of the additional matrix and the additional matrix. The additional matrix is a matrix with equal rows and equal columns, and can also be called a positive definite matrix. If the additional matrix is an 8 row by 8 column matrix, the additional matrix contains 64 values. The sum of the 8 values on the diagonal of the matrix can be referred to as the trace of the additional matrix. The M blocks in the Q blocks also need to be updated with additional matrixes, and the (Q-M) blocks except the M blocks are blocks which do not change basically any more in the Nth iteration, and the blocks do not need to be updated with inverse matrixes and do not need to be updated with additional matrixes, so that the blocks which do not change basically any more in the (Q-M) additional matrixes can be directly excluded when the P blocks which need to be updated with the inverse matrixes are selected, the blocks are directly selected from the M blocks which need to be updated with the additional matrixes, and the time for training the neural network model can be further saved.

In a possible implementation manner of the first aspect, the steps are as follows: obtaining the P blocks from the M blocks based on information of an additional matrix of the M blocks in the neural network model, including: obtaining the P blocks from the M blocks according to the trace of the addition matrix of the M blocks of the Nth iteration and the trace of the addition matrix of the M blocks of the (N-1) th iteration.

In this possible implementation, the P blocks of the inverse matrix to be updated are obtained by the trace of the additional matrix of the M blocks iterated twice before and after, and the accuracy of block selection can be improved.

In a possible implementation manner of the first aspect, the steps are as follows: deriving P blocks from the M blocks based on the traces of the append matrix for the M blocks for the Nth iteration and the traces of the append matrix for the M blocks for the (N-1) th iteration, comprising: and obtaining P blocks with a first ratio larger than a first threshold value from the M blocks, wherein the first ratio is the ratio of a first difference value to the trace of the additional matrix of the (N-1) th iteration, and the first difference value is the difference value of the trace of the additional matrix of the Nth iteration and the trace of the additional matrix of the (N-1) th iteration.

In this possible implementation manner, the relationship between the first ratio and the first threshold may be represented by a relational expression> T1, wherein F^NAdditional matrix representing the Nth iteration, F^(N-1)An additional matrix, tr (F), representing the (N-1) th iteration^N) Representation matrix F^NTrace of (d), tr (F)^(N-1)) Representation matrix F^(N-1)Trace of (d), tr (F)^N)-tr(F^(N-1)) It is indicated that the first difference value is,indicating a first ratio and T1 indicating a first threshold. The value of T1 may be set to 0.01, and if the first ratio of the additional matrix of a block is greater than 0.01, it indicates that the inverse matrix of the block needs to be updated. If the first ratio of the additional matrix of a block is less than 0.01, it means that the inverse matrix of the block does not need to be updated. According to the possible implementation mode, whether the corresponding block needs to update the inverse matrix can be determined according to the trace of the additional matrix of each block in the iteration process, and the selection accuracy of the block needing to update the inverse matrix can be improved.

In a possible implementation manner of the first aspect, the method further includes: p blocks are derived from the plurality of blocks based on sampling probabilities of the plurality of blocks in the neural network model, wherein the sampling probability of a block is used to indicate a probability that the block is updated with an inverse of the additional matrix at the nth iteration.

In this possible implementation, the block whose inverse matrix needs to be updated is selected according to the sampling probability of the block, so that the speed of block selection can be increased.

In a possible implementation manner of the first aspect, the sampling probability of one of the blocks is related to the parameter amount in the block and the total parameter amount in the neural network model, or the sampling probabilities of the blocks are pre-configured.

In this possible implementation, each block has a different influence on the training process, so the sampling probability of each block is also different, and the more the number of parameters is, the greater the influence on the training process is. Can be as followsTo determine the sampling probability for each block. Wherein w_iRepresents the parameter quantity, sigma, of the ith block_jw_jRepresenting the total amount of parameters of the neural network model. According to the possible implementation mode, the sampling probability is determined through the parameter quantity in the block, so that the selection probability of the block with large influence on the neural network model is improved.

In a possible implementation manner of the first aspect, the method further includes: updating the inverse matrix if a second difference value of the nth iteration is equal to an update start value, wherein the second difference value is a difference value of the N and a total length of a preceding iteration range, the preceding iteration range is located before the third iteration range in the execution order, and the update start value is used for indicating an iteration of the third iteration range for updating the inverse matrix for the first time.

In this possible implementation, an initial update iteration of the inverse matrix may be set for each iteration range, such as: and updating the inverse matrix for the first iteration in each iteration range, and updating the inverse matrix once for the iteration times indicated by each updating step according to the updating step of the iteration range. If the third iteration range is the first iteration range of all the iteration ranges, the inverse matrix needs to be updated as long as N is the update start value. If the third iteration range is the second iteration range, then there is one of the previous iteration ranges, for example: period1 is from step1 to step200, period2 is from step201 to step500, if N is 201 and the update start value is 1, the second difference value 201 is 1 and 200 is 1, and just the second difference value is equal to the update start value, it can be determined that the 201 st iteration is the iteration in the period2 which needs to update the inverse matrix for the first time. Of course, the update start value is not limited to be equal to 1, but may be equal to 2 or other values, and usually the update start value is less than or equal to the minimum update step among the update steps. According to the possible implementation mode, whether the Nth iteration is the initial updating iteration of the third iteration range can be rapidly determined through a specific mathematical relation, and the speed of training the neural network model is improved.

In a possible implementation manner of the first aspect, the method further includes: updating the inverse matrix under the condition that the first remainder of the Nth iteration is 0; wherein the first remainder is a remainder of a third difference value and a third update step size, the third difference value is a difference value between (N-update start value) and a total length of a previous iteration range, and the update start value is used to indicate an iteration for updating the inverse matrix for the first time in the third iteration range when the previous iteration range is located before the third iteration range in the execution sequence.

In this possible implementation manner, when the nth iteration is performed, whether the inverse matrix needs to be updated for the nth iteration may be determined through N, information of an iteration range, and an update step size. If the third iteration range is the first iteration range of all the iteration ranges, whether the nth iteration updates the inverse matrix can be determined by taking the value of the relation (N-update initial value)% of the third update step, wherein "%" represents the remainder. For example: period 1: from step1 to step200, N is 5, the update start value is 1, the third update step is 1, and then (5-1)% 1 is 0, indicating that the 5 th iteration requires updating the inverse matrix. If the update start value is 1, the third update step is 2, and N is 6, (6-1)% 2 is 1, which means that the inverse matrix does not need to be updated for the 6 th iteration. If there are additional iteration ranges before the third iteration range, all iteration ranges performed before the third iteration range are referred to as preceding iteration ranges. For example: period 1: from step1 to step200, period 2: from step201 to step500, if N is 205, it means that the third iteration range is period2, and period1 is the previous iteration range. The total length of the preceding iteration range is 200. For the case that the nth iteration is located in a range other than the first iteration, whether the nth iteration updates the inverse matrix may be determined by taking the value of the relation (N-X — third difference)% third update step, where "%" represents a remainder, and the third difference is (N — total length of previous iteration range). If the update start value is 1, the update step size of period2 is equal to 2, and N is 205, (205-1-200)% 2 is 0, this indicates that the first remainder is equal to 0, and the inverse matrix needs to be updated for this 205 th iteration. According to the possible implementation mode, whether the inverse matrix of the additional matrix needs to be updated or not can be rapidly determined through the specific mathematical relational expression, and the neural network model training speed can be improved.

A second aspect of the present application provides a method for updating a parameter of a neural network model multiple times through multiple iterations, where N is an integer greater than 1 for an nth iteration of the multiple iterations, the method including: updating an inverse matrix of an additional matrix of P blocks, wherein the P blocks are partial blocks or all blocks in Q blocks of the neural network model, P and Q are integers, Q is more than or equal to P, Q is more than or equal to 2, and P is more than or equal to 1; updating the parameters of the corresponding blocks in the P blocks by adopting the inverse matrix of the additional matrix after the updating of the P blocks and the first-order gradient of the Nth iteration of the P blocks; if Q > P, then (Q-P) blocks other than P blocks update the parameters of the corresponding block in the (Q-P) block using the inverse of the addition matrix used by the (Q-P) block at the (N-1) th iteration and the first order gradient of the Nth iteration of the (Q-P) block.

In the second aspect, a block update concept is proposed, which may divide a neuron in a neural network model into at least two blocks, and then update an inverse matrix of an additional matrix of the corresponding block according to the blocks. The concept about "blocks" may be a set of vector relationships of neurons between two adjacent layers in a neural network model, which may also be referred to as "layers". The division manner of the block is not limited to the division in the form of layers, and may also be the division in the form of neurons in the neural network model, and in this division manner, 1.5 layers, two layers, or more layers may be divided into one block. When updating the inverse matrix for each block, the entire block may be updated, or a part of the block may be updated. Typically, the model training begins with all blocks being updated, and as the number of iterations increases, the number of blocks that need to update the inverse matrix decreases. If Q is 8 and P is 3, updating the inverse matrix of the additional matrix of three blocks, and updating the parameters of the three blocks by using the updated inverse matrix, and updating the parameters of the current time by using the inverse matrix used in the last iteration without updating the inverse matrices of the other five blocks. According to the possible implementation mode, only the inverse matrix of the additional matrix of part of the blocks is updated in a block updating mode, so that the time overhead of updating the inverse matrix of the additional matrix can be reduced, the training time of the neural network model can be reduced, and the training speed of the neural network model can be improved.

In one possible implementation manner of the second aspect, the method further includes: p blocks are obtained from the M blocks based on the information of the additional matrix of the M blocks in the neural network model, the information of the additional matrix comprises the trace of the additional matrix or the two norms of the additional matrix, the M blocks are blocks needing to update the additional matrix in the Q blocks of the Nth iteration, M is an integer, and Q is more than or equal to M and more than or equal to P.

In this possible implementation, the trace of the additional matrix is the sum of the values on the diagonal of the additional matrix, and the two-norm of the additional matrix is the square of the maximum eigenvalue multiplied by the transpose of the additional matrix. The additional matrix is a matrix with equal rows and equal columns, and can also be called a positive definite matrix. If the additional matrix is an 8 row by 8 column matrix, the additional matrix contains 64 values. The sum of the 8 values on the diagonal of the matrix can be referred to as the trace of the additional matrix. The M blocks in the Q blocks also need to be updated with additional matrixes, and the (Q-M) blocks except the M blocks are blocks which do not change basically any more in the Nth iteration, and the blocks do not need to be updated with inverse matrixes and do not need to be updated with additional matrixes, so that the blocks which do not change basically any more in the (Q-M) additional matrixes can be directly excluded when the P blocks which need to be updated with the inverse matrixes are selected, the blocks are directly selected from the M blocks which need to be updated with the additional matrixes, and the time for training the neural network model can be further saved.

In a possible implementation manner of the second aspect, the steps are as follows: obtaining the P blocks from the M blocks based on information of an additional matrix of the M blocks in the neural network model, including: obtaining the P blocks from the M blocks according to the trace of the addition matrix of the M blocks of the Nth iteration and the trace of the addition matrix of the M blocks of the (N-1) th iteration.

In a possible implementation manner of the second aspect, the steps are as follows: deriving P blocks from the M blocks based on the traces of the append matrix for the M blocks for the Nth iteration and the traces of the append matrix for the M blocks for the (N-1) th iteration, comprising: and obtaining P blocks with a first ratio larger than a first threshold value from the M blocks, wherein the first ratio is the ratio of a first difference value to the trace of the additional matrix of the (N-1) th iteration, and the first difference value is the difference value of the trace of the additional matrix of the Nth iteration and the trace of the additional matrix of the (N-1) th iteration.

In this possible implementation manner, the relationship between the first ratio and the first threshold may be represented by a relational expression> T1, wherein F^NAdditional matrix representing the Nth iteration, F^(N-1)An additional matrix, tr (F), representing the (N-1) th iteration^N) Representation matrix F^NTrace of (d), tr (F)^(N-1)) Representation matrix F^(N-1)Trace of (d), tr (F)^N)-tr(F^(N-1)) It is indicated that the first difference value is,indicating a first ratio and T1 indicating a first threshold. The value of T1 may be set to 0.01, and if the first ratio of the additional matrix of a block is greater than 0.01, it indicates that the inverse matrix of the block needs to be updated. If the above ratio of the additional matrix of a block is less than 0.01, it means that the inverse matrix of the block does not need to be updated. According to the possible implementation mode, whether the corresponding block needs to update the inverse matrix can be determined according to the trace of the additional matrix of each block in the iteration process, and the selection accuracy of the block needing to update the inverse matrix can be improved.

In one possible implementation manner of the second aspect, the method further includes: p blocks are derived from the plurality of blocks based on sampling probabilities of the plurality of blocks in the neural network model, wherein the sampling probability of a block is used to indicate a probability that the block is updated with an inverse of the additional matrix at the nth iteration.

In a possible implementation manner of the second aspect, the sampling probability of one of the blocks is related to the parameter quantity in the block and the total parameter quantity in the neural network model, or the sampling probabilities of the blocks are pre-configured.

A third aspect of the present application provides an apparatus for updating a parameter, where the apparatus for updating a parameter has a function of implementing the method according to the first aspect or any one of the possible implementation manners of the first aspect. The function can be realized by hardware, and can also be realized by executing corresponding software by hardware. The hardware or software includes one or more modules corresponding to the above functions, such as: the functions of the first processing unit, the second processing unit and the third processing unit can be realized by one processing unit, or can be realized by two or three processing units.

A fourth aspect of the present application provides an apparatus for updating a parameter, where the apparatus for updating a parameter has a function of implementing the method of the second aspect or any one of the possible implementation manners of the second aspect. The function can be realized by hardware, and can also be realized by executing corresponding software by hardware. The hardware or software includes one or more modules corresponding to the above functions, such as: the functions of the first processing unit, the second processing unit and the third processing unit can be realized by one processing unit, or can be realized by two or three processing units.

A fifth aspect of the present application provides a computer device comprising at least one processor, a memory, an input/output (I/O) interface, and computer executable instructions stored in the memory and executable on the processor, wherein when the computer executable instructions are executed by the processor, the processor performs the method according to the first aspect or any one of the possible implementation manners of the first aspect.

A sixth aspect of the present application provides a computer device comprising at least one processor, a memory, an input/output (I/O) interface, and computer executable instructions stored in the memory and executable on the processor, wherein when the computer executable instructions are executed by the processor, the processor performs the method according to any one of the possible implementations of the second aspect or the second aspect.

A seventh aspect of the present application provides a computer-readable storage medium storing one or more computer-executable instructions that, when executed by a processor, perform a method according to the first aspect or any one of the possible implementations of the first aspect.

An eighth aspect of the present application provides a computer-readable storage medium storing one or more computer-executable instructions that, when executed by a processor, perform a method as described in any one of the possible implementations of the second aspect or the second aspect.

A ninth aspect of the present application provides a computer program product storing one or more computer executable instructions that, when executed by the processor, perform the method of the first aspect or any one of the possible implementations of the first aspect.

A tenth aspect of the present application provides a computer program product storing one or more computer executable instructions that, when executed by the processor, perform the method of any one of the possible implementations of the second aspect or the second aspect.

An eleventh aspect of the present application provides a chip system, where the chip system includes a processor, and the apparatus for supporting parameter updating implements the functions recited in the first aspect or any one of the possible implementations of the first aspect. In one possible design, the system-on-chip may further include a memory, for storing program instructions and data necessary for the means for updating parameters. The chip system may be constituted by a chip, or may include a chip and other discrete devices.

A twelfth aspect of the present application provides a chip system, where the chip system includes a processor, and the means for supporting parameter update implements the functions mentioned in the second aspect or any one of the possible implementations of the second aspect. In one possible design, the system-on-chip may further include a memory, for storing program instructions and data necessary for the means for updating parameters. The chip system may be constituted by a chip, or may include a chip and other discrete devices.

For technical effects brought by any one or any one of the possible implementation manners of the third aspect, the fifth aspect, the seventh aspect, the ninth aspect and the eleventh aspect, reference may be made to technical effects brought by different possible implementation manners of the first aspect or the first aspect, and details are not described here again.

For technical effects brought by the fourth aspect, the sixth aspect, the eighth aspect, the tenth aspect, and the twelfth aspect or any one of possible implementation manners, reference may be made to technical effects brought by different possible implementation manners of the second aspect or the second aspect, and details are not described here.

According to the embodiment of the application, a time-sharing updating mode is adopted, the inverse matrix of the additional matrix of the neural network model is updated once for the iteration times indicated by each updating step length, the inverse matrix does not need to be updated every iteration, and therefore the time overhead of updating the inverse matrix of the additional matrix can be reduced, the training time of the neural network model can be reduced, and the training speed of the neural network model is improved.

In addition, the embodiment of the application adopts a block updating mode, and the inverse matrix of the additional matrix of all blocks or part of blocks can be updated according to requirements, so that the time overhead of updating the inverse matrix can be reduced, the training time of the neural network model can be reduced, and the training speed of the neural network model can be improved.

Drawings

FIG. 1 is a schematic diagram of an artificial intelligence body framework;

FIG. 2 is a schematic diagram of a system architecture provided by an embodiment of the present application;

FIG. 3 is a schematic diagram of a convolutional neural network;

FIG. 4 is a schematic diagram of another structure of a convolutional neural network;

FIG. 5A is a block diagram illustrating an example of an embodiment of the present disclosure;

fig. 5B is a schematic diagram of another exemplary block provided in an embodiment of the present application;

FIG. 6 is a schematic diagram of a system architecture for training a neural network model according to an embodiment of the present application;

fig. 7A is a schematic diagram illustrating an example of a method for updating parameters according to an embodiment of the present application;

fig. 7B is a schematic diagram of another example of a method for updating parameters provided by an embodiment of the present application;

FIG. 8A is a schematic diagram of an embodiment of a method for updating parameters provided by an embodiment of the present application;

FIG. 8B is a schematic diagram illustrating an example of a variation curve of an update step according to an embodiment of the present application;

fig. 9A is a schematic diagram of another embodiment of a method for updating parameters provided in an embodiment of the present application;

FIG. 9B is a diagram illustrating an example of traces representing an additional matrix provided by an embodiment of the present application;

FIG. 9C is a diagram illustrating an example of block sampling provided by an embodiment of the present application;

FIG. 10 is a schematic diagram of another embodiment of a method for updating parameters, provided by an embodiment of the present application;

FIG. 11 is a schematic diagram of an embodiment of an apparatus for updating parameters according to an embodiment of the present disclosure;

FIG. 12 is a schematic structural diagram of a computer device provided in an embodiment of the present application;

fig. 13 is another schematic structural diagram of a computer device provided in an embodiment of the present application.

Detailed Description

Embodiments of the present application will now be described with reference to the accompanying drawings, and it is to be understood that the described embodiments are merely illustrative of some, but not all, embodiments of the present application. As can be known to those skilled in the art, with the development of technology and the emergence of new scenarios, the technical solution provided in the embodiments of the present application is also applicable to similar technical problems.

The terms "first," "second," and the like in the description and in the claims of the present application and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It will be appreciated that the data so used may be interchanged under appropriate circumstances such that the embodiments described herein may be practiced otherwise than as specifically illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

The embodiment of the application provides a method for updating parameters, which is used for shortening the time for updating the parameters in a neural network model. The embodiment of the application also provides a corresponding device and a storage medium. The following are detailed below.

Artificial Intelligence (AI) is a comprehensive technique in computer science that attempts to understand the essence of intelligence and produces a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

FIG. 1 is a schematic diagram of an artificial intelligence body framework that describes the overall workflow of an artificial intelligence system, adapted to the general artificial intelligence field requirements.

The artificial intelligence topic framework described above is set forth below in terms of two dimensions, the "intelligent information chain" (horizontal axis) and the "IT value chain" (vertical axis).

The "smart information chain" reflects a list of processes processed from the acquisition of data. For example, the general processes of intelligent information perception, intelligent information representation and formation, intelligent reasoning, intelligent decision making and intelligent execution and output can be realized. In this process, the data undergoes a "data-information-knowledge-wisdom" refinement process.

The 'IT value chain' reflects the value of the artificial intelligence to the information technology industry from the bottom infrastructure of the human intelligence, information (realization of providing and processing technology) to the industrial ecological process of the system.

(1) Infrastructure:

the infrastructure provides computing power support for the artificial intelligent system, realizes communication with the outside world, and realizes support through a foundation platform. Communicating with the outside through a sensor; the computing power is provided by a smart chip (a Central Processing Unit (CPU), a neural Network Processor (NPU), a Graphic Processing Unit (GPU), an Application Specific Integrated Circuit (ASIC), a field-programmable gate array (FPGA), and other hardware acceleration chips); the basic platform comprises distributed computing framework, network and other related platform guarantees and supports, and can comprise cloud storage and computing, interconnection and intercommunication networks and the like. For example, sensors and external communications acquire data that is provided to intelligent chips in a distributed computing system provided by the base platform for computation.

(2) Data of

Data at the upper level of the infrastructure is used to represent the data source for the field of artificial intelligence. The data relates to graphs, images, voice and texts, and also relates to the data of the Internet of things of traditional equipment, including service data of the existing system and sensing data such as force, displacement, liquid level, temperature, humidity and the like.

(3) Data processing

Data processing typically includes data training, machine learning, deep learning, searching, reasoning, decision making, and the like.

The machine learning and the deep learning can perform symbolized and formalized intelligent information modeling, extraction, preprocessing, training and the like on data.

Inference means a process of simulating an intelligent human inference mode in a computer or an intelligent system, using formalized information to think about and solve a problem by a machine according to an inference control strategy, and a typical function is searching and matching.

The decision-making refers to a process of making a decision after reasoning intelligent information, and generally provides functions of classification, sequencing, prediction and the like.

(4) General capabilities

After the above-mentioned data processing, further based on the result of the data processing, some general capabilities may be formed, such as algorithms or a general system, e.g. translation, analysis of text, computer vision processing, speech recognition, recognition of images, etc.

(5) Intelligent product and industrial application

The intelligent product and industry application refers to the product and application of an artificial intelligence system in various fields, and is the encapsulation of an artificial intelligence integral solution, the intelligent information decision is commercialized, and the landing application is realized, and the application field mainly comprises: intelligent manufacturing, intelligent transportation, intelligent home, intelligent medical, intelligent security, automatic driving, safe city, intelligent terminal, intelligent marketing, safe city, intelligent customer service, and the like.

In whatever application of artificial intelligence, neural network models are involved, such as: a Deep Neural Network (DNN) model or a Convolutional Neural Network (CNN) model. And training the initial neural network model by using sample data of different fields or service scenes to obtain a target neural network model suitable for the service scene. The sample data may be image data, voice data, text data, or the like, and the type of the sample data is determined according to a field to which the neural network model is applied. For example: when the neural network model is used in the field of automatic driving, the sample data may be various image data in a traffic scene, such as: images of buildings around the autonomous vehicle, images of pedestrians, images of surrounding vehicles, images of ground signs, images of traffic lights, and the like. When the neural network model is used for intelligent security or safe cities, the sample data can be various image data of the cities, such as: images of each block of a city. When the neural network model is used in other service scenes, the sample data is image, audio or text data of the corresponding service scene. The training process for the neural network model may be performed in the system architecture 200 shown in fig. 2.

Referring to fig. 2, a system architecture 200 is provided in accordance with an embodiment of the present application. The data acquisition device 260 is used to acquire sample data for neural network model training and store the sample data in the database 230, and the sample data can be understood by referring to the description of the sample data in the previous paragraph, and will not be described repeatedly here. The training device 220 generates a target neural network model/rule 201 based on sample data maintained in the database 230. How the training device 220 derives the target neural network model/rule 201 based on the sample data will be described in more detail below, the target neural network model/rule 201 being capable of, for example, directing an autonomous vehicle to travel automatically or automatically identifying unsafe factors, etc.

The operation of each layer in the deep neural network model can be described by the mathematical expression y ═ a (W.x + b). Where W is a weight vector, each value in the vector representing a weight value for a neuron in the layer of neural network. This vector determines the spatial transformation of input space to output space described above, i.e. the weight W of each layer controls how the space is transformed. The purpose of training the deep neural network model is to finally obtain the weight matrix (the weight matrix formed by the vectors W of many layers) of all the layers of the trained neural network. Therefore, the training process of the neural network model is essentially a way of learning the control space transformation, and more specifically, the weight matrix.

"the difference between the predicted value and the target value of the neural network model", which is a loss function (loss function) or an objective function (objective function).

The target neural network model/rules obtained by the training device 220 may be applied in different systems or devices. In FIG. 2, the execution device 210 is configured with an I/O interface 212 to interact with data from an external device, and a "user" may input data to the I/O interface 212 via a client device 240.

The execution device 210 may call data, code, etc. from the data storage system 250 and may store data, instructions, etc. in the data storage system 250.

The calculation module 211 processes the input data using the target neural network model/rule 201, for example: in the field of autonomous driving, the target neural network model/rule 201 identifies obstacles and the like during autonomous driving from image data of a traffic scene.

Finally, the I/O interface 212 returns the results of the processing to the client device 240 for presentation to the user.

Further, the training device 220 may generate corresponding target neural network models/rules 201 for different targets based on sample data of different business scenarios to provide better results to the user.

It should be noted that fig. 2 is only a schematic diagram of a system architecture provided by an embodiment of the present application, and the position relationship between the devices, modules, and the like shown in fig. 2 does not constitute any limitation, for example, in fig. 2, the data storage system 250 is an external memory with respect to the execution device 210, and in other cases, the data storage system 250 may also be disposed in the execution device 210.

The convolutional neural network model may also be referred to as a convolutional neural network for short, is a deep neural network with a convolutional structure, and is a deep learning (deep learning) architecture, where the deep learning architecture refers to learning at multiple levels at different abstraction levels through a machine learning algorithm. As a deep learning architecture, CNN is a feed-forward artificial neural network in which individual neurons respond to overlapping regions in an image input thereto.

As shown in fig. 3, Convolutional Neural Network (CNN)100 may include an input layer 110, a convolutional/pooling layer 120, where the pooling layer is optional, and a neural network layer 130.

Convolutional layer/pooling layer 120:

and (3) rolling layers:

as shown in FIG. 3, convolutional layer/pooling layer 120 may include, for example, 121-126 layers, in one implementation, 121 layers are convolutional layers, 122 layers are pooling layers, 123 layers are convolutional layers, 124 layers are pooling layers, 125 layers are convolutional layers, and 126 layers are pooling layers; in another implementation, 121, 122 are convolutional layers, 123 are pooling layers, 124, 125 are convolutional layers, and 126 are pooling layers. I.e., the output of a convolutional layer may be used as input to a subsequent pooling layer, or may be used as input to another convolutional layer to continue the convolution operation.

Taking convolutional layer 121 as an example, convolutional layer 121 may include a plurality of convolution operators, also called kernels, whose role in image processing is to act as a filter to extract specific information from the input image matrix, and the convolution operator may be essentially a weight matrix, which is usually predefined.

The weight values in these weight matrices need to be obtained through a large amount of training in practical application, and each weight matrix formed by the trained weight values can extract information from the input image, thereby helping the convolutional neural network model 100 to make correct prediction.

A pooling layer:

since it is often necessary to reduce the number of training parameters, it is often necessary to periodically introduce pooling layers after the convolutional layer, i.e. the layers 121-126 as illustrated by 120 in fig. 3, may be one convolutional layer followed by one pooling layer, or may be multiple convolutional layers followed by one or more pooling layers. During image processing, the only purpose of the pooling layer is to reduce the spatial size of the image.

The neural network layer 130:

after processing by convolutional layer/pooling layer 120, convolutional neural network model 100 is not sufficient to output the required output information. Because, as previously described, the convolutional layer/pooling layer 120 only extracts features and reduces the parameters brought by the input image. However, to generate the final output information (class information or other relevant information as needed), the convolutional neural network model 100 requires the use of the neural network layer 130 to generate one or a set of outputs of the number of classes as needed. Accordingly, a plurality of hidden layers (such as 131, 132, to 13n shown in fig. 3) and an output layer 140 may be included in the neural network layer 130, and parameters included in the plurality of hidden layers may be obtained by pre-training according to related training data of a specific task type, for example, the task type may include image recognition, image classification, image super-resolution reconstruction, and the like.

The last layer after the multiple hidden layers in the neural network layer 130, i.e., the entire convolutional neural network model 100, is the output layer 140. The output layer 140 has a loss function similar to the classified cross entropy, and is specifically used for calculating the prediction error, once the forward propagation (i.e., the propagation from 110 to 140 in fig. 3 is forward propagation) of the whole convolutional neural network model 100 is completed, the backward propagation (i.e., the propagation from 140 to 110 in fig. 3 is backward propagation) starts to update the weight values and the bias of the aforementioned layers, so as to reduce the loss of the convolutional neural network model 100 and the error between the result output by the convolutional neural network model 100 through the output layer and the ideal result.

It should be noted that the convolutional neural network model 100 shown in fig. 3 is only an example of a convolutional neural network model, and in a specific application, the convolutional neural network model may also exist in the form of other network models, for example, as shown in fig. 4, a plurality of convolutional layers/pooling layers are parallel, and the features extracted respectively are all input to the whole neural network layer 130 for processing.

The convolutional neural network model-based algorithm shown in fig. 3 and 4 described above may be implemented in an NPU chip.

From the above description, it can be seen that both the deep neural network model and the convolutional neural network model include weights. In fact, the training process of the neural network model is a process of continuously updating the weights in the neural network model through multiple iterations, and the weights are parameters to be updated in the training stage of the neural network model. In the model training process, each iteration uses sample data to calculate a loss function of the iteration, then first-order optimization is performed on the loss function to obtain a first-order gradient, and further additional optimization is performed on the basis of the first-order gradient, for example: and further performing second-order optimization on the basis of the first-order gradient to obtain the update weight of the iteration, then performing model update by using the update weight of the iteration, and performing next iteration on the basis of the model after the weight is updated by the iteration until the whole training process of the neural network model is completed.

The first-order optimization adopts a first-order optimization algorithm, and the additional optimization adopts an additional optimization algorithm.

The first-order optimization algorithm generally adopts the following rules for updating the parameters:the additional optimization algorithm is to firstInverse matrix G to an additional matrix G^-1Multiplication yields the following update rule:where θ 1 is the parameter before update (i.e., the weight before update), θ 2 is the parameter after update (i.e., the weight after update), η is the learning rate, which may be pre-configured,is the first gradient of the parameter obtained by first-order derivation of the loss function. The additional matrix is a matrix that preprocesses the first order gradients, which may be two in a second order optimization algorithmOrder information matrices, such as: fisher matrix (FIM) in the natural gradient method. The additional matrix may also be other additional matrices, such as a second moment of the gradient, which is the product of the first order gradient and the transpose of the first order gradient. In the examples listed here, G is^-1The inverse matrix of the additional matrix is not limited to this representation, and other modified equations based on the idea of the present application are applicable, for example:wherein the content of the first and second substances,is equivalent to

Considering that the complexity of the inverse matrix of the additional matrix is large and the calculation time is long in each iteration, and in order to save the time for training the neural network model, the embodiment of the present application proposes an idea of "time-sharing update", where the time-sharing update refers to: in the process of training the neural network model through multiple iterations, only the sampled iteration process needs to update the inverse matrix of the additional matrix, and the non-sampled iteration process does not need to update the inverse matrix of the additional matrix.

In addition, as can be seen from the above descriptions of fig. 3 and 4, a neural network model generally includes an input layer, an implicit layer, and an output layer, and for example, the number of the implicit layers is also many, and in order to save the time for updating parameters in the training process of the neural network model, the embodiment of the present application further provides an idea of "block update", where block update refers to that a block that is sampled needs to update the inverse matrix of an additional matrix, and a block that is not sampled does not need to update the inverse matrix of the additional matrix.

For the sake of understanding, the concept of "chunk" in the chunk update will be described first with reference to fig. 5A and 5B.

As shown in fig. 5A, for a more complex neural network model, the concept of "block" refers to the vector relationship between neurons between two layers, and as in fig. 5A, the set of vector relationships represented by arrows between hidden layer 1 and hidden layer 2 may be referred to as a "block", and in some descriptions, this set of vector relationships may also be referred to as a "layer". The set of vector relationships represented by arrows between the hidden layer 2 and the hidden layer 3 may be referred to as a "block", and in some descriptions, this set of vector relationships may also be referred to as a "layer". Of course, fig. 5A only illustrates hidden layer 1, hidden layer 2, and hidden layer 3 as examples, and actually, more hidden layers and input and output layers may be included, and a set of vector relationships between each two adjacent layers including neurons may be referred to as a "block" or a "layer" regardless of whether the hidden layers, the input layers, or the output layers.

The division manner of the blocks is not limited to the above division in the form of layers, and may also be divided in the form of neurons in the neural network model, and in such a division manner, for example, in fig. 5A: 1.5 layers, two layers or more are divided into one block. The number of specific layers of neurons is divided into one block, and is not limited in the present application. As shown in fig. 5B, "block" in this case refers to a matrix block divided in a manner of a combination of neurons in the neural network model, and as shown in fig. 5B, includes 4 matrix blocks 601 of 3 × 3 size and 2 matrix blocks 602 of 4 × 4 size.

Based on the above ideas of "time-sharing update" and "block update", the embodiment of the present application provides a system architecture for training a neural network model.

Fig. 6 is a schematic diagram of a system architecture for training a neural network model according to an embodiment of the present application.

As shown in fig. 6, a system architecture 700 for training a neural network model provided in an embodiment of the present application includes a hardware layer 710, an Operating System (OS) 720, and a training architecture layer 730, where the training architecture layer 730 is configured to update weights in the neural network model using training data. The training architecture layer 730 includes a sample data obtaining module 731, a loss function calculating module 732, a first-order gradient calculating module 733, a time-sharing/block-dividing updating decision module 734, a preprocessing calculating module 735, and a weight updating module 736, where the sample data obtaining module 731 to the weight updating module 736 may be functional modules implemented by software.

The sample data obtaining module 731 is configured to obtain sample data.

The loss function computation module 732 is used to compute a loss function using the sample data. The loss function is defined in the above description of the deep neural network model in fig. 2, and will not be described again here.

The first order gradient calculation module 733 is used for performing first order derivation on the loss function to calculate a first order gradient

The time-sharing/block-updating decision module 734 has the functions of time-sharing updating decision, block-updating decision, and updating decision by time-sharing and then block-updating decision.

The time-sharing updating decision means that the decision-updated iteration process needs to update the inverse matrix of the additional matrix, and the decision-updated iteration process does not need to update the inverse matrix of the additional matrix.

The block updating decision means that the block which is updated by the decision updates the inverse matrix of the additional matrix, and the block which is not updated by the decision does not need to update the inverse matrix of the additional matrix.

The decision-making is updated in a time-sharing manner and then is updated in a blocking manner, which means that the blocks in the iterative process updated by the decision-making have the opportunity to be updated by the blocking decision.

The "block" in the block can be understood by referring to the corresponding description of the above-mentioned fig. 5A and 5B, and the description thereof is not repeated.

The preprocessing calculation module 735 updates the inverse of the additional matrix of the corresponding block only during the iterative process of decision update or the block of decision update of the time-sharing/block-updating decision module 734, and then uses the inverse G of the updated additional matrix of the block^-1And first order gradientMultiplying to calculate the pre-processing result of the blockIn the iterative process of decision updating, the block which is not selected for updating directly uses the inverse matrix of the additional matrix of the corresponding block in the last iterative process to calculate the preprocessing result. In the iterative process without updating decision, the inverse matrix of the additional matrix of each block is not updated, and the inverse matrix of the additional matrix of each block in the last iterative process is directly used for calculating the preprocessing result.

Updating the inverse matrix G of G^-1May be based on cholesky decomposition. The specific process of cholesky decomposition is not described in detail in this application.

The weight update module 736 is configured to use the update rules of the second order optimization algorithm:the weights are updated such that the weights are updated from θ 2 to θ 2. In the examples listed here, G is^-1The inverse matrix of the additional matrix is not limited to this representation, and other modified equations based on the idea of the present application are applicable, for example:wherein the content of the first and second substances,is equivalent to

It should be noted that, in the embodiment of the present application, each block corresponds to an additional matrix, and each block corresponds to a weight, where the weight may be in the form of a matrix. The weights of the blocks updated for the inverse of the additional matrix are also updated, in which case θ 2 ≠ θ 1. The weight of a block for which no update has occurred to the inverse matrix of the additional matrix is also not updated, and in this case, θ 2 is θ 1.

After the weight updating is finished, the updating weight of the iteration is used for updating the model, the next iteration is carried out on the basis of the model after the weight is updated by the iteration, and the iteration process of each iteration can be understood by referring to the working processes of the functional modules.

The system architecture 700 may be deployed on computer devices such as servers, virtual machines, and terminal devices. For example, the terminal device may be a mobile phone (mobile phone), a tablet (pad), a computer with a wireless transceiving function, a Virtual Reality (VR) terminal, an Augmented Reality (AR) terminal, a wireless terminal in industrial control (industrial control), a wireless terminal in self driving (self driving), a wireless terminal in remote medical (remote medical), a wireless terminal in smart grid (smart grid), a wireless terminal in transportation safety (transportation safety), a wireless terminal in smart city (smart city), a wireless terminal in home (smart home), and the like.

Based on the introduction of the functional modules in the system architecture 700, the method for updating parameters provided in the embodiment of the present application includes: the first scheme is as follows: updating in a time-sharing manner; scheme II: updating in blocks; the third scheme is as follows: time-sharing update + block update. Three schemes are described below.

The first scheme is as follows: and (5) updating in a time-sharing manner.

According to the time-sharing updating scheme, from the whole process of training the neural network model, the parameter updating method is used for updating the parameters of the neural network model for multiple times through multiple iterations, the multiple iterations comprise a first iteration range and a second iteration range, and one embodiment of the parameter updating method comprises the following steps: updating an inverse matrix of an additional matrix of the neural network model once every iteration number indicated by the first updating step within a first iteration range, wherein the first iteration range comprises at least two iterations; updating an inverse matrix of an additional matrix of the neural network model once every iteration number indicated by a second update step in a second iteration range, the second iteration range comprising at least two iterations, the first iteration of the second iteration range being after the last iteration of the first iteration range in the iteration order, the second update step being greater than the first update step.

The process of training a neural network model using sample data typically requires many iterations to obtain a target neural network model, each iteration may be referred to as a step (step). The whole training process may start from the first iteration until all steps in the whole process of training the target neural network model can be divided into at least two iteration ranges (period), for example: ten thousand iterations are needed to train the neural network model, ten thousand iterations can be divided into 10 iteration ranges, the 10 iteration ranges are arranged according to the sequence used in the iteration process and range from period1 to period10, wherein each iteration range comprises 1000 iterations. Of course, this division is only an example, and the lengths of the iteration ranges may be different, for example: period1 is from step1 to step200, period2 is from step201 to step500, period3 is from step501 to step1000, period4 is from step1001 to step1700, period5 is from step1701 to step2600, period6 is from step2601 to step3600, period7 is from step3601 to step4800, period8 is from step4801 to step6000, period9 is from step6001 to step7500, and period10 is from step7500 to step 10000. If the convergence condition of the training neural network model is not the condition of presetting a certain number of iterations, more iteration ranges can be set, and the set iteration ranges are not used up when the neural network model converges. An iteration range may also be an epoch.

The first iteration range and the second iteration range may be any two of all the iteration ranges as long as the second iteration range follows the first iteration range in execution order. Each iteration range corresponds to an update step (update stride), the update step represents the update interval, and represents that the inverse matrix of the additional matrix of the neural network model is updated once per update step, which can also be described as being updated once every other (update step-1). This update step size may also be referred to as an update interval (update interval). The update step size may be an integer value greater than or equal to 1. The variation trend of the update step may be that the update step of the corresponding iteration range is larger and larger as the number of iterations increases, or the update step of the iteration range after the execution sequence is only larger than the update step of the iteration range or some iteration ranges before the execution sequence, and the update steps of some iteration ranges may be equal. For example: the case that some update steps are equal, that is, the update step of the first iteration range is 1, the update step of the second iteration range is 2, the update step of the third iteration range is 2, and the update step of the fourth iteration range is 3, may also be applicable to the scheme of the time-sharing update of the present application.

The first iteration range, the second iteration range, the first update step size, and the second update step size can be understood with reference to fig. 7A and 7B.

Fig. 7A and 7B are two exemplary diagrams of a method of updating parameters.

In fig. 7A, the first iteration range and the second iteration range are two adjacent iteration ranges in the execution order, and in fig. 7B, the first iteration range and the second iteration range are two non-adjacent iteration ranges. Whether or not the first iteration range and the second iteration range are adjacent, as long as the second iteration range is located after the first iteration range in the execution order.

In fig. 7A, the first update step size is 2, i.e., the inverse of the additional matrix of the neural network model is updated every 2 iterations, and the second update step size is 3, i.e., the inverse of the additional matrix of the neural network model is updated every 3 iterations. In fig. 7B, the first update step size is 2, and the value of the second update step size may be an integer equal to or greater than 3. Actually, the values of the first update step and the second update step in fig. 7A and 7B are only examples, as long as the second update step is greater than the first update step, and the specific values are not limited in this embodiment of the application.

The above is described from the overall process of neural network model training, and for the nth iteration, it may be: and if the Nth iteration in the multiple iterations is in the third iteration range and is the iteration which is indicated by the third updating step and needs to update the inverse matrix, updating the inverse matrix of the additional matrix of the neural network model, using the updated inverse matrix of the additional matrix and the first-order gradient of the Nth iteration to update the parameters in the neural network model, wherein the third updating step is the updating step of the third iteration range, N is an integer, and N is greater than 1.

The third iteration range may be the first iteration range, the second iteration range, or any other one of the iteration ranges. The nth iteration may be any iteration from the second iteration of the training to the end of the training of the neural network model, and actually, the first iteration, that is, when N is 1, the inverse matrix may also be updated, except that the update of the first iteration does not need a third update step to indicate, and the first iteration may be indicated to update the inverse matrix in a manner of presetting an update start position. According to the possible implementation mode, aiming at the step needing to be updated, the inverse matrix is updated, and the parameters are updated by using the updated inverse matrix, so that the neural network model is converged.

Optionally, for any one iteration process of multiple iterations of the neural network model, the process of updating the parameters provided in the embodiment of the present application may be understood with reference to the following embodiments.

Fig. 8A is a schematic diagram of an embodiment of a method for updating parameters according to an embodiment of the present application.

As shown in fig. 8A, an embodiment of a method for updating a parameter provided in the embodiment of the present application includes:

801. and acquiring information of an iteration range involved when the neural network model is iterated for the Nth time.

The information of the involved iteration range includes information of a third iteration range in which the nth iteration is located. If the third iteration range is not the first iteration range in the execution order, the involved iteration range also includes information of the previous iteration range in which the execution order precedes the third iteration range.

The range of iterations involved refers to the range of iterations that are traversed from the first iteration to the nth iteration. In the process of training the neural network model, starting from the first iteration, when the iteration is performed to the Nth iteration, if the Nth iteration is within the second iteration range, the involved iteration range comprises the first iteration range and the second iteration range. If the nth iteration is within the first iteration range, the involved iteration range only includes the first iteration range. The information of the iteration range may include the numerical range of the involved iteration range, such as: period1 goes from step1 to step200 and period2 goes from step201 to step 500.

Each iteration range corresponds to an updating step length, and with the increase of the iteration times, the additional matrix of the neural network model and the corresponding inverse matrix tend to be more and more stable and do not change or change less and less with the iteration, so that the updating step length corresponding to the iteration range used later in the whole iteration process can be set to be larger.

The setting mode of the update step length can be square of the sequence number of the iteration range, cosine curve, exponential curve, multiple increase or piecewise constant, etc.

The implementation of determining the update step size by using the square of the sequence number of the iteration range is as follows:

F(x)＝x²，x＝1，2，3，...

where F (x) is the update step size in the xth iteration range.

For large neural networks, for example: ResNet50, with 256 period, one period comprising, for example, 5004 steps, i.e., period 5004, the transformation curve for updating the step size can be understood with reference to the portion of the exponential curve illustrated in fig. 8B.

The update step size may also be increased by a multiple. Also taking period1 to period10 as an example, the update step sizes corresponding to 10 iteration ranges from period1 to period10 become gradually larger, the larger amplitude increases by 2 times, for example, the update step size of the first iteration range is equal to 1, that is, the update step size of period1 is set to 1, then the update step size of period2 increases by 2 times, the update step size of period3 is 2, and then the corresponding relationship from period1 to period10 can be understood by referring to table 1 below.

Table 1: correspondence table of iteration range and updating step length

Certainly, the value of the update step length in table 1 is only an example, and the value of the update step length may be set according to a requirement, for example: the update step size is set according to a trend larger than one.

802. And determining whether the Nth iteration is the Xth iteration in the third iteration range according to the first relation between the N and the information of the involved iteration range, if so, executing the step 803, and if not, executing the step 804.

In this embodiment of the present application, X is used to represent an update start value in a third iteration range, where the update start value is used to indicate an iteration in the third iteration range, where the inverse matrix is updated for the first time. The xth iteration within the third range of iterations is the first iteration to update the inverse matrix within the third range of iterations.

If the nth iteration is within the first iteration range, if N ═ X, it means that the nth iteration is the iteration in which the inverse matrix is updated for the first time within the first iteration range.

If the information of the involved iteration ranges comprises a third iteration range and a preceding iteration range with an execution order preceding the third iteration range, the first relation may be expressed as: (N-total length of previous iteration range) second difference.

If the second difference is X, it may be determined that the nth iteration is the iteration in the third iteration range in which the inverse matrix is first updated.

The value of X is usually equal to 1, but is not limited to be equal to 1, and X is usually smaller than or equal to the minimum update step size. For example: if the minimum update step is 3, X may be equal to 1, or may also be equal to 2, or of course, may also be equal to 3, and this is merely an example, and the value of X may be set according to a requirement.

For example: if period1 goes from step1 to step200, period2 goes from step201 to step500, N is 201, and X is 1. A third iteration range of period2 may be determined by N201, with the previous iteration range being period 1. If the second difference value 201 ═ 200 ═ 1 ═ X can be obtained from the first relationship, it can be determined that the nth iteration is the iteration in period2 in which the inverse matrix is updated for the first time.

If period1 is from step1 to step200, period2 is from step201 to step500, period3 is from step501 to step1000, N is 503, and X is 1. A third iteration range of period3 may be determined by N-503, and previous iteration ranges of period1 and period 2. The second difference 503 ≠ 500 ≠ 3 ≠ 1, which can be obtained from the first relationship, that is, the 503 th iteration is not the iteration of updating the inverse matrix for the first time in period 3.

803. And if the second difference value indicates that the Nth iteration is the Xth iteration within the third iteration range, updating an inverse matrix of an additional matrix of the neural network model, updating the parameters in the neural network model by using the updated inverse matrix of the additional matrix and the first-order gradient of the Nth iteration, and updating the parameters in the neural network model.

804. If the second difference indicates that the nth iteration is not the xth iteration within the third iteration range, a third update step size for the third iteration range is obtained.

If N is 503, the iteration range used by N is period3, and it can be known from table 1 that the update step size of period3 is equal to 4.

805. And determining whether the inverse matrix of the additional matrix of the neural network model needs to be updated according to the second relation among N, the information of the involved iteration range and the third updating step, if so, executing step 806, and if not, executing step 807.

Optionally, if the third iteration range is the first iteration range of all the iteration ranges, the second relationship may be represented as: (N-X)% third update step-first remainder, where "%" denotes a remainder.

The X synchronization step 802 indicates updating the initial value, and the values and physical meanings of X can be understood by referring to the corresponding explanations in the above step 802.

For example: period 1: from step1 to step200, if X is 1, the update step size of period1 is equal to 1, and N is 5, then (5-1)% 1 is 0, which means that the first remainder is equal to 0, which means that the 5 th iteration needs to update the inverse matrix. If X is 1, the update step is equal to 2, and N is 6, (6-1)% 2 is 1, which means that the first remainder is not equal to 0, which means that the inverse matrix does not need to be updated for the 6 th iteration.

If there is a preceding iteration range before the third iteration range, the second relationship may be expressed as: (N-X-third difference)% third update step-first remainder, where third difference-total length of previous iteration range.

For example: period 1: from step1 to step200, period 2: from step201 to step500, if N is 205, it means that the third iteration range is period2, and period1 is the previous iteration range. The total length of the preceding iteration range is 200. If X is 1 and the update step size of period2 is equal to 2, then the first remainder is (205-1-200)% 2 is 0, which means that the first remainder is equal to 0 and the inverse matrix needs to be updated for this 205 th iteration.

If period 1: from step1 to step200, period 2: from step201 to step500, period 3: from step501 to step1000, N is 506, which means that the third iteration range is period3, period1 and period2 are previous iteration ranges, the total length of which is 500, if X is 1 and the update step size of period3 is equal to 4, then (506-1-500)% 4 is 1, which means that the first remainder is not equal to 0, and the 506 th iteration does not need to update the inverse matrix.

806. And if the first remainder indicates that the inverse matrix of the additional matrix of the neural network model needs to be updated, updating the inverse matrix of the additional matrix of the neural network model, updating the parameters in the neural network model by using the updated inverse matrix of the additional matrix and the first-order gradient of the Nth iteration, and updating the parameters in the neural network model.

807. If the first remainder indicates that the inverse of the additional matrix of the neural network model does not need to be updated, updating the parameters in the neural network model using the inverse of the additional matrix used in the (N-1) th iteration and the first order gradient of the Nth iteration.

According to the description related to the first scheme, the inverse matrix of the additional matrix of the neural network model does not need to be updated every iteration in a time-sharing updating mode, so that the time overhead of updating the inverse matrix of the additional matrix can be reduced, the training time of the neural network model can be reduced, and the training speed of the neural network model can be improved.

Scheme II: and (5) updating in blocks.

Fig. 9A is a schematic diagram of another embodiment of a method for updating parameters provided in an embodiment of the present application.

As shown in fig. 9A, another embodiment of the method for updating parameters provided in the embodiment of the present application includes:

901. and updating an inverse matrix of the additional matrix of P blocks, wherein the P blocks are partial blocks or all blocks in the Q blocks of the neural network model.

When updating the inverse matrix for each block, the entire block may be updated, or a part of the block may be updated. Typically, the model training begins with all blocks being updated, and as the number of iterations increases, the number of blocks that need to update the inverse matrix decreases.

P and Q are integers, Q is more than or equal to P, Q is more than or equal to 2, and P is more than or equal to 1.

The block concept can be understood by referring to the above-mentioned fig. 5A and 5B, and the description is not repeated here.

Each block has an additional matrix G, and the corresponding inverse matrix G can be calculated through G^-1. Before the inverse matrix of the additional matrix has not been updated in the Nth iteration, the additional matrix G of each of the Q blocks of the block 1, the block 2, the block …, the block Q, and the inverse matrix G in the deep neural network^-1This can be understood by referring to table 2.

Table 2: additional matrix G and inverse matrix G of block used at (N-1) th time^-1Corresponding relation table of

If Q is 8 and P is 3, for example, if the 3 blocks determined at the nth iteration are block 1, block 4, and block 7, the inverse matrix of the 3 blocks needs to be updated, and the updated inverse matrix of block 1 isThe inverse matrix of block 4 isThe inverse matrix of block 7 isThe inverse matrices of the other 7 blocks except the three blocks are unchanged, and are the same as those used in the (N-1) th iteration. Additional matrix G and inverse matrix G of each block after Nth updating^-1The results can be understood with reference to table 3.

Table 3: additional matrix G and inverse matrix G of block updated by Nth iteration^-1Corresponding relation table of

As can be seen from table 3, at the nth iteration, the additional matrix G of each block may be updated, but only the determined inverse matrix of P blocks is updated, and in fact, the additional matrix G may not be updated completely, but the additional matrix of the block to be updated needs to be updated.

902. And updating the parameters of the corresponding block in the P blocks by using the inverse matrix of the additional matrix after the updating of the P blocks and the first-order gradient of the Nth iteration of the P blocks, if Q is more than P, updating the parameters of the corresponding block in the P blocks by using the inverse matrix of the additional matrix used by the (Q-P) block in the (N-1) th iteration and the first-order gradient of the Nth iteration of the (Q-P) block.

When updating the parameters of a block, update rules may be employedUpdates are made, and to facilitate distinguishing between parameters in each step, the update rules may be rewritten asWherein, theta_NParameter, θ, representing the update of the Nth iteration_(N-1)Representing the parameters obtained in the (N-1) th iteration,the inverse matrix representing the nth iteration,representing the first order gradient of the nth iteration.

Again using the example of Q8 and P3 in step 901 above, the inverse matrices of block 1, block 4, and block 7 are updated, and then the updated parameters of block 1, block 4, and block 7 are:

the updated parameters of block 1 are:

the updated parameters of block 4 are:

the updated parameters of block 7 are:

θ_1N、θ_4Nand theta_7NI.e. updated parameters for block 1, block 4 and block 7.

In addition to the block 1, the block 4, and the block 7, five blocks of the blocks 2, the block 3, the block 5, the block 6, and the block 8 remain among the 8 blocks. The inverse matrices of the five blocks are the same as those used at the (N-1) th time, and the inverse matrices of the five blocks in the above table 3 are usedTo obtain theta_2N、θ_3N、θ_5N、θ_6NAnd theta_8N. The calculation process of these several parameters can be expressed as:

the parameters obtained by the nth iteration of block 2 are:

the parameters obtained by the nth iteration of block 3 are:

the nth iteration of block 5 yields the following parameters:

the nth iteration of block 6 yields the following parameters:

the nth iteration of block 8 yields the following parameters:

as can be seen from the description of this embodiment, the inverse matrix of the additional matrix of only part of the blocks is updated in a block updating manner, so that the time overhead of updating the inverse matrix of the additional matrix can be reduced, the time for training the neural network model can be reduced, and the training speed of the neural network model can be increased.

The P blocks in step 901 above can be obtained in the following two ways.

The implementation mode is as follows: p blocks are obtained from the M blocks based on the information of the additional matrix of the M blocks in the neural network model, the information of the additional matrix comprises the trace of the additional matrix or the two norms of the additional matrix, the M blocks are blocks needing to update the additional matrix in the Q blocks of the Nth iteration, M is an integer, and Q is more than or equal to M and more than or equal to P.

Wherein the trace of the additional matrix is the sum of the values on the diagonal of the additional matrix. The additional matrix in the embodiment of the present application is a matrix with equal rows and equal columns, and may also be referred to as a positive definite matrix. The concept of traces can be understood with reference to FIG. 9B. As shown in fig. 9B, an additional matrix of 8 rows by 8 columns is shown, which contains 64 values. The sum of the 8 values on the diagonal 910 of the matrix can be referred to as the trace of the 8 rows by 8 columns of the additional matrix, i.e., the trace of the 8 rows by 8 columns of the additional matrix is b11+ b22+ b33+ b44+ b55+ b66+ b77+ b 88.

The two-norm of the additional matrix is the square of the maximum eigenvalue after multiplication of the transpose of the additional matrix and the additional matrix.

The steps are as follows: obtaining the P blocks from the M blocks based on information of an additional matrix of the M blocks in the neural network model, including: obtaining the P blocks from the M blocks according to the trace of the addition matrix of the M blocks of the Nth iteration and the trace of the addition matrix of the M blocks of the (N-1) th iteration.

It should be noted that, in this embodiment of the application, the trace of the additional matrix of the M blocks of the (N-1) th iteration is not limited to the trace of the additional matrix updated at the (N-1) th time, and if the additional matrix is not updated at the (N-1) th time, the trace of the additional matrix of the M blocks of the (N-1) th iteration may be the trace of the additional matrix updated at the time closest to the nth time, and the trace obtained after the update is stored in a memory or a cache, and is obtained from the memory or the cache and then may be used.

The M blocks in the Q blocks also need to be updated with additional matrixes, and the (Q-M) blocks except the M blocks are blocks which do not change basically any more in the Nth iteration, and the blocks do not need to be updated with inverse matrixes and do not need to be updated with additional matrixes, so that the blocks which do not change basically any more in the (Q-M) additional matrixes can be directly excluded when the P blocks which need to be updated with the inverse matrixes are selected, the blocks are directly selected from the M blocks which need to be updated with the additional matrixes, and the time for training the neural network model can be further saved.

Wherein, the steps are as follows: deriving P blocks from the M blocks based on the traces of the append matrix for the M blocks for the Nth iteration and the traces of the append matrix for the M blocks for the (N-1) th iteration, comprising: and obtaining P blocks with a first ratio larger than a first threshold value from the M blocks, wherein the first ratio is the ratio of a first difference value to the trace of the additional matrix of the (N-1) th iteration, and the first difference value is the difference value of the trace of the additional matrix of the Nth iteration and the trace of the additional matrix of the (N-1) th iteration.

This process may be determined according to the following relationship:

wherein, F^NAdditional matrix representing the Nth iteration, F^(N-1)An additional matrix, tr (F), representing the (N-1) th iteration^N) Representation matrix F^NTrace of (d), tr (F)^(N-1)) Representation matrix F^(N-1)Trace of (d), tr (F)^N)-tr(F^(N-1)) It is indicated that the first difference value is,indicating a first ratio and T1 indicating a first threshold.

If the first ratio of a block is greater than T1, it may be determined that the inverse matrix for that block needs to be updated. If the first ratio of the additional matrix of a block is smaller than T1, it indicates that the inverse matrix of the block does not need to be updated.

In this implementation, there may be another relation:where T2 is the second threshold and T2 < T1, if the above ratio of some blocks is less than T2, it means that the additional matrix of the block does not need to be updated in the next iteration.

The value of T1 may be set to 0.01, the value of T2 may be set to 0.001, and if the first ratio of the additional matrix of a block is greater than 0.01, it indicates that the inverse matrix of the block needs to be updated. If the first ratio of the additional matrix of the block is less than 0.01, it indicates that the inverse matrix of the block does not need to be updated. If the first ratio of the additional matrix of the block is less than 0.001, it indicates that the additional matrix of the block does not need to be updated in the subsequent iteration process. Of course, the values of T1 and T2 can be set as required, and 0.01 and 0.001 are only examples.

This implementation may also be referred to as a computational process based on an online model. The online model may be applicable to users who are unfamiliar with the structure of the neural network model. According to the first implementation mode, whether the block needs to be updated or not can be determined according to the trace in the iteration process, and the accuracy of block selection can be improved.

With respect to the online model, the second implementation may be referred to as a calculation process based on the offline model. When using an offline model, the user can manually adjust the sampling probability of each block, fitting users familiar with the structure of the neural network model. The sampling probability of each block can be set using a priori information when using an off-line model.

The implementation mode two is as follows: p blocks are derived from the plurality of blocks based on sampling probabilities of the plurality of blocks in the neural network model, wherein the sampling probability of a block is used to indicate a probability that the block is updated with an inverse of the additional matrix at the nth iteration.

Wherein the sampling probability of one of the blocks is related to the parameter quantity in the block and the total parameter quantity in the neural network model, or the sampling probabilities of the blocks are pre-configured.

Each block has different influence on the training process, so the sampling probability of each block is different, and the block with larger parameter number has larger influence on the training process. Can be as followsTo determine the sampling probability for each block. Wherein w_iRepresents the parameter quantity, sigma, of the ith block_jw_jRepresenting the total amount of parameters of the neural network model. The sampling probability is determined through the parameter quantity in the block, so that the selection probability of the block which has large influence on the neural network model is improved.

After the sampling probability is calculated, sampling can be performed according to the sampling probability, and then the index value of the block of the inverse matrix of the additional matrix is output to be updated. If there are 10 blocks in the neural network model, if the output index values are 1, 4, and 7, then the inverse of the additional matrix representing block 1, block 4, and block 10 needs to be updated.

The process can also be understood with reference to fig. 9C, as shown in fig. 9C, using the picture to train a neural network model that includes a plurality of blocks, such as: convolution (conv) layers conv 0, conv1, conv2, conv3, …, full connected (fc) layers. And performing block sampling according to the sampling probability, and determining that the three blocks conv1, conv3 and fc need to update the inverse matrix of the additional matrix in the iteration.

According to the second scheme, by means of block updating, the dimension of the inverse matrix of the additional matrix can be reduced by updating partial blocks, the time for training the neural network model can be shortened, and the speed for training the neural network model can be increased.

The third scheme is as follows: time-sharing update + block update.

The scheme three phases are equivalent to the combination of the scheme one and the scheme two. And after the situation that the inverse matrix needs to be updated in the Nth iteration is determined, block updating is carried out on the Nth iteration. And if the nth iteration does not need to update the inverse matrix, the block updating is not executed.

This process can be understood with reference to fig. 10. Fig. 10 is a schematic diagram of another embodiment of a method for updating parameters provided in an embodiment of the present application.

As shown in fig. 10, another embodiment of the method for updating parameters provided in the embodiment of the present application includes:

1001. and after the Nth iteration calculation obtains a first-order gradient, making a time-sharing updating decision.

The time-sharing updating decision process can be understood by referring to the corresponding content in the first scheme, and the details are not repeated here.

If the decision in step1001 is updated, step 1002 is executed, and if the decision in step1001 is not updated, step 1007 is executed by using the inverse matrix of the additional matrix used in the (N-1) th iteration and the first-order gradient of the nth iteration to update the parameters in the neural network model, that is, to update the parameters.

1002. And judging whether to use the online model for block sampling. If yes, go to step 1003, otherwise go to step 1004.

1003. Block sampling is done using an online model.

This step 1003 can be understood by referring to the content of the first implementation in the second scheme.

1004. Block sampling is performed using an offline model.

This step 1003 can be understood by referring to the contents of the second implementation in the second scheme.

1005. And determining whether the current block is updated according to the block index, if so, executing step 1006, and if not, executing step 1007 by using the additional matrix of the block used in the (N-1) th iteration and the inverse matrix of the first-order gradient of the block in the Nth iteration, namely, updating the parameters.

1006. The inverse of the additional matrix for the block is updated.

After 1006, step 1007 is performed using the inverse of the updated additional matrix and the first order gradient of the block for the nth iteration, i.e. a parameter update is performed.

1007. And (6) updating the parameters.

And if the inverse matrix of the block is updated, updating the parameters in the neural network model by adopting the updated inverse matrix and the first-order gradient of the Nth iteration.

And if the inverse matrix of the block is not updated, updating the parameters in the neural network model by adopting the inverse matrix of the additional matrix used in the (N-1) th iteration and the first-order gradient of the Nth iteration.

According to the third scheme, block updating is performed on the basis of time-sharing updating, so that the time for training the neural network model can be further saved, and the training speed of the neural network model is increased.

In order to illustrate the effect of the scheme of the present application, the following test is performed by using the same hardware environment and software environment and using three different algorithms to obtain the test data in table 4.

The sample data set of the test is ImageNet complete set, the neural network model is a large neural network model Resnet50, the processor adopts GPU V100, and the deep learning framework of the software environment is a pytorch.

The scheme optimizer of the application is based on a natural gradient method, firstly blocks the matrix of the neural network model according to the network structure, and then tests by using the time-sharing updating and block updating schemes described in the above embodiments, so that the data in the column where the time-sharing block updating is performed in the following table 4 can be obtained.

In addition, the data listed in the SGD + Momentum algorithm in table 4 can also be obtained by performing a test using a random gradient algorithm (SGD) and a Momentum algorithm.

Furthermore, the original (original) kronecker-factor approximation curvature algorithm (KFAC) was used for testing, and the data listed in the original KFAC algorithm in table 4 can also be obtained.

Table 4: experimental data comparison table

As can be seen from table 4, the scheme of time-sharing and block-partitioning update of the present application is improved by 20 times in the overall training time and by 50 times in the computation time of the overall additional matrix/inverse matrix, compared with the second-order optimizer of original KFAC. It is also much faster on a single iteration than original KFAC; the convergence iteration times (75% top-1) of the method are improved by more than 1 time relative to the SGD + Momentum of the first-order optimizer, and the convergence speed is far higher than that of the first-order optimizer; the training time of the method is improved by about 30% compared with that of a first-order optimizer, and is far faster than that of the first-order optimizer.

The method for updating parameters in the embodiment of the present application is described above, and the corresponding apparatus is described below with reference to the accompanying drawings.

Fig. 11 is a schematic diagram of an embodiment of an apparatus for updating parameters according to an embodiment of the present disclosure.

As shown in fig. 11, an embodiment of an apparatus 110 for updating parameters is provided in the present application. The apparatus for updating parameters is used for updating parameters of a neural network model for a plurality of iterations, the plurality of iterations including a first iteration range and a second iteration range, the apparatus 110 includes:

a first processing unit 1101 configured to update an inverse matrix of an additional matrix of the neural network model once every iteration number indicated by a first update step in a first iteration range, the first iteration range including at least two iterations.

A second processing unit 1102, configured to update the inverse matrix of the additional matrix of the neural network model once every iteration number indicated by a second update step size within a second iteration range, where the second iteration range includes at least two iterations, and a first iteration of the second iteration range is after a last iteration of the first iteration range in the iteration order, and the second update step size is larger than the first update step size.

According to the scheme provided by the embodiment of the application, the inverse matrix of the additional matrix of the neural network model is not required to be updated every iteration in a time-sharing updating mode, so that the time overhead of updating the inverse matrix of the additional matrix can be reduced, and the training speed of the neural network model can be further improved.

Optionally, the multiple iterations include a third iteration range, where the third iteration range is any one iteration range in the multiple iterations, and the apparatus 110 further includes:

the third processing unit 1103 is configured to, if an nth iteration of the multiple iterations is located in a third iteration range and is an iteration, indicated by a third update step length, of which the inverse matrix needs to be updated, update the inverse matrix of the additional matrix of the neural network model, and update parameters in the neural network model using the updated inverse matrix of the additional matrix and a first-order gradient of the nth iteration, where the third update step length is an update step length of the third iteration range, N is an integer, and N is greater than or equal to 1.

Optionally, the third processing unit 1103 is configured to: updating an inverse matrix of an additional matrix of P blocks, wherein the P blocks are partial blocks or all blocks in Q blocks of the neural network model, P and Q are integers, Q is more than or equal to P, Q is more than or equal to 2, and P is more than or equal to 1; updating the parameters of the corresponding blocks in the P blocks by adopting the inverse matrix of the additional matrix after the updating of the P blocks and the first-order gradient of the Nth iteration of the P blocks; if N > 1 and Q > P, then (Q-P) blocks other than P blocks update the parameters of the corresponding block in the (Q-P) block using the inverse of the addition matrix used by the (Q-P) block at the (N-1) th iteration and the first order gradient of the (Q-P) block Nth iteration.

Optionally, the third processing unit 1103 is further configured to obtain P blocks from the M blocks based on information of an additional matrix of the M blocks in the neural network model, where the information of the additional matrix includes a trace of the additional matrix or a two-norm of the additional matrix, the M blocks are blocks in the N iteration, which need to be updated, of Q blocks, M is an integer, and Q is greater than or equal to M and greater than or equal to P.

Optionally, the third processing unit 1103 is configured to obtain P blocks from the M blocks according to the trace of the additional matrix of the M blocks of the nth iteration and the trace of the additional matrix of the M blocks of the (N-1) th iteration.

Optionally, the third processing unit 1103 is configured to obtain P blocks from the M blocks, where a first ratio is greater than a first threshold, where the first ratio is a ratio of a first difference to a trace of the additional matrix of the (N-1) th iteration, and the first difference is a difference between a trace of the additional matrix of the nth iteration and a trace of the additional matrix of the (N-1) th iteration.

Optionally, the third processing unit 1103 is further configured to obtain P blocks from the plurality of blocks based on sampling probabilities of the plurality of blocks in the neural network model, where the sampling probability of one block is used to indicate a probability that the block is updated by an inverse matrix of the additional matrix at the nth iteration.

Optionally, the third processing unit 1103 is further configured to update the inverse matrix if a second difference value of the nth iteration is equal to an update start value, where the second difference value is a difference value between N and a total length of a previous iteration range, and the update start value is used to indicate an iteration of updating the inverse matrix for the first time in the third iteration range before the previous iteration range is located in the execution order of the third iteration range.

Optionally, the third processing unit 1103 is configured to update the inverse matrix when the first remainder of the nth iteration is 0; wherein the first remainder is a remainder of a third difference value and a third update step size, the third difference value is a difference value between (N-update start value) and a total length of a previous iteration range, and the update start value is used to indicate an iteration for updating the inverse matrix for the first time in the third iteration range when the previous iteration range is located before the third iteration range in the execution sequence.

The above describes the scheme of time-sharing update of the device for updating parameters and the scheme of time-sharing update plus block update, and the contents of this part can be understood by referring to the corresponding contents in the foregoing embodiments, and will not be described repeatedly here.

In addition, the apparatus 110 for updating parameters provided in this embodiment of the present application may also perform a process of block updating alone, in this case, the apparatus 110 for updating parameters is configured to update parameters of the neural network model multiple times through multiple iterations, and for an nth iteration in the multiple iterations, N is an integer greater than 1. The apparatus 110 comprises:

the first processing unit 1101 is configured to update an inverse matrix of an additional matrix of P blocks, where the P blocks are some or all of Q blocks of the neural network model, P and Q are integers, Q is greater than or equal to P, Q is greater than or equal to 2, and P is greater than or equal to 1.

The second processing unit 1102 is configured to update parameters of corresponding blocks in the P blocks by using an inverse matrix of the updated additional matrix of the P blocks and a first-order gradient of an nth iteration of the P blocks; if Q > P, then (Q-P) blocks other than P blocks update the parameters of the corresponding block in the (Q-P) block using the inverse of the addition matrix used by the (Q-P) block at the (N-1) th iteration and the first order gradient of the Nth iteration of the (Q-P) block.

The application embodiment adopts a block updating mode, and can update the inverse matrix of the additional matrix of all blocks or partial blocks according to requirements, so that the time overhead of updating the inverse matrix can be reduced, the training time of the neural network model can be reduced, and the training speed of the neural network model can be improved.

Optionally, the third processing unit 1103 is configured to obtain P blocks from M blocks based on information of an additional matrix of the M blocks in the neural network model, where the information of the additional matrix includes a trace of the additional matrix or a two-norm of the additional matrix, the M blocks are blocks of an nth iteration, which need to be updated, and M is an integer, and Q is greater than or equal to M and greater than or equal to P.

Optionally, the third processing unit 1103 is configured to obtain P blocks from the M blocks, where a first ratio is greater than a first threshold, the first ratio is a ratio of a first difference to a trace of the additional matrix of the (N-1) th iteration, and the first difference is a difference between a trace of the additional matrix of the nth iteration and a trace of the additional matrix of the (N-1) th iteration.

Optionally, the third processing unit 1103 is further configured to obtain P blocks from the plurality of blocks based on sampling probabilities of the plurality of blocks in the neural network model, where the sampling probability of one block is used to indicate a probability that the one block is updated with an inverse matrix of the additional matrix at the nth iteration.

The above describes a scheme of block update of a device for updating parameters, and the contents of this portion can be understood by referring to the corresponding contents in the foregoing embodiments, and are not repeated here.

It should be noted that the first processing unit 1101, the second processing unit 1102, and the third processing unit 1103 may implement the functions of the three processing units by one processing unit, or may implement the functions of the three processing units by two or three processing units.

The apparatus 110 for updating parameters can be understood by referring to the embodiments of the method for updating parameters, which are not described herein in too much detail.

Fig. 12 is a schematic diagram illustrating a possible logical structure of a computer device 120 according to an embodiment of the present application. The computer device 120 includes: a processor 1201, a communication interface 1202, a memory 1203, and a bus 1204, the processor 1201 may include a CPU or a CPU and at least one of a GPU and NPU, and other types of processors. The processor 1201, the communication interface 1202, and the memory 1203 are connected to each other by a bus 1204. In an embodiment of the present application, the processor 1201 is configured to control and manage an action of the computer device 120, for example, the processor 1201 is configured to perform: the inverse of the additional matrix of the neural network model is updated once per iteration number indicated by the first update step in a first iteration range, and the inverse of the additional matrix of the neural network model is updated once per iteration number indicated by the second update step in a second iteration range. Alternatively, processor 1201 is configured to perform steps 801 through 803 in fig. 8A, and steps 901 through 902 in fig. 9A, and steps 1001 through 1007 in fig. 10, and/or other processes for the techniques described herein. Communication interface 1202 is used to support communication for computer device 120. A memory 1203 is used to store program codes and data for the computer device 120.

The processor 1201 may be, for example, a central processing unit, a general purpose processor, a digital signal processor, an application specific integrated circuit, a field programmable gate array or other programmable logic device, a transistor logic device, a hardware component, or any combination thereof. Which may implement or perform the various illustrative logical blocks, modules, and circuits described in connection with the disclosure. The processor may also be a combination of computing functions, e.g., comprising one or more microprocessors, a digital signal processor and a microprocessor, or the like. The bus 1204 may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown in FIG. 12, but this is not intended to represent only one bus or type of bus.

Fig. 13 is a schematic diagram of a possible logical structure of a computer device 130 according to an embodiment of the present disclosure. The computer device 130 includes: a hardware layer 1301 and a Virtual Machine (VM) layer 1302, which may include one or more VMs. The hardware layer 1301 provides hardware resources for the VM and supports the VM to run, and the functions of the VM and processes related to the present application can be understood by referring to the corresponding descriptions in fig. 6 to fig. 10. The hardware layer 601 includes hardware resources such as a processor, a communication interface, and a memory. The processor may include a CPU, or a CPU and at least one of a GPU and an NPU.

In another embodiment of the present application, a computer-readable storage medium is further provided, in which computer-executable instructions are stored, and when the at least one processor of the device executes the computer-executable instructions, the device performs the method for updating parameters described in the above-mentioned embodiments in fig. 6 to fig. 10.

In another embodiment of the present application, there is also provided a computer program product comprising computer executable instructions stored in a computer readable storage medium; the computer executable instructions may be read by at least one processor of the device from a computer readable storage medium, and execution of the computer executable instructions by the at least one processor causes the device to perform the method for updating parameters described in the embodiments of fig. 6-10 above.

In another embodiment of the present application, a chip system is further provided, where the chip system includes a processor, and the apparatus for supporting parameter updating implements the method for updating parameters described in the foregoing embodiments of fig. 6 to 10. In one possible design, the system-on-chip may further include a memory, for storing program instructions and data necessary for the means for updating parameters. The chip system may be constituted by a chip, or may include a chip and other discrete devices.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the embodiments of the present application.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the embodiments of the present application, it should be understood that the disclosed system, apparatus, and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solutions of the embodiments of the present application, which essentially or partly contribute to the prior art, may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The above description is only a specific implementation of the embodiments of the present application, but the scope of the embodiments of the present application is not limited thereto, and any person skilled in the art can easily conceive of changes or substitutions within the technical scope of the embodiments of the present application, and all the changes or substitutions should be covered by the scope of the embodiments of the present application. Therefore, the protection scope of the embodiments of the present application shall be subject to the protection scope of the claims.

43页详细技术资料下载

Method, device and storage medium for updating parameters

相关技术

网友询问留言