Deep learning training operation resource placement system and method based on reinforcement learning

文档序号：135165 发布日期：2021-10-22 浏览：18次中文

阅读说明：本技术 基于强化学习的深度学习训练作业资源放置系统及方法 (Deep learning training operation resource placement system and method based on reinforcement learning ) 是由周悦媛杨康章家维邵恩谭光明于 2021-07-30 设计创作，主要内容包括：本发明涉及计算资源调度技术领域,具体公开了基于强化学习的深度学习训练作业资源放置系统及方法,方法包括如下步骤：随机初始化DRL神经网络模型的参数；生成批量作业的状态向量；将状态向量送入DRL神经网络模型中推理得到批量作业的放置位置信息,并按照该放置位置信息进行作业放置,得到批量作业运行的最大完成时间记为T-RL；随机生成若干放置位置信息,并按照该随机生成的放置位置信息进行作业放置,得到该批量作业的若干最大完成时间,取得其中最小的最大完成时间记为T-Random；基于最大完成时间T-RL和最大完成时间T-Random计算奖励；反向梯度更新DRL神经网络模型的参数。采用本发明的技术方案能够在资源出错场景下对DLT作业进行自适应放置。(The invention relates to the technical field of computing resource scheduling, and particularly discloses a deep learning training operation resource placement system and method based on reinforcement learning, wherein the method comprises the following steps: randomly initializing parameters of a DRL neural network model; generating a state vector of batch operation; sending the state vector into a DRL neural network model to infer and obtain placement position information of batch jobs, placing the jobs according to the placement position information, and recording the maximum completion time of batch job operation as T _ RL; randomly generating a plurality of placing position information, placing the operation according to the randomly generated placing position information to obtain a plurality of maximum completion times of the batch operation, and obtaining the minimum maximum completion time and recording as T _ Random; calculating a reward based on the maximum completion time T _ RL and the maximum completion time T _ Random; the inverse gradient updates the parameters of the DRL neural network model. By adopting the technical scheme of the invention, the DLT operation can be placed in a self-adaptive manner under the resource error scene.)

1. The deep learning training operation resource placement method based on reinforcement learning is characterized by comprising the following steps of:

an initialization step: randomly initializing parameters of a DRL neural network model;

a state vector generation step: generating a state vector of batch operation;

the inference step comprises: sending the state vector into a DRL neural network model to infer and obtain placement position information of batch jobs, placing the jobs according to the placement position information, and recording the maximum completion time of batch job operation as T _ RL;

a random generation step: randomly generating a plurality of placing position information, placing the operation according to the randomly generated placing position information to obtain a plurality of maximum completion times of the batch operation, and obtaining the minimum maximum completion time and recording as T _ Random;

and (3) reward calculation step: calculating a reward based on the maximum completion time T _ RL and the maximum completion time T _ Random;

and (3) updating parameters: the inverse gradient updates the parameters of the DRL neural network model.

2. The reinforcement learning-based deep learning training work resource placement method according to claim 1, wherein: further comprising an experience playback step: and sampling the quadruple samples generated in the DRL neural network model training process for experience playback.

3. The reinforcement learning-based deep learning training work resource placement method according to claim 1, wherein: in the state vector generation step, a state vector is generated based on the DLT job information and the cluster information and recorded asWherein N is the number of the computing units required by the current operation; t is estimated running time of the current operation under the condition of no error; and S is the use state of each computing unit in the current cluster.

4. The reinforcement learning-based deep learning training task resource placement method according to claim 3, wherein: the reasoning step specifically comprises:

a1: inputting the state vector into a value network of a DRL neural network model to obtain a long-term measurement index V;

a2: inputting the state vector into a strategy network of a DRL neural network model to obtain the selection probability P of N computing units_iWherein i is 1, 2.. times.n;

a3: probability P corresponding to the calculation unit to be occupied and the calculation unit of the fault_jSetting to zero to obtain P'_i；

A4: selecting the k-th computing unit as one of the computing units to be placed in the job, wherein P_k＝max(P′_i)；

A5: if the number of the calculation units to be placed in the operation is equal to the number of the calculation units required by the operation, the inference of the placement position information of the operation is completed, and the inference of the position information of the next operation is carried out, otherwise, the step A1 is skipped.

5. The reinforcement learning-based deep learning training task resource placement method according to claim 2, wherein: the experience playback step specifically comprises:

b1: creating a playback buffer pool stack;

b2: pushing the quadruple samples generated in the training process into a return visit buffer pool stack;

b3: if the return visit buffer pool stack is full, overflowing the quadruplet sample which is earliest to be stacked;

b4: and selecting X quadruple samples as a batch for next training, wherein the number of X is the total number of the quadruple samples in the current return visit buffer pool stack.

6. The reinforcement learning-based deep learning training work resource placement method according to claim 1, wherein: in the reward calculation step, the calculation formula of the reward is as follows:

7. the reinforcement learning-based deep learning training task resource placement method according to claim 3, wherein: further comprises a training and judging step: and judging whether the DRL neural network model is trained completely, if not, returning to the state vector generation step, otherwise, finishing the training.

8. The reinforcement learning-based deep learning training task resource placement method according to claim 7, wherein: further comprising the steps of: and (4) reasoning to obtain the placement position of each operation in the batch operation by using the trained DRL neural network model.

9. The reinforcement learning-based deep learning training task resource placement method according to claim 8, wherein: the using steps specifically include:

c1: acquiring operation information and cluster information of batch operation;

c2: generating a state vector based on the information collected in step C1;

c3: inputting the state vector in the step C2 into a strategy network of a DRL neural network model to obtain the placement position information output by the strategy network;

c4: if the number of the inferred computing units is less than the number of the computing units required by the current operation, repeating the step C3; otherwise go to step C5;

c5: the corresponding job is set in accordance with the set position information inferred in step C3.

10. The deep learning training operation resource placement system based on reinforcement learning comprises a DRL neural network model and an action scheduling module; the method is characterized in that the operation scheduling module trains the DRL neural network model by using the steps of the method of any one of claims 1 to 8, obtains the placement position information from the trained DRL neural network model, and places the corresponding operation according to the placement position information.

Technical Field

The invention relates to the technical field of computing resource scheduling, in particular to a deep learning training operation resource placement system and method based on reinforcement learning.

Background

Deep Learning Training (DLT) jobs are typically computationally intensive tasks requiring powerful and expensive computing resources, such as GPU devices, and in order to process Training data of increasing scale, most mainstream IT companies or enterprises currently run DLT jobs through a cluster of GPU servers, and perform Distributed Deep Learning (DDL) Training to utilize multiple GPUs in parallel, thereby reducing the load on a single GPU and speeding up the Training rate of the model.

The multi-machine multi-card training mode is a main characteristic of large-scale distributed DLT operation, and the operation error probability can be increased proportionally along with the increase of the complexity of a system. Moreover, the DLT job training time is generally long, and the probability of job error is also increased when the DLT job training time is long. In addition, frequent submissions in a multi-tenant, multi-job scenario also typically lead to an increase in the probability of job errors. However, DLT operation errors are one of the important reasons for the decrease of the resource utilization rate of the system, the time overhead caused by the operation errors is not negligible, and the more the error times are, the greater the operation restart overhead and the resource recovery overhead caused by the errors are, the lower the resource utilization rate is.

In order to better and reasonably place DLT (distributed living technology) operation in the scene of cluster resource errors, a method based on cluster capacity perception and a method based on load interference perception are provided in the prior art. For example, when the GPU device with a low error probability is in a relatively high-load state for a long time, the scheduling policy is likely to frequently place a multi-card DLT large job on the GPU device with a high error probability, which may restart the job many times and cause a reduction in resource utilization. Although training performance degradation and resource utilization rate reduction caused by interference among DLT jobs are avoided to a great extent by the method based on load interference perception, error characteristics of each GPU device in the cluster are still not considered, for example, if GPU distribution with high error probability in the cluster is dispersed, distributed multi-card DLT jobs with high interference degree are likely to be placed on GPUs with high error probability when being placed separately, so that jobs are restarted frequently, and more serious reduction of training performance and resource utilization rate is brought.

Reinforcement Learning (RL) is a self-Learning method similar to the conventional deep Learning method, but deep Learning predicts unknown data by Learning features in existing data, and is a static Learning algorithm. The RL is a process of establishing a decision model and learning to obtain an optimal strategy through continuous exploration of an unknown environment, and is a dynamic learning algorithm. Therefore, to some extent, RL is more consistent with human thinking and Learning processes, especially RL that incorporates Deep Learning techniques, i.e. Deep Reinforcement Learning (DRL), is recognized as a paradigm that most closely approaches real artificial intelligence.

Therefore, how to apply the DRL algorithm to resource scheduling, namely, the decision problem of job placement positions, the DLT job is reasonably placed under the scene of cluster resource errors so as to maximize the resource utilization rate as much as possible and improve the user service quality, and the problem to be solved is solved.

Disclosure of Invention

One of the objectives of the present invention is to provide a method for placing deep learning training job resources based on reinforcement learning, which can adaptively place DLT jobs in a resource error scenario.

In order to solve the technical problem, the present application provides the following technical solutions:

the deep learning training operation resource placement method based on reinforcement learning comprises the following steps:

an initialization step: randomly initializing parameters of a DRL neural network model;

a state vector generation step: generating a state vector of batch operation;

and (3) reward calculation step: calculating a reward based on the maximum completion time T _ RL and the maximum completion time T _ Random;

and (3) updating parameters: the inverse gradient updates the parameters of the DRL neural network model.

The basic scheme principle and the beneficial effects are as follows:

in the scheme, the DRL neural network model is trained to reason the operation placement position, compared with the traditional heuristic algorithm, the DRL neural network model can automatically analyze and extract more effective and accurate characteristics of cluster faults and DLT operation without manually selecting certain parameters as the characteristics, and therefore the influence caused by the fact that the characteristics are selected by people by mistake is reduced.

The reward of the DRL neural network model is calculated by taking the minimum value T _ random of the time for finishing the batch operation in multiple random scheduling as a reference, a larger reward range is obtained by utilizing the randomness, and the learning capacity of the DRL neural network model can be improved.

The training process of the scheme can utilize a simulator to pre-train or completely train so as to save time and economic cost, meanwhile, historical data of a real cluster system can be used for training so as to obtain a scheduling strategy more suitable for the system, and the training process can also be directly carried out on a prototype system on line so as to obtain a more accurate scheduling strategy.

In conclusion, the method aims at the decision-making problem of the placement position of the DLT operation under the cluster error condition, trains the DRL neural network model to carry out self-adaptive placement under the resource error scene of the DLT operation, reduces the maximum completion time of large-scale distributed DLT operation in batches, and improves the resource utilization rate.

Further, the method also comprises an experience playback step: and sampling the quadruple samples generated in the DRL neural network model training process for experience playback.

Through experience playback, on one hand, correlation among samples can be eliminated to meet basic requirements of neural network training, and on the other hand, dynamic experience playback can maximize the playback range and ensure the effectiveness of the experience playback.

Further, in the state vector generation step, a state vector is generated based on DLT job information and cluster information, and is recorded asWherein N is the number of the computing units required by the current operation; t is estimated running time of the current operation under the condition of no error; and S is the use state of each computing unit in the current cluster.

In the optimal scheme, DLT operation information and cluster information are obtained, state vectors generated after processing are used as features and input into a DRL neural network model for training, the optimal random scheduling scheme and the scheduling scheme inferred by the current DRL neural network model are combined to obtain the maximum completion time of batch operations, the maximum completion time is used as an evaluation criterion, the neural network is guided to make a DLT operation placement decision in a self-adaptive mode, the maximum completion time of batch large-scale distributed DLT operations is reduced, and the resource utilization rate is improved.

Further, the reasoning step specifically includes:

a1: inputting the state vector into a value network of a DRL neural network model to obtain a long-term measurement index V;

a2: inputting the state vector into a strategy network of a DRL neural network model to obtain the selection probability P of N computing units_iWherein i is 1, 2.. times.n;

a3: probability P corresponding to the calculation unit to be occupied and the calculation unit of the fault_jSetting to zero to obtain P'_i；

A4: selecting the k-th computing unit as one of the computing units to be placed in the job, wherein P_k＝max(P′_i)；

Further, the experience playback step specifically includes:

b1: creating a playback buffer pool stack;

b2: pushing the quadruple samples generated in the training process into a return visit buffer pool stack;

b3: if the return visit buffer pool stack is full, overflowing the quadruplet sample which is earliest to be stacked;

b4: and selecting X quadruple samples as a batch for next training, wherein the number of X is the total number of the quadruple samples in the current return visit buffer pool stack.

In the DRL neural network model, after a series of reasoning actions are performed, a plurality of quadruple samples are generated, the samples have strong correlation and do not meet the requirement of the deep neural network on the independent and same distribution of training samples, a sample sequence generated in an interval cannot represent global experience, and the forgetfulness characteristic of the neural network can easily fall into local optimum in the training process. The preferred solution is to use an empirical playback approach. Considering that the uncertainty of the resource error time is very strong, and the quantity difference of generated samples in different scheduling intervals is possibly very large, dynamic batch sampling is adopted. The size of the random sampling batch is equal to the number of samples obtained by inference in the current scheduling period when the DRL neural network model is trained every time, and the batch is input into the DRL neural network model for training, so that the effect of experience playback can be fully exerted, and the correlation among samples is reduced to a great extent.

Further, in the reward calculation step, the calculation formula of the reward is as follows:

the smaller the desired T _ RL, the better, but T _ RL has relativity, i.e. in relation to the actual working time, which is long, T _ RL cannot be too small. Therefore, in this embodiment, T _ random is used as a relative value to compare with T _ RL.

Further, the method also comprises a training and judging step: and judging whether the DRL neural network model is trained completely, if not, returning to the state vector generation step, otherwise, finishing the training.

Further, the method also comprises the following steps: and (4) reasoning to obtain the placement position of each operation in the batch operation by using the trained DRL neural network model.

Further, the using step specifically comprises:

c1: acquiring operation information and cluster information of batch operation;

c2: generating a state vector based on the information collected in step C1;

c3: inputting the state vector in the step C2 into a strategy network of a DRL neural network model to obtain the placement position information output by the strategy network;

c4: if the number of the inferred computing units is less than the number of the computing units required by the current operation, repeating the step C3; otherwise go to step C5;

c5: the corresponding job is set in accordance with the set position information inferred in step C3.

The invention also aims to provide a deep learning training operation resource placement system based on reinforcement learning, which comprises a DRL neural network model and an operation scheduling module; the operation scheduling module trains the DRL neural network model by using the steps of the method. And obtaining the placement position information from the trained DRL neural network model, and placing corresponding operation according to the placement position information.

According to the scheme, a DRL neural network model is adopted to schedule a computing unit, cluster information and currently submitted operation information are periodically obtained and are input into the DRL neural network model for training after being processed, the maximum completion time of batch operations obtained by adopting a random scheduling optimal mode and a scheduling mode inferred by adopting the current DRL neural network model is combined to serve as an evaluation criterion, the DRL neural network model is guided to make a DLT operation placement decision in a self-adaptive mode, the maximum completion time of batch large distributed DLT operations is reduced, and accordingly resource utilization rate is improved.

Drawings

FIG. 1 is a schematic view of a cluster job lifecycle;

FIG. 2 is a flow chart of DRL neural network model training;

FIG. 3 is a schematic diagram of DRL neural network model structure design;

FIG. 4 is a schematic illustration of an empirical playback;

fig. 5 is a schematic diagram of an inference flow of the DRL neural network model.

Detailed Description

The following is further detailed by way of specific embodiments:

examples

As shown in fig. 1, the method of the present embodiment is applied to a job scheduling process of a cluster, and aims to give which nodes and which computing resources in the cluster a job should be placed on. The present embodiment takes a common computing unit, namely GPU, as an example, and introduces a deep learning training job resource placement method based on reinforcement learning, which includes the following steps:

training neural network model parameters in a reinforcement learning manner, as shown in fig. 2, specifically includes:

an initialization step: parameters of the DRL neural network model are randomly initialized.

A state vector generation step: a status vector of workloads (batch jobs) is generated based on the DLT job information and the cluster information. The state vector is noted asSpecifically, the following are shown:

n: the number of GPUs required for the current operation.

T: the estimated running time of the current operation under the normal and error-free condition.

S: the usage status of each GPU in the current cluster. For example, if there are 4 GPU devices in the current cluster, the first two GPUs are available, and the last two GPUs are unavailable due to errors or being occupied, then S ═ 0,0,1, 1.

In this embodiment, DLT job information and cluster information are periodically acquired.

The inference step comprises: state vectorAnd sending the data to a DRL neural network model to obtain the placement position information of the workloads through reasoning, and performing operation placement according to the placement position information to obtain the maximum completion time T _ RL of the workloads.

The DRL neural network model of the inference step is shown in fig. 3, and specifically includes:

a1: state vectorInputting a Value Network (Value Network) of a DRL neural Network model, and obtaining a long-term measurement index V through a full connection layer with 5 layers of neuron numbers of 256, 196, 128 and 1 respectively.

A2: state vectorInputting a Policy Network (Policy Network) of the DRL neural Network model, obtaining selection probabilities P of N GPUs through a full connection layer with 5 layers of neurons respectively with 256, 196, 128 and N and a softmax layer_i,i＝1,2,...,N。

A3: the probability P corresponding to the occupied GPU and the failed GPU_jSetting to zero to obtain P'_i。

A4: selecting the kth GPU as one of the GPUs to be placed in the operation, wherein P_k＝max(P′_i)。

A5: and if the number of GPUs to be placed by the operation is equal to the number of GPUs required by the operation, completing the placement position information reasoning of the operation, and in turn reasoning the position information of the next operation, otherwise, jumping to the step A1.

A random generation step: randomly generating a series of placing position information, placing the operation according to the placing position information to obtain a series of maximum completion time of the workloads, and finally obtaining the minimum maximum completion time and recording as T _ Random.

And (3) reward calculation step: the reward (rewarded) is calculated based on the maximum completion time T _ RL and the maximum completion time T _ Random. The calculation formula is as follows:

and an experience playback step: the quadruple samples (s, a, r, s') generated during the DRL training process are sampled to apply empirical replay. In the quadruplet sample (s, a, r, s '), s is the environment state, a is the Actor, i.e. one action selected based on the current policy, s' is the next environment state to which the transition is made after the execution of the action a in the environment state s, and r is the reward of the environment feedback, i.e. reward.

The empirical playback step is shown in fig. 4, and specifically includes:

b1: a playback Buffer stack is created.

B2: the quadruplet samples (s, a, r, s') generated by the training process are pushed into the return visit buffer pool stack.

B3: if the buffer pool stack is full, the earliest pushed data is overflowed.

B4: selecting x quadruplet samples as a batch and waiting for next training. Wherein the number of X is the total number of quadruple samples in the current return visit buffer pool stack.

And (3) updating parameters: the inverse gradient updates the parameters of the DRL neural network model. In this embodiment, backward gradient update is performed by means of reward, which is the prior art and will not be described herein.

Training and judging: if the training is not completed, the execution is continued by returning to step S12, otherwise the training is ended. In this embodiment, the planned training times are preset, and when the actual training times are equal to the planned training times, the training is considered to be completed.

The method comprises the following steps: the method for obtaining the placement position of each operation in the workloads through the trained DRL neural network model inference is shown in FIG. 5, and specifically includes:

c1: and acquiring workloads operation information and cluster GPU information.

C2: using the information collected in step C1, a state vector is generated

C3: and inputting the state vector in the step C2 into the policy network to obtain the placement policy, i.e. the placement position information, output by the policy network.

C4: if the current inferred number of GPUs is less than the number of GPUs required for the current job, step C3 is repeated.

C5: the corresponding job is set in accordance with the set position information inferred in step C3.

In this embodiment, the GPU is the minimum scheduling unit, that is, one GPU may not be allocated to multiple jobs for use. In other embodiments, the calculation unit may also employ TPU (tensor processing unit), MLU (machine learning processor), and the like.

According to the deep learning training job resource placement method based on reinforcement learning, the embodiment also provides a deep learning training job resource placement system based on reinforcement learning, which comprises a DRL neural network model and a job scheduling module. The operation scheduling module trains the DRL neural network model by using the steps of the method. The operation scheduling module also obtains the placement position information from the trained DRL neural network model by using the steps of the method, and places the corresponding operation according to the placement position information.

The above are merely examples of the present invention, and the present invention is not limited to the field related to this embodiment, and the common general knowledge of the known specific structures and characteristics in the schemes is not described herein too much, and those skilled in the art can know all the common technical knowledge in the technical field before the application date or the priority date, can know all the prior art in this field, and have the ability to apply the conventional experimental means before this date, and those skilled in the art can combine their own ability to perfect and implement the scheme, and some typical known structures or known methods should not become barriers to the implementation of the present invention by those skilled in the art in light of the teaching provided in the present application. It should be noted that, for those skilled in the art, without departing from the structure of the present invention, several changes and modifications can be made, which should also be regarded as the protection scope of the present invention, and these will not affect the effect of the implementation of the present invention and the practicability of the patent. The scope of the claims of the present application shall be determined by the contents of the claims, and the description of the embodiments and the like in the specification shall be used to explain the contents of the claims.

12页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：一种高性能分布式结合的多路视频实时处理方法

Deep learning training operation resource placement system and method based on reinforcement learning

相关技术

网友询问留言