Intelligent operation batching method and device based on deep reinforcement learning and electronic equipment

文档序号：191181 发布日期：2021-11-02 浏览：25次中文

阅读说明：本技术 基于深度强化学习的智能作业分批方法、装置及电子设备 (Intelligent operation batching method and device based on deep reinforcement learning and electronic equipment ) 是由刘亮郑霄龙马华东江呈羚罗梓珲于 2021-08-23 设计创作，主要内容包括：本发明公开了一种基于深度强化学习的智能作业分批方法、装置及电子设备,涉及工业互联网技术领域。本发明实例包括以下步骤：获取每个作业的静态特征和动态特征,所述作业的静态特征包括作业交货期、作业的规格和工艺要求,所述作业的动态特征包括接收时刻；将所述各作业的静态特征和动态特征输入作业分批模块,作业分批模块利用马尔可夫决策过程将待组批作业集合中具有相似特征的作业组合为同一个批次,使最终组成的批次总数尽量少,且每个批次中的作业特征差异值尽量小。本发明可以充分利用工业互联网中大量的未标记数据学习稳定分批策略,并能处理有多维度特征的输入数据,给出稳定的、高效的作业分批解决方案,适用于作业量较大的应用场景中。(The invention discloses an intelligent operation batching method and device based on deep reinforcement learning and electronic equipment, and relates to the technical field of industrial internet. An embodiment of the present invention comprises the steps of: acquiring static characteristics and dynamic characteristics of each operation, wherein the static characteristics of the operation comprise an operation delivery date, an operation specification and a process requirement, and the dynamic characteristics of the operation comprise a receiving moment; and inputting the static characteristics and the dynamic characteristics of each operation into an operation batching module, and combining the operations with similar characteristics in the operation set to be batched into the same batch by using a Markov decision process by the operation batching module, so that the total number of the finally formed batches is as small as possible, and the difference value of the operation characteristics in each batch is as small as possible. The method can fully utilize a large amount of unmarked data in the industrial Internet to learn the stable batch strategy, can process input data with multi-dimensional characteristics, provides a stable and efficient operation batch solution, and is suitable for application scenes with large operation amount.)

1. An intelligent operation batch method based on deep reinforcement learning is characterized by comprising the following steps:

s1, acquiring static characteristics and dynamic characteristics of each job, wherein the static characteristics of the job comprise a job delivery date, a job specification and a process requirement, and the dynamic characteristics of the job comprise a receiving time;

s2, inputting the static characteristics and dynamic characteristics of each operation into an operation batching module, wherein the operation batching module combines the operations with similar characteristics in the operation set to be batched into the same batch by utilizing a Markov decision process, so that the total number of the finally formed batches is as small as possible, and the difference value of the operation characteristics in each batch is as small as possible;

wherein the Markov decision process is as follows: at each time step, the job batching module obtains the state of the current environment, wherein the state of the job at the time t comprises the static characteristics of the job, the demand size of the job at the time t and the residual available capacity of the current batch n at the time t, and the state of the current environment at the time t is the set of the states of all jobs at the time t; then, corresponding action is made according to the state of the current environment, the effect of the action is measured by a positive or negative reward value, and the reward value is the opposite number of the objective function value; the environment is then subject to the previous action to transition from the previous state to the next new state.

2. The intelligent work batch method based on deep reinforcement learning as claimed in claim 1, wherein the action of step S2 according to the status is: the virtual nodes and other operation nodes are taken as input sequences of the model together, and at each decision time point t, the operation batching module selects one of all the input sequences in sequence as an output node; the first output of the default job batching module is a virtual node representing the start of batching; when the operation batch module selects the virtual node as an output node, the current batch division is finished; when all the jobs are combined into the corresponding batch, an output sequence is obtained according to the decision of the job batch module, and the output sequence is the batch result of the job set.

3. The intelligent job-batch method based on deep reinforcement learning of claim 1, wherein the job-batch module of step S2 comprises an encoder and a decoder, the encoder uses one-dimensional convolutional layer as an embedded layer to map the static features of each job in the input sequence into an output matrix; the decoder mainly comprises a long-short term memory network, a pointer network and Mask vectors, and the working process of the decoder is as follows: at each decision time point t, the long-short term memory network reads the hidden layer state of the long-short term memory network at the previous decision time point and the output node at the previous decision time point, and outputs the hidden layer state at the time t; the pointer network calculates the probability distribution of each output node by combining Mask vectors according to the output matrix of the encoder, the states of the hidden layer of the long-short term memory network at the time t, the dynamic characteristic vectors of all input sequences at the time t and the residual capacity of the current batch n at the time t, the lengths of the Mask vectors are equal to those of the input sequences and respectively correspond to the input sequences one by one, the value of each bit of the Mask vector is 0 or 1, and the value of the Mask vector bit corresponding to the virtual node is always 1; finally, selecting the node with the maximum probability value as an output node at the moment t; and after the decision at the time t is finished, immediately updating the Mask vector, the dynamic characteristic vector of the input sequence and the residual capacity of the current batch n according to the decision result, and taking the updated Mask vector, the dynamic characteristic vector of the input sequence and the residual capacity of the current batch n as the input of a next decision time point model.

4. The intelligent work-batch method based on deep reinforcement learning as claimed in claim 3, wherein the working process of the pointer network is as follows: and at each decoding time step t, obtaining the weight of the input sequence at the time t by using an attention mechanism, and normalizing the weight by a Softmax function to obtain the probability distribution of the input sequence.

5. The intelligent work-batch method based on deep reinforcement learning of claim 1, wherein the training method of the work-batch module uses an operator-critic algorithm, and the operator-critic algorithm is composed of an operator network and a critic network; the actor network is used for predicting the probability of each node in the input sequence at each decision time point and selecting the node with the highest probability as an output node; the critic network is used to calculate an estimate of the prize earned by the input sequence.

6. The intelligent work-batch method based on deep reinforcement learning of claim 5, wherein the act-critic algorithm comprises the steps of: randomly initializing parameters of an operator network and a critic network, randomly extracting J instances from a training set at each iteration step epoch, sequentially determining an output sequence of each instance until all jobs in the instance are combined into a corresponding batch, and calculating an obtained reward value of the current output sequence; after the batch tasks of J instances are completed, the gradients of the operator network and the critic network are calculated and updated respectively.

7. An intelligent work-batch apparatus based on deep reinforcement learning, the apparatus comprising:

the system comprises a characteristic acquisition module, a data processing module and a data processing module, wherein the characteristic acquisition module is used for acquiring static characteristics and dynamic characteristics of each to-be-batched operation, the static characteristics of the operation comprise an operation delivery date, an operation specification and a process requirement, and the dynamic characteristics of the operation comprise a receiving moment;

the operation batching module is used for inputting the static characteristics and the dynamic characteristics of each operation into the operation batching module, and combining the operations with similar characteristics in the operation set to be batched into the same batch by utilizing the Markov decision process, so that the total number of the finally formed batches is as small as possible, and the difference value of the operation characteristics in each batch is as small as possible;

wherein the Markov decision process of the job batch module is: at each time step, the job batching module obtains the state of the current environment, wherein the state of the job at the time t comprises the static characteristics of the job, the demand size of the job at the time t and the residual available capacity of the current batch n at the time t, and the state of the current environment at the time t is the set of the states of all jobs at the time t; then, corresponding action is made according to the state of the current environment, the effect of the action is measured by a positive or negative reward value, and the reward value is the opposite number of the objective function value; the environment is then subject to the previous action to transition from the previous state to the next new state.

8. The apparatus of claim 7, wherein the action is performed by: the virtual nodes and other operation nodes are taken as input sequences of the model together, and at each decision time point t, the operation batching module selects one of all the input sequences in sequence as an output node; the first output of the default job batching module is a virtual node representing the start of batching; when the operation batch module selects the virtual node as an output node, the current batch division is finished; when all the jobs are combined into the corresponding batch, an output sequence is obtained according to the decision of the job batch module, and the output sequence is the batch result of the job set.

9. The apparatus of claim 8, wherein the job batching module comprises an encoder and a decoder, the encoder uses one-dimensional convolutional layer as an embedded layer to map the static features of each job in the input sequence into an output matrix; the decoder mainly comprises a long-short term memory network, a pointer network and Mask vectors, and the working process of the decoder is as follows: at each decision time point t, the long-short term memory network reads the hidden layer state of the long-short term memory network at the previous decision time point and the output node at the previous decision time point, and outputs the hidden layer state at the time t; the pointer network calculates the probability distribution of each output node by combining Mask vectors according to the output matrix of the encoder, the states of the hidden layer of the long-short term memory network at the time t, the dynamic characteristic vectors of all input sequences at the time t and the residual capacity of the current batch n at the time t, the lengths of the Mask vectors are equal to those of the input sequences and respectively correspond to the input sequences one by one, the value of each bit of the Mask vector is 0 or 1, and the value of the Mask vector bit corresponding to the virtual node is always 1; finally, selecting the node with the maximum probability value as an output node at the moment t; and after the decision at the time t is finished, immediately updating the Mask vector, the dynamic characteristic vector of the input sequence and the residual capacity of the current batch n according to the decision result, and taking the updated Mask vector, the dynamic characteristic vector of the input sequence and the residual capacity of the current batch n as the input of a next decision time point model.

10. An electronic device is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor and the communication interface are used for realizing mutual communication by the memory through the communication bus;

a memory for storing a computer program;

a processor for implementing the method steps of any one of claims 1 to 6 when executing a program stored in the memory.

Technical Field

The invention relates to the technical field of industrial Internet of things, in particular to an intelligent operation batching method and device based on deep reinforcement learning and electronic equipment.

Background

With the explosive development of the Industrial Internet of things (IIoT), the traditional industry is upgrading to smart manufacturing. Flexible production of multiple varieties in small batches is an important component of intelligent manufacturing. In order to improve the equipment utilization rate and the production efficiency, manufacturers often group jobs with similar characteristics into a batch and then perform production in units of batches. The operation batch problem is widely existed in the fields of chemical industry, textile, semiconductor, medical treatment, steel making and the like. Taking the steel making field as an example, each operation has a number of characteristics, such as steel grade, thickness, width, weight, etc. Operations in the steel making field usually have different characteristic values, since different customers require different steel products.

In order to improve the production efficiency, operations with similar characteristics such as similar steel types and similar thicknesses are combined into a batch for production under the condition of meeting the capacity constraint of production equipment. In actual production, such a job-batch process is usually performed manually. However, performing job batching by hand often has the following problems: (1) the operation batch arrangement and combination are complex due to the large total number of operations to be batched, the unknown total number of batches, the multiple operation characteristics and the multiple batch constraints, and a technician cannot exhaust all feasible solutions in a short time; (2) it is difficult for the skilled person to select a reasonable solution from a large number of possible solutions in a short time.

Fig. 1 illustrates an ideal intelligent plant in an industrial internet. The automatic, intelligent and unmanned whole production process can be realized through comprehensive perception of intelligent equipment on production data, real-time transmission of wireless communication on the production data and rapid calculation of the operation batching module in the cloud. Obviously, the efficiency of the entire production process is directly affected by the quality of the job batching modules in the cloud. To achieve the beauty of the industrial internet, an efficient job-batching module is needed.

The existing job batch processing research mainly focuses on a clustering algorithm and a meta-heuristic algorithm, and neither method utilizes mass data to learn prior knowledge. The clustering algorithm needs to know in advance the total number of batches that will eventually be divided, however, this is not known in practical application scenarios. The meta-heuristic algorithm is seriously dependent on the experience of technicians, and the result of the meta-heuristic algorithm is unstable and is not suitable for actual production. In addition, as the number of jobs increases, the inference time of clustering algorithms and meta-heuristic algorithms increases explosively. Therefore, designing an efficient and intelligent operation batching method aiming at industrial internet of things scenes has very important urgency and practical significance.

Reinforcement Learning (RL) is an important branch of machine Learning, mainly studying how a job batch module takes action in the environment to obtain the maximum accumulated return. While reinforcement learning training may take a relatively long time, once a job batch module is trained, it may quickly behave correctly to new problems encountered. Currently, reinforcement learning has been successfully applied to various scenarios including control of robots, manufacturing, and gaming. Deep Learning (DL) has strong perception capability, but lacks certain decision-making capability; and the reinforcement learning has decision-making capability and is ineligible for perceiving problem tie. Therefore, deep reinforcement learning can be performed at the same time. In recent years, Deep Reinforcement Learning (DRL) has successfully solved various practical problems by combining the decision-making capability of Reinforcement Learning with the perception capability of Deep Learning to handle multidimensional features. Therefore, the invention innovatively adopts a deep reinforcement learning method to solve the problem of job batching in the industrial Internet.

Disclosure of Invention

The invention aims to provide an intelligent operation batching method and device based on deep reinforcement learning and electronic equipment.

In order to achieve the above purpose, the invention provides the following technical scheme:

in a first aspect, the invention provides an intelligent job batching method based on deep reinforcement learning, which comprises the following steps:

Further, the process of the step S2 acting according to the state is: the virtual nodes and other operation nodes are taken as input sequences of the model together, and at each decision time point t, the operation batching module selects one of all the input sequences in sequence as an output node; the first output of the default job batching module is a virtual node representing the start of batching; when the operation batch module selects the virtual node as an output node, the current batch division is finished; when all the jobs are combined into the corresponding batch, an output sequence is obtained according to the decision of the job batch module, and the output sequence is the batch result of the job set.

Further, the job batching module of step S2 includes an encoder and a decoder, wherein the encoder uses the one-dimensional convolutional layer as an embedded layer to map the static features of each job in the input sequence into an output matrix; the decoder mainly comprises a long-short term memory network, a pointer network and Mask vectors, and the working process of the decoder is as follows: at each decision time point t, the long-short term memory network reads the hidden layer state of the long-short term memory network at the previous decision time point and the output node at the previous decision time point, and outputs the hidden layer state at the time t; the pointer network calculates the probability distribution of each output node by combining Mask vectors according to the output matrix of the encoder, the states of the hidden layer of the long-short term memory network at the time t, the dynamic characteristic vectors of all input sequences at the time t and the residual capacity of the current batch n at the time t, the lengths of the Mask vectors are equal to those of the input sequences and respectively correspond to the input sequences one by one, the value of each bit of the Mask vector is 0 or 1, and the value of the Mask vector bit corresponding to the virtual node is always 1; finally, selecting the node with the maximum probability value as an output node at the moment t; and after the decision at the time t is finished, immediately updating the Mask vector, the dynamic characteristic vector of the input sequence and the residual capacity of the current batch n according to the decision result, and taking the updated Mask vector, the dynamic characteristic vector of the input sequence and the residual capacity of the current batch n as the input of a next decision time point model.

Further, the working process of the pointer network is as follows: and at each decoding time step t, obtaining the weight of the input sequence at the time t by using an attention mechanism, and normalizing the weight by a Softmax function to obtain the probability distribution of the input sequence.

Further, the training method of the job batch module uses an operator-critic algorithm, and the operator-critic algorithm consists of an operator network and a critic network; the actor network is used for predicting the probability of each node in the input sequence at each decision time point and selecting the node with the highest probability as an output node; the critic network is used to calculate an estimate of the prize earned by the input sequence.

Further, the operator-critic algorithm comprises the following steps: randomly initializing parameters of an operator network and a critic network, randomly extracting J instances from a training set at each iteration step epoch, sequentially determining an output sequence of each instance until all jobs in the instance are combined into a corresponding batch, and calculating an obtained reward value of the current output sequence; after the batch tasks of J instances are completed, the gradients of the operator network and the critic network are calculated and updated respectively.

In a second aspect, the present invention provides an intelligent job-batching device based on deep reinforcement learning, the device comprising:

Further, the action making process is as follows: the virtual nodes and other operation nodes are taken as input sequences of the model together, and at each decision time point t, the operation batching module selects one of all the input sequences in sequence as an output node; the first output of the default job batching module is a virtual node representing the start of batching; when the operation batch module selects the virtual node as an output node, the current batch division is finished; when all the jobs are combined into the corresponding batch, an output sequence is obtained according to the decision of the job batch module, and the output sequence is the batch result of the job set.

Furthermore, the job batching module comprises an encoder and a decoder, wherein the encoder uses a one-dimensional convolutional layer as an embedded layer and maps the static characteristics of each job in an input sequence into a virtual output matrix; the decoder mainly comprises a long-short term memory network, a pointer network and Mask vectors, and the working process of the decoder is as follows: at each decision time point t, the long-short term memory network reads the hidden layer state of the long-short term memory network at the previous decision time point and the output node at the previous decision time point, and outputs the hidden layer state at the time t; the pointer network calculates the probability distribution of each output node by combining Mask vectors according to the output matrix of the encoder, the states of the hidden layer of the long-short term memory network at the time t, the dynamic characteristic vectors of all input sequences at the time t and the residual capacity of the current batch n at the time t, the lengths of the Mask vectors are equal to those of the input sequences and respectively correspond to the input sequences one by one, the value of each bit of the Mask vector is 0 or 1, and the value of the Mask vector bit corresponding to the virtual node is always 1; finally, selecting the node with the maximum probability value as an output node at the moment t; and after the decision at the time t is finished, immediately updating the Mask vector, the dynamic characteristic vector of the input sequence and the residual capacity of the current batch n according to the decision result, and taking the updated Mask vector, the dynamic characteristic vector of the input sequence and the residual capacity of the current batch n as the input of a next decision time point model.

In a third aspect, the present invention provides an electronic device, including a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory complete mutual communication through the communication bus;

a memory for storing a computer program;

and the processor is used for realizing the steps of any intelligent work batch method based on the deep reinforcement learning when the program stored in the memory is executed.

In a fourth aspect, the present invention further provides a computer readable storage medium, having a computer program stored therein, where the computer program, when executed by a processor, implements any of the above-mentioned steps of the deep reinforcement learning-based intelligent job batch method.

In a fifth aspect, the present invention also provides a computer program product containing instructions which, when run on a computer, cause the computer to perform the steps of any of the above-described intelligent deep reinforcement learning-based job-batch methods.

Compared with the prior art, the invention has the beneficial effects that:

the intelligent operation batching method, the intelligent operation batching device and the electronic equipment based on the deep reinforcement learning describe the operation batching problem as a Markov decision process, and adopt a method based on the deep reinforcement learning to solve the problem. Meanwhile, the invention considers the job batch process as a mapping process from one sequence to another sequence and provides a pointer network-based job batch module, which aims to minimize the total number of job batches and the characteristic difference of jobs in the batches under the constraint of batch capacity.

The invention provides an intelligent operation batching method, an intelligent operation batching device and electronic equipment based on deep reinforcement learning, which can fully utilize a large amount of unmarked data in industrial Internet to learn a stable batching strategy, can process input data with multi-dimensional characteristics and provide a stable and efficient operation batching solution. Particularly, even in an actual application scenario with a large number of jobs, the method can quickly generate a corresponding solution.

Of course, not all of the advantages described above need to be achieved at the same time in the practice of any one product or method of the invention.

Drawings

In order to more clearly illustrate the embodiments of the present application or technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments described in the present invention, and other drawings can be obtained by those skilled in the art according to the drawings.

Fig. 1 is a schematic diagram of an ideal intelligent factory in the industrial internet.

Fig. 2 is a schematic diagram of an input sequence according to an embodiment of the present invention.

Fig. 3 is a schematic diagram of an output sequence according to an embodiment of the present invention.

FIG. 4 is a schematic diagram of an encoder for a job batch module according to an embodiment of the present invention.

FIG. 5 is a block diagram of a job batch module according to an embodiment of the present invention.

FIG. 6 is a schematic structural diagram of an intelligent job batch device based on deep reinforcement learning according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

For a better understanding of the present solution, the method of the present invention is described in detail below with reference to the accompanying drawings. The following table is a description of the meaning of symbols referred to in the examples of the present application.

Watch 1

The invention provides an intelligent operation batching method based on deep reinforcement learning, which comprises the following steps:

and S2, inputting the static characteristics and the dynamic characteristics of the jobs into a job batching module, wherein the job batching module combines the jobs with similar characteristics in the job set to be batched into the same batch by utilizing a Markov decision process, so that the total number of the finally composed batches is as small as possible, and the difference value of the job characteristics in each batch is as small as possible.

In the present invention, we mainly consider a typical job batch problem that must be faced in the industrial internet. Specifically, given a set of jobs to be batched, X ═ X_iI is 1,2, …, M }. Each operation X_iCan be defined as X_i＝{f_i,d_iIn which f_iRepresents operation X_iThe characteristic of (e.g. delivery date of job, specification of job, process requirement, etc. (defined by specific application scenario), can be used as a tuple f_i＝{f_ikK is 1,2, …, K; d_iRepresents operation X_iThe amount of demand of (2).

Given that the maximum capacity of a batch is C, the purpose of job batching is to combine jobs with similar characteristics in a job set to be batched into the same batch under the constraint of satisfying the batch capacity, so that the total number N of finally composed batches is as small as possible, and the difference value D of job characteristics is as small as possible.

The mathematical model of the problem is as follows;

min(αD+βN) (1)

wherein the content of the first and second substances,

α+β＝1 (2)

wherein the content of the first and second substances,

equation (1) is an objective function, where D represents the sum of the difference values of the operation characteristics in all batches. Formula (1) includes 2 sub-targets: the total number of operation batches which are finally formed is as small as possible; and the other is that the difference in characteristics between jobs in each lot is as small as possible (i.e., jobs with similar characteristics are divided into a lot).

Equation (2) represents the importance constraints of the two sub-targets in equation (1).

Equation (3) represents the degree of influence constraint of each attribute feature of the job on the job classification result.

Equation (4) indicates that the total amount of jobs in each batch cannot exceed the upper capacity limit of a batch, depending on the enterprise production requirements.

Equation (5) indicates that one job can be combined into only one batch at most.

Markov decision process:

the job batching Process may be viewed as a Process in which job batching modules continuously interact with the environment through sequence decisions to combine individual jobs into batches, which may be represented by Markov Decision Process (MDP).

Specifically, at each time step, the job-batch module obtains a current environment state (state) and, based on the state, performs a corresponding action (action), the effect of which is measured by a positive or negative reward value (reward). The environment is then subject to the previous action to transition from the previous state to the next new state. The job batching module gradually learns better decisions in such continuous cycles to complete better job batching.

(1) State (state): the invention is set as operation X_iWhen combined into a batch n, the product demand of the job is calculated by d_iBecomes 0 indicating that the job has been successfully divided into a batch. At the same time, the remaining available capacity V of the current batch n_nAlso original C is changed into C-d_i. Thus, as a current job is combined into a batch, the current demand d for that job_iAnd the remaining available capacity V of the current batch n_nIs a variable related to time t.

Thus, each job X_iCan be redefined asWherein f is_iAndrespectively represent operation X_iStatic characteristics and dynamic characteristics at time t. In the decoding phase (batch phase) of the model, job X_iThe static characteristics of the job (such as delivery date of the job, product length, width, etc.) remain unchanged, and the dynamic elements of the job change dynamically according to the output stage.

To sum up, operation X_iThe state at time t may be represented by a triplet:respectively represent operation X_iStatic feature of, Job X_iThe amount of demand at time t, and the remaining available capacity of the current batch n at time t.

In summary, the status of the current environment at time t is all the jobs X_iSet of states at time t

(2) Action (action): to assist the Job batching Module in better completing Job batching tasks, a virtual node X is defined₀＝{f₀,d₀And the nodes are used as batch segmentation nodes. Virtual node X₀And operation X_iHaving the same characteristic dimension, except that each characteristic value f of the virtual nodes₀And a demand d₀The size is 0 at any time. Virtual node X₀With other job nodes X_i(i ═ 1,2, …, M) together as the input sequence for the model. At each decision time t, the job batching module will in turn select one of all the input sequences as an output node. Selecting virtual node X as job batch module₀And the output node represents that the current batch is divided. The first output of the default Job batch Module is virtual node X₀Indicating the start of a batch job. When the termination condition is satisfied (i.e., all jobs are combined into the corresponding batch), an output sequence is obtained according to the decision of the job batch module, and the output sequence is the batch result of the job set X.

For example (as shown in fig. 2), X ═ X for the job set_iI-1, 2, … 7}, and the final output sequence of the Job-batch Module (as shown in FIG. 3) is { X }₀,X₅,X₁,X₀,X₃,X₄,X₆,X₀,X₅,X₂,X₀Indicates that the job batch module divides the job combination X into 3 batches, U respectively₁＝{X₅,X₁}，U₂＝{X₃,X₄,X₆}，U₃＝{X₇,X₂}。

In summary, action y taken by job batching module at time t_tCan be divided into two categories:

(3) reward (reward): the reward function visually reflects the quality of the action taken by the job batching module under the current environment state. The goal of the job batching module is to minimize the total number of lots that the final batching result constitutes and minimize the feature difference value of the job in each lot, as shown in equation (1), without exceeding the lot capacity limit. When the objective function value is smaller, the reward value to the job batch module should be larger, meaning that the current batch is better.

Thus, the reward function is expressed as follows:

R＝-(αD+βN) (7)

wherein D is shown as formula (1), the value is the sum of the difference values of the operation characteristics, and N is the total number of batches after batching.

The overall structure of the job batch module 602 provided by the present invention is shown in fig. 4 and fig. 5, and the model is implemented based on a pointer network and is divided into an Encoder (Encoder)603 and a Decoder (Decoder) 604.

(1) Encoder section

The coding layer of the pointer network structure is implemented based on a Recurrent Neural Network (RNN), however, the RNN is meaningful only when the arrangement of the input sequence delivers certain information (e.g., in the case of text translation, the arrangement sequence of the previous word and the next word delivers certain related information). Since the input of the model is a set of a series of unordered operation features, the arrangement of any random input sequence contains the same information as the original input, i.e. the order of the input sequence has no meaning.

Therefore, in the model, we omit the RNN network in the encoder, and directly use the static feature f of each task in the input sequence (virtual node and job set) with the one-dimensional convolutional Layer as the Embedding Layer (Embedding Layer)_i(i-0, 1, …, M) mapping a virtual matrix

(2) Decoder part

The decoder is mainly composed of a long short term memory network (LSTM), a pointer network and a Mask vector. The working process is as follows: at each decision time t, the LSTM reads the LSTM hidden layer state h of the last decision time^t-1And the output node y of the last decision time point job batch module^t-1Outputting the hidden layer state h at the moment t^t. Pointer network based on output matrix of encoderLSTM hidden layer state h at time t^tDynamic feature vector d of all input sequences at time t^tAnd the remaining capacity V of the current batch n at time t_n ^tAnd calculating the probability distribution of each output node by combining Mask vectors, wherein the lengths of the Mask vectors are equal to that of the input sequences and are respectively in one-to-one correspondence with the input sequences, the value of each bit of the Mask vectors is 0 or 1, and the value of the Mask vector bit corresponding to the virtual node is always 1. Finally, selecting the node with the maximum probability value as the output node y at the moment t^t. After the operation batch module finishes the decision of t moment, the Mask vector and the dynamic characteristic vector d of the input sequence are immediately updated according to the decision result^t+1Remaining capacity V of current batch n_n ^t+1And the dynamic variables are used as the input of the model of the next decision time point.

The pointer network mechanism can be described as: at each decoding time step t, obtaining the weight of the input sequence at the time t by using an attention mechanism, and passing the weight through Sobtaining the probability distribution a of the input sequence after normalizing the soft max function^t。a^tIs calculated as follows (v)_aAnd omega_aAs training parameters):

in order to ensure the legality of the output sequence of the job batch module, a Mask vector is introduced into the decision process of the job batch module to add constraints. The length of Mask vector is equal to the input sequence, and is respectively equal to the input sequence X_i(i-0, 2, …, M) in one-to-one correspondence. The Mask vector takes a value of 0 or 1 for each bit. Virtual node X₀The value of the corresponding Mask vector bit is always 1, which indicates that the job batch module can end the division of the current batch at any time.

In the following case, X_iThe Mask vector bit value corresponding to (i ═ 1,2, …, M) is 0:

a) at time t, operation X_iHas been selected by the Job batch Module as an output node, Job X_iHave been combined into one batch;

b) at time t, operation X_iSize of demand d_iRemaining available capacity V greater than current batch n_n ^t；

c) When t is 0, the job batch module can only select virtual node X at this time₀As an output node, the start of the job batching task is marked.

Combining Mask vectors, the probability value output by the final pointer network at the time t is calculated as follows (v)_bAs training parameters):

P(y_t|Y^t-1,S^t)＝softmax(v_ba^t-ln(Mask)) (9)

as can be seen from the formula (9), when the operation X is performed_iIf the corresponding Mask vector bit is 0, operation X_iThe probability value as the output node is also 0. At each decoding time step t, calculating formula (9), and taking the node with the maximum probability value as the output node y at the time t_t。

The invention uses an operator-critic algorithm to train the model. The operator-critic algorithm generally consists of 2 networks: an actor network and a critic network.

The actor network is used for predicting the probability of each node in the input sequence at each decision time point t, and selecting the node with the highest probability as an output node. Assuming that the parameter is θ, the gradient to the operator network parameter is:

the critic network is used to calculate an estimate of the prize earned by the input sequence. Assume that its parameters areThe gradient for the critic network parameters is then:

the specific algorithm steps are as follows:

first, randomly initializing the parameters theta and theta of the operator network and the critic network

At each iteration step epoch, we randomly draw J instances (each instance is a set containing M jobs) from the training set, and denote the jth instance with the subscript J;

for each instance, determining its output sequence (i.e., making batch decisions) in turn according to equation (9) using the modified pointer network until the termination condition is satisfied (all jobs in this instance are combined into the corresponding batch);

at this time, the reward value R obtained by the output sequence of the current operation batch module is calculated according to the formula (7)_j；

After the batch tasks of J instances are completed, the gradients of the operator network and the critic network are calculated and updated according to the formula (10) and the formula (11), respectively.

In the formulae (10) to (11),indicating the state of the input sequence at the time when t is 0 for the jth instance, Y_jIs an actor pairOf the final decision output sequence, R_jThe actual resulting reward value for the actor for the jth instance final decision outcome sequence,outputting a probability value for each node in the sequence for the jth instance,for critic versus j' th instanceAn estimate of the reward may be obtained.

The present invention describes the job-batch problem as a Markov decision process and employs a deep reinforcement learning based approach to solve the problem. The method can process multidimensional input data and does not need label data to train the model.

The present invention treats job-batch processes as a sequence-to-sequence mapping process and proposes a pointer-network based job-batch module that aims to minimize the total number of job batches and the differences in characteristics of jobs within a batch, within the constraints of batch capacity.

In an industrial internet scene, the operation batch problem widely exists, and the quality of an operation batch method directly influences the efficiency of the whole production flow. Aiming at the job batching problem, the invention establishes a job batching module based on a pointer network. Meanwhile, the invention provides an intelligent operation batch method based on deep reinforcement learning, which can fully utilize a large amount of unmarked data in the industrial internet to learn a stable batch strategy, can process input data with multi-dimensional characteristics, and provides a stable and efficient operation batch solution. In particular, even in a practical application scenario with a large number of jobs, our method can quickly generate a corresponding solution. Therefore, the present invention can be applied to practical production.

Corresponding to the embodiment of the method, the invention also provides an intelligent operation batching device based on deep reinforcement learning, so as to realize the method. Referring to fig. 6, the apparatus includes: a feature acquisition module 601 and a job batching module 602;

the system comprises a characteristic acquisition module 601, a data processing module and a data processing module, wherein the characteristic acquisition module 601 is used for acquiring static characteristics and dynamic characteristics of each to-be-batched operation, the static characteristics of the operation comprise an operation delivery date, an operation specification and a process requirement, and the dynamic characteristics of the operation comprise a receiving moment;

the job batching module 602 is configured to input the static features and the dynamic features of the jobs into the job batching module, and combine jobs with similar features in the job set to be batched into a same batch by using a markov decision process, so that the total number of finally composed batches is as small as possible, and the difference value of the job features in each batch is as small as possible.

Wherein the Markov decision process of the job batch module is: at each time step, the job batching module obtains the state of the current environment, wherein the state of the job at the time t comprises the static characteristics of the job, the demand size of the job at the time t and the residual available capacity of the current batch n at the time t, and the state of the current environment at the time t is the set of the states of all jobs at the time t; then, corresponding action is made according to the state of the current environment, the effect of the action is measured by a positive or negative reward value, and the reward value is the opposite number of the objective function value; the environment is then subject to the previous action to transition from the previous state to the next new state.

The action making process comprises the following steps: the virtual nodes and other operation nodes are taken as input sequences of the model together, and at each decision time point t, the operation batching module selects one of all the input sequences in sequence as an output node; the first output of the default job batching module is a virtual node representing the start of batching; when the operation batch module selects the virtual node as an output node, the current batch division is finished; when all the jobs are combined into the corresponding batch, an output sequence is obtained according to the decision of the job batch module, and the output sequence is the batch result of the job set.

The operation batch model comprises an encoder and a decoder, wherein the encoder uses a one-dimensional convolutional layer as an embedded layer and maps the static characteristics of each operation in an input sequence into a virtual output matrix; the decoder mainly comprises a long-short term memory network, a pointer network and Mask vectors, and the working process of the decoder is as follows: at each decision time point t, the long-short term memory network reads the hidden layer state of the long-short term memory network at the previous decision time point and the output node at the previous decision time point, and outputs the hidden layer state at the time t; the pointer network calculates the probability distribution of each output node by combining Mask vectors according to the output matrix of the encoder, the states of the hidden layer of the long-short term memory network at the time t, the dynamic characteristic vectors of all input sequences at the time t and the residual capacity of the current batch n at the time t, the lengths of the Mask vectors are equal to those of the input sequences and respectively correspond to the input sequences one by one, the value of each bit of the Mask vector is 0 or 1, and the value of the Mask vector bit corresponding to the virtual node is always 1; finally, selecting the node with the maximum probability value as an output node at the moment t; and after the decision at the time t is finished, immediately updating the Mask vector, the dynamic characteristic vector of the input sequence and the residual capacity of the current batch n according to the decision result, and taking the updated Mask vector, the dynamic characteristic vector of the input sequence and the residual capacity of the current batch n as the input of a next decision time point model.

The working process of the pointer network comprises the following steps: and at each decoding time step t, obtaining the weight of the input sequence at the time t by using an attention mechanism, and normalizing the weight by a Softmax function to obtain the probability distribution of the input sequence.

The training method of the operation batch model uses an actor-critic algorithm, and the actor-critic algorithm consists of an actor network and a critic network; the actor network is used for predicting the probability of each node in the input sequence at each decision time point and selecting the node with the highest probability as an output node; the critic network is used to calculate an estimate of the prize earned by the input sequence.

The operator-critic algorithm comprises the following steps: randomly initializing parameters of an operator network and a critic network, randomly extracting J instances from a training set at each iteration step epoch, sequentially determining an output sequence of each instance until all jobs in the instance are combined into a corresponding batch, and calculating an obtained reward value of the current output sequence; after the batch tasks of J instances are completed, the gradients of the operator network and the critic network are calculated and updated respectively.

The invention also provides an electronic device, as shown in fig. 7, comprising a processor 701, a communication interface 702, a memory 703 and a communication bus 704, wherein the processor 701, the communication interface 702 and the memory 703 complete mutual communication through the communication bus 704,

a memory 703 for storing a computer program;

the processor 701 is configured to implement the method steps in the above-described method embodiments when executing the program stored in the memory 703.

The communication bus mentioned in the electronic device may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.

The communication interface is used for communication between the electronic equipment and other equipment.

The Memory may include a Random Access Memory (RAM) or a Non-Volatile Memory (NVM), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the processor.

The Processor may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but also Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components.

The present invention also provides a computer readable storage medium having a computer program stored therein, which when executed by a processor implements the steps of any of the above-mentioned intelligent deep reinforcement learning-based job batching methods.

The present invention also provides a computer program product containing instructions which, when run on a computer, cause the computer to perform any of the above embodiments of the deep reinforcement learning based intelligent job batching method.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

The above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: it is to be understood that modifications may be made to the technical solutions described in the foregoing embodiments, or equivalents may be substituted for some of the technical features thereof, but such modifications or substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

17页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：一种短期风功率预测方法

Intelligent operation batching method and device based on deep reinforcement learning and electronic equipment

相关技术

网友询问留言