Operation device, method, medium and computing equipment

文档序号:189976 发布日期:2021-11-02 浏览:31次 中文

阅读说明:本技术 一种运算装置、方法、介质及计算设备 (Operation device, method, medium and computing equipment ) 是由 杨龚轶凡 闯小明 郑瀚寻 王润哲 于 2021-07-12 设计创作,主要内容包括:本发明公开了一种运算装置、方法、介质及计算设备,其中该装置包括编译器、存储器、寄存器,以及控制器,其中,所述存储器用于在所述编译器和所述寄存器的作用下将所述Transformer模型的计算信息编译成规模化数据并储存,所述规模化数据至少包括按照时间和空间编译重组的运算指令及运算数据;所述寄存器用于从所述存储器中不间断的读取所述运算指令,并储存所述运算指令的相关信息;所述控制器用于根据所述寄存器读取的运算指令,以及所述运算指令的相关信息构建控制流,并按照流模式控制所述存储器选择相应的运算数据进行运算。本装置具有运算消耗小、运行速度快的优点。(The invention discloses an arithmetic device, an arithmetic method, a medium and a computing device, wherein the device comprises a compiler, a memory, a register and a controller, wherein the memory is used for compiling calculation information of a Transformer model into large-scale data under the action of the compiler and the register and storing the large-scale data, and the large-scale data at least comprises operation instructions and operation data which are compiled and recombined according to time and space; the register is used for uninterruptedly reading the operation instruction from the memory and storing the relevant information of the operation instruction; the controller is used for constructing a control flow according to the operation instruction read by the register and the related information of the operation instruction, and controlling the memory to select corresponding operation data for operation according to a flow mode. The device has the advantages of low operation consumption and high running speed.)

1. An arithmetic device for executing a Transformer model, comprising a compiler, a memory, a register, and a controller,

the memory is used for compiling the calculation information of the Transformer model into large-scale data under the action of the compiler and the register and storing the large-scale data, wherein the large-scale data at least comprises operation instructions and operation data which are compiled and recombined according to time and space;

the register is used for uninterruptedly reading the operation instruction from the memory and storing the relevant information of the operation instruction;

the controller is used for constructing a control flow according to the operation instruction read by the register and the related information of the operation instruction, and controlling the memory to select corresponding operation data for operation according to a flow mode.

2. The arithmetic device according to claim 1, wherein the computation information includes layer information, and the compiler divides all operations in each layer into a plurality of operation chains in a time dimension according to a dependency relationship, parallelizes the plurality of operation chains according to a physical space for operations on the memory, and compiles the parallelized plurality of operation chains into the operation instruction.

3. The computing device of claim 2, wherein the compiler comprises an operator fusion module and a decoding module;

the operator fusion module is used for generating a computation logic diagram according to the layer information;

the decoding module is used for dividing all operations in each layer into a plurality of operation chains in a time dimension according to the computational logic diagram and the dependency relationship, performing parallelization processing on the operation chains according to the physical space of the memory for operation, and compiling the operation chains subjected to the parallelization processing into an operation instruction set.

4. The computing device of claim 3, wherein the compiler is further configured to determine an implementation form of the Transformer model and obtain the computation information according to the implementation form of the Transformer model.

5. The computing device of claim 1, wherein the memory comprises a parallel storage module, a data selection module, and a heterogeneous parallel computation module;

the parallel storage module is used for storing the large-scale data;

the data selection module is used for selecting a specific group of operation data from the parallel storage module in a flow mode according to a control flow signal of the controller;

the heterogeneous parallel computing module is used for operating a specific group of operation data selected by the data selection module in a stream mode.

6. An operation method for operating a Transformer model, comprising:

acquiring calculation information of the Transformer model;

compiling the calculation information into large-scale data, wherein the large-scale data at least comprises operation instructions and operation data which are compiled and recombined according to time and space;

constructing a control flow according to the operation instructions and the related information of each operation instruction;

and responding to the control flow, continuously selecting corresponding operation data according to a flow mode, and completing corresponding operation.

7. The calculation method according to claim 6, wherein obtaining calculation information of the Transformer model comprises:

judging the implementation form of the Transformer model;

and acquiring the calculation information of the Transformer model according to the implementation form of the Transformer model.

8. The operational method of claim 6, wherein compiling the computation information into scale data, the scale data including at least operational instructions and operational data reassembled according to temporal and spatial compilation, comprises:

extracting layer information, parameter storage information and operation data from the calculation information;

disassembling the operation of each layer;

constructing the operation after disassembling each layer into a plurality of dependency chains according to the dependency relationship;

constructing a plurality of dependency chains of each layer into a plurality of operation chains in a time dimension;

and carrying out parallelization processing on the multiple operation chains of each layer in the spatial dimension.

9. A storage medium having a computer program stored thereon, the computer program comprising: the computer program, when executed by a processor, implementing the method of any one of claims 6-8.

10. A computing device, characterized by: the computing device comprises a processor for implementing the method of any one of claims 6-8 when executing a computer program stored in a memory.

Technical Field

The invention relates to the field of artificial intelligence deep learning, in particular to an arithmetic device, an arithmetic method, an arithmetic medium and a computing device.

Background

The Transformer is a model for machine learning, and at present, the operation of the Transformer model is usually realized on a traditional general processor CPU according to a method of sequentially operating one arithmetic unit by one arithmetic unit, and the method has low execution efficiency and cannot reflect the design concept of the Transformer model and large-scale parallelization; in addition, the Transformer model can also be run on a general-purpose graphics processor GPU, where operations of the same type can be parallelized by executing general-purpose SIMD instructions using a general-purpose register file and a general-purpose stream processing unit. In the two operation methods, different operations can only be sequentially operated in time, and large-scale parallel operations of the same kind of calculation can be completed only by large-scale instructions, so that data transmission and instruction number become the bottleneck of the operations.

Therefore, neither the general purpose processor CPU nor the general purpose graphics processor GPU of the prior art can efficiently and efficiently run the transform model with high performance

The above is only for the purpose of assisting understanding of the technical aspects of the present invention, and does not represent an admission that the above is prior art.

Disclosure of Invention

The present invention is directed to an arithmetic device, a method, a medium, and a computing apparatus, which are used to solve the problems mentioned in the background art.

To achieve the above object, the present invention provides an arithmetic device comprising: a compiler, a memory, a register, and a controller, wherein,

the memory is used for compiling the calculation information of the Transformer model into large-scale data under the action of the compiler and the register and storing the large-scale data, wherein the large-scale data at least comprises operation instructions and operation data which are compiled and recombined according to time and space;

the register is used for uninterruptedly reading the operation instruction from the memory and storing the relevant information of the operation instruction;

the controller is used for constructing a control flow according to the operation instruction read by the register and the related information of the operation instruction, and controlling the memory to select corresponding operation data for operation according to a flow mode.

Optionally, the calculation information includes layer information, the compiler divides all operations in each layer into a plurality of operation chains in a time dimension according to a dependency relationship, performs parallelization processing on the plurality of operation chains according to a physical space for operation on the memory, and compiles the plurality of operation chains subjected to the parallelization processing into the operation instruction.

Optionally, the compiler includes an operator fusion module and a decoding module;

the operator fusion module is used for generating a computation logic diagram according to the layer information;

the decoding module is used for dividing all operations in each layer into a plurality of operation chains in a time dimension according to the computational logic diagram and the dependency relationship, performing parallelization processing on the operation chains according to the physical space of the memory for operation, and compiling the operation chains subjected to the parallelization processing into an operation instruction set.

Optionally, the compiler is further configured to determine an implementation form of the Transformer model, and obtain the calculation information according to the implementation form of the Transformer model.

Optionally, the memory includes a parallel storage module, a data selection module, and a heterogeneous parallel computation module;

the parallel storage module is used for storing the large-scale data;

the data selection module is used for selecting a specific group of operation data from the parallel storage module in a flow mode according to a control flow signal of the controller;

the heterogeneous parallel computing module is used for operating a specific group of operation data selected by the data selection module in a stream mode.

Optionally, selecting a specific set of operation data from the parallel storage module in a stream mode includes: and selecting the operation data with fixed length from the parallel storage module each time.

Optionally, the register includes an instruction control module and an instruction description module;

the instruction control module is used for recording an operation instruction read by the register from the memory;

the instruction description module is used for storing the relevant information of the operation instruction.

Optionally, the instruction description module includes a multidimensional register file, the register stores information related to the operation instruction in different register files according to functions, and all the register files share the same data storage logic and data reading logic, so that the register uninterruptedly provides the operation instruction and the information related to the operation instruction to the controller.

The invention also provides an operation method, which comprises the following steps:

acquiring calculation information of the Transformer model;

compiling the calculation information into large-scale data, wherein the large-scale data comprises operation instructions and operation data which are compiled and recombined according to time and space;

constructing a control flow according to the operation instructions and the related information of each operation instruction;

and responding to the control flow, continuously selecting corresponding operation data according to a flow mode, and completing corresponding operation.

Optionally, obtaining calculation information of the Transformer model includes:

judging the implementation form of the Transformer model;

and acquiring the calculation information of the Transformer model according to the implementation form of the Transformer model.

Optionally, compiling the calculation information into scale data, where the scale data includes operation instructions and operation data compiled and recombined according to time and space, and the method includes:

extracting layer information, parameter storage information and operation data from the calculation information;

disassembling the operation of each layer;

constructing the operation after disassembling each layer into a plurality of dependency chains according to the dependency relationship;

constructing a plurality of dependency chains of each layer into a plurality of operation chains in a time dimension;

and carrying out parallelization processing on the multiple operation chains of each layer in the spatial dimension.

The invention also proposes a storage medium on which a computer program is stored, which computer program, when being executed by a processor, carries out the above-mentioned method.

The invention also proposes a computing device comprising a processor for implementing the above-mentioned method when executing a computer program stored in a memory.

According to the above description, the operation device provided by the invention recombines the operation of each layer of the transform model in time and space to form a control flow, and controls each operation to run uninterruptedly in the form of the control flow, so that the complete process of simplifying the transform model is realized. In addition, large-scale parallel computation is controlled by the same instruction and logic of the controller, and the overhead on time and energy consumption caused by additional control logic is reduced.

Drawings

FIG. 1 is a schematic diagram of an embodiment of an arithmetic device according to the present invention;

FIG. 2 is a diagram illustrating an embodiment of a memory of the computing device;

FIG. 3 is a diagram of a transform model in an embodiment of a computing device of the present invention;

FIG. 4 is a diagram of the steps of the method of operation of the present invention;

FIG. 5 is a flow chart of a method of operation of the present invention;

FIG. 6 is a schematic structural diagram of a storage medium according to the present invention;

FIG. 7 is a schematic diagram of a computing device according to the present invention.

Detailed Description

The principles and spirit of the present invention will be described with reference to a number of exemplary embodiments. It is understood that these embodiments are given solely for the purpose of enabling those skilled in the art to better understand and to practice the invention, and are not intended to limit the scope of the invention in any way. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

As will be appreciated by one skilled in the art, embodiments of the present invention may be embodied as a system, device, method or computer program product. Accordingly, the present disclosure may be embodied in the form of: entirely hardware, entirely software (including firmware, resident software, micro-code, etc.), or a combination of hardware and software.

According to an embodiment of the invention, an arithmetic device, a method, a medium and a computing device are provided.

In this document, it is to be understood that any number of elements in the figures are provided by way of illustration and not limitation, and any nomenclature is used for differentiation only and not in any limiting sense.

The principles and spirit of the present invention are explained in detail below with reference to several representative embodiments of the invention.

Exemplary devices

The present exemplary embodiment proposes an arithmetic device, as shown in fig. 1, including: the system comprises a compiler 100, a memory 300, a register 200 and a controller 400, wherein the memory 300 is used for compiling the calculation information of the Transformer model into scale data under the action of the compiler 100 and the register 200 and storing the scale data, and the scale data at least comprises operation instructions and operation data which are compiled and recombined according to time and space; the register 200 is used for uninterruptedly reading the operation instruction from the memory 300 and storing the relevant information of the operation instruction; the controller 400 is configured to construct a control flow according to the operation instruction read by the register 200 and the related information of the operation instruction, and control the memory 300 to select corresponding operation data for operation according to a flow pattern.

Specifically, the compiler 100 may be a program for decoding a Transformer model, for example, in some embodiments, the compiler 100 may be implemented in C + + language, and identify whether a target program in the Transformer model is a tensrflow platform or a Pytorch platform by using a program interpretation function in the C + + language, and then obtain its own computation information from the Transformer model according to specific functions corresponding to different platforms. The computation information of the transform model may include layer information, parameter storage information, and operation data, at this time, on one hand, the obtained operation data and parameter storage information may be stored in the memory 300, and on the other hand, the operation instruction of the transform model may be compiled and reassembled according to the layer information and according to time and space, and at this time, the large-scale data may be obtained. After the operation starts, the register 200 may continuously read the compiled and reassembled operation instruction and the information related to the operation instruction from the memory 300, and then the controller 400 constructs a control stream according to the operation instruction of the register 200 and the information related to the operation instruction, and controls the memory 300 to select an operation data stream from the large-scale data for operation according to the stream mode. In the present embodiment, the selection of the operation data stream and the completion of the corresponding operation are performed under the control of the controller 400 and with the aid of the register 200, and the whole operation process is completed on the memory 300, thereby improving the efficiency of the operation.

Next, with reference to this embodiment, a detailed description is made on compiling and recombining an operation instruction according to time and space, in this embodiment, the compiler 100 includes an operator fusion module and a decoding module, where the operator fusion module is used to generate a computation logic diagram according to the layer information; the decoding module is configured to divide all operations in each layer into a plurality of operation chains in a time dimension according to the computational logic diagram, perform parallelization processing on the plurality of operation chains according to a physical space of the memory 300 for operations, and compile the plurality of operation chains subjected to the parallelization processing into an operation instruction set. Specifically, after the layer information is obtained, the operator fusion module generates a computation logic diagram according to the layer information, then the decoding module splits the operation in each layer according to the computation logic diagram, and further divides all the operations into a plurality of operation chains in a time dimension according to a dependency relationship, for example, ten operations including a first operation, a second operation … …, a tenth operation in a certain layer are included in sequence, and the dependency relationship between the operations is as follows: the second operation depends on the first operation result, the fourth operation depends on the third operation result, the sixth operation depends on the fifth operation result, the seventh operation depends on the second operation result, the fourth operation result and the sixth operation result, the tenth operation depends on the seventh operation result, the eighth operation result and the ninth operation result, so that the first operation, the third operation, the fifth operation, the eighth operation and the ninth operation can be divided into an operation chain again for the dependent operation in the time dimension, and the processing can be carried out at the first time node at the same time; the second operation, the fourth operation and the sixth operation are an operation chain and can be processed at the same time by a second time node; the seventh operation is an operation chain and is operated at a third time node; the tenth operation is an operation chain which can be operated at a fourth time node; therefore, the original ten operations on the ten time nodes are divided into four operation chains on the four time nodes in the time dimension according to the dependency relationship, and after the operation chains are divided, each operation chain may contain a plurality of operations, so that each operation chain needs to be parallelized according to the physical space of the memory 300 for operation, and thus the operation chains on each operation node can be matched with the operation space, and at this time, the recombination of the operation instructions is completed. From the above description, it is obvious that the technical scheme combines and unitedly puts various operations which can be operated in parallel at different time nodes into the same time for operation by recombining the operation instructions in the time dimension, so that the operation time is saved, and the parallel processing of recombination in the space dimension can maximize the motion operation space and improve the operation efficiency.

As shown in fig. 2, the memory 300 is described in detail, and the memory 300 is used for storing the scale data, as described above, in this embodiment, the memory 300 includes a parallel storage module 410, a data layer selection module 420 and a heterogeneous parallel computation module 430, wherein the parallel storage module 410 is used for storing the scale data; the data layer selection module 420 is configured to select a specific set of operation data from the parallel storage module 410 in a flow mode according to a control flow signal of the controller 400; the heterogeneous parallel computing module 430 is configured to operate on a specific set of operation data selected by the data layer selection module 420 in a stream mode. Specifically, after the compiler 100 acquires the layer information and the parameter storage information of the transform model and calculates data, the layer information, the parameter storage information and the calculation data are sent to the parallel storage module 410 of the memory 300 to be stored, and after the calculation instruction is completely reassembled and the memory 300 receives the control signal of the controller 400, the data layer selection module 420 may select a specific set of calculation data from the parallel storage module 410 and then calculate the specific set of calculation data by the heterogeneous parallel calculation module 430. The parallel storage module 410 can be a nonvolatile data parallel storage module 410 of large-scale data, the heterogeneous parallel computation module can complete operations of different network layers and different modes in the Transformer model, and the data layer selection module 420 can select the large-scale data to complete corresponding operations, so that the requirement of completing multi-level concurrent computation by a single instruction can be met.

Next, the register 200 is described in detail, wherein the operation instruction is continuously read from the memory 300, and the information related to the operation instruction is stored, in an embodiment, the register 200 includes an instruction control module and an instruction description module; the instruction control module is used for recording the operation instruction read by the register 200 from the memory 300; the instruction description module is used for storing the relevant information of the operation instruction. Specifically, after the compiler 100 splits and recombines the operations in each layer through the register 200, it compiles to form an operation instruction, and then sends the operation instruction to the memory 300, after the Transformer model runs, the instruction control module starts to continuously read and store the operation instruction, the instruction description module continuously reads and stores information related to the operation instruction from the memory 300, for example, the instruction description module reads and stores the calculation length, data source, and calculation type of each operation instruction, and the instruction control module reads and stores the next calculation time, next instruction position, and other information of each operation instruction.

In another embodiment, the instruction description module comprises multidimensional register 200 files, the register 200 stores information related to the operation instruction in different register 200 files according to functions, and all the register 200 files share the same data storage logic and data reading logic, so that the register 200 continuously provides the operation instruction and the information related to the operation instruction to the controller 400.

After the compiler 100 reconstructs the operation instruction set, the post-controller 400 constructs a control flow according to the information read and stored by the register 200. Firstly, the controller 400 needs to decode the instruction, and then controls the memory 300 to select corresponding operation data to operate according to the control information reflected by the decoded control instruction, and since the controller 400 controls the operation performed by the memory 300 according to the form of a control flow, the memory 300 can uninterruptedly operate each operation of each layer of the transform model. For example, the controller 400 sends control instructions to the memory 300 without interruption, the data layer selection module 420 selects a specific set of operation data from the parallel storage module 410 without interruption in a flow mode according to the control flow signal, and the heterogeneous parallel computation module 430 performs operations without interruption in the flow mode. On the other hand, the controller 400 may also obtain the specific operation mode of each step from the register 200, such as: the length, data source, calculation type, next calculation time and next instruction position are calculated at this time, so that the functional correctness is guaranteed, and meanwhile, the maximum performance improvement is obtained.

In another embodiment, the data layer selection module 420 selects a specific set of operation data from the parallel storage module 410 in a stream mode including: the fixed length operation data is selected from the parallel memory module 410 each time.

Another embodiment of the present apparatus will be described with reference to the specific architecture of the transform model shown in fig. 3. Part a in fig. 3 is a specific architecture of the transform model in this embodiment, part B is a computational logic diagram generated by the operator fusion module according to layer information of the transform, and according to the generated computational logic diagram, the decoding module can recombine operations in each layer of the transform model in time and space, and then operate the transform model according to the specifically explained method principle.

The operation device provided by the invention recombines the operation of each layer of the transform model in time and space to form a control flow, and controls each operation to run uninterruptedly in the form of the control flow, so that the complete process of simplifying the transform model is realized, and all the operations are completed by limiting the parameter storage information and the operation data participating in the operations in the memory 300, so that the purposes of improving the operation speed and reducing the operation energy consumption are achieved. In addition, large-scale parallelization calculation is controlled by the same instruction and logic of the controller 400, and the additional overhead in time and energy consumption brought by control logic is reduced.

Exemplary method

The exemplary embodiment proposes an arithmetic method, as shown in fig. 4, the method including the steps of:

step S100: and acquiring the calculation information of the Transformer model.

Step S200: and compiling the calculation information into large-scale data, wherein the large-scale data at least comprises operation instructions and operation data which are compiled and recombined according to time and space.

Step S300: and constructing a control flow according to the operation instructions and the related information of each operation instruction.

Step S400: and responding to the control flow, continuously selecting corresponding operation data according to a flow mode, and completing corresponding operation.

Referring to fig. 5, describing the above steps, for step S100, before obtaining the calculation information of the transform model, it is first necessary to judge the implementation form of the transform model, and according to the implementation form of the transform model, that is, the identification of the model algorithm implementation form, since the transform model can be implemented by different platforms, it is necessary to try model form identification many times until the correct implementation form is found. And then, selecting a corresponding function to obtain the calculation information of the Transformer model.

After the calculation information of the transform model is acquired, step S200 is performed, and step S200 includes the following steps:

step S210: and extracting layer information, parameter storage information and operation data from the calculation information.

Step S220: and disassembling the operation of each layer.

Step S230: and constructing the operation after the disassembly of each layer into a plurality of dependency chains according to the dependency relationship.

Step S240: and constructing a plurality of dependency chains of each layer into a plurality of operation chains in a time dimension.

Step S250: and carrying out parallelization processing on the multiple operation chains of each layer in the spatial dimension.

The specific method of each step from step S210 to step S250 may refer to the specific method of the apparatus in the exemplary embodiment, which is not described herein again.

After the operation instructions are recomposed and compiled in step S200, step S300 is performed to construct a control flow according to the operation instructions and the related information of each operation instruction.

Then, step S400 is performed, and in response to the control flow, corresponding operation data is selected and corresponding operations are completed continuously according to the flow pattern. In step S300, a control flow is constructed, in this step, each control instruction is issued according to the control flow, in response to receiving the control instruction, operation data is selected and imported according to the control instruction, large-scale parallelization data operation is performed, and after the control flow is finished, the model completes operation. The control flow comprises all operational instructions of the transform model from beginning to end, including but not limited to the current computation length, the computation data source, the computation type, the next computation time, the next instruction position and other data. The control flow is an uninterrupted flow state control instruction, so that the operation of the Transformer model can be completed according to an uninterrupted flow state operation mode in response to the control flow. And (3) forming a data stream through each operation of the transform model under the control instruction of the control stream, and finishing all operations from beginning to end, thereby achieving the purpose of improving the calculation efficiency.

Exemplary Medium

Having described the method, medium, and system of the exemplary embodiments of the present invention, next, a computer-readable storage medium of the exemplary embodiments of the present invention is described with reference to fig. 6, please refer to fig. 6, which illustrates a computer-readable storage medium being an optical disc 70 having a computer program (i.e., a program product) stored thereon, where the computer program, when executed by a processor, implements the steps described in the above-mentioned method embodiments, such as obtaining the calculation information of the transform model; compiling the calculation information into large-scale data, wherein the large-scale data at least comprises operation instructions and operation data which are compiled and recombined according to time and space; constructing a control flow according to the operation instructions and the related information of each operation instruction; responding to the control flow, uninterruptedly selecting corresponding operation data according to a flow mode and completing corresponding operation; the specific implementation of each step is not repeated here.

It should be noted that examples of the computer-readable storage medium may also include, but are not limited to, phase-change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory, or other optical and magnetic storage media, which are not described in detail herein.

Exemplary computing device

Having described the method, system, and media of exemplary embodiments of the present invention, a computing device of an exemplary embodiment of the present invention is next described with reference to FIG. 7.

FIG. 7 illustrates a block diagram of an exemplary computing device 80 suitable for use in implementing embodiments of the present invention, the computing device 80 may be a computer system or server. The computing device 80 shown in FIG. 7 is only one example and should not impose any limitations on the functionality or scope of use of embodiments of the present invention.

As shown in fig. 7, components of computing device 80 may include, but are not limited to: one or more processors or processing units 801, a system memory 802, and a bus 803 that couples various system components including the system memory 802 and the processing unit 801.

Computing device 80 typically includes a variety of computer system readable media. Such media may be any available media that is accessible by computing device 80 and includes both volatile and nonvolatile media, removable and non-removable media.

System storage 802 may include computer system readable media in the form of volatile storage, such as Random Access Memory (RAM)8021 and/or cache storage 8022. Computing device 70 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, ROM8023 may be used to read from and write to non-removable, nonvolatile magnetic media (not shown in FIG. 7, and typically referred to as a "hard disk drive"). Although not shown in FIG. 7, a magnetic disk drive for reading from and writing to a removable, nonvolatile magnetic disk (e.g., a "floppy disk") and an optical disk drive for reading from or writing to a removable, nonvolatile optical disk (e.g., a CD-ROM, DVD-ROM, or other optical media) may be provided. In these cases, each drive may be connected to the bus 803 by one or more data media interfaces. At least one program product having a set (e.g., at least one) of program modules configured to carry out the functions of embodiments of the invention may be included in system memory 802.

Program/utility 8025, having a set (at least one) of program modules 8024, can be stored in, for example, system memory 802, and such program modules 8024 include, but are not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment. Program modules 8024 generally perform the functions and/or methodologies of embodiments of the present invention as described herein.

Computing device 80 may also communicate with one or more external devices 804 (e.g., keyboard, pointing device, display, etc.). Such communication may be through an input/output (I/O) interface. Moreover, computing device 80 may also communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the internet) via network adapter 806. As shown in FIG. 7, the network adapter 806 communicates with other modules of the computing device 80, such as the processing unit 801, over the bus 803. It should be appreciated that although not shown in FIG. 7, other hardware and/or software modules may be used in conjunction with computing device 80.

The processing unit 801 executes various functional applications and data processing by running programs stored in the system storage 802, for example, acquiring calculation information of the transform model; compiling the calculation information into large-scale data, wherein the large-scale data at least comprises operation instructions and operation data which are compiled and recombined according to time and space; constructing a control flow according to the operation instructions and the related information of each operation instruction; responding to the control flow, uninterruptedly selecting corresponding operation data according to a flow mode and completing corresponding operation; the specific implementation of each step is not repeated here. It should be noted that although in the above detailed description several units/modules or sub-units/sub-modules of the arithmetic device are mentioned, such a division is merely exemplary and not mandatory. Indeed, the features and functionality of two or more of the units/modules described above may be embodied in one unit/module according to embodiments of the invention. Conversely, the features and functions of one unit/module described above may be further divided into embodiments by a plurality of units/modules.

In the description of the present invention, it should be noted that the terms "first", "second", and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the systems, systems and units described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the embodiments provided by the present invention, it should be understood that the disclosed system, system and method can be implemented in other ways. The above-described system embodiments are merely illustrative, and for example, the division of the units is only one logical functional division, and there may be other divisions in actual implementation, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of systems or units through some communication interfaces, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in software functional units and sold or used as a stand-alone product, may be stored in a non-transitory computer readable storage medium executable by a processor. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computing device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. The storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

Finally, it should be noted that: the above-mentioned embodiments are only specific embodiments of the present invention, which are used for illustrating the technical solutions of the present invention and not for limiting the same, and the protection scope of the present invention is not limited thereto, although the present invention is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the embodiments of the present invention, and they should be construed as being included therein. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Moreover, while the operations of the method of the invention are depicted in the drawings in a particular order, this does not require or imply that the operations must be performed in this particular order, or that all of the illustrated operations must be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions.

Through the above description, the embodiments of the present invention provide the following technical solutions, but are not limited thereto:

1. an arithmetic device for executing a Transformer model, comprising a compiler, a memory, a register, and a controller,

the memory is used for compiling the calculation information of the Transformer model into large-scale data under the action of the compiler and the register and storing the large-scale data, wherein the large-scale data at least comprises operation instructions and operation data which are compiled and recombined according to time and space;

the register is used for uninterruptedly reading the operation instruction from the memory and storing the relevant information of the operation instruction;

the controller is used for constructing a control flow according to the operation instruction read by the register and the related information of the operation instruction, and controlling the memory to select corresponding operation data for operation according to a flow mode.

2. The arithmetic device according to claim 1, wherein the computation information includes layer information, and the compiler divides all operations in each layer into a plurality of operation chains in a time dimension according to a dependency relationship, parallelizes the plurality of operation chains according to a physical space for operations on the memory, and compiles the parallelized plurality of operation chains into the operation instruction.

3. The arithmetic device according to claim 1 or 2, wherein the compiler includes an operator fusion module and a decoding module;

the operator fusion module is used for generating a computation logic diagram according to the layer information;

the decoding module is used for dividing all operations in each layer into a plurality of operation chains in a time dimension according to the computational logic diagram and the dependency relationship, performing parallelization processing on the operation chains according to the physical space of the memory for operation, and compiling the operation chains subjected to the parallelization processing into an operation instruction set.

4. The arithmetic device according to any one of claims 1 to 3, wherein the compiler is further configured to determine an implementation form of the Transformer model, and acquire the calculation information according to the implementation form of the Transformer model.

5. The arithmetic device according to any one of claims 1 to 4, wherein the memory includes a parallel storage module, a data selection module, and a heterogeneous parallel computation module;

the parallel storage module is used for storing the large-scale data;

the data selection module is used for selecting a specific group of operation data from the parallel storage module in a flow mode according to a control flow signal of the controller;

the heterogeneous parallel computing module is used for operating a specific group of operation data selected by the data selection module in a stream mode.

6. The arithmetic device according to any one of claims 1 to 5, wherein selecting a specific set of arithmetic data from the parallel memory module in a stream mode includes: and selecting the operation data with fixed length from the parallel storage module each time.

7. The arithmetic device according to any one of claims 1 to 6, wherein the register includes an instruction control module and an instruction description module;

the instruction control module is used for recording an operation instruction read by the register from the memory;

the instruction description module is used for storing the relevant information of the operation instruction.

8. The arithmetic device according to any one of claims 1 to 7, wherein the instruction description module includes a multidimensional register file, the register stores information related to the arithmetic instruction in different register files according to functions, and all the register files share the same data storage logic and data reading logic, so that the register provides the arithmetic instruction and information related to the arithmetic instruction to the controller without interruption.

9. An operation method for operating a Transformer model, comprising:

acquiring calculation information of the Transformer model;

compiling the calculation information into large-scale data, wherein the large-scale data at least comprises operation instructions and operation data which are compiled and recombined according to time and space;

constructing a control flow according to the operation instructions and the related information of each operation instruction;

and responding to the control flow, continuously selecting corresponding operation data according to a flow mode, and completing corresponding operation.

10. The operation method according to claim 9, wherein the obtaining of the calculation information of the fransformer model includes:

judging the implementation form of the Transformer model;

and acquiring the calculation information of the Transformer model according to the implementation form of the Transformer model.

11. The operation method according to claim 9 or 10, wherein the compiling of the calculation information into scale data at least includes operation instructions and operation data compiled and recombined according to time and space includes:

extracting layer information, parameter storage information and operation data from the calculation information;

disassembling the operation of each layer;

constructing the operation after disassembling each layer into a plurality of dependency chains according to the dependency relationship;

constructing a plurality of dependency chains of each layer into a plurality of operation chains in a time dimension;

and carrying out parallelization processing on the multiple operation chains of each layer in the spatial dimension.

12. A storage medium having a computer program stored thereon, the computer program comprising: the computer program when executed by a processor implements the method according to any of claims 9-11.

13. A computing device, characterized by: the computing device comprises a processor for implementing the method according to any of claims 9-11 when executing a computer program stored in a memory.

18页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:一种跨指令集的执行部件移植与剪裁方法

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!