Optimizing hardware FIFO instructions

文档序号：74757 发布日期：2021-10-01 浏览：28次中文

阅读说明：本技术 优化硬件fifo指令 (Optimizing hardware FIFO instructions ) 是由徐元仲詹姆斯·M·施蒂希诺特戴维·亚历山大·迈内默于 2019-11-26 设计创作，主要内容包括：描述了用于调度先进先出(FIFO)指令的方法、系统和装置。在一个方面,一种方法包括接收表示将由具有硬件处理器的处理单元执行的程序的代码的数据。对于一个或多个硬件处理器中的每一个,在表示代码的数据中标识用于由硬件处理器执行的独立FIFO指令组的顺序。对于用于由硬件处理器执行的每个独立的FIFO指令组,确定表示从独立的FIFO指令组到达程序的末尾需要多长时间的路径长度度量。至少基于用于由硬件处理器执行的每个独立FIFO指令组的路径长度度量来生成用于由硬件处理器执行的独立FIFO指令组的新顺序。(Methods, systems, and apparatus are described for scheduling first-in-first-out (FIFO) instructions. In one aspect, a method includes receiving data representing code of a program to be executed by a processing unit having a hardware processor. For each of the one or more hardware processors, an order of the independent FIFO instruction groups for execution by the hardware processor is identified in data representing the code. For each independent FIFO instruction group for execution by the hardware processor, a path length metric is determined that indicates how long it takes to reach the end of the program from the independent FIFO instruction group. A new order of the independent FIFO instruction groups for execution by the hardware processor is generated based at least on the path length metric for each independent FIFO instruction group for execution by the hardware processor.)

1. A method performed by one or more data processing apparatus, the method comprising:

receiving data representing code of a program to be executed by a processing unit including a hardware processor;

for each of one or more of the hardware processors:

identifying, in the data representing the code of the program, an order of independent first-in-first-out (FIFO) instruction groups for execution by the hardware processor;

for each separate FIFO instruction group for execution by the hardware processor, determining a path length metric that represents how long it takes to reach the end of the program from the separate FIFO instruction group; and

generating a new order for each separate FIFO instruction group for execution by the hardware processor based at least on the path length metric for the separate FIFO instruction group for execution by the hardware processor;

generating update code for the program based at least on each new order of independent sets of FIFO instructions for the one or more hardware processors;

generating output code for execution by the hardware processor based on the update code for the program; and

providing the output code to the hardware processor.

2. The method of claim 1, wherein the data representing the code of the program comprises a compiler-intermediate representation of a source code of the program.

3. The method of claim 1 or 2, wherein the output code comprises compiled machine code executable by the one or more hardware processors.

4. The method of any preceding claim, further comprising generating a map representing a sequence of instructions for the program based on the data representing the code of the program, wherein generating the new order of the independent FIFO instruction groups for a given hardware processor comprises reordering the independent FIFO instruction groups in the map to generate an updated map.

5. The method of claim 4, wherein generating update code for the program based at least on each new order of independent FIFO instruction groups for the one or more hardware processors comprises: generating an update instruction sequence based on the update map.

6. The method of any preceding claim, wherein generating update code for the program based at least on each new order of independent FIFO instruction groups for the one or more hardware processors comprises:

identifying instructions in the update map that are not identified as FIFO instructions;

for one or more of the identified instructions:

selecting an independent FIFO instruction group based at least on a path length between the independent FIFO instruction group and an identified instruction in the update map; and

scheduling the identified instructions and the selected long latency groups in the update code.

7. The method of claim 6, wherein a plurality of hardware processors execute independent sets of FIFO instructions, the method further comprising: determining, for at least one of the plurality of hardware processors, an order of independent FIFO instruction sets to be executed by the at least one hardware processor based on whether the independent FIFO instruction groups are on a critical path of the program.

8. The method of claim 7, further comprising:

determining that the first hardware processor includes a plurality of independent FIFO instruction groups; and

determining that an independent FIFO instruction group of a second hardware processor is an instruction to receive data output by one of the plurality of independent FIFO instruction groups of the first hardware processor,

wherein generating the new order for the independent FIFO instruction groups for execution by the first hardware processor comprises: in response to determining that the independent FIFO instruction group of the second of the plurality of hardware processors is an instruction that receives data output by one of the plurality of independent FIFO instruction groups of the first hardware processor, moving the one independent FIFO instruction group of the first hardware processor to execute before one or more other independent FIFO instruction groups of the first hardware processor.

9. The method of any preceding claim, wherein each independent FIFO instruction set comprises one or more input instructions for providing input data to the hardware processor, and one or more output instructions for outputting data from the hardware processor based on input data.

10. The method of any preceding claim, wherein the particular hardware processor of the processing unit is a configurable hardware processor that executes instructions using a plurality of different configurations, the method further comprising:

identifying, in the data representing the code of the program, a FIFO instruction set to be executed by the particular hardware processor using a same configuration for the particular hardware processor; and

in the update code for the application, a new order for the FIFO instruction set is determined such that the FIFO instruction set is executed in sequence without any intervening FIFO instructions requiring a different configuration of the particular hardware processor.

11. The method of any preceding claim, wherein the one or more hardware processors comprise a plurality of processors, the method further comprising:

for each of the plurality of hardware processors, determining a performance metric representing how long the hardware processor takes to complete all FIFO instructions for the hardware processor;

determining an order of evaluation of the plurality of hardware processors based on the performance metrics for each of the plurality of hardware processors; and

determining a new order of said separate FIFO instruction groups for each of said plurality of hardware processors one at a time in accordance with the determined evaluation order.

12. A system, comprising:

a data processing device; and

a memory storage device in data communication with the data processing device, the memory storage device storing instructions executable by the data processing device and which, when executed, cause the data processing device to perform operations comprising:

receiving data representing code of a program to be executed by a processing unit including a hardware processor;

for each of one or more of the hardware processors:

identifying, in the data representing the code of the program, an order of independent first-in-first-out (FIFO) instruction groups for execution by the hardware processor;

generating update code for the program based at least on each new order of independent sets of FIFO instructions for the one or more hardware processors;

generating output code for execution by the hardware processor based on the update code for the program; and

providing the output code to the hardware processor.

13. The system of claim 12, wherein the data representing the code of the program comprises a compiler-intermediate representation of source code of the program.

14. The system of claim 12 or 13, wherein the output code comprises compiled machine code executable by the one or more hardware processors.

15. The system of any of claims 12 to 14, wherein the operations comprise generating a graph representing a sequence of instructions for the program based on the data representing the code of the program, wherein generating the new order of the independent FIFO instruction groups for a given hardware processor comprises reordering the independent FIFO instruction groups in the graph to generate an updated graph.

16. The system of claim 15, wherein generating update code for the program based at least on each new order of independent FIFO instruction groups for the one or more hardware processors comprises: generating an update instruction sequence based on the update map.

17. The system of any of claims 12 to 16, wherein generating update code for the program based at least on each new order of independent FIFO instruction groups for the one or more hardware processors comprises:

identifying instructions in the update map that are not identified as FIFO instructions;

for one or more of the identified instructions:

selecting an independent FIFO instruction group based at least on a path length between the independent FIFO instruction group and an instruction already identified in the update map; and

scheduling the identified instructions and the selected long latency groups in the update code.

18. The system of claim 17, wherein the plurality of hardware processors execute independent sets of FIFO instructions, the operations further comprising: determining, for at least one of the plurality of hardware processors, an order of independent FIFO instruction sets to be executed by the at least one hardware processor based on whether the independent FIFO instruction groups are on a critical path of the program.

19. The system of claim 18, wherein the operations comprise:

determining that the first hardware processor includes a plurality of independent FIFO instruction groups; and

20. A non-transitory computer storage medium encoded with a computer program, the program comprising instructions that, when executed by one or more data processing apparatus, cause the one or more data processing apparatus to perform operations comprising:

receiving data representing code of a program to be executed by a processing unit including a hardware processor;

for each of one or more of the hardware processors:

identifying, in the data representing the code of the program, an order of independent first-in-first-out (FIFO) instruction groups for execution by the hardware processor;

generating update code for the program based at least on each new order of independent sets of FIFO instructions for the one or more hardware processors;

generating output code for execution by the hardware processor based on the update code for the program; and

providing the output code to the hardware processor.

Background

Some processors (e.g., domain-specific processors) may include long-latency, high-throughput hardware units. For example, some processors may include systolic arrays to efficiently compute complex numerical workloads. The natural programming interface for such hardware units is a first-in-first-out (FIFO) queue defined in the Instruction Set Architecture (ISA) of the processor, since there are instructions responsible for pushing (push) data to and popping (pop) data to these units. However, hardware FIFOs impose ordering constraints on instructions and may keep additional states in configuration, making it difficult for a compiler to efficiently optimize and schedule instructions.

Disclosure of Invention

This specification describes techniques relating to improving the order of FIFO instructions, e.g., at compile time, to reduce latency in performing machine learning and other computations.

In general, one innovative aspect of the subject matter described in this specification can be embodied in methods that include receiving data representing code of a program to be executed by a processing unit that includes a hardware processor. The method includes, for each of one or more hardware processors: identifying, in data representing code of a program, an order of independent first-in-first-out (FIFO) instruction groups for execution by a hardware processor; for each separate FIFO instruction group executed by the hardware processor, determining a path length metric, the path length metric representing how long it will take to reach the end of the program from the separate FIFO instruction group; and generating a new order for the independent FIFO instruction groups for execution by the hardware processor based at least on the path length metric for each independent FIFO instruction group for execution by the hardware processor. Update code for the program is generated based at least on each new order of independent FIFO instruction groups for the one or more hardware processors. Based on the updated code for the program, output code is generated for execution by the hardware processor. The output code is provided to a hardware processor. Other embodiments of this and other aspects include corresponding systems, methods, and computer programs configured to perform the actions of the methods encoded on computer storage devices. A system of one or more computers can be so configured by way of software, firmware, hardware or a combination thereof installed on the system that, in operation, cause the system to perform actions. One or more computer programs can be so configured by means of instructions which, when executed by data processing apparatus, cause the apparatus to perform the actions.

These and other embodiments may each optionally include one or more of the following features. In some aspects, the data representing the code of the program comprises a compiler-intermediate representation of the source code of the program. In some aspects, the output code comprises compiled machine code executable by one or more hardware processors.

Some aspects include generating a graph representing a sequence of instructions for a program based on data representing code for the program. Generating a new order of independent FIFO instruction sets for a given hardware processor may include reordering the independent FIFO instruction sets in the map to generate an updated map. Generating update code for the program based at least on each new order of independent FIFO instruction groups for the one or more hardware processors may include generating a sequence of update instructions based on an update map.

In some aspects, generating update code for a program based at least on each new order of independent FIFO instruction groups for one or more hardware processors may include, identifying instructions in an update map that are not identified as FIFO instructions; for one or more of the identified instructions: selecting an individual FIFO instruction group based at least on a path length between the individual FIFO instruction group and an identified instruction in the update map; and scheduling the identified instructions and the selected long latency groups in the update code.

In some aspects, multiple hardware processors execute independent sets of FIFO instructions. Some aspects may include determining, for at least one of the plurality of hardware processors, an order of independent FIFO instruction sets to be executed by the at least one hardware processor based on whether the independent FIFO instruction groups are on a critical path of the program. Some aspects may include determining that the first hardware processor includes a plurality of independent FIFO instruction groups; and determining that the independent FIFO instruction group of the second hardware processor is an instruction that receives data output by one of the plurality of independent FIFO instruction groups of the first hardware processor. In some aspects, generating the new order of independent FIFO instruction sets for execution by the first hardware processor may include moving one independent FIFO instruction set of the first hardware processor to execute before one or more other independent FIFO instruction sets of the first hardware processor in response to determining that an independent FIFO instruction set of a second hardware processor of the plurality of hardware processors is an instruction that receives data output by one of the plurality of independent FIFO instruction sets of the first hardware processor.

In some aspects, each independent FIFO instruction set includes one or more input instructions for providing input data to the hardware processor and one or more output instructions for outputting data from the hardware processor based on the input data. In some aspects, a particular hardware processor of a processing unit is a configurable hardware processor that executes instructions using a plurality of different configurations. Some aspects may include identifying, in data representing code of a program, a FIFO instruction set to be executed by a particular hardware processor using a same configuration for the particular hardware processor; and determining a new order for the FIFO instruction set in the update code for the application such that the FIFO instruction set is executed in sequence without any intervening FIFO instructions requiring a different configuration of the particular hardware processor.

In some aspects, the one or more hardware processors comprise a plurality of processors. Some aspects may include determining, for each of a plurality of hardware processors, a performance metric representing how long the hardware processor will take to complete all FIFO instructions for the hardware processor; determining an evaluation order for the plurality of hardware processors based on the performance metrics for each of the plurality of hardware processors; and determining a new order of the independent FIFO instruction groups for each of the plurality of hardware processors one at a time in accordance with the determined evaluation order.

The subject matter described in this specification can be implemented in particular embodiments to realize one or more of the following advantages. By generating an improved order of FIFO instructions for use by the hardware processor to input and output data, latency in executing the program may be reduced. non-FIFO instructions may be grouped with FIFO instructions (e.g., with the closest FIFO instruction) to further improve to reduce latency in executing programs. For example, by grouping another instruction that provides data for a FIFO instruction for loading into a hardware processor with the FIFO instruction, the instructions may be scheduled together such that the amount of time that the FIFO instruction must wait for such data is reduced. By reducing the delay in executing a program, complex workloads, such as deep neural network workloads, may be completed faster and more efficiently. Thus, by utilizing the described techniques at compile time, a given program may be executed at runtime with reduced latency relative to a conventional compiler executing on the same hardware processor. Furthermore, by taking into account the characteristics of a particular hardware processor during generation of the improved sequence of FIFO instructions, the execution of the FIFO instructions may be customized/adapted to the particular hardware processor present in the system executing the FIFO instructions.

The details of one or more implementations of the subject matter described in this specification are set forth in the accompanying drawings and the description below. Other potential features, aspects, and advantages of the subject matter will become apparent from the description, the drawings, and the claims.

Drawings

FIG. 1 is a block diagram of an exemplary environment in which a program processing system modifies a program and performs computations using the modified program.

FIG. 2 depicts a directed graph of instructions of a program and an updated graph of instructions of the program.

FIG. 3 is a flow chart illustrating an exemplary process for modifying a program.

Like reference numbers and designations in the various drawings indicate like elements.

Detailed Description

In general, the subject matter described in this specification relates to scheduling hardware FIFO instructions for a hardware processor to reduce the overall latency of a program executed by the processor. The processing unit may include multiple hardware processors, including long-latency, high-throughput hardware processors that efficiently perform complex computations. For example, a long-delay hardware processor may be configured to perform matrix multiplication calculations or other complex calculations of machine learning models (e.g., digital signal processing, video encoding, decoding, etc.) using dedicated hardware, and such calculations may take multiple processor cycles to calculate an output.

These long-delay hardware processors may be programmed via a push instruction to send input data to the hardware processor and a pop instruction to extract result data from the hardware processor to provide to another hardware processor or another FIFO of the hardware processor. The push and pop typically occur in FIFO order so that the hardware processor can be modeled as a FIFO queue. In some cases, the hardware processor may also maintain additional state to represent the configuration, which may be changed via additional instructions. For example, a hardware processor may have multiple configurations and the hardware processor may maintain statistical information specifying the current configuration.

Some hardware processors generate results based on a single input. In this example, one pop (e.g., output) from the FIFO corresponds to one push (e.g., input) to the FIFO. Other hardware processors require multiple inputs to generate one or more results. In this example, the set of one or more pops from the FIFO corresponds to a set of pushes to the FIFO. For example, a hardware processor performing a matrix multiplication operation may include providing multiple pushes of values of a matrix to the hardware processor and one or more pops that output the computed values to another hardware processor or a FIFO. Such a set of related pushes and corresponding popups based on the pushes represent indivisible computational units that are executed by the hardware processor. Each group of one or more pops from a FIFO corresponding to one or more pushes to the same FIFO (e.g., each group of one or more pops from a FIFO outputting one or more results based on one or more pushes to the same FIFO) is a separate FIFO instruction group. Different independent FIFO instruction groups may be reordered to reduce latency. In some embodiments, FIFO instructions within an independent group are not reordered.

Dependencies between instructions can be modeled in a directed graph. Instructions are vertices or nodes in the graph, and dependencies are edges between vertices or nodes. The graph may include a directed edge from a preceding instruction to a subsequent instruction if the subsequent instruction must wait for the preceding instruction to complete before the subsequent instruction begins. The length of the edge may represent the latency between instructions. The graph may encode all types of dependencies in the instruction, including producer to consumer, memory write first then read, and FIFO ordering constraints.

There are various types of FIFO ordering dependencies, including control dependencies and data dependencies. The control dependencies are imposed sequentially by the FIFO processing of the hardware processor. The control dependency edge from the first instruction to the second instruction indicates that the second instruction executes after the first instruction and must wait for the first instruction to complete before the second instruction starts. The data dependency from the first instruction to the second instruction indicates that the second instruction uses data generated by the first instruction. For example, a data dependency may indicate that a pop of the result of an operation must wait for another instruction, e.g., a push of the input of the operation.

An example of a control dependency is a subsequent push (or pop) of a FIFO queue, which must wait for a previous push or pop of the same FIFO queue. In this example, the latency between instructions is based on the throughput of the hardware processor. Another example of a control dependency is when the FIFO queue does not have enough space for a new push. The new push must wait for the pop, which will make enough room in the FIFO queue for the new push. An example of data dependency is that in a separate FIFO instruction group, a result pop must wait for an input push. In this example, the hardware processor performs a calculation using data received by the input push to determine data output by the output pop. The latency in this example is determined by the delay of the hardware processor. Each of these dependencies limits how the order of the FIFO instructions is determined or modified.

The FIFO control dependencies may prevent optimal instruction scheduling for the hardware processor if the original order of the FIFO instructions is not optimal. For example, consider two hardware FIFOs (e.g., hardware processors modeled as FIFO queues), FIFO1 and FIFO 2. It is also contemplated that there are 10 independent pushes to FIFO1 and the final result from FIFO1 is used as an input to the first push to FIFO 2. FIFO control dependencies require that all 10 FIFO1 operations need to be completed before the first FIFO1 push is scheduled. To address this problem, a new order of FIFO instructions may be determined, for example, before the final control dependencies are determined. In this example, the new order of FIFO instructions may take the last previous result of FIFO1 as the first result of FIFO1, such that the latency of FIFO2 waiting for this result is reduced. The techniques described herein use performance metrics, critical path heuristics, and other considerations to schedule separate sets of FIFO instructions and other related instructions such that execution latency of the program is reduced.

FIG. 1 is a block diagram of an exemplary environment 100 in which a program processing system 130 modifies a program and performs computations using the modified program. The machine learning processing system 130 may receive the program 112 from the user device 110 over a data communication network 120 (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), the internet, a mobile network, or a combination thereof).

The programs 112 may include workloads, such as a machine learning workload, a digital signal processing workload, a video encoding/decoding workload, or other suitable workloads. For example, each program 112 may define or be in the form of a directed computation graph that specifies operators for workloads. Each operator represents an operation, such as a numerical operation, to be performed by the processing unit 160 of the machine learning processing system 130. For example, the operators may represent mathematical operations (e.g., addition, multiplication, maximum, minimum, etc.), tensor operations (e.g., shaping, transposing, etc.), data replication operations, and/or other suitable operations.

Program processing system 130 includes a compiler 140 that compiles program 112 into machine instructions for a processing unit 160. Processing unit 160 may include one or more hardware processors that execute machine instructions to perform computations defined by program 112 and/or a modified version of the program, as described below. In this example, processing unit 160 includes systolic array 162 of hardware processors 164. Other suitable arrangements of processors may also be used. For example, processing unit 160 may be a Graphics Processing Unit (GPU) or an application specific integrated circuit, e.g., a machine learning accelerator application specific integrated circuit, such as a Tensor Processing Unit (TPU).

Compiler 140 may also modify program 112 to improve the performance of processing unit 160 in executing machine instructions. For example, compiler 140 may modify program 112 based on characteristics of processing unit 160 that will perform computations of program 112. Compiler 140 may also adjust or determine the scheduling of instructions to improve the performance of processing unit 160 in executing a program, e.g., to reduce the latency of executing a program at runtime.

In particular, compiler 140 includes a FIFO scheduler 142 that may determine whether one or more FIFO instructions may be moved in the instruction schedule of processor 164. If so, the FIFO scheduler 142 may determine a new order of FIFO instructions. The FIFO scheduler 142 may also determine new orders for other non-FIFO instructions, for example, based on the new orders for the FIFO instructions.

In some implementations, the compiler 140 generates or receives data that defines a directed graph of instructions for the program 112. For example, compiler 140 may receive program 112 and convert program 112 into a compiler intermediate representation. The compiler intermediate representation may be in the form of a static single allocation, e.g., prior to register allocation. The compiler 140 may generate the graph based on an instruction set of the program in the compiler intermediate representation. As described above, the instructions are vertices or nodes in the graph, and the dependencies are edges between the vertices or nodes. The graph may include nodes for FIFO instructions and non-FIFO instructions, such as operations, transferring data for hardware processors that are not modeled as FIFO queues, storing data, and so forth.

FIFO scheduler 142 may analyze the graph and determine the order of the independent FIFO instruction groups. As described above, a separate FIFO instruction group is a group of one or more pops from a FIFO that correspond to one or more pushes to the same FIFO. Since processing unit 160 may include multiple hardware FIFOs, FIFO scheduler 142 may determine the order of FIFO instructions for each hardware FIFO, e.g., one at a time.

FIFO scheduler 142 may remove control dependencies between independent FIFO instruction groups in the figure. The FIFO scheduler 142 may leave control dependencies within the separate FIFO instruction groups intact, as these instructions may not be reordered by the FIFO scheduler 142. FIFO scheduler 142 may then determine the order of the separate FIFO instruction groups for each hardware FIFO.

In some embodiments, the FIFO scheduler 142 evaluates the instructions and determines the order of the separate FIFO instruction groups for each hardware FIFO, one at a time (e.g., in a particular order). For example, the FIFO scheduler 142 may sequentially determine the order of the separate FIFO instruction groups for each hardware FIFO based on the degree of performance advantage of the hardware FIFO in the program 112.

To determine the order in which to evaluate the hardware FIFOs, the FIFO scheduler 142 may determine performance metrics for each hardware FIFO in the program 112. The performance metric of the hardware FIFO may represent a degree of performance advantage of operations performed by the hardware FIFO in the program 112. In some embodiments, the performance metric of the hardware FIFO is based on, for example, being proportional to how long it takes to perform all pushes and pops of the hardware FIFO. For example, the performance metric of the FIFO may be based on (e.g., proportional to) an estimated or actual duration of time to perform all pushes and pops, which may be based on throughput of the hardware FIFO, computations performed by the hardware FIFO on the data, and/or an amount of data on which the hardware FIFO performs computations. Since the pop-up is used to output data for a calculation performed by the hardware FIFO using data input by the corresponding push, the pop-up must wait until the calculation is complete. Thus, the performance metric of the hardware FIFO may be based on how long it takes to perform the computation, e.g., based on the throughput and computation of the hardware FIFO.

FIFO scheduler 142 may evaluate each hardware FIFO in order from the hardware FIFO with the most dominant performance (e.g., longest completed push and pop) to the FIFO with the least dominant performance (e.g., shortest completed push and pop) based on the performance metric. The order of the independent FIFO instruction groups of the hardware FIFO becomes a constraint on the other hardware FIFOs. Since a hardware FIFO with more dominant performance may have the greatest impact on the overall performance of program 112, the constraint placed on the hardware FIFO with the most dominant performance should be less than the constraint placed on the hardware FIFO with the least dominant performance. Determining the order of the independent FIFO instruction groups for each hardware FIFO in that order based on the performance metric may result in better instruction scheduling with reduced latency for program 112 relative to other evaluation orders.

To determine the order of the individual FIFO instruction groups for a given hardware FIFO, FIFO scheduler 142 may determine a path length metric for each individual FIFO instruction group of the hardware FIFO. The path length metric of an individual FIFO instruction group may indicate how long it will take to reach the end of program 112 from the individual FIFO instruction group. For example, the path length metric may be based on the length of the longest path from the separate FIFO instruction group to the end of the program 112 (e.g., to the end of the graph representing the instructions of the program 112). The length of each path may be based on the length of the edges along the path, as the edges represent the latency between instructions. The length of each path may also be based on instructions along each path, the latency of the hardware processor(s) that will execute the instructions, and/or the throughput of the hardware processor(s), as these parameters affect how long it takes to traverse the path to the end of program 112. For example, the length of each path may be based on an aggregate length of edges along the path and based on how long (e.g., an estimated duration) each instruction will take to execute at the path based on the parameter (e.g., in proportion to the aggregate length of edges along the path and how long each instruction will take to execute at the path based on the parameter).

FIFO scheduler 142 may sort the separate FIFO instruction groups of the hardware FIFO based on the path length determined for each separate FIFO instruction group. FIFO scheduler 142 may determine an order of the independent FIFO instruction groups of the hardware FIFO based on the determined path length. For example, FIFO scheduler 142 may order the independent FIFO instruction groups of the hardware FIFO from the longest path length to the shortest path length. That is, the FIFO scheduler 142 may determine the order such that the independent FIFO instruction group having the longest determined path length (e.g., the shortest path length is the longest of the independent groups) is first for the hardware FIFO, the independent group having the second longest path length is second, and so on.

However, such ordering must respect the control dependencies of the hardware FIFO and any data dependencies between the independent FIFO instruction groups of the hardware FIFO. For example, if the separate FIFO instruction group having the longest path length pushes data output by a different separate FIFO instruction group into the hardware FIFO, the different separate FIFO instruction group may be scheduled before the separate FIFO instruction group having the longest path length.

After determining the order of the independent FIFO instruction groups for the hardware FIFOs, the FIFO scheduler 142 may add control dependencies to the graph based on the determined order. The FIFO scheduler 142 may then move to the next performance-advantageous hardware FIFO and determine the order of the hardware FIFO instructions for that hardware FIFO. The FIFO scheduler 142 may continue the process until all hardware FIFOs have been evaluated and, if appropriate, a new order of independent FIFO instruction groups has been determined for the hardware FIFOs.

In some embodiments, FIFO scheduler 142 may also use scheduling heuristics in determining the order of independent FIFO instruction groups, such as the critical path on the graph and the interdependencies between hardware FIFOs. For example, FIFO scheduler 142 may order the independent FIFO instruction groups in a manner that shortens the critical path of the graph, which reduces the overall latency of program 112.

In some embodiments, FIFO scheduler 142 evaluates the graph to identify independent FIFO instruction groups on the critical path of the graph. The critical path of the graph may be the longest path from the beginning of the graph to the end of the graph. The length of the path used to determine the critical path may be based on how long it takes to traverse the path, e.g., using the length of the edge, the delay of the hardware FIFOs on the path, the throughput of the hardware FIFOs on the path, and/or other suitable data, as described above.

FIFO scheduler 142 may evaluate each independent FIFO instruction group on the critical path to determine whether reordering the other independent FIFO instruction groups will shorten the critical path. If so, the order of FIFO instructions that are not on the critical path may be reordered to shorten the critical path.

For example, a first hardware FIFO (FIFO1) may have multiple independent sets of FIFO instructions, each set of FIFO instructions including one or more pushes and one or more pops. A second hardware FIFO (FIFO2) may have a separate FIFO instruction group that is on the critical path of program 112 and has a push of results output by one pop of FIFOs 1. If the pop of the FIFO1 is not the first of the order of the FIFO instructions of the FIFO1, the FIFO scheduler 142 may determine a new order of the FIFO instructions of the FIFO1, such that the pop of data required by the FIFO2 is the first of the new order. This reduces the amount of time that the FIFO2 on the critical path must wait for the required data to make its calculations, which reduces the overall latency of the program 112.

If other independent FIFO instruction groups of the FIFO1 output data that is input to other FIFOs on the critical path of the program, the FIFO scheduler 142 may determine the order of the instructions based on when the other FIFOs require data and/or the degree of performance advantage of the other FIFOs in the program 112. For example, if another FIFO is more performance advantageous than FIFO2 and requires output of a different independent FIFO instruction group from FIFO1, then the different independent group may be the first in the sequence for FIFO 1.

In some embodiments, FIFO scheduler 142 may omit or combine some FIFO operations on hardware processor 164 to improve efficiency. For example, when hardware processor 164 is configurable, additional instructions are required to change the configuration of hardware processor 164. The execution of these additional instructions may be expensive in terms of processing requirements and/or delay. In some cases, the configuration of hardware processor 164 may be periodically switched between configurations based on operations to be performed by hardware processor 164 and/or data input to hardware processor 164 using push instructions.

If multiple FIFO operations sharing the same configuration can be co-located in the graph or in the schedule generated by the graph, the frequency of switching the configuration of the hardware processor 164 can be reduced, resulting in faster and more efficient computations. FIFO scheduler 142 may identify FIFO instructions for hardware processor 164 that will change the configuration of hardware processor 164 during execution of a program.

For each such hardware processor 164, the FIFO scheduler 142 may identify candidate FIFO instructions that may be grouped together. For example, for each configuration, FIFO scheduler 142 may identify each FIFO instruction of hardware processor 164 that requires hardware processor 164 to have that configuration. The FIFO scheduler 142 may then determine whether grouping candidate FIFO instructions having the same configuration (or at least a portion of FIFO instructions having the same configuration) into groups will result in a reduced latency for the program 112. For example, FIFO scheduler 142 may determine whether adjusting the position of one or more FIFO instructions in the graph to group FIFO instructions having the same configuration would negatively impact the critical path of the graph, e.g., whether the packet would extend the length of the critical path. If so, the FIFO scheduler 142 may determine not to group the FIFO instructions. If not, the FIFO scheduler 142 may group the FIFO instructions in the figure together so that all FIFO instructions in the group are executed without having to switch the configuration of the hardware processor 164.

Another example of a co-located FIFO instruction is for a hardware processor that processes data of a larger data size than the data size of its input. For example, the hardware processor 164 may be configured to process 32-bit data, but the input data is only 16-bit data. Two inputs of 16-bit data may be combined and the hardware processor 164 may process the combined data. However, combining the data in this example may also negatively impact the critical path of the graph.

FIFO scheduler 142 may identify a FIFO instruction for hardware processor 164, where hardware processor 164 processes data having a larger data size than the data input to hardware processor 164 by the FIFO instruction. The FIFO scheduler 142 may then determine, for each such hardware processor 164, whether to add instructions to combine the data and/or whether to have FIFO instructions wait for some data to be combined that would negatively impact the critical path of the graph (e.g., by extending the length of the critical path). If so, FIFO scheduler 142 may determine not to combine the data for hardware processor 164. If not, the FIFO scheduler 142 may adjust the graph to combine the inputs to the hardware processor 164, for example, by adding an instruction to combine data from the input source and the edge to the instruction to combine the data.

The FIFO scheduler 142 may search for potential candidates (e.g., candidate FIFO instructions having the same configuration as the hardware processor processing or smaller data sizes) when determining whether an independent FIFO instruction group should be collocated or grouped with other independent FIFO instruction groups. The FIFO scheduler 142 may then determine that the critical path from the independent FIFO instruction group to the end of the graph will still include co-location or combination in locations of control dependency edges around the independent FIFO instruction group. That is, the improvement is made in the case where the program 112 is constrained by the control dependencies of the FIFO hardware processor.

After determining the order of the FIFO instructions, FIFO scheduler 142 may generate a schedule, e.g., a linear schedule, of instructions to be compiled by compiler 140 and executed by hardware processor 164. As described above, the diagram may include non-FIFO instructions in addition to FIFO instructions. non-FIFO instructions typically have a shorter delay than FIFO instructions and typically do not have similar ordering constraints. Therefore, critical path heuristics are less useful for scheduling non-FIFO instructions.

The FIFO scheduler 142 may cluster non-FIFO instructions with FIFO instructions. In some embodiments, FIFO scheduler 142 clusters each non-FIFO instruction with the FIFO instruction that is closest to the non-FIFO instruction in the graph. As described above, the length of the edges in the graph may represent the latency between instructions. The FIFO scheduler 142 may evaluate the length of the edge between the non-FIFO instruction and one or more FIFO instructions (e.g., one or more FIFO instructions connected to the non-FIFO instruction) to determine the closest FIFO instruction to the non-FIFO instruction. For example, if there are intervening instructions between the non-FIFO instruction and the FIFO instruction, the graph may include a number of edges between the non-FIFO instruction and the FIFO instruction. In this case, the FIFO scheduler 142 may aggregate the lengths of the edges (or latencies represented by the edges) to determine the total length between non-FIFO instructions. The FIFO scheduler 142 may then cluster the non-FIFO instructions with the independent FIFO instruction groups that include the FIFO instruction having the shortest length from the non-FIFO instructions. The FIFO scheduler 142 may do so for each non-FIFO instruction until all non-FIFO instructions are in the cluster with FIFO instructions.

FIFO scheduler 142 may use critical path heuristics to classify groups of instructions. For example, for each cluster of instructions, the FIFO schedule may determine the length of the critical path of the FIFO instructions of the cluster to the end of the program. The critical path may represent the longest path from the FIFO instructions of the cluster to the end of the program. FIFO scheduler 142 may then schedule the cluster of instructions based on the length of the critical path. For example, FIFO scheduler 142 may schedule clusters of instructions in order from the longest length to the shortest length.

When a program involves multiple hardware FIFOs and there is a data dependency between them, this approach can result in a larger effective range of registers if the utilization of different FIFOs is different. For example, if one out of every three results popped from FIFO1 is pushed into FIFO2, the critical path to push into FIFO2 is much shorter than the critical path to pop from FIFO 1. Therefore, after sorting based on the length of the critical path, the effective range of the pop-up may be very long. In this case, the FIFO scheduler 142 may move the data-consuming FIFO instructions closer to the corresponding data-generating FIFO instructions when the critical path of the graph is not increased. For example, the FIFO scheduler 142 may move the data production push to the earliest position in the graph, may move the data production push without increasing the critical path length of the graph.

FIFO scheduler 142 may also schedule instructions within each cluster using one or more scheduling techniques. For example, the FIFO scheduler 142 may use a list scheduling technique having a priority determined based on the graphics topology and characteristics of the hardware architecture of the hardware processor 164.

Compiler 140 may complete the compiler based on the scheduling of instructions and provide compiler 150 to processing unit 160. The hardware processor 164 of the processing unit 160 may execute the compiler 150. For example, the hardware processor 164 may execute FIFO instructions and non-FIFO instructions, respectively, according to a scheduled order as defined by the compiler 150.

FIG. 2 depicts a directed graph 205 of instructions of a program and an updated graph 235 of instructions of a program. Directed graph 205 may be a portion of a graph representing dependencies between instructions of a program. In this example, graph 205 includes a separate FIFO instruction group A of a first hardware FIFO (FIFO1)₁-A₄. Independent FIFO instruction set A₁-A₄In a first order 210 based on a first version of the program (e.g., the version of the program received from the user device). The diagram 205 also includes a separate FIFO instruction group B for the first hardware FIFO (FIFO1)₁-B₄. Based on a first version of the program (e.g., a version of the program received from a user device), the separate FIFO instruction set B₁-B₄Also in the first order 220.

The FIFO scheduler 142 may determine a new order of FIFO instructions for each FIFO, e.g., one at a time, based on the degree of performance advantage of the FIFOs 1 and 2 in the program. In this example, it is assumed that FIFO2 has a performance advantage over FIFO 1.

The individual FIFO instruction groups of each FIFO may be ordered based on the length of the longest path from each individual FIFO instruction group to the end of the program. In this example, each of the independent FIFO instruction groups has an associated estimated duration based on a longest path of the independent FIFO instruction group. For example, independent FIFO instruction groupsA₁With an estimated duration of 2.1 milliseconds (ms). As described above, the duration may be an estimated duration of the traversal path.

FIFO scheduler 142 may determine independent FIFO instruction group B based on path length or estimated duration₁-B₄The new order 225. In this example, separate FIFO instruction set B₁-B₄Is from the longest duration to the shortest duration. Due to independent FIFO instruction group B₄Has the longest duration (4.9ms), and thus the independent FIFO instruction group B₄Is first in the new order 215 of the update map 235.

FIFO scheduler 142 may also determine independent FIFO instruction group a based on path length or estimated duration₁-A₄The new order 215. For example, the FIFO scheduler 142 may determine the new order 215 based on the estimated duration, similar to independent instruction group B₁-B₄The new order 225. FIFO scheduler 142 may also take into account the interdependencies between critical path heuristics and FIFOs. In this example, diagram 205 includes independent FIFO instruction set A₄And B₁As shown by edge 212. Thus, B₁Has to wait for the output data from A₄Pop up to push the data into its hardware processor. To reduce this latency, the FIFO scheduler 142 may assign a₄First placed in a new order and based on their estimated duration, to other independent FIFO instruction groups a₁-A₃And (6) sorting. In this example, the new order 215 has A first₄Then is A₃、A₁And A₂。

FIG. 3 is a flow diagram illustrating an exemplary process 300 for modifying a program. The operations of process 300 may be performed by a system including one or more data processing apparatus. For example, the operations of process 300 may be performed by program processing system 130 of FIG. 1.

The system receives a program (310) to be executed by a processing unit comprising a hardware processor. The programs may include one or more workloads, such as a machine learning workload, a digital signal processing workload, a video encoding/decoding workload, or other suitable workload. For example, the program may define or be in the form of a directed computational graph that specifies the operators of the workload.

The processing unit may include one or more long-latency, high-throughput hardware processors that may be modeled as FIFO queues. The processing unit may also include other types of lower latency hardware processors that are not modeled as FIFO queues.

The system determines, for each of the one or more hardware processors, a new order of the independent FIFO instruction groups to be executed by the hardware processor (320). For example, the system may receive or determine a compiler-intermediate representation of the source code of the program based on a hardware processor of the processing unit. This compiler-intermediate representation of the program may include a directed graph of instructions to be executed by the hardware processor. The diagram may also indicate, for each instruction, the hardware processor that will execute the instruction.

The system may identify each hardware processor modeled as a FIFO queue in the figure as one or more hardware processors. These processors are also referred to as hardware FIFOs in this document. The diagram may include one or more separate FIFO instruction groups for each hardware FIFO. For a hardware FIFO that includes multiple independent groups, the system may evaluate the independent instruction groups of each hardware FIFO to determine whether the latency of the program is to be reduced using the new order of the independent FIFO instruction groups of the hardware FIFO. If so, the system may determine a new order of hardware FIFOs. The system may use composition operation 321-324 to determine a new order of FIFO instructions for each of the one or more hardware FIFOs. If the one or more hardware processors include multiple hardware processors, the system may sequentially evaluate each hardware processor based on a degree of performance advantage of the hardware processor in the program, as described above.

The system identifies an order of the independent FIFO instruction groups for execution by the hardware processor in data representing code of the program (321). The system may then determine a path length metric for each identified group of independent FIFO instructions (322).

The path length metric of an individual FIFO instruction group may indicate how long it will take to reach the end of the program from the individual FIFO instruction group. For example, the path length metric may be based on the length of the longest path from the independent FIFO instruction group to the end of the program. As described above, the length of each path may be based on the length of the edges along the path, the instructions along each path, the latency of the hardware processor that will execute the instructions, and/or the throughput of the hardware processor. For example, the length of each path may be based on the aggregate length of the edges along the path and how long (e.g., an estimated duration) each instruction will take to execute on the path based on the parameters (e.g., in proportion to how long each instruction will take to execute on the path based on the aggregate length of the edges along the path and based on the parameters).

The system generates a new order of independent FIFO instruction groups based on the path length metric (323). For example, the system may generate a new order such that the independent FIFO instruction groups are ordered from the independent group having the path length metric representing the longest path length to the independent group having the path length metric representing the longest path length. As described above, the system may also consider other critical path heuristics and interdependencies between hardware FIFOs when determining the order of the independent FIFO instruction groups of the hardware FIFOs. The system may update the map based on the determined order and add control dependencies between the FIFO instructions based on the new order.

The system determines whether there is an additional hardware processor for which to evaluate the order of the independent FIFO instruction groups (324). If so, the system returns to operation 321 to evaluate another hardware processor. If not, the system generates a schedule of instructions. The instruction schedule may include a linear instruction schedule including FIFO instructions and non-FIFO instructions.

The system may order the independent FIFO instruction groups in the schedule based on a length of a critical path from each independent FIFO instruction group to an end of the update map. For example, the order of the independent FIFO instruction groups in the schedule may be from the independent group having the longest critical path to the independent group having the shortest critical path. For non-FIFO instructions, the system may group the non-FIFO instructions with the closest group of independent FIFO instructions in the figure, as described above.

The system generates update code for the program (340). The system may generate update code based on the schedule of instructions. The system may then generate output code for execution by the hardware processor based on the updated code (350). For example, the system may perform the compilation of code into machine instructions for each hardware processor.

The system may then provide the output code to each hardware processor (360). The hardware processor may execute instructions while the processing unit is executing a program.

Embodiments of the subject matter and the functional operations described in this specification can be implemented in digital electronic circuitry, in tangibly embodied computer software or firmware, in computer hardware comprising the structures disclosed in this specification and their structural equivalents, or in combinations of one or more of them. Embodiments of the subject matter described in this specification can be implemented as one or more computer programs, i.e., one or more modules of computer program instructions, encoded on a tangible, non-transitory program carrier for execution by, or to control the operation of, data processing apparatus. Alternatively or additionally, the program instructions may be encoded on an artificially generated propagated signal (e.g., a machine-generated electrical, optical, or electromagnetic signal) that is generated to encode information for transmission to suitable receiver apparatus for execution by the data processing apparatus. The computer storage medium may be a machine-readable storage device, a machine-readable storage substrate, a random or serial access storage device, or a combination of one or more of them.

The processes and logic flows described in this specification can be performed by one or more programmable computers executing one or more computer programs to perform functions by operating on input data and generating output. The processes and logic flows can also be performed by, and apparatus can also be implemented as, special purpose logic circuitry, e.g., an FPGA (field programmable gate array), an ASIC (application-specific integrated circuit), or a GPGPU (general purpose graphics processing unit).

By way of example, a computer suitable for executing a computer program comprises a central processing unit that may be based on a general purpose microprocessor or a special purpose microprocessor, or both, or any other type. Generally, a central processing unit will receive instructions and data from a read-only memory or a random access memory or both. The essential elements of a computer are a central processing unit for executing or carrying out instructions and one or more memory devices for storing instructions and data. Generally, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto-optical disks, or optical disks. However, a computer need not have such devices. Furthermore, the computer can be embedded in another device, e.g., a mobile telephone, a Personal Digital Assistant (PDA), a mobile audio or video player, a game player, a Global Positioning System (GPS) receiver, or a portable storage device, e.g., a Universal Serial Bus (USB) flash drive, to name a few.

Computer-readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media and memory devices, including by way of example semiconductor memory devices, e.g., EPROM, EEPROM, and flash memory; magnetic disks, such as internal hard disks or removable disks; magneto-optical disks; and CD ROM and DVD-ROM disks. The processor and the memory can be supplemented by, or incorporated in, special purpose logic circuitry.

While this specification contains many specific implementation details, these should not be construed as limitations on the scope of any invention or of what may be claimed, but rather as descriptions of features that may be specific to particular embodiments of particular inventions. Certain features that are described in this specification in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination. Furthermore, although features may be described above as acting in certain combinations and even initially claimed as such, one or more features from a claimed combination can in some cases be excised from the combination, and the claimed combination may be directed to a subcombination or variation of a subcombination.

Similarly, while operations are depicted in the drawings in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In some cases, multitasking and parallel processing may be advantageous. Moreover, the separation of various system modules and components in the embodiments described above should not be understood as requiring such separation in all embodiments, and it should be understood that the described program components and systems can generally be integrated in a single software product or packaged into multiple software products.

Specific embodiments of the subject matter have been described. Other embodiments are within the scope of the following claims. For example, the actions recited in the claims can be performed in a different order and still achieve desirable results. As one example, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may be advantageous.

19页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：以从右向左的方向移动数据的指令

Optimizing hardware FIFO instructions

相关技术

网友询问留言