Automated compiler data flow optimization for implementing circular pipelining with local storage requirements

文档序号：189909 发布日期：2021-11-02 浏览：19次中文

阅读说明：本技术 用于实现具有局部存储需求的循环流水线化的自动编译器数据流优化 (Automated compiler data flow optimization for implementing circular pipelining with local storage requirements ) 是由 R·迪欧德哈 S·德米特里耶夫 D·伍德沃思 R·克里什奈耶 K·格洛索普 A·苏达尔萨南于 2020-12-18 设计创作，主要内容包括：本文描述了用于实现具有局部存储需求的循环流水线化的自动编译器数据流优化。系统、设备和方法可以提供一种技术,该技术用于：检测源代码中的一个或多个局部变量,其中(多个)局部变量在源代码中的循环的迭代之间缺乏相关性；自动生成(多个)局部变量的流水线执行代码；以及将流水线执行代码合并到编译器的输出中。在一个示例中,流水线执行代码包括(多个)局部变量的缓冲存储的池的初始化。(Automated compiler data flow optimization for implementing loop pipelining with local storage requirements is described herein. Systems, devices, and methods may provide techniques for: detecting one or more local variables in the source code, wherein the local variable(s) lack correlation between iterations of a loop in the source code; pipelined execution code that automatically generates local variable(s); and incorporating the pipelined execution code into an output of the compiler. In one example, the pipelined execution code includes initialization of a pool of buffer stores of local variable(s).)

1. A computing system, comprising:

a network controller;

a processor coupled to the network controller; and

a memory coupled to the processor, the memory comprising a set of executable program instructions that, when executed by the processor, cause the processor to:

detecting one or more local variables in source code, wherein the one or more local variables lack dependencies between iterations of a loop in the source code,

pipelined execution code that automatically generates the one or more local variables, an

Merging the pipelined execution code to an output of a compiler.

2. The computing system of claim 1, wherein the pipelined execution code includes initialization of a buffer pool of the one or more local variables.

3. The computing system of claim 2, wherein the initialized buffer pool is larger than the local storage corresponding to a single iteration of the loop.

4. The computing system of claim 2, wherein the pipelined execution code further comprises a definition of a plurality of tokenized slots in the initialized buffer pool, and wherein each tokenized slot corresponds to a pipelined iteration of a loop.

5. The computing system of claim 1 wherein the pipelined execution code includes a pipeline depth definition.

6. The computing system of any of claims 1 to 5, wherein the one or more local variables are detected after registration of the source code, the automatic generation of the pipeline execution code is in response to detection of the one or more local variables, and the source code is associated with a communication channel in a dataflow graph.

7. A semiconductor device, comprising:

one or more substrates; and

logic coupled to the one or more substrates, wherein the logic is implemented at least in part in one or more of configurable logic or fixed function hardware logic, the logic coupled to the one or more substrates to:

detecting one or more local variables in source code, wherein the one or more local variables lack dependencies between iterations of a loop in the source code;

pipeline execution code that automatically generates the one or more local variables; and

merging the pipelined execution code to an output of a compiler.

8. The semiconductor device of claim 7, wherein the pipelined execution code includes initialization of a buffer pool of the one or more local variables.

9. The semiconductor device of claim 8, wherein the initialized buffer pool is larger than a local storage corresponding to a single iteration of the loop.

10. The semiconductor device of claim 8, wherein the pipelined execution code further comprises a definition of a plurality of tokenized slots in the initialized buffer pool, and wherein each tokenized slot corresponds to a pipelined iteration of a loop.

11. The semiconductor device of claim 7, wherein the pipelined execution code includes a pipeline depth definition.

12. The semiconductor device of any of claims 7 to 11, wherein the one or more local variables are detected after registration of the source code, the automatic generation of the pipeline execution code is in response to detection of the one or more local variables, and the source code is associated with a communication channel in a dataflow graph.

13. The semiconductor device of any of claims 7-11, wherein the logic coupled to the one or more substrates comprises a transistor channel region within the one or more substrates.

14. A method, comprising:

detecting one or more local variables in source code, wherein the one or more local variables lack dependencies between iterations of a loop in the source code;

pipeline execution code that automatically generates the one or more local variables; and

merging the pipelined execution code to an output of a compiler.

15. The method of claim 14, wherein the pipelined execution code includes initialization of a buffer pool of the one or more local variables.

16. The method of claim 15, wherein the initialized buffer pool is larger than the local storage corresponding to a single iteration of the loop.

17. The method of claim 15, wherein the pipelined execution code further comprises a definition of a plurality of tokenized slots in the initialized buffer pool, and wherein each tokenized slot corresponds to a pipelined iteration of a loop.

18. The method of claim 14, wherein the pipelined execution code includes a pipeline depth definition.

19. The method of any of claims 14 to 18, wherein the one or more local variables are detected after registration of the source code, the automatic generation of the pipelined execution code is responsive to the detection of the one or more local variables, and the source code is associated with a communication channel in a dataflow graph.

20. A semiconductor device, comprising:

means for detecting one or more local variables in source code, wherein the one or more local variables lack dependencies between iterations of a loop in the source code;

means for automatically generating pipelined execution code of the one or more local variables; and

means for merging the pipelined execution code into an output of a compiler.

21. The semiconductor device of claim 20, wherein the pipelined execution code includes initialization of a buffer pool of the one or more local variables.

22. The semiconductor device of claim 21, wherein the initialized buffer pool is larger than the local storage corresponding to a single iteration of the loop.

23. The semiconductor device of claim 21, wherein the pipelined execution code further comprises a definition of a plurality of tokenized slots in the initialized buffer pool, and wherein each tokenized slot corresponds to a pipelined iteration of a loop.

24. The semiconductor device of claim 20, wherein the pipelined execution code includes a pipeline depth definition.

25. The semiconductor device of any of claims 20 to 24, wherein the one or more local variables are detected after registration of the source code, the automatic generation of the pipeline execution code is in response to detection of the one or more local variables, and the source code is associated with a communication channel in a dataflow graph.

Technical Field

Embodiments are generally related to compilers. More particularly, embodiments relate to automated compiler data flow optimization to enable pipelining of loops with local storage requirements.

Background

A dataflow graph can be used to model computer source code according to dependencies between various operations performed by the code. The compiler may convert the source code into a dataflow graph that is typically executed by accelerator hardware, such as a Field Programmable Gate Array (FPGA), a Configurable Space Accelerator (CSA), or other dataflow architecture. Although accelerator hardware may be useful in processing High Performance Computing (HPC) and/or data center applications that operate on relatively large data arrays and structures, there is still significant room for improvement. For example, if the operation of the source code involves executing a loop that internally declares a "private" variable of a large data array, the ability to save (e.g., "register") the underlying data in an internal channel of the accelerator (e.g., communication arc, buffer, delay insensitive channel/LIC, etc.) may be limited. As a result, private variables may be treated as purely memory-based variables, which may result in performance loss.

Drawings

Various advantages of the embodiments will become apparent to those skilled in the art by reading the following specification and appended claims, and by referencing the following drawings, in which:

FIG. 1 is a block diagram of an example of a compiler output, according to an embodiment;

FIG. 2A is a source code list of an example of a loop with a fixed-size local store according to an embodiment;

FIG. 2B is a source code listing of an example of a loop with local storage as a function of run time according to an embodiment;

FIG. 2C is a source code listing of an example of a loop with explicitly specified private variables, according to an embodiment;

FIG. 2D is a source code listing of an example of a loop with dynamically allocated local variables, according to an embodiment;

fig. 3 is a block diagram of an example of a communication arc in a dataflow graph according to an embodiment;

FIG. 4 is a flow diagram of an example of a method of operating a compiler, according to an embodiment;

FIG. 5 is a block diagram of an example of a compiler, according to an embodiment;

FIG. 6 is a block diagram of an example of a performance enhanced computing system according to an embodiment;

fig. 7 is an illustration of an example of a semiconductor device according to an embodiment;

FIG. 8 is a block diagram of an example of a processor according to an embodiment; and

FIG. 9 is a block diagram of an example of a multiprocessor-based computing system, according to an embodiment.

Detailed Description

Turning now to fig. 1, compiler 20 is shown, wherein compiler 20 automatically converts source code 22 into output 24, which output 24 may be executed by a dataflow architecture (such as, for example, FPGA, CSA, etc.). In an embodiment, source code 22 is written in a high-level language such as, for example, C, C + + or Fortran, enhanced by parallel annotations (e.g., OpenMP parallel compilation instructions) to achieve runtime parallelism in a dataflow architecture. The source code 22 may generally use loops to perform various operations. In practice, the runtime performance of an application may be dictated by the time it takes to execute a loop to perform a task. On data flow architectures such as CSA, the performance of parallel loops can be improved by: a) creating multiple copies of a loop body (e.g., "worker"), b) executing workers in parallel, and c) pipelining the execution of workers.

In the illustrated example, the source code 22 contains one or more local variables 26 (e.g., private variables) that lack dependencies between iterations of the loop in the source code 22. As will be discussed in more detail, such variables may occur naturally when declared within a loop. In an embodiment, local variable(s) 26 are occasionally used for relatively large data arrays. In this case, to improve the throughput of the loop containing the local variable(s) 26, the illustrated compiler 20 generates a pipelined execution code 28 of the local variable(s) 26 and merges the pipelined execution code 28 into the output 24 of the compiler 20. Thus, the illustrated local variables are assigned in such a way that each loop iteration obtains its own copy, allowing for pipelined execution. As previously described, pipelining the execution of workers can significantly improve performance.

Fig. 2A shows source code 30 of a loop containing a declaration variable "b" (e.g., "for (int i ═ 0; i < n; i + +"), which variable "b" can be considered a local variable because it lacks dependencies between iterations of the loop.

Fig. 2B shows the source code 32 of a loop (e.g., "for (int i ═ ibegin; i < end; i + +") containing a declaration variable "spa", which also lacks dependencies between iterations of the loop and is considered to be a local variable. In the illustrated example, the size of the variables varies and is known only at runtime. The illustrated source code 32 may readily replace the source code 22 (FIG. 1) already discussed. Accordingly, pipelined execution code of the illustrated local variables may be automatically generated.

Fig. 2C shows source code 34 (e.g., "for (int j ═ x; j < y; j + +") containing a loop that uses a variable "b", where the variable "b" is explicitly specified as a private variable (e.g., using a "private" clause). Other explicit clauses (such as "firstprivate", "lastprivate", "reduction", etc.) may also be used. In the illustrated example, the variables have a fixed size (e.g., an array of 100 integers). Thus, the local storage requirements of variable b are fixed and statically known to the compiler. The illustrated source code 34 may readily replace the source code 22 (FIG. 1) already discussed. Accordingly, pipelined execution code of the illustrated local variables may be automatically generated.

FIG. 2D shows source code 36 (e.g., "for (int i ═ 0; i < n; i + +") containing a loop that dynamically allocates memory for variable "b" from within the loop. In the illustrated example, the variables are local variables that lack dependencies between iterations of the loop, and the size of the variables may remain constant or vary. The illustrated source code 36 may readily replace the source code 22 (FIG. 1) already discussed. Accordingly, pipelined execution code of the illustrated local variables may be automatically generated.

Turning now to fig. 3, a communication arc 40 (e.g., LIC) between a first functional unit 42 (e.g., node) in a data flow graph and a second functional unit 44 in the data flow graph is shown. In the illustrated example, functional units 42 and 44 are used to perform operations on data associated with local variables in a loop. In an embodiment, the communication arcs 40 include a buffer store (not shown) such as, for example, one or more line buffers, FIFO (first-in-first-out) buffers, or the like, that can be used to hold values that can be used to assign data associated with local variables in a loop to different loop iterations.

Fig. 4 illustrates a method 50 of operating a compiler. Method 50 may generally be implemented in a compiler, such as, for example, compiler 20 (FIG. 1), already discussed. More specifically, method 50 may be implemented in one or more modules as a set of logic instructions stored in a machine-or computer-readable storage medium such as Random Access Memory (RAM), Read Only Memory (ROM), programmable ROM (prom), firmware, flash memory, etc., in configurable logic such as, for example, Programmable Logic Arrays (PLAs), FPGAs, Complex Programmable Logic Devices (CPLDs), in fixed function logic hardware using circuit technology such as, for example, Application Specific Integrated Circuits (ASICs), Complementary Metal Oxide Semiconductors (CMOSs), or transistor-transistor logic (TTL) technology, or in any combination of the above.

For example, computer program code for carrying out operations shown in method 50 may be written in any combination of one or more programming languages, including an object oriented programming language such as JAVA, SMALLTALK, C + +, or the like, as well as conventional procedural programming languages, such as the "C" programming language or similar programming languages. Additionally, logic instructions may include assembler instructions, Instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, state setting data, configuration data for an integrated circuit, state information to personalize electronic circuitry native to hardware (e.g., host processor, central processing unit/CPU, microcontroller, etc.), and/or other structural components.

The illustrated processing block 52 provides for detecting one or more local variables in the source code, where the local variable(s) lack correlation between iterations of a loop in the source code. The source code may be associated with a communication channel, such as communication arcs 40 (fig. 3) in a data flow graph. In embodiments, block 52 includes a loop that automatically parses and/or searches the source code for local stores having a fixed size (e.g., as shown in FIG. 2A), local stores that vary at runtime (e.g., as shown in FIG. 2B), explicitly specified private variables (e.g., as shown in FIG. 2C), dynamically allocated local variables (e.g., as shown in FIG. 2D), and so forth. Further, block 52 may be performed after registration of the source code.

Block 54 automatically generates pipelined execution code of the local variable(s) (e.g., in response to detecting one or more local variables). As will be discussed in more detail, block 54 may include generating executable instructions to initialize a buffer pool of local variable(s), defining a pipeline depth, and defining a plurality of tokenized slots in the initialized buffer pool. In one example, the initialized buffer pool is larger than the local storage corresponding to a single iteration of the loop. Further, each tokenized slot may correspond to a pipelined iteration of the loop. The illustrated block 56 incorporates the pipelined execution code into the output of the compiler. Therefore, the method 50 improves performance by enabling pipelining of loops containing private data, which improves throughput. In practice, the total number of cycles required to execute a loop may be much less than the product of the number of quiescent cycles (e.g., the number of cycles required to execute one iteration of the loop) and the number of loop iterations.

FIG. 5 illustrates a compiler 60 that may implement one or more aspects of the method 50 (FIG. 4) already discussed. In addition, compiler 60 may readily replace source code 20 (FIG. 1) already discussed. In general, compiler 60 implements pipelined execution of loops containing local variables and may be interpreted with reference to a piece of sample source code and compiler-generated pseudo code. For further reference, the end of the present disclosure includes using the actual Intermediate Representation (IR) results of the LLVM compiler for similar samples before and after the main compiler conversion described herein.

Taking the example of dynamically allocated local stores in a loop, for simplicity, a constant array size of 100 is chosen, and it may be assumed that compiler 60 selects two workers for the loop and three pipeline depths for each worker loop.

OpenMP language extensions may also be implemented to allow explicit control over worker creation and pipeline depth. Such extensions may be considered optional.

The OpenMP language is extended as follows:

#pragma omp...

a data stream ([ worker _ number [ (< n >) ],

[ static [ (< Block size >) ],

[ Assembly line [ (< depth >) ].)

for cycle

The pipeline (depth) clause (sub-close) specifies how many loop iterations are allowed to be performed in parallel. The worker _ numbers (num _ workers) and static clauses specify how many workers to create and the way to distribute loop iterations among the workers. Other concurrent annotation languages and/or APIs (application programming interfaces) may also be used, such as OpenACC, OpenCL, SYCL, etc.

Solutions to properly handle private variables in pipeline loops may span multiple paths in compiler 60. As shown in fig. 5, the transition is in three places.

The worker creation stage 62 may be used when storing partial from OpenMP clauses. In an embodiment, the worker creation stage 62 replaces OpenMP instructions with extensions for multiple workers. The worker creation stage 62 may also use dynamic allocation to represent local storage. The pseudo code for the worker creation stage 62 is provided below.

Loop (Loop):

b＝alloca…

// the body of the loop refers to a local variable b

…

< inner j cycle >

…

End-loop (End-loop):

the local store expansion stage 64 processes a relatively large portion of the conversion described herein. In one example, the local storage expansion stage 64 handles the allocation and referencing of the remaining private variables. The illustrated path of stage 64 is performed relatively late to allow other compiler optimizations to register local variables as much as possible. Accordingly, variables that otherwise cannot be registered are processed in stage 64. If the loop has a set of S private variables, then stage 64 creates an array of type S with a dimension of pipeline depth, the dynamic count of iterations in progress.

The dataflow operations translation stage 66 may handle the management of the various slots in the private variable array created for each loop.

Worker creation

The worker creation stage 62 may create multiple workers as instructed by the OpenMP instructions. For non-OpenMP loops, the worker creation stage 62 may automatically decide the number of workers to generate. Similarly, the OpenMP instruction may specify a pipeline depth, or compiler 60 may select a degree of pipelining to generate. For purposes of discussion, assume that two workers are created and three pipeline depths are selected.

A pair of LLVM IR internal functions can be introduced to support circular local storage:

r＝pipeline.limited.entry(int id,int depth)

pipeline.limited.exit(r)

these intrinsic functions contain loops that require local storage. The parameter of the "entry" call specifies the pipeline depth and marks where the allocation of the closed loop occurs. "Exit" marks the release point. This representation ensures that a single allocation/release is performed for a loop, independent of the number of workers generated.

The pseudo code of the original single loop after the worker creation stage 62 is shown below. In the illustrated example, the original loop has been replicated to form two workers. Furthermore, the local variables in the original loop become individual local variables in each new loop. The pipeline has not been considered and is completed later in the local store expansion stage 64. The pseudo code after processing by the worker creation stage 62 may be:

local store expansion

In an embodiment, the local store expansion stage 64 performs a translation to account for pipelining. The concept of tokens is used to enforce the depth of three pipelines and to create a pool of three token values for each worker. In one example, an iteration may begin when a token can be obtained from the pool. Take is modeled by calling "token" and can only be completed when a local storage slot becomes available. When the iteration is complete, the token is returned to the pool. The return is modeled by calling "token. In one example, only three iterations can be performed in parallel in each worker, since there are only three different token values.

The pseudo code after the local store expansion stage 64 may be:

data stream operation conversion

The final stage in which the loop partial storage is implemented is during a data stream operation conversion stage 66, which converts the IR code into a data stream operation. Take and return may be abstract representations of mechanisms that distribute a fixed number of tokens. In an embodiment, the physical implementation of this mechanism uses CSA LIC. The basic property of a CSA LIC is to hold multiple values to pass values from one end of the LIC when reading and to write values at the other end of the LIC when writing. This property may be used to allow only a fixed number of values to be cycled through within the body of the cycle. In one example, the depth of the LIC is selected as a user-specified pipeline depth. Additionally, the value in the LIC may be the offset of the respective slot allocated for the private variable of the loop. When a new iteration of the loop begins, a value will be read from the LIC and added to the base address to generate the slot address for the current iteration of the loop. When the iteration is complete, the offset may be written back to the LIC. Since LIC only holds "depth" values, only a few iterations of depth can be performed in parallel, each using a separate local memory slot. An example data flow operation to implement this scheme is shown below.

In a dataflow machine, instructions execute when their input dependencies are satisfied. Hereinafter, "input ordinal" is an input ordinal (e.g., a signal that satisfies an input dependency), and when an instruction completes execution, the instruction generates an "output ordinal" to indicate that the result is now available. Gate64, add64, and mov instructions are explained first, and then their usage in implementing token.

gate64 result, inord, value

The instruction is not executed until inord is available. Then, "value" is available as a result.

Add64 result, input1, input2

The instruction is not executed until input1 and input2 are available. Then, "result" becomes available as the sum of "" input1 (input 1) "and" input2 (input 2) ".

mov0 result, inord, value

This instruction is not executed until "inord" is available. Then, "value" becomes available as a result.

The following pseudo code is an example output of the data stream operation conversion stage 66 for a CSA implementation.

// each loop iteration requires 400 bytes of local storage

// create 2 workers for the original cycle

// achieving a pipeline depth of 3

V/total local store 400 x 3 x 2 byte 2400 byte

// worker 0 uses a pool ranging from byte 0 to byte 1199

// worker 1 uses a pool ranging from byte 1200 to byte 2399

// in each worker's pool, 3 slots are offset by 0, 400, 800

V/initialize LIC with depth 3 with offset value:

/(offset of slot 0), (offset of slot 1), and (offset of slot 2)

[email protected]％slot_offset

％slot_offset:ci64＝init64 0

％slot_offset:ci64＝init64 400

％slot_offset:ci64＝init64 800

…

// implementing token _ acquisition over CSA

// out-of-cycle dynamic memory allocation generating pool addresses

pool … …// corresponds to CsAMemAlloc (Csa memory allocation) (2400)

…

In a loop, when token _ get is ready to execute

// making the pool address available to the add64 instruction

gate64 pool_gated,token_take_inord,pool

// addresses assigned to local storage slots of the iteration

// is calculated

add64 slot_addr,slot_offsets,pool_gated

…

// implementing token _ Return over CSA

In a loop, when token _ returns ready for execution

// write back slot _ offset at the end of LIC

gate64 slot_offsets,token_return_inord,slot_offsets

V denotes the completion of token _ Return with this mov0

mov0 token_return_outord,token_return_inord

In this way, the data flow properties of the CSA LIC are leveraged to enable pipelining of parallel loops while ensuring that sufficient local storage is available for dynamic loop iterations. Compiler 60 can do this conversion automatically and has implemented a prototype OpenMP language extension to prove the advantages of this solution.

Turning now to FIG. 6, a performance enhanced computing system 151 is illustrated. The system 151 may generally be part of an electronic device/platform having computing functionality (e.g., personal digital assistant/PDA, notebook computer, tablet computer, convertible tablet, server), communication functionality (e.g., smartphone), imaging functionality (e.g., camera, camcorder), media playing functionality (e.g., smart television/TV), wearable functionality (e.g., watch, eyewear, headgear, foot wear, jewelry), vehicle-mounted functionality (e.g., car, truck, motorcycle), robotic functionality (e.g., autonomous robot), internet of things (IoT) functionality, and the like, or any combination thereof. In the illustrated example, the system 151 includes a host processor 153 (e.g., a central processing unit/CPU) with an Integrated Memory Controller (IMC)155, the IMC 155 coupled to a system memory 157.

The illustrated system 151 also includes an Input Output (IO) module 159, the IO module 159 being implemented as a system on chip (SoC) on semiconductor die 163 with host processor 153 and graphics processor 161 (e.g., a graphics processing unit/GPU). The illustrated IO module 159 communicates with, for example, a display 165 (e.g., touchscreen, liquid crystal display/LCD, light emitting diode/LED display), a network controller 167 (e.g., wired and/or wireless), and mass storage 169 (e.g., hard disk drive/HDD, optical disk, solid state drive/SSD, flash memory).

In an embodiment, host processor 153, graphics processor 161, and/or IO module 159 execute instructions 171 retrieved from system memory 157 and/or mass storage 169 to perform one or more aspects of method 50 (FIG. 4) already discussed. Thus, execution of the illustrated instructions 171 may cause the computing system 151 to: detecting one or more local variables in the source code, wherein the one or more local variables lack dependencies between iterations of the loop in the source code; pipelined execution code that automatically generates one or more local variables; and incorporating the pipelined execution code into an output of the compiler.

In an embodiment, the pipelined execution code includes initialization of a pool of buffer stores of one or more local variables. In such cases, the initialized buffer pool may be greater than (e.g., multiple times) the local storage corresponding to a single iteration of the loop. In addition, the pipelined execution code may further include a definition of a plurality of tokenized slots in the initialized buffer pool, where each tokenized slot corresponds to a pipelined iteration of the loop. In an embodiment, the pipelined execution code further includes a pipeline depth definition. In one example, the local variable(s) are detected after registration of the source code and the source code is associated with a communication channel in the dataflow graph. Further, automatic generation of pipeline execution code may be performed in response to detecting the local variable(s).

Thus, the illustrated system 151 improves performance, at least to some extent, such that pipelining of loops containing private data improves throughput. In practice, the total number of cycles required to execute a loop may be much smaller than the product of the number of quiescent cycles and the number of loop iterations.

Fig. 7 shows a semiconductor package device 173. The illustrated device 173 includes one or more substrates 175 (e.g., silicon, sapphire, gallium arsenide) and logic 177 (e.g., transistor arrays and other integrated circuit/IC components) coupled to the substrate(s) 175. Logic 177 may be implemented at least partially in configurable logic or fixed function logic hardware. In one example, logic 177 implements one or more aspects of method 50 (fig. 4) already discussed. Thus, logic 177 may: detecting one or more local variables in the source code, wherein the local variable(s) lack correlation between iterations of a loop in the source code; pipelined execution code that automatically generates local variable(s); and incorporating the pipelined execution code into an output of the compiler. Thus, the illustrated device 173 improves performance, at least to some extent, such that pipelining of loops containing private data improves throughput. In practice, the total number of cycles required to execute a loop may be much smaller than the product of the number of quiescent cycles and the number of loop iterations.

In one example, logic 177 includes transistor channel regions positioned (e.g., embedded) within substrate(s) 175. Thus, the interface between logic 177 and substrate(s) 175 may not be an abrupt junction. Logic 177 can also be considered to comprise epitaxial layers grown on initial wafers of substrate(s) 175.

Figure 8 illustrates a processor core 200 according to one embodiment. Processor core 200 may be a core for any type of processor, such as a microprocessor, an embedded processor, a Digital Signal Processor (DSP), a network processor, or other device to execute code. Although only one processor core 200 is illustrated in fig. 8, a processing element may alternatively include more than one processor core 200 illustrated in fig. 8. Processor core 200 may be a single-threaded core, or for at least one embodiment, processor core 200 may be multithreaded in that each of its cores may include more than one hardware thread context (or "logical processor").

Fig. 8 also illustrates a memory 270 coupled to processor core 200. The memory 270 may be any of a wide variety of memories (including various layers of a memory hierarchy) known or otherwise available to those of skill in the art. Memory 270 may include one or more instructions of code 213 to be executed by processor core 200, where code 213 may implement one or more aspects of method 50 (fig. 4) already discussed. Processor core 200 follows a program sequence of instructions indicated by code 213. Each instruction may enter the front-end portion 210 and be processed by one or more decoders 220. Decoder 220 may generate as its output a micro-operation, such as a fixed width micro-operation in a predefined format, or may generate other instructions, micro-instructions, or control signals that reflect the original code instruction. The illustrated front-end portion 210 also includes register renaming logic 225 and scheduling logic 230, the scheduling logic 230 generally allocating resources and queuing operations corresponding to the translation instructions for execution.

Processor core 200 is shown to include execution logic 250 having a set of execution units 255-1 through 255-N. Some embodiments may include a large number of execution units dedicated to a specified function or group of functions. Other embodiments may include only one execution unit or one execution unit that may perform a particular function. The illustrated execution logic 250 performs the operations specified by the code instructions.

After completing execution of the operation specified by the code instruction, back-end logic 260 retires the instruction of code 213. In one embodiment, processor core 200 allows out-of-order execution but requires in-order retirement of instructions. Retirement logic 265 may take various forms (e.g., reorder buffers, etc.) as known to those skilled in the art. In this manner, processor core 200 is transformed during execution of code 213 based at least on the outputs generated by the decoder, the hardware registers and tables utilized by register renaming logic 225, and any registers (not shown) modified by execution logic 250.

Although not illustrated in fig. 8, the processing elements may include other elements on a chip with processor core 200. For example, the processing element may include memory control logic with processor core 200. The processing element may include I/O control logic and/or may include I/O control logic integrated with memory control logic. The processing element may also include one or more caches.

Referring now to fig. 9, shown is a block diagram of an embodiment of a computing system 1000, according to an embodiment. Shown in fig. 9 is a multiprocessor system 1000 that includes a first processing element 1070 and a second processing element 1080. Although two processing elements 1070 and 1080 are shown, it is to be understood that embodiments of system 1000 may also include only one such processing element.

System 1000 is illustrated as a point-to-point interconnect system where a first processing element 1070 and a second processing element 1080 are coupled via a point-to-point interconnect 1050. It should be understood that any or all of the interconnects illustrated in fig. 9 may be implemented as a multi-drop bus rather than a point-to-point interconnect.

As shown in fig. 9, each of processing elements 1070 and 1080 may be multicore processors, including first and second processor cores (i.e., processor cores 1074a and 1074b and processor cores 1084a and 1084 b). Such cores 1074a, 1074b, 1084a, 1084b may be configured to execute instruction code in a manner similar to that discussed above in connection with fig. 8.

Each processing element 1070, 1080 may include at least one shared cache 1896a, 1896 b. The shared caches 1896a, 1896b may store data (e.g., instructions) that are utilized by one or more components of the processors, such as the cores 1074a, 1074b and 1084a, 1084b, respectively. For example, the shared caches 1896a, 1896b may locally cache data stored in the memories 1032, 1034 for faster access by components of the processors. In one or more embodiments, the shared caches 1896a, 1896b may include one or more intermediate level caches (such as level 2 (L2), level 3 (L3), level 4 (L4), or other levels of cache), Last Level Caches (LLC), and/or combinations thereof.

While shown with only two processing elements 1070, 1080, understand the scope of the embodiments is not so limited. In other embodiments, one or more additional processing elements may be present in a given processor. Alternatively, one or more of the processing elements 1070, 1080 may be an element other than a processor, such as an accelerator or a field programmable gate array. For example, the additional processing element(s) may include additional processor(s) that are the same as first processor 1070, additional processor(s) that are heterogeneous or asymmetric to first processor 1070, accelerators (such as, for example, graphics accelerators or Digital Signal Processing (DSP) units), field programmable gate arrays, or any other processing element. There may be various differences between the processing elements 1070, 1080 in terms of a range of quality metrics including architectural, microarchitectural, thermal, power consumption characteristics, and so forth. These differences may manifest themselves effectively as asymmetries and heterogeneity amongst the processing elements 1070, 1080. For at least one embodiment, the processing elements 1070, 1080 may reside in the same die package.

The first processing element 1070 may further include memory controller logic (MC)1072 and point-to-point (P-P) interfaces 1076 and 1078. Similarly, second processing element 1080 may include a MC 1082 and P-P interfaces 1086 and 1088. As shown in fig. 9, MC's 1072 and 1082 couple the processors to respective memories, namely a memory 1032 and a memory 1034, which may be portions of main memory locally attached to the respective processors. Although MC 1072 and MC 1082 are illustrated as being integrated into the processing elements 1070, 1080, for alternative embodiments, the MC logic may be discrete logic external to the processing elements 1070, 1080 rather than integrated therein.

First processing element 1070 and second processing element 1080 may be coupled to I/O subsystem 1090 via P-P interconnects 1076, 1086, respectively. As shown in FIG. 9, I/O subsystem 1090 includes P-P interfaces 1094 and 1098. In addition, the I/O subsystem 1090 includes an interface 1092 that couples the I/O subsystem 1090 with a high performance graphics engine 1038. In one embodiment, bus 1049 may be used to couple graphics engine 1038 to I/O subsystem 1090. Alternatively, a point-to-point interconnect may couple these components.

In turn, I/O subsystem 1090 may be coupled to a first bus 1016 via an interface 1096. In one embodiment, first bus 1016 may be a Peripheral Component Interconnect (PCI) bus, or a bus such as a PCI Express (PCI Express) bus or another third generation I/O interconnect bus, although the scope of the embodiments is not so limited.

As shown in fig. 9, various I/O devices 1014 (e.g., biometric scanners, speakers, cameras, sensors) may be coupled to first bus 1016, along with a bus bridge 1018, which may couple first bus 1016 to a second bus 1020. In one embodiment, second bus 1020 may be a Low Pin Count (LPC) bus. Various devices may be coupled to second bus 1020 including, for example, a keyboard/mouse 1012, communication device(s) 1026, and a data storage unit 1019 such as a disk drive or other mass storage device which may include code 1030 in one embodiment. The illustrated code 1030 may implement one or more aspects of the method 50 (fig. 4) already discussed. Further, an audio I/O1024 may be coupled to second bus 1020 and battery 1010 may supply power to computing system 1000.

Note that other embodiments are contemplated. For example, the system may implement a multi-drop bus or another such communication topology, rather than the point-to-point architecture of fig. 9. Also, more or fewer integrated chips than shown in FIG. 9 may alternatively be used to partition the elements of FIG. 9.

Additional description and examples:

example 1 includes a performance enhanced computing system, comprising: a network controller; a processor coupled to the network controller; and a memory coupled to the processor, the memory including a set of executable program instructions that, when executed by the processor, cause the processor to: detecting one or more local variables in the source code, wherein the one or more local variables lack dependencies between iterations of the loop in the source code; pipelined execution code that automatically generates one or more local variables; and merging the pipelined execution code into an output of the compiler.

Example 2 includes the computing system of example 1, wherein the pipelined execution code includes initialization of a buffer pool of one or more local variables.

Example 3 includes the computing system of example 2, wherein the initialized buffer pool is larger than the local storage corresponding to a single iteration of the loop.

Example 4 includes the computing system of example 2, wherein the pipelined execution code further includes a definition of a plurality of tokenized slots in the initialized buffer pool, and wherein each tokenized slot corresponds to a pipelined iteration of the loop.

Example 5 includes the computing system of example 1, wherein the pipelined execution code includes a pipeline depth definition.

Example 6 includes the computing system of any of examples 1 to 5, wherein the one or more local variables are detected after registration of the source code, the automatic generation of the pipelined execution code is responsive to the detection of the one or more local variables, and the source code is associated with a communication channel in the dataflow graph.

Example 7 includes a semiconductor device, comprising: one or more substrates, and logic coupled to the one or more substrates, wherein the logic is implemented at least in part in one or more of configurable logic or fixed function hardware logic, the logic coupled to the one or more substrates to: detecting one or more local variables in the source code, wherein the one or more local variables lack dependencies between iterations of the loop in the source code; pipelined execution code that automatically generates one or more local variables; and merging the pipelined execution code into an output of the compiler.

Example 8 includes the semiconductor device of example 7, wherein the pipelined execution code includes initialization of a buffer pool of one or more local variables.

Example 9 includes the semiconductor device of example 8, wherein the initialized buffer pool is larger than the local storage corresponding to a single iteration of the loop.

Example 10 includes the semiconductor device of example 8, wherein the pipelined execution code further includes a definition of a plurality of tokenized slots in the initialized buffer pool, and wherein each tokenized slot corresponds to a pipelined iteration of the loop.

Example 11 includes the semiconductor device of example 7, wherein the pipelined execution code includes a pipeline depth definition.

Example 12 includes the semiconductor device of any one of examples 7 to 11, wherein the one or more local variables are detected after registration of the source code, the automatic generation of the pipelined execution code is responsive to the detection of the one or more local variables, and the source code is associated with a communication channel in the dataflow graph.

Example 13 includes the semiconductor device of any one of examples 7 to 12, wherein the logic coupled to the one or more substrates includes a transistor channel region within the one or more substrates.

Example 14 includes at least one computer-readable memory medium comprising a set of instructions that, when executed by a computing system, cause the computing system to: detecting one or more local variables in the source code, wherein the one or more local variables lack dependencies between iterations of the loop in the source code; pipelined execution code that automatically generates one or more local variables; and merging the pipelined execution code into an output of the compiler.

Example 15 includes the at least one computer-readable memory medium of example 14, wherein the pipelined execution code includes initialization of a buffer pool of one or more local variables.

Example 16 includes the at least one computer-readable memory medium of example 15, wherein the initialized buffer pool is larger than the local storage corresponding to a single iteration of the loop.

Example 17 includes the at least one computer-readable memory medium of example 15, wherein the pipelined execution code further includes a definition of a plurality of tokenized slots in the initialized buffer pool, and wherein each tokenized slot corresponds to a pipelined iteration of the loop.

Example 18 includes the at least one computer-readable storage medium of example 14, wherein the pipeline execution code includes a pipeline depth definition.

Example 19 includes the at least one computer-readable storage medium of any one of examples 14 to 18, wherein the one or more local variables are detected after registration of the source code, the automatic generation of the pipeline execution code is responsive to the detection of the one or more local variables, and the source code is associated with a communication channel in the dataflow graph.

Example 20 includes a method of operating a compiler, the method comprising: detecting one or more local variables in the source code, wherein the one or more local variables lack dependencies between iterations of the loop in the source code; pipelined execution code that automatically generates one or more local variables; and incorporating the pipelined execution code into an output of the compiler.

Example 21 includes the method of example 20, wherein the pipelined execution code includes initialization of a buffer pool of one or more local variables.

Example 22 includes the method of example 21, wherein the initialized buffer pool is larger than the local storage corresponding to a single iteration of the loop.

Example 23 includes the method of example 21, wherein the pipelined execution code further includes a definition of a plurality of tokenized slots in the initialized buffer pool, and wherein each tokenized slot corresponds to a pipelined iteration of the loop.

Example 24 includes the method of example 20, wherein the pipelined execution code includes a pipeline depth definition.

Example 25 includes the method of any one of examples 20 to 24, wherein the one or more local variables are detected after registration of the source code, the automatic generation of the pipelined execution code is responsive to the detection of the one or more local variables, and the source code is associated with a communication channel in the dataflow graph.

Example 26 includes means for performing the method of any of examples 20-25.

Thus, the techniques described herein may include an auto-compiler transformation that may take as input a loop with some form of partial loop storage and dynamically pipeline the loop using one or more workers of a data flow architecture (such as CSA). The compiler may detect the local stores remaining in the loop after registration and allocate enough memory to hold private variables for: a) each worker, and b) concurrent execution of each worker. As each worker agent begins execution, the worker agent may be assigned a unique slot in the allocated private store. When the worker completes execution of the iteration, the local storage slots associated with the worker may be automatically reclaimed for use in the next iteration.

Several applications/references, such as, for example, SPGemm (sparse matrix-matrix multiplication) and a priori (Apriori) references, may benefit from the transformation techniques described herein.

Embodiments are applicable for use with all types of semiconductor integrated circuit ("IC") chips. Examples of such IC chips include, but are not limited to, processors, controllers, chipset components, Programmable Logic Arrays (PLAs), memory chips, network chips, system on chip (SoC), SSD/NAND controller ASICs, and the like. Additionally, in some of the figures, signal conductors are represented by lines. Some lines may be different to indicate more constituent signal paths, may have a number label to indicate the number of constituent signal paths, and/or may have arrows at one or more ends to indicate primary information flow direction. However, this should not be construed in a limiting manner. Rather, such added detail may be used in connection with one or more exemplary embodiments to facilitate easier understanding of a circuit. Any represented signal lines, whether or not having additional information, may actually comprise one or more signals that may travel in multiple directions and be implemented with any suitable type of signal scheme, such as digital or analog lines implemented with differential pairs, fiber optic lines, and/or single-ended lines.

Example sizes/models/values/ranges may have been given, but embodiments are not limited thereto. As manufacturing techniques (e.g., photolithography) mature over time, it is expected that devices of smaller size could be manufactured. In addition, well known power/ground connections to IC chips and other components may or may not be shown within the FIGS, for simplicity of illustration and discussion, and so as not to obscure certain aspects of the embodiments. Moreover, various arrangements may be shown in block diagram form in order to avoid obscuring the embodiments, and also in view of the fact that specifics with respect to implementation of such block diagram arrangements is highly dependent upon the platform within which the embodiments are to be implemented, i.e., such specifics should be well within purview of one skilled in the art. Where specific details (e.g., circuits) are set forth in order to describe example embodiments, it should be apparent to one skilled in the art that the various embodiments can be practiced without, or with variation of, these specific details. The description is thus to be regarded as illustrative instead of limiting.

The term "coupled" may be used herein to refer to any type of direct or indirect relationship between the components in question, and may apply to electrical, mechanical, fluidic, optical, electromagnetic, electromechanical or other connections. In addition, the terms "first," "second," and the like may be used herein only for ease of discussion, and do not have a specific temporal or chronological meaning unless otherwise stated.

As used in this application and the claims, a list of items joined by the term "one or more of … … can mean any combination of the listed items. For example, the phrase "A, B or one or more of C" may mean a; b; c; a and B; a and C; b and C; or A, B and C.

Those skilled in the art will appreciate from the foregoing description that the broad techniques of the embodiments can be implemented in a variety of forms. Therefore, while the embodiments have been described in connection with particular examples thereof, the true scope of the embodiments should not be so limited since other modifications will become apparent to the skilled practitioner upon a study of the drawings, the specification and the following claims.

25页详细技术资料下载

Automated compiler data flow optimization for implementing circular pipelining with local storage requirements

相关技术

网友询问留言