Dynamically replacing calls in a software library with accelerator calls
阅读说明:本技术 用加速器的调用来动态替换软件库中的调用 (Dynamically replacing calls in a software library with accelerator calls ) 是由 L.G.汤姆森 P.沙尔特 J·C-T·陈 J.卡雷 于 2018-12-19 设计创作,主要内容包括:计算机程序包括对软件库的调用。建立虚拟功能表,其中包括对所述计算机程序中所述软件库的所述调用。一种可编程设备,包括一个或多个当前实现的加速器。确定当前实现的可用加速器。确定与当前实现的加速器相对应的所述软件库中的所述调用。对所述虚拟功能表中所述软件库的一个或多个调用将被替换为对相应的当前实现的加速器的一个或多个相应的调用。当可以在新加速器中实现软件库中的调用时,将动态生成所述新加速器的加速器映像。然后部署所述加速器映像以创建所述新加速器。所述虚拟功能表中对所述软件库的一个或多个调用被替换为对所述新加速器的一个或多个相应调用。(The computer program includes a call to a software library. Creating a virtual function table including the call to the software library in the computer program. A programmable device includes one or more currently implemented accelerators. Available accelerators for the current implementation are determined. Determining the call in the software library corresponding to the currently implemented accelerator. One or more calls to the software library in the virtual function table are replaced with one or more corresponding calls to the corresponding currently implemented accelerator. When a call in the software library can be implemented in a new accelerator, an accelerator image of the new accelerator will be dynamically generated. The accelerator image is then deployed to create the new accelerator. One or more calls to the software library in the virtual function table are replaced with one or more corresponding calls to the new accelerator.)
1. An apparatus, comprising:
at least one processor;
a memory coupled to the at least one processor;
a programmable device coupled to the at least one processor, the programmable device including a currently implemented accelerator;
a computer program resident in the memory and executed by the at least one processor;
a software library residing in the memory, the software library comprising a plurality of functions invoked by the computer program; and
an accelerator deployment tool residing in memory and coupled to the at least one processor, the accelerator deployment tool determining a plurality of calls to the software library in the computer program, constructing a virtual function table, the virtual function table including the plurality of calls to the software library in the computer program, determining that the currently implemented accelerator is available in the programmable device, determining a first one of the plurality of calls in the software library corresponding to the currently implemented accelerator, and replacing the first one of the software library calls in the virtual function table with a call to the currently implemented accelerator.
2. The apparatus of claim 1, wherein the programmable device includes an open coherent accelerator processor interface (OpenCAPI) coupled to the at least one processor.
3. The device of claim 1, wherein the accelerator deployment tool determines that a second call in the software library can be implemented in a new accelerator, dynamically generates an accelerator image for the new accelerator, deploys the accelerator image to the programmable device to create the new accelerator, and replaces the second call to the software library in the virtual function table with a call to the new accelerator.
4. The apparatus of claim 1, wherein the new accelerator is not the currently implemented accelerator in the programmable device.
5. The apparatus of claim 1, wherein the accelerator deployment tool dynamically generates the accelerator image by converting code portions in the computer program to a hardware description language representation and then processing the hardware description language representation to generate the accelerator image therefrom.
6. The apparatus of claim 1, wherein the programmable device comprises a Field Programmable Gate Array (FPGA).
7. An apparatus, comprising:
at least one processor;
a memory coupled to the at least one processor;
a Field Programmable Gate Array (FPGA) coupled to the at least one processor, including an open coherent accelerator processor interface (OpenCAPI) coupled to the at least one processor, wherein the FPGA includes a currently implemented accelerator;
a computer program resident in the memory and executed by the at least one processor;
a software library residing in the memory, the software library comprising a plurality of functions invoked by the computer program; and
an accelerator deployment tool residing in memory and coupled to the at least one processor, the accelerator deployment tool determining a plurality of calls to the software library in the computer program, building a virtual function table comprising the plurality of calls to the software library in the computer program, determining that the currently implemented accelerator in the programmable device is available, determining that a first one of the plurality of calls in the software library corresponding to the currently implemented accelerator is available, and replacing the first call to the software library in the virtual function table with a call to the currently implemented accelerator, determining that a second one of the software library can be implemented in a new accelerator other than the currently implemented accelerator, dynamically generating an accelerator image by converting portions of code in the computer program into a hardware description language representation, then for processing the hardware description language representation to generate the accelerator image, then deploying the accelerator image to the programmable device to create the new accelerator and replacing a second call to the software library in the virtual function table with a call to the new accelerator.
8. A method for improving runtime performance of a computer program, the method comprising:
providing a currently implemented accelerator in a programmable device;
providing a software library comprising a plurality of functions invoked by the computer program;
determining a plurality of calls to the software library in the computer program;
establishing a virtual function table, the virtual function table including the multiple calls to the software library in the computer program;
determining that the currently implemented accelerator is available in the programmable device;
determining a first one of the plurality of calls in the software library corresponding to the currently implemented accelerator; and
replacing the first call to the software library in the virtual function table with a call to the currently implemented accelerator.
9. The method of claim 8, wherein the programmable device comprises an open coherent accelerator processor interface (OpenCAPI) coupled to the at least one processor.
10. The method of claim 8, further comprising:
determining that a second call in the software library can be implemented in a new accelerator;
dynamically generating an accelerator image for the new accelerator;
deploying the accelerator image to the programmable device to create the new accelerator; and
replacing the second call to the software library in the virtual function table with a call to the new accelerator.
11. The method of claim 8, wherein the new accelerator is not the currently implemented accelerator in the programmable device.
12. The method of claim 8, wherein dynamically generating the accelerator map for the new accelerator comprises:
converting code portions in the computer program into hardware description language representations;
processing the hardware description language representation to generate the accelerator image therefrom.
13. The method of claim 8, wherein the programmable device comprises a Field Programmable Gate Array (FPGA).
Technical Field
The present disclosure relates generally to computer systems, and more particularly to hardware accelerators in computer systems.
Background
Open coherent accelerator processor interface (OpenCAPI) is a specification developed by a consortium of industry leaders. The OpenCAPI specification defines an interface that allows any processor to connect to consistent user-level accelerators and I/O devices. OpenCAPI provides a high-bandwidth, low-latency open interface design specification, which aims to reduce the complexity of high-performance accelerator design. OpenCAPI is capable of transmitting at a rate of 25 gigabits per second (Gbits) per channel with performance superior to the current peripheral component interconnect express (PCIe) specification, which provides a maximum data transmission rate of 16Gbits per channel per second. OpenCAPI provides a data-centric approach that brings computing power closer to the data and eliminates inefficiencies in traditional system architectures to help eliminate system performance bottlenecks and improve system performance. A significant advantage of OpenCAPI is that virtual addresses of processors can be shared and used in OpenCAPI devices (e.g., accelerators) in the same way as processors. With the development of OpenCAPI, hardware accelerators including OpenCAPI architecture interfaces can now be developed.
Disclosure of Invention
Aspects of the present invention provide a computer program that includes a call to a software library. A virtual function table is established, which includes calls to software libraries in the computer program. A programmable device includes one or more currently implemented accelerators. Available accelerators for the current implementation are determined. Determining a call in the software library corresponding to a currently implemented accelerator. One or more calls to the software library in the virtual function table are replaced with one or more corresponding calls to the corresponding currently implemented accelerator. When a call in the software library can be implemented in a new accelerator, an accelerator image of the new accelerator will be dynamically generated. The accelerator image is then deployed to create the new accelerator. One or more calls to the software library in the virtual function table are replaced with one or more corresponding calls to the new accelerator.
Viewed from a first aspect, the present invention provides an apparatus comprising: at least one processor; and a memory coupled to the at least one processor; a programmable device coupled to the at least one processor includes a currently implemented accelerator; a computer program residing in the memory and executed by the at least one processor; a software library residing in the memory, the software library comprising a plurality of functions invoked by the computer program; an accelerator deployment tool residing in the memory and coupled to the at least one processor, the accelerator deployment tool determining a plurality of calls to the software library in the computer program, establishing a library in the virtual function table computer program comprising a plurality of calls to the software, determining that the currently implemented accelerator in the programmable device is available, determining that a first one of the plurality of calls in the software library corresponds to the currently implemented accelerator, and replacing the first call to the software library in the virtual function table with a call to the currently implemented accelerator.
Preferably, the present invention provides an apparatus wherein the programmable device comprises an open coherent accelerator processor interface (OpenCAPI) coupled to the at least one processor.
Preferably, the present invention provides an apparatus wherein the accelerator deployment tool determines that a second call in the software library can be implemented in a new accelerator, dynamically generates an accelerator image for the new accelerator, deploys the accelerator image to the programmable device to create the new accelerator, and replaces the second call to the software library in the virtual function table with a call to the new accelerator.
Preferably, the present invention provides a device wherein said new accelerator is not said currently implemented accelerator in said programmable device.
Preferably, the present invention provides an apparatus wherein the accelerator deployment tool dynamically generates the accelerator image by converting code portions in the computer program into a hardware description language representation and then processing the hardware description language representation to generate the accelerator image therefrom.
Preferably, the present invention provides an apparatus wherein the programmable apparatus comprises a Field Programmable Gate Array (FPGA).
Viewed from a second aspect, the present invention provides an apparatus comprising: at least one processor; and a memory coupled to the at least one processor; a Field Programmable Gate Array (FPGA) coupled to the at least one processor, including an open coherent accelerator processor interface (OpenCAPI) coupled to the at least one processor, wherein the FPGA includes a currently implemented accelerator; a computer program resident in the memory and executed by the at least one processor; a software library residing in the memory, the software library comprising a plurality of functions invoked by the computer program; an accelerator deployment tool residing in the memory and coupled to the at least one processor, the accelerator deployment tool determining a plurality of calls to the software library in the computer program, establishing a virtual function table comprising the plurality of calls to the software library in the computer program, determining that the currently implemented accelerator in the programmable device is available, determining that a first one of the plurality of calls in the software library corresponding to the currently implemented accelerator is available, determining that a second one of the calls in the software library can be implemented in a new accelerator that is not the currently implemented accelerator by calling the currently implemented accelerator to replace the first one of the software library in the virtual function table, dynamically generating an accelerator image for the new accelerator by converting portions of code in the computer program to a hardware description language representation, the hardware description language representation is then processed to generate the accelerator image described herein, the accelerator image is deployed to the programmable device to create the new accelerator, and a second call to the software library in the virtual function table is replaced with a call to the new accelerator.
Viewed from a third aspect, the present invention provides a method for improving runtime performance of a computer program, said method comprising: providing a currently implemented accelerator in a programmable device; providing a software library comprising a plurality of functions invoked by the computer program; determining a plurality of calls to the software library in the computer program; establishing a virtual function table, wherein the virtual function table comprises a plurality of calls of a software library in the computer program; determining that the currently implemented accelerator is available in the programmable device; determining a first one of a plurality of calls in the software library corresponding to a currently implemented accelerator; replacing a first call to the software library in the virtual function table with a call to the currently implemented accelerator.
Preferably, the present invention provides a method wherein the programmable device comprises an open coherent accelerator processor interface (OpenCAPI) coupled to the at least one processor.
Preferably, the present invention provides a method, further comprising: determining that a second call in the software library can be implemented in a new accelerator; dynamically generating an accelerator image for the new accelerator; deploying the accelerator image to the programmable device to create the new accelerator; replacing a second call to the software library in the virtual function table with a call to the new accelerator.
Preferably, the present invention provides a method wherein said new accelerator is not said currently implemented accelerator in said programmable device.
Preferably, the present invention provides a method wherein dynamically generating the accelerator image for the new accelerator comprises: converting code portions in the computer program into hardware description language representations; processing the hardware description language representation to generate the accelerator image therefrom.
Preferably, the present invention provides a method wherein the programmable device comprises a Field Programmable Gate Array (FPGA).
The foregoing and other features and advantages will be apparent from the following more particular description, as illustrated in the accompanying drawings.
Drawings
The present disclosure will be described with reference to the accompanying drawings, wherein like reference numerals refer to like elements, and:
FIG. 1 is a block diagram of an example system showing how an open coherent accelerator processor interface (OpenCAPI) may be used;
FIG. 2 is a flow diagram of a programmable device with an OpenCAPI interface, which may include one or more hardware accelerators;
FIG. 3 is a block diagram of a computer system including tools for dynamically generating and deploying accelerators for portions of code in a computer program;
FIG. 4 is a flow diagram illustrating a particular implementation of how the accelerator map generator in FIG. 3 generates an accelerator map from a portion of code;
FIG. 5 is a block diagram of a particular embodiment of the code analyzer of FIG. 3 that analyzes a computer program and selects portions of code;
FIG. 6 is a flow diagram of a method for identifying a portion of code in a computer program, dynamically generating and deploying an accelerator corresponding to the portion of code, and then modifying the computer program to replace the portion of code with a call to the deployed accelerator;
FIG. 7 is a block diagram illustrating a first example computer program having different code portions;
FIG. 8 is a block diagram showing how code portions are converted to HDL and then to an accelerator map, which may be deployed to a programmable device to provide an accelerator;
FIG. 9 is a block diagram representing the computer program of FIG. 7 after code portion B has been replaced with a call to the accelerator of code portion B;
FIG. 10 is a block diagram illustrating a sample accelerator directory;
FIG. 11 is a flow diagram of a method for deploying an accelerator for a portion of code when maintaining a directory of previously generated accelerators;
FIG. 12 is a block diagram illustrating a second example computer program having different code portions;
FIG. 13 is a block diagram representing two portions of code in the computer program of FIG. 12 that would benefit from an accelerator;
FIG. 14 is a block diagram illustrating a sample accelerator directory including accelerators corresponding to a portion of code Q.
FIG. 15 is a block diagram showing deployment of an accelerator image of the portion of code Q identified in the directory in FIG. 14 to a programmable device;
FIG. 16 is a block diagram showing the computer program of FIG. 12 after the code Q portion has been replaced by a call to the accelerator of the code Q portion;
fig. 17 is a block diagram illustrating generation of an accelerator map from a portion of code R in the computer program shown in fig. 12 and 16.
FIG. 18 is a block diagram illustrating the deployment of a newly generated accelerator image of the portion of code R to a programmable device.
FIG. 19 is a block diagram representing the computer program of FIG. 16 after replacing the code R portion with a call to the accelerator for the code R portion.
FIG. 20 is a block diagram of the accelerator directory 1400 shown in FIG. 14 after creating an entry for the accelerator representing a portion of code R;
FIG. 21 is a block diagram of an example computer program.
FIG. 22 is a block diagram of a programmable device with an OpenCAPI interface that includes the accelerator of FIG. 21 for the loop portion, the accelerator of FIG. 21 for the branch tree portion, and the accelerator of FIG. 21 for the long serial portion.
FIG. 23 is a block diagram of the computer program of FIG. 21 after replacing portions of code with calls to the corresponding accelerators.
FIG. 24 is a block diagram of a prior art computer program that calls functions in a software library.
FIG. 25 is a flow diagram of a method of replacing calls to the software library with corresponding calls to one or more currently implemented accelerators;
FIG. 26 illustrates a virtual function table that creates a level of indirection for calls from the computer program to a software library.
FIG. 27 is a block diagram of the computer program of FIG. 24 in which the call to the software library has been replaced with a call to the virtual function table.
FIG. 28 is a block diagram of an accelerator dependency table illustrating the currently implemented accelerator corresponding to functions in the software library.
FIG. 29 is a block diagram of a programmable device showing the three currently implemented accelerators listed in the table of FIG. 28;
FIG. 30 illustrates the virtual function table of FIG. 26 after a call to the software library has been replaced with a call to the corresponding accelerator;
FIG. 31 is a flow diagram of a method for generating a new accelerator and replacing one or more calls to a software library with one or more corresponding calls to the new accelerator;
FIG. 32 is a block diagram of a programmable device showing the three previously spawned accelerators and the one newly spawned accelerator produced in FIG. 31, an
FIG. 33 shows the virtual function table of FIGS. 26 and 30 after replacing the call to the software library with a corresponding call to the new accelerator.
Detailed Description
As discussed in the background section above, the open coherent accelerator processor interface (OpenCAPI) is a specification that defines an interface that allows any processor to connect to a coherent user-level accelerator and I/O device. Referring to FIG. 1, an example computer system 100 is shown to illustrate some concepts related to the OpenCAPI interface 150. The processor 110 is coupled to standard memory 140 or a memory hierarchy as is known in the art. The processor is coupled to one or more PCIe devices 130 through a PCIe interface 120. The processor 110 is also coupled through an OpenCAPI interface 150 to one or more associated devices, such as an accelerator 160, an associated
The deployment of accelerators to programmable devices is well known in the art. Referring to fig. 2, programmable device 200 represents any suitable programmable device. The programmable device 200 may be, for example, an FPGA or an ASIC. The OpenCAPI interface 210 may be implemented within a programmable device. Additionally, one or more accelerators may be implemented in programmable device 200. Fig. 1 illustrates, by way of example, an accelerator 1220A, an accelerator 2220B. In the prior art, a human designer would determine which type of accelerator is needed based on the functionality that needs to be accelerated by implementation in hardware. The accelerator functions may be represented, for example, in a Hardware Description Language (HDL). Then, using known tools, the designer can generate an accelerator map corresponding to the HDL. The accelerator image, once loaded into a programmable device such as 200 in fig. 2, creates an accelerator in the programmable device that one or more computer programs may call as needed to provide one or more hardware accelerators.
The computer program includes a call to a software library. Creating a virtual function table comprising calls to said software library in said computer program. A programmable device includes one or more currently implemented accelerators. Determining the available accelerators for the current implementation. Determining the call in the software library corresponding to the currently implemented accelerator. One or more calls to the software library in the virtual function table are replaced with one or more corresponding calls to the corresponding currently implemented accelerator. When a call in the software library can be implemented in the new accelerator, an accelerator image of the new accelerator will be dynamically generated. The accelerator image is then deployed to create the new accelerator. One or more calls to the software library in the virtual function table are replaced with one or more corresponding calls to the new accelerator.
Referring to FIG. 3,
The
In a first implementation, the
Although
The present invention may be a system, method, and/or computer program product at any possible level of integration of technical details. The computer program product may include a computer-readable storage medium having computer-readable program instructions thereon for causing a processor to perform aspects of the invention.
The computer readable storage medium may be a tangible device that can hold and store the instructions for use by the instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as punch cards or in-groove projection structures having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media as used herein is not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission medium (e.g., optical pulses through a fiber optic cable), or electrical signals transmitted through electrical wires.
The computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or to an external computer or external storage device via a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. The network adapter card or network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in the respective computing/processing device.
Computer program instructions for carrying out operations of the present invention may be assembly instructions, Instruction Set Architecture (ISA) instructions, machine related instructions, microcode, firmware instructions, state setting data, integrated circuit configuration data, or source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C + + or the like and procedural programming languages, such as the "C" programming language or similar programming languages. The computer-readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, aspects of the present invention are implemented by personalizing an electronic circuit, such as a programmable logic circuit, a Field Programmable Gate Array (FPGA), or a Programmable Logic Array (PLA), with state information of computer-readable program instructions, which can execute the computer-readable program instructions.
Aspects of the present invention are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.
These computer-readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer-readable program instructions may also be stored in a computer-readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer-readable medium storing the instructions comprises an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.
The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
FIG. 4 illustrates details of one suitable implementation of the
The HDL for the code portions 420 is fed into one or more processes, which may include synthesis and simulation. The synthesis process 430 is shown in steps 432, 434, 436, 438 and 440 of the middle part of fig. 4. The simulation process 450 is shown in the lower steps 452, 454 and 460 of fig. 4. HDL portion 420 for code may be fed to the synchronization module 432, which synchronization module 432 determines which hardware elements are needed. The place and route module 434 determines where to place the programmable hardware on the programmable device and how to route the interconnections between the hardware. Timing analysis 436 analyzes the performance of the accelerator after placing hardware elements and routing interconnects in block 434. The test module 438 tests the resulting accelerator image to determine if timing and performance parameters are met. The test module 438 feeds back to the debug module 440 when the design of the accelerator still needs improvement. This process may be repeated several times.
The simulation process 450 accepts the HDL of the code section 420 and performs computer simulation to determine its functionality. Simulation test module 454 determines whether the simulated design is functioning as desired. The simulation test module 454 feeds back to the debug module 460 when the design of the accelerator still needs improvement.
The
Some details of one possible implementation of the
The
Referring to fig. 6, a
Some examples are now provided to illustrate the above concepts. Fig. 7 illustrates an example computer program 700 that includes a plurality of code portions, shown in fig. 7 as code portion a 710, code portion B720, code portion C730. We assume that code section B720 is identified as a code section improvement to be implemented from a hardware accelerator. Code section B720 is then converted into a corresponding HDL representation 810, as shown in FIG. 8. The HDL of code section B810 is then used to generate an accelerator image of code section B820. For example, the method shown in fig. 4 may be used, and any other suitable method may be used. Once the accelerator image for code portion B820 is generated, the accelerator image is loaded into programmable device 830 to generate the accelerator for code portion B850. Programmable device 830 is one suitable implementation of
Once the accelerator is deployed at the programmable device 830, the code section B is deleted in the computer program and replaced with the accelerator call for code section B910 shown in FIG. 9. In the most preferred embodiment, the accelerator for code section B contains a return to the code, which is returned once the processing of code section B in the accelerator is complete. In this manner, the
In a first embodiment, accelerators may be dynamically generated to improve the performance of a computer program, as shown in FIGS. 4-9 and described above. In a second implementation, once the accelerator is dynamically generated, it can be stored in a directory so that it can be reused when needed. FIG. 10 illustrates a sample accelerator directory 1000, which is one suitable implementation of the
The latency field preferably specifies the average latency of the accelerator. For the example shown in fig. 10, the delay of Acc1 is 1.0 microseconds, while the delay of accelerator Acc N is 500 nanoseconds. The delay may represent, for example, the time required for the accelerator to perform its intended function. Other characteristics fields may include any other suitable information or data describing or otherwise identifying the accelerator, its characteristics and attributes, and the portion of code corresponding to the accelerator. For both example entries in fig. 10, the other properties field indicates that Acc1 includes a network connection and that Acc N has an affinity with Acc 5, which means that Acc N should be placed in close proximity to Acc 5 if possible. The various fields in FIG. 10 are shown by way of example and provide an accelerator directory having any suitable information or data within the scope of the disclosure and claims herein.
Referring to fig. 11, a method 1100 according to the second embodiment begins by running the computer program (step 1110). The runtime performance of the computer program is analyzed (step 1120). One or more code portions in the computer program that are to be improved by using the hardware accelerator are identified (step 1130). One of the identified code portions is selected (step 1140). When there is a previously generated accelerator for the selected code portion in the accelerator directory (yes at step 1150), the previously generated accelerator image is deployed to the programmable device (step 1160) to provide the accelerator. The computer program is then modified to replace the selected portion of code with a call to the accelerator (step 1162). When there is no previously generated accelerator in the directory for the selected portion of code (no in step 1150), an accelerator image for the selected portion of code is dynamically generated (step 1170), the accelerator image is deployed into a programmable device (step 1172), the computer program is modified to replace the portion of code with a call to the newly deployed accelerator (step 1174), and the accelerator is stored into an accelerator directory (step 1176). When the accelerator image is stored in the directory entry, step 1176 writes the accelerator image to the directory. When the accelerator image is stored in memory outside of the directory, step 1176 stores the accelerator image to external memory and writes an entry including a path of the accelerator image into the accelerator directory.
When there are more identified code portions (yes at step 1180), the method 1100 loops back to step 1140 and continues. When there are no more identified code portions (no at step 1180), the method 1100 loops back to step 1120 and continues. This means that the method 1100 most preferably continuously monitors the computer program and dynamically generates and/or deploys accelerators as needed to improve the runtime performance of the computer program.
An example is now provided to illustrate the concepts in fig. 11 relating to the second preferred embodiment. Fig. 12 illustrates an example computer program 1200 that includes a number of code portions, represented in fig. 12 as code portion P1210, code portion Q1220, code portion R1230. We assume that steps 1110, 1120, and 1130 in fig. 11 are performed. In step 1130, we assume that code portion Q1220 and code portion R1230 are identified as code portions that will be improved by implementing these code portions in the accelerator, as shown in table 1300 in FIG. 13. We further assume we have accelerator directory 1400 is one suitable implementation of the
The computer program is then modified to replace the selected code portion Q1220 with a call to the accelerator for code portion Q (step 1162). FIG. 16 shows the computer program 1200 of FIG. 12 having replaced code portion Q with a call to the accelerator of code portion Q, as shown at 1610 in FIG. 16. Thus, instead of executing code portion Q, computer program 1600 calls an accelerator of code portion Q1540 in
There are still code portions identified (yes in step 1180), i.e., code portion R shown in fig. 13, so method 11 in fig. 11 loops back to step 1140 where code portion R1230 is selected (step 1140). In the directory 1400 shown in fig. 14, there are no previously generated accelerators for code portion R (no at step 1150), so an accelerator image is dynamically generated for code portion R (step 1170). This is represented in FIG. 17, where the code portion R1230 is used to generate the HDL of code portion R1710, which is used to generate the accelerator image of code portion R1720. The accelerator image for code portion R1720 is newly dynamically generated, and the accelerator image is then deployed to the programmable device (step 1172). This is shown in FIG. 18, where the
A more specific example is shown in fig. 21 and 22. For this example, we assume that the computer program named example 12100 includes three different code portions of interest, namely a
The computer program example 12100 also includes a
The computer program example 12100 also includes a long
We assume that the code portions in fig. 21 are identified from the performance data 520 generated by the code evaluator 510 in fig. 5. The criteria used by the code selection tool 530 to select the
Figure 22 shows a
FIG. 23 shows the example computer program 12100 after replacing the code portions shown in FIG. 21 with the calls to the hardware accelerator shown in FIG. 22, so the
FIG. 24 shows a prior art computer program 2400 that includes calls to functions in a software library 2410. Software libraries are well known in the art and provide general functions that can be used by programmers without having to encode them. For example, the functions of performing compression, graphics operations, and XML parsing may be contained in a software library. Computer program 2400 includes code portion D2420, code portion E2422, code portion F2424, and possibly other code portions (via code portion L2428) that are not shown. The software library 2410 includes functions L12430, L22432, L32434, L42436, and obtains other functions through LN 2450. Code section D2420 in computer program 2400 includes a call to function L12430 in software library 2410. Code section F2424 includes a call to function L42436 in software library 2410. Code section L2428 includes a call to function L22432 in software library 2410.
Referring to FIG. 25, a method 2500 is preferably performed by the
One specific implementation of the virtual function table is shown at 2600 in fig. 26. The virtual function table 2600 lists previous calls made directly to the software library from the computer program and creates a level of indirection so that the calls can use the accelerator as much as possible. As shown in computer program 2700 in fig. 24, the call in computer program 2400 in fig. 24 has been replaced by a call to the function in virtual function table 2600. Thus, the call to L1 is replaced with a call to F1; the call to L4 will be replaced with a call to F4; while the call to L2 would be replaced with a call to F2. The virtual function table 2600 indicates which functions to call for each call from the computer program. When the virtual function table is initially built, each call from the computer program maps to a corresponding call to the software library. The modified computer program 2700 and virtual function table 2600 thus provide similar functionality as shown in fig. 24, but with a degree of indirection. Thus, code portion D2720 calls function F1 in the virtual function table 2600, which generates a call to L1 in the software library. Code portion F2724 calls function F4 in the virtual function table 2600, which generates a call to L4 in the software library. The code part L2728 calls a function F2 in the virtual function table, and the function F2 generates that the call to L2 is a software library. We see from this simple example that when the virtual function table is initially built, it provides similar functionality as shown in fig. 24, i.e. each call to the virtual function table results in a corresponding call to the software library.
Fig. 28 shows an accelerator correlation table 2800. For this example, we assume that three accelerators, Acc1, Acc2, and Acc3, have been deployed. We assume that these accelerators correspond to three functions in the software library. Thus, Acc1 corresponds to the library function L4; acc2 corresponds to library function L1; and Acc3 corresponds to library function L2, as shown in fig. 28. The correlation between the accelerator and library functions may be determined in any suitable manner, including a user manually generating an entry of the accelerator correlation table, or the accelerator deployment tool automatically determining the correlation between the accelerator and library functions. For accelerators that are manually generated by a user, the user may use the same library name and function name, enabling a code linker to automatically detect the accelerator and create calls to the accelerator instead of the software library. Similarly, an automatically generated accelerator may use the same library name and function name, causing the code linker to function in a similar manner to automatically detect the accelerator and create calls to the accelerator instead of the software library. In a different implementation, the accelerator may include data characterizing its functionality, allowing the accelerator to be queried to determine the functionality it supports, this information being available to replace calls to the software library with calls to the accelerator.
Fig. 29 shows a
We now consider the method 2500 in fig. 25 with reference to the specific examples in fig. 26-29. Steps 2510 and 2520 build the virtual function table 2600 of FIG. 26. Step 2530 determines that Acc 12910, Acc 22920 and Acc 32930 are currently implemented and available for use in
In an alternative embodiment, not only can the currently implemented accelerator be used to replace calls to software library functions, but a new accelerator can also be dynamically generated to replace calls to software library functions. Referring to fig. 31, when a call to a software library cannot be implemented in the new accelerator (no at step 3110),
We will continue to refer to the same example in fig. 26-30 while discussing
The accelerators shown in figures 8, 15 and 22 may comprise an OpenCAPI interface. It is noted, however, that the OpenCAPI interface is not strictly necessary to dynamically generate and deploy accelerators as disclosed and claimed herein. Deploying accelerators to a programmable device containing an OpenCAPI interface is very useful because the OpenCAPI specification is open, allowing anyone to develop the specification and interoperate in a cloud environment. In addition, the OpenCAPI interface provides lower latency, thereby reducing the "distance" between the accelerator and the data it may consume or produce. In addition, OpenCAPI provides higher bandwidth, thereby increasing the amount of data that the accelerator can consume or produce in a given time. These advantages of OpenCAPI combine to provide a good environment for implementing code portions of a computer program in an accelerator and to lower the threshold for code portions that are better in an accelerator than in the computer program. However, the disclosure and claims herein are equally applicable to accelerators that do not include or have access to an OpenCAPI interface.
The computer program includes a call to a software library. Creating a virtual function table including the call to the software library in the computer program. A programmable device includes one or more currently implemented accelerators. Available accelerators for the current implementation are determined. Determining the call in the software library corresponding to a currently implemented accelerator. One or more calls to the software library in the virtual function table are replaced with one or more corresponding calls to the corresponding currently implemented accelerator. When a call in the software library can be implemented in a new accelerator, an accelerator image of the new accelerator will be dynamically generated. The accelerator image is then deployed to create the new accelerator. One or more calls to the software library in the virtual function table are replaced with one or more corresponding calls to the new accelerator.
Those skilled in the art will understand that many variations are possible within the scope of the claims. Thus, while the present disclosure has been particularly shown and described above, it will be understood by those skilled in the art that these and other changes in form and details may be made therein without departing from the scope of the claims.
- 上一篇:一种医用注射器针头装配设备
- 下一篇:针对联动存储器设备调度存储器请求