Shared accelerator memory system and method
阅读说明:本技术 共享加速器存储器系统和方法 (Shared accelerator memory system and method ) 是由 S.库马尔 D.库法蒂 P.兰茨 P.马罗利亚 R.桑卡兰 K.科宁 于 2020-02-28 设计创作,主要内容包括:本公开涉及在处理器存储器电路和多个对等连接的加速器单元中的每一个中的加速器存储器电路之间共享存储器电路的系统和方法。每个加速器单元都包括物理到虚拟地址转换电路和迁移电路。每个加速器单元中的物理到虚拟地址转换电路包括针对所述多个加速器单元中的至少一些中每一个的页面。迁移电路使数据在处理器存储器电路和所述多个加速器电路的每一个中的加速器存储器电路之间转移。迁移电路基于统计信息将数据迁移到加速器存储器电路和从加速器存储器电路逐出数据,所述统计信息与对以下各项中至少一项的访问相关联:处理器存储器电路或一个或多个对等加速器电路中的加速器存储器电路。因此,可以动态分配处理器存储器电路和加速器存储器电路以有利地最小化可归因于数据访问操作的系统等待时间。(The present disclosure relates to systems and methods of sharing memory circuitry between a processor memory circuit and accelerator memory circuits in each of a plurality of peer-to-peer connected accelerator units. Each accelerator unit includes physical-to-virtual address translation circuitry and migration circuitry. The physical-to-virtual address translation circuitry in each accelerator unit includes a page for each of at least some of the plurality of accelerator units. Migration circuitry transfers data between the processor memory circuitry and the accelerator memory circuitry in each of the plurality of accelerator circuits. Migration circuitry migrates data to and evicts data from the accelerator memory circuitry based on statistical information associated with accesses to at least one of: a processor memory circuit or an accelerator memory circuit of one or more peer accelerator circuits. Thus, processor memory circuitry and accelerator memory circuitry may be dynamically allocated to advantageously minimize system latency attributable to data access operations.)
1. An accelerator system, comprising:
at least one accelerator circuit;
accelerator memory circuitry communicatively coupled to the at least one accelerator circuitry, the accelerator memory circuitry to locally store a plurality of pages transferred from the processor memory circuitry;
accelerator virtual-to-physical address translation circuitry to translate a virtual address received from the at least one accelerator circuitry to a physical memory address in one of: a processor memory circuit or an accelerator memory circuit;
accelerator page migration circuitry to:
collecting usage data for the virtual-to-physical address translations performed by accelerator virtual-to-physical address translation circuitry; and
causing bidirectional page transfers between the processor memory circuit and the accelerator memory circuit based, at least in part, on the collected usage data.
2. The accelerator system of claim 1, wherein the accelerator virtual-to-physical address translation circuitry comprises:
accelerator translation lookaside buffer (A-TLB) circuitry communicatively coupled to the at least one accelerator circuit, the A-TLB circuitry to store data representing a plurality of virtual memory address to physical memory address translations; and
accelerator memory management unit (A-MMU) circuitry communicatively coupled between the A-TLB circuitry and accelerator memory circuitry, the A-MMU circuitry to store a plurality of pages to service virtual memory address to physical memory address translations associated with virtual memory addresses not found in the A-TLB circuitry.
3. The accelerator system of claim 2, the a-TLB circuitry further to:
transmitting a virtual memory address to physical memory address translation service request to a processor memory management unit circuit (MMU), the virtual memory address to physical memory address translation service request associated with a virtual memory address not found in the A-TLB circuit or one of a plurality of page tables included in the A-MMU circuit.
4. The accelerator system of claim 2, the a-MMU circuitry to further:
performing one or more data transformation operations on the data transferred to the accelerator memory circuit.
5. The accelerator system of claim 2, the a-MMU circuitry to further:
performing one or more data transformation operations on at least a portion of the data transferred to the accelerator memory circuit, the data transformation operations comprising at least one of: a compression operation, a decompression operation, an encoding operation, a decoding operation, an encryption operation, or a decryption operation.
6. The accelerator system of claim 1, wherein the accelerator page migration circuitry comprises:
page migration prediction circuitry to initiate a page transfer between the processor memory circuitry and the accelerator memory circuitry; and
page data mover circuitry to bi-directionally migrate pages between the processor memory circuitry and accelerator memory circuitry in response to receipt of a request by the page migration prediction circuitry.
7. The accelerator system of claim 6, wherein the page migration prediction circuitry is to initiate a page transfer between the processor memory circuitry and the accelerator memory circuitry based on a virtual memory address associated with a memory access operation.
8. The accelerator system of claim 1, the accelerator memory circuitry to locally store a plurality of pages transferred from the processor memory circuitry for use by at least one of: the at least one accelerator circuit or at least one accelerator circuit disposed in a second accelerator system communicatively coupled to the accelerator system.
9. A method of processor acceleration, comprising:
performing, by accelerator virtual address to physical address translation circuitry disposed in the accelerator unit, translation of the one or more virtual memory addresses to the one or more physical memory addresses;
collecting, by accelerator page migration circuitry, data associated with a plurality of virtual memory address to physical memory address translations performed by accelerator virtual address to physical address translation circuitry; and
causing, by the accelerator page migration circuitry, migration of the one or more pages from the processor memory circuitry to the accelerator memory circuitry based at least in part on the collected virtual memory address to physical memory address translation data.
10. The processor acceleration method of claim 9, further comprising:
storing, by accelerator translation lookaside buffer (A-TLB) circuitry communicatively coupled to the at least one accelerator circuitry, data representing a plurality of virtual memory address to physical memory address translations; and
storing, by an accelerator memory management unit (A-MMU) circuit communicatively coupled between the A-TLB circuit and the accelerator memory circuit, a plurality of pages for servicing a virtual memory address to physical memory address translation associated with a virtual memory address not found in the A-TLB circuit.
11. The processor acceleration method of claim 10, further comprising:
communicating, by the A-TLB circuit to a processor memory management unit (P-MMU) circuit, a virtual memory address to physical memory address translation service request associated with a virtual memory address not found in the A-TLB circuit or one of a plurality of tables included in the A-MMU circuit.
12. The processor acceleration method of claim 10, further comprising
Performing, by the A-MMU circuitry, one or more data transformation operations on the data transferred to the accelerator memory circuitry.
13. The processor acceleration method of claim 10, further comprising:
performing, by the A-MMU circuitry, one or more data transformation operations on at least a portion of the data transferred to the accelerator memory circuitry, the data transformation operations comprising at least one of: a compression operation, a decompression operation, an encoding operation, a decoding operation, an encryption operation, or a decryption operation.
14. The processor acceleration method of any one of claims 9 to 13:
wherein collecting data associated with the plurality of virtual memory address to physical memory address translations performed by the accelerator virtual address to physical address translation circuitry comprises: and
collecting, by page migration prediction circuitry, data associated with the plurality of virtual memory address to physical memory address translations performed by accelerator virtual address to physical address translation circuitry; and
wherein causing migration of the one or more pages from the processor memory circuit to the accelerator memory circuit based at least in part on the collected virtual memory address to physical memory address translation data comprises:
causing, by the page migration prediction circuit, the page data mover circuit to migrate the one or more pages from the processor memory circuit to the accelerator memory circuit based, at least in part, on the collected virtual memory address to physical memory address translation data.
15. The processor acceleration method of any one of claims 9 to 13, further comprising:
at least a portion of the plurality of pages stored in the accelerator memory circuit is shared with at least one other accelerator system included in the plurality of accelerator systems via an inter-accelerator structure communicatively coupling each of the plurality of accelerator systems.
16. A processor acceleration system comprising:
means for performing a translation of one or more virtual memory addresses to one or more physical memory addresses;
means for collecting data associated with a plurality of virtual memory address to physical memory address translations performed by accelerator virtual address to physical address translation circuitry; and
means for migrating one or more pages between the processor memory circuitry and the accelerator memory circuitry based at least in part on the collected virtual memory address to physical memory address translation data.
17. The processor acceleration system of claim 16, further comprising:
means for storing data representing a plurality of virtual memory address to physical memory address translations.
18. The processor acceleration system of claim 17, further comprising:
means for transmitting a virtual memory address to physical memory address translation service request associated with not finding a virtual memory address in the stored data representing the plurality of virtual memory address to physical memory address translations.
19. The processor acceleration system of claim 17, further comprising:
means for performing one or more data transformation operations on data transferred to the accelerator memory circuit.
20. The processor acceleration system of claim 17, further comprising:
means for performing one or more data transformation operations on at least a portion of the data transferred to the accelerator memory circuit, the data transformation operations comprising at least one of: a compression operation, a decompression operation, an encoding operation, a decoding operation, an encryption operation, or a decryption operation.
21. The processor acceleration system of any of claims 16 to 20:
wherein means for collecting data associated with the plurality of virtual memory address to physical memory address translations performed by the accelerator virtual address to physical address translation circuitry comprises: and
means for collecting data associated with the plurality of virtual memory address to physical memory address translations; and
wherein means for causing migration of the one or more pages from the processor memory circuitry to the accelerator memory circuitry based at least in part on the collected virtual memory address to physical memory address translation data comprises:
means for migrating the one or more pages from the processor memory circuitry to the accelerator memory circuitry based at least in part on the collected virtual memory address to physical memory address translation data.
22. The processor acceleration system of any of claims 16 to 20, further comprising:
means for sharing at least a portion of the plurality of pages stored in the accelerator memory circuit with at least one other accelerator system included in the plurality of accelerator systems.
23. A chipset arranged to perform the method of any of the claims 9 to 15.
24. A non-transitory machine-readable medium comprising a plurality of instructions that in response to being executed on a computing device, cause the computing device to carry out the method according to any one of claims 9 to 15.
Technical Field
The present disclosure relates to shared data storage, and more particularly to peer-to-peer shared accelerator memory.
Background
Many modern heterogeneous computing platforms include one or more host processors coupled with a plurality of computing accelerators (general purpose graphics processing units (gpgpgpu); Field Programmable Gate Arrays (FPGA); Tensor Processing Units (TPU)). High speed memory (e.g., DRAM) may be present within the processor and high speed/bandwidth memory (e.g., HBM) may be present within each of the plurality of compute accelerators. Such high speed memory provides significantly reduced access times compared to master system memory accessed via one or more master buses (e.g., PCIe buses). Thus, there may be significant advantages where frequently accessed data is held in memory near the point of use (e.g., processor or compute accelerator) rather than simply in system memory accessed only via the relatively slow master system bus. For example, modern programmable parallel accelerators, such as GPGPUs, may include a bandwidth of approximately 900 gigabits per second (GB/s) for GPGPU (or device) memory as compared to approximately 30 GB/s (PCIe Gen 3x 16) for master or system memory. To achieve peak performance of the compute accelerator, the accelerator memory should be used to service most data accesses performed by the accelerator.
Drawings
Features and advantages of various embodiments of the claimed subject matter will become apparent as the following detailed description proceeds, and upon reference to the drawings, in which like numerals depict like parts, and in which:
fig. 1 is a block diagram of an illustrative system in accordance with at least one embodiment described herein, the illustrative system comprising: a processor circuit; a processor memory management unit (P-MMU) circuit; a processor memory circuit; a plurality of acceleration units, each acceleration unit having a respective accelerator memory circuit; and inter-accelerator structure circuitry communicatively coupled to the acceleration unit;
FIG. 2 is a block diagram of an illustrative system for performing memory access operations by an accelerator unit in accordance with at least one embodiment described herein;
FIG. 3 is a block diagram of an illustrative system for performing a page access operation by a processor circuit when a page resides in the processor memory circuit compared to when the page is migrated to an accelerator memory circuit in accordance with at least one embodiment described herein;
FIG. 4 is a block diagram of an illustrative system in which pages are migrated from processor memory circuitry to accelerator memory circuitry in accordance with at least one embodiment described herein;
FIG. 5 is a block diagram of an illustrative system in which pages are migrated from an accelerator memory circuit to a processor memory circuit in accordance with at least one embodiment described herein;
FIG. 6 is a schematic diagram of an illustrative processor-based electronic device including a Central Processing Unit (CPU) or multi-chip module (MCM) including processor circuitry and one or more accelerator units in accordance with at least one embodiment described herein;
FIG. 7 is a high level flow chart of an illustrative method of causing a page to be transferred between an accelerator memory circuit and a processor memory circuit in a system that includes a plurality of peer accelerator units coupled to the processor circuit in accordance with at least one embodiment described herein;
FIGS. 8A and 8B are block diagrams illustrating the generic vector friendly instruction format and its instruction templates according to embodiments of the invention;
9A, 9B, 9C, and 9D are block diagrams illustrating exemplary specific vector friendly instruction formats according to embodiments of the invention;
FIG. 10 is a block diagram of a register architecture according to one embodiment of the invention;
FIG. 11A is a block diagram illustrating both an exemplary in-order pipeline and an exemplary register renaming out-of-order issue/execution pipeline, according to embodiments of the invention;
FIG. 11B is a block diagram illustrating both an exemplary embodiment of an in-order architecture core and an exemplary register renaming out-of-order issue/execution architecture core to be included in a processor according to embodiments of the invention;
FIGS. 12A and B illustrate block diagrams of a more specific exemplary in-order core architecture, which would be one of several logic blocks in a chip (including other cores of the same type and/or different types);
FIG. 13 is a block diagram of a processor that may have more than one core, may have an integrated memory controller, and may have an integrated graphics device, according to an embodiment of the invention;
FIGS. 14, 15, 16, and 17 are block diagrams of exemplary computer architectures; and
FIG. 18 is a block diagram comparing conversion of binary instructions in a source instruction set to binary instructions in a target instruction set using a software instruction converter, according to an embodiment of the invention.
Although the following detailed description will proceed with reference being made to illustrative embodiments, many alternatives, modifications, and variations thereof will be apparent to those skilled in the art.
Detailed Description
In systems with multiple accelerators, platform level scaling or in-package scaling is typically used to scale accelerator performance. With platform-level scaling, multiple accelerators disposed in a single server node are communicatively coupled using an inter-accelerator structure. Each accelerator presents itself as a separate device with local high bandwidth memory for the processor circuitry. Each accelerator is able to access the peer's memory via an inter-accelerator structure. With platform level scaling, accessing the accelerator's local memory provides the highest bandwidth and lowest latency compared to accessing the peer accelerator's memory.
In the case of intra-package scaling, given the relatively high cost and manufacturing complexity of large silicon dies, the trend is toward accelerators constructed using smaller chiplets that are communicatively coupled using silicon interposers or silicon bridges, such as embedded multi-die interconnect bridges (EMIBs). For example, a chiplet can include a processor circuit and a memory controller circuit. Accessing the chiplet local memory has the highest bandwidth and lowest latency compared to accessing the memory of the peer chiplet.
To optimize the performance of such heterogeneous platforms, programmers will perform non-uniform memory access optimization at three levels: first, between the processor circuit and the accelerator circuit; second, between the accelerator circuit and the remote memory circuit when the working set cannot accommodate the local memory circuit; and third, between chiplets included in the accelerator circuit.
In the systems and methods disclosed herein, accelerator circuits (GPGPU, FPGA, TPU, etc.) are communicatively coupled to processor circuits using input/output (I/O) links (e.g., PCIe) or coherency links (e.g., IAL, CCIX, NVLink, etc.). The accelerator circuit includes a local high bandwidth memory, hereinafter referred to as an "accelerator memory circuit". If the accelerator circuits are coherently attached, the accelerator memory circuits are accessible to the processor circuits and exposed to the device operating system as system memory circuits. An accelerator circuit is a device that supports Shared Virtual Memory (SVM) that can perform memory accesses using application virtual addresses. In the systems and methods disclosed herein, the accelerator circuit implements an address translation cache using an Address Translation Service (ATS) defined by a PCIe ATS. In the systems and methods disclosed herein, the accelerator circuitry also supports page faults. Page fault handling is implementation specific, with some implementations blocking faults and other implementation scenarios switching out of the accelerator thread.
By allowing applications to transparently use accelerator memory circuits, the systems and methods disclosed herein provide benefits and advantages over conventional accelerator circuits. More specifically, the systems and methods disclosed herein present several beneficial enhancements. First, the systems and methods disclosed herein allow for overbooking of accelerator memory circuits by allowing programmers to allocate more memory than is physically available in the accelerator circuits. The accelerator memory manager may spill over to a processor-attached memory circuit (e.g., CPU DRAM). Second, the systems and methods disclosed herein advantageously allow for automatic non-uniform memory access (NUMA) optimization such that accelerator circuitry provides information to drivers that use the information to implement heuristics for causing memory pages to migrate between accelerator memory circuitry and processor memory circuitry based on access frequency. Advantageously, intelligent migration of data physically close to the point of use improves application performance. Third, the systems and methods disclosed herein advantageously provide for scalable address translation. The accelerator circuit supports a Secure Virtual Machine (SVM) and therefore memory accesses generated by the accelerator circuit require translation. A miss in an accelerator translation look-aside buffer circuit (A-TLB) generates an Address Translation Service (ATS) translation request to an I/O memory management unit (IOMMU) over a PCIe link. As accelerator bandwidth increases for accelerator memory circuits and accelerator memory circuit capacity increases, the A-TLB miss path will become a bottleneck unless the A-TLB maintains a relatively high hit rate. Fourth, the systems and methods disclosed herein advantageously allow for accelerator memory circuit virtualization to provide isolation and high performance direct access to accelerator memory circuits by a virtual machine implemented in processor circuits. Fifth, the systems and methods disclosed herein allow for peer-to-peer mapping, providing accelerator circuitry with the ability to map peer-to-peer accelerator memory circuitry in an A-TLB. Sixth, the systems and methods disclosed herein provide peer-to-peer migration capability such that accelerator circuitry is able to migrate pages from memory circuitry in another accelerator through an inter-accelerator structure, thereby facilitating high speed page migration and minimizing involvement of processor circuitry.
The systems and methods disclosed herein advantageously allow an accelerator memory management unit circuit (A-MMU) to perform address translations in conjunction with a system IOMMU to increase the likelihood of obtaining successful translations locally, thereby improving system performance. In addition, the accelerator circuit provides usage statistics to the accelerator driver circuit. The accelerator driver circuit uses the usage statistics to determine which pages to migrate from the processor memory circuit to the accelerator memory circuit and which pages to evict from the accelerator memory circuit to the processor memory circuit to create space for incoming pages to reach the accelerator memory circuit or to reduce access time to pages evicted by the processor circuit. The A-MMU supports efficient virtualization of accelerator memory circuitry, direct access to migration hints from guest drivers, and direct access to A-MMU invalidate queues from guest drivers without significant Virtual Machine Manager (VMM) overhead (nested first and second level translations). In addition, the accelerator page table format includes data representing metadata that is used to configure data transformation operations (compression/decompression; encoding/decoding, etc.) in addition to configuring translation and access permissions.
The systems and methods disclosed herein advantageously provide a plurality of peer accelerator circuits, each having memory circuits for storing memory pages for use by the accelerator circuits. The accelerator circuit also includes a migration predictor circuit and a data mover circuit for transferring data between the processor memory circuit and the accelerator memory circuit based on statistical usage information gathered by the migration predictor circuit. The accelerator circuit includes a translation lookaside buffer circuit (A-TLB) and a memory management unit circuit (A-MMU). In operation, the A-TLB circuitry performs translation of one or more virtual memory addresses to one or more physical memory addresses in the accelerator memory circuitry. The migration prediction circuitry collects data related to memory accesses to physical memory performed by the A-TLB circuitry. Based at least in part on the gathered statistics, the migration predictor circuit causes the migration circuit to transfer data between the accelerator memory circuit and the processor memory circuit. Advantageously, pages may be transferred between accelerator memory circuits and processor memory circuits based on the circuit generating the memory access (e.g., accelerator circuit or processor circuit), thereby improving the performance of the system by reducing latency attributable to actual physical memory accesses.
An accelerator system is provided. The system may include: at least one accelerator circuit; accelerator memory circuitry communicatively coupled to the at least one accelerator circuitry, the accelerator memory circuitry to locally store a plurality of pages transferred from the processor memory circuitry; accelerator virtual-to-physical address translation circuitry to translate a virtual address received from the at least one accelerator circuitry to a physical memory address in one of: a processor memory circuit or an accelerator memory circuit; an accelerator migration circuit to: collecting usage data for memory accesses to physical memory performed by the A-TLB circuitry; and causing a bidirectional data transfer between the processor memory circuit and the accelerator memory circuit based at least in part on the collected usage data.
An electronic device is provided. The electronic device may include: a processor circuit; a processor Memory Management Unit (MMU) circuit; a processor memory circuit; at least one accelerator system comprising: at least one accelerator circuit; accelerator memory circuitry communicatively coupled to the at least one accelerator circuitry, the accelerator memory circuitry to locally store a plurality of pages transferred from the processor memory circuitry; accelerator virtual-to-physical address translation circuitry to translate a virtual address received from the at least one accelerator circuitry to a physical memory address in one of: a processor memory circuit or an accelerator memory circuit; and an accelerator migration circuit to: collecting usage data for virtual-to-physical address translations performed by accelerator virtual-to-physical address translation circuitry; and causing a bidirectional data transfer between the processor memory circuit and the accelerator memory circuit based at least in part on the collected usage data.
A processor acceleration method is provided. The method may include: performing, by accelerator virtual address to physical address translation circuitry disposed in the accelerator unit, translation of the one or more virtual memory addresses to the one or more physical memory addresses; collecting, by accelerator migration circuitry, data related to a plurality of virtual memory address to physical memory address translations performed by accelerator virtual address to physical address translation circuitry; and causing, by the accelerator migration circuitry, migration of the one or more pages from the processor memory circuitry to the accelerator memory circuitry based at least in part on the collected virtual memory address to physical memory address translation data.
A processor acceleration system is provided. The system may include: means for performing a translation of one or more virtual memory addresses to one or more physical memory addresses; means for collecting data related to a plurality of virtual memory address to physical memory address translations performed by accelerator virtual address to physical address translation circuitry; and means for migrating one or more pages between the processor memory circuitry and the accelerator memory circuitry based at least in part on the collected virtual memory address to physical memory address translation data.
A non-transitory storage device is provided. The non-transitory storage device may include instructions that, when executed by the accelerator circuit, cause the accelerator circuit to: causing accelerator virtual address to physical address translation circuitry disposed in an accelerator unit to translate one or more virtual memory addresses to one or more physical memory addresses; causing accelerator migration circuitry to collect data related to a plurality of virtual memory address to physical memory address translations performed by accelerator virtual address to physical address translation circuitry; and cause the accelerator migration circuitry to migrate the one or more pages from the processor memory circuitry to the accelerator memory circuitry based, at least in part, on the collected virtual memory address to physical memory address translation data.
Fig. 1 is a block diagram of an
In operation, when
In operation, when performing memory operations, the A-TLB circuitry 172 may provide virtual-to-physical address translations not only for physical addresses within the
P-
Each of the plurality of
Migration circuitry 160 includes any number and/or combination of the following: electrical devices, semiconductor components, and/or logic elements of the migration prediction circuit 162 and the page data mover circuit 164 can be provided. In an embodiment, the migration prediction circuit 162 receives, collects, polls, or otherwise obtains statistical information indicative of occurring memory transactions involving the
The migration prediction circuit 162 includes performance counter circuitry and/or hardware monitoring circuitry that, when combined with the predicted instruction set executed by the migration prediction circuit 162, provides hints to identify candidate pages for migration from the
The page data mover circuit 164 includes Direct Memory Access (DMA) circuitry to migrate pages between the local
The physical-to-virtual
A-TLB circuitry 172 includes any number and/or combination of the following: currently available or future developed electronic components, semiconductor devices and/or logic elements capable of providing virtual to physical address translation. If a virtual address is not available within the A-TLB circuitry 172, it may be advantageous to first service the translation request using the A-MMU circuitry 174, rather than using the P-
The A-MMU circuitry 174 includes any number and/or combination of the following: currently available or future developed electronic components, semiconductor devices, and/or logic elements capable of using accelerator page tables (a-PT) to provide virtual to physical address translations for pages that reside in the local
The
Fig. 2 is a block diagram of an illustrative system 200 for performing memory access operations by an
If the virtual-to-physical address translation in the physical-to-virtual
In an embodiment, if the
The OS/VMM page fault handler services page faults by: ensuring that the page resides in the
Fig. 3 is a block diagram of an
A CPU page fault is generated in response to a failure to detect a virtual address in P-
If a page fault occurs when
Fig. 4 is a block diagram of an illustrative system 400 in which pages are migrated from
The accelerator unit driver 420 communicates a request to the O/S to update the processor page table circuitry 430. In response, the O/S may synchronize the P-
The accelerator unit driver 420 then updates the virtual address mapping in the accelerator page table circuitry to the page in the
In response to the page being migrated to the
Fig. 5 is a block diagram of an illustrative system 500 in which pages are migrated from
The accelerator unit driver 520 then removes the link to the page in the
The accelerator unit driver 520 then calls the accelerator DMA circuit 540 within the page data mover circuit 164 to copy the page from the
The accelerator unit driver 520 then requests the O/S to update the virtual address mapping in the processor page table with the same permissions associated with the virtual address and processor page in the A-PT.
After the page migration is complete, the accelerator unit driver 520 clears the "FORCE FAULT" flag in the a-PT (if set in response to the page being in the read/write state, see above). Resetting the "FORCE FAULT" flag allows the
Fig. 6 is a schematic diagram of an illustrative processor-based electronic device 600, the processor-based electronic device 600 including a Central Processing Unit (CPU) or multi-chip module (MCM) 610 and one or
One skilled in the relevant art will appreciate that the illustrated embodiments, as well as other embodiments, may be practiced with other processor-based device configurations, including portable electronic or handheld electronic devices, such as smart phones, portable computers, wearable computers, consumer electronics, personal computers ("PCs"), network PCs, minicomputers, blade servers, mainframe computers, and the like.
The processor-based device 600 includes a bus or similar communication link 616 that communicatively couples various system components and facilitates the exchange of information and/or data between the various system components, including the CPU 610/
The
The system memory 640 may include read only memory ("ROM") 642 and random access memory ("RAM") 646. A portion of the ROM 642 may be used to store or otherwise retain a basic input/output system ("BIOS"). The BIOS 644 provides basic functionality to the processor-based device 600, for example, by causing the
The processor-based device 600 may include at least one wireless input/output (I/O) interface 620. The at least one wireless I/O interface 620 may be communicatively coupled to one or more physical output devices 622 (haptic devices, video displays, audio output devices, hardcopy output devices, etc.). The at least one wireless I/O interface 620 may be communicatively coupled to one or more physical input devices 624 (pointing device, touch screen, keyboard, haptic device, etc.). The at least one wireless I/O interface 620 may comprise any currently available or future developed wireless I/O interface. Example wireless I/O interfaces include, but are not limited to: bluetooth®Near Field Communication (NFC), etc.
The processor-based device 600 may include one or more wired input/output (I/O) interfaces 630. The at least one wired I/O interface 630 may be communicatively coupled to one or more physical output devices 622 (haptic devices, video displays, audio output devices, hardcopy output devices, etc.). The at least one wired I/O interface 630 may be communicatively coupled to one or more physical input devices 624 (pointing device, touch screen, keyboard, haptic device, etc.). The wired I/O interface 630 may include any currently available or future developed I/O interface. Example wired I/O interfaces include, but are not limited to: universal Serial Bus (USB), IEEE 1394 ("FireWire"), etc.
The processor-based device 600 may include one or more communicatively coupled non-transitory data storage devices 660. The data storage 660 may include one or more Hard Disk Drives (HDDs) and/or one or more solid State Storage Devices (SSDs). The one or more data storage devices 660 may include any currently or future developed storage appliance, network storage device, and/or system. Non-limiting examples of such data storage 660 may include, but are not limited to, any current or future developed non-transitory storage appliance or device, such as one or more magnetic storage devices, one or more optical storage devices, one or more resistive storage devices, one or more molecular storage devices, one or more quantum storage devices, or various combinations thereof. In some implementations, the one or more data storage devices 660 can include one or more removable storage devices, such as one or more flash drives, flash memory units, or similar appliances or devices capable of being communicatively coupled to and decoupled from the processor-based device 600.
One or more data storage devices 660 may include an interface or controller (not shown) to communicatively couple the respective storage device or system to bus 616. One or more data storage devices 660 may store, retain, or otherwise contain a set of machine-readable instructions, data structures, program modules, data stores, databases, logic structures, and/or other data useful to
The set of machine-readable instructions 614 and other programs, applications, logic sets, and/or modules may be stored in whole or in part in the system memory 640. Such an instruction set 614 may be transferred in whole or in part from the one or more data storage devices 660. The set of instructions 614 may be loaded, stored, or otherwise retained in whole or in part in the system memory 640 during execution by the
The processor-based device 600 may include power management circuitry 650, the power management circuitry 650 controlling one or more operational aspects of an energy storage device 652. In an embodiment, energy storage device 652 may comprise one or more primary (i.e., non-rechargeable) or secondary (i.e., rechargeable) batteries or similar energy storage devices. In an embodiment, energy storage device 652 may include one or more ultracapacitors or ultracapacitors. In an embodiment, the power management circuitry 650 may alter, regulate, or control the flow of energy from the external power source 654 to the energy storage device 652 and/or to the processor-based device 600. The power source 654 may include, but is not limited to, a solar energy system, a commercial power grid, a portable generator, an external energy storage device, or any combination thereof.
For convenience, the CPU 610/
FIG. 7 is a high level flow chart of an illustrative method 700 for causing a page to be transferred between an
At 704, in response to receiving a memory access at the virtual address, accelerator virtual-to-physical
At 706, the page migration circuitry 160 collects information and/or data associated with the virtual-to-physical memory address lookup performed at 704. In an embodiment, migration circuitry 160 collects information and/or data associated with the miss (i.e., the virtual address is not found in virtual-to-physical address translation circuitry 170). In an embodiment, migration circuitry 160 collects information and/or data associated with the hit (i.e., the virtual address is found in virtual-to-physical address translation circuitry 170). Such information may be used to (i) track "hot" pages by determining the frequency of accelerator memory accesses to remote memory, such as processor memory circuitry and/or accelerator memory circuitry in one or more peer accelerator circuits; and/or (ii) tracking "cold" pages by determining the frequency of accelerator memory accesses accessing data in the local accelerator memory circuit.
At 708, migration circuitry 160 causes the page to be transferred from
At 710, the migration circuitry 160 causes the page to be evicted from the
The following figures detail exemplary architectures and systems for implementing the above-described embodiments. In some embodiments, one or more hardware components and/or instructions as described above are emulated as described in detail below, or implemented as software modules.
Embodiments of the instruction(s) detailed above may be embodied in the "generic vector friendly instruction format" detailed below. In other embodiments, this format is not utilized, but another instruction format is used, however, the following descriptions of writemask registers, various data transformations (swizzle), broadcast, etc.), addressing, etc. are generally applicable to the description of embodiments of the instruction(s) described above. Additionally, exemplary systems, architectures, and pipelines are detailed below. Embodiments of the instruction(s) described above may be executed on such systems, architectures, and pipelines, but are not limited to those detailed.
The instruction set may include one or more instruction formats. A given instruction format may define various fields (e.g., number of bits, location of bits) to specify, among other things, an operation to be performed (e.g., an opcode) and operand(s) and/or other data field(s) (e.g., a mask) to perform the operation. By defining instruction templates (orSubformats) that further subdivide some instruction formats. For example, an instruction template for a given instruction format may be defined to have different subsets of the fields of the instruction format (the included fields typically have the same order, but at least some fields have different bit positions because fewer fields are included) and/or to have different interpretations of the given field. Thus, each instruction of the ISA is expressed using a given instruction format (and, if defined, a given one of the instruction templates of that instruction format), and includes fields for specifying operations and operands. For example, an exemplary ADD instruction has a particular opcode and an instruction format that includes an opcode field to specify the opcode and operand fields (source 1/target and source 2) to select operands; and the presence of the ADD instruction in the instruction stream will have particular contents in the operand field that selects a particular operand. SIMD extension sets have been published and/or disclosed, which are referred to as advanced vector extensions (AVX) (AVX 1 and AVX 2) and use a Vector Extension (VEX) encoding scheme (see, e.g., Intel @)®64 and IA-32architecture Software developers' Manual, September 2014, 9 months 2014; and see Intel®Advanced Vector Extensions Programming Reference, October 2014)。
Exemplary instruction Format
Embodiments of the instruction(s) described herein may be embodied in different formats. Additionally, exemplary systems, architectures, and pipelines are detailed below. Embodiments of the instruction(s) may be executed on such systems, architectures, and pipelines, but are not limited to those detailed.
Universal vector friendly instruction format
The vector friendly instruction format is an instruction format that is appropriate for vector instructions (e.g., there are certain fields that are specific to vector operations). Although embodiments are described that support both vector and scalar operations with the vector friendly instruction format, alternative embodiments use only vector operations with the vector friendly instruction format.
8A-8B are block diagrams illustrating a generic vector friendly instruction format and its instruction templates according to embodiments of the invention. FIG. 8A is a block diagram illustrating the generic vector friendly instruction format and class A instruction templates thereof according to embodiments of the invention; and FIG. 8B is a block diagram illustrating the generic vector friendly instruction format and class B instruction templates thereof according to embodiments of the invention. In particular, class a and class B instruction templates are defined for the generic vector friendly instruction format 800, neither of which includes a
Although embodiments of the invention will be described in which the vector friendly instruction format supports the following: a 64 byte vector operand length (or size) having a 32 bit (4 byte) or 64 bit (8 byte) data element width (or size) (and thus, a 64 byte vector consists of 16 doubleword-sized elements or alternatively 8 quadword-sized elements); a 64 byte vector operand length (or size) with a 16 bit (2 bytes) or 8 bit (1 byte) data element width (or size); a 32-byte vector operand length (or size) having a 32-bit (4-byte), 64-bit (8-byte), 16-bit (2-byte), or 8-bit (1-byte) data element width (or size); and a 16-bit vector operand length (or size) having a 32-bit (4 bytes), 64-bit (8 bytes), 16-bit (2 bytes), or 8-bit (1 byte) data element width (or size); alternative embodiments may support more, fewer, and/or different vector operand sizes (e.g., 256 byte vector operands) with more, fewer, or different data element widths (e.g., 128 bit (16 byte) data element widths).
The class a instruction templates in fig. 8A include: 1) showing within the
The generic vector friendly instruction format 800 includes the following fields listed below in the order shown in fig. 8A and 8B.
Format field 840 — a specific value in this field (an instruction format identifier value) uniquely identifies the vector friendly instruction format and, thus, the occurrence of instructions in the vector friendly instruction format in the instruction stream. This field is thus optional in the sense that it is not required by an instruction set having only the generic vector friendly instruction format.
Modifier field 846-whose contents distinguish the presence of instructions in the generic vector instruction format that specify a memory access from those that do not specify a memory access; that is, a distinction is made between
Augmentation operation field 850-its content distinguishes which of a number of different operations to perform in addition to the basic operation. This field is context specific. In one embodiment of the invention, this field is divided into a class field 868, an
Scaling field 860-its content allows the content of the index field to be scaled for memory address generation (e.g., for use with 2)ZoomIndex + base address generation).
Data element width field 864-its content distinguishes which of a plurality of data element widths to use (in some embodiments, for all instructions; in other embodiments, for only some instructions). This field is optional in the sense that it is not needed if only one data element width is supported and/or some aspect of the usage opcode supports data element widths.
Immediate field 872-its contents allow for the specification of an immediate. This field is optional in the sense that it is not present in implementations of generic vector friendly formats that do not support immediate and is not present in instructions that do not use immediate.
Category field 868-its contents distinguish between different classes of instructions. Referring to fig. 8A and 8B, the contents of this field select between class a and class B instructions. In fig. 8A and 8B, rounded squares are used to indicate that a particular value is present in the field (e.g., class a 868A and class B868B for category field 868 in fig. 8A and 8B, respectively).
Class A instruction template
In the case of a class a
Non-memory access instruction template-full round control type operation
In the non-memory access full round control type operation 810 instruction template, the beta field 854 is interpreted as a round control field 854A, the contents of which provide static rounding. Although in the described embodiment of the present invention the round control field 854A includes a suppress all floating point exceptions (SAE) field 856 and a round operation control field 858, alternative embodiments may support that both concepts may be encoded as the same field or only one or the other of the concepts/fields (e.g., only the round operation control field 858 may be present).
SAE field 856 — its content distinguishes whether exception reporting is disabled; when the contents of the SAE field 856 indicate that throttling is enabled, a given instruction will not report any kind of floating point exception flag, and will not raise any floating point exception handler.
The round operation control field 858, whose contents distinguish which of a set of rounding operations is to be performed (e.g., round up, round down, round to zero, and round to nearest). Thus, the rounding operation control field 858 allows the rounding mode to be changed on a per instruction basis. In one embodiment of the present invention where the processor includes a control register for specifying the rounding mode, the contents of the rounding
Non-memory access instruction template-data transformation type operation
In the non-memory access data transform type operation 815 instruction template, the beta field 854 is interpreted as a data transform field 854B whose content distinguishes which of a plurality of data transforms is to be performed (e.g., no data transform, obfuscation, broadcast).
In the case of a class A memory access 820 instruction template, the
Vector memory instructions perform vector loads from memory and vector stores to memory with translation support. As with conventional vector instructions, vector memory instructions transfer data from/to memory in a data element-by-data element manner, with the actual transferred elements being specified by the contents of the vector mask selected as the write mask.
Memory access instruction template-temporal
Transient data is data that is likely to be reused fast enough to benefit from caching. However, this is a hint, and different processors may implement it in different ways, including ignoring hints altogether.
Memory access instruction templates-non-transitory
Non-transitory data is data that: the data is unlikely to be reused fast enough to benefit from caching in the first level cache and should be preferentially evicted. However, this is a hint, and different processors may implement it in different ways, including ignoring hints altogether.
Class B instruction templates
In the case of a class B instruction template, the
In the case of a class B
In the non-memory access, write mask control, partial round control type operation 810 instruction template, the remainder of the β field 854 is interpreted as the
The rounding
In the non-memory access, write mask control, VSIZE type operation 817 instruction template, the remainder of the beta field 854 is interpreted as a
In the case of a class B memory access 820 instruction template, a portion of the beta field 854 is interpreted as a
With respect to the generic vector friendly instruction format 800, a full opcode field 874 is shown, the full opcode field 874 including the format field 840, the
The
The combination of the writemask field and the data element width field creates typed instructions because they allow masks to be applied based on different data element widths.
The various instruction templates found within class a and class B are beneficial in different situations. In some embodiments of the invention, different processors or different cores within a processor may support only class A, only class B, or both. For example, a high performance generic out-of-order core intended for general purpose computing may support only class B, a core intended primarily for graphics and/or scientific (throughput) computing may support only class a, and a core intended for both may support both class a and class B (of course, a core having some mix of templates and instructions from both classes but not all templates and instructions from both classes is within the scope of the invention). Also, a single processor may include multiple cores, all of which support the same class, or in which different cores support different classes. For example, in a processor having separate graphics and general purpose cores, one of the graphics cores intended primarily for graphics and/or scientific computing may support only class a, while one or more of the general purpose cores may be a high performance general purpose core supporting only class B, with out of order execution and register renaming intended for general purpose computing. Another processor without a separate graphics core may include one or more general purpose in-order or out-of-order cores that support both class a and class B. Of course, features from one category may also be implemented in another category in different embodiments of the invention. Programs written in high-level languages will be adapted (e.g., just-in-time compilation or static compilation) into a variety of different executable forms, including: 1) in the form of instructions of only the class(s) supported by the target processor for execution; or 2) have alternate routines written using different combinations of instructions of all classes, and have the form of control flow code that selects a routine to execute based on instructions supported by the processor currently executing the code.
Exemplary specific vector friendly instruction Format
FIG. 9 is a block diagram illustrating an exemplary specific vector friendly instruction format according to an embodiment of the invention. Fig. 9 illustrates a specific vector friendly instruction format 900, the specific vector friendly instruction format 900 being specific in the sense that it specifies the location, size, interpretation and order of the fields and the values for some of these fields. The specific vector friendly instruction format 900 may be used to extend the x86 instruction set, and thus some fields are similar or identical to those used in the existing x86 instruction set and its extensions (e.g., AVX). This format is consistent with the prefix encoding field, the real opcode byte field, the MOD R/M field, the SIB field, the displacement field, and the immediate field of the existing x86 instruction set with extensions. The fields from fig. 8A and 8B are illustrated, with the fields from fig. 9 mapped into the fields of fig. 8A and 8B.
It should be understood that although embodiments of the present invention are described with reference to the specific vector friendly instruction format 900 in the context of the generic vector friendly instruction format 800 for illustrative purposes, the present invention is not limited to the specific vector friendly instruction format 900 unless stated otherwise. For example, the generic vector friendly instruction format 800 takes into account various possible sizes of the various fields, while the specific vector friendly instruction format 900 is shown as having fields of a specific size. As a particular example, while the data
The generic vector friendly instruction format 800 includes the following fields listed in the order shown in FIG. 9A.
EVEX prefix (bytes 0-3) 902-is encoded in four bytes.
Format field 840 (
The second through fourth bytes (EVEX bytes 1-3) include a plurality of bit fields that provide specific capabilities.
REX field 905 (EVEX byte 1, bits [7-5 ]) -consists of: EVEX.R bit fields (EVEX byte 1, bits [7] -R), EVEX.X bit fields (EVEX byte 1, bits [6] ] -X), and 857BEX byte 1, bits [5] -B). The evex.r, evex.x, and evex.b bit fields provide the same functionality as the corresponding VEX bit fields and are encoded using a 1s complement form, i.e., ZMM0 is encoded as 1111B and ZMM15 is encoded as 0000B. As is known in the art, other fields of the instruction encode the lower three bits of the register index (rrr, xxx, and bbb) so that Rrrr, Xxxx, and Bbbb may be formed by addition of evex.r, evex.x, and evex.b.
REX 'field 810-this is the first part of REX' field 810 and is an EVEX. R 'bit field (EVEX byte 1, bits [4] -R') for encoding the upper 16 or lower 16 of the extended 32 register set. In one embodiment of the invention, this bit, and other bits as indicated below, are stored in a bit-reversed format to distinguish (using the well-known x 8632 bit mode) from the true opcode byte 62 BOUND instruction, but do not accept the value of 11 in the MOD field in the MOD R/M field (described below); alternate embodiments of the present invention do not store this bit and the other bits indicated below in an inverted format. The value 1 is used to encode the lower 16 registers. In other words, R 'Rrrr is formed by combining evex.r', evex.r, and other RRRs from other fields.
Opcode map field 915 (EVEX byte 1, bits [ 3: 0] -mmmm) -whose content encodes the implied preamble opcode byte (0F, 0F 38, or 0F 3).
The data element width field 864 (EVEX byte 2, bits [7] -W) -is represented by the symbol EVEX.W. Evex.w is used to define the granularity (size) of the data type (32-bit data element or 64-bit data element).
EVEX. vvvvv 920 (EVEX byte 2, bits [ 6: 3] -vvv) -EVEX. vvvvv the effect of which may include the following: 1) vvvvv encodes a first source register operand specified in inverted (1 s complement) form and is valid for instructions having 2 or more source operands; 2) Vvvvv encodes a destination register operand specified in 1s complement for some vector shifts; or 3) evex. vvvvv does not encode any operand, this field is a reserved field and shall contain 1111 b. Vvvvv field 920 thus encodes the 4 low order bits of the first source register specifier, which are stored in inverted (1 s complement) form. According to this instruction, an additional different EVEX bit field is used to extend the specifier size to 32 registers.
Evex.u 868 category field (EVEX byte 2, bit [2] -U) — if evex.u =0, it indicates a class a or
Prefix encoding field 925 (EVEX byte 2, bits [ 1: 0] -pp) -provides additional bits for the basic operation field. In addition to providing support for legacy SSE instructions that employ the EVEX prefix format, this also has the advantage of compacting the SIMD prefix (the EVEX prefix requires only 2 bits instead of one byte to express the SIMD prefix). In one embodiment, to support legacy SSE instructions that use SIMD prefixes (66H, F2H, F3H) in a legacy format and in an EVEX prefix format, these legacy SIMD prefixes are encoded into a SIMD prefix encoding field; and is extended at run-time to a legacy SIMD prefix before being provided to the decoder's PLA (so the PLA can execute these legacy instructions in both legacy and EVEX formats without modification). While newer instructions may use the contents of the EVEX prefix encoding field directly as an opcode extension, certain embodiments extend in a similar manner to maintain consistency, but allow these legacy SIMD prefixes to specify different meanings. Alternate embodiments may redesign the PLA to support 2-bit SIMD prefix encoding and therefore do not require expansion.
Alpha field 852 (EVEX byte 3, bits [7] -EH; also known as EVEX. EH, EVEX.rs, EVEX.RL, EVEX. write mask control, and EVEX.N; also illustrated with alpha) -as previously described, this field is context specific.
β field 854 (EVEX byte 3, bits [ 6: 4]]SSS, also known as EVEX.s2-0、EVEX.r2-0Evex.rr1, evex.ll0, evex.llb; also illustrated with βββ) — as previously mentioned, this field is context specific.
REX 'field 810-this is the remainder of the REX' field and is an evex.v 'bit field (EVEX byte 3, bits [3] -V') that can be used to encode the upper 16 or lower 16 of the extended 32 register set. The bit is stored in a bit-reversed format. The value 1 is used to encode the lower 16 registers. In other words, V 'VVVV is formed by merging evex.v', evex.vvvvv.
Writemask field 870 (EVEX byte 3, bits [ 2: 0] -kkk) -whose contents specify the index of the register in the writemask register, as previously described. In one embodiment of the invention, the particular value evex.kkk =000 has a dedicated behavior that implies no write mask is used for this particular instruction (this may be implemented in various ways, including using hardware that is hardwired to all write masks, or that bypasses masking hardware).
The true opcode field 930 (byte 4) is also referred to as the opcode byte. A portion of the opcode is specified in this field.
MOD R/M field 940 (byte 5) includes MOD field 942,
Scale, index, base (SIB) byte (byte 6) — as previously described, the contents of the
Complete operation code field
Figure 9B is a block diagram illustrating the fields of the specific vector friendly instruction format 900 that make up the full opcode field 874, according to one embodiment of the invention. In particular, the full opcode field 874 includes the format field 840, the
Register index field
FIG. 9C is a block diagram illustrating the fields of the specific vector friendly instruction format 900 that make up the
Extended operation field
FIG. 9D is a block diagram illustrating the fields of the specific vector friendly instruction format 900 that make up the
α field 852 (EVEX byte 3, bit [7 ]) when U =1]EH) is interpreted as a write mask control (Z) field 852c when U =1 and MOD field 942 contains 11 (representing a non-memory access operation), β is part of field 854 (EVEX byte 3, bit [4 ])]-S0) Interpreted as RL field 857A, when it contains a 1 (rounded 857 a.1), β the remainder of field 854 (EVEX byte 3, bits 6-5)]-S2-1) Interpreted as a rounding
Exemplary register architecture
FIG. 10 is a block diagram of a register architecture 1000 according to one embodiment of the invention. In the illustrated embodiment, there are 32 vector registers 1010, which are 512 bits wide; these registers are referred to as zmm0 through zmm 31. The low order 256 bits of the lower 16 zmm registers are overlaid on registers ymm 0-16. The low order 128 bits of the lower 16 zmm registers (the low order 128 bits of the ymm registers) are overlaid on the registers xmm 0-15. The specific vector friendly instruction format 900 operates on these overlaid register files as shown in the table below.
In other words, the
Writemask register 1015 — in the embodiment shown, there are 8 writemask registers (
General purpose registers 1025-in the embodiment shown, there are sixteen 64-bit general purpose registers that are used to address memory operands along with the existing x86 addressing scheme. These register names are RAX, RBX, RCX, RDX, RBP, RSI, RDI, RSP, and R8 through R15.
A scalar floating point stack register file (x 87 stack) 1045, aliased as the MMX packed integer flat register file 1050-in the illustrated embodiment, the x87 stack is an eight element stack for performing scalar floating point operations on 32/64/80-bit floating point data using the x87 instruction set extensions; while the MMX register is used to perform operations on 64-bit packed integer data and to hold operands for some operations performed between the MMX and XMM registers.
Alternative embodiments of the present invention may use wider or narrower registers. In addition, alternative embodiments of the invention may use more, fewer, or different register files and registers.
Exemplary core architecture, processor, and computer architecture
Processor cores may be implemented in different ways, for different purposes, and in different processors. For example, implementations of such cores may include: 1) a generic ordered core intended for general purpose computing; 2) a high performance general out-of-order core intended for general purpose computing; 3) mainly for dedicated cores for graphics and/or scientific (throughput) computation. Implementations of different processors may include: 1) a CPU comprising one or more general-purpose in-order cores intended for general-purpose computing and/or one or more general-purpose out-of-order cores intended for general-purpose computing; and 2) coprocessors, including one or more special purpose cores intended primarily for graphics and/or science (throughput). Such different processors result in different computer system architectures that may include: 1) the coprocessor is positioned on a chip separated from the CPU; 2) the coprocessor is positioned on a separate tube core in the same package with the CPU; 3) coprocessors are on the same die as the CPU (in which case such coprocessors are sometimes referred to as dedicated logic, such as integrated graphics and/or scientific (throughput) logic, or as dedicated cores); 4) A system on a chip, which may include the described CPU (sometimes referred to as application core(s) or application processor (s)), co-processor(s) described above, and additional functionality on the same die. An exemplary core architecture is described next, followed by an exemplary processor and computer architecture.
Exemplary core architecture
Ordered and out-of-order core block diagrams
FIG. 11A is a block diagram illustrating both an exemplary in-order pipeline and an exemplary register renaming out-of-order issue/execution pipeline according to embodiments of the invention. FIG. 11B is a block diagram illustrating both an exemplary embodiment of an in-order architecture core and an exemplary register renaming out-of-order issue/execution architecture core to be included in a processor according to embodiments of the invention. The solid boxes in FIGS. 11A and 11B illustrate an in-order pipeline and an in-order core, while the optionally added dashed boxes illustrate a register renaming out-of-order issue/execution pipeline and core. Assuming that the ordered aspect is a subset of the out-of-order aspect, the out-of-order aspect will be described.
In FIG. 11A, processor pipeline 1100 includes a fetch
Fig. 11B shows processor core 1190 including a front end unit 1130 coupled to an execution engine unit 1150, and both front end unit 1130 and execution engine unit 1150 are coupled to memory unit 1170. Core 1190 may be a Reduced Instruction Set Computing (RISC) core, a Complex Instruction Set Computing (CISC) core, a Very Long Instruction Word (VLIW) core, or a hybrid or alternative core type. As yet another option, the core 1190 may be a dedicated core, such as, for example, a network or communication core, compression engine, coprocessor core, general purpose computing graphics processing unit (GPGPU) core, graphics core, or the like.
The front end unit 1130 includes a branch prediction unit 1132, the branch prediction unit 1132 is coupled to an instruction cache unit 1134, the instruction cache unit 1134 is coupled to an instruction Translation Lookaside Buffer (TLB) 1136, the TLB 1136 is coupled to an instruction fetch unit 1138, and the instruction fetch unit 1138 is coupled to a decode unit 1140. The decode unit 1140 (or decoder) may decode instructions and generate one or more micro-operations, microcode input points, microinstructions, other instructions, or other control signals as outputs, which are decoded from, or otherwise reflect or are derived from, the original instructions. The decode unit 1140 may be implemented using a variety of different mechanisms. Examples of suitable mechanisms include, but are not limited to, look-up tables, hardware implementations, Programmable Logic Arrays (PLAs), microcode read-only memories (ROMs), and the like. In one embodiment, core 1190 includes a microcode ROM or other medium that stores microcode for certain macroinstructions (e.g., in decode unit 1140 or otherwise within front end unit 1130). The decode unit 1140 is coupled to a rename/allocator unit 1152 in the execution engine unit 1150.
The execution engine unit 1150 includes a rename/allocator unit 1152, the rename/allocator unit 1152 coupled to a retirement unit 1154 and a set of one or more scheduler units 1156. Scheduler unit(s) 1156 represents any number of different schedulers, including reservation stations, central instruction windows, and so forth. The scheduler unit(s) 1156 is coupled to the physical register file unit(s) 1158. Each physical register file unit 1158 represents one or more physical register files that store one or more different data types, such as scalar integers, scalar floating points, packed integers, packed floating points, vector integers, vector floating points, states (e.g., an instruction pointer that is the address of the next instruction to be executed), and so forth. In one embodiment, physical register file unit(s) 1158 include vector register units, write mask register units, and scalar register units. These register units may provide architectural vector registers, vector mask registers, and general purpose registers. Physical register file unit(s) 1158 is/are overlaid by retirement unit 1154 to illustrate the various ways in which register renaming and out-of-order execution may be implemented (e.g., using reorder buffer(s) and retirement register file(s); using future file(s), history buffer(s), and retirement register file(s); using register maps and register pools; etc.). Retirement unit 1154 and physical register file unit(s) 1158 are coupled to execution cluster(s) 1160. Execution cluster(s) 1160 includes a set of one or more execution units 1162 and a set of one or more memory access units 1164. The execution units 1162 may perform various operations (e.g., shifts, additions, subtractions, multiplications) on various types of data (e.g., scalar floating point, packed integer, packed floating point, vector integer, vector floating point). While some embodiments may include multiple execution units dedicated to a particular function or groups of functions, other embodiments may include only one execution unit or multiple execution units that all perform all functions. Scheduler unit(s) 1156, physical register file unit(s) 1158, and execution cluster(s) 1160 are shown as being possibly plural in that certain embodiments create separate pipelines for certain types of data/operations (e.g., a scalar integer pipeline, a scalar floating point/packed integer/packed floating point/vector integer/vector floating point pipeline, and/or a memory access pipeline, each having its own scheduler unit, physical register file unit(s), and/or execution cluster-and in the case of a separate memory access pipeline, implement certain embodiments in which only the execution cluster of that pipeline has memory access unit(s) 1164). It should also be understood that where separate pipelines are used, one or more of these pipelines may be issued/executed out-of-order, while the rest are in-order.
The set of memory access units 1164 is coupled to memory units 1170, and the memory units 1170 include a data TLB unit 1172 coupled to a data cache unit 1174, the data cache unit 1174 coupled to a level two (L2) cache unit 1176. In one exemplary embodiment, the memory access unit 1164 may include a load unit, a store address unit, and a store data unit, each of which is coupled to the data TLB unit 1172 in the memory unit 1170. The instruction cache unit 1134 is also coupled to a level 2 (L2) cache unit 1176 of the memory units 1170. The L2 cache unit 1176 is coupled to one or more other levels of cache, and ultimately to main memory.
By way of example, the exemplary register renaming out-of-order issue/execution core architecture may implement the pipeline 1100 as follows: 1) instruction fetch 1138 performs fetch and length decode
Core 1190 may support one or more instruction sets including the instruction(s) described herein (e.g., the x86 instruction set (with some extensions that have been added to newer versions), the MIPS instruction set of MIPS technologies, inc. of sunnyvale, california, the ARM instruction set of ARM holdings, inc. of sunnyvale, california (with optional additional extensions such as NEON) hione embodiment, core 1190 includes logic that supports packed data instruction set extensions (e.g., AVX1, AVX 2) allowing operations used by many multimedia applications to be performed using packed data.
It should be appreciated that a core may support multithreading (performing two or more parallel operations or sets of threads), and this may be done in a variety of ways, including time-division multithreading, simultaneous multithreading (where a single physical core provides a logical core for each thread, and physical cores are multithreaded simultaneously), or a combination thereof (e.g., time-division fetch and decode and simultaneous multithreading thereafter, such as in Intel®In hyper-threading technology).
Although register renaming is described in the context of out-of-order execution, it should be understood that register renaming may be used in an in-order architecture. Although the processor embodiment shown also includes a separate instruction and data cache unit 1134/1174 and a shared L2 cache unit 1176, alternative embodiments may have a single internal cache for both instructions and data, such as, for example, a level 1 (L1) internal cache, or multiple levels of internal cache. In some embodiments, a system may include a combination of internal caches and external caches that are external to the core and/or processor. Alternatively, all caches may be external to the core and/or processor.
Specific exemplary ordered core architecture
Fig. 12A and 12B illustrate block diagrams of more specific exemplary in-order core architectures that would be one of several logic blocks in a chip (including other cores of the same type and/or different types). Depending on the application, the logic blocks communicate with some fixed function logic, memory I/O interfaces, and other necessary I/O logic over a high bandwidth interconnection network (e.g., a ring network).
Figure 12A is a block diagram of a single processor core and its connection to the on-
The local subset of the
FIG. 12B is an enlarged view of a portion of the processor core of FIG. 12A, according to an embodiment of the invention. FIG. 12B includes the L1 data cache 1206A portion of the
Fig. 13 is a block diagram of a processor 1300, which processor 1300 may have more than one core, may have an integrated memory controller, and may have an integrated graphics device, according to an embodiment of the invention. The solid line block diagram in FIG. 13 illustrates a processor 1300 having a single core 1302A, a system agent 1310, a set of one or more bus controller units 1316, while the optionally added dashed line block diagram illustrates a replacement processor 1300 having a plurality of cores 1302A-N, a set of one or more integrated memory controller units 1314 in the system agent unit 1310, and dedicated logic 1308.
Thus, different implementations of processor 1300 may include: 1) a CPU, where dedicated logic 1308 is integrated graphics and/or scientific (throughput) logic (which may include one or more cores), and cores 1302A-N are one or more general-purpose cores (e.g., general-purpose in-order cores, general-purpose out-of-order cores, a combination of both); 2) Coprocessors, where cores 1302A-N are a large number of dedicated cores intended primarily for graphics and/or science (throughput); and 3) coprocessors, where cores 1302A-N are a number of general purpose ordered cores. Thus, the processor 1300 may be a general-purpose processor, coprocessor or special-purpose processor, such as, for example, a network or communication processor, compression engine, graphics processor, GPGPU (general purpose graphics processing unit), a high-throughput multi-integrated core (MIC) coprocessor (including 30 or more cores), embedded processor, or the like. The processor may be implemented on one or more chips. The processor 1300 may be part of and/or may be implemented on one or more substrates using any of a variety of processing technologies, such as, for example, BiCMOS, CMOS, or NMOS.
The memory hierarchy includes one or more cache levels within each core, one or more shared cache units 1306 or sets, and external memory (not shown) coupled to a set of integrated memory controller units 1314. The set of shared cache units 1306 may include one or more mid-level caches, such as level 2 (L2), level 3 (L3), level 4 (L4), or other levels of cache, a Last Level Cache (LLC), and/or combinations thereof. Although in one embodiment, the ring-based interconnect unit 1312 interconnects the integrated graphics logic 1308, the set of shared cache units 1306, and the system agent unit 1310/integrated memory controller unit(s) 1314, alternative embodiments may use any number of well-known techniques for interconnecting these units. In one embodiment, coherency is maintained between one or more cache molecules 1306 and cores 1302-A-N.
In some embodiments, one or more of cores 1302A-N are capable of multithreading. System agent 1310 includes those components that coordinate and operate cores 1302A-N. The system agent unit 1310 may include, for example, a Power Control Unit (PCU) and a display unit. The PCU may be or include the logic and components necessary to adjust the power states of cores 1302A-N and integrated graphics logic 1308. The display unit is used to drive one or more externally connected displays.
In terms of the architectural instruction set, the cores 1302A-N may be homogeneous or heterogeneous. That is, two or more of cores 1302A-N may be capable of executing the same instruction set, while other cores may be capable of executing only a subset of the instruction set or a different instruction set.
Example computer architecture
Fig. 14, 15, 16, and 17 are block diagrams of exemplary computer architectures. Other system designs and configurations known in the art for laptop computers, desktop computers, handheld PCs, personal digital assistants, engineering workstations, servers, network appliances, network hubs, switches, embedded processors, Digital Signal Processors (DSPs), graphics devices, video game devices, set-top boxes, microcontrollers, cell phones, portable media players, handheld devices, and a variety of other electronic devices are also suitable. In general, a wide variety of systems or electronic devices capable of incorporating the processors and/or other execution logic disclosed herein are generally suitable.
Referring now to FIG. 14, shown is a block diagram of a system 1400 in accordance with one embodiment of the present invention. The system 1400 may include one or more processors 1410, 1415, the processors 1410, 1415 coupled to a controller hub 1420. In one embodiment, the controller hub 1420 includes a Graphics Memory Controller Hub (GMCH) 1490 and an input/output hub (IOH) 1450 (which may be on separate chips); GMCH 1490 includes memory and graphics controllers to which memory 1440 and coprocessor 1445 are coupled; IOH 1450 couples input/output (I/O) device 1460 to GMCH 1490. Alternatively, one or both of the memory and graphics controllers are integrated within the processor (as described herein), the memory 1440 and the coprocessor 1445 are coupled directly to the processor 1410 and the controller hub 1420, the controller hub 1420 and the IOH 1450 are in a single chip.
The optional nature of the additional processor 1415 is indicated in FIG. 14 by dashed lines. Each processor 1410, 1415 may include one or more processing cores described herein and may be some version of the processor 1300.
Memory 1440 may be, for example, Dynamic Random Access Memory (DRAM), Phase Change Memory (PCM), or a combination of both. For at least one embodiment, controller hub 1420 communicates with processor(s) 1410, 1415 via a multi-drop bus, such as a front-side bus (FSB), a point-to-point interface, such as a QuickPath interconnect (QPI), or similar connection 1495.
In one embodiment, the coprocessor 1445 is a special-purpose processor, such as, for example, a high-throughput MIC processor, a network or communication processor, compression engine, graphics processor, GPGPU, embedded processor, or the like. In one embodiment, the controller hub 1420 may include an integrated graphics accelerator.
There may be various differences between the physical resources 1410, 1415 in terms of metrics for a range of metrics including architectural, microarchitectural, thermal, power consumption characteristics, and the like.
In one embodiment, processor 1410 executes instructions that control data processing operations of a general type. Embedded in the instructions may be coprocessor instructions. The processor 1410 identifies these coprocessor instructions as being of a type that should be executed by the attached coprocessor 1445. Accordingly, the processor 1410 issues these coprocessor instructions (or control signals representing coprocessor instructions) on a coprocessor bus or other interconnect to coprocessor 1445. Coprocessor(s) 1445 accepts and executes received coprocessor instructions.
Referring now to FIG. 15, shown is a block diagram of a first more specific
A shared cache (not shown) may be included in either processor or outside of both processors, but connected to the processors via a P-P interconnect, such that if a processor is placed in a low power mode, local cache information for either or both processors may be stored in the shared cache.
As shown in fig. 15, various I/
Referring now to fig. 16, shown is a block diagram of a second more specific
Fig. 16 illustrates that the
Referring now to fig. 17, shown is a block diagram of a
Embodiments of the mechanisms disclosed herein may be implemented in hardware, software, firmware, or a combination of these implementations. Embodiments of the invention may be implemented as computer programs or program code executing on programmable systems comprising at least one processor, a storage system (including volatile and non-volatile memory and/or storage elements), at least one input device, and at least one output device.
Program code, such as
The program code may be implemented in a high level procedural or object oriented programming language to communicate with a processing system. Program code can also be implemented in assembly or machine language, as desired. Indeed, the scope of the mechanisms described herein is not limited to any particular programming language. In any case, the language may be a compiled or interpreted language.
One or more aspects of at least one embodiment may be implemented by representative instructions stored on a machine-readable medium which represent various logic within a processor, which when read by a machine, cause the machine to fabricate logic to perform the techniques described herein. Such representations, known as "IP cores," may be stored on a tangible, machine-readable medium and provided to various customers or manufacturing plants for loading into the fabrication machines that actually manufacture the logic or processor.
Such machine-readable storage media may include, but are not limited to, non-transitory tangible arrangements of articles of manufacture or formation by machines or devices, including storage media such as: a hard disk; any other type of disk including floppy disks, optical disks, compact disk read-only memories (CD-ROMs), compact disk rewritables (CD-RWs), and magneto-optical disks; semiconductor devices such as Read Only Memory (ROM), Random Access Memory (RAM) (such as Dynamic Random Access Memory (DRAM), Static Random Access Memory (SRAM)), Erasable Programmable Read Only Memory (EPROM), flash memory, Electrically Erasable Programmable Read Only Memory (EEPROM), Phase Change Memory (PCM); magnetic or optical cards; or any other type of media suitable for storing electronic instructions.
Accordingly, embodiments of the present invention also include non-transitory tangible machine-readable media containing instructions or containing design data, such as Hardware Description Language (HDL), that define the structures, circuits, devices, processors, and/or system features described herein. Such embodiments may also be referred to as program products.
Simulation (including binary conversion, code transformation, etc.)
In some cases, an instruction converter may be used to convert instructions from a source instruction set to a target instruction set. For example, the instruction converter may convert (e.g., using static binary translation, dynamic binary translation including dynamic compilation), morph, emulate, or otherwise translate the instruction into one or more other instructions to be processed by the core. The instruction converter may be implemented in software, hardware, firmware, or a combination thereof. The instruction converter may be on the processor, off the processor, or partially on and partially off the processor.
FIG. 18 is a block diagram comparing conversion of binary instructions in a source instruction set to binary instructions in a target instruction set using a software instruction converter, according to an embodiment of the invention. In the illustrated embodiment, the instruction converter is a software instruction converter, but alternatively, the instruction converter may be implemented in software, firmware, hardware, or various combinations thereof. Fig. 18 shows that a program in a high-level language 1802 can be compiled using an x86 compiler 1804 to generate x86 binary code 1806, and the x86 binary code 1806 can be executed natively by a processor having at least one x86
While fig. 7 illustrates various operations according to one or more embodiments, it is to be understood that not all of the operations depicted in fig. 7 are necessary for other embodiments. Indeed, it is fully contemplated herein that in other embodiments of the present disclosure, the operations depicted in fig. 7 and/or other operations described herein may be combined in a manner not specifically shown in any of the figures but still fully consistent with the present disclosure. Accordingly, claims directed to features and/or operations not specifically illustrated in one of the figures are considered to be within the scope and content of the present disclosure.
As used in this application and the claims, a list of items joined by the term "and/or" can mean any combination of the listed items. For example, the phrase "A, B and/or C" may represent a; b; c; a and B; a and C; b and C; or A, B and C. As used in this application and in the claims, a list of items joined by the term "at least one" can mean any combination of the listed terms. For example, the phrase "at least one of a, B, or C" may denote a; b; c; a and B; a and C; b and C; or A, B and C.
As used in any embodiment herein, the term "system" or "module" may refer to, for example, software, firmware, and/or circuitry configured to perform any of the foregoing operations. The software may be embodied as a software package, code, instructions, instruction sets, and/or data recorded on a non-transitory computer-readable storage medium. Firmware may be embodied as code, instructions or instruction sets and/or data that are hard-coded (e.g., nonvolatile) in a memory device.
As used in any embodiment herein, the term "circuitry" may include, for example, alone or in any combination: a hard-wired circuit; programmable circuitry, such as a computer processor including one or more individual instruction processing cores; a state machine circuit; and/or firmware that stores instructions for execution by programmable circuitry or future computational paradigms including, for example, massively parallel, analog, or quantum computing, hardware embodiments of accelerators such as neural network processors, and non-silicon implementations of the foregoing. The circuitry may be collectively or individually embodied as circuitry forming part of a larger system, such as an Integrated Circuit (IC), a system on a chip (SoC), a desktop computer, a laptop computer, a tablet computer, a server, a smartphone, or the like.
Any of the operations described herein may be implemented in a system comprising one or more media (e.g., non-transitory storage media) having instructions stored therein, individually or in combination, that when executed by one or more processors perform the described methods. Here, the processor may include, for example, a server CPU, a mobile device CPU, and/or other programmable circuitry. Moreover, it is intended that operations described herein may be distributed among multiple physical devices, such as processing structures at more than one different physical location. The storage medium may include any type of tangible medium, such as: any type of disk including hard disks, floppy disks, optical disks, compact read-only memories (CD-ROMs), compact disk rewritables (CD-RWs), and magneto-optical disks; semiconductor devices such as read-only memory (ROM), Random Access Memory (RAM) such as dynamic and static RAM, erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), flash memory, Solid State Disks (SSD); an embedded multimedia card (eMMC); secure digital input/output (SDIO) cards; magnetic or optical cards; or any type of media suitable for storing electronic instructions. Other embodiments may be implemented as software executed by a programmable control device.
Accordingly, the present disclosure is directed to systems and methods of sharing memory circuitry between a processor memory circuit and accelerator memory circuits in each of a plurality of peer-to-peer connected accelerator units. Each accelerator unit includes physical-to-virtual address translation circuitry and migration circuitry. The physical-to-virtual address translation circuitry in each accelerator unit includes a page for each of at least some of the plurality of accelerator units. Migration circuitry transfers the page between the processor memory circuitry and the accelerator memory circuitry in each of the plurality of accelerator circuits. Migration circuitry migrates the page to and evicts the page from the accelerator memory circuitry based on statistics collected by the physical-to-virtual address translation circuitry. Thus, processor memory circuitry and accelerator memory circuitry may be dynamically allocated to advantageously minimize system latency attributable to data access operations.
The following examples relate to other embodiments. The following examples of the disclosure may include subject matter, such as at least one device; a method for preparing a composition; at least one machine readable medium for storing instructions that, when executed, cause a machine to perform acts in accordance with the method; means for performing an action based on the method; and/or a system for migrating and evicting pages to and from accelerator memory circuitry based on statistics associated with virtual memory address to physical memory address translations performed by physical to virtual address translation circuitry. By maintaining frequently used data near the point of use (e.g., data used by the accelerator circuits is maintained in the accelerator memory circuits), the systems and methods disclosed herein advantageously reduce system latency due to data transfers to/from the local accelerator unit memory circuits and the peer accelerator unit memory circuits.
According to example 1, an accelerator system is provided. The system may include: at least one accelerator circuit; accelerator memory circuitry communicatively coupled to the at least one accelerator circuitry, the accelerator memory circuitry to locally store a plurality of pages transferred from the processor memory circuitry; accelerator virtual-to-physical address translation circuitry to translate a virtual address received from the at least one accelerator circuitry to a physical memory address in one of: a processor memory circuit or an accelerator memory circuit; an accelerator migration circuit to: collecting usage data associated with accesses to at least one of the processor memory circuit or an accelerator memory circuit of the one or more peer accelerator circuits; and causing a bidirectional data transfer between the processor memory circuit and the accelerator memory circuit based at least in part on the collected usage data.
Example 2 may include the elements of example 1, wherein the accelerator virtual-to-physical address translation circuitry may comprise: accelerator translation lookaside buffer (A-TLB) circuitry communicatively coupled to the at least one accelerator circuit, the A-TLB circuitry to store data representing a plurality of virtual memory address to physical memory address translations; and accelerator memory management unit (A-MMU) circuitry communicatively coupled between the A-TLB circuitry and accelerator memory circuitry, the A-MMU circuitry to store a plurality of pages to service virtual memory address to physical memory address translations associated with virtual memory addresses not found in the A-TLB circuitry.
Example 3 may include the elements of example 1 or 2, wherein the a-TLB circuitry is further to: transmitting a virtual memory address to physical memory address translation service request to a processor memory management unit circuit (MMU), the virtual memory address to physical memory address translation service request associated with a virtual memory address not found in an A-TLB circuit or one of a plurality of page tables included in the A-MMU circuit.
Example 4 may include the elements of any one of examples 1 to 3, wherein the a-MMU circuitry is further to: performing one or more data transformation operations on the data transferred to the accelerator memory circuit.
Example 5 may include the elements of any one of examples 1 to 4, wherein the a-MMU circuitry is further to: performing one or more data transformation operations on at least a portion of the data transferred to the accelerator memory circuit, the data transformation operations comprising at least one of: a compression operation, a decompression operation, an encoding operation, a decoding operation, an encryption operation, or a decryption operation.
Example 6 may include the elements of any one of examples 1 to 5, wherein the accelerator migration circuitry may include: a migration prediction circuit to initiate a page transfer between the processor memory circuit and the accelerator memory circuit; and page data mover circuitry for bi-directionally migrating pages between the processor memory circuitry and the accelerator memory circuitry by the migration prediction circuitry in response to receipt of the request.
Example 7 may include the elements of any one of examples 1 to 6, wherein the migration prediction circuitry is to initiate a page transfer between the processor memory circuitry and the accelerator memory circuitry based on a virtual memory address associated with the memory access operation.
Example 8 may include the elements of any one of examples 1 to 7, wherein the accelerator memory circuitry is to locally store a plurality of pages transferred from the processor memory circuitry for use by at least one of: the at least one accelerator circuit or at least one accelerator circuit disposed in a second accelerator system communicatively coupled to the accelerator system.
According to example 9, an electronic device is provided. The electronic device may include: a processor circuit; a processor Memory Management Unit (MMU) circuit; a processor memory circuit; at least one accelerator system comprising: at least one accelerator circuit; accelerator memory circuitry communicatively coupled to the at least one accelerator circuitry, the accelerator memory circuitry to locally store a plurality of pages transferred from the processor memory circuitry; accelerator virtual-to-physical address translation circuitry to translate a virtual address received from the at least one accelerator circuitry to a physical memory address in one of: a processor memory circuit or an accelerator memory circuit; and an accelerator migration circuit to: collecting usage data associated with access to at least one of: a processor memory circuit or an accelerator memory circuit of the one or more peer accelerator circuits; and causing a bidirectional data transfer between the processor memory circuit and the accelerator memory circuit based at least in part on the collected usage data.
Example 10 may include the elements of example 9, wherein the accelerator virtual to physical address translation circuitry may comprise: accelerator translation lookaside buffer (A-TLB) circuitry communicatively coupled to the at least one accelerator circuit, the A-TLB circuitry to store data representing a plurality of virtual memory address to physical memory address translations; and accelerator memory management unit (A-MMU) circuitry communicatively coupled between the A-TLB circuitry and accelerator memory circuitry, the A-MMU circuitry to store a plurality of pages to service virtual memory address to physical memory address translations associated with virtual memory addresses not found in the A-TLB circuitry.
Example 11 may include the elements of any of examples 9 or 10, wherein the a-TLB circuitry is further to: transmitting a virtual memory address to physical memory address translation service request to a processor memory management unit circuit (MMU), the virtual memory address to physical memory address translation service request associated with a virtual memory address not found in an A-TLB circuit or one of a plurality of page tables included in the A-MMU circuit.
Example 12 may include the elements of any of examples 9 to 11, wherein the a-MMU circuitry is further to: performing one or more data transformation operations on the data transferred to the accelerator memory circuit.
Example 13 may include the elements of any one of examples 9 to 12, wherein the a-MMU circuitry is further to: performing one or more data transformation operations on at least a portion of the data transferred to the accelerator memory circuit, the data transformation operations comprising at least one of: a compression operation, a decompression operation, an encoding operation, a decoding operation, an encryption operation, or a decryption operation.
Example 14 may include the elements of any one of examples 9 to 13, wherein the accelerator migration circuitry may include: a migration prediction circuit to initiate a page transfer between the processor memory circuit and the accelerator memory circuit; and page data mover circuitry for bi-directionally migrating pages between the processor memory circuitry and the accelerator memory circuitry by the migration prediction circuitry in response to receipt of the request.
Example 15 may include the elements of any one of examples 9 to 14, wherein the migration prediction circuitry is to initiate a page transfer between the processor memory circuitry and the accelerator memory circuitry based on a virtual memory address associated with the memory access operation.
Example 16 may include the elements of any one of examples 9 to 15, wherein the accelerator memory circuitry is to locally store a plurality of pages transferred from the processor memory circuitry for use by at least one of: the at least one accelerator circuit or at least one accelerator circuit disposed in a second accelerator system communicatively coupled to the accelerator system.
According to example 17, a processor acceleration method is provided. The method may include: performing, by accelerator virtual address to physical address translation circuitry disposed in the accelerator unit, translation of the one or more virtual memory addresses to the one or more physical memory addresses; collecting, by accelerator migration circuitry, usage data associated with accesses to at least one of: a processor memory circuit or an accelerator memory circuit of the one or more peer accelerator circuits; and causing, by the accelerator migration circuitry, migration of the one or more pages from the processor memory circuitry to the accelerator memory circuitry based at least in part on the collected virtual memory address to physical memory address translation data.
Example 18 may include the elements of example 17, and the method may further comprise: storing, by accelerator translation lookaside buffer (A-TLB) circuitry communicatively coupled to the at least one accelerator circuitry, data representing a plurality of virtual memory address to physical memory address translations; and storing, by an accelerator memory management unit (a-MMU) circuit communicatively coupled between the a-TLB circuit and the accelerator memory circuit, a plurality of pages for servicing a virtual memory address to physical memory address translation associated with a virtual memory address not found in the a-TLB circuit.
Example 19 may include the elements of any one of examples 17 or 18, and the method may further include: communicating, by the A-TLB circuit to a processor memory management unit (P-MMU) circuit, a virtual memory address to physical memory address translation service request associated with a virtual memory address not found in the A-TLB circuit or one of a plurality of tables included in the A-MMU circuit.
Example 20 may include the elements of any one of examples 17 to 19, and the method may further include: performing, by the A-MMU circuitry, one or more data transformation operations on the data transferred to the accelerator memory circuitry.
Example 21 may include the elements of any one of examples 17 to 20, and the method may further comprise: performing, by the A-MMU circuitry, one or more data transformation operations on at least a portion of the data transferred to the accelerator memory circuitry, the data transformation operations comprising at least one of: a compression operation, a decompression operation, an encoding operation, a decoding operation, an encryption operation, or a decryption operation.
Example 22 may include the elements of any one of examples 17 to 21, wherein collecting data associated with the plurality of virtual memory address to physical memory address translations performed by the accelerator virtual address to physical address translation circuitry comprises: collecting, by migration prediction circuitry, data associated with the plurality of virtual memory address to physical memory address translations performed by accelerator virtual address to physical address translation circuitry; and wherein causing migration of one or more pages from the processor memory circuitry to the accelerator memory circuitry based at least in part on the collected virtual memory address to physical memory address translation data comprises: causing, by the migration prediction circuit, the page data mover circuit to migrate the one or more pages from the processor memory circuit to the accelerator memory circuit based at least in part on the collected virtual memory address to physical memory address translation data.
Example 23 may include the elements of any one of examples 17 to 22, and the method may further include: at least a portion of the plurality of pages stored in the accelerator memory circuit is shared with at least one other accelerator system included in the plurality of accelerator systems via an inter-accelerator structure communicatively coupling each of the plurality of accelerator systems.
According to example 24, a processor acceleration system is provided. The system may include: means for performing a translation of one or more virtual memory addresses to one or more physical memory addresses; means for collecting usage data associated with access to at least one of: a processor memory circuit or an accelerator memory circuit of the one or more peer accelerator circuits; and means for migrating one or more pages between the processor memory circuitry and the accelerator memory circuitry based at least in part on the collected virtual memory address to physical memory address translation data.
Example 25 may include the elements of example 24, and the system may further include: means for storing data representing a plurality of virtual memory address to physical memory address translations.
Example 26 may include the elements of any one of examples 24 or 25, and the system may further include: means for transmitting a virtual memory address to physical memory address translation service request associated with not finding a virtual memory address in the stored data representing the plurality of virtual memory address to physical memory address translations.
Example 27 may include the elements of any one of examples 24 to 26, and the system may further include: means for performing one or more data transformation operations on data transferred to the accelerator memory circuit.
Example 28 may include the elements of any of examples 24 to 27, and the system may further include: means for performing one or more data transformation operations on at least a portion of the data transferred to the accelerator memory circuit, the data transformation operations comprising at least one of: a compression operation, a decompression operation, an encoding operation, a decoding operation, an encryption operation, or a decryption operation.
Example 29 may include the elements of any one of examples 24 to 28, wherein means for collecting data associated with the plurality of virtual memory address to physical memory address translations performed by the accelerator virtual address to physical address translation circuitry comprises: means for collecting data associated with the plurality of virtual memory address to physical memory address translations; and wherein the means for causing migration of the one or more pages from the processor memory circuitry to the accelerator memory circuitry based at least in part on the collected virtual memory address to physical memory address translation data comprises: means for migrating the one or more pages from the processor memory circuitry to the accelerator memory circuitry based at least in part on the collected virtual memory address to physical memory address translation data.
Example 30 may include the elements of any one of examples 24 to 29, and the system may further include: means for sharing at least a portion of the plurality of pages stored in the accelerator memory circuit with at least one other accelerator system included in the plurality of accelerator systems.
According to example 31, a non-transitory storage device is provided. The non-transitory storage device may include instructions that, when executed by the accelerator circuit, cause the accelerator circuit to: causing accelerator virtual address to physical address translation circuitry disposed in an accelerator unit to translate one or more virtual memory addresses to one or more physical memory addresses; causing the accelerator migration circuitry to collect usage data associated with access to at least one of: a processor memory circuit or an accelerator memory circuit of the one or more peer accelerator circuits; and cause the accelerator migration circuitry to migrate the one or more pages from the processor memory circuitry to the accelerator memory circuitry based, at least in part, on the collected virtual memory address to physical memory address translation data.
Example 32 may include the elements of example 31, wherein the instructions are further to cause the accelerator circuitry to: causing accelerator translation lookaside buffer (A-TLB) circuitry communicatively coupled to the at least one accelerator circuitry to store data representing a plurality of virtual memory address to physical memory address translations; and causing accelerator memory management unit (a-MMU) circuitry communicatively coupled between the a-TLB circuitry and the accelerator memory circuitry to store a plurality of pages to service virtual memory address to physical memory address translations associated with virtual memory addresses not found in the a-TLB circuitry.
Example 33 may include the elements of any one of examples 31 or 32, wherein the instructions are further to cause the accelerator circuitry to: causing an A-TLB circuit to transmit a virtual memory address to physical memory address translation service request to a processor memory management unit (P-MMU) circuit, the virtual memory address to physical memory address translation service request associated with a virtual memory address not found in the A-TLB circuit or one of a plurality of tables included in the A-MMU circuit.
Example 34 may include the elements of any one of examples 31 to 33, wherein the instructions are further to cause the accelerator circuitry to: causing the A-MMU circuitry to perform one or more data transformation operations on the data transferred to the accelerator memory circuitry.
Example 35 may include the elements of any one of examples 31 to 34, wherein the instructions are further to cause the accelerator circuitry to: causing the A-MMU circuitry to perform one or more data transformation operations on at least a portion of the data transferred to the accelerator memory circuitry, the data transformation operations including at least one of: a compression operation, a decompression operation, an encoding operation, a decoding operation, an encryption operation, or a decryption operation.
Example 36 may include the elements of any one of examples 31 to 35, wherein the instructions that cause the accelerator migration circuitry to collect data associated with the plurality of virtual memory address to physical memory address translations performed by the accelerator virtual address to physical address translation circuitry further cause the accelerator circuitry to: cause the migration prediction circuit to collect usage data associated with access to at least one of: a processor memory circuit or an accelerator memory circuit of the one or more peer accelerator circuits; and wherein the instructions that cause the accelerator migration circuitry to migrate the one or more pages from the processor memory circuitry to the accelerator memory circuitry based at least in part on the collected virtual memory address to physical memory address translation data further cause the accelerator circuitry to: causing the page data mover circuit to migrate the one or more pages from the processor memory circuit to the accelerator memory circuit based at least in part on the collected virtual memory address to physical memory address translation data.
Example 37 may include the elements of any one of examples 31 to 36, wherein the instructions are further to cause the accelerator circuitry to: at least a portion of the plurality of pages stored in the accelerator memory circuit is shared with at least one other accelerator system included in the plurality of accelerator systems via an inter-accelerator structure communicatively coupling each of the plurality of accelerator systems.
According to example 38, there is provided an acceleration system arranged to perform the method of any one of examples 17 to 23.
According to example 39, there is provided a chipset arranged to perform the method of any of examples 17 to 23.
According to example 40, there is provided at least one machine readable storage device comprising a plurality of instructions that in response to being executed on a computing device, cause the computing device to carry out the method according to any one of examples 17 to 23.
According to example 41, there is provided an apparatus comprising a plurality of peer acceleration units coupled to a processor circuit, the apparatus being arranged to perform the method of any of examples 17 to 23.
The terms and expressions which have been employed herein are used as terms of description and not of limitation, and there is no intention, in the use of such terms and expressions, of excluding any equivalents of the features shown and described (or portions thereof), and it is recognized that various modifications are possible within the scope of the claims. Therefore, the claims are intended to cover all such equivalents. Various features, aspects, and embodiments have been described herein. As will be appreciated by those skilled in the art, these features, aspects and embodiments are susceptible to combination with one another and to variation and modification. Accordingly, the present disclosure is to be considered as including such combinations, variations and modifications.
As described herein, the various embodiments may be implemented using hardware elements, software elements, or any combination thereof. Examples of hardware elements may include processors, microprocessors, circuits, circuit elements (e.g., transistors, resistors, capacitors, inductors, and so forth), integrated circuits, Application Specific Integrated Circuits (ASIC), Programmable Logic Devices (PLD), Digital Signal Processors (DSP), Field Programmable Gate Array (FPGA), logic gates, registers, semiconductor device, chips, microchips, chip sets, and so forth.
Reference throughout this specification to "one embodiment" or "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment. Thus, the appearances of the phrases "in one embodiment" or "in an embodiment" in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.