Memory node controller

文档序号：1432167 发布日期：2020-03-17 浏览：15次中文

阅读说明：本技术 存储器节点控制器 (Memory node controller ) 是由乔纳森·柯蒂斯·比尔德罗克萨娜·鲁西托鲁柯蒂斯·格伦·邓纳姆于 2018-07-05 设计创作，主要内容包括：一种用于数据处理网络的节点的存储器节点控制器,所述网络包括至少一个计算设备和至少一个数据资源,每个数据资源通过物理地址来寻址。所述节点被配置为将所述至少一个计算设备与所述至少一个数据资源耦合。所述数据处理网络的元件经由系统地址空间来寻址。所述存储器节点控制器包括：到所述至少一个数据资源的第一接口；到所述至少一个计算设备的第二接口；以及系统地址到物理地址转换器缓存,所述系统地址到物理地址转换器缓存被配置为将所述系统地址空间中的系统地址转换为所述至少一个数据资源的物理地址空间中的物理地址。(A memory node controller for a node of a data processing network, the network comprising at least one computing device and at least one data resource, each data resource being addressed by a physical address. The node is configured to couple the at least one computing device with the at least one data resource. The elements of the data processing network are addressed via a system address space. The memory node controller includes: a first interface to the at least one data resource; a second interface to the at least one computing device; and a system address to physical address translator cache configured to translate a system address in the system address space to a physical address in a physical address space of the at least one data resource.)

1. A memory node controller for a node of a data processing network, the data processing network having at least one computing device and at least one data resource, the node being configured to couple the at least one computing device with the at least one data resource and an element of the data processing network addressable via a system address space, the memory node controller comprising:

a first interface to the at least one data resource, wherein each of the at least one data resource is addressed via a physical address space;

a second interface to the at least one computing device; and

a system address to physical address translator cache configured to: translating a system address in the system address space to a physical address in the physical address space of a data resource of the at least one data resource.

2. The memory node controller of claim 1, further comprising:

a Physical Device Configuration Settings (PDCS) memory that stores information indicating a mapping of elements of the data processing network into the system address space.

3. The memory node controller of claim 1, wherein the at least one data resource comprises a remote network, and wherein the first interface comprises a network interface card.

4. The memory node controller of claim 1, wherein the first interface comprises an interface to another memory node controller.

5. The memory node controller of claim 4, wherein the system address space comprises a plurality of address partitions, and wherein the memory node controller is associated with a first address partition of the plurality of address partitions and the another memory node controller is associated with a second address partition of the plurality of address partitions.

6. The memory node controller of claim 1, wherein the second interface comprises an interface to a processor core.

7. The memory node controller of claim 1, wherein the second interface comprises an interface to a hardware accelerator.

8. The memory node controller of claim 1, wherein the first interface comprises an interface to a memory device or a storage device.

9. A non-transitory computer readable medium having instructions representing a hardware description language of the memory node controller system of claim 1.

10. A non-transitory computer readable medium having a netlist representing the memory node controller of claim 1.

11. A data processing network comprising:

a first memory node controller;

a first plurality of addressable units addressed by a system address space and comprising a first plurality of data resources, each of the first plurality of data resources coupled to the first memory node controller via a channel and addressed by a physical address space; and

a first plurality of computing devices each coupled to the first memory node controller and configured to access the first plurality of addressable units via the first memory node controller,

wherein the first memory node controller comprises a system address to physical address translator cache configured to: translating system addresses received from computing devices of the first plurality of computing devices to physical addresses in an address space of data resources of the first plurality of data resources.

12. The data processing network of claim 11, further comprising:

one or more second memory node controllers coupled to the first memory node controller;

wherein the first memory node controller is assigned a first partition of system addresses in the system address space,

wherein each of the one or more second memory node controllers is assigned a second partition of system addresses in the system address space, and

wherein a computing device of the first plurality of computing devices includes a range table associating the first memory node controller with a system address in a first partition of the system address and each of the one or more second memory node controllers with a system address in a second partition of the corresponding system address, and the range table is configured to send a request to access memory at the system address to a memory node controller of the first and second memory node controllers associated with a system address.

13. The data processing network of claim 12, further comprising:

a second plurality of data resources each coupled to a second memory node controller of the one or more second memory node controllers via a channel and having a physical address space; and

a second plurality of computing devices each coupled to a second memory node controller of the one or more second memory node controllers and configured to access the data processing network via the system address space,

wherein the one or more second memory node controllers are configured to couple the second plurality of computing devices with the second plurality of data resources.

14. The data processing network of claim 11, wherein the first plurality of addressable units further comprises a hardware accelerator.

15. The data processing network of claim 11, wherein the first plurality of addressable units further comprises a network interface card.

16. A method for accessing one or more data resources by one or more computing devices in a data processing network, the method comprising:

mapping elements of the data processing network to a system address space;

assigning a first partition of the system address space to a first memory node controller of the data processing network, wherein the one or more computing devices and the one or more data resources are coupled to the first memory node controller;

receiving, at the first memory node controller, a request to access an element of the data processing network at a system address in the system address space; and

servicing, by the first memory node controller, the request when the system address is in the first partition of the system address space.

17. The method of claim 16, further comprising:

assigning a second partition of the system address space to a second memory node controller of the data processing network; and

forwarding the request to the second memory node controller when the system address is in the second partition of the system address space.

18. The method of claim 16, further comprising:

assigning a second partition of the system address space to a second memory node controller of the data processing network; and

servicing, by the first memory node controller, the request when the system address is in the second partition of the system address space and the system address is dynamically shared with the first memory node controller.

19. The method of claim 16, wherein each of the one or more data resources is coupled to the first memory node controller via a channel, and wherein servicing the request by the first memory node controller comprises:

identifying a channel to a data resource of the one or more data resources corresponding to the system address;

translating the system address to a physical address in the data resource; and

accessing the data resource at the physical address via the identified channel.

20. The method of claim 16, wherein the first partition of the system address space comprises a first plurality of pages, the method further comprising:

assigning a second partition of the system address space to a second memory node controller of the data processing network, wherein the second partition of the system address space comprises a second plurality of pages;

monitoring access to the second plurality of pages by the one or more computing devices coupled to the first memory node controller; and

migrating pages of the second plurality of pages from the second memory node controller to the first memory node controller in accordance with the monitored accesses.

21. The method of claim 20, further comprising:

the coherency state of the migrated page is recorded.

22. The method of claim 16, wherein the first partition of the system address space comprises a plurality of rows, and wherein the data processing network further comprises a data transfer cache, the method further comprising:

monitor system memory requests to the plurality of rows by the one or more computing devices coupled to the first memory node controller;

servicing, by the first memory node controller, a system memory request when a requested line of the plurality of lines is not present in the data transfer cache;

pushing the requested line from the first memory node controller to a data transfer cache of the data processing network in accordance with the monitored system memory request; and

servicing, by the data transfer cache, the system memory request when the requested line is present in the data transfer cache.

23. The method of claim 16, further comprising:

assigning a second partition of the system address space to a second memory node controller of the data processing network, wherein one or more additional data resources are coupled to the second memory node controller; and is

The first memory node controller:

allocating memory within an address range of a second partition of the system address space;

entering the assigned address range in a system address translation table of the first memory node controller; and

directing memory requests for addresses within the allocated address range to the second memory node controller.

Technical Field

The present disclosure relates to control of physical device memory in a data processing network.

Background

A data processing system may include multiple computing devices of various types and multiple memory resources of different types. For example, a system may include Dynamic Random Access Memory (DRAM), block devices, Remote Direct Memory Access (RDMA) devices, memory located on hardware accelerators, and other types of volatile and non-volatile memory. Memory and other resources within a data processing system are addressed by a system address space, while each memory device is addressed by a physical address space.

The mapping between system addresses and corresponding physical addresses may be performed statically, either by software calls to the operating system, or by hardware caching of software-mediated translation processes. Such an approach does not provide optimal use of memory, particularly when memory resources are shared among multiple processing cores or multiple processes and when the memory resources have different characteristics.

Drawings

FIG. 1 illustrates a data processing network consistent with certain embodiments of the present disclosure.

Fig. 2 is another block diagram of a data processing network consistent with embodiments of the present disclosure.

FIG. 3 is a block diagram of a simplified network incorporating a memory node controller consistent with embodiments of the present disclosure.

FIG. 4 illustrates a state diagram of a data coherency protocol consistent with embodiments of the present disclosure.

Fig. 5 is a block diagram of a data processing network consistent with the present disclosure.

FIG. 6 is a flow diagram of a method for routing memory access requests consistent with an embodiment of the present disclosure.

Detailed Description

While this invention may be embodied in many different forms, there is shown in the drawings and will herein be described in detail specific embodiments, with the understanding that the present disclosure is to be considered as an exemplification of the principles of the invention and is not intended to limit the invention to the specific embodiments illustrated and described. In the description below, like reference numerals may be used to describe the same, similar or corresponding parts in the several views of the drawings.

In this document, relational terms such as first and second, top and bottom, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. The terms "comprises," "comprising," "including," "includes," "having," "with," "having," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. An element followed by an "comprising.

Reference throughout this document to "one embodiment," certain embodiments, "" an embodiment, "" implementation(s), "(aspect(s)" or similar terms means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the present disclosure. Thus, the appearances of such phrases in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments without limitation.

The term "or" as used herein is to be interpreted as inclusive or meaning any one or any combination. Thus, "A, B or C" means "any of the following: a; b; c; a and B; a and C; b and C; A. b and C ". An exception to this definition will occur only when a combination of elements, functions, steps or acts are in some way inherently mutually exclusive. Additionally, unless stated otherwise or clear from context, grammatical conjunctions are intended to express any and all disjunctive (disjunctive) and conjunctive (conjunctive) combinations that incorporate clauses, sentences, words, and the like. Thus, the term "or" should generally be understood to mean "and/or" and the like.

All documents mentioned herein are hereby incorporated by reference in their entirety. Reference to a singular item should be understood to include a plural item and vice versa unless explicitly stated otherwise or clear from the text.

The words "about," "approximately," "substantially," and the like, when accompanied by numerical values, are to be construed as indicating a deviation as would be appreciated by one of ordinary skill in the art to operate satisfactorily for the intended purpose. Values and/or ranges of values are provided herein as examples only and are not limiting upon the scope of the described embodiments. The use of any and all examples, or exemplary language ("e.g.," such as "etc.), provided herein is intended merely to better illuminate the embodiments and does not pose a limitation on the scope of the embodiments. No language in the specification should be construed as indicating any non-claimed element as essential to the practice of the embodiments.

For simplicity and clarity of illustration, reference numerals may be repeated among the figures to indicate corresponding or analogous elements. Numerous details are set forth to provide an understanding of the embodiments described herein. Embodiments may be practiced without these specific details. In other instances, well-known methods, procedures, and components have not been described in detail so as not to obscure the described embodiments. The description should not be considered as limiting the scope of the embodiments described herein.

In the following description, it is to be understood that such terms as "first," "second," "top," "bottom," "upper," "lower," "above …," "below …," and the like are words of convenience and are not to be construed as limiting terms. Additionally, the terms apparatus and device may be used interchangeably herein.

System for controlling a power supply

Fig. 1 is a schematic illustration of a data processing network consistent with embodiments of the present disclosure. Referring to FIG. 1, a data processing network includes a plurality of processor cores 100, 110. In the example of fig. 1, two such processor cores are illustrated, but more processor cores may be provided (as schematically illustrated by the dashed box 112). Alternatively, the system may include only one processor core.

The processor core is arranged to process data according to the virtual memory address. For example, each of the processor cores may process data according to virtual memory addresses in a respective virtual memory address space, e.g., under control of an operating system or so-called hypervisor that allocates virtual memory address spaces to processes being executed by different processor cores, in part as a technique for avoiding processes associated with one processor core accidentally or maliciously accessing data appropriate for a process being executed by another of the processor cores.

First tier memory address translation

Elements of the data processing network, such as memory and other resources, are addressable by system addresses in a system address space. Memory address translation means are provided for translating between virtual memory addresses in the virtual memory address space and system addresses in the system address space. This system address space may be accessed via indirect means or via a processing device that accesses this system address space as an anonymous physical space (i.e., the physical memory for the processing device is virtualized). The system address is the "output" memory address of the first layer. The system address may represent a physical address that can be used to physically address a physical memory device or other addressable unit. Alternatively, the system address may represent an address that requires another phase of address translation before being used to access a physical memory device or other addressable unit. These options are equivalent from the point of view of address translation technology. That is, the address translation technique starts with a virtual memory address and generates a system memory address. Another stage of address translation for system addresses is provided by memory node controllers 160 and 170. In this example, the address passed to the memory node controller is within the range. However, in an alternative embodiment, the address passed to the memory node controller is provided by a conventional page-based virtual memory system located on top of the MNC layer. In this implementation, the page-based virtual memory layer will treat them as physical device addresses, but the MNC will treat them as system addresses.

In fig. 1, address translation is performed by so-called Range Table Buffers (RTBs) 105, 115. This performs address translation between virtual memory addresses in the virtual memory address space and system memory addresses in the system (output) address space. Each of the processor cores has a respective range table buffer. The operation of the range table buffer is described in detail in co-pending patent application No.15/649,930, which is hereby incorporated by reference in its entirety. However, other address translation techniques may be used, such as, for example, a Translation Lookaside Buffer (TLB).

Bypass logic 108, 118 is provided to selectively bypass the RTBs 105, 115 when the address translation is such that the virtual memory address is equivalent to the corresponding system memory address. The bypass circuit or logic may be controlled by control signals 109, 119 as will be discussed below. When the bypass logic is enabled, the RTBs 105, 115 do not perform any translations, or the translations performed by the RTBs 105, 115 are ignored and the virtual memory addresses are provided by the bypass ways 104, 114 for use as output memory addresses.

The processor cores 100, 110 are implemented or fabricated on an integrated circuit substrate in this example, and may both (or all, in more than two cases) be disposed on the same integrated circuit substrate. These devices are referred to in fig. 1 as "on-chip".

Also provided on-chip is a cache and/or system cache memory 130 for providing temporary scratchpad for a subset of data (such as a recently accessed subset and/or a speculatively fetched subset) to be maintained by the memory system discussed below. As shown in fig. 1, processor cores 100 and 110 share a common cache/system cache 130. However, in other examples more than one cache may be provided, and another cache 140 is shown with dashed lines to illustrate such an arrangement. In one embodiment, the virtual index and system address tag may be utilized to access the cache in parallel with the cache/system cache 130.

In one embodiment, the cache/system cache 130(140) operates according to system (output) memory addresses generated by the RTBs 105, 115. In another embodiment, the first level cache is accessed in parallel with the RTBs 105, 115.

Dashed line 150 indicates a logical boundary between an on-chip device and an off-chip device, but it should be appreciated that this is merely an example and that the implementation of any of the modules shown on the same integrated circuit or as different circuits in FIG. 1 is a matter of the system designer. Thus, fig. 1 shows an illustrative example of how separation of on-chip and off-chip components may be achieved.

Memory node controller

One or more off-chip memory node controllers 160, 170 are provided which, in turn, access one or more respective physical devices 180, 190, such as random access memory (DRAM), Phase Change Memory (PCM), and High Bandwidth Memory (HBM). In view of the physical devices 180, 190 operating in physical address space, the functions of the memory node controllers 160, 170 may include: (a) translate system (output) memory addresses to physical device memory addresses if another stage of translation is required, and (b) manage which of the physical devices 180, 190 needs to be accessed in order to implement the required memory access operation.

The translation operation (a) mentioned above is a second layer address translation and may be performed using the techniques to be discussed below or by known memory address translation techniques. The management operation (b) for managing which of the physical devices 180, 190 should be accessed may be performed using, for example, a table or directory stored at one or more of the memory node controllers 160, 170 to indicate the partitioning of the physical address space between the physical devices and, in turn, the memory node controllers.

In the example of FIG. 1, two memory node controllers are provided. If one of the memory node controllers, such as memory node controller 160, receives a request for a memory access transaction related to an address processed by another memory node controller, such as memory node controller 170, first memory node controller 160 may communicate with the other memory node controller 170 via data connection 165, passing the system address related to the transaction and requesting that the transaction be performed by second memory node controller 170.

The present disclosure relates to Memory Node Controllers (MNCs). According to some embodiments, the MNC provides a dedicated hardware mechanism to collect and use metadata, including performance statistics such as reuse distance. The metadata is used to provide better placement of the memory pages among the available technologies.

As described above, the MNC maps at least one partition of the system address space of the data processing network to the physical device memory space. The MNC provides mapping functionality from the system address space to a physical space in a resource, such as, for example, a DRAM device, a block device, a Remote Direct Memory Access (RDMA) appliance, or a memory located on a hardware accelerator. An RDMA appliance may be any memory or storage device for remote memory access. The MNC provides functionality for performing: write to system addresses, maintain page level ownership across memory structures, optimally place pages into memory via metadata tracking, and feed data forward to fast on-chip memory. The MNC is implemented in hardware and may be part of an integrated circuit with additional functionality. For example, the synthetic MNCs may be represented from a netlist or a Hardware Description Language (HDL).

According to some embodiments, the MNC provides a single abstraction of resources, such as storage, memory, or Network Interface Controllers (NICs), into a single system address space.

According to some embodiments, the MNC provides a means for treating memory at the MNC page level as "shareable" between multiple MNCs.

According to some embodiments, the MNC provides a second layer of Copy-on-Write (Copy-on-Write) devices.

According to certain embodiments, the MNC provides an efficient means for performing a single copy operation. This may be provided, for example, to all levels of the accelerator device and may be provided via the NIC.

According to some embodiments, the MNC is part of a memory fabric configured in compliance with a memory server model, where the memory fabric services memory requests from various local or remote computing devices of the data processing network.

Fig. 2 is a block diagram of a data processing network 200 consistent with embodiments of the present disclosure. The data processing network 200 includes a first memory node controller (MNC 0)160 and optionally one or more second memory node controllers, such as an MCN 1170. The MNC 160 includes control logic 204 coupled to a number of other functional blocks.

The device interface 206 enables the MNC 160 to interact with multiple computing devices 208 and 210 and other devices such as hardware accelerators. The computing device may be on-chip or off-chip. The system 200 is configured as a network in which the MNCs 160 and 170 provide network nodes that interconnect computing devices (such as 208 and 210) with memory resources. In one embodiment, device interface 206 conforms to the Peripheral Component Interconnect Express (PCIe) standard. PCIe is a non-proprietary standard for interconnecting internal devices in a computer. For example, the device interface 206 may include a PCIe root complex (root complex) that enables connection with multiple computer elements or virtualized PCIe devices. When multiple MNC nodes are used, the node is a 'multi-root' in the PCIe standard. However, the MNC may also support single root virtualization of the customer operating system to facilitate backward compatibility. Devices on the MNC network are identified via a virtualized hardware-mediated PCI device tree.

IO controller

The MNC 160 also includes an I/O controller (IOC) 212. The IOC212 is a general purpose interconnect or bus that enables connection to off-chip components. This may be a custom physical layer with a protocol that enables responding to or transmitting commands. For example, the command may be an Instruction Set Architecture (ISA) level memory instruction. For example, the IOC212 may operate under a standard architecture such as the PCIe-v4 architecture. Other physical layers may be used depending on the characteristics required by the MNC and the overall system. The IOC212 enables communication with other MNC devices.

Memory controller

The MNC 160 also includes memory/storage components such as a memory controller 214 (such as one or more DRAM controllers) to enable coupling with a storage device 216, a block device controller 218 (such as, for example, a SATA controller) for coupling with a block device 220. Each device is mapped to a physical range of memory that can be further virtualized by the operating system.

The additional IO controller 222 enables connection to devices such as a network interface controller 224.

In the embodiment shown in fig. 2, the MNC 160 also includes an accelerator physical memory 226, a fast local memory 232 (such as, for example, a combined DRAM/NV 3-D integrated memory) that serves as a local buffer for data reordering, in-memory computations, and for NIC and RDMA transactions.

Physical device configuration settings (PDSC) memory

The configuration data is stored in a Physical Device Configuration Settings (PDCS) memory 228. For example, a mapping of accelerator physical memory 226 by the system address space may be stored in PDCS memory 228. The physical device configuration information is provided to the MNC via a firmware interface, on-board non-volatile memory (such as SPI flash or equivalent), small persistent removable storage (such as microSD cards), or any other method that can convey information for physical mapping configurations and other configurations such as channel and interrupt assignments. Physical devices (such as SATA, PCIe, SSD, DRAM, and other devices) are mapped from a virtual system address range to a physical device range.

In one embodiment, the PDSC memory stores information that tells the MNC about the devices present on the network. This enables devices to "map" each other in effect from the specified memory space into their operating system and file system.

The information about the storage devices is slightly different and tells the MNC what devices are attached, its characteristics, and what bus channels or channels to assign to them. Further, for a PCIe accelerator, it may be desirable to provide other configuration data as well as a system address mapping for the accelerator device so that it can be mapped into the system address space of the host operating system for virtualization.

In summary, the configuration information in the PDSC memory provides the MNC with the information needed to make external devices act and map devices such as accelerators, computing devices and network interface controllers into the system address space to enable virtualization. This information may supplement or replace traditional device configurations within the kernel.

Other memory node controllers can be easily discovered at startup through handshaking, but they can also be specified directly in the configuration information.

System to physical transfer (STP) cache architecture

A system to physical translation (STP) cache structure 230 holds a translation structure that maps system addresses to physical pages on various devices. An example implementation of this structure is a sparse hah-graph, where the most frequently used N transitions reside in Static Random Access Memory (SRAM), while the overflow moves through the STP cache to the DRAM allocated by the MNC memory controller 214. In some embodiments, the SRAM in the STP cache may be large enough to store most of the address translations. The translation from system addresses to physical addresses may be performed via a hash map, and in some hash map implementations, may be performed on a single clock cycle edge.

Caching

The caches 130 and 140 may be near-network-edge data caches used to facilitate data transfer to and from the MNC 160. These are called data delivery buffers (DDCs) and may be located on-chip. Within the MNC 160, the memory is divided into pages representing a specific amount of memory (such as a 4K region), which may be placed on any type of physical device. Dividing the larger range into smaller pages enables improved memory technology placement and provides an upper limit for metadata collection, as data only needs to be kept at page granularity. The metadata enables the memory node controller 160 to place physical pages on the best memory/storage technology (based on some cost function) and, in addition, use a "feed forward" approach to placing pages to on-chip cache. In the feed forward approach, the MNC feeds (at some granularity) the data page that is optimal for the core and application (i.e., based on bandwidth requirements, reuse distance, etc.) to the high bandwidth/low latency memory that needs it exactly on the core.

Network configuration

FIG. 3 is a block diagram of a simplified network 300 incorporating a memory node controller consistent with the present disclosure. Simplified network 300 includes nodes 302, 304, and 306. Each node includes a Memory Node Controller (MNC). Memory/ storage elements 308 and 310 are within the home range of MNC-0 at node 302, just as is remote access device 312 (which may be accessed, for example, via a NIC). Access to these elements is provided by MNC-0 (302). Network 300 is arranged as a clustered network, with MNC-0 at node 302 being configured to provide memory/storage resources to a cluster of computing devices 314, 316, and 318. These may be processor cores or other computing devices. Similarly, the MNCs at nodes 304 and 306 of the network will each serve a cluster of computing devices and control access to dedicated memory/storage elements. The MNC is coupled to computing devices 314, 316, and 318 via channel 324 and to memory/ storage elements 308 and 310 via channel 326.

The function of the MNC 302 in the network 300 is to provide memory pages to local devices 314, 316 and 318 or to devices coupled via other nodes 304 and 306. MNC-0 at node 302 is coupled to MNC-1 at node 304 via link 320 and to MNC-2 at node 306 via link 322. It will be apparent to those skilled in the art that different network topologies may be used, and that any number of nodes, computer elements, and memory/storage devices may be used.

In operation, a request sent from the computing device to the MNC references a system address. The request sent from the MNC to the memory/storage resource references a physical (or network) address. The MNC is configured to perform a translation from a system address to a physical address.

Routing of memory access requests in a network

One function of the memory node controller is to translate system addresses to physical addresses. When a request to access memory at a particular address is sent in a data processing system of a network having a memory node controller, the request is routed to the appropriate MNC. Various routing techniques may be used. Embodiments may use, for example, a cluster memory node controller scheme as depicted in FIG. 3. Typically, in a clustering scheme, there may be a maximum of N cores or computing devices for each of the K MNCs. The N computer elements will be clustered such that the best route is to the local memory node. Each memory request originating from these cores goes directly to the nearest MNC. If the request is for a page that is statically assigned to the MNC or for a page that is dynamically shared from another MNC, the request may be immediately returned to the core. However, if another MNC owns the memory (e.g., as determined by a coherence protocol such as, for example, the coherence protocol shown in fig. 4 and discussed below), there is one additional network hop before being satisfied for the request. When operation of the system is initiated, each MNC is assigned a region or partition of the overall system addressable space that can be utilized by the system (while providing for repartitioning to facilitate hot-plugging). Each of the partitions assigned to each memory node is then divided into pages. The advantage of this system is that locality is implicit by the MNC from which the memory request originated. The computing device of the access node is known (or at least the cluster of access nodes is known) without additional data. Using this information, the MNC can migrate the data pages within the memory network or retrieve them from the owner of the partition if the computing mode is authorized for it.

Consistent with some embodiments, the retrieval of pages is facilitated by a restricted directory structure. The restricted directory structure may be stored in memory local to the MNC (such as local memory 232 in fig. 2), so the restricted directory structure is implicitly memory-enabled and hardware-managed. In another embodiment, the restricted directory structure is stored in SRAM or other fast memory so that when pages are checked out, they can be fed into the directory structure.

Referring again to fig. 3, when a request from a core or computing device to access memory at a system address arrives at the MNC of the memory network 300 from that computing element, a routing calculation is performed using the system address. Since the MNC that receives the request addresses only a portion of the global address space, the request may need to be routed to another MNC in the network. If the current MNC does not control the memory partition containing the address, the request is forwarded to the appropriate MNC. For example, a request from computer element 314 to access an address in the main partition of MCN-1 at node 304 is received by MNC-0 at node 302 and forwarded to MNC-1 at node 304 via link 320.

When an address arrives in a request from the core to the MNC, a routing calculation is performed for that address. If the address is outside the partition of the current storage node, a range lookup for routing may be performed in parallel by consulting the directory to determine if the page is checked in from its external system address partition master node.

In parallel with routing the system address, one of two operations can be performed depending on implementation requirements. First, the hash can be consulted to see if the page (assuming the read address is outside the partition of the current memory node of the system address space) is checked out from its home node and currently resides in the current node (the node performing the address calculation). Another option is to use a directory-like methodology that sends the request data packet to the master node of the system address partition, which then determines whether the page is detected by a closer node. In this method, the originating MNC (i.e., the first node that receives the request from the computing device) is encoded in a packet. This approach may require an additional network hop if the node is detected locally, but has the benefit of reduced overall data movement while retaining the benefit of data interleaving with the requesting socket.

Within the MNC, there are a number of data structures that can be used in hardware to store paging information. For example, in one embodiment, a sparse hash map structure is used, which may be implemented as a tree structure. In a write operation to a page without physical support, support is created in a class of memory chosen by the optimization function (most likely first in DRAM as an example), however, it can easily create new dirty pages in non-volatile media. In a read operation, a similar thing happens. The operations may be performed on a page-by-page basis, where a page is a subset of the range at a certain granularity (e.g., 4K). In this way, range translation is provided and pages are striped/placed on the most efficient memory technology. Reference is made below to the description of this structure. Each page can be placed anywhere in the memory network by the MNC without the core making any changes or taking any action.

Since data can be shared between computer devices, a coherency protocol is used to prevent access to memory pages containing stale data. To this end, a restricted directory structure may be used to store the state of the data page.

Fig. 4 illustrates an example state diagram 400 for a modified MESI protocol consistent with embodiments of the present disclosure. The data can be identified as being in one of four different states: "Modify" (M)402, "Exclusive" (E)404, "shared" (S)406, and "invalid" (I) 408. MESI diagram 400 illustrates transitions between different states. The status may be indicated by status bits in the metadata for each page, in addition to other data such as page utilization statistics, performance counters, and the like. State diagram 400 illustrates that the modified MESI protocol for MNC page sharing is managed within the MNC network.

For example, a detected page that is not a partition from the current MNC is called a "foreign" page and its detected state is recorded in a directory of the MNC, which may be stored in local memory. The detected state is indicated in the page metadata for pages in the partition of the current MNC, i.e. pages for which the current MNC is the main partition MNC.

For example, when a page is allocated, it may be initially checked out from the main partition MNC in the "exclusive" state (E) 404. After the write has occurred, the state changes to "modified" (M). After the page has been synchronized back to the main partition MNC, the state returns to "shared" (S) 406.

If a page is deallocated while it is in either the exclusive (E) state 404 or the shared (S) state 406, the state is changed to invalid (I) 408. Alternatively, the page may be returned to the shared (S) state 406 by a re-verify instruction.

Once space is needed in the directory structure or if pages are requested from other nodes in the shared state, the checked-out pages will eventually be migrated back to the main partition MNC. Moving the page back to the main partition MNC is similar to writing data back to memory from a standard cache. However, the MNC may keep the page in persistent or volatile memory, which is indicated in the metadata as the state of the checked out page.

Each MNC is assigned a block or partition of the full system address space at start-up or during the renegotiation process. An example system address range may be: { base _ address +0) → { base _ address + n). This partition is further subdivided into physical addresses (or network addresses in the case of a NIC) behind the MNC. The MNC controls access to all resources behind it that can store data and maps the system address space to physical or network addresses in those resources. The file system and networking functionality may also be mapped into this address space. An accelerator with on-board memory is also mapped into this address space and can be accessed from its virtual address space through the interface without knowledge of the system address space.

Starting on the processor core, a Range Translation Buffer (RTB) or a Translation Lookaside Buffer (TLB) is used to translate the virtual memory address in the request to an address in the system address space. The request is then sent to the memory node controller.

In a first embodiment, the memory space is divided between the MNCs so that there is a fixed static mapping after boot. This approach has the advantage of being fast for route calculation and always going to the correct node. However, this approach may not be optimal for on-chip routing or for optimizing memory placement. One reason that fixed computations are suboptimal is that it increases on-chip memory traffic that could otherwise be used for core-to-core (or thread-to-thread) communications. Another reason is that the overhead required to support on-chip cluster locality with N cores is log in each memory request case₂(N) is provided. The request will be tagged to indicate its origin and then passed to the MNC. This approach may be used, for example, when simplicity is more important than overhead.

In another embodiment, it is assumed that there will be N cores per K MNCs, and these N cores will be clustered, as opposed to a fully connected grid. In this approach, the routing path is from the cluster through one or more caches (including any DDCs) and then to the off-chip interconnect controller associated with the cluster. For example, this interconnect may utilize PCIe or other physical layers. The MNCs are each assigned a static system address partition. This may be done after boot configuration or system reset to facilitate hot add/remove/plug of storage/memory. At the MNC, the system address range is further subdivided into pages that are either zero-allocated (for initial allocation) or point to physical memory on some device. The advantage of this routing path is that the source of the request is implicit in the traffic source. For example, it is known which core cluster created the traffic implicitly without further information or metadata. Capturing and recording the originating core would otherwise require at least enough bits to encode the number of clusters within a node per memory request. Using the locality information, the MNC network may migrate virtual pages within the network or detect them from the owner of the system address partition if the computing mode grants it.

Efficient use of this infrastructure is facilitated through software awareness. Given that the system address space is split between N MNCs, and these MNCs are connected to the computing device based on, for example, physical locality, tasks may be scheduled such that they are executed on computing devices connected to MNCs that control already allocated system and physical memory, or at least to nearby MNCs. This ensures low latency communication.

Fig. 5 is a block diagram of a data processing network 500 consistent with the present disclosure. The data processing network 500 includes a core or computing device 502 operating in a virtual memory space 504. In the illustrated embodiment, the system address space is divided into three partitions, R1, R2, and R3. System 500 also includes MNC506 associated with system memory partition 508(R1), MNC 510 associated with system memory partition 512(R2), and MNC 514 associated with system memory partition 516 (R3). The MNC506 controls access to memory resources 518, which memory resources 518 store pages 520 having physical addresses P1, P2, P3, and P4 in this example. MNC506 translates system addresses in partition R1 to physical addresses in memory resource 518. Similarly, the MNC 510 controls access to memory resources 522 that store pages 524 with physical addresses P5, P6, P7, and P8 in this example. MNC 510 translates system addresses in partition R2 to physical addresses in memory resource 522. The MNC 514 controls access to memory resources 526, which memory resources 526 store pages 528 having physical addresses P9, P10, P11, and P12 in this example. MNC 514 translates system addresses in partition R3 to physical addresses in memory resource 526. In this manner, the 48K address space is split into four 16K partitions and each 16K partition is split into four 4K pages allocated between the physical resources.

When core 502 issues a request to access data at a virtual address in space 504, the virtual address is translated to a system address in range table buffer 530. If the address is not found in the local cache, the system address is passed to the MNC coupled to the core 502. In this example, the system address is passed to MNC-1 (510). MNC-1(510) determines whether the system address is in partition R2. If the system address is in partition R2, then the corresponding memory resource of 522 is identified along with the physical address in the memory resource and the data at the physical address is accessed. Memory write access may be acknowledged over link 532. Data read from the memory resource is returned over link 532.

If the system address is not in partition R2, the MNC controlling that partition is identified and the request forwarded to the MNC identified to service the request. Any response to the request is returned to core 502 via link 532.

In some embodiments, the translation between the system address and the physical address is made within the MNC using a data structure stored in a system-to-physical translation (STP) cache (e.g., 230 in fig. 2). The data structure may be a table that is hashed using a mask of the page directory purpose. For example, the page number may be calculated with a logical and operation between the system address and the page size that is a power of two. An example page entry in the RTP cache may contain the information shown in table 1.

Table 1.

In one embodiment, copy-on-write is supported using three pointers, one to an entry that is the current clean physical copy (head), one to the parent and one to the child. This enables the update process to be optimized. Other variations will be apparent to those skilled in the art.

Memory allocation may be efficiently handled by using a partner memory allocation scheme or other scheme that may be represented by a sparse tree. Compression (reordering) of the system address range may be accomplished, for example, by signaling the OS to find a processing thread containing the system address and then change the system address range. This process can be time consuming. However, for large system address spaces (such as 64 bits), this is unlikely to happen unless the current system becomes very larger.

The page information is stored in a memory. In the simplest implementation of hardware, an entry is used for each page. For example, if a single MNC is assigned a 100TB address partition and if the page size is selected to be 2MB, the table will fit into a small 64MB SRAM structure even if the device is fully full. Additional space is required if other metadata is to be stored. However, in one embodiment, the size of the table is reduced by compressing empty pages into the zero page range. In another embodiment, the translation data may be persistent or have a second copy to ensure persistence.

FIG. 6 is a flow diagram 600 of a method for accessing one or more data resources by one or more computing devices in a data processing network that routes memory access requests consistent with embodiments of the present disclosure. Following start block 602, elements of the data processing network are mapped to a system address space at block 604. At block 606, a range or region of system address space is allocated to a memory node controller of the data processing network. For example, a first partition may be assigned to a first memory node controller MNC-1 and a second partition may be assigned to a second memory node controller MNC-2. At block 608, a request is received at MNC-1 to access an element of the data processing network at a system address in the system address space. When the system address is in the partition of MNC-1, as depicted by the positive branch from decision block 610, MNC-1 is used at block 612 to service the request. When the system address is not in the partition of MNC-1, as depicted by the negative branch from decision block 610, flow continues to decision block 614. If the system address is dynamically shared with the first memory node controller, as depicted by the positive branch from decision block 614, then the request is serviced at block 612 again using MNC-1. If the system address is not dynamically shared with the first memory node controller, as depicted by the negative branch from decision block 614, the request is forwarded to the MNC at block 616 to get the system address, the request being serviced at block 616. The process terminates at block 618. MNC-1 services the request by identifying the appropriate data resource at block 620 and translating the system address to a physical address in the data resource at block 622. This may be done, for example, by accessing a table or directory. The requested service is completed at block 624 and the process terminates at block 618.

Some embodiments relate to a method for routing memory access requests consistent with embodiments of the present disclosure. The method is applicable to a cluster memory node controller scheme as described above with reference to figure 3. In general, in a clustering scheme, there may be a maximum of N cores or computing devices for each of the K MNCs. The N computer elements will be clustered such that the best route is to the local memory node. A memory request to access a resource system address is received by the MNC. Each memory request originating from these cores goes directly to the nearest MNC, so the request comes from a core in the local cluster of MNCs. A channel to the appropriate data resource holding the requested page is determined at block 620. If the request is for a page that is statically assigned to the MNC or for a page that is dynamically shared from another MNC, the request may be immediately returned to the core. However, if another MNC owns the memory (e.g., as determined by the coherence protocol), then there is one additional network hop for the request before it is satisfied. When operation of the system is initiated, each MNC is assigned a region of the overall system addressable space that can be utilized by the system (while providing for re-partitioning to facilitate hot-plugging). Each of the partitions assigned to each memory node is then divided into pages. The advantage of this system is that locality is implicit by the MNC from which the memory request originated. The computing device of the access node is known (or at least the cluster of access nodes is known) without additional data. Using this information, the MNC can migrate the data pages within the memory network or detect them from the owner of the partition if the computing mode authorizes it.

The MNC may allocate memory from the system address space of another memory node controller for use within its system address translation table to redirect one system address to another. For example, for a defragmentation operation, the first MNC may be able to allocate memory in the system address space partition of the second MNC, where the first MNC will display these pages as checked out from the second MNC in the first MNC. The first MNC will keep the physical memory support of the page as it was. Once the address range allocated from the second MNC is entered in the appropriate table, the offset within the range table entry may be changed to point to the new system address range. The system address range that was previously used at this time is now free. The new system address range from the second MNC and the multiple pages making up that address range are now free to migrate independently according to a coherency protocol, metadata or scheduling algorithm.

A system may be equipped with a system cache structure, referred to as a data transfer cache (DDC). In this embodiment, system memory requests are sent to the DDC and MNC simultaneously. If the MNC has registered the page as present in the DDC, then the line is served from the DDC and ignored from the MNC. The request is served in the MNC if the line is present in the MNC. It should be apparent that the synchronization between MNCs can take the form of a directory or filter mechanism. Example embodiments implement a send/acknowledge system in which a DDC does not begin serving pages once it has received an acknowledgement from the DDC that the page was installed. Along with the acknowledgement, the DDC will receive outstanding memory requests when it decides to push a page from the MNC to the DDC.

The various embodiments and examples of the present disclosure as presented herein are understood to illustrate the present disclosure without limiting it, and are not limiting with respect to the scope of the present disclosure.

Further specific and preferred aspects of the present disclosure are set out in the accompanying independent and dependent claims. Features of the dependent claims may be combined with features of the independent claims as appropriate and in combinations other than those explicitly set out in the claims.

One or more memory node controllers may be implemented in an integrated circuit. For example, a circuit may be defined as a set of instructions in a Hardware Description Language (HDL), which may be stored in a non-transitory computer readable medium. The instructions may be distributed via a computer readable medium or via other means, such as a wired or wireless network. The instructions may be used to control the fabrication or design of an integrated circuit, and may be combined with other instructions.

Although illustrative embodiments of the present invention have been described in detail herein with reference to the accompanying drawings, it is to be understood that the invention is not limited to those precise embodiments, and that various changes and modifications can be effected therein by one skilled in the art without departing from the scope and spirit of the invention as defined by the appended claims.

It will be appreciated that the above-described apparatus, systems, and methods are set forth by way of example, and not by way of limitation. Absent explicit indication to the contrary, the disclosed steps may be modified, supplemented, omitted, and/or reordered without departing from the scope of the present disclosure. Many variations, additions, omissions, and other modifications will become apparent to those of ordinary skill in the art. Additionally, the order or presentation of method steps in the above description and the figures is not intended to require the performance of such order of recited steps, unless a particular order is explicitly required or otherwise clear from the context.

Unless a different meaning is explicitly provided or otherwise clear from the context, the method steps of the implementations described herein are intended to include any suitable method that causes such method steps to be performed, consistent with the patentability of the following claims.

It should further be appreciated that the above methods are provided as examples. Absent explicit indication to the contrary, the disclosed steps may be modified, supplemented, omitted, and/or reordered without departing from the scope of the present disclosure.

It will be appreciated that the above-described methods and systems are set forth by way of example, and not by way of limitation. Many variations, additions, omissions, and other modifications will become apparent to those of ordinary skill in the art. Additionally, the order or representation of the method steps in the above description and the figures is not intended to require the order in which the recited steps are performed, unless a specific order is explicitly claimed or otherwise clear from the context. Thus, while particular embodiments have been shown and described, it will be obvious to those skilled in the art that various changes and modifications in form and detail may be made without departing from the scope of the disclosure, and it is intended that such changes and modifications in form and detail form part of the disclosure as defined by the following claims, which are to be interpreted in the broadest sense permitted by law.

Various representative embodiments that have been described in detail herein have been presented by way of example and not by way of limitation. It is understood that various changes in form and details of the described embodiments may be made to produce equivalent embodiments that remain within the scope of the appended claims.

Thus, some features of the disclosed embodiments are set forth in the following numbered items:

1. a memory node controller for a node of a data processing network having at least one computing device and at least one data resource, the node being configured to couple the at least one computing device with the at least one data resource and an element of the data processing network addressable via a system address space, the memory node controller comprising: a first interface to the at least one data resource, wherein each of the at least one data resource is addressed via a physical address space; a second interface to the at least one computing device; and a system address to physical address translator cache configured to translate a system address in the system address space to a physical address in a physical address space of a data resource of the at least one data resource.

2. The memory node controller of item 1, further comprising: a Physical Device Configuration Settings (PDCS) memory that stores information indicating a mapping of elements of the data processing network into the system address space.

3. The memory node controller of item 1, wherein the at least one data resource comprises a remote network, and wherein the first interface comprises a network interface card.

4. The memory node controller of item 1, wherein the first interface comprises an interface to another memory node controller.

5. The memory node controller of item 4, wherein the system address space comprises a plurality of address partitions, and wherein the memory node controller is associated with a first address partition of the plurality of address partitions and the another memory node controller is associated with a second address partition of the plurality of address partitions.

6. The memory node controller of item 1, wherein the second interface comprises an interface to a processor core.

7. The memory node controller of item 1, wherein the second interface comprises an interface to a hardware accelerator.

8. The memory node controller of item 1, wherein the first interface comprises an interface to a memory device or a storage device.

9. A non-transitory computer readable medium having instructions representing a hardware description language of the memory node controller system of item 1.

10. A non-transitory computer readable medium having a netlist representing a memory node controller according to item 1.

11. A data processing network, the data processing network comprising: a first memory node controller; a first plurality of addressable units addressed by a system address space and comprising a first plurality of data resources, each of the first plurality of data resources being coupled to the first memory node controller via a channel and addressed by a physical address space; and a first plurality of computing devices each coupled to the first memory node controller and configured to access the first plurality of addressable units via the first memory node controller, wherein the first memory node controller includes a system-to-physical address translator cache configured to translate a system address received from a computing device of the first plurality of computing devices to a physical address in an address space of a data resource of the first plurality of data resources.

12. The data processing network of item 11, further comprising: one or more second memory node controllers coupled to the first memory node controller; wherein the first memory node controller is assigned a first partition of system addresses in the system address space, wherein each of the one or more second memory node controllers is assigned a second partition of system addresses in the system address space, and wherein a computing device of the first plurality of computing devices comprises a range table that associates the first memory node controller with a system address in a first partition of the system address and associates each of the one or more second memory node controllers with a system address in a second partition of the corresponding system address, and configured to send a request to a memory node controller of the first and second memory node controllers associated with the system address to access memory at the system address.

13. The data processing network of item 12, further comprising: a second plurality of data resources each coupled to a second memory node controller of the one or more second memory node controllers via a channel and having a physical address space; and a second plurality of computing devices each coupled to a second memory node controller of the one or more second memory node controllers and configured to access the data processing network via the system address space, wherein the one or more second memory node controllers are configured to couple the second plurality of computing devices with the second plurality of data resources.

14. The data processing network of item 11, wherein the first plurality of addressable units further comprises a hardware accelerator.

15. The data processing network of item 11, wherein the first plurality of addressable units further comprises a network interface card.

16. A method for accessing one or more data resources by one or more computing devices in a data processing network, the method comprising: mapping elements of the data processing network to a system address space; assigning a first partition of the system address space to a first memory node controller of the data processing network, wherein the one or more computing devices and the one or more data resources are coupled to the first memory node controller; receiving, at the first memory node controller, a request to access an element of the data processing network at a system address in the system address space; and servicing, by the first memory node controller, the request when the system address is in a first partition of the system address space.

17. The method of item 16, further comprising: assigning a second partition of the system address space to a second memory node controller of the data processing network; and forwarding the request to the second memory node controller when the system address is in a second partition of the system address space.

18. The method of item 16, further comprising: assigning a second partition of the system address space to a second memory node controller of the data processing network; and servicing, by the first memory node controller, the request when the system address is in a second partition of the system address space and the system address is dynamically shared with the first memory node controller.

19. The method of item 16, wherein each of the one or more data resources is coupled to the first memory node controller via a channel, and wherein servicing the request by the first memory node controller comprises: identifying a channel to a data resource of the one or more data resources corresponding to the system address; translating the system address to a physical address in the data resource; and accessing the data resource at the physical address via the identified channel.

20. The method of item 16, wherein the first partition of the system address space comprises a first plurality of pages, the method further comprising: assigning a second partition of the system address space to a second memory node controller of the data processing network, wherein the second partition of the system address space comprises a second plurality of pages; monitoring access to the second plurality of pages by the one or more computing devices coupled to the first memory node controller; and migrating a page of the second plurality of pages from the second memory node controller to the first memory node controller based on the monitored access.

21. The method of item 20, further comprising: the coherency state of the migrated page is recorded.

22. The method of item 16, wherein the first partition of the system address space comprises a plurality of rows, and wherein the data processing network further comprises a data transfer cache, the method further comprising: monitor system memory requests to the plurality of rows by the one or more computing devices coupled to the first memory node controller; servicing, by the first memory node controller, a system memory store request when a requested line of the plurality of lines is not present in the data transfer cache; pushing the requested line from the first memory node controller to a data transfer cache of the data processing network in accordance with the monitored system memory request; and servicing, by the data transfer cache, the system memory request when the requested line is present in the data transfer cache.

23. The method of item 16, further comprising: assigning a second partition of the system address space to a second memory node controller of the data processing network, wherein the one or more additional data resources are coupled to the second memory node controller; and the first memory node controller: allocating memory within an address range of a second partition of the system address space; entering the assigned address range in a system address translation table of the first memory node controller; and directing memory requests for addresses within the allocated address range to the second memory node controller.

24页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：增强用户空间与内核空间的隔离性的方法和装置

Memory node controller

相关技术

网友询问留言