System, method and apparatus for resource allocation

文档序号：152970 发布日期：2021-10-26 浏览：24次中文

阅读说明：本技术 用于资源分配的系统、方法和装置 (System, method and apparatus for resource allocation ) 是由克里希纳·T·马拉丁安德鲁·常恩山·纳贾法巴迪亚西尔·A·扎格鲁于 2021-04-06 设计创作，主要内容包括：提供了用于资源分配的系统、方法和装置。所述方法可包括：确定与第一集群中的至少一个第一装置相关联的参数的第一值；基于参数的第一值确定阈值；接收用于在第一装置处理工作负载的请求；确定与第二集群中的至少一个第二装置相关联的参数的第二值满足所述阈值；和响应于满足所述阈值,将工作负载的至少一部分路由到第二装置。(Systems, methods, and apparatuses for resource allocation are provided. The method may comprise: determining a first value of a parameter associated with at least one first apparatus in a first cluster; determining a threshold value based on a first value of a parameter; receiving a request to process a workload at a first device; determining that a second value of a parameter associated with at least one second apparatus in a second cluster satisfies the threshold; and routing at least a portion of the workload to the second device in response to the threshold being met.)

1. A method for resource allocation, comprising:

determining a first value of a parameter associated with at least one first apparatus in a first cluster;

determining a threshold value based on a first value of a parameter;

receiving a request to process a workload at a first device;

determining that a second value of a parameter associated with at least one second apparatus in a second cluster satisfies the threshold; and

in response to the threshold being met, at least a portion of the workload is routed to the second device.

2. The method of claim 1, wherein the method further comprises:

determining that a second value of a parameter associated with at least one second apparatus in a second cluster exceeds the threshold; and

in response to the threshold being exceeded, maintaining at least a portion of the workload at the first device.

3. The method of claim 1, wherein the first cluster or the second cluster comprises at least one of a direct attached memory architecture, a pooled memory architecture, a distributed memory architecture, and a disaggregated memory architecture.

4. The method of claim 3, wherein the direct connection memory architecture comprises at least one of a storage level memory SCM device, a dynamic random access memory DRAM device, and a DRAM-based vertical NAND device.

5. The method of claim 3, wherein the pooled memory architecture comprises a cache coherence accelerator device.

6. The method of claim 3, wherein the distributed memory architecture comprises a cache coherent device connected with a PCIe interconnect.

7. The method of claim 3, wherein the disaggregated memory architecture comprises physical cluster memory and accelerator extensions in a chassis.

8. The method of any of claims 1 to 7, wherein the method further comprises:

calculating a score based on the projected memory usage, the first value, and the second value of the workload; and

at least a portion of the workload is routed to the second device based on the score.

9. The method of any of claims 1 to 7, wherein the step of routing at least a portion of the workload to the second device comprises: the routing is performed using a cache coherency protocol that further includes at least one of a CXL protocol and a GenZ protocol, and the first cluster and the second cluster are connected via a PCIe fabric.

10. The method of any of claims 1 to 7, wherein the parameter is associated with at least one of a memory resource and a computing resource.

11. The method of any of claims 1-7, wherein the parameters include at least one of power characteristics, total cost of ownership, remote memory capacity, and direct memory capacity.

12. An apparatus for resource allocation, comprising:

at least one memory device storing computer-executable instructions; and

at least one processor configured to access a memory device, wherein the processor is configured to execute computer-executable instructions to:

determining a first value of a parameter associated with at least one first apparatus in a first cluster;

determining a threshold value based on a first value of a parameter;

receiving a request to process a workload at a first device;

determining that a second value of a parameter associated with at least one second apparatus in a second cluster satisfies the threshold; and

in response to the threshold being met, at least a portion of the workload is routed to the second device.

13. The apparatus of claim 12, wherein the processor is further configured to execute the computer-executable instructions to:

determining that a second value of a parameter associated with at least one second apparatus in a second cluster exceeds the threshold; and

In response to the threshold being exceeded, maintaining at least a portion of the workload at the first device.

14. The apparatus of claim 12, wherein the first cluster or the second cluster comprises at least one of a direct attached memory architecture, a pooled memory architecture, a distributed memory architecture, and a disaggregated memory architecture.

15. The device of claim 14, wherein the direct connection memory architecture comprises at least one of a storage level memory SCM device, a dynamic random access memory DRAM device, and a DRAM-based vertical NAND device.

16. The apparatus of any of claims 12 to 15, wherein the apparatus is further configured to present at least the second apparatus to the host.

17. A system for resource allocation, comprising:

at least one memory device storing computer-executable instructions; and

at least one processor configured to access the at least one memory device, wherein the at least one processor is configured to execute computer-executable instructions to:

determining a first value of a parameter associated with at least one first apparatus in a first cluster;

determining a threshold value based on a first value of a parameter;

Receiving a request to process a workload at a first device;

determining that a second value of a parameter associated with at least one second apparatus in a second cluster satisfies the threshold; and

in response to the threshold being met, at least a portion of the workload is routed to the second device.

18. The system of claim 17, wherein the at least one processor is further configured to execute computer-executable instructions to:

determining that a second value of a parameter associated with at least one second apparatus in a second cluster exceeds the threshold; and

in response to the threshold being exceeded, maintaining at least a portion of the workload at the first device.

19. The system of claim 17, wherein the first cluster or the second cluster comprises at least one of a direct attached memory architecture, a pooled memory architecture, a distributed memory architecture, and a disaggregated memory architecture.

20. The system of claim 19, wherein the direct connection memory architecture comprises at least one of a storage level memory SCM device, a dynamic random access memory DRAM device, and a DRAM-based vertical NAND device.

Technical Field

The present disclosure relates generally to cache coherency and, more particularly, to systems, methods and apparatus for resource allocation.

Background

Some server systems may include a collection of servers connected by a network protocol. Each server in such a system may include processing resources (e.g., processors) and memory resources (e.g., system memory). In some cases, it may be advantageous for the processing resources of one server to access the memory resources of another server, and it may be advantageous to do so while minimizing the processing resources of either server.

Accordingly, there is a need for an improved system and method for managing memory resources in a system that includes one or more servers.

The above information disclosed in this background section is only for enhancement of understanding of the background of the disclosure and therefore it may contain information that does not form the prior art.

Disclosure of Invention

In various embodiments, what is described herein includes systems, methods, and apparatus for resource allocation. In some embodiments, a method for resource allocation is described. The method may comprise: determining a first value of a parameter associated with at least one first apparatus in a first cluster; determining a threshold value based on a first value of a parameter; receiving a request to process a workload at a first device; determining that a second value of a parameter associated with at least one second apparatus in a second cluster satisfies the threshold; and routing at least a portion of the workload to the second device in response to the threshold being met.

In various embodiments, the method may further comprise: determining that a second value of a parameter associated with at least one second apparatus in the second cluster exceeds the threshold, and in response to exceeding the threshold, maintaining at least a portion of the workload at the first apparatus. In another embodiment, the first cluster or the second cluster includes at least one of a direct memory architecture, a pooled memory architecture, a distributed memory architecture, or a disaggregated memory architecture. In some embodiments, the direct connection memory architecture includes at least one of a Storage Class Memory (SCM) device, a Dynamic Random Access Memory (DRAM) device, and a DRAM-based vertical NAND device. In another embodiment, a pool memory architecture includes a cache coherence accelerator device. In another embodiment, a distributed memory architecture includes a cache coherency mechanism coupled to a PCIe interconnect. In some embodiments, the disaggregated memory architecture includes memory and accelerator extensions for the physical cluster in the chassis.

In various embodiments, the method may further comprise: calculating a score based on the projected memory usage, the first value, and the second value of the workload; and routing at least a portion of the workload to the second device based on the score. In another embodiment, the cache coherency protocol comprises at least one of a CXL protocol or a GenZ protocol, and the first cluster and the second cluster are connected via a PCIe fabric. In one embodiment, the resources include at least one of memory resources or computing resources. In another embodiment, the performance parameter includes at least one of a power characteristic, a performance per unit energy characteristic, a remote memory capacity, and a direct memory capacity. In some embodiments, the method may include presenting at least the second device to the host.

Similarly, apparatus and systems for performing substantially the same or similar operations as described above are further disclosed.

Accordingly, particular embodiments of the subject matter described herein can be implemented such that one or more of the following advantages are achieved. Reduce network delay, improve network stability and operational data transfer rate, and then improve user physical examination. Costs associated with routing network traffic, network maintenance, network upgrades, etc. are reduced. Further, in some aspects, the disclosed systems may be used to reduce power consumption and/or bandwidth of devices on a network, and may be used to increase the speed and/or efficiency of communication between devices.

Drawings

The foregoing and other aspects of the present technology will be better understood when the present application is understood in consideration of the following drawings, in which like reference numerals indicate similar or identical elements:

FIG. 1A is a block diagram of a system for attaching memory resources to computing resources using cache coherent connections, according to an embodiment of the present disclosure;

FIG. 1B is a block diagram of a system for attaching memory resources to computing resources using cache coherent connections employing an expansion slot adapter in accordance with an embodiment of the present disclosure;

Fig. 1C is a block diagram of a system for aggregating memory with ethernet top-of-rack (ToR) switches, according to an embodiment of the present disclosure;

fig. 1D is a block diagram of a system for aggregating memory employing an ethernet ToR switch and expansion slot adapters in accordance with an embodiment of the present disclosure;

FIG. 1E is a block diagram of a system for aggregating memory in accordance with an embodiment of the present disclosure;

FIG. 1F is a block diagram of a system for aggregating memory employing expansion slot adapters in accordance with an embodiment of the present disclosure;

FIG. 1G is a block diagram of a system for a disaggregated server according to an embodiment of the present disclosure;

FIG. 2 depicts a diagram of a representative system architecture in accordance with an example disclosed embodiment, wherein aspects of the disclosed embodiment are operable in connection with a managing computing entity capable of communicating and configuring the various servers described in connection with FIGS. 1A-1G.

FIG. 3A depicts a first diagram of a representative system architecture in accordance with an example embodiment of the disclosure, wherein aspects of the disclosed embodiment are operable in connection with a managing computing entity capable of communicating and configuring the various servers described in connection with FIGS. 1A-1G.

FIG. 3B depicts a second diagram of a representative system architecture in accordance with an example embodiment of the disclosure, wherein aspects of the disclosed embodiment are operable in connection with a managing computing entity capable of communicating and configuring the various servers described in connection with FIGS. 1A-1G.

FIG. 3C depicts a third diagram of a representative system architecture in accordance with an example embodiment of the disclosure, wherein aspects of the disclosed embodiment are operable in connection with a managing computing entity capable of communicating and configuring the various servers described in connection with FIGS. 1A-1G.

FIG. 3D depicts a fourth diagram of a representative system architecture in which aspects of the disclosed embodiments are operable in connection with a managing computing entity capable of communicating and configuring the various servers described in connection with FIGS. 1A-1G, according to an example embodiment of the disclosure.

FIG. 5 depicts a diagram of a representative network architecture in which aspects of the disclosed embodiments are operable, including embodiments that manage servers in computing entity configurable core, edge, and mobile edge data centers, according to a disclosed example embodiment.

FIG. 6 depicts another diagram of a representative network architecture in which aspects of the disclosed embodiments are operable, including embodiments that manage servers in computing entity configurable core, edge, and mobile edge data centers, according to a disclosed example embodiment.

FIG. 7 depicts another diagram of a representative network architecture in which aspects of the disclosed embodiments are operable, including embodiments that manage servers in computing entity configurable core, edge, and mobile edge data centers, according to a disclosed example embodiment.

FIG. 8 depicts a diagram of a supervised machine learning approach for determining the distribution of workloads across different servers using a managing computing entity in accordance with a disclosed example embodiment.

FIG. 9 depicts a diagram of an unsupervised machine learning method for determining a distribution of workloads across different servers using a managing computing entity, according to an example embodiment of the disclosure.

FIG. 10 illustrates an example schematic of a system that can be used to practice embodiments of the present disclosure.

FIG. 11 illustrates an example schematic diagram of a managing computing entity, according to an example disclosed embodiment.

Fig. 12 shows an example schematic diagram of a user device according to an example embodiment of the disclosure.

Fig. 13 is an illustration of an example method 1300 of operating the disclosed system to determine workload distribution across one or more network clusters, in accordance with an example embodiment of the disclosure.

Fig. 14 is an illustration of an exemplary method 1400 of operating the disclosed system to determine additional workload distribution across one or more network clusters, in accordance with an example embodiment of the disclosure.

Fig. 15 is an illustration of an exemplary method 1500 of operating the disclosed system to determine a distribution of workload across one or more clusters of network architectures in accordance with a disclosed example embodiment.

Fig. 16A is an illustration of an exemplary method 1600 of the disclosed system for routing workloads through a network fabric to one or more clusters of a core data center and one or more edge data centers, according to an exemplary disclosed embodiment.

Fig. 16B is an illustration of another exemplary method 1601 of the disclosed system for routing workloads through a network fabric to a core data center and one or more clusters of one or more edge data centers, according to an example embodiment of the disclosure.

While the technology is susceptible to various modifications and alternative forms, specific embodiments thereof are shown by way of example in the drawings and will herein be described. The drawings may not be to scale. It should be understood, however, that the drawings and detailed description thereto are not intended to limit the technology to the particular form disclosed, but on the contrary, the intention is to cover all modifications, equivalents and alternatives falling within the spirit and scope of the present technology as defined by the appended claims.

Detailed Description

The details of one or more embodiments of the subject matter described herein are set forth in the accompanying drawings and the description below. Other features, aspects, and advantages of the subject matter will become apparent from the description, the drawings, and the claims.

Various embodiments of the present disclosure will now be described more fully hereinafter with reference to the accompanying drawings, in which some, but not all embodiments are shown. Indeed, the disclosure may be embodied in many different forms and should not be construed as limited to the embodiments set forth herein; rather, these embodiments are provided so that this disclosure will satisfy applicable legal requirements. The term "or" is used herein in an optional and conjunctive sense unless otherwise indicated. The terms "illustrative" and "example" are used as examples that do not indicate a level of quality. Like numbers refer to like elements throughout. The arrows in each figure depict bi-directional data flow and/or bi-directional data flow capabilities. The terms "path," "via," and "route" are used interchangeably herein.

Embodiments of the present disclosure may be implemented in various ways, including as computer program products comprising articles of manufacture. The computer program product may include a non-transitory computer-readable storage medium that stores applications, programs, program components, scripts, source code, program code, object code, bytecode, compiled code, interpreted code, machine code, executable instructions, etc. (also referred to herein as executable instructions, instructions for execution, a computer program product, program code, and/or similar terms used interchangeably herein). Such non-transitory computer-readable storage media include all computer-readable media (including both volatile and non-volatile media).

In one embodiment, the non-volatile computer-readable storage medium may include a floppy disk, a flexible disk, a hard disk, a solid state storage device (SSS) (e.g., a Solid State Drive (SSD)), a Solid State Card (SSC), a solid state component (SSM), an enterprise-level flash drive, a magnetic tape, or any other non-transitory magnetic medium, or the like. Non-volatile computer-readable storage media may also include punch cards, paper tape, optical mark sheets (or any other physical medium with patterns of holes or other optically recognizable marks), compact disc read only memory (CD-ROM), compact disc rewritable (CD-RW), Digital Versatile Discs (DVD), blu-ray disc (BD), any other non-transitory optical medium, and so forth. Such non-volatile computer-readable storage Media may also include read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), flash memory (e.g., serial, NAND, NOR, etc.), Multimedia Memory Cards (MMC), Secure Digital (SD) memory cards, Smart Media (Smart Media) cards, Compact Flash (CF) cards, memory sticks, and the like. In addition, the non-volatile computer-readable storage medium may further include a Conductive Bridge Random Access Memory (CBRAM), a phase change random access memory (PRAM), a ferroelectric random access memory (FeRAM), a non-volatile random access memory (NVRAM), a Magnetoresistive Random Access Memory (MRAM), a Resistive Random Access Memory (RRAM), a silicon-oxide-nitride-oxide-silicon memory (SONOS), a Floating Junction Gate Random Access Memory (FJGRAM), a Millipede memory, a racetrack memory (racetrack memory), and the like.

In one embodiment, a volatile computer-readable storage medium may include Random Access Memory (RAM), Dynamic Random Access Memory (DRAM), Static Random Access Memory (SRAM), fast page mode dynamic random access memory (FPM DRAM), extended data output dynamic random access memory (EDO DRAM), Synchronous Dynamic Random Access Memory (SDRAM), double data rate synchronous dynamic random access memory (DDR SDRAM), double data rate two-type synchronous dynamic random access memory (DDR2 SDRAM), double data rate three-type synchronous dynamic random access memory (DDR3 SDRAM), Rambus Dynamic Random Access Memory (RDRAM), double transistor RAM (TTRAM), thyristor RAM (T-RAM), zero capacitor (Z-RAM), Rambus in-line memory component (RIMM), dual in-line memory component (DIMM), Single in-line memory modules (SIMMs), Video Random Access Memory (VRAM), cache memory (including various levels), flash memory, register memory, and the like. It will be appreciated that where embodiments are described as using computer-readable storage media, other types of computer-readable storage media may be used in place of or in addition to the computer-readable storage media described above.

It should be understood that various embodiments of the present disclosure may also be implemented as methods, apparatus, systems, computing devices, computing entities, and so forth. Accordingly, embodiments of the present disclosure may take the form of an apparatus, system, computing device, computing entity, or the like executing instructions stored on a computer-readable storage medium to perform particular steps or operations. Accordingly, embodiments of the present disclosure may also take the form of an entirely hardware embodiment, an entirely computer program product embodiment, and/or an embodiment including a combination of hardware and computer program products for performing certain steps or operations.

Embodiments of the present disclosure are described below with reference to block diagrams and flowchart illustrations. It should be understood, therefore, that each block of the block diagrams and flowchart illustrations may be implemented in the form of a computer program product, an entirely hardware embodiment, a combination of hardware and computer program products, and/or an apparatus, system, computing apparatus, computing entity, or the like that executes instructions for execution, acts, steps, and similar words (e.g., executable instructions, instructions for execution, program code, and so on) on a computer-readable storage medium. For example, the retrieving, loading, and executing of code may be performed sequentially, such that one instruction is retrieved, loaded, and executed at a time. In some example embodiments, retrieving, loading, and/or executing may be performed in parallel, such that multiple instructions are retrieved, loaded, and/or executed together. Accordingly, such embodiments may result in a specially configured machine that performs the steps or operations specified in the block diagrams and flowchart illustrations. Accordingly, the block diagrams and flowchart illustrations support various combinations of embodiments for performing the specified instructions, operations, or steps.

In some aspects, as data requirements increase, networked computing and storage may face problems. In particular, very large scale workload requirements are becoming more demanding because in addition to having high bandwidth allocation requirements, workloads can exhibit a diversity of memory and input/output (IO) delays. Furthermore, some existing systems may have reduced resource flexibility without reconfiguring the hardware rack system, which may result in inefficiencies that may hinder data processing and storage requirements. Furthermore, computing and memory resources are increasingly tightly coupled, and increased demand for one may impact demand for another. Furthermore, the entire industry faces a shortage of viable distributed shared memory and large address space systems. In some aspects, the fixed resources may increase cost of ownership (e.g., for a data center-based environment) and may also limit peak performance of the subsystems. In some aspects, hardware used in such environments may have different replacement periods and associated timelines, which may further complicate the updating of such systems. Accordingly, there is a need to improve resource sharing and matching to workloads in networked computing systems.

In some embodiments, a cache coherency protocol, such as a fast compute link (CXL), may implement memory expansion and a coherency accelerator. In various embodiments, the disclosed system may use a cache coherency protocol (such as CXL) to implement a class of memory systems and accelerators, while accommodating different workloads requires unique configurations. Thus, the disclosed system may enable combinable cache coherency (e.g., CXL) memory and accelerator resources by utilizing a structure and architecture that presents a system view to each workload running across racks, for example, in one or more clusters of a data center. In some aspects, the disclosed system may be used to extend cache coherency beyond a single server, provide management of heterogeneous chassis based on workload requirements, and provide combinability of resources. Further, in some examples, CXL over PCIe fabric may be used as a counterpart of another protocol (such as Non-Volatile Memory express (NVMeoF) over fabric, which may be used for composability of remote I/O devices). As used herein, combinable may refer to attributes by which a given device (e.g., a cache coherency-enabled device in a particular cluster) may request and/or obtain resources (e.g., memory, computing, and/or network resources) (e.g., for executing at least a portion of a workload) from a different portion of a network (e.g., at least one other cache coherency-enabled device in a second cluster). In some embodiments, composability, as used herein, may include using a fluid pool of physical and virtual computing, storage, and architectural resources into any suitable configuration to run any application or workload.

In various embodiments, the disclosed systems may include one or more architectural components including a cache coherent CXL module having one or more processors (e.g., RISC-V processors) that may be configured to perform various operations associated with a control plane. Furthermore, the disclosed system may be capable of using one or more homogeneous pools of cache-coherent CXL resources, which will be discussed further below. In particular, the disclosed system may feature a management computing device for exposing and utilizing the performance and capacity and acceleration characteristics of cache coherency resources for use by various network devices. In particular, the management computing device may determine one or more parameters associated with a system operated by the management computing device and route workloads to different clusters based on the parameters.

In various embodiments, the disclosed system may be capable of using multiple homogeneous resource pools, each pool dedicated to a particular cache coherency architecture. In particular, the disclosed system may use a type a cluster (which may represent a set of servers with direct memory expansion devices (SCM, DRAM-ZNAND hybrid)), a type B cluster (which may represent a set of CXL 2 combinable coherence accelerators), a type C cluster (which may include CX L devices connected in a distributed memory system architecture with back-gated PCIe interconnect, whereby processes share the same address space), and a type D cluster (which includes physical cluster memory and accelerator extensions in the same fabric (e.g., chassis)).

In various embodiments, the disclosed system including managing computing devices may feature a smart device architecture. In particular, the disclosed system may feature devices that plug into a cache coherency interface (e.g., a CXL/PCIe5 interface), and may implement various cache and memory protocols (e.g., type-2device based cxl.cache and cxl.memory protocols) based on type 2 devices. Further, in some examples, the device may include a programmable controller or processor (e.g., RISC-V processor) that may be configured to present the remote coherency device as part of a local system negotiated using a cache coherency protocol (e.g., cxl.io protocol).

In various embodiments, the disclosed system may implement cluster-level performance-based control and management capabilities whereby workloads may be automatically routed (e.g., via algorithmic methods and/or machine learning-based methods) based on remote architecture configuration and device performance, power characteristics, and the like. In some examples, the disclosed systems may be programmed, at least in part, via ASIC circuitry, FPGA cells, or the like. Further, such devices may implement AI-based techniques (e.g., machine learning-based methods) to route workloads as shown and described herein. Further, the disclosed system may use a management computing entity to perform discovery and/or workload partitioning and/or resource binding based on predetermined criteria (e.g., best performance per unit of currency or power). In addition, the managing computing entity may perform such operations based on various parameters of the system, these parameters include, but are not limited to, a cache coherency protocol (e.g., CXL based) based round trip time, a determination of whether a device is a host bias or a device bias, a cache coherency protocol (e.g., CXL based) based switch hierarchy and/or host upstream port to device downstream port binding, cache coherency protocol (e.g., CXL based) based switch fabric manager configuration, cache coherency protocol (e.g., CXL based) based protocol or physical media packets (e.g., cxl.io or PCIe intervening bulk 4KB packets), network latency, cache coherency protocol (e.g., CXL based) based memory technology (e.g., memory type), combinations thereof, and the like.

In various embodiments, the managing computing entity may operate at a rack and/or cluster level and/or may operate at least partially within a given device (e.g., a cache coherency enabled device) that is part of a given cluster architecture (e.g., A, B, C and/or a D-type cluster). In various embodiments, a device within a given cluster architecture may perform a first portion of the operations of a managing computing entity, while another portion of the operations of the managing computing entity may be implemented on a rack and/or at the cluster level. In some embodiments, both portions of an operation may be performed in a coordinated manner (e.g., devices in a cluster sending and receiving coordination messages to and from a managing computing entity implemented on a rack and/or at a cluster level). In some embodiments, the first portion of operations associated with devices in the cluster may include, but is not limited to, operations for determining current or future resource requirements of the device or cluster, advertising current or future resource availability of the device or cluster, synchronizing particular parameters associated with algorithms running at the device or cluster level, training one or more machine learning modules associated with the operation of the device or rack/cluster, logging corresponding data associated with routing workloads, combinations thereof, and the like.

Peripheral component interconnect express (PCIe) may represent a computer interface that may have a relatively high and variable latency, which may limit the usefulness of the computer interface when connecting with memory. CXL is an open industry standard for communicating over PCIe 5.0, can provide a fixed, relatively short packet size, and thus can be capable of providing relatively high bandwidth and relatively low fixed latency. Thus, CXLs can support cache coherency, and CXLs can be well suited for connecting to memory. CXLs can also be used to provide a connection between a host and network interface circuits (or "network interface controllers" or "Network Interface Cards (NICs)") in accelerators, memory devices, and servers.

Cache coherency protocols such as CXL can also be used for heterogeneous processing (e.g., in scalar, vector, and buffer memory systems). CXLs can be used to provide a cache coherency interface with channels, retimers, the PHY layer of the system, logical aspects of the interface, and protocols from PCIe 5.0. The CXL transaction layer may include three multiplexed sub-protocols running simultaneously on a single link and may be referred to as cxl.io, cxl.cache and cxl.memory. The cxl.io may include I/O semantics that may be similar to PCIe. Cache may include cache semantics and cxl.memory may include memory semantics; both cache semantics and memory semantics may be optional. As with PCIe, CXLs can support (i) local widths of x16, x8, and x4, which can be partitionable, (ii) data rates of 32GT/s, which can be reduced to 8GT/s and 16GT/s, 128b/130b, (iii)300W (75W in x16 connectors), and (iv) plug and play. To support plug-and-play, a PCIe or CXL device link may begin training in PCIe in Gen1, negotiate CXL, complete Gen1-5 training, and then begin a CXL transaction.

In some embodiments, as discussed in further detail below, using CXL connections for aggregation or "pooling" of memory (e.g., multiple memories, including multiple memory units connected together) may provide various advantages in a system including multiple servers connected together over a network. For example, a CXL switch (referred to herein as an "enhanced-capability CXL switch") having other capabilities in addition to providing packet-switched functionality for CXL packets may be used to connect the aggregation of memory to one or more Central Processing Units (CPUs) (or "central processing circuits") and one or more network interface circuits (which may have enhanced capabilities). Such a configuration may enable (i) an aggregation of memories to include various types of memories having different characteristics, (ii) an enhanced CXL switch to virtualize the aggregation of memories and store data of different characteristics (e.g., access frequency) in the appropriate types of memories, and (iii) an enhanced CXL switch to support Remote Direct Memory Access (RDMA) so that RDMA may be performed with little or no involvement of processing circuitry from the server. As used herein, "virtualizing" memory means performing memory address translation between processing circuitry and memory.

The CXL switch can (i) support memory and accelerator decomposition through single-level switching, (ii) enable resources offline and online between domains, which can enable cross-domain time-division multiplexing based on demand, and (iii) support virtualization of downstream ports. The CXL may be used to implement an aggregation memory that can implement one-to-many and many-to-one switching (e.g., it may be capable of (i) connecting multiple root ports to one endpoint, (ii) connecting one root port to multiple endpoints, or (iii) connecting multiple root ports to multiple endpoints), where, in some embodiments, the aggregation device is divided into multiple logical devices, each having a respective LD-ID (logical device identifier). In such embodiments, the physical device may be divided into a plurality of logical devices, each of which is visible to a respective initiator. A device may have one Physical Function (PF) and multiple (e.g., 16) isolated logical devices. In some embodiments, the number of logical devices (e.g., the number of partitions) may be limited (e.g., to 16), and there may also be one control partition (which may be a physical function for controlling the devices).

In some embodiments, a fabric manager may be employed for (i) performing device discovery and virtual CXL software creation and (ii) binding virtual ports to physical ports. Such a fabric manager may operate over an SMBus sideband connection. The fabric manager may be implemented in hardware, or software, or firmware, or a combination thereof, and it may reside, for example, in the host, in one of the memory modules 135, or in the enhanced capability cache coherency switch 130, or elsewhere in the network. In some embodiments, the cache coherent switch may be the CXL switch 130. The fabric manager may issue commands (including commands issued over a sideband bus or over a PCIe tree).

Referring to fig. 1A, in some embodiments, a server system includes a plurality of servers 105 connected together by a top-of-rack (ToR) ethernet switch 110. Although the switch is described as using the ethernet protocol, any other suitable network protocol may be used. Each server includes one or more processing circuits 115, each processing circuit 115 connected to (i) a system memory 120 (e.g., a double data rate (version 4) (DDR4) memory or any other suitable memory), (ii) one or more network interface circuits 125, and (iii) one or more CXL memory modules 135. Each of the processing circuits 115 may be a stored-program processing circuit (e.g., a central processing unit (CPU (e.g., X86 CPU), Graphics Processing Unit (GPU), or ARM processor)). In some embodiments, the network interface circuit 125 may be embedded on one of the memory modules 135 (e.g., on the same semiconductor chip as one of the memory modules 135, or in the same module as one of the memory modules 135), or the network interface circuit 125 may be packaged separately from the memory modules 135.

In various embodiments, the management computing entity 102 (to be described in detail below) may be configured to include a processing element (e.g., a processor, an FPGA, an ASIC, a controller, etc.) that may monitor one or more parameters associated with any portion of the network (e.g., ethernet traffic, data center parameters, ToR ethernet switch 110 parameters, parameters associated with the server 105, parameters associated with the Network Interface Circuit (NIC)125, parameters associated with one or more CXL memory modules 135, combinations thereof, etc.) to route workloads and/or portions of workloads to different portions of the network (including any suitable elements of fig. 1A-1G described herein). Further, as described above, in various embodiments, the disclosed system may implement cluster-level performance-based control and management capabilities, whereby workloads may be automatically routed (e.g., via algorithmic methods and/or machine learning-based methods) based on remote architecture configuration and device performance, power characteristics, and the like. In some examples, the disclosed systems may be programmed, at least in part, via ASIC circuitry, FPGA cells, or the like. Further, such devices may implement AI-based techniques (e.g., machine learning-based methods) to route workloads as shown and described herein. Further, the disclosed system may use a management computing entity to perform discovery and/or workload partitioning and/or resource binding based on predetermined criteria (e.g., best performance per unit of currency or power). Further, the managing computing entity may perform such operations based on various parameters of the system, including but not limited to: a cache coherency protocol (e.g., CXL based) based round trip time, a determination of whether a device is a host bias or a device bias, a cache coherency protocol (e.g., CXL based) based switch hierarchy and/or host upstream port to device downstream port bindings, a cache coherency protocol (e.g., CXL based) based switch fabric manager configuration, a cache coherency protocol (e.g., CXL based) based protocol packet or physical media packet (e.g., cxl.io or PCIe intervening bulk 4KB packets), a network latency, a cache coherency protocol (e.g., CXL based) based memory technology (e.g., memory type), combinations thereof, and the like.

As used herein, a "memory module" is a package (e.g., a package including a printed circuit board and components connected thereto, or an encapsulation including a printed circuit board) that includes one or more memory dies, each memory die including a plurality of memory cells. Each memory die or each of a set of groups of memory dies may be in an enclosure (e.g., an epoxy molding compound (EM C) enclosure) soldered to a printed circuit board of the memory module (or connected to the printed circuit board of the memory module by a connector). Each memory module 135 may have a CXL interface and may include a controller 137 (e.g., FPGA, ASIC, processor, etc.) for converting between CXL data packets and the memory interface of the memory die (e.g., signals suitable for the memory technology of the memory in the memory module 135). As used herein, a "memory interface" of a memory die is an interface that is inherent to the technology of the memory die (e.g., in the case of a DRAM, the memory interface may be a word line and a bit line). The memory module may also include a controller 137 that may provide enhanced capabilities as described in further detail below. The controller 137 of each memory module 135 may be connected to the processing circuitry 115 through a cache coherency interface (e.g., through a CXL interface). The controller 137 may also facilitate data transfers (e.g., RDMA requests) between different servers 105, thereby bypassing the processing circuit 115. The ToR ethernet switch 110 and the network interface circuit 125 may include RDMA interfaces to facilitate RDMA requests between CXL memory devices on different servers (e.g., the ToR ethernet switch 110 and the network interface circuit 125 may provide hardware offload or hardware acceleration by fusing RDMA (roce), Infiniband, and iWARP packets of ethernet).

The CXL interconnects in the system may conform to a cache coherency protocol, such as the CXL 1.1 standard, or in some embodiments, to the CXL 2.0 standard, future versions of CXLs, or any other suitable protocol (e.g., cache coherency protocol). Memory modules 135 may be directly connected to the processing circuit 115 as shown, and the top-of-rack ethernet switch 110 may be used to scale the system to a larger size (e.g., with a larger number of servers 105).

In some embodiments, as shown in fig. 1A, each server may populate multiple directly connected CXL attached memory modules 135. Each memory module 135 may expose a set of Base Address Registers (BARs) to the host's basic input/output system (BIOS) as a memory range. One or more of the memory modules 135 may include firmware to transparently manage its memory space behind the host OS map. Each memory module 135 may include one or a combination of memory technologies including, for example, but not limited to, Dynamic Random Access Memory (DRAM), NAND (NAND) flash, High Bandwidth Memory (HBM), and low power double data rate synchronous dynamic random access memory (LPDDR SDRAM) technologies, and may also include a cache controller or separate controllers for different technology memory devices (memory modules 135 for combining several memory devices of different technologies). Each memory module 135 may include a different interface width (x4-x16) and may be constructed in accordance with any of a variety of relevant form factors (form factors) (e.g., U.2, m.2, Half Height Half Length (HHHL), full height half length (FH HL), e1.s, e1.l, e3.s, and e 3.h).

In some embodiments, as described above, the enhanced CXL switch 130 includes an FPGA (or ASIC) controller 137 and provides additional features in addition to the switching of CXL packets. The controller 137 of the enhanced CXL switch 130 can also serve as a management device for the memory module 135 and assist in host control plane processing, and it can implement rich control semantics and statistics. Controller 137 may include additional "back-gate" (e.g., 100 gigabit ethernet (GbE)) network interface circuitry 125. In some embodiments, the controller 137 is presented to the processing circuitry 115 as a type 2 CXL device, enabling a cache invalidation instruction to be issued to the processing circuitry 115 upon receipt of a remote write request. In some embodiments, the DDIO technique is enabled and remote data is first pulled to the Last Level Cache (LLC) of the processing circuit and later written (from the cache) to the memory module 135. As used herein, a "type 2" CXL device is a CXL device that can initiate transactions and implement optional coherency caches and host managed device memory, and its applicable transaction types include all cxl.caches and all cxl.memory transactions.

As described above, one or more of the memory modules 135 may include persistent memory or "persistent storage" (i.e., storage devices that do not lose data when external power is disconnected). If the memory module 135 is presented as a persistent device, the controller 137 of the memory module 135 may manage the persistent domain (e.g., it may store data in persistent storage that is identified by the processing circuitry 115 as requiring persistent storage (e.g., as a result of an application making calls to corresponding operating system functions)). In such embodiments, the software API may flush the cache and data to persistent storage.

In some embodiments, direct memory transfers from the network interface circuit 125 to the memory module 135 are enabled. Such a transfer may be a one-way transfer to a remote storage for fast communication in a distributed system. In such embodiments, the memory module 135 may expose hardware details to the network interface circuitry 125 in the system to enable faster RDMA transfers. In such a system, two scenarios may occur depending on whether data direct I/o (ddio) of the processing circuitry 115 is enabled or disabled. The DDIO may enable direct communication between the ethernet controller or ethernet adapter and the cache of the processing circuit 115. If the DDIO of the processing circuit 115 is enabled, the target of the transfer may be the last level cache of the processing circuit, from which the data may then be automatically flushed to the memory module 135. If the DDIO of processing circuit 115 is disabled, memory module 135 may operate in a device bias mode to force destination memory module 135 (without the DDIO) to receive accesses directly. Such RDMA transfers may be implemented using RDMA-capable network interface circuitry 125 with Host Channel Adapters (HCAs), buffers, and other processing, which may bypass target memory buffer transfers that may exist in other RDMA transfer modes. For example, in such embodiments, the use of a rebound buffer (e.g., a buffer in a remote server when the final destination in memory is in an address range not supported by the RDMA protocol) may be avoided. In some embodiments, RDMA uses another physical media option other than ethernet (e.g., for use with switches configured to handle other network protocols). Examples of RDMA-enabled inter-server connections include, but are not limited to, Infiniband, RDMA over converged ethernet (roce) (using the ethernet User Datagram Protocol (UDP)), and iWARP (using the transmission control protocol/internet protocol (TCP/IP)).

FIG. 1B shows a system similar to that of FIG. 1A, in which the processing circuitry 115 is connected to the network interface circuitry 125 through a memory module 135. The memory module 135 and the network interface circuit 125 are located on the expansion slot adapter 140. Each expansion slot adapter 140 may be plugged into an expansion slot 145 (e.g., an m.2 connector) on the motherboard of server 105. As such, the server may be any suitable (e.g., industry standard) server that is modified by installing expansion slot adapter 140 in expansion slot 145. In such embodiments, (i) each network interface circuit 125 may be integrated into a respective one of the memory modules 135, or (ii) each network interface circuit 125 may have a PCIe interface (the network interface circuits 125 may be PCIe endpoints (i.e., PCIe slaves)) such that the processing circuit 115 to which the network interface circuit 125 is connected (which may operate as a PCIe master or "root port") may communicate with the network interface circuit 125 over a root port-to-endpoint PCIe connection, and the controller 137 of the memory module 135 may communicate with the network interface circuit 125 over a point-to-point PCIe connection.

According to an embodiment of the present invention, there is provided a system including: a first server including a stored program processing circuit, a first network interface circuit, and a first memory module, wherein the first memory module includes a first memory die and a controller, the controller is coupled to the first memory die through a memory interface, and the controller is coupled to the stored program processing circuit through a cache coherency interface and is coupled to the first network interface circuit. In some embodiments, the first memory module also includes a second memory die, the first memory die including volatile memory and the second memory die including persistent memory. In some embodiments, the persistent memory comprises NAND flash memory. In some embodiments, the controller is configured to provide a flash translation layer for the persistent memory. In some embodiments, the cache coherency interface comprises a fast compute link (CXL) interface. In some embodiments, the first server includes an expansion slot adapter connected to an expansion slot of the first server, the expansion slot adapter including a first memory module and a first network interface circuit. In some embodiments, the controller of the first memory module is connected to the stored program processing circuit through an expansion slot. In some embodiments, the expansion slots comprise m.2 slots. In some embodiments, the controller of the first memory module is connected to the first network interface circuit by a point-to-point peripheral component interconnect express (PCIe) connection. In some embodiments, the system further comprises a second server and a network switch connected to the first server and the second server. In some embodiments, the network switch comprises a top-of-rack (ToR) ethernet switch. In some embodiments, the controller of the first memory module is configured to receive a Remote Direct Memory Access (RDMA) request and to send an RDMA response. In some embodiments, the controller of the first memory module is configured to receive a Remote Direct Memory Access (RDMA) request through the network switch and through the first network interface circuit, and to send an RDMA response through the network switch and through the first network interface circuit. In some embodiments, the controller of the first memory module is configured to: receiving data from a second server; storing data in a first memory module; and sending a command to the stored program processing circuitry to invalidate the cache line. In some embodiments, the controller of the first memory module comprises a Field Programmable Gate Array (FPGA) or an Application Specific Integrated Circuit (ASIC). According to an embodiment of the present invention, there is provided a method for performing remote direct memory access in a computing system, the computing system including a first server and a second server, the first server including a stored program processing circuit, a network interface circuit, and a first memory module including a controller, the method comprising: receiving, by a controller of a first memory module, a Remote Direct Memory Access (RDMA) request; and sending, by the controller of the first memory module, the RDMA response. In some embodiments, the computing system further comprises an ethernet switch connected to the first server and the second server, and the step of receiving the RDMA request comprises receiving the RDMA request through the ethernet switch. In some embodiments, the method further comprises: the method includes receiving, by a controller of a first memory module, a read command for a first memory address from a stored program processing circuit, converting, by the controller of the first memory module, the first memory address to a second memory address, and retrieving, by the controller of the first memory module, data from the first memory module at the second memory address. In some embodiments, the method further comprises: the method includes receiving data by a controller of a first memory module, storing the data in the first memory module by the controller of the first memory module, and sending a command to invalidate a cache line to a stored program processing circuit by the controller of the first memory module. According to an embodiment of the present invention, there is provided a system including: a first server including stored program processing circuitry, first network interface circuitry, and a first memory module, wherein the first memory module includes a first memory die and a controller device coupled to the first memory die through a memory interface, the controller device coupled to the stored program processing circuitry through a cache coherency interface, and coupled to the first network interface circuitry.

Referring to fig. 1C, in some embodiments, the server system includes a plurality of servers 105 connected together by a top-of-rack (ToR) ethernet switch 110. Each server includes one or more processing circuits 115, each processing circuit 115 connected to (i) a system memory 120 (e.g., DDR4 memory), (ii) one or more network interface circuits 125, and (iii) an enhanced capability CXL switch 130. The enhanced CXL switch 130 may be connected to a plurality of memory modules (MEM) 135. That is, the system of FIG. 1C includes a first server 105, the first server 105 including a stored program processing circuit 115, a network interface circuit 125, a cache coherent switch 130, and a first memory module 135. In the system of FIG. 1C, the first memory module 135 is connected to a cache coherent switch 130, the cache coherent switch 130 is connected to the network interface circuitry 125, and the stored program processing circuitry 115 is connected to the cache coherent switch 130.

The memory modules 135 may be grouped by type, form factor, or technology type (e.g., DDR4, DR AM, LDPPR, High Bandwidth Memory (HBM), or NAND flash or other persistent storage (e.g., solid state drives containing NAND flash)). Each memory module may have a CXL interface and include interface circuitry for converting between CXL data packets and signals appropriate for the memory in the memory module 135. In some embodiments, these interface circuits are instead in the enhanced CXL switch 130, and each of the memory modules 135 has an interface that is a local interface to the memory in the memory module 135. In some embodiments, the enhanced CXL switch 130 is integrated into the memory module 135 (e.g., in an m.2 form factor package with other components of the memory module 135 or in a single integrated circuit with other components of the memory module 135).

ToR ethernet switch 110 may include interface hardware to facilitate RDMA requests between aggregated memory devices on different servers. The enhanced capabilities CXL switch 130 can include one or more circuits (e.g., can include FPGAs or ASICs) to (i) route data to different memory types based on workload, (ii) virtualize host addresses to device addresses, and/or (iii) facilitate RDMA requests between different servers, bypassing the processing circuits 115.

The memory modules 135 may be in an expansion box (e.g., in the same rack as the enclosure that houses the motherboard of the enclosure) that may include a predetermined number (e.g., more than 20 or more than 100) of memory modules 135, each memory module 135 plugged into an appropriate connector. The module may have an m.2 form factor and the connector may be an m.2 connector. In some embodiments, the connection between the servers is through a different network than ethernet (e.g., they may be wireless connections such as WiFi or 5G connections). Each processing circuit may be an x86 processor or another processor (e.g., an ARM processor or GPU). The PCIe link that instantiates the CXL link can be PCIe 5.0 or another version (e.g., an earlier version or a later (e.g., future) version (e.g., PCIe 6.0)). In some embodiments, a different cache coherency protocol is used in place of CXLs or supplemental CXLs in the system, and a different cache coherency switch can be used in place of enhanced-capability CXL switch 130 or supplemental enhanced-capability CXL switch 130. Such a cache coherency protocol may be another standard protocol or a cache coherency variant of a standard protocol (in a similar manner to the way CXL is a variant of PCIe 5.0). Examples of standard protocols include, but are not limited to, non-volatile dual in-line memory module (version P) (NVDIMM-P), cache coherent interconnect for accelerators (CCIX), and open coherent accelerator processor interface (OpenCAPI).

The system memory 120 may include, for example, DDR4 memory, DRAM, HBM, or LDPPR memory. The memory modules 135 may be partitioned or include a cache controller to handle a variety of memory types. The memory modules 135 may have different form factors, examples of which include, but are not limited to, HHHL, FHHL, m.2, U.2, mezzanine card, daughter card (daughter card), E1.s, E1.1, E3.1, and E3.s.

In some embodiments, the system implements an aggregation architecture comprising a plurality of servers, wherein each server is aggregated with a plurality of CXLs attached memory modules 135. Each of memory modules 135 may include multiple partitions that may be individually exposed as memory devices to multiple processing circuits 115. Each input port of the enhanced CXL switch 130 can independently access a plurality of output ports of the enhanced CXL switch 130 and the memory modules 135 connected thereto. As used herein, an "input port" or "upstream port" of an enhanced-capability CXL switch 130 is a port connected to (or adapted to be connected to) a PCIe root port, and an "output port" or "downstream port" of an enhanced-capability CXL switch 130 is a port connected to (or adapted to be connected to) a PCIe endpoint. As with the embodiment of fig. 1A, each memory module 135 may expose a set of Base Address Registers (BARs) to the host BIOS as a memory range. One or more of the memory modules 135 may include firmware to transparently manage its memory space behind the host OS map.

In some embodiments, as described above, the enhanced CXL switch 130 includes an FPGA (or ASIC) controller 137 and provides additional features in addition to the switching of CXL packets. For example, it may (as described above) virtualize the memory module 135 (i.e., operate as a translation layer) to translate between processing circuit-side addresses (or "processor-side" addresses, i.e., addresses included in memory read and write commands issued by the processing circuit 115) and memory-side addresses (i.e., addresses used by the enhanced CXL switch 130 to address storage locations in the memory module 135), thereby masking the physical addresses of the memory module 135 and presenting a virtual aggregation of memory. The controller 137 of the enhanced CXL switch 130 can also serve as a management device for the memory modules 135 and facilitate host control plane processing. The controller 137 may transparently move data without the involvement of the processing circuitry 115 and update the memory map (or "address translation table") accordingly so that subsequent accesses work as intended. The controller 137 may include a switch management device that (i) may bind and unbind upstream and downstream connections as appropriate during runtime and (iii) may implement rich control semantics and statistics associated with data transfers to and from the memory module 135. Controller 137 may include additional "back-gates" 100GbE or other network interface circuitry 125 (in addition to the network interface for connecting to hosts) for connecting to other servers 105 or other networked devices. In some embodiments, the controller 137 is presented to the processing circuit 115 as a type 2 device, which enables a cache invalidation instruction to be issued to the processing circuit 115 upon receipt of a remote write request. In some embodiments, the DDIO technique is enabled and remote data is first pulled to the Last Level Cache (LLC) of the processing circuit 115 and later written (from the cache) to the memory module 135.

As described above, one or more of the memory modules 135 may include persistent storage. If the memory module 135 is presented as a persistent device, the controller 137 of the enhanced CXL switch 130 can manage the persistent domain (e.g., it can store data in the persistent storage device that is identified by the processing circuitry 115 (e.g., by using the corresponding operating system function) as requiring persistent storage). In such embodiments, the software API may flush the cache and data to persistent storage.

In some embodiments, the direct memory transfer to the memory module 135 may be performed in a manner similar to that described above for the embodiments of fig. 1A and 1B, where operations performed by the controller of the memory module 135 are performed by the controller 137 of the enhanced capability CXL switch 130.

As described above, in some embodiments, the memory modules 135 are organized into groups (e.g., one group that is memory dense, another group that is HBM congested (heavy), another group that has limited density and performance, and another group that has dense capacity). Such groups may have different form factors or be based on different technologies. The controller 137 of the enhanced CXL switch 130 can intelligently route data and commands based on, for example, workload, label, or quality of service (QoS). For read requests, there may be no routing based on such factors.

The controller 137 of the enhanced CXL switch 130 can also virtualize the processing circuit-side and memory-side addresses (as described above) so that the controller 137 of the enhanced CXL switch 130 can determine where data is to be stored. The controller 137 of the enhanced CXL switch 130 can make such a determination based on information or instructions that can be received from the processing circuit 115. For example, the operating system may provide a memory allocation feature so that an application may specify that low latency storage or high bandwidth storage or persistent storage is to be allocated, and such requests, initiated by the application, may then be considered by the controller 137 of the enhanced CXL switch 130 in determining where (e.g., in which memory module 135) to allocate memory. For example, storage requested for high bandwidth by an application may be allocated in the memory module 135 containing HBM, storage requested for data persistence by an application may be allocated in the memory module 135 containing NAND flash, and other storage (which is not requested by an application) may be stored on the memory module 135 containing relatively inexpensive DRAM. In some embodiments, the controller 137 of the enhanced CXL switch 130 can determine where to store particular data based on network usage patterns. For example, the controller 137 of the enhanced CXL switch 130 can determine that data in a particular range of physical addresses is accessed more frequently than other data by monitoring usage patterns, and then the controller 137 of the enhanced CXL switch 130 can copy these data to the memory module 135 containing the HBM and modify its address translation table so that the data in the new location is stored in the same range of virtual addresses. In some embodiments, one or more of the memory modules 135 include flash memory (e.g., NAND flash memory) and the controller 137 of the enhanced CXL switch 130 implements a flash translation layer for the flash memory. The flash translation layer can support overwriting of processor-side memory locations (by moving data to a different location and marking a previous location of the data as invalid), and it can perform garbage collection (e.g., erasing a block after moving any valid data in the block to another block when the portion of the data marked as invalid in the block exceeds a threshold).

In some embodiments, the controller 137 of the enhanced CXL switch 130 can facilitate Physical Function (PF) to PF transfers. For example, if one of the processing circuits 115 needs to move data from one physical address to another (which may have the same virtual address; this fact need not affect the operation of the processing circuit 115), or if the processing circuit 115 needs to move data between two virtual addresses (which the processing circuit 115 will need to have), the controller 137 of the enhanced CXL switch 130 can supervise the transfer without involving the processing circuit 115. For example, the processing circuit 115 may send CXL requests, and after the enhanced-capability CXL switch 130 data may be sent from one memory module 135 to another memory module 135 (e.g., data may be copied from one memory module 135 to another memory module 135) without going to the processing circuit 115. In this case, because the processing circuit 115 initiates the CXL request, the processing circuit 115 may need to flush its cache to ensure coherency. If instead the type 2 memory device (e.g., one of the memory modules 135, or an accelerator that may also be connected to the CXL switch) initiates a CXL request and the switch is not virtualized, the type 2 memory device may send a message to the processing circuit 115 to invalidate the cache.

In some embodiments, the controller 137 of the enhanced CXL switch 130 can facilitate RDMA requests between servers. The remote server 105 may initiate such RDMA requests, and the requests may be sent through the ToR ethernet switch 110 and to an enhanced capabilities CXL switch 130 ("local server") in the server 105 that responds to the RDMA requests. The enhanced CXL switch 130 may be configured to receive such RDMA requests, and it may treat a group of memory modules 135 in the receiving server 105 (i.e., the server receiving the RDMA request) as its own memory space. In the local server, the enhanced CXL switch 130 may receive the RDMA request as a direct RDMA request (i.e., an RDMA request that is not routed through the processing circuit 115 in the local server) and it may send a direct response to the RDMA request (i.e., it may send a response without being routed through the processing circuit 115 in the local server). In the remote server, the response (e.g., data sent by the local server) may be received by the remote server's enhanced CXL switch 130 and stored in the remote server's memory module 135 without being routed through the processing circuit 115 in the remote server.

Fig. 1D shows a system similar to that of fig. 1C, in which the processing circuitry 115 is connected to the network interface circuitry 125 through an enhanced-capability CXL switch 130. The enhanced CXL switch 130, memory module 135 and network interface circuitry 125 are located on an expansion slot adapter 140. Expansion slot adapter 140 may be a circuit board or module that is inserted into an expansion slot (e.g., PCIe connector 145) on the motherboard of server 105. Thus, the server may be any suitable server that is modified only by installing expansion slot adapter 140 in PCIe connector 145. The memory module 135 may be mounted in a connector (e.g., an m.2 connector) on the expansion slot adapter 140. In such embodiments, (i) the network interface circuits 125 may be integrated into the enhanced capability CXL switch 130, or (ii) each network interface circuit 125 may have a PCIe interface (the network interface circuits 125 may be PCIe endpoints) such that the processing circuit 115 to which it is connected may communicate with the network interface circuit 125 over a root port to endpoint PCIe connection. The controller 137 of the enhanced CXL switch 130 (which may have PCIe input ports connected to the processing circuitry 115 and to the network interface circuitry 125) may communicate with the network interface circuitry 125 over point-to-point PCIe connections.

According to an embodiment of the present invention, there is provided a system including: a first server comprising stored program processing circuitry, network interface circuitry, a cache coherent switch, and a first memory module, wherein the first memory module is connected to the cache coherent switch, the cache coherent switch is connected to the network interface circuitry, and the stored program processing circuitry is connected to the cache coherent switch. In some embodiments, the system further comprises a second memory module connected to the cache coherent switch, wherein the first memory module comprises volatile memory and the second memory module comprises persistent memory. In some embodiments, a cache coherent switch is configured to virtualize a first memory module and a second memory module. In some embodiments, the first memory module includes flash memory and the cache coherence switch is configured to provide a flash translation layer for the flash memory. In some embodiments, a cache coherent switch is configured to monitor access frequency of a first memory location in a first memory module; determining that the access frequency exceeds a first threshold; and copying the contents of the first memory location into a second memory location, the second memory location being in a second memory module. In some embodiments, the second memory module includes a High Bandwidth Memory (HBM). In some embodiments, a cache coherent switch is configured to maintain a table for mapping processor-side addresses to memory-side addresses. In some embodiments, the system further comprises a second server and a network switch connected to the first server and the second server. In some embodiments, the network switch comprises a top-of-rack (ToR) ethernet switch. In some embodiments, a cache coherency switch is configured to receive a Remote Direct Memory Access (RDMA) request and to send an RDMA response. In some embodiments, the cache coherent switch is configured to receive a Remote Direct Memory Access (RDMA) request through the ToR ethernet switch and through the network interface circuitry, and to send an RDMA response through the ToR ethernet switch and through the network interface circuitry. In some embodiments, the cache coherency switch is configured to support a fast compute link (CXL) protocol. In some embodiments, the first server includes an expansion slot adapter connected to an expansion slot of the first server, the expansion slot adapter including a cache coherent switch and a memory module slot, the first memory module connected to the cache coherent switch through the memory module slot. In some embodiments, the memory module slots comprise m.2 slots. In some embodiments, the network interface circuitry is on an expansion slot adapter. According to an embodiment of the present invention, there is provided a method for performing remote direct memory access in a computing system, the computing system including a first server and a second server, the first server including stored program processing circuitry, network interface circuitry, a cache coherence switch, and a first memory module, the method including receiving, by the cache coherence switch, a Remote Direct Memory Access (RDMA) request and sending, by the cache coherence switch, an RDMA response. In some embodiments: the computing system further includes an ethernet switch, and the step of receiving an RDMA request includes receiving the RDMA request through the ethernet switch. In some embodiments, the method further comprises: a read command for a first memory address is received by the cache coherent switch from the store program processing circuitry, the first memory address is translated to a second memory address by the cache coherent switch, and data is retrieved from the first memory module at the second memory address by the cache coherent switch. In some embodiments, the method further comprises: the method includes receiving data by a cache coherent switch, storing the data in a first memory module by the cache coherent switch, and sending a command to invalidate a cache line to a store program processing circuit by the cache coherent switch. According to an embodiment of the present invention, there is provided a system including: a first server comprising stored program processing circuitry, network interface circuitry, cache coherency switching means and a first memory module, wherein the first memory module is coupled to the cache coherency switching means, the cache coherency switching means is coupled to the network interface circuitry, and the stored program processing circuitry is coupled to the cache coherency switching means.

Fig. 1E illustrates an embodiment in which each of the plurality of servers 105 is connected to a ToR server link switch 112, as shown, ToR server link switch 112 may be a PCIe 5.0CXL switch with PCIe capability (also referred to as a ToR PCIe5 switch). Server link switch 112 may comprise an FPGA or an ASIC and may provide better performance (in terms of throughput and latency) than an ethernet switch. Each of the servers 105 may include a plurality of memory modules 135 connected to the server link switch 112 through an enhanced CXL switch 130 and through a plurality of PCIe connectors. As shown, each of the servers 105 may also include one or more processing circuits 115 and a system memory 120. As discussed in further detail below, the server link switch 112 can operate as a master and each of the enhanced-capability CXL switches 130 can operate as a slave.

In the embodiment of FIG. 1E, the server link switch 112 may group or batch multiple cache requests received from different servers 105, and it may group data packets, thereby reducing control overhead. The enhanced capability CXL switch 130 can include a slave controller (e.g., from an FPGA or from an ASIC) to (i) route data to different memory types based on workload, (ii) virtualize processor-side addresses as memory-side addresses, and (iii) facilitate coherency requests between different servers 105, bypassing the processing circuitry 115. The system shown in fig. 1E may be based on CXL2.0, it may include distributed shared memory within a rack, and it may use ToR server link switch 112 to make local connections with remote nodes.

The ToR server link switch 112 may have additional network connections (e.g., an ethernet connection as shown or another type of connection (e.g., a wireless connection such as a WiFi connection or a 5G connection)) for connecting to other servers or clients. The server link switch 112 and the enhanced CXL switch 130 can each include a controller, which can be or include a processing circuit such as an ARM processor. The PCIe interface may conform to the PCIe 5.0 standard or to an earlier or future version of the PCIe standard, or an interface conforming to a different standard (e.g., NVDIMM-P, CCIX or OpenC API) may be employed in place of the PCIe interface. The memory module 135 may include various memory types including DDR4 DRAM, HBM, LDPPR, NAND flash, or Solid State Drive (SSD). The memory modules 135 may be partitioned or include a cache controller to handle multiple memory types, and they may have different form factors (such as HHHL, FHHL, m.2, U.2, mezzanine card, daughter card, E1.s, E1.1, E3.1, or E3.s).

In the embodiment of fig. 1E, the enhanced CXL switch 130 can implement one-to-many and many-to-one switching, and it can implement a fine-grained load-store interface at the flit (64 byte) level. Each server may have an aggregated memory device, each device being divided into a plurality of logical devices each having a respective LD-ID. ToR switch 112 (which may be referred to as a "server link switch") implements a one-to-one multifunction, and enhanced-capability CXL switch 130 in server 105 implements a many-to-one function. The server link switch 112 may be a PCIe switch or a CXL switch or both. In such a system, the requesting party may be the processing circuits 115 of multiple servers 105 and the responding party may be a number of aggregated memory modules 135. The hierarchy of the two switches (as described above, the master switch is the server chaining switch 112 and the slave switch is the enhanced capability CXL switch 130) enables any communication. Each memory module 135 may have one Physical Function (PF) and up to 16 isolated logical devices. In some embodiments, the number of logical devices (e.g., the number of partitions) may be limited (e.g., to 16), and there may also be one control partition (which may be a physical function for controlling the devices). Each of the memory modules 135 may be a type 2 device with cxl.cache, cxl.memory and cxl.io and Address Translation Service (ATS) implementations to process copies of cache lines that the processing circuitry 115 may hold. The enhanced CXL switch 130 and fabric manager can control the discovery of the memory module 135 and (i) perform device discovery and virtual CXL software creation, and (ii) bind virtual ports to physical ports. As with the embodiments of fig. 1A-1D, the fabric manager may operate through connection of SMBus sidebands. Configurability may be achieved by an interface to the memory module 135, which may be an Intelligent Platform Management Interface (IPMI) or an interface that conforms to the Redfish standard (and may also provide additional features not required by the standard).

As described above, some embodiments implement a hierarchical structure in which a master controller (which may be implemented in an FPGA or ASIC) is part of the server link switch 112 and a slave controller is part of the enhanced capability CXL switch 130 to provide a load-store interface (i.e., an interface that has cache line (e.g., 64 byte) granularity and operates within a coherency domain without software driver involvement). Such a load-store interface may extend the coherency domain beyond a single server or CPU or host, and may involve electrical or optical physical media (e.g., an optical connection with an electro-optical transceiver on both ends). In operation, the master controller (in the server linked switch 112) starts (or "reboots") and configures all of the servers 105 in the rack. The master controller may have visibility on all hosts, and it may (i) discover each server and discover how many servers 105 and memory modules 135 are present in the server cluster, (ii) configure each server 105 independently, (iii) enable or disable some memory blocks on different servers based on, for example, the configuration of the rack (e.g., enable or disable any memory modules in memory modules 135), (iv) control access (e.g., which server may control which other server), (v) implement flow control (e.g., since all host and device requests are through the master, data may be sent from one server to another and flow control performed on the data), (vi) group or batch requests or packets (e.g., the master receives multiple cache requests from different servers 105), and (vii) receive remote software updates, Broadcast communications, etc. In batch mode, the server linked switch 112 may receive multiple data packets destined for the same server (e.g., for a first server) and send them together (i.e., without a pause between them) to the first server. For example, the server link switch 112 may receive a first data packet from a second server and a second data packet from a third server and send the first data packet and the second data packet together to the first server. Each of the servers 105 may expose (i) an IPMI network interface, (ii) a System Event Log (SEL), and (iii) a Board Management Controller (BMC) to the master controller, enabling the master controller to measure performance, measure reliability in operation, and reconfigure the servers 105.

In some embodiments, a software architecture that facilitates a high availability load-store interface is used. Such software architectures may provide reliability, replication, consistency, system consistency, hashing, caching, and durability. The software architecture may provide reliability (in a system with a large number of servers) by performing periodic hardware checks of the CXL device components via IPMI. For example, the server link switch 112 may query the status of the memory server 150 (e.g., query the power status (whether the power to the memory server 150 is operating properly), the network status (whether the interface to the server link switch 112 is operating properly), and the error check status (whether an error condition exists in any subsystem of the memory server 150)) via the IPMI interface of the memory server 150. The software architecture may provide for replication because the host controller may replicate data stored in the memory module 135 and maintain data consistency across the replicas.

The software architecture may provide consistency because the master controller may be configured with different levels of consistency and the server link switch 112 may adjust the packet format according to the level of consistency to be maintained. For example, if final consistency is being maintained, the server link switch 112 may reorder the requests, while to maintain strict consistency, the server link switch 112 may maintain a scoreboard of all requests with precise timestamps at the switch. Because multiple processing circuits 115 may read from or write to the same memory address, the software architecture may provide system coherency, and to maintain coherency, the master controller may be responsible for reaching the home node of the address (using a directory lookup) or broadcasting the request on the common bus.

The software architecture can provide hashing because the server link switch 112 and the enhanced CXL switch can maintain a virtual mapping of addresses that can use consistent hashing (i.e., consistent hashing) with multiple hash functions (i.e., hash functions) to map data evenly to all CXL devices across all nodes at startup (or adjust when one server is down or present). Because the host controller may designate a particular memory partition (e.g., in a memory module 135 that includes HBM or similar capable technologies) to be used as a cache (e.g., employing write-through caching or write-back caching), the software architecture may provide caching. Because the master and slave controllers can manage persistence domains and flushes, the software architecture can provide persistence.

In some embodiments, the CXL switch's capabilities are integrated into the controller of the memory module 135. In such embodiments, the server linked switch 112 may still function as a master and have enhanced features as discussed elsewhere herein. The server link switch 112 may also manage other storage devices in the system and it may have an ethernet connection (e.g., 100GbE connection) for connecting to, for example, client machines that are not part of the PCIe network formed by the server link switch 112.

In some embodiments, the server link switch 112 has enhanced capabilities and also includes an integrated CXL controller. In other embodiments, the server link switches 112 are merely physical routing devices, and each server 105 includes a master CXL controller. In such embodiments, master devices across different servers may negotiate a master-slave architecture. (i) The enhanced capabilities CXL switch 130 and (ii) the intelligent functionality of the server link switch 112 can be implemented in one or more FPGAs, one or more ASICs, one or more ARM processors, or in one or more SSD devices with computing capabilities. Server link switch 112 may perform flow control, for example, by reordering individual requests. In some embodiments, RDMA is optional because the interface is load-store, but there may be intermediate RDMA requests using the PCIe physical medium (instead of 100 GbE). In such embodiments, the remote host may initiate an RDMA request that may be sent to the enhanced CXL switch 130 through the server link switch 112. The server link switch 112 and the enhanced capabilities CXL switch 130 can prioritize RDMA 4KB requests or flit (64 byte) requests for CXLs.

As with the embodiments of fig. 1C and 1D, the enhanced CXL switch 130 may be configured to receive such RDMA requests, and it may treat a group of memory modules 135 in the receiving server 105 (i.e., the server receiving the RDMA request) as its own memory space. Further, the enhanced CXL switch 130 can virtualize across the processing circuits 115 and initiate RDMA requests on the remote enhanced CXL switch 130 to move data back and forth between the servers 105 without involving the processing circuits 115.

Fig. 1F shows a system similar to fig. 1E, in which the processing circuit 115 is connected to the network interface circuit 125 (e.g., PCIe5 connector) through an enhanced capability CXL switch 130. As with the embodiment of fig. 1D, in fig. 1F, the enhanced CXL switch 130, the memory module 135 and the network interface circuitry 125 are located on an expansion slot adapter 140. Expansion slot adapter 140 may be a circuit board or module that is inserted into an expansion slot (e.g., PCIe connector 145) on the motherboard of server 105. Thus, the server may be any suitable server that is modified only by installing expansion slot adapter 140 in PCIe connector 145. The memory module 135 may be mounted in a connector (e.g., an m.2 connector) on the expansion slot adapter 140. In such embodiments, (i) the network interface circuits 125 may be integrated into the enhanced-capability CXL switch 130, or (ii) each network interface circuit 125 may have a PCIe interface (the network interface circuits 125 may be PCIe endpoints) such that the processing circuit 115 to which it is connected may communicate with the network interface circuits 125 over a root port-to-endpoint PCIe connection, and the controller 137 of the enhanced-capability CXL switch 130 (may have input ports connected to the processing circuit 115 and the network interface circuits 125) may communicate with the network interface circuits 125 over a point-to-point PCIe connection.

According to an embodiment of the present invention, there is provided a system including: a first server comprising stored program processing circuitry, a cache coherent switch, and a first memory module; a second server; and a server linked switch connected to the first server and the second server, wherein the first memory module is connected to a cache coherent switch, the cache coherent switch is connected to the server linked switch, and the stored program processing circuitry is connected to the cache coherent switch. In some embodiments, the server linked switch comprises a peripheral component interconnect express (PCIe) switch. In some embodiments, the server chaining switch comprises a compute express Chaining (CXL) switch. In some embodiments, the server link switch comprises a top-of-rack (ToR) CXL switch. In some embodiments, the server linked switch is configured to discover the first server. In some embodiments, the server linked switch is configured to restart the first server. In some embodiments, the server linked switch is configured to cause the cache coherent switch to disable the first memory module. In some embodiments, the server linked switch is configured to send data from the second server to the first server and perform flow control on the data. In some embodiments, the system further comprises a third server connected to the server linked switch, wherein the server linked switch is configured to: the method includes receiving a first data packet from a second server, receiving a second data packet from a third server, and transmitting the first data packet and the second data packet to the first server. In some embodiments, the system further comprises a second memory module connected to the cache coherent switch, wherein the first memory module comprises volatile memory and the second memory module comprises persistent memory. In some embodiments, a cache coherent switch is configured to virtualize a first memory module and a second memory module. In some embodiments, the first memory module includes flash memory and the cache coherence switch is configured to provide a flash translation layer for the flash memory. In some embodiments, the first server includes an expansion slot adapter connected to an expansion slot of the first server, the expansion slot adapter including a cache coherent switch and a memory module slot, the first memory module connected to the cache coherent switch through the memory module slot. In some embodiments, the memory module slots comprise m.2 slots. In some embodiments, the cache coherent switch is connected to the server link switch through a connector, and the connector is on an expansion slot adapter. According to an embodiment of the present invention, there is provided a method for performing remote direct memory access in a computing system, the computing system including a first server, a second server, a third server, and a server linked switch connected to the first server, the second server, and the third server, the first server including stored program processing circuitry, a cache coherent switch, and a first memory module, the method comprising: the method further includes receiving, by the server linking switch, a first data packet from the second server, receiving, by the server linking switch, a second data packet from the third server, and sending the first data packet and the second data packet to the first server. In some embodiments, the method further comprises: receiving, by a cache coherent switch, a Remote Direct Memory Access (RDMA) request and sending, by the cache coherent switch, an RDMA response. In some embodiments, the step of receiving an RDMA request comprises receiving an RDMA request through a server linked switch. In some embodiments, the method further comprises: the method includes receiving, by a cache coherent switch, a read command for a first memory address from a store program processing circuit, converting, by the cache coherent switch, the first memory address to a second memory address, and retrieving, by the cache coherent switch, data from a first memory module at the second memory address. According to an embodiment of the present invention, there is provided a system including: a first server comprising stored program processing circuitry, a cache coherency exchange, a first memory module, a second server, and a server link switch coupled to the first server and the second server, wherein the first memory module is coupled to the cache coherency exchange, the cache coherency exchange is coupled to the server link switch, and the stored program processing circuitry is coupled to the cache coherency exchange.

Fig. 1G illustrates an embodiment in which each of the plurality of memory servers 150 is connected to a ToR server link switch 112, as shown, ToR server link switch 112 may be a PCIe 5.0CXL switch (also referred to as a ToR PCIe5 CXL switch). As with the embodiments of fig. 1E and 1F, server link switch 112 may comprise an FPGA or an ASIC and may provide performance (in terms of throughput and latency) that is superior to ethernet switches. As with the embodiment of fig. 1E and 1F, the memory server 150 may include a plurality of memory modules 135 connected to the server link switch 112 by a plurality of PCIe connectors. In the embodiment of fig. 1G, processing circuitry 115 and system memory 120 may not be present, and the primary purpose of memory server 150 may be to provide memory for use by other servers 105 having computing resources.

In the embodiment of FIG. 1G, the server link switch 112 may group or batch multiple cache requests received from different storage servers 150, and it may group data packets, thereby reducing control overhead. The enhanced CXL switch 130 can include combinable hardware building blocks to (i) route data to different memory types based on workload and (ii) virtualize processor-side addresses (convert such addresses to memory-side addresses). The system shown in fig. 1G may be based on CXL 2.0, it may include combinable and disaggregated shared memory within a chassis, and it may use ToR server link switch 112 to provide pooled (i.e., aggregated) memory to remote devices.

The ToR server link switch 112 may have additional network connections (e.g., an ethernet connection as shown, or another type of connection (e.g., a wireless connection such as a WiFi connection or a 5G connection)) for connecting to other servers or clients. The server link switch 112 and the enhanced CXL switch 130 can each include a controller, which can be or include a processing circuit such as an ARM processor. The PCIe interface may conform to the PCIe 5.0 standard or to an earlier or future version of the PCIe standard, or a different standard (e.g., NVDIMM-P, CCIX or OpenC API) may be employed in place of PCIe. The memory module 135 may include various memory types including DDR4 DRAM, HBM, LDPPR, NAND flash, and Solid State Drives (SSD). The memory modules 135 may be partitioned or include a cache controller to handle multiple memory types, and they may have different form factors (such as HHHL, FHHL, m.2, U.2, mezzanine card, daughter card, E1.s, E1.1, E3.1, or E3.s).

In the embodiment of fig. 1G, the enhanced CXL switch 130 can implement one-to-many and many-to-one switching, and it can implement a fine-grained load-store interface at the flit (64 byte) level. Each memory server 150 may have an aggregated memory device, each device being divided into a plurality of logical devices each having a corresponding LD-ID. The enhanced capabilities CXL switch 130 can include a controller 137 (e.g., an ASIC or FPGA) and circuitry for device discovery, enumeration, partitioning, and presentation of physical address ranges (which can be separate from or part of such an ASIC or FPGA). Each memory module 135 may have one Physical Function (PF) and up to 16 isolated logical devices. In some embodiments, the number of logical devices (e.g., the number of partitions) may be limited (e.g., to 16), and there may also be one control partition (which may be a physical function for controlling the devices). Each of the memory modules 135 may be a type 2 device with cxl.cache, cxl.memory and cxl.io and Address Translation Service (ATS) implementations to process copies of cache lines that the processing circuitry 115 may hold.

The enhanced CXL switch 130 and fabric manager can control the discovery of the memory module 135 and (i) perform device discovery and virtual CXL software creation and (ii) bind virtual ports to physical ports. As with the embodiments of fig. 1A-1D, the fabric manager may operate over a connection on an SMBus sideband. Configurability may be achieved by an interface to the memory module 135, which may be an Intelligent Platform Management Interface (IPMI) or an interface that conforms to the Redfish standard (and may also provide additional features not required by the standard).

For the embodiment of fig. 1G, the building blocks may include (as described above) the CXL controller 137 implemented on an FPGA or ASIC switching to implement the aggregation of memory devices (e.g., of the memory module 135), SSDs, accelerators (GPUs, NICs), CXL and PCIe5 connectors, and firmware, exposing device details to Advanced Configuration and Power Interface (ACPI) tables of the operating system, such as Heterogeneous Memory Attribute Tables (HMATs) or static resource association tables SRATs.

In some embodiments, the system provides composability. The system may provide the ability to bring CXL devices and other accelerators online or offline based on software configuration, and it may be able to group accelerator, memory, storage resources and allocate them to each memory server 150 in the rack. The system can hide the physical address space and provide a transparent cache using faster devices such as HBM and SRAM.

In the embodiment of fig. 1G, the controller 137 of the enhanced CXL switch 130 can (i) manage the memory modules 135, (ii) integrate and control heterogeneous devices (such as NICs, SSDs, GPUs, DRAMs), and (iii) enable dynamic reconfiguration of storage of the memory devices through power gating. For example, the ToR server link switch 112 may disable power to one of the memory modules 135 (i.e., turn off or reduce power) by instructing the enhanced CXL switch 130 to disable power to the memory modules 135. The enhanced CXL switch 130 can then disable power to the memory module 135 when instructed by the server link switch 112 to disable power to the memory module. Such disabling may save power and may improve performance (e.g., throughput and latency) of other memory modules 135 in the memory server 150. Each remote server 105 may see a different logical view of the memory module 135 and its connections based on the negotiation. The controller 137 of the enhanced CXL switch 130 can maintain state such that each remote server maintains allocated resources and connections and can perform compression or deduplication of memory to save memory capacity (using configurable chunk sizes). The disassembled chassis of FIG. 1G may have its own BMC. It may also expose the IPMI network interface and System Event Log (SEL) to remote devices, enabling the master (e.g., using a stored remote server provided by the memory server 150) to measure performance and reliability on the fly, and reconfigure the disaggregated chassis. The disaggregated chassis of FIG. 1G may provide reliability, replication, consistency, system consistency, hashing, caching, and persistence in a manner similar to that described herein for the embodiment of FIG. 1E, where, for example, multiple remote servers reading from or writing to the same memory address are provided with consistency, and each remote server is configured with a different level of consistency. In some embodiments, the server linked switch maintains a final consistency between data stored on the first memory server and data stored on the second memory server. The server linked switches 112 may maintain different levels of consistency for different pairs of servers, for example, the server linked switches may also maintain a level of consistency that is strictly consistent, sequentially consistent, causally consistent, or processor consistent between data stored on a first memory server and data stored on a third memory server. The system may employ communication in both the "local zone" (server linked switches 112) and "global zone" (exploded server) domains. Writes may be flushed to the "global stripe" to be visible to new reads from other servers. The controller 137 of the enhanced CXL switch 130 can manage persistent domains and flushes separately for each remote server. For example, a cache coherence switch may monitor fullness (fullness) of a first region of memory (volatile memory, operating as cache) and when the fullness level exceeds a threshold, the cache coherence switch may move data from the first region of memory to a second region of memory, the second region of memory being in persistent memory. Flow control may be handled because priority may be established between remote servers by the controller 137 of the enhanced CXL switch 130 to present different perceived delays and bandwidths.

According to an embodiment of the present invention, there is provided a system including: a first memory server comprising a cache coherent switch and a first memory module; a second storage server; and a server linked switch connected to the first memory server and the second memory server, wherein the first memory module is connected to the cache coherent switch and the cache coherent switch is connected to the server linked switch. In some embodiments, the server link switch is configured to disable power to the first memory module. In some embodiments, the server link switch is configured to disable power to the first memory module by instructing the cache coherency switch to disable power to the first memory module, and the cache coherency switch is configured to disable power to the first memory module when the disabling of power to the first memory module is instructed by the server link switch. In some embodiments, the cache coherent switch is configured to perform deduplication within the first memory module. In some embodiments, the cache coherent switch is configured to compress data and store the compressed data in the first memory module. In some embodiments, the server linked switch is configured to query a state of the first storage server. In some embodiments, the server linked switch is configured to query a state of the first storage server via an Intelligent Platform Management Interface (IPMI). In some embodiments, querying the status comprises querying a status selected from the group consisting of a power status, a network status, and an error check status. In some embodiments, the server linked switch is configured to batch cache requests directed to the first storage server. In some embodiments, the system further comprises a third memory server connected to the server linked switch, wherein the server linked switch is configured to maintain a consistency level selected from the group consisting of strict consistency, sequential consistency, causal consistency, and processor consistency between data stored on the first memory server and data stored on the third memory server. In some embodiments, the cache coherence switch is configured to: the method includes monitoring a fullness of a first region of memory and moving data from the first region of memory to a second region of memory, wherein the first region of memory is in volatile memory and the second region of memory is in persistent memory. In some embodiments, the server linked switch comprises a peripheral component interconnect express (PCIe) switch. In some embodiments, the server chaining switch comprises a compute express Chaining (CXL) switch. In some embodiments, the server link switch comprises a top-of-rack (ToR) CXL switch. In some embodiments, the server linked switch is configured to send data from the second memory server to the first memory server and perform flow control on the data. In some embodiments, the system further comprises a third storage server connected to the server linked switch, wherein the server linked switch is configured to: the first data packet is received from the second memory server, the second data packet is received from the third memory server, and the first data packet and the second data packet are transmitted to the first memory server. According to an embodiment of the present invention, there is provided a method for performing remote direct memory access in a computing system, the computing system including a first memory server, the first server including a cache coherent switch and a first memory module, a second server including stored program processing circuitry, and a server linked switch connected to the first memory server, the first server, and the second server, the method comprising: receiving, by a server linked switch, a first data packet from a first server; receiving, by the server linked switch, a second data packet from a second server; and sending the first data packet and the second data packet to the first memory server. In some embodiments, the method further comprises: the data is compressed by the cache coherent switch and stored in the first memory module. In some embodiments, the method further comprises: the state of the first storage server is queried by the server linked switch. According to an embodiment of the present invention, there is provided a system including: a first memory server comprising a cache coherent switch and a first memory module; a second storage server; and a server link switch coupled to the first memory server and the second memory server, wherein the first memory module is coupled to the cache coherent switch and the cache coherent switch is coupled to the server link switch.

FIG. 2 depicts a diagram 200 of a representative system architecture in accordance with an example embodiment of the disclosure, in which aspects of the disclosed embodiments are operable in connection with a managing computing entity capable of communicating and configuring the various servers described in connection with FIGS. 1A-1G. In some embodiments, the disclosed system may include a management computing entity 202 that may be configured to operate in conjunction with multiple clusters. As shown, the clusters may include a type A pool cluster 204, a type B pool cluster 206, a type C pool cluster 208, and a type D pool cluster 210. In one embodiment, type a pool cluster 204 can include direct memory (e.g., CXL memory), type B pool cluster 206 can include accelerators (e.g., CXL accelerators), type C pool cluster 208 can include pooled/distributed memory (e.g., CXL memory), and type D pool cluster 210 can include disaggregated memory (e.g., CXL memory). Further, each cluster may include, but is not limited to, a plug-in module 212 that may include a computing element 214, such as a processor (e.g., a RISC-V based processor) and/or a programmable controller (e.g., an FPGA-based controller), and a corresponding medium 216.

In various embodiments, the management computing entity 202 may be configured to direct I/O and memory storage and retrieval operations to various clusters based on one or more predetermined parameters (e.g., parameters associated with respective workloads processed by hosts or devices on a network in communication with the management computing entity 202).

In various embodiments, the managing computing entity 202 may operate at the rack and/or cluster level, or may operate at least partially within a given device (e.g., a cache coherency enabled device) that is part of a given cluster architecture (e.g., type a pool cluster 204, type B pool cluster 206, type C pool cluster 208, and type D pool cluster 210). In various embodiments, a device within a given cluster architecture may perform a first portion of the operations of a managing computing entity, while another portion of the operations of the managing computing entity may be implemented on a rack and/or at the cluster level. In some embodiments, both portions of an operation may be performed in a coordinated manner (e.g., devices in a cluster sending and receiving coordination messages to and from a managing computing entity implemented on a rack and/or at a cluster level). In some embodiments, the first portion of operations associated with devices in the cluster may include, but is not limited to, operations for determining current or future resource requirements of the device or cluster, advertising current or future resource availability of the device or cluster, synchronizing particular parameters associated with algorithms running at the device or cluster level, training one or more machine learning modules associated with the operation of the device or rack/cluster, logging corresponding data associated with routing workloads, combinations thereof, and the like.

FIG. 3A depicts another diagram 300 of a representative system architecture in accordance with an example embodiment of the disclosure, in which aspects of the disclosed embodiments are operable in connection with a managing computing entity capable of communicating and configuring the various servers described in connection with FIGS. 1A-1G. In some embodiments, the management computing entity 302 may be similar to, but not necessarily identical to, the management computing entity 202 shown and described above in connection with fig. 2. Further, the management computing entity 202 may communicate with the type A pool. In various embodiments, the type a pool cluster 312 may include several servers. Further, type a pool cluster 312 may feature direct-connect cache coherency (e.g., CXL) devices that may be configured to operate using RCiEP, for example. In another embodiment, type a pool cluster 312 may feature a cache coherency protocol based memory (such as CXL memory) to reduce any limitations of CPU pins. In one embodiment, the type a pool cluster 312 may include direct-connected devices with various form factor options (e.g., E1, E3 form factors that are capable of conforming to the enterprise and data center SSD form factor (EDSFF) standards and/or add-on card (AIC) form factors). In another embodiment, the disclosed system can include a switch 304, such as a cache coherence (e.g., CXL) based switch and/or a photonic-on-silicon based switch. In one embodiment, switch 304 may feature a top-of-rack (ToR) ethernet based switch that can be used to scale the system to the rack level.

In various embodiments, as shown in FIG. 3B, the type B pool cluster 314 may also include several servers. Furthermore, type B pool cluster 314 may use cache coherence based (e.g., CXL 2.0 based) switches and accelerators (which may be referred to simply as ACCs) that can be pooled within one of the servers. Furthermore, type B pool cluster 314 can feature virtual cache coherency protocol (e.g., CXL protocol) based switch (VCS) hierarchy capabilities according to workload. In particular, a VCS may be identified as a portion of a switch and a connection component behind a particular root port (e.g., PCIe root port). In another embodiment, the disclosed system can include a switch 306, such as a cache coherence (e.g., CXL) based switch and/or a photonic-on-silicon based switch.

In various embodiments, as shown in FIG. 3C, the type C pool cluster 316 may also include several servers. Further, type C pool cluster 316 may use a CXL 2.0 switch within one of the servers. Further, type C pool cluster 316 can scale cache coherent memory across servers using PCIe-based architectures and/or Gen-Z based systems. Furthermore, type C pool cluster 316 may introduce at least three coherent memory pools in the cluster: local DRAM, local CXL memory, and remote memory. In another embodiment, the disclosed system can include a switch 308, such as a cache coherency (e.g., CXL) based switch and/or a photonic-on-silicon (psp) based switch.

In various embodiments, as shown in FIG. 3D, the D-type pool cluster 318 may also include several servers. In one embodiment, the type D pool cluster 318 may include physically decomposed CXL memory. Further, each server may be assigned a partition such that there may be limited or no sharing across servers. In some embodiments, the D-type pool cluster 318 may be initially limited to a predetermined number (e.g., 16) of Multiple Logical Device (MLD) partitions and hosts. In particular, a type 3 cache coherency protocol (e.g., CXL) based memory device can be partitioned to look like multiple devices, where each device presents a unique logical device ID. Further, the D-pool cluster 318 may scale cache coherent memory across servers using a PCIe-based architecture and/or a Gen-Z based system. In another embodiment, the disclosed system can include a switch 310, such as a cache coherence (e.g., CXL) based switch and/or a photonic-on-silicon based switch.

FIG. 4 depicts a diagram of a representative parameter table that may characterize aspects of the servers described in conjunction with FIGS. 1A-1G, wherein a managing computing entity configures the various servers based on the parameter table, according to a disclosed example embodiment. In particular, table 400 illustrates various example parameters that may be considered by the disclosed system, and in particular by a managing computing entity described in various ways herein, to route portions of the workload to different clusters based on the comparison of the values of these parameters (or similar parameters) for different pooled cluster types described above. In particular, table 400 shows parameters 402 corresponding to different cluster types shown in the columns, namely, direct 406 memory cluster (similar to type a pool cluster), pooled 408 memory cluster (similar to type B pool cluster), distributed 410 memory cluster (similar to type C pool cluster), and disaggregated 412 memory cluster (similar to type D pool cluster). Non-limiting examples of such parameters 402 include direct memory capacity, far memory capacity (e.g., for cache coherency protocols such as CXL), remote memory capacity (e.g., per server), remote memory performance, Total Cost of Ownership (TCO), total power (amortization), and total area (e.g., with an E1 form factor). In various embodiments, as described further below, the disclosed system may use a machine learning algorithm associated with the managing computing entity to determine to route at least a portion of the workload to a different cluster. Although fig. 4 illustrates some example parameters, the disclosed system may be configured to monitor any suitable parameters to route workloads or portions of workloads to different devices associated with a cluster. In addition, the managing computing entity may perform such operations based on various parameters of the system, these parameters include, but are not limited to, a cache coherency protocol (e.g., CXL based) based round trip time, a determination of whether a device is a host bias or a device bias, a cache coherency protocol (e.g., CXL based) based switch hierarchy and/or host upstream port to device downstream port binding, cache coherency protocol (e.g., CXL based) based switch fabric manager configuration, cache coherency protocol (e.g., CXL based) based protocol or physical media packets (e.g., cxl.io or PCIe intervening bulk 4KB packets), network latency, cache coherency protocol (e.g., CXL based) based memory technology (e.g., memory type), combinations thereof, and the like.

FIG. 5 depicts a diagram of a representative network architecture in which aspects of the disclosed embodiments are operable in conjunction with a first topology, according to an example embodiment of the disclosure. Specifically, diagram 500 shows a network 502, a first data transmission 503, a host 504, a second data transmission 505, a device 506, a management computing entity 508, a core data center 510, devices 513, 514, and 516, an edge data center 512, devices 514, 516, and 518, an edge data center 520, devices 522, 524, and 526, a mobile edge data center 530, and devices 532, 534, and 536, described further below. In various embodiments, the clusters (e.g., A, B, C and the type D pool clusters shown and described above) may be part of one or more of the core data center 510, the edge data center 512, the edge data center 520, and/or the mobile edge data center 530. Further, devices (e.g., devices 506, 513, 514, 516, devices 522, 524, and 526, and devices 532, 534, and 536) may include devices (e.g., memory, accelerators, or the like) within or associated with a given cluster (e.g., any of A, B, C and the type D pool cluster shown and described above).

As used herein, edge computing may refer to a distributed computing system, for example, to improve response time and save bandwidth, edge computing brings computation and data storage physically closer to where such resources would be needed. Edge computing may be used, for example, to move certain aspects of cloud computing, network control, and storage to network edge platforms (e.g., edge data centers and/or devices) that may be physically closer to resource constrained end devices to support computing density and latency critical applications. Thus, edge computing can result in reduced latency and increased bandwidth on network architectures that include edge and core data centers. In some aspects, to provide low latency services, the edge computing paradigm may optimize an edge computing platform design, aspects of which are described herein.

In some embodiments, diagram 500 shows that host 504 may initiate a workload request to network 502 via first data transfer 503. The administrative computing entity 508 may monitor parameters associated with the network architecture (e.g., any suitable parameters (such as those shown and described above in connection with fig. 4) other than data transfer rates, network portion utilization, combinations thereof, and the like) (e.g., including, but not limited to, network parameters associated with the core data center 510 and various edge data centers (such as edge data center 520 and edge data center 512 and/or any cluster thereof)). Based on the results of the monitoring, the managing computing entity 508 can determine to route at least a portion of the workload to one or more clusters of core data centers 510. In some examples, the managing computing entity 508 can also route different portions of the workload to one or more clusters of edge data centers 512 or edge data centers 520. To determine where to route workloads, the administrative computing entity 508 can run a model of the network architecture and/or portions of the network (e.g., clusters associated with edge data centers, core data centers, various devices, etc.) to determine parameters (such as latency and/or energy usage) associated with different portions of the network architecture. As described above, the administrative computing entity 508 can use the parameters as inputs to a machine learning component (further shown and described below in conjunction with fig. 8 and 9) to determine an optimal route between one or more clusters of core and edge data centers for the computation of the workload.

Turning now to the various components illustrated in diagram 500, a more detailed description of the various components will be provided below. In some embodiments, network 502 may include, but is not limited to, the internet or a public network such as a wide area network (WLAN). In some examples, host 504 may comprise a network host (e.g., a computer or other device connected to a computer network). The hosts may operate as servers that provide information resources, services, and applications to users or other hosts on network 502. In some examples, a host may be assigned at least one network address. In other examples, computers participating in a network (such as the internet) may be referred to as internet hosts. Such internet hosts may include one or more IP addresses assigned to their respective network interfaces.

In some examples, the device 506 may comprise a device directly connected to the network 502, e.g., via a wired or wireless link. In some aspects, the device 506 may initiate a workload (e.g., a video stream request). The workload may then be processed by the relevant portions of the network architecture in accordance with the embodiments disclosed herein. An example of a device that may be used as device 506 is further shown and described below in connection with fig. 12.

In various embodiments, the managing computing entity 508 may perform routing of traffic and/or workloads to one or more clusters of core data centers 510 and/or one or more clusters of one or more edge data centers 520. Further, the management computing entity 508 can run models/machine learning techniques to determine parameters (e.g., delay, energy usage, etc.) associated with one or more clusters of different portions of the network, for example, based on monitored network traffic data. As described above, in some embodiments, the management computing entity 508 may run a machine learning model to determine how to route workload data. Examples of machine learning models are shown and described below in connection with fig. 8 and 9.

In some embodiments, core data center 510 may comprise a dedicated entity capable of housing computer systems and associated components (such as telecommunications and storage systems and/or components). Further, core data center 510 may include a cluster having various servers (such as the cluster shown and described above in connection with fig. 1A-2) with computing, network, and storage resources for executing workloads, storing associated data, communicating data with network 502, edge data centers (e.g., edge data center 520, mobile edge data center 530), and/or other portions of the network architecture (not shown). In some embodiments, core data center 510 may be connected to various devices (e.g., devices 513, 514, and 516). For example, the connection may be a wired connection (e.g., ethernet-based) or a wireless connection (e.g., Wi-Fi, 5G, and/or cellular-based). In another embodiment, the core data center 510 may receive workload requests from various devices (e.g., devices 513, 514, and 516) directly connected to the core data center 510 and may perform at least a portion of a given workload request (discussed further below). In some examples, core data center 510 may send the results of a given workload to various devices connected directly or indirectly to the core data center.

In some embodiments, edge data center 512 may represent a dedicated entity capable of housing computer systems and associated components (such as telecommunications and storage systems) and may have many of the same or similar capabilities as the core data center; however, the edge data center 512 may generally have a smaller physical space (physical font) than the core data center. Further, as previously described, the edge data centers 512 may be physically closer to the end users and, thus, may provide reduced latency for specific workloads and applications. In some embodiments, edge data center 512 may be connected to a core data center or other edge data centers (e.g., mobile edge data center 530 or edge data center 520). Further, one or more clusters of edge data centers 512 may receive workload requests from various devices (e.g., devices 522, 524, and 526) directly connected to edge data centers 512 and may perform at least a portion of a given workload request (discussed further below). In another embodiment, one or more clusters of edge data centers 512 may send a portion of the workload to an edge data center (e.g., edge data center 520) or other cluster of core data centers (e.g., core data center 510), e.g., using a cache coherency protocol (e.g., CXL protocol). Further, the edge data center 512 may send the results of a given workload to various devices connected directly or indirectly to the edge data center.

FIG. 6 depicts another diagram of the representative network architecture of FIG. 5, in which aspects of the disclosed embodiments are operable in conjunction with a second topology, according to a disclosed example embodiment. In particular, diagram 600 depicts many of the same elements as described above for FIG. 5. However, diagram 600 illustrates a management computing entity 608 that may be connected to one or more clusters of core data centers 510 in a second topology, rather than a network 502 as in fig. 5. This is intended to illustrate the possibility that the managing computing entity may exist in different locations on the network architecture (e.g., one or more clusters of core data centers and networks).

In some embodiments, diagram 600 also illustrates an example where network 502 may initiate a workload request to one or more clusters of core data centers 510 via a first data transmission 601. For example, a device (e.g., device 506) or host (e.g., host 504) connected to network 502 may generate a workload that may be processed by network 502, and a workload request may be initiated via first data transmission 601. Again, the management computing entity 608 may monitor parameters associated with the network architecture (e.g., parameters shown and described above in connection with fig. 4, in addition to data transfer rates, network portion utilization, combinations thereof, and the like) (e.g., network parameters including, but not limited to, network parameters associated with the core data center 510 and one or more clusters of various edge data centers, such as edge data center 520 and edge data center 512).

Based on the results of the monitoring, management computing entity 608 may determine to maintain at least a portion of the workload at one or more clusters of core data centers 510. In some examples, the management computing entity 608 may also route different portions of the workload to one or more clusters of edge data centers 512, edge data centers 520, or even mobile edge data centers 530 (e.g., edge data centers that may change location, e.g., via a wireless connection). As previously described, to determine where to route workloads, the management computing entity 608 can run machine learning techniques that incorporate aspects of the network architecture and portions of the network to determine various parameters (e.g., latency, energy usage, etc.) associated with different portions of the network architecture. The management computing entity 608 may use the parameters as inputs to a machine learning component (further shown and described below in conjunction with fig. 8 and 9) to determine an optimal route between one or more clusters of core data centers and edge data centers for the computation of the workload.

FIG. 7 depicts another diagram of the representative network architecture of FIG. 5, in which aspects of the disclosed embodiments are operable in conjunction with a third topology, according to a disclosed example embodiment. In particular, diagram 700 depicts many of the same elements as described above for FIG. 5. However, diagram 700 illustrates a management computing entity 708 that may connect to one or more clusters of example edge data centers (such as mobile edge data center 530) in a third topology instead of one or more clusters as in network 502 in fig. 5 or one or more clusters as in core data center 510 in fig. 6. Again, the topology reflects the likelihood that the managing computing entity may exist in different locations on the network architecture (e.g., one or more clusters of edge data centers versus one or more clusters of core data centers and/or networks).

In some embodiments, diagram 700 also shows that network 502 may initiate a workload request to one or more clusters of core data centers 510 via a first data transmission 701 and/or to mobile edge data centers 530 via a second data transmission 703. For example, a device (e.g., device 506) or host (e.g., host 504) connected to one or more clusters of network 502 may generate a workload that may be handled by the one or more clusters of network 502 and initiate a workload request via data transmission 701. Again, the managing computing entity 708 may monitor parameters associated with the network architecture (e.g., parameters shown and described in connection with fig. 4, parameters and/or data transfer rates related to cache coherency protocols, network portion utilization, combinations thereof, etc.) (e.g., including, but not limited to, parameters associated with one or more clusters of core data centers 510 and one or more clusters of various edge data centers, such as mobile edge data center 530, edge data center 520, and/or edge data center 512).

Based on the determination of the results and/or parameters of the monitoring and the associated thresholds, management computing entity 708 may determine to maintain at least a portion of the workload at one or more clusters of mobile edge data centers 530. In some examples, management computing entity 708 may also route different portions of the workload to one or more clusters of core data centers 510, edge data centers 512, and/or edge data centers 520. As previously described, to determine where to route workloads, the administrative computing entity 708 may use the parameters as inputs to a machine learning component (further shown and described below in conjunction with fig. 8 and 9) to determine the best route between the core and edge data centers for the computation of workloads.

Fig. 8 depicts a diagram of a supervised machine learning approach for determining the distribution of workload across one or more clusters of different portions of a network architecture, according to an example embodiment of the disclosure. In particular, diagram 800 illustrates a supervised machine learning approach for determining a distribution of a given workload to one or more clusters of core data centers and one or more edge data centers based on a parameter. More specifically, diagram 800 illustrates a training component 801 of a machine learning method, the training component 801 including a network 802, parameters 804, labels 806, feature vectors 808, a management computing entity 810, a machine learning component 812, a processor 814, and a memory 816, which will be described below. Further, the diagram 800 shows an inference component 803 of a machine learning method, also described below, the inference component 803 including parameters 820, feature vectors 822, a predictive model 824, and an expected distribution 826.

Turning now to the various components illustrated in diagram 800, a more detailed description is described. In particular, the network 802 may be similar to the network 502 shown and described above in connection with fig. 5. In some examples, the network 802 is communicatively connected to a management computing entity 810. In some embodiments, parameters 804 may include the parameters shown and described above in connection with fig. 4 and/or raw data transmitted over various portions of the network architecture between various entities (such as the entities shown and described in connection with fig. 5). In some examples, the raw data may include, but is not limited to, workload, data transmission, delay, and/or data transmission rate on the portion of the network. As described above, the disclosed system may be configured to monitor any suitable parameter to route a workload or portion of a workload to different devices associated with a cluster. In addition, the managing computing entity may perform such operations based on various parameters of the system, these parameters include, but are not limited to, a cache coherency protocol (e.g., CXL based) based round trip time, a determination of whether a device is a host bias or a device bias, a cache coherency protocol (e.g., CXL based) based switch hierarchy and/or host upstream port to device downstream port binding, cache coherency protocol (e.g., CXL based) based switch fabric manager configuration, cache coherency protocol (e.g., CXL based) based protocol or physical media packets (e.g., cxl.io or PCIe intervening bulk 4KB packets), network latency, cache coherency protocol (e.g., CXL based) based memory technology (e.g., memory type), combinations thereof, and the like.

In some embodiments, the label 806 may represent the optimal distribution of a given workload across one or more clusters of core data centers and one or more edge data centers in an example network architecture having a particular configuration. In some embodiments, the results of the model may be used to determine the label 806. In various aspects, for example, the labels 806 can thus be utilized to train the machine learning component 812 to predict an expected distribution 826 of a given future workload across one or more clusters of core data centers and one or more edge data centers during the inference component 803.

In some embodiments, the feature vector 808 may represent various parameters of interest (e.g., parameters, delays, and/or data transmission rates, combinations thereof, etc., shown and described in connection with fig. 4), which in some examples may be extracted from the raw data and/or may be part of the parameters 804. In some examples, the feature vectors 808 may represent various measurable attributes or characteristics of transmissions observed by the managing computing entity over the network fabric.

In other embodiments, the management computing entity 810 is communicatively connected to the network 802 and may include a machine learning component 812, a processor 814, and a memory 816. In particular, the machine learning component 812 can generate a predictive model 824 of the expected distribution 826 for processing a given workload across one or more clusters of the core data center and one or more edge data centers using any suitable machine learning technique. Non-limiting machine learning techniques may include, but are not limited to, supervised learning techniques (shown and described in connection with fig. 8), unsupervised learning techniques (shown and described in connection with fig. 9), reinforcement learning techniques, self-learning techniques, feature learning techniques, association rule techniques, combinations thereof, and the like. Additional non-limiting machine learning techniques may include, but are not limited to, particular implementations (such as artificial neural networks, decision trees, support vector machines, regression analysis techniques, bayesian network techniques, genetic algorithm techniques, combinations thereof, and the like).

As previously described, diagram 800 includes inference component 803. In particular, the inference component 803 can be similar to the training component 801 in that parameters 820 are received, feature vectors are extracted (e.g., by the managing computing entity 810), and a machine learning component 812 executing a predictive model 824 is utilized to determine an expected distribution 826 of processing a given workload across one or more clusters of core data centers and one or more edge data centers. One difference between the inference component 803 and the training component 801 is that the inference component can train the machine learning component to determine the distribution without receiving a label (e.g., label 806). Thus, in the inference component 803 mode of operation, the managing computing entity 810 can determine an expected distribution 826 of a given workload in real-time. Subsequently, if the error rate (e.g., defined based on the total latency reduction for a given workload) is below a predetermined threshold, the machine learning component 812 can be retrained using the training component 801 (e.g., having a different label 806 associated with a different or similar network parameter 804). The inference component 803 can then be run to increase the error rate above a predetermined threshold.

Fig. 9 depicts a diagram of an unsupervised machine learning method for determining the distribution of workload across different portions of a network architecture, according to a disclosed example embodiment. In particular, diagram 900 shows a network 902 connected to a managing computing entity 910. Further, diagram 900 includes a training component 901 of a machine learning method, the training component 901 including parameters 904, feature vectors 908, a management computing entity 910 having a machine learning component 912, a processor 914, and a memory 916. Further, diagram 900 includes a training component 903 of a machine learning method, the training component 903 including parameters 920, feature vectors 922, models 924, and an expected distribution 926 of workload across one or more clusters of core and edge data centers.

Turning now to the various components shown in diagram 900, a more detailed description is provided. In particular, network 902 may be similar to network 502 shown and described above in connection with fig. 5. In some examples, network 902 is communicatively connected to a managing computing entity 910. In some embodiments, the network parameters 904 may include raw data transmitted over various portions of a network architecture such as that shown and described in connection with fig. 5. In some examples, the raw data may include, but is not limited to, workload, data transfer, delay, and/or data transfer rate on the portion of the network, combinations thereof, and the like.

In some embodiments, the training component 901 may not have labels 806 that represent the optimal distribution of a given workload across one or more clusters of core data centers and one or more edge data centers, as opposed to such labels 806 shown and described above in connection with fig. 8. In contrast, the managing computing entity 910 can train the machine learning component 912 without any labels (e.g., using the inference component 903 to predict an expected distribution 926 of a given future workload across one or more clusters of the core data center and one or more edge data centers).

In some embodiments, feature vector 908 may represent various parameters of interest (e.g., delay and/or data transfer rate) that may be extracted from raw data that may be part of parameters 904. In some examples, the feature vectors 908 may represent various measurable attributes or characteristics of transmissions observed by the managing computing entity through the network fabric.

In other embodiments, the managing computing entity 910 is communicatively connected to the network 902 and may include a machine learning component 912, a processor 914, and a memory 916. In particular, machine learning component 912 can generate model 924 that processes expected distribution 926 of a given workload across one or more clusters of core data centers and one or more edge data centers using any suitable machine learning technique.

As previously described, the diagram 900 includes an inference component 903. In particular, the inference component 903 can be similar to the training component 901 in that parameters 920 are received, feature vectors 922 are extracted (e.g., by the managing computing entity 910), and a machine learning component 912 executing a model 924 is used to determine an expected distribution 926 of processing a given workload across one or more clusters of a core data center and one or more edge data centers. Thus, in the inference component 903 mode of operation, the managing computing entity 910 can determine an expected distribution 926 of a given workload in real-time. Subsequently, if the error rate (e.g., defined based on the total latency reduction for a given workload) is below a predetermined threshold, machine learning component 912 can be retrained using training component 901. The inference component 903 may then be run to increase the error rate above a predetermined threshold.

In addition to and/or in conjunction with the various parameters described above, the disclosed system may additionally consider parameters for dynamically routing I/O from one cluster to another using machine learning and/or any other suitable AI-based technique, which may include, but are not limited to, energy cost/usage for each cluster/rack/server/device, peak load for each cluster/rack/server/device within a given time interval, thermal efficiency for each cluster/rack/server/device (e.g., heat generated per cycle of British Thermal Units (BTUs)), types of processors available in a given cluster/rack/server/device (e.g., x 86-based processes), and number of processors available, a number of processors, and/or a combination thereof, And symmetry from a cache coherency point of view. Further, the disclosed system may consider the constituent memory resources of the clusters, e.g., may be the type of memory technology (e.g., DRAM, triple-level cell (TLC), quad-level cell (QLC), etc.) of each cluster/rack/server/device.

In various embodiments, the disclosed system may determine additional criteria for routing a given workload to one or more clusters. For example, the disclosed system may determine one or more of a data rate, a material basis for network connections, and a signal loss budget to determine a maximum distance that a signal may be transmitted over a given network (e.g., a PCIe Gen-5 based network) with a given error rate associated with data transmission.

As another example, the disclosed system may determine whether retimers (number and location) are needed and use them to determine what the total delay will be increased by the increased delay.

In various embodiments, the disclosed system may determine what data path to use for which cluster/rack/server/device for asymmetric data flows with asymmetric consistency. Further, the disclosed system can determine the resolution of a given workload and the associated expected delays for each sub-function, and then route the data to accelerate the most critical segments to the lowest delay accelerator using CXLs. For example, for object detection workloads, the disclosed system may route data based on the above-described techniques for the image segmentation phase rather than the object database retrieval phase, and vice versa.

As previously described, in some aspects, the management computing entity 910 may use Artificial Intelligence (AI) (e.g., the machine learning components shown and described above in connection with fig. 8 and 9) to determine the routing of workloads between portions of a network architecture (e.g., by monitoring data flow (e.g., historical data) over time on different portions of the network for enhanced workload routing). Accordingly, embodiments of the apparatus, management computing entity, and/or related components described herein may employ AI to facilitate automation of one or more features described herein. The components may employ various AI-based schemes to perform the various embodiments/examples disclosed herein. To provide or facilitate many of the determinations described herein (e.g., determining, ascertaining, inferring, computing, predicting, estimating, deriving, forecasting, detecting, calculating), components described herein can examine an entirety or subset of data authorized for access and can provide reasoning about or determine a state of a system, environment, etc. from a set of observations as captured via events and/or data. For example, a determination can be employed to identify a particular context or action, or a probability distribution over states can be generated. The determination may be probabilistic-that is, the computation of a probability distribution over states of interest based on a consideration of data and events. Determining may also refer to techniques employed to compose higher-level events from a set of events and/or data.

Such a determination may result in the construction of a new event or action from a set of observed events and/or stored event data, whether the events are related in close temporal proximity, and whether the events and data come from one or several event and data sources. The components disclosed herein can employ various classification (explicitly trained (e.g., via training data) as well as implicitly trained (e.g., via observing behavior, preferences, historical information, receiving extrinsic information, etc.)) schemes and/or systems (e.g., support vector machines, neural networks, expert systems, bayesian belief networks, fuzzy logic, data fusion engines, etc.) in connection with performing automated and/or determined actions related to the claimed subject matter. Thus, a classification scheme and/or system may be used to automatically learn and perform a number of functions, actions, and/or determinations. In some aspects, the neural network may include, but is not limited to, at least one of a Long Short Term Memory (LSTM) neural network, a recurrent neural network, a time-delay neural network, or a feed-forward neural network.

The classifier may map the input attribute vector z ═ (z1, z2, z3, z4, …, zn) to the confidence that the input belongs to a class, such as by f (z) ═ confidence (class). Such classification may employ a probabilistic and/or statistical-based analysis to determine an action to be automatically performed. A Support Vector Machine (SVM) may be an example of a classifier that may be employed. The SVM operates by finding a hypersurface in the space of possible inputs, which hypersurface attempts to split the triggering criteria from the non-triggering events. Intuitively, this makes the classification correct for test data that is close to, but not identical to, the training data. Other directed and undirected model classification approaches (e.g., including na iotave bayes, bayesian networks, decision trees, neural networks, fuzzy logic models, and/or probabilistic classification models providing different patterns of independence) can be employed. Classification as used herein also is inclusive of statistical regression that is utilized to develop models of priority.

FIG. 10 illustrates an example schematic of a system that can be used to practice embodiments of the present disclosure. As shown in fig. 10, particular embodiments may include one or more management computing entities 1000, one or more networks 1005, and one or more user devices 1010. Each of these components, entities, devices, systems, and similar words used interchangeably herein may communicate directly or indirectly with each other through the same or different wired or wireless networks (e.g., network 502 shown and described in connection with fig. 5, including but not limited to edge data centers and/or core data centers and associated clusters). Further, while fig. 10 shows the various system entities as separate, independent entities, the various embodiments are not limited to this particular architecture. Further, the management computing entity 1000 can include a machine learning component as described herein. As noted above, the communication may be performed using any suitable protocol (e.g., 5G network protocol, cache coherency protocol) as further described herein.

FIG. 11 illustrates an example schematic diagram of a managing computing entity, according to an example disclosed embodiment. Further, the management computing entity 1000 may include a content component, a processing component, and a sending component (not shown). In particular, the content component can be utilized to determine signals indicative of data (e.g., video, audio, text, data, combinations thereof, etc.) to be transmitted over the network architecture described herein. In another embodiment, the determination of the signal for transmission may be based on, for example, user input of the device, a predetermined schedule of data transmission on the network, a change in network conditions, or the like. In one embodiment, the signal may include: the data may be encapsulated in data frames (e.g., 5G data frames and/or cache coherence protocol data frames) configured to be transmitted from the device to one or more devices on the network.

In another embodiment, the processing element 1105 may be used to determine various parameters associated with data transmitted over a network (e.g., the network 1005 shown and described above in connection with fig. 10) and/or parameters associated with a cluster of portions of the network. For example, the processing element 1105 may be used to run models against network data, run machine learning techniques against network data, determine a distribution of workloads to be processed by various portions of the network architecture, combinations thereof, and so forth. As another example. Processing element 1105 may be used to run models against network data, run machine learning techniques against parameters associated with different performance capabilities of clusters of networks, determine a distribution of workloads to be processed by various clusters of portions of a network architecture, combinations thereof, and so forth.

In one embodiment, a sending component (not shown) may be used to send a signal from one device to another device on the network (e.g., from a first device on a first cluster to a second device on a second cluster, e.g., using a cache coherency protocol). For example, the transmit component can be used to prepare a transmitter (e.g., transmitter 1204 of fig. 12, below) to transmit a signal over a network. For example, the sending component may queue data in one or more caches, may determine that the sending device and associated transmitter are functioning properly and have sufficient power to send signals over the network, and may adjust one or more parameters associated with the data transmission (e.g., modulation type, signal amplification, signal power level, noise suppression, combinations thereof, etc.).

In general, the terms computing entity, computer, entity, device, system, and/or similar terms used interchangeably herein may refer to, for example, one or more computers, computing entities, desktop computers, mobile phones, tablet computers, tablet phones, notebook computers, laptop computers, distributed systems, gaming consoles (e.g., Xbox, PlayStation, Wii), watches, glasses, iBeacons, proximity beacons, key fobs, Radio Frequency Identification (RFID) tags, headphones, a scanner, a television, a dongle, a camera, a wristband, a wearable item/device, an automated service terminal, an input terminal, a server or server network, a blade, a gateway, a switch, a processing device, a processing entity, a set-top box, a repeater, a router, a network access point, a base station, etc., and/or any combination of devices or entities suitable for performing the functions, operations, and/or processing described herein. Such functions, operations, and/or processes may include, for example, transmitting, receiving, operating, processing, displaying, storing, determining, creating/generating, monitoring, evaluating, comparing, and/or similar terms used interchangeably herein. In one embodiment, these functions, operations, and/or processes may be performed on data, content, information, and/or similar terms used interchangeably herein.

As shown in fig. 11, in one embodiment, the management computing entity 1000 may include or communicate with one or more processing elements 1105 (also referred to as processors, processing circuits, and/or similar terms used interchangeably herein), the one or more processing elements 1105 communicating with other elements within the management computing entity 1000, such as via a bus. As will be appreciated, the processing element 1105 may be implemented in a number of different ways. For example, the processing element 1105 may be implemented as one or more Complex Programmable Logic Devices (CPLDs), microprocessors, multi-core processors, co-processing entities, application specific instruction set processors (ASIPs), microcontrollers, and/or controllers. Further, the processing element 1105 may be implemented as one or more other processing devices or circuits. The term circuit may denote an entirely hardware embodiment or a combination of hardware and a computer program product. Thus, the processing element 1105 may be implemented as an integrated circuit, an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a hardware accelerator, other circuitry, and so forth. Thus, as will be appreciated, the processing element 1105 may be configured for a particular use or configured to execute instructions stored in a volatile or non-volatile medium or otherwise accessible to the processing element 1105. Thus, whether configured by hardware or a computer program product, or by a combination thereof, the processing element 1105 may be capable of performing steps or operations in accordance with embodiments of the present disclosure when configured accordingly.

In one embodiment, the managing computing entity 1000 may also include or communicate with non-volatile media (also referred to as non-volatile storage, memory storage, memory circuitry, and/or like terms used interchangeably herein). In one embodiment, the non-volatile storage device or memory may include one or more non-volatile storage devices or storage media 1110, the non-volatile storage devices or storage media 1110 including, but not limited to, a hard disk, ROM, PROM, EPROM, EEPROM, flash, MMC, SD memory card, memory stick, CBRAM, PRAM, FeRAM, NVRAM, MRAM, RRAM, SONOS, FJG RAM, Millipede memory, racetrack memory, and the like. As will be appreciated, the non-volatile storage or memory medium may store a database, a database instance, a database management system, data, an application, a program component, a script, source code, object code, bytecode, compiled code, interpreted code, machine code, executable instructions, and the like. The terms database, database instance, database management system, and/or similar terms used interchangeably herein may represent a collection of records or data stored in a computer-readable storage medium using one or more database models (such as hierarchical database models, network models, relational models, entity-relational models, object models, document models, semantic models, graph models, and the like).

In one embodiment, the managing computing entity 1000 may also include or communicate with volatile media (also referred to as volatile storage, memory storage, memory circuitry, and/or like terms used interchangeably herein). In one embodiment, the volatile storage or memory may also include one or more volatile storage or storage media 1115, the volatile storage or storage media 1115 including, but not limited to, RAM, DRAM, SRAM, FPM DRAM, EDO DRAM, SDRAM, DDR2 SDRAM, DDR3 SDRAM, RDRAM, TTRAM, T-RAM, Z-RAM, RIMM, DIMM, SIMM, VRAM, cache memory, register memory, or the like. As will be appreciated, a volatile storage device or storage medium may be used to store, for example, at least a portion of a database, database instance, database management system, data, application, program component, script, source code, object code, bytecode, compiled code, interpreted code, machine code, executable instructions, and/or the like, that are executed by the processing element 1105. Thus, databases, database instances, database management systems, data, applications, programs, program components, scripts, source code, object code, bytecode, compiled code, interpreted code, machine code, executable instructions, and the like, may be used to control certain aspects of the operation of the managed computing entity 1000 with the aid of the processing element 1105 and operating system.

As indicated, in one embodiment, the managing computing entity 1000 may also include one or more communication interfaces 1120 for communicating with various computing entities, such as by communicating data, content, information, and/or similar terms that may be sent, received, manipulated, processed, displayed, stored, etc., and/or used interchangeably herein. Such communication may be performed using a wired data transmission protocol, such as peripheral component interconnect express (PCIe), Fiber Distributed Data Interface (FDDI), Digital Subscriber Line (DSL), ethernet, Asynchronous Transfer Mode (ATM), frame relay, Data Over Cable Service Interface Specification (DOCSIS), or any other wired transmission protocol. Similarly, the managing computing entity 1000 may be configured to communicate via a wireless external communication network using any of a variety of protocols, such as General Packet Radio Service (GPRS), Universal Mobile Telecommunications System (UMTS), code division multiple access 2000(CDMA2000), CDMA 20001X (1xRTT), wideband code division multiple access (WC DMA), time division synchronous code division multiple access (TD-SCDMA), Long Term Evolution (LTE), evolved universal terrestrial radio access network (E-UTRAN), evolution data optimized (EVDO), High Speed Packet Access (HSPA), High Speed Downlink Packet Access (HSDPA), IEEE 802.11(Wi-Fi), Wi-Fi direct, 802.16(WiMAX), Ultra Wideband (UWB), Infrared (IR) protocols, Near Field Communication (NFC) protocols, ZigBee, bluetooth protocols, 5G protocols, wireless Universal Serial Bus (USB) protocols, and/or any other wireless protocols.

Although not shown, the managing computing entity 1000 may include or communicate with one or more input elements (such as keyboard inputs, mouse inputs, touch screen/display inputs, motion inputs, movement inputs, audio inputs, pointing device inputs, joystick inputs, keypad inputs, etc.). The management computing entity 1000 may also include or be in communication with one or more output elements (not shown), such as audio output, video output, screen/display output, motion output, movement output, etc.

As will be appreciated, one or more of the components of the management computing entity 1000 may be located remotely from other management computing entity 1000 components (such as in a distributed system). Further, one or more components may be combined, and additional components performing the functions described herein may be included in the managing computing entity 1000. Thus, the management computing entity 1000 may be adapted to accommodate various needs and circumstances. As will be appreciated, these architectures and descriptions are provided for purposes of example only and are not limited to the various embodiments.

A user may be an individual, a family, a company, an organization, an entity, a department within an organization, a representative of an organization and/or an individual, and so forth. In one example, the user may be an employee, resident, customer, or the like. For example, a user may operate user device 1010, user device 1010 including one or more components similar in functionality to that of managing computing entity 1000.

In various aspects, the processing components, sending components, and/or receiving components (not shown) may be configured to operate on one or more user devices 1010, which may include aspects of managing the functionality of the computing entity 1000 as shown and described herein in connection with fig. 10 and 11. In particular, the processing component, the sending component, and/or the receiving component can be configured to communicate with the one or more processing elements 1105, the memory 1110, the volatile memory 1115, and can include a communication interface 1120 (e.g., to facilitate communication between devices).

Fig. 12 shows an example schematic diagram of a user device according to an example embodiment of the disclosure. Fig. 12 provides an illustrative schematic diagram representing a user device 1010 (shown in connection with fig. 10) that may be used in connection with embodiments of the present disclosure. In general, the terms device, system, computing entity, and/or similar terms used interchangeably herein may represent, for example, one or more computers, computing entities, desktops, mobile phones, tablets, phablets, notebook computers, laptop computers, distributed systems, game consoles (e.g., Xbox, PlayStation, Wii), watches, glasses, key fobs, Radio Frequency Identification (RFID) tags, headsets, scanners, cameras, wristbands, automated service terminals, input terminals, servers or server networks, blades, gateways, switches, processing devices, processing entities, set top boxes, repeaters, routers, network access points, base stations, and/or the like, and/or any combination of devices or entities suitable for performing the functions, operations, and/or processes, and the like, described herein. User device 1010 may be operated by various parties. As shown in fig. 12, user device 1010 may include an antenna 1212, a transmitter 1204 (e.g., a radio), a receiver 1206 (e.g., a radio), and a processing element (also referred to as a processing device) 1208 (e.g., a CPLD, an FPGA, a microprocessor, a multi-core processor, a co-processing entity, an ASIP, a microcontroller, and/or a controller) that provides signals to and receives signals from transmitter 1204 and receiver 1206, respectively.

The signals provided to and received from the transmitter 1204 and receiver 1206, respectively, may include signaling information in accordance with the air interface standard of the applicable wireless system. In this regard, the user device 1010 can be capable of operating with one or more air interface standards, communication protocols, modulation types, and access types. More specifically, the user device 1010 may operate in accordance with any one of a number of wireless communication standards and protocols, such as those described above with respect to the management computing entity 1000 of fig. 10. In a particular embodiment, the user device 1010 may operate according to a plurality of wireless communication standards and protocols (such as the published IoT DOCSIS protocol, UMTS, CDMA2000, 1xRTT, WCDMA, TD-SCDMA, LTE, E-UTRAN, EVDO, HSPA, HSDPA, 5G, Wi-Fi, Wi-Fi direct, WiMAX, UWB, IR, NFC, bluetooth, USB, and so forth). Similarly, the user device 1010 may operate in accordance with a plurality of wired communication standards and protocols (such as those described above with respect to the management computing entity 1000) via the network interface 1220.

Via these communication standards and protocols, user device 1010 may communicate with various other entities using concepts such as Unstructured Supplementary Service Data (USSD), Short Message Service (SMS), Multimedia Message Service (MMS), dual tone multi-frequency signaling (DTMF), and/or subscriber identity component dialer (SIM dialer). User device 1010 may also download changes, add-ons, and updates to, for example, its firmware, software (e.g., including executable instructions, applications, program components), and operating system.

According to one embodiment, user device 1010 may include location determination aspects, devices, components, functions, and/or similar terms used interchangeably herein. The location determination aspect may be used to inform one or more of the models used by the managing computing entity and the models and/or machine learning techniques described herein. For example, the user device 1010 may include outdoor positioning aspects such as location components adapted to obtain, for example, latitude, longitude, altitude, geocoding, course, direction, heading, speed, Universal Time (UTC), date, and/or various other information/data. In one embodiment, the location component may acquire data, sometimes referred to as ephemeris data, by identifying the number of satellites in view and the relative positions of those satellites. The satellites may be a variety of different satellites including the Low Earth Orbit (LEO) satellite system, the united states department of defense (DOD) satellite system, the european union galileo positioning system, the chinese compass navigation system, the indian regional navigation satellite system, etc. Alternatively, the location information may be determined by triangulating the location of the user device 1010 in conjunction with various other systems, including cell towers, Wi-Fi access points, and the like. Similarly, the user device 1010 may include indoor positioning aspects (such as location components adapted to obtain, for example, latitude, longitude, altitude, geocode, route, direction, heading, speed, time, date, and/or various other information/data). Some indoor systems may use various location or positioning technologies, including RFID tags, indoor beacons or transmitters, Wi-Fi access points, cellular towers, nearby computing devices (e.g., smart phones, laptops), etc. For example, such technologies may include iBeacon, Gimbal proximity beacons (gimbals proximity beacons), Bluetooth Low Energy (BLE) transmitters, NFC transmitters, and the like. These indoor positioning aspects can be used in various settings to locate someone or something within inches or centimeters.

User device 1010 may also comprise a user interface (which may comprise a display 1216 coupled to processing element 1208) and/or a user input interface (coupled to processing element 1208). For example, as described herein, a user interface may be a user application, browser, user interface, and/or similar terms used interchangeably herein executing on user device 1010 and/or accessible via user device 1010 for interacting with management computing entity 1000 and/or displaying information from management computing entity 1000. The user input interface may include any of a number of devices or interfaces that allow the user device 1010 to receive data, such as a keypad 1218 (hard or soft), a touch display, a sound/voice or motion interface, or other input device. In embodiments that include keypad 1218, keypad 1218 may include (or be capable of displaying) conventional numbers (0-9) and associated keys (#, #) as well as other keys for operating user device 1010, and may include a complete set of alphanumeric keys or a set of keys that may be activated to provide a complete set of alphanumeric keys. In addition to providing input, the user input interface may be used, for example, to activate or deactivate certain functions (such as screen saver and/or sleep mode).

The user device 1010 can also include volatile storage or memory 1222 and/or non-volatile storage or memory 1224, which can be embedded and/or may be removable. For example, the non-volatile memory may be ROM, PROM, EPROM, EEPROM, flash memory, MMC, SD memory card, memory stick, CBRAM, PRAM, FeRAM, NVRAM, MRAM, RRAM, SONOS, FJG RAM, Millipede memory, racetrack memory, and the like. Volatile memory can be RAM, DRAM, SRAM, FPM DRAM, EDO DRAM, SDRAM, DDR2 SDRAM, DDR3 SDRAM, RDRAM, TTRAM, T-RAM, Z-RAM, RIMM, DIMM, SIMM, VRAM, cache, register memory, etc. The volatile and non-volatile storage devices or memories may store databases, database instances, database management systems, data, applications, programs, program components, scripts, source code, object code, bytecode, compiled code, interpreted code, machine code, executable instructions, etc., to implement the functions of the user device 1010. As indicated, this may include user applications that exist on the entity or that are accessible through a browser or other user interface for communicating with the managing computing entity 1000 and/or various other computing entities.

In another embodiment, as described in more detail above, the user device 1010 may include one or more components or functions that are the same as or similar to the components or functions of the managing computing entity 1000. As will be appreciated, these architectures and descriptions are provided for purposes of example only and are not limited to the various embodiments.

Fig. 13 is an illustration of an example method 1300 of operating the disclosed system to determine a workload distribution across one or more clusters of a network, in accordance with an example embodiment of the disclosure. At block 1302, the disclosed system may determine a first value of a parameter associated with at least one first apparatus in a first cluster. At block 1304, the disclosed system may determine a threshold based on the first value of the parameter. At block 1306, the disclosed system may receive a request to process a workload at a first device. At block 1308, the disclosed system may determine that a second value of the parameter associated with at least one second apparatus in the second cluster satisfies a threshold. At block 1310, the disclosed system may route at least a portion of the workload to the second device in response to the threshold being met.

Fig. 14 is an illustration of another exemplary method 1400 of operating the disclosed system to determine workload distribution across one or more clusters of a network, in accordance with an example embodiment of the disclosure. At block 1402, the disclosed system may determine performance parameters for a cluster implementing a direct memory architecture, a pooled memory architecture, a distributed memory architecture, and a disaggregated memory architecture. At block 1404, the disclosed system may determine projected memory usage requirements and acceptable performance parameter thresholds for the workload. At block 1406, the disclosed system may calculate a score for each cluster based on the expected memory usage requirements and the corresponding performance parameters of the workload. At block 1408, the disclosed system may route the workload to the memory cluster with the highest score.

Fig. 15 is an illustration of an exemplary method 1500 of operating the disclosed system to determine a distribution of workloads on a network architecture comprising clusters as described herein, in accordance with a disclosed example embodiment. At block 1502, the disclosed system may receive a workload from a host communicatively connected to a network. In some embodiments, the host may comprise a host on the internet. In some examples, the workload may originate from a device connected to the host, such as a user device (e.g., a mobile phone) requesting a particular service (e.g., a video streaming request, a search request, a combination thereof, etc.). In some aspects, receiving a workload from a host may be similar, but not necessarily identical, to the process shown and described above in connection with fig. 5.

At block 1504, the disclosed system may receive a workload from the edge data center. Similar to block 1502, the workload may originate from a device connected to the edge data center, such as a user device (e.g., a mobile phone) requesting a particular service (e.g., a video stream request, a search request, a combination thereof, etc.). In some aspects, receiving workloads from edge data centers may be similar to, but not necessarily identical to, the processing shown and described above in connection with fig. 7.

At block 1506, the disclosed system may receive a workload from the core data center. Similar to blocks 1502 and 1504, the workload may originate from a device (e.g., a user device (e.g., a mobile phone) requesting a particular service (e.g., a video stream request, a search request, etc.) connected to a fringe data center or a core data center). In some aspects, receiving workloads from the core data center may be similar to, but not necessarily identical to, the processing shown and described above in connection with fig. 6.

In some examples, the disclosed system may receive a portion of a workload from any combination of a host, an edge data center, and/or a core data center, e.g., in a disaggregated manner. For example, more than one device requesting service may be connected in a point-to-point (P2P) connection and may initiate a composite workload that may be received at different portions of the network architecture (e.g., a host, an edge data center, and/or a core data center). Further, the disclosed system may aggregate partial workload requests (which may themselves be partially or fully executed at any suitable location of the network architecture) at the managing computing entity for further processing according to the operations described below.

At block 1508, the disclosed system may receive parameters associated with clusters in the core data center and the edge data center. In particular, the disclosed system may employ a management computing entity as shown and described in various ways herein to monitor network architecture to determine parameters. In some embodiments, for example, as further shown and described above in connection with fig. 4, the disclosed system may intercept or otherwise access raw data transmitted at various portions of a network architecture and determine specific parameters from the raw data including, but not limited to, data rate, machine utilization, memory capacity, remote memory capacity, and the like.

At block 1510, the disclosed system may determine expected delays or energy usage associated with workloads executing on the clusters of core and edge data centers based on the parameters. In particular, the disclosed system may use a model as further shown and described in connection with fig. 8-9 to determine the delay associated with the workload. Non-limiting examples of delays may include service time delays including processing and communication delays. In some embodiments, the disclosed system may determine a latency that maps to a particular network architecture that implements a particular protocol (e.g., a 5G network protocol). Further, non-limiting examples of energy usage may include per-watt performance or per-unit currency (e.g., dollar) performance of executing a particular workload on a given core data center or cluster of edge data centers.

At block 1512, the disclosed system may optionally execute a model to determine routes to the edge data centers or the cluster of core data centers. In particular, the disclosed system may implement machine learning techniques to determine the best route to an edge data center or a core data center. For example, the disclosed system may implement a supervised machine learning technique as further shown and described in conjunction with fig. 8 or an unsupervised machine learning technique as further shown and described in conjunction with fig. 9 to determine an expected distribution for routing workloads to clusters associated with edge data centers or core data centers. In other examples, the disclosed system may implement predetermined rules (e.g., user-specified policies) for routing workloads to clusters of edge data centers or core data centers, other than or in combination with machine learning methods.

At block 1514, the disclosed system may determine a distribution of the workload to clusters of core data centers or edge data centers based at least in part on the results of the model. In particular, as characterized by the determined distribution, the disclosed system may determine to send a first portion of the workload to a cluster of core data centers and a second portion of the workload to a cluster of edge data centers. In some embodiments, the disclosed system may determine the distribution of certain parameters that may affect the network architecture, such as reducing the overall delay (e.g., service latency). In other aspects, the disclosed system may further determine the distribution to reduce other factors associated with the network architecture (including, but not limited to, bandwidth usage of the network, power usage of the network or portions of the network, combinations thereof, and the like).

Fig. 16A is an illustration of an example method 1600 of the disclosed system for routing workloads through a network fabric to a cluster of core data centers and a cluster of one or more edge data centers, according to an example disclosed embodiment. At block 1602, the disclosed system may receive a workload and a distribution of the workload. In some embodiments, a management computing entity residing on the core network may receive the workload and distribution. As described above, the workload may originate from a device (e.g., a user device (e.g., a mobile phone) requesting a particular service (e.g., a video streaming request, a search request, a combination thereof, etc.) connected to a host or core data center on the internet). Further, the distribution of workload may be determined from the results of the machine learning techniques described above in connection with fig. 8 and 9 and throughout the disclosure. In one example, the distribution may be determined based at least in part on a difference between a first programmatically expected delay associated with at least one device in a cluster associated with the core data center and a second programmatically expected delay associated with a device in the edge data center exceeding a predetermined threshold.

At block 1604, the disclosed system may route a portion of the workload and data associated with the portion of the workload to one or more clusters of one or more edge data centers based on the distribution. In particular, the disclosed system may break down discrete components of a workload into modular tasks, generate a series of data packets associated with the discrete components of the workload, and transmit the data packets to designated portions of the network (e.g., various clusters associated with one or more edge data centers) over the network fabric when appropriate. Further, the disclosed system may encapsulate the discrete components with any suitable header (header) for transmission over any underlying network medium. For example, the disclosed system may encapsulate discrete components of a workload with first metadata associated with a first network protocol (e.g., a 5G protocol) and may encapsulate discrete components of a workload with second metadata associated with a second network protocol (e.g., an ethernet protocol) for transmission to a cluster associated with a first edge data center and another cluster associated with a second edge data center, respectively.

At block 1606, the disclosed system may process another portion of the workload and data associated with the portion of the workload at one or more clusters of the core data center. In particular, the disclosed system may reserve a portion of the workload for processing at one or more clusters associated with the core data center. For example, a relatively high level of computing resources may be required at the portion of the one or more cluster processes associated with the core data center, which may be feasible at the one or more clusters associated with the core data center relative to the one or more clusters associated with the edge data center(s). In some embodiments, the disclosed system may process a portion of the workload according to any suitable Service Level Agreement (SLA).

At block 1608, the disclosed system may aggregate portions of the processed workload from the clusters of core data centers and edge data centers. In some examples, the disclosed systems may include tags for different portions of the workload that reflect portions of the network that handle the respective portions of the workload (e.g., one or more clusters associated with a core data center or an edge data center). For example, the tags may be included in metadata associated with portions of the workload (e.g., metadata associated with packets representing portions of the workload). Accordingly, the disclosed system may use tags to classify, filter, and/or aggregate processed portions. In particular, the disclosed system may receive a first completed workload associated with a first portion from a given cluster of data centers and a second completed workload associated with a second portion from another cluster of edge data centers and classify, filter, or aggregate the first completed workload or the second completed workload using the first label or the second label.

At block 1610, the disclosed system may send the aggregated and processed portion of the workload to at least one device. In some embodiments, the disclosed system may send the aggregated and processed portions to devices located on a network portion similar to or different from the device that initiated the workload request.

Fig. 16B is an illustration of another exemplary method 1601 of the disclosed system for routing workloads through a network fabric to one or more clusters associated with a core data center and one or more clusters associated with one or more edge data centers, according to an exemplary disclosed embodiment. At block 1612, the disclosed system can receive the workload and the distribution of the workload. In some embodiments, a management computing entity residing on an edge network may receive workload and distribution. As described above, the workload may originate from a device (e.g., a user device (e.g., a mobile phone) requesting a particular service (e.g., a video stream request, a search request, etc.)) connected to a host or core data center on the internet. Further, the distribution of workload may be determined from the results of the machine learning techniques described above and throughout the disclosure.

At block 1614, the disclosed system may route a portion of the workload and data associated with the portion of the workload to one or more clusters of core data centers based on the distribution. As previously described, the disclosed system may break down discrete components of a workload into modular tasks, generate a series of data packets associated with the discrete components of the workload, and transmit the data packets to a designated portion (e.g., one or more clusters of core data centers) over a network fabric when appropriate. Further, the disclosed system may utilize any suitable header to encapsulate the discrete components for transmission over any underlying network medium. For example, the disclosed system may encapsulate the discrete components of the workload with first metadata associated with a first network protocol (e.g., a 5G-based network protocol) and may encapsulate the discrete components of the workload with second metadata associated with a second network protocol (e.g., an ethernet-based network protocol) for transmission to one or more clusters of the first core data center and one or more clusters of the second core data center, respectively.

At block 1616, the disclosed system may process another portion of the workload and data associated with the portion of the workload at one or more clusters of one or more edge data centers. In particular, the disclosed system may reserve a portion of the workload for processing at one or more clusters of edge data centers. For example, portions of the processing at one or more clusters of the edge data center may require a relatively lower level of computing resources but reduced latency, which is possible at one or more clusters of the edge data center relative to one or more clusters of the core data center. In some embodiments, the disclosed system may process a portion of the workload according to any suitable SLA.

At block 1618, the disclosed system may aggregate portions of the processed workload from the one or more clusters of core data centers and edge data centers. In some examples, as previously described, the disclosed system may include tags for different portions of the workload, the tags reflecting portions of the network (e.g., one or more clusters of core data centers or edge data centers) that handle the respective portions of the workload. For example, the tags may be included in metadata associated with portions of the workload (e.g., metadata associated with packets representing portions of the workload). Accordingly, the disclosed system may use tags to classify, filter, and/or aggregate processed portions.

At block 1620, the disclosed system may send the aggregated and processed portion of the workload to at least one device. In some embodiments, the disclosed system may send the aggregated and processed portions to devices located on a network portion similar to or different from the device that initiated the workload request.

Particular embodiments may be implemented in one or a combination of hardware, firmware, and software. Other embodiments may also be implemented as instructions stored on a computer-readable storage device, which may be read and executed by at least one processor to perform the operations described herein. A computer-readable storage device may include any non-transitory memory mechanism for storing information in a form readable by a machine (e.g., a computer). For example, a computer-readable storage device may include Read Only Memory (ROM), Random Access Memory (RAM), magnetic disk storage media, optical storage media, flash memory devices, and other storage devices and media.

The word "exemplary" is used herein to mean "serving as an example, instance, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments. The terms "computing device," "user device," "communication station," "handset," "mobile device," "wireless device," and "user equipment" (UE) are used herein to refer to a wireless communication device, such as a cellular telephone, smartphone, tablet, netbook, wireless terminal, laptop computer, home base station, High Data Rate (HDR) subscriber station, access point, printer, point-of-sale device, access terminal, or other Personal Communication System (PCS) device. The device may be mobile or stationary.

As used within this document, the term "communication" is intended to include transmitting, or receiving, or both. This may be particularly useful in the claims when describing the organization of data transmitted by one device and received by another device, but only requiring functional infringement of one of the devices. Similarly, when the functionality of only one of these devices is required, a bidirectional data exchange between two devices (both devices transmitting and receiving during the exchange) may be described as "communicating. The term "communicate" as used herein with respect to wireless communication signals includes transmitting wireless communication signals and/or receiving wireless communication signals. For example, a wireless communication unit capable of transmitting wireless communication signals may comprise a wireless transmitter for transmitting wireless communication signals to at least one other wireless communication unit, and/or a wireless communication receiver for receiving wireless communication signals from at least one other wireless communication unit.

Some embodiments may be used with various devices and systems (e.g., Personal Computers (PCs), desktop computers, mobile computers, laptop computers, notebook computers, tablet computers, server computers, handheld devices, Personal Digital Assistant (PDA) devices, handheld PDA devices, in-vehicle devices, off-vehicle devices, hybrid devices, vehicle devices, off-vehicle devices, mobile or portable devices, consumer devices, non-mobile or non-portable devices, wireless communication stations, wireless communication devices, wireless Access Points (APs), wired or wireless routers, wired or wireless modems, video devices, audio-video (a/V) devices, wired or wireless networks, wireless area networks (WVAN), Local Area Networks (LANs), wireless LANs (wlans), video cameras, audio devices, audio-video (a/V) devices, wireless networks, wireless local area networks (WVAN), wireless local area networks (wlans), video cameras, audio devices, audio-video cameras, audio-video (a/V) devices, audio-video-audio-video (e.g., audio-video-audio-via-wireless networks (wlan) and/wireless networks (wlan) devices, wireless local area networks, wireless-via-wireless communication networks (wlan) or wireless communication networks (wlan) and/wireless communication networks (wlan) or wireless communication networks (wlan) via-via wireless communication networks, wireless communication stations, wireless local area networks (wlan) and/wireless local area networks (wlan) and wireless communication stations, wireless local area networks (wlan) devices, wireless local area networks, wireless communication stations, wireless communication devices, wireless communication stations, wireless communication devices, wireless communication stations, wireless communication devices, wireless communication stations, wireless communication devices, wireless communication stations, wireless communication devices, wireless communication stations, wireless communication devices, wireless communication stations, personal Area Networks (PANs), wireless PANs (wpans), etc.).

Some embodiments may be used with one-way and/or two-way radio communication systems, cellular radiotelephone communication systems, mobile telephones, cellular telephones, radiotelephones, Personal Communication Systems (PCS) devices, PDA devices that include wireless communication devices, mobile or portable Global Positioning System (GPS) devices, devices that include GPS receivers or transceivers or chips, devices including RFID elements or chips, multiple-input multiple-output (MIMO) transceivers or devices, single-input multiple-output (SIMO) transceivers or devices, multiple-input single-output (MISO) transceivers or devices, devices having one or more internal and/or external antennas, Digital Video Broadcasting (DVB) devices or systems, multi-standard radio devices or systems, wired or wireless handheld devices (e.g., smart phones), Wireless Application Protocol (WAP) devices, and the like, are used in combination.

Some embodiments may be compatible with one or more wireless communication protocols (e.g., Radio Frequency (RF), Infrared (IR), Frequency Division Multiplexing (FDM), Orthogonal FDM (OFDM), Time Division Multiplexing (TDM), Time Division Multiple Access (TDMA), extended TDMA (E-TDMA), General Packet Radio Service (GPRS), extended GPRS, Code Division Multiple Access (CDMA), Wideband CDMA (WCDMA), CDMA 2000, single carrier CDMA, multi-carrier modulation (MDM), discrete multi-tone (DMT), Bluetooth (Bluetooth) ^TM) Global Positioning System (GPS), Wi-Fi, Wi-Max and ZigBee^TMUltra-wideband (UWB), global system for mobile communications (GSM), 2G, 2.5G, 3G, 3.5G, 4G, fifth generation (5G) mobile networks, 3GPP, Long Term Evolution (LTE), LTE-advanced, enhanced data rates for GSM evolution (EDGE), etc.). Other embodiments may be used in various other devices, systems, and/or networks.

Although an example processing system has been described above, embodiments of the subject matter and the functional operations described herein may be implemented in other types of digital electronic circuitry, or in computer software, firmware, or hardware, including the structures disclosed in this specification and their structural equivalents, or in combinations of one or more of them.

Embodiments of the subject matter and the operations described herein may be implemented in digital electronic circuitry, or in computer software, firmware, or hardware, including the structures disclosed in this specification and their structural equivalents, or in combinations of one or more of them. Embodiments of the subject matter described herein may be implemented as one or more computer programs (i.e., one or more components of computer program instructions encoded on a computer storage medium for execution by, or to control the operation of, information/data processing apparatus). Alternatively or in addition, the program instructions may be encoded on an artificially generated propagated signal (e.g., a machine-generated electrical, optical, or electromagnetic signal) that is generated to encode information/data for transmission to suitable receiver apparatus for execution by the information/data processing apparatus. The computer storage medium may be, or be included in, a computer-readable storage device, a computer-readable storage substrate, a random or serial access memory array or device, or a combination of one or more of them. Further, although the computer storage medium is not a propagated signal, the computer storage medium can be a source or destination of computer program instructions encoded in an artificially generated propagated signal. The computer storage medium may also be or be included in one or more separate physical components or media (e.g., multiple CDs, disks, or other storage devices).

The operations described herein may be implemented as operations performed by an information/data processing apparatus on information/data stored on one or more computer-readable storage devices or received from other sources.

The term "data processing apparatus" includes all types of apparatus, devices, and machines for processing data (including by way of example a programmable processor, a computer, a system on a chip, or multiple ones or combinations of the foregoing). An apparatus can comprise special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit). In addition to hardware, the apparatus can include code that creates an execution environment for the computer program in question (e.g., code that constitutes processor firmware, a protocol stack, a database management system, an operating system, a cross-platform runtime environment, a virtual machine, or a combination of one or more of them). The devices and execution environments may implement a variety of different computing model infrastructures (such as web services, distributed computing, and grid computing infrastructures).

A computer program (also known as a program, software application, script, or code) can be written in any form of programming language, including compiled or interpreted languages, declarative or procedural languages, and it can be deployed in any form, including as a stand-alone program or as a component, subroutine, object, or other unit suitable for use in a computing environment. A computer program may, but need not, correspond to a file in a file system. A program can be stored in a portion of a file that holds other programs or information/data (e.g., one or more scripts stored in a markup language document), in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more components, sub-programs, or portions of code). A computer program can be deployed to be executed on one computer or on multiple computers that are located at one site or distributed across multiple sites and interconnected by a communication network.

The processes and logic flows described herein can be performed by one or more programmable processors executing one or more computer programs to perform actions by operating on input information/data and generating output. Processors suitable for the execution of a computer program include, by way of example, both general and special purpose microprocessors, and any one or more processors of any kind of digital computer. Generally, a processor will receive instructions and information/data from a read-only memory or a random access memory or both. The essential elements of a computer are a processor for performing actions in accordance with instructions and one or more memory devices for storing instructions and data. Generally, a computer will also include, or be operatively coupled to receive information/data from or transfer information/data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto-optical disks, or optical disks. However, a computer need not have such a device. Suitable means for storing computer program instructions and information/data include all forms of non-volatile memory, media and memory devices, including by way of example semiconductor memory devices (e.g., EPROM, EEPROM, and flash memory devices), magnetic disks (e.g., an internal hard disk or a removable disk), magneto-optical disks, and CD-ROM and DVD-ROM disks. The processor and the memory can be supplemented by, or incorporated in, special purpose logic circuitry.

To provide for interaction with a user, embodiments of the subject matter described herein can be implemented on a computer having a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information/data to the user and a keyboard and a pointing device (e.g., a mouse or a trackball) by which the user can provide input to the computer. Other types of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback), and input from the user can be received in any form, including acoustic, speech, or tactile input. Further, the computer may interact with the user by sending and receiving documents to and from the device used by the user (e.g., by sending a web page to a web browser on the user's client device in response to a request received from the web browser).

Embodiments of the subject matter described herein can be implemented in a computing system that includes a back-end component (e.g., as an information/data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a client computer having a graphical user interface or a web browser through which a user can interact with an embodiment of the subject matter described herein), or any combination of one or more such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital information/data communication (e.g., a communication network). Examples of communication networks include local area networks ("LANs") and wide area networks ("WANs"), the internet (e.g., the internet), and point-to-point networks (e.g., ad-hoc point-to-point networks).

The computing system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. In some embodiments, the server sends information/data (e.g., HTML pages) to the client device (e.g., for the purpose of displaying information/data to and receiving user input from a user interacting with the client device). Information/data generated at the client device (e.g., the results of the user interaction) may be received at the server from the client device.

While this specification contains many specific embodiment details, these should not be construed as limitations on the scope of any embodiment or of what may be claimed, but rather as descriptions of features specific to particular embodiments. Certain features that are described herein in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination. Furthermore, although features may be described above as acting in certain combinations and even initially claimed as such, one or more features from a claimed combination can in some cases be excised from the combination, and the claimed combination may be directed to a subcombination or variation of a subcombination.

Similarly, while operations are depicted in the drawings in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In certain situations, multitasking and parallel processing may be advantageous. Moreover, the separation of various system components in the embodiments described above should not be understood as requiring such separation in all embodiments, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products.

Thus, particular embodiments of the subject matter have been described. Other embodiments are within the scope of the following claims. In some cases, the actions recited in the claims can be performed in a different order and still achieve desirable results. Further, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In particular embodiments, multitasking and parallel processing may be advantageous.

Many modifications and other embodiments of the disclosure set forth herein will come to mind to one skilled in the art to which these embodiments pertain having the benefit of the teachings presented in the foregoing descriptions and the associated drawings. Therefore, it is to be understood that the embodiments are not to be limited to the specific embodiments disclosed and that modifications and other embodiments are intended to be included within the scope of the appended claims. Although specific terms are employed herein, they are used in a generic and descriptive sense only and not for purposes of limitation.

69页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：负载均衡控制方法、装置和计算机可读存储介质

System, method and apparatus for resource allocation

相关技术

网友询问留言