Techniques for flexible I/O endpoint acceleration

文档序号：1042651 发布日期：2020-10-09 浏览：10次中文

阅读说明：本技术 用于灵活i/o端点加速的技术 (Techniques for flexible I/O endpoint acceleration ) 是由马修·J·阿迪莱塔布拉德·伯尔斯杜安·加尔比阿米特·库玛尔李亚东萨尔玛·米尔扎于 2020-02-26 设计创作，主要内容包括：本公开涉及用于灵活I/O端点加速的技术。用于灵活I/O端点加速的技术包括一种计算设备,其具有根复合体、耦合到根复合体的软端点和耦合到软端点的负载转移复合体。软端点基于端点固件建立模拟端点层次体系。计算设备可对端点固件编程。软端点接收源自于根复合体的I/O事务并且确定是否要处理该I/O事务。软端点可处理I/O事务或者将I/O事务转发到负载转移复合体。软端点可利用元数据封装I/O事务并且将封装的事务转发到负载转移复合体。软端点可将来自负载转移复合体的响应存储在历史缓冲器中并且响应于重试的I/O事务而取回响应。I/O事务可以是快速PCI事务层封包。描述和要求保护了其他实施例。(The present disclosure relates to techniques for flexible I/O endpoint acceleration. Technologies for flexible I/O endpoint acceleration include a computing device having a root complex, a soft endpoint coupled to the root complex, and a load transfer complex coupled to the soft endpoint. The soft endpoint establishes an emulated endpoint hierarchy based on the endpoint firmware. The computing device may program the endpoint fixtures. The soft endpoint receives an I/O transaction originating from the root complex and determines whether to process the I/O transaction. The soft endpoint may process or forward the I/O transaction to the load transfer complex. The soft endpoint may encapsulate the I/O transaction with the metadata and forward the encapsulated transaction to the load transfer complex. The soft endpoint may store the response from the load transfer complex in the history buffer and retrieve the response in response to the retried I/O transaction. The I/O transaction may be a PCI express transaction layer packet. Other embodiments are described and claimed.)

1. A soft endpoint for I/O communications, the soft endpoint comprising:

a flexible endpoint for establishing a simulated endpoint hierarchy based on endpoint firmware of the soft endpoint;

a transaction layer to receive I/O transactions originating from a root port of a computing device in response to establishment of the simulated endpoint hierarchy;

an intelligent filter to (I) determine whether the I/O transaction is to be processed, and (ii) process the I/O transaction in response to determining that the I/O transaction is to be processed; and

an endpoint interface to forward the I/O transaction to a load transfer complex of the computing device in response to determining not to process the I/O transaction.

2. The soft endpoint of claim 1, wherein the flexible endpoint is further to: programming endpoint firmware for the soft endpoint, wherein establishing the endpoint hierarchy comprises establishing the endpoint hierarchy in response to programming the endpoint firmware.

3. The soft endpoint of claim 1, wherein establishing the endpoint hierarchy comprises: emulating a PCI express switch, a PCI express bridge, or a PCI express endpoint.

4. The soft endpoint of claim 1, wherein the I/O transaction comprises a PCI express transaction layer packet.

5. The soft endpoint of claim 4, wherein determining whether to process the I/O transaction comprises: determining whether to process the I/O transaction based on the type of the transaction layer packet.

6. The soft endpoint of claim 4, wherein determining whether to process the I/O transaction comprises: determining whether to process the I/O transaction based on an address of the transaction layer packet.

7. The soft endpoint of claim 1, wherein processing the I/O transaction comprises: copying data associated with the I/O transaction to a memory of the soft endpoint.

8. The soft endpoint of claim 1, wherein processing the I/O transaction comprises: copying data associated with the I/O transaction to a memory of the load transfer complex.

9. The soft endpoint of claim 1, wherein forwarding the I/O transaction to the load transfer complex comprises: the I/O transaction is encapsulated with metadata.

10. The soft endpoint of claim 1, further comprising a retry manager, wherein:

the endpoint interface is further to receive a response from the load transfer complex in response to forwarding the I/O transaction to the load transfer complex; and is

The retry manager is to store the response in a history buffer of the soft endpoint.

11. The soft endpoint of claim 10, wherein:

the transaction layer to receive a retry I/O transaction originating from the root port, wherein the retry I/O transaction is associated with the I/O transaction; and is

The retry manager is to retrieve the response from the history buffer in response to receipt of the retry I/O transaction.

12. A method for I/O communication, the method comprising:

establishing, by a soft endpoint of a computing device, a simulated endpoint hierarchy based on endpoint firmware of the soft endpoint, wherein the soft endpoint is coupled to a root port of the computing device, and wherein the soft endpoint is further coupled to a load transfer complex of the computing device;

receiving, by the soft endpoint, an I/O transaction originating from the root port in response to establishing the simulation endpoint hierarchy;

determining, by the soft endpoint, whether to process the I/O transaction;

processing, by the soft endpoint, the I/O transaction in response to determining to process the I/O transaction; and is

Forwarding, by the soft endpoint, the I/O transaction to the load transfer complex in response to determining not to process the I/O transaction.

13. The method of claim 12, further comprising: programming, by the computing device, endpoint firmware of the soft endpoint, wherein establishing the endpoint hierarchy comprises establishing the endpoint hierarchy in response to programming the endpoint firmware.

14. The method of claim 12, wherein establishing the endpoint hierarchy comprises emulating a PCI express switch, a PCI express bridge, or a PCI express endpoint.

15. The method of claim 12, wherein:

receiving the I/O transaction comprises receiving a PCI express transaction layer packet; and is

Determining whether to process the I/O transaction includes determining whether to process the I/O transaction based on a type of the transaction layer packet.

16. The method of claim 12, wherein:

receiving the I/O transaction comprises receiving a PCI express transaction layer packet; and is

Determining whether to process the I/O transaction includes determining whether to process the I/O transaction based on an address of the transaction layer packet.

17. The method of claim 12, wherein processing the I/O transaction comprises: copying data associated with the I/O transaction to a memory of the soft endpoint.

18. The method of claim 12, wherein processing the I/O transaction comprises: copying data associated with the I/O transaction to a memory of the load transfer complex.

19. One or more computer-readable storage media comprising a plurality of instructions stored thereon that, in response to execution, cause a computing device to:

establishing, by a soft endpoint of the computing device, a simulated endpoint hierarchy based on endpoint firmware of the soft endpoint, wherein the soft endpoint is coupled to a root port of the computing device, and wherein the soft endpoint is further coupled to a load transfer complex of the computing device;

receiving, by the soft endpoint, an I/O transaction originating from the root port in response to establishing the simulation endpoint hierarchy;

determining, by the soft endpoint, whether to process the I/O transaction;

processing, by the soft endpoint, the I/O transaction in response to determining to process the I/O transaction; and is

Forwarding, by the soft endpoint, the I/O transaction to the load transfer complex in response to determining not to process the I/O transaction.

20. The one or more computer-readable storage media of claim 19, further comprising a plurality of instructions stored thereon that, in response to being executed, cause the computing device to: programming, by the computing device, endpoint firmware of the soft endpoint, wherein establishing the endpoint hierarchy comprises establishing the endpoint hierarchy in response to programming the endpoint firmware.

21. The one or more computer-readable storage media of claim 19, wherein establishing the endpoint hierarchy comprises: emulating a PCI express switch, a PCI express bridge, or a PCI express endpoint.

22. The one or more computer-readable storage media of claim 19, wherein:

the I/O transaction comprises a PCI express transaction layer packet; and is

Determining whether to process the I/O transaction includes determining whether to process the I/O transaction based on a type of the transaction layer packet.

23. The one or more computer-readable storage media of claim 19, wherein:

the I/O transaction comprises a PCI express transaction layer packet; and is

Determining whether to process the I/O transaction includes determining whether to process the I/O transaction based on an address of the transaction layer packet.

24. The one or more computer-readable storage media of claim 19, wherein processing the I/O transaction comprises: copying data associated with the I/O transaction to a memory of the soft endpoint.

25. The one or more computer-readable storage media of claim 19, wherein processing the I/O transaction comprises: copying data associated with the I/O transaction to a memory of the load transfer complex.

Technical Field

The present disclosure relates to techniques for flexible I/O endpoint acceleration.

Background

Typical PCI Express (PCI Express, PCIe) I/O devices include fixed function endpoints. Fixed function endpoints typically include an endpoint transaction layer, a hardware basic configuration space, and an endpoint interface to other parts of the I/O device. The endpoint may also include fixed function protocol conversion. PCIe devices may also include hardware switches, bridges, or other components that establish a fixed PCI hierarchy.

Current computing systems may utilize software virtualization, which is typically performed by a virtual machine monitor, hypervisor, or virtualized guest software executed by a host processor, to share computing resources such as disk drives or other storage devices among multiple tenants. Some computing systems may support bare metal virtualization (bare metal virtualization) by transferring certain virtualization task loads to a load transfer complex (offload complex).

Disclosure of Invention

According to an aspect of the present disclosure, there is provided a soft endpoint for I/O communications, the soft endpoint comprising: a flexible endpoint for establishing a simulated endpoint hierarchy based on endpoint firmware of the soft endpoint; a transaction layer to receive I/O transactions originating from a root port of a computing device in response to establishment of the simulated endpoint hierarchy; an intelligent filter to (I) determine whether the I/O transaction is to be processed, and (ii) process the I/O transaction in response to determining that the I/O transaction is to be processed; and an endpoint interface to forward the I/O transaction to a load transfer complex of the computing device in response to determining not to process the I/O transaction.

According to an aspect of the present disclosure, there is provided a method for I/O communication, the method comprising: establishing, by a soft endpoint of a computing device, a simulated endpoint hierarchy based on endpoint firmware of the soft endpoint, wherein the soft endpoint is coupled to a root port of the computing device, and wherein the soft endpoint is further coupled to a load transfer complex of the computing device; receiving, by the soft endpoint, an I/O transaction originating from the root port in response to establishing the simulation endpoint hierarchy; determining, by the soft endpoint, whether to process the I/O transaction; processing, by the soft endpoint, the I/O transaction in response to determining to process the I/O transaction; and forwarding, by the soft endpoint, the I/O transaction to the load transfer complex in response to determining not to process the I/O transaction.

According to an aspect of the disclosure, there is provided one or more computer-readable storage media comprising a plurality of instructions stored thereon that, in response to execution, cause a computing device to: establishing, by a soft endpoint of the computing device, a simulated endpoint hierarchy based on endpoint firmware of the soft endpoint, wherein the soft endpoint is coupled to a root port of the computing device, and wherein the soft endpoint is further coupled to a load transfer complex of the computing device; receiving, by the soft endpoint, an I/O transaction originating from the root port in response to establishing the simulation endpoint hierarchy; determining, by the soft endpoint, whether to process the I/O transaction; processing, by the soft endpoint, the I/O transaction in response to determining to process the I/O transaction; and forwarding, by the soft endpoint, the I/O transaction to the load transfer complex in response to determining not to process the I/O transaction.

Drawings

The concepts described herein are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings. For simplicity and clarity of illustration, elements illustrated in the figures have not necessarily been drawn to scale. Where considered appropriate, reference numerals have been repeated among the figures to indicate corresponding or analogous elements.

FIG. 1 is a simplified diagram of at least one embodiment of a data center for executing workloads with de-aggregated resources;

FIG. 2 is a simplified diagram of at least one embodiment of a room that may be included in the data center of FIG. 1;

FIG. 3 is a perspective view of at least one embodiment of a rack that may be included in the machine room of FIG. 2;

FIG. 4 is a side elevational view of the housing of FIG. 3;

FIG. 5 is a perspective view of the rack of FIG. 3 with a bracket mounted therein;

FIG. 6 is a simplified block diagram of at least one embodiment of a top side of the carrier of FIG. 5;

FIG. 7 is a simplified block diagram of at least one embodiment of a bottom side of the tray of FIG. 6;

FIG. 8 is a simplified block diagram of at least one embodiment of a computing rack usable in the data center of FIG. 1;

FIG. 9 is a top perspective view of at least one embodiment of the computing bracket of FIG. 8;

FIG. 10 is a simplified block diagram of at least one embodiment of an accelerator carriage usable in the data center of FIG. 1;

FIG. 11 is a top perspective view of at least one embodiment of the accelerator carriage of FIG. 10;

FIG. 12 is a simplified block diagram of at least one embodiment of a storage rack usable in the data center of FIG. 1;

FIG. 13 is a top perspective view of at least one embodiment of the storage tray of FIG. 12;

FIG. 14 is a simplified block diagram of at least one embodiment of a memory bay usable in the data center of FIG. 1; and is

FIG. 15 is a simplified block diagram of a system that may be set up within the data center of FIG. 1 to perform workloads with managed nodes comprised of de-aggregated resources.

FIG. 16 is a simplified block diagram of at least one embodiment of a system for flexible endpoint acceleration for bare metal virtualization;

FIG. 17 is a simplified block diagram of at least one embodiment of a soft endpoint and load transfer complex of the computing device of FIG. 16;

FIG. 18 is a simplified block diagram of at least one embodiment of an environment that may be established by the computing devices of FIGS. 16-17;

19-20 are simplified flow diagrams of at least one embodiment of a method for flexible endpoint acceleration for bare metal virtualization that a soft endpoint of the computing device of FIGS. 16-18 may perform.

FIG. 21 is a simplified flow diagram of at least one embodiment of a method for transaction retry executable by the computing device of FIGS. 16-18; and is

Fig. 22 is a schematic diagram of a soft endpoint hierarchy that may be established by the computing devices of fig. 16-18.

Detailed Description

While the concepts of the present disclosure are susceptible to various modifications and alternative forms, specific embodiments thereof have been shown by way of example in the drawings and will herein be described in detail. It should be understood, however, that there is no intent to limit the concepts of the present disclosure to the particular forms disclosed, but on the contrary, the intention is to cover all modifications, equivalents, and alternatives falling within the spirit and scope of the present disclosure and appended claims.

References in the specification to "one embodiment," "an illustrative embodiment," etc., indicate that the embodiment described may include a particular feature, structure, or characteristic, but every embodiment may or may not include the particular feature, structure, or characteristic. Moreover, such phrases are not necessarily referring to the same embodiment. Further, when a particular feature, structure, or characteristic is described in connection with an embodiment, it is submitted that it is within the knowledge of one skilled in the art to effect such feature, structure, or characteristic in connection with other embodiments whether or not explicitly described. Further, it should be understood that items included in the list of "at least one A, B and C" form may refer to (A); (B) (ii) a (C) (ii) a (A and B); (A and C); (B and C); or (A, B and C). Similarly, an item listed in the form of "at least one of A, B or C" can refer to (a); (B) (ii) a (C) (ii) a (A and B); (A and C); (B and C); or (A, B and C).

The disclosed embodiments may be implemented in hardware, firmware, software, or any combination thereof in some cases. The disclosed embodiments may also be implemented as instructions carried by or stored on a transitory or non-transitory machine-readable (e.g., computer-readable) storage medium, which may be read and executed by one or more processors. A machine-readable storage medium may be implemented as any storage device, mechanism or other physical structure for storing or transmitting information in a form readable by a machine (e.g., a volatile or non-volatile memory, a media disk, or other media device).

In the drawings, some structural or methodical features may be shown in a particular arrangement and/or ordering. However, it is to be understood that such specific arrangement and/or ordering may not be necessary. Rather, in some embodiments, such features may be arranged in a different manner and/or order than shown in the schematic diagrams. Moreover, the inclusion of a particular structural or methodical feature in a particular figure is not intended to imply that such feature is required in all embodiments, and in some embodiments may not include such feature or may be combined with other features.

Referring now to fig. 1, a data center 100 in which disaggregated resources may cooperate to execute one or more workloads (e.g., applications on behalf of customers) includes a plurality of rooms (pots) 110, 120, 130, 140, each of which includes one or more rows of racks (racks). Of course, while the data center 100 is illustrated as having multiple computer rooms, in some embodiments the data center 100 may be implemented as a single computer room. As described in more detail herein, each rack houses a plurality of racks (slids), each of which may be primarily equipped with a particular type of resource (e.g., memory device, data storage device, accelerator device, general purpose processor), i.e., a resource that may be logically coupled to form a synthetic node, which may act as, for example, a server. In the illustrative embodiment, the racks in each of the rooms 110, 120, 130, 140 are connected to a plurality of room switches (pod switches) (e.g., switches that route data communications to and from the racks within the rooms). The premise switches are in turn connected to spine switch 150, and spine switch 150 switches communications between premises (e.g., premises 110, 120, 130, 140) in data center 100. In some embodiments, the bracket may be connected to an architecture using Intel Omni-Path technology. In other embodiments, the cradle may be connected to other architectures, such as InfiniBand or Ethernet. As described in more detail herein, resources within a carrier in the data center 100 may be allocated to a group (referred to herein as a "managed node") that contains resources from one or more carriers to be utilized collectively in the execution of a workload. The workload may be performed as if the resources belonging to the managed node were located on the same carrier. The resources in the managed node may belong to racks that belong to different racks and even to different rooms 110, 120, 130, 140. In this way, some resources of a single carrier may be allocated to one managed node, while other resources of the same carrier may be allocated to a different managed node (e.g., one processor is assigned to one managed node, and another processor of the same carrier is assigned to another different managed node).

Data centers that include de-aggregated resources, such as data center 100, may be used in a wide variety of contexts, such as businesses, governments, cloud service providers, and communication service providers (e.g., telecommunications companies), and in a wide variety of sizes, from cloud service providers that consume over 100,000 square feet to mass data centers for single or multi-rack installations in base stations.

De-aggregating resources into an enclosure consisting primarily of a single type of resource (e.g., a computing enclosure consisting primarily of computing resources, a memory enclosure consisting primarily of memory resources) and selectively allocating and de-allocating the de-aggregated resources to form managed nodes assigned to perform workloads improves the operation and resource usage of data center 100 over typical data centers consisting of super-convergence servers that contain computing, memory, storage, and possibly additional resources in a single chassis. For example, because a carrier primarily contains resources of a particular type, resources of a given type may be upgraded independently of other resources. Furthermore, because different resource types (processors, memory, accelerators, etc.) typically have different refresh rates, higher resource utilization and reduced overall cost of ownership may be achieved. For example, a data center operator may upgrade processors throughout its facility by simply swapping out computing racks. In this case, the accelerator and storage resources may not be upgraded at the same time, but may be allowed to continue operating until the resources are scheduled to do their own refreshes. Resource utilization may also increase. For example, if a managed node is constructed based on the requirements of the workload to be run on it, then the resources within the node are more likely to be fully utilized. Such utilization may allow more managed nodes to run in a data center with a given set of resources, or may allow a data center to be built that is expected to run a given set of workloads with fewer resources.

Referring now to fig. 2, the room 110 includes, in the illustrative embodiment, a set of multiple rows 200, 210, 220, 230 of racks 240. Each rack 240 may house a plurality of racks (e.g., sixteen racks) and provide power and data connections to the housed racks, as described in more detail herein. In the illustrative embodiment, the racks in each row 200, 210, 220, 230 are connected to multiple room switches 250, 260. The room switch 250 includes a set of ports 252 to which brackets of the racks of the room 110 connect and another set of ports 254 that connect the room 110 to the spine switch 150 to provide connectivity to other rooms in the data center 100. Similarly, the room switch 260 includes a set of ports 262 to which brackets of the racks of the room 110 connect and a set of ports 264 that connect the room 110 to the spine switch 150. In this way, the use of a pair of switches 250, 260 provides a certain amount of redundancy to the room 110. For example, if either of switches 250, 260 fails, the rack in the room 110 may still maintain data communication with the rest of the data center 100 (e.g., the racks of the other rooms) through the other switch 250, 260. Further, in the illustrative embodiment, the switches 150, 250, 260 may be implemented as dual-mode optical switches capable of routing both ethernet Protocol communications carrying Internet Protocol (IP) packets and communications according to a second high-performance link layer Protocol (e.g., InfiniBand, PCI express, of the Omni-Path architecture of intel) via an optical signaling medium of an optical fabric.

It should be appreciated that each of the other rooms 120, 130, 140 (and any additional rooms of the data center 100) may be configured similarly and have similar components to the room 110 shown in fig. 2 and described with reference to fig. 2 (e.g., each room may have a rack row that houses multiple racks as described above). Further, while two room switches 250, 260 are shown, it should be understood that in other embodiments, each room 110, 120, 130, 140 may be connected to a different number of room switches, providing even more failover capabilities. Of course, in other embodiments, the arrangement of the machine rooms may be different from the rack row configuration shown in fig. 1-2. For example, the machine room may be implemented as a plurality of sets of racks, wherein each set of racks is radially arranged, i.e. the racks are equidistant from the central exchange.

Referring now to fig. 3-5, each illustrative rack 240 of the data center 100 includes two elongated columns 302, 304 that are vertically arranged. For example, the elongated posts 302, 304 may extend upward from the floor of the data center 100 when deployed. The rack 240 also includes one or more horizontal pairs 310 of elongate support arms 312 (identified in fig. 3 by dashed ovals) configured to support a carrier of the data center 100 as described below. One elongated support arm 312 of a pair of elongated support arms 312 extends outwardly from the elongated stanchion 302 and the other elongated support arm 312 extends outwardly from the elongated stanchion 304.

In the illustrative embodiment, each of the racks of the data center 100 is implemented as a cabinet rack. That is, each carrier has an inorganic case circuit board substrate upon which physical resources (e.g., processors, memory, accelerators, storage devices, etc.) are placed as discussed in more detail below. As such, the chassis 240 is configured to receive a bin free carrier. For example, each pair 310 of elongated support arms 312 defines one bracket slot 320 of the rack 240 that is configured to receive a respective cassette bracket. To this end, each illustrative elongated support arm 312 includes a circuit board guide 330 configured to receive the inorganic case circuit board substrate of the carrier. Each circuit board guide rail 330 is fastened or otherwise mounted to the top side 332 of the respective elongated support arm 312. For example, in the illustrative embodiment, each circuit board guide 330 is disposed distally of a respective elongate support arm 312 relative to a respective elongate support post 302, 304. For clarity of the drawings, each circuit board rail 330 may not be referenced in each figure.

Each circuit board guide 330 includes interior walls defining a circuit board slot 380, the circuit board slot 380 being configured to receive the inorganic case circuit board substrate of the carrier 400 when the carrier 400 is received in the corresponding carrier slot 320 of the rack 240. To this end, as shown in fig. 4, a user (or robot) aligns the inorganic case circuit board substrate of the illustrative inorganic case carrier 400 to the carrier slot 320. The user or robot may then slide the inorganic case circuit board substrate forward into the bracket slot 320 such that each side edge 414 of the inorganic case circuit board substrate is received in a respective circuit board slot 380 of the circuit board guide 330 of the pair 310 of elongated support arms 312 defining the respective bracket slot 320 as shown in fig. 4. By having the robot-accessible and robot-manipulable carriages include disaggregated resources, each type of resource can be upgraded independently of the other and at its own optimized refresh rate. Further, the brackets are configured to blindly mate with the power and data communication cables in each rack 240, enhancing their ability to be quickly removed, upgraded, reinstalled, and/or replaced. As such, in some embodiments, the data center 100 may operate (e.g., perform a workload, undergo maintenance and/or upgrades, etc.) without human intervention on the data center site. In other embodiments, a human may facilitate one or more maintenance or upgrade operations in the data center 100.

It should be understood that each circuit board rail 330 is double-sided. That is, each circuit board rail 330 includes an inner wall defining a circuit board slot 380 on each side of the circuit board rail 330. In this way, each circuit board guide 330 may support an inorganic case circuit board substrate on either side. In this way, a single additional elongated leg may be added to the rack 240 to turn the rack 240 into a dual rack solution that can support twice as many carrier slots 320 as shown in fig. 3. The illustrative chassis 240 includes seven pairs 310 of elongated support arms 312 that define respective seven bracket slots 320, each bracket slot 320 configured to receive and support a respective bracket 400 as described above. Of course, in other embodiments, the holster 240 may include additional or fewer pairs 310 of elongated support arms 312 (e.g., additional or fewer bracket slots 320). It should be appreciated that because the rack 400 is a cabinet-less rack, the rack 400 may have an overall height that is different than a typical server. As such, in some embodiments, the height of each bracket slot 320 may be shorter than the height of a typical server (e.g., shorter than a single rack unit of "1U"). That is, the vertical distance between each pair 310 of elongated support arms 312 may be less than the standard rack unit "1U". Further, due to the relatively reduced height of the bracket slot 320, the overall height of the rack 240 may be shorter than the height of conventional rack enclosures in some embodiments. For example, in some embodiments, each of the elongated struts 302, 304 may have a length of six feet or less. Also, in other embodiments, the chassis 240 may have different dimensions. For example, in some embodiments, the vertical distance between each pair 310 of elongated support arms 312 may be greater than the standard rack unit "1U". In such embodiments, the increased vertical distance between the racks allows a larger heat sink to be attached to the physical resources and larger fans to be used (e.g., in fan array 370 described below) to cool each rack, which in turn may allow the physical resources to operate at increased power levels. Additionally, it should be understood that the housing 240 does not include any walls, enclosures, or the like. In contrast, the rack 240 is an unencapsulated rack that is open to the local environment. Of course, in some cases, the end plates may be attached to one of the elongated struts 302, 304 in those instances where the racks 240 form an end-of-row rack in the data center 100.

In some embodiments, various interconnects may be routed up or down through the elongated struts 302, 304. To facilitate such routing, each of the elongated struts 302, 304 includes an inner wall defining a lumen in which the interconnection may be located. The interconnect routed through the elongated stanchions 302, 304 may be implemented as any type of interconnect including, but not limited to, a data or communication interconnect providing a communication connection to each carrier slot 320, a power interconnect providing power to each carrier slot 320, and/or other types of interconnects.

The chassis 240, in the illustrative embodiment, includes a support platform upon which a corresponding optical data connector (not shown) is mounted. Each optical data connector is associated with a respective carrier slot 320 and is configured to mate with the optical data connector of a respective carrier 400 when the carrier 400 is received in the respective carrier slot 320. In some embodiments, the optical connections between components (e.g., racks, and switches) in the data center 100 are made using blind-mate-to-optical connections. For example, a door on each cable may prevent dust from contaminating the optical fibers inside the cable. The door is pushed open when the end of the cable approaches or enters the connector mechanism during connection to the blind mate optical connector mechanism. The optical fibers inside the cables may then enter the gel inside the connector mechanism and the optical fibers of one cable contact the optical fibers of another cable inside the gel inside the connector mechanism.

The illustrative chassis 240 also includes a fan array 370 coupled to the cross-support arms of the chassis 240. The fan array 370 includes one or more rows of cooling fans 372 that are aligned with horizontal lines between the elongated struts 302, 304. In the illustrative embodiment, the fan array 370 includes a row of cooling fans 372 for each carrier slot 320 of the rack 240. As described above, each rack 400 does not include any on-board cooling system in the illustrative embodiment, and thus, the fan array 370 provides cooling for each rack 400 received in the rack 240. Each bay 240 also includes a power supply associated with each carrier slot 320 in the illustrative embodiment. Each of the power supplies is secured to one of the elongated support arms 312 of the pair 310 of elongated support arms 312 that define a respective bracket slot 320. For example, the frame 240 may include a supply power source coupled or secured to each of the elongated support arms 312 extending from the elongated stanchion 302. Each power supply includes a power connector configured to mate with a power connector of the cradle 400 when the cradle 400 is received in the corresponding cradle slot 320. In the illustrative embodiment, the racks 400 do not include any on-board supply power and, thus, the supply power provided in the racks 240 provides power to the respective racks 400 when mounted to the racks 240. Each supply is configured to meet power requirements for its associated rack, which may vary from rack to rack. In addition, the supply power provided in the housing 240 may be operated independently of each other. That is, within a single rack, a first supply source providing power to the computing rack may provide a different power level than a second supply source providing power to the accelerator rack. The power supply may be controllable at the cradle or rack level and may be controlled locally by components on the associated cradle or remotely, for example by another cradle or coordinator.

Referring now to fig. 6, the carrier 400 is configured in the illustrative embodiment to be seated in a respective rack 240 of the data center 100 as described above. In some embodiments, each carriage 400 may be optimized or otherwise configured for performing specific tasks, such as computing tasks, acceleration tasks, data storage tasks, and so forth. For example, the carriage 400 may be implemented as a computing carriage 800 as described below with reference to fig. 8-9, as an accelerator carriage 1000 as described below with reference to fig. 10-11, as a storage carriage 1200 as described below with reference to fig. 12-13, or as a carriage optimized or otherwise configured to perform other specialized tasks, such as a memory carriage 1400 as described below with reference to fig. 14.

As described above, the illustrative carrier 400 includes an inorganic case circuit board substrate 602 that supports various physical resources (e.g., electrical components) mounted thereon. It should be appreciated that the circuit board substrate 602 is "inorganic" in that the cradle 400 does not include a housing or enclosure. In contrast, the inorganic case circuit board substrate 602 is open to the local environment. The chassis-less circuit board substrate 602 may be formed of any material capable of supporting the various electrical components mounted thereon. For example, in the illustrative embodiment, the inorganic case circuit board substrate 602 is formed from an FR-4 glass reinforced epoxy laminate material. Of course, other materials may be used to form the inorganic case circuit board substrate 602 in other embodiments.

As discussed in more detail below, the inorganic case circuit board substrate 602 includes a number of features that improve the thermal cooling characteristics of the various electrical components mounted on the inorganic case circuit board substrate 602. As previously described, the inorganic case circuit board substrate 602 does not include an enclosure or enclosure, which may improve airflow through the electrical components of the cradle 400 by reducing those structures that may inhibit airflow. For example, because the inorganic case circuit board substrate 602 is not located in an individual enclosure or enclosure, no vertically arranged backplane (e.g., a backplane of a chassis) is attached to the inorganic case circuit board substrate 602, which can inhibit airflow through the electrical components. Further, the inorganic case circuit board substrate 602 has a geometry configured to reduce a length of an airflow path through electrical components mounted to the inorganic case circuit board substrate 602. For example, the illustrative chassis-less circuit board substrate 602 has a width 604 that is greater than a depth 606 of the chassis-less circuit board substrate 602. In one particular embodiment, for example, the inorganic case circuit board substrate 602 has a width of about 21 inches and a depth of about 9 inches, as compared to a typical server having a width of about 17 inches and a depth of about 39 inches. As such, the airflow path 608 extending from the front edge 610 toward the rear edge 612 of the inorganic case circuit board substrate 602 has a shorter distance relative to typical servers, which may improve the thermal cooling characteristics of the tray 400. Further, although not illustrated in fig. 6, the various physical resources mounted to the inorganic case circuit board substrate 602 are mounted in respective locations such that no two substantially heat generating electrical components would be shielded from each other, as discussed in more detail below. That is, no two electrical components that generate appreciable heat during operation (i.e., greater than a nominal heat sufficient to adversely affect cooling of another electrical component) may be placed in line with each other to the inorganic case circuit board substrate 602 in the direction of the airflow path 608 (i.e., in a direction extending from the front edge 610 toward the rear edge 612 of the inorganic case circuit board substrate 602).

As described above, the illustrative carrier 400 includes one or more physical resources 620 mounted to the top side 650 of the chassis-less circuit board substrate 602. While two physical resources 620 are shown in fig. 6, it should be understood that the carrier 400 may include one, two, or more physical resources 620 in other embodiments. The physical resources 620 may be implemented as any type of processor, controller, or other computing circuitry capable of performing various tasks such as computing functions and/or controlling the functions of the carrier 400 depending on, for example, the type or desired function of the carrier 400. For example, as discussed in more detail below, the physical resources 620 may be implemented as a high-performance processor in embodiments in which the cradle 400 is implemented as a computing cradle, as an accelerator co-processor or circuit in embodiments in which the cradle 400 is implemented as an accelerator cradle, as a memory controller in embodiments in which the cradle 400 is implemented as a memory cradle, or as a set of memory devices in embodiments in which the cradle 400 is implemented as a memory cradle.

The bracket 400 also includes one or more additional physical resources 630 mounted to the top side 650 of the chassis-less circuit board substrate 602. In the illustrative embodiment, the additional physical resources include a Network Interface Controller (NIC), as discussed in more detail below. Of course, depending on the type and function of the cradle 400, the physical resources 630 may include additional or other electrical components, circuits, and/or devices in other embodiments.

Physical resources 620 are communicatively coupled to physical resources 630 via input/output (I/O) subsystem 622. I/O subsystem 622 may be implemented as circuitry and/or components to facilitate input/output operations with physical resources 620, physical resources 630, and/or other components of carrier 400. For example, I/O subsystem 622 may be implemented as or otherwise include a memory controller hub, an input/output control hub, an integrated sensor hub, a firmware device, a communication link (e.g., a point-to-point link, a bus link, a wire, a cable, a waveguide, a light guide, a printed circuit board track, etc.), and/or other components and subsystems to facilitate input/output operations. In the illustrative embodiment, the I/O subsystem 622 is implemented as or otherwise includes a double data rate 4 (DDR 4) data bus or a DDR5 data bus.

In some embodiments, the cradle 400 may also include a resource-to-resource interconnect 624. Resource-to-resource interconnect 624 may be implemented as any type of communication interconnect capable of facilitating resource-to-resource communication. In the illustrative embodiment, resource-to-resource interconnect 624 is implemented as a high-speed point-to-point interconnect (e.g., faster than I/O subsystem 622). For example, resource-to-resource interconnect 624 may be implemented as a QuickPath interconnect (QPI), an UltraPath interconnect (UPI), or other high-speed point-to-point interconnect dedicated to resource-to-resource communication.

The cradle 400 also includes power connectors 640 configured to mate with corresponding power connectors of the chassis 240 when the cradle 400 is seated in the corresponding chassis 240. The cradle 400 receives power from a supply power source of the rack 240 via the power connector 640 to provide power to the various electrical components of the cradle 400. That is, the cradle 400 does not include any local supply power (i.e., on-board supply power) to provide power to the electrical components of the cradle 400. The absence of a local or on-board power supply facilitates a reduction in the overall footprint of the inorganic case circuit board substrate 602, which, as described above, may increase the thermal cooling characteristics of the various electrical components mounted on the inorganic case circuit board substrate 602. In some embodiments, the voltage regulators are placed on the bottom side 750 (see fig. 7) of the inorganic case circuit board substrate 602, directly opposite the processor 820 (see fig. 8), and power is routed from the voltage regulators to the processor 820 by vias extending through the circuit board substrate 602. This configuration provides increased thermal budget, additional current and/or voltage, and better voltage control relative to typical printed circuit boards where processor power is delivered by printed circuit tracks from a portion of the voltage regulator.

In some embodiments, the carrier 400 may also include a mounting feature 642 configured to mate with a mounting arm or other structure of a robot to facilitate placement of the carrier 600 in the rack 240 by the robot. The mounting features 642 may be implemented as any type of physical structure that allows a robot to grasp the bracket 400 without destroying the chassis circuit board substrate 602 or the electrical components mounted thereto. For example, in some embodiments, the mounting features 642 may be implemented as a non-conductive liner attached to the inorganic case circuit board substrate 602. In other embodiments, the mounting features may be implemented as brackets, posts, or other similar structures attached to the inorganic case circuit board substrate 602. The particular number, shape, size, and/or composition of the seating features 642 may depend on the design of the robot configured to manage the carrier 400.

Referring now to fig. 7, in addition to physical resources 630 mounted to a top side 650 of the inorganic chassis circuit board substrate 602, the cradle 400 also includes one or more memory devices 720 mounted to a bottom side 750 of the inorganic chassis circuit board substrate 602. That is, the inorganic case circuit board substrate 602 is implemented as a double-sided circuit board. Physical resources 620 are communicatively coupled to memory device 720 via I/O subsystem 622. For example, the physical resources 620 and the memory device 720 may be communicatively coupled by one or more vias extending through the inorganic case circuit board substrate 602. Each physical resource 620 is communicatively coupled, in some embodiments, to a different set of one or more memory devices 720. Alternatively, in other embodiments, each physical resource 620 is communicatively coupled to each memory device 720.

The memory device 720 may be implemented as any type of memory device capable of storing data for the physical resources 620 during operation of the carrier 400, such as any type of volatile (e.g., Dynamic Random Access Memory (DRAM), etc.) or non-volatile memory. Volatile memory may be a storage medium that requires power to maintain the state of data stored by the medium. Non-limiting examples of volatile memory may include various types of Random Access Memory (RAM), such as Dynamic Random Access Memory (DRAM) or Static Random Access Memory (SRAM). One particular type of DRAM that may be used in a memory module is Synchronous Dynamic Random Access Memory (SDRAM). In a particular embodiment, the DRAM of the memory component may comply with standards promulgated by JEDEC, such as JESD79F for DDR SDRAM, JESD79-2F for DDR2 SDRAM, JESD79-3F for DDR3 SDRAM, JESD79-4A for DDR4 SDRAM, JESD209 for Low Power DDR (LPDDR), JESD209-2 for LPDDR2, JESD209-3 for LPDDR3, and JESD209-4 for LPDDR 4. Such a standard (and similar standards) may be referred to as a DDR-based standard, and a communication interface of a memory device implementing such a standard may be referred to as a DDR-based interface.

In one embodiment, the memory devices are block addressable memory devices, such as those based on NAND or NOR technology. The memory device may also include a next generation non-volatile device, such as an Intel 3D XPoint^TMMemory or other byte-addressable location-write (write-in-place) non-volatile memory devices. In one embodiment, the Memory device may be or may include a chalcogenide glass-based Memory device, a multi-threshold level NAND flash Memory, a NOR flash Memory, a single or multi-level Phase Change Memory (PCM), a resistive Memory, a nanowire Memory, a ferroelectric transistor Random Access Memory (FeTRAM), an antiferroelectric Memory, a Magnetoresistive Random Access Memory (MRAM) including memristor technology, a resistive Memory including a metal oxide substrate, an oxygen vacancy substrate, and a conductive bridge Random Access Memory (CB-RAM), or a Spin Transfer Torque (STT) -based device, a spintronics magnetic junction Memory-based device, a magnetic junction (MTJ) -based device, DW (Domain Wall) and SOT (Spin Orbit Transfer) based devices, thyristor based memory devices, or a combination of any of the above, or other memory. A memory device may refer to the die itself and/or to a packaged memory product. In some embodiments, the memory device may include a transistor-less stackable cross-point architecture, wherein the memory cells are located at intersections of word lines and bit lines and are individually addressable, and wherein the bit storage is based on a change in body resistance.

Referring now to fig. 8, in some embodiments, the carriage 400 may be implemented as a computing carriage 800. The computing tray 800 is optimized or otherwise configured to perform computing tasks. Of course, as described above, the computing carriage 800 may rely on other carriages, such as an acceleration carriage and/or a storage carriage, to perform such computing tasks. The computing tray 800 includes various physical resources (e.g., electrical components) similar to those of the tray 400, which are identified in FIG. 8 with the same reference numerals. The description of such components provided above with reference to fig. 6 and 7 applies to the corresponding components of the computing tray 800 and is not repeated here for clarity of the description of the computing tray 800.

In the illustrative computing carrier 800, the physical resources 620 are implemented as a processor 820. Although only two processors 820 are shown in FIG. 8, it should be understood that the computing tray 800 may include additional processors 820 in other embodiments. Illustratively, the processor 820 is implemented as a high performance processor 820 and may be configured to operate at a relatively high power rating. While the processor 820 generates additional heat when operating at a power rating greater than typical processors (which operate at approximately 155-230W), the enhanced thermal cooling characteristics of the inorganic case circuit board substrate 602 discussed above facilitate higher power operation. For example, in the illustrative embodiment, the processor 820 is configured to operate at a rated power of at least 250W. In some embodiments, processor 820 may be configured to operate at a power rating of at least 350W.

In some embodiments, the computing tray 800 may also include a processor-to-processor interconnect 842. Similar to the resource-to-resource interconnect 624 of the carrier 400 discussed above, the processor-to-processor interconnect 842 may be implemented as any type of communication interconnect capable of facilitating processor-to-processor interconnect 842 communications. In the illustrative embodiment, processor-to-processor interconnect 842 is implemented as a high-speed point-to-point interconnect (e.g., faster than I/O subsystem 622). For example, processor-to-processor interconnect 842 may be implemented as a QuickPath interconnect (QPI), UltraPath interconnect (UPI), or other high-speed point-to-point interconnect dedicated to processor-to-processor communication.

The computing tray 800 also includes communication circuitry 830. The illustrative communication circuit 830 includes a Network Interface Controller (NIC)832, which may also be referred to as a Host Fabric Interface (HFI). NIC832 may be implemented as or otherwise include any type of integrated circuit, discrete circuit, controller chip, chipset, add-in board, sub-board, network interface card, or other device that may be used by computing tray 800 to connect with another computing device (e.g., with other trays 400). In some embodiments, NIC832 may be implemented as part of a system-on-a-chip (SoC) that includes one or more processors or included on a multi-chip package that also includes one or more processors. In some embodiments, the NIC832 may include a local processor (not shown) and/or a local memory (not shown) that are both local to the NIC 832. In such an embodiment, the local processor of NIC832 may be capable of performing one or more functions of processor 820. Additionally or alternatively, in such embodiments, the local memory of NIC832 may be integrated into one or more components of the computing tray at a board level, socket level, chip level, and/or other level.

The communication circuit 830 is communicatively coupled to an optical data connector 834. The optical data connector 834 is configured to mate with a corresponding optical data connector of the rack 240 when the computing tray 800 is seated in the rack 240. Illustratively, the optical data connector 834 includes a plurality of optical fibers leading from a mating surface of the optical data connector 834 to an optical transceiver 836. The optical transceiver 836 is configured to convert incoming optical signals from the rack-side optical data connector into electrical signals and to convert electrical signals into outgoing optical signals to the rack-side optical data connector. While shown in the illustrative embodiment as forming part of the optical data connector 834, the optical transceiver 836 may form part of the communication circuit 830 in other embodiments.

In some embodiments, the computing bracket 800 may also include an expansion connector 840. In such embodiments, the expansion connector 840 is configured to mate with a corresponding connector of the expansion chassis-less circuit board substrate to provide additional physical resources to the computing tray 800. The additional physical resources may be used, for example, by the processor 820 during operation of the computing tray 800. The extended chassis-less circuit board substrate may be substantially similar to the chassis-less circuit board substrate 602 discussed above and may include various electrical components mounted thereto. The particular electrical components mounted to the extended chassis-less circuit board substrate may depend on the intended function of the extended chassis-less circuit board substrate. For example, expanding the inorganic case circuit board substrate may provide additional computing, memory, and/or storage resources. As such, additional physical resources that extend the chassis of the inorganic case circuit board may include, but are not limited to, processors, memory devices, storage devices, and/or accelerator circuits, including, for example, Field Programmable Gate Arrays (FPGAs), application-specific integrated circuits (ASICs), security coprocessors, Graphics Processing Units (GPUs), machine learning circuits, or other specialized processors, controllers, devices, and/or circuits.

Referring now to FIG. 9, an illustrative embodiment of a computing tray 800 is shown. As shown, the processor 820, communication circuit 830, and optical data connector 834 are mounted to the top side 650 of the inorganic case circuit board substrate 602. Any suitable attachment or mounting technique may be used to mount the physical resources of the computing bracket 800 to the chassis-less circuit board substrate 602. For example, various physical resources may be housed in respective sockets (e.g., processor sockets), holders, or racks. In some cases, some of the electrical components may be mounted directly to the inorganic case circuit board substrate 602 via soldering or similar techniques.

As described above, the individual processor 820 and the communication circuit 830 are mounted to the top side 650 of the chassis-less circuit board substrate 602 such that no two heat-generating electrical components are shielded from each other. In the illustrative embodiment, the processor 820 and the communication circuit 830 are disposed in respective locations of the top side 650 of the chassis-less circuit board substrate 602 such that no two of these physical resources would be aligned with each other along the direction of the airflow path 608. It should be appreciated that while the optical data connector 834 is aligned with the communication circuit 830, the optical data connector 834 generates no or only negligible heat during operation.

The memory device 720 of the computing bracket 800 is mounted to the bottom side 750 of the inorganic case circuit board substrate 602 as discussed above for the bracket 400. While mounted to bottom side 750, memory device 720 is communicatively coupled to processor 820 at top side 650 via I/O subsystem 622. Because the inorganic case circuit board substrate 602 is implemented as a double-sided circuit board, the memory device 720 and the processor 820 may be communicatively coupled by one or more vias, connectors, or other mechanisms that extend through the inorganic case circuit board substrate 602. Of course, each processor 820 may be communicatively coupled to a different set of one or more memory devices 720 in some embodiments. Alternatively, in other embodiments, each processor 820 is communicatively coupled to each memory device 720. In some embodiments, memory devices 720 may be mounted to one or more memory mezzanines on the bottom side of the inorganic case circuit board substrate 602 and may be interconnected with a corresponding processor 820 by a ball grid array.

Each processor 820 includes a heat sink 850 secured thereto. Because of the placement of the memory devices 720 to the bottom side 750 of the chassis-less circuit board substrate 602 (and the vertical spacing of the brackets 400 in the respective racks 240), the top side 650 of the chassis-less circuit board substrate 602 includes additional "free" area or space, which facilitates the use of a heat sink 850 having a larger size relative to conventional heat sinks used in typical servers. Furthermore, due to the improved thermal cooling characteristics of the inorganic case circuit board substrate 602, none of the processor heat sinks 850 include a cooling fan attached thereto. That is, each heat sink 850 is implemented as a fanless heat sink. In some embodiments, heat sink 850 mounted atop processor 820 may overlap with a heat sink attached to communication circuitry 830 in the direction of airflow path 608 due to its increased size, as illustratively suggested in fig. 9.

Referring now to fig. 10, in some embodiments, the carriage 400 may be implemented as an accelerator carriage 1000. Accelerator tray 1000 is configured to perform specialized computing tasks such as machine learning, encryption, hashing, or other computationally intensive tasks. In some embodiments, for example, computing carriage 800 may transfer a task load to accelerator carriage 1000 during operation. The accelerator carriage 1000 includes various components similar to those of the carriage 400 and/or the computing carriage 800, which are identified in fig. 10 using the same reference numerals. The description of such assemblies provided above with reference to fig. 6, 7 and 8 applies to the corresponding assemblies of the accelerator carriage 1000 and is not repeated here for clarity of the description of the accelerator carriage 1000.

In the illustrative accelerator carrier 1000, the physical resources 620 are implemented as accelerator circuitry 1020. Although only two accelerator circuits 1020 are shown in FIG. 10, it should be understood that the accelerator carriage 1000 may include additional accelerator circuits 1020 in other embodiments. For example, as shown in fig. 11, the accelerator carriage 1000 may include four accelerator circuits 1020 in some embodiments. Accelerator circuitry 1020 may be implemented as any type of processor, coprocessor, computing circuitry, or other device capable of performing computing or processing operations. For example, accelerator circuitry 1020 may be implemented as, for example, a Field Programmable Gate Array (FPGA), an Application Specific Integrated Circuit (ASIC), a security co-processor, a Graphics Processing Unit (GPU), a neuromorphic processor unit, a quantum computer, a machine learning circuit, or other specialized processor, controller, device, and/or circuitry.

In some embodiments, the accelerator carriage 1000 may also include an accelerator-to-accelerator interconnect 1042. Similar to resource-to-resource interconnect 624 of rack 600 discussed above, accelerator-to-accelerator interconnect 1042 can be implemented as any type of communication interconnect capable of facilitating accelerator-to-accelerator communications. In the illustrative embodiment, accelerator-to-accelerator interconnect 1042 is implemented as a high speed point-to-point interconnect (e.g., faster than I/O subsystem 622). For example, accelerator-to-accelerator interconnect 1042 may be implemented as a QuickPath interconnect (QPI), UltraPath interconnect (UPI), or other high speed point-to-point interconnect dedicated to processor-to-processor communications. In some embodiments, the accelerator circuits 1020 may be daisy chained, with the primary accelerator circuit 1020 connected to the NIC832 and the memory device 720 through the I/O subsystem 622 and the secondary accelerator circuit 1020 connected to the NIC832 and the memory device 720 through the primary accelerator circuit 1020.

Referring now to FIG. 11, an illustrative embodiment of an accelerator carriage 1000 is shown. As described above, accelerator circuit 1020, communication circuit 830, and optical data connector 834 are mounted to top side 650 of inorganic case circuit board substrate 602. Also, as described above, the individual accelerator circuits 1020 and communication circuits 830 are mounted to the top side 650 of the chassis-less circuit board substrate 602 such that no two heat-generating electrical components would be shielded from each other. The memory device 720 of the accelerator bracket 1000 is mounted to the bottom side 750 of the inorganic case circuit board substrate 602 as described above for the bracket 600. While mounted to the bottom side 750, the memory device 720 is communicatively coupled to the accelerator circuit 1020 at the top side 650 via the I/O subsystem 622 (e.g., through vias). Additionally, each accelerator circuit 1020 may include a heat sink 1070 that is larger than conventional heat sinks used in servers. As described above with reference to heat spreader 870, heat spreader 1070 may be larger than conventional heat spreaders because of the "free" area provided by memory resources 720 located on bottom side 750 of chassis circuit board substrate 602 rather than on top side 650.

Referring now to fig. 12, in some embodiments, the carriage 400 may be implemented as a storage carriage 1200. The storage bracket 1200 is configured to store data in a data storage device 1250 local to the storage bracket 1200. For example, during operation, the computing carriage 800 or the accelerator carriage 1000 may store and retrieve data from the data storage device 1250 of the storage carriage 1200. The storage rack 1200 includes various components similar to those of the rack 400 and/or the computing rack 800, which are identified in fig. 12 with the same reference numerals. The description of such components provided above with reference to fig. 6, 7, and 8 applies to the corresponding components of the storage bracket 1200 and is not repeated here for clarity of the description of the storage bracket 1200.

In the illustrative storage carrier 1200, the physical resources 620 are implemented as a storage controller 1220. Although only two storage controllers 1220 are shown in FIG. 12, it should be understood that the storage rack 1200 may include additional storage controllers 1220 in other embodiments. Storage controller 1220 may be implemented as any type of processor, controller, or control circuit capable of controlling the storage and retrieval of data into data storage device 1250 based on requests received via communication circuits 830. In the illustrative embodiment, the memory controller 1220 is implemented as a relatively low power processor or controller. For example, in some embodiments, memory controller 1220 may be configured to operate at a power rating of approximately 75 watts.

In some embodiments, the storage bracket 1200 may also include a controller-to-controller interconnect 1242. Similar to the resource-to-resource interconnect 624 of the cradle 400 discussed above, the controller-to-controller interconnect 1242 may be implemented as any type of communication interconnect capable of facilitating controller-to-controller communication. In the illustrative embodiment, controller-to-controller interconnect 1242 is implemented as a high-speed point-to-point interconnect (e.g., faster than I/O subsystem 622). For example, controller-to-controller interconnect 1242 may be implemented as a QuickPath interconnect (QPI), UltraPath interconnect (UPI), or other high-speed point-to-point interconnect dedicated to processor-to-processor communications.

Referring now to fig. 13, an illustrative embodiment of a storage bracket 1200 is shown. In the illustrative embodiment, the data storage 1250 is implemented as or otherwise includes a storage cage 1252 configured to house one or more Solid State Drives (SSDs) 1254. To this end, the storage cage 1252 includes several mounting slots 1256, each mounting slot 1256 configured to receive a respective solid state drive 1254. Each mounting slot 1256 includes a number of drive rails 1258, the drive rails 1258 cooperating to define a manhole 1260 of the respective mounting slot 1256. The storage cage 1252 is secured to the chassis-less circuit board substrate 602 such that the access opening faces away from the chassis-less circuit board substrate 602 (i.e., toward the front of the chassis-less circuit board substrate 602). As such, the solid state drives 1254 are accessible when the storage carrier 1200 is seated in the respective bay 240. For example, the solid state drives 1254 may be swapped out of the racks 240 (e.g., via a robot) while the storage racks 1200 remain seated in the respective racks 240.

The storage cage 1252 illustratively includes sixteen mounting pockets 1256 and is capable of mounting and storing sixteen solid state drives 1254. Of course, the storage cage 1252 may be configured to store additional or fewer solid state drives 1254 in other embodiments. Further, in the illustrative embodiment, the solid state drives are vertically disposed in the storage cage 1252, but may be disposed in different orientations in the storage cage 1252 in other embodiments. Each solid state drive 1254 may be implemented as any type of data storage device capable of storing long-term data. To this end, the solid state drive 1254 may include the volatile and non-volatile memory devices discussed above.

As shown in fig. 13, a memory controller 1220, a communication circuit 830, and an optical data connector 834 are illustratively mounted to the top side 650 of the inorganic case circuit board substrate 602. Also, as described above, any suitable attachment or mounting technique may be used to mount the electrical components of the storage bracket 1200 to the chassis-less circuit board substrate 602, including, for example, sockets (e.g., processor sockets), supports, brackets, solder connections, and/or other mounting or fastening techniques.

As described above, the individual memory controllers 1220 and the communication circuit 830 are mounted to the top side 650 of the chassis-less circuit board substrate 602 such that no two heat-generating electrical components are shielded from each other. For example, the memory controller 1220 and the communication circuit 830 are placed in respective positions on the top side 650 of the chassis-less circuit board substrate 602 such that no two of the electrical components will line up with each other along the direction of the airflow path 608.

The memory devices 720 of the storage bracket 1200 are mounted to the bottom side 750 of the inorganic case circuit board substrate 602 as described above for the bracket 400. While mounted to bottom side 750, memory device 720 is communicatively coupled to storage controller 1220 located on top side 650 via I/O subsystem 622. Also, because the inorganic case circuit board substrate 602 is implemented as a double-sided circuit board, the memory device 720 and the memory controller 1220 may be communicatively coupled by one or more vias, connectors, or other mechanisms that extend through the inorganic case circuit board substrate 602. Each memory controller 1220 includes a heat sink 1270 secured to it. As described above, none of the heat sinks 1270 includes a cooling fan attached thereto due to the improved thermal cooling characteristics of the inorganic case circuit board substrate 602 of the storage bracket 1200. That is, each heat sink 1270 is implemented as a fanless heat sink.

Referring now to fig. 14, in some embodiments, the carriage 400 may be implemented as a storage carriage 1400. The memory carrier 1400 is optimized or otherwise configured to provide other carriers 400 (e.g., computing carriers 800, accelerator carriers 1000, etc.) access to a pool of memory local to the memory carrier 1400 (e.g., in two or more sets 1430, 1432 of memory devices 720). For example, during operation, the computing carriage 800 or the accelerator carriage 1000 may remotely write to and/or read from one or more of the memory banks 1430, 1432 of the memory carriage 1400 using a logical address space that maps to physical addresses in the memory banks 1430, 1432. The memory bay 1400 includes various components similar to those of the bay 400 and/or the computing bay 800, which are identified in fig. 14 using the same reference numerals. The description of such components provided above with reference to fig. 6, 7, and 8 applies to the corresponding components of the memory bay 1400 and is not repeated here for clarity of the description of the memory bay 1400.

In the illustrative memory bay 1400, the physical resources 620 are implemented as a memory controller 1420. Although only two memory controllers 1420 are shown in FIG. 14, it should be understood that the memory cradle 1400 may include additional memory controllers 1420 in other embodiments. Memory controller 1420 may be implemented as any type of processor, controller, or control circuit capable of controlling the writing and reading of data into memory banks 1430, 1432 based on requests received via communications circuitry 830. In the illustrative embodiment, each memory controller 1420 is connected to a respective memory bank 1430, 1432 to write to and read from memory devices 720 within the respective memory bank 1430, 1432 and to enforce any permissions (e.g., read, write, etc.) associated with the carrier 400 that sent a request to the memory carrier 1400 to perform a memory access operation (e.g., read or write).

In some embodiments, the memory cradle 1400 may also include a controller-to-controller interconnect 1442. Similar to the resource-to-resource interconnect 624 of the chassis 400 discussed above, the controller-to-controller interconnect 1442 may be implemented as any type of communication interconnect capable of facilitating controller-to-controller communication. In the illustrative embodiment, controller-to-controller interconnect 1442 is implemented as a high-speed point-to-point interconnect (e.g., faster than I/O subsystem 622). For example, controller-to-controller interconnect 1442 may be implemented as a QuickPath interconnect (QPI), UltraPath interconnect (UPI), or other high-speed point-to-point interconnect dedicated to processor-to-processor communications. As such, in some embodiments, memory controller 1420 may access memory within memory bank 1432 associated with another memory controller 1420 via controller-to-controller interconnect 1442. In some embodiments, a scalable memory controller is made up of a plurality of smaller memory controllers on a memory bay (e.g., memory bay 1400), which are referred to herein as "chiplets" (chiplets). Chiplets can be interconnected (e.g., using EMIB (Embedded Multi-dielnterconnect Bridge, Embedded Multi-die interconnect bridging)). The combined chiplet memory controller can be scaled to a relatively large number of memory controllers and I/O ports (e.g., up to 16 memory channels). In some embodiments, memory controller 1420 may implement memory interleaving (e.g., one memory address is mapped to memory bank 1430, the next memory address is mapped to memory bank 1432, and the third address is mapped to memory bank 1430, etc.). Interleaving may be managed within memory controller 1420, or from a CPU socket (e.g., of computing tray 800) through a network link to memory banks 1430, 1432, and may improve latency associated with performing memory access operations over accessing consecutive memory addresses from the same memory device.

Additionally, in some embodiments, the memory cradle 1400 may be connected to one or more other cradles 400 (e.g., in the same chassis 240 or adjacent chassis 240) through waveguides using a waveguide connector 1480. In the illustrative embodiment, the waveguide is a 64 millimeter waveguide providing 16 Rx (i.e., receive) channels and 16 Tx (i.e., transmit) channels. In the illustrative embodiment, each channel is 16GHz or 32 GHz. In other embodiments, the frequency may be different. The use of waveguides may provide high throughput access to memory pools (e.g., memory banks 1430, 1432) to another carrier (e.g., carrier 400 in the same bay 240 as the memory bay 1400 or in an adjacent bay 240) without increasing the load on the optical data connector 834.

Referring now to FIG. 15, a system for executing one or more workloads (e.g., applications) can be implemented in accordance with the data center 100. In an illustrative embodiment, system 1510 includes a coordinator server 1520, which may be implemented as a managed node that includes a computing device (e.g., processor 820 on computing tray 800) executing management software (e.g., a cloud operating environment such as OpenStack) communicatively coupled to a plurality of trays 400, the trays 400 including a number of computing trays 1530 (e.g., each similar to computing tray 800), memory trays 1540 (e.g., each similar to memory tray 1400), accelerator trays 1550 (e.g., each similar to accelerator tray 1000), and storage trays 1560 (e.g., each similar to storage tray 1200). One or more of the shelves 1530, 1540, 1550, 1560 may be grouped into a managed node 1570, such as by the coordinator server 1520, to collectively execute a workload (e.g., an application 1532 executing in a virtual machine or in a container). Managed node 1570 may be implemented as a collection of physical resources 620 (e.g., processors 820, memory resources 720, accelerator circuits 1020, or data storage 1250) from the same or different carrier 400. In addition, managed nodes may be established, defined, or "spun-up" by the coordinator server 1520 when workloads are to be assigned to the managed nodes, or at any other time, and may exist regardless of whether any workloads are currently assigned to the managed nodes. In the illustrative embodiment, the coordinator server 1520 may selectively allocate and/or deallocate physical resources 620 from the carriers 400 and/or add or remove one or more carriers 400 from the managed node 1570 according to quality of service (QoS) objectives associated with a service level agreement for a workload (e.g., application 1532) (e.g., performance objectives associated with throughput, latency, instructions per second, etc.). During this time, the coordinator server 1520 may receive telemetry data indicative of performance conditions (e.g., throughput, latency, instructions per second, etc.) in each rack 400 of the managed node 1570 and compare the telemetry data to a quality of service objective to determine whether the quality of service objective is met. The coordinator server 1520 may also determine whether one or more physical resources may be deallocated from the managed node 1570 while still meeting the QoS goals, freeing up those physical resources for use in another managed node (e.g., to perform a different workload). Alternatively, if the QoS target is not currently met, the coordinator server 1520 may determine to dynamically allocate additional physical resources to assist in the execution of the workload (e.g., application 1532) while the workload is executing. Similarly, if the coordinator server 1520 determines that the result of deallocating physical resources will be that the QoS target is still met, the coordinator server 1520 may determine to dynamically deallocate physical resources from the managed node.

Further, in some embodiments, the coordinator server 1520 may identify trends in resource utilization of the workload (e.g., the application 1532), for example, by identifying phases of execution of the workload (e.g., the application 1532) (e.g., periods of time to perform different operations, each operation having different resource utilization characteristics) and preemptively identifying available resources in the data center 100 and allocating them to the managed node 1570 (e.g., within a predetermined period of time after the associated phases begin). In some embodiments, the coordinator server 1520 may model performance based on various latency and distribution schemes to place workloads between compute racks and other resources (e.g., accelerator racks, memory racks, storage racks) in the data center 100. For example, the coordinator server 1520 may utilize a model that: the model takes into account the performance of the resource on the shelf 400 (e.g., FPGA performance, memory access latency, etc.) and the performance of the path through the network to the resource (e.g., FPGA) (e.g., congestion, latency, bandwidth). As such, the coordinator server 1520 may determine which resource(s) should be used for which workloads based on the total latency associated with each potential resource available in the data center 100 (e.g., the latency associated with the performance of the resource itself, and the latency associated with the path through the network between the computing rack executing the workload and the rack 400 in which the resource is located).

In some embodiments, the coordinator server 1520 may generate a map of heat generation in the data center 100 using the telemetry data (e.g., temperatures, fan speeds, etc.) reported from the racks 400 and allocate resources to managed nodes according to the map of heat generation and predicted heat generation associated with different workloads to maintain a target temperature and heat distribution in the data center 100. Additionally or alternatively, in some embodiments, the coordinator server 1520 may organize the received telemetry data into a hierarchical model that indicates relationships between managed nodes (e.g., spatial relationships, such as physical locations of resources of the managed nodes within the data center 100, and/or functional relationships, such as groupings of managed nodes by customers for which the managed nodes provide services, types of functions the managed nodes typically perform, managed nodes that typically share or exchange workload between each other, etc.). Based on differences in physical location and resources in the managed nodes, a given workload may exhibit different resource utilization (e.g., cause different internal temperatures, use different percentages of processor or memory capacity) among the resources of different managed nodes. Coordinator server 1520 may determine these differences based on telemetry data stored in the hierarchical model and factor these differences into a prediction of future resource utilization of the workload if the workload is reassigned from one managed node to another managed node to accurately balance resource utilization in data center 100.

To reduce the computational load on the coordinator server 1520 and the data transfer load on the network, in some embodiments, the coordinator server 1520 may send self-test information to the racks 400 to enable each rack 400 to determine locally (e.g., on the rack 400) whether the telemetry data generated by that rack 400 meets one or more conditions (e.g., meets a predetermined threshold of available capacity, meets a predetermined threshold of temperature, etc.). Each rack 400 may then report back the results of the reduction (e.g., yes or no) to the coordinator server 1520, which the coordinator server 1520 may use to determine resource allocation to the managed nodes.

Referring now to fig. 16, an illustrative system 1600 for flexible endpoint acceleration for bare metal virtualization includes a computing device 1602 and a plurality of remote devices 1604 communicating over a network 1606. Each of the devices 1602, 1604 may be implemented as one or more racks 400 in a data center (e.g., a computing rack 800 and a plurality of storage racks 1200, or another configuration). In use, as described further below, the computing device 1602 issues I/O transactions to soft endpoints emulating an endpoint hierarchy (e.g., PCI express hierarchy). The computing device 1602 may utilize standard drivers and/or operating systems to issue I/O transactions. The soft endpoints and their simulated hierarchy are programmable by the computing device 1602. Thus, the system 1600 can transparently support a flexible, programmable device hierarchy. Further, the soft endpoint may selectively process or otherwise accelerate and/or pass I/O transactions to a load transfer complex, such as a smart NIC. The load transfer complex may perform tasks such as bare metal virtualization. Thus, system 1600 supports load transfer of bare metal virtualization tasks from the host processor of computing device 1602 to the load transfer complex. The system 1600 also supports load shifting of I/O transaction processing tasks from the load shifting complex. Thus, system 1600 may improve the available processing cycles of load transfer complexes and/or may allow for the use of less expensive load transfer complexes.

The computing device 1602 may be implemented as any type of device capable of performing the functions described herein. For example, the computing device 1602 may be implemented as, but is not limited to, a cradle, a computing cradle, an accelerator cradle, a storage cradle, a computer, a server, a distributed computing device, a de-aggregation computing device, a laptop computer, a tablet computer, a notebook computer, a mobile computing device, a smartphone, a wearable computing device, a multi-processor system, a server, a workstation, and/or a consumer electronics device. As shown in fig. 1, the illustrative computing device 1602 includes a processor 1620, an I/O subsystem 1622, memory 1626, data storage 1628, and a communications subsystem 1630. Further, in some embodiments, one or more of the illustrative components may be included in, or otherwise form part of, another component. For example, in some embodiments, the memory 1626, or portions thereof, may be included in the processor 1620.

Processor 1620 may be implemented as any type of processor capable of performing the functions described herein. For example, processor 1620 may be implemented as a single or multi-core processor(s), digital signal processor, microcontroller, or other processor or processing/control circuit. Similarly, memory 1626 may be implemented as any type of volatile or non-volatile memory or data storage device capable of performing the functions described herein. In operation, the memory 1626 may store various data and software used during operation of the computing device 1602, such as operating systems, applications, programs, libraries, and drivers.

Illustratively, the memory 1626 is communicatively coupled to the processor 1620 via an I/O subsystem 1622, which 1622 may be implemented as circuitry and/or components to facilitate input/output operations with the processor 1620, the memory 1626, and other components of the computing device 1602. For example, I/O subsystem 1622 may be implemented as or otherwise include a memory controller hub, an input/output control hub, a sensor hub, a host controller, firmware devices, communication links (i.e., point-to-point links, bus links, wires, cables, light guides, printed circuit board tracks, etc.), and/or other components and subsystems to facilitate input/output operations. As shown, the I/O subsystem 1622 illustratively includes a pci express (pcie) Root Complex (RC) 1624. RC1624 may include one or more root ports, PCIe links, PCIe switches, and/or other components that may be used to transfer I/O data between a host system (e.g., processor 1620 and/or memory 1626) of computing device 1602 and one or more I/O devices. In some embodiments, the memory 1626 may be directly coupled to the processor 1620, such as via an integrated memory controller hub or data port. Further, in some embodiments, I/O subsystem 1622 may form part of a system on a chip (SoC) and be included on a single integrated circuit chip with processor 1620, memory 1626, and other components of computing device 1602.

Data storage device 1628 may be implemented as one or more devices of any type configured for short-term or long-term storage of data, such as memory devices and circuits, memory cards, hard drives, solid-state drives, non-volatile flash memory, or other data storage devices. Computing device 1602 may also include a communication subsystem 1630, which communication subsystem 1630 may be implemented as any Network Interface Controller (NIC), communication circuit, device, or collection thereof that enables communication between computing device 1602 and other remote devices via a computer network (not shown). Communication subsystem 1630 may be configured to use any one or more communication technologies (e.g., wired or wireless communication) and associated protocols (e.g., ethernet, etc.),WiMAX, 3G, 4G LTE, etc.) to enable such communication.

As shown, computing device 1602 also includes soft endpoint 1632 and load transfer complex 1634. As described further below, soft endpoint 1632 is coupled to RC1624 and to load transfer complex 1634, e.g., via one or more PCIe lanes. Soft endpoint 1632 receives I/O transactions from RC1624 and may process the I/O transactions and/or provide the I/O transactions to load transfer complex 1634. Load transfer complex 1634 performs further processing of the I/O transaction, for example, by performing bare metal virtualization (e.g., virtualizing multiple storage devices, network devices, or other devices). Load transfer complex 1634 is also coupled to communication subsystem 1630 and may thus communicate with one or more remote devices 1604. One possible embodiment of soft endpoint 1632 and load transfer complex 1634 is described below in connection with FIG. 17.

Similarly, each remote device 1604 may be implemented as any type of device capable of performing the functions described herein. For example, each remote device 1604 may be implemented as, but is not limited to, a cradle, computing cradle, accelerator cradle, storage cradle, computer, server, distributed computing device, de-aggregation computing device, laptop computer, tablet computer, notebook computer, mobile computing device, smart phone, wearable computing device, multi-processor system, server, workstation, and/or consumer electronic device. As such, each remote device 1604 may include similar components and features as the computing device 1602, such as a processor, I/O subsystem, memory, data storage device, communication subsystem, or other components of a storage cradle. As shown, each remote device 1604 may include remote storage 1640 that is accessible, for example, by load transfer complex 1634 for bare metal virtualization.

As discussed in more detail below, the computing device 1602 and the remote device 1604 may be configured to send and receive data with each other and/or with other devices of the system 1600 over the network 1606. Network 1606 may be implemented as any number of various wired and/or wireless networks. For example, network 1606 may be implemented as or otherwise include a wired or wireless Local Area Network (LAN), and/or a wired or wireless Wide Area Network (WAN). As such, network 1606 may include any number of additional devices, such as additional computers, routers, and switches, to facilitate communication between devices of system 1600.

Referring now to FIG. 17, FIG. 1700 illustrates one possible embodiment of soft endpoint 1632 and load transfer complex 1634. As shown, soft endpoint 1632 includes endpoint interface 1702, one or more soft cores 1704, DMA engine 1706, memory 1708, and endpoint interface 1710. Endpoint interface 1702 is coupled to a Root Complex (RC)1624 and may be implemented as any communication circuit or other component for communicating with RC1624 over a PCIe link. For example, the endpoint interface 1710 may be implemented as or otherwise include a PCIe physical layer, a PCIe data link layer, and a PCIe transaction layer.

Each soft core 1704 may be implemented as a programmable element, such as a state machine, microcontroller, microprocessor, or other computing resource. As described further below, the soft core 1704 may be configured to emulate a PCIe endpoint hierarchy, process PCI transactions, and perform other tasks as described further below. DMA engine 1706 may be implemented as a DMA controller or other component capable of performing DMA transactions (e.g., reads and/or writes) to transfer data between memory 1626 and soft endpoint 1632 and/or between memory 1626 and load transfer complex 1634. The memory 1708 may be implemented as any volatile or non-volatile memory or data storage device capable of performing the functions described herein. In operation, memory 1708 may store various data and software used during operation of soft endpoint 1632, such as firmware and data processed by soft core 1704.

Endpoint interface 1710 is coupled to load transfer complex 1634 and may be implemented as any communication circuitry or other component for communicating with load transfer complex 1634. As shown, endpoint interface 1710 is coupled to root complex 1712 of load transfer complex 1634. Thus, endpoint interface 1710 may communicate with load transfer complex 1634 over PCIe and thus may be implemented as or otherwise include a PCIe physical layer, a PCIe data link layer, and a PCIe transaction layer.

As shown, load transfer complex 1634 includes root complex 1712, plurality of processor cores 1714, memory 1716, and root complex 1718. As described above, root complex 1712 is coupled to endpoint interface 1710 of soft endpoint 1632. Similar to RC1624, RC 1712 may include one or more root ports, PCIe links, PCIe switches, and/or other components that may be used to transfer I/O data between load transfer complex 1634 and soft endpoint 1632.

Each processor core 1714 may be implemented as any type of processor core capable of performing the functions described herein, such as a single or multi-core processor(s), digital signal processor, microcontroller, or other processor or processing/control circuitry. Processor core 1714 may execute instructions from the same Instruction Set Architecture (ISA) as processor 1620 or a different ISA. For example, in some embodiments, processor core 1714 may be implemented as

And (4) a core. In other embodiments, processor core 1714 may be implemented as an ARM core. StoringThe device 1716 may be implemented as any type of volatile or non-volatile memory or data storage device capable of performing the functions described herein. In operation, memory 1716 may store various data and software used during operation of load transfer complex 1634, such as operating systems, applications, programs, libraries, and drivers.

Root complex 1718 is coupled to a communication subsystem 1630 (e.g., to a network interface controller) and may be implemented as any communication circuitry or other component for communicating with communication subsystem 1630. For example, root complex 1718 may include one or more root ports, PCIe links, PCIe switches, and/or other components that may be used to transfer I/O data between load transfer complex 1634 and communication subsystem 1630.

While shown as separate components in FIG. 17, it should be understood that in some embodiments soft endpoint 1632 and load transfer complex 1634 may be included in the same component and/or with other components. For example, in some embodiments, soft endpoint 1632 and load transfer complex 1634 may be implemented as separate dies included in the same computer chip. In these embodiments, the chips comprising soft endpoint 1632 and load transfer complex 1634 may be included in a multi-chip package with a NIC (e.g., communication subsystem 1630), FPGA, or other component.

Referring now to FIG. 18, in an illustrative embodiment, a computing device 1602 establishes an environment 1800 during operation. The illustrative environment 1800 includes an application 1802, a driver 1804, a transaction layer 1806, a flexible endpoint 1808, an intelligent filter 1810, a retry manager 1812, and an endpoint interface 1814. The various components of environment 1800 may be implemented as hardware, firmware, software, or combinations thereof. As such, in some embodiments, one or more components of environment 1800 may be implemented as a circuit or collection of electrical devices (e.g., application circuit 1802, driver circuit 1804, transaction layer circuit 1806, flexible endpoint circuit 1808, smart filter circuit 1810, retry manager circuit 1812, and/or endpoint interface circuit 1814). It should be appreciated that in such embodiments, one or more of application circuitry 1802, driver circuitry 1804, transaction layer circuitry 1806, flexible endpoint circuitry 1808, smart filter circuitry 1810, retry manager circuitry 1812, and/or endpoint interface circuitry 1814 may form part of processor 1620, I/O subsystem 1622, soft endpoint 1632, and/or other components of computing device 1602. Further, in some embodiments, one or more of the illustrative components may form a portion of another component, and/or one or more of the illustrative components may be independent of each other.

The applications 1802 can be implemented as any user application, system application, or other application executed by the computing device 1602. Similarly, the drivers 1804 may be implemented as any device driver, operating system, virtual machine monitor, and/or hypervisor that controls or otherwise communicates I/O data with the soft endpoint 1632. The driver 1804 may communicate with the soft endpoint 1632 using one or more standardized device protocols (e.g., nvm express (nvme), VirtIO, intel Adaptive Virtual Function (AVF), or other protocols). For example, the drives 1804 may be implemented as local storage drives, network device drives, or other device drives. Applications 1802 may access services provided by soft endpoint 1632 and/or load transfer complex 1634 via drivers 1804.

Flexible endpoint 1808 is configured to program endpoint firmware of soft endpoint 1632. Flexible endpoint 1808 is also configured to establish a simulated endpoint hierarchy based on endpoint firmware in response to programming the endpoint firmware. Establishing an endpoint hierarchy may include emulating one or more PCI express switches, PCI express bridges, and/or PCI express endpoints (devices/functions).

The transaction layer 1806 is configured to receive an I/O transaction originating from the root complex 1624 of the computing device 1602 in response to the emulated endpoint hierarchy being established. The I/O transaction may be implemented as a PCI express Transaction Layer Packet (TLP). Transaction layer 1806 may also be configured to receive retry I/O transactions originating from root complex 1624. Retrying the I/O transaction is associated with a previously received I/O transaction.

Smart filter 1810 is configured to determine whether to process a received I/O transaction. Intelligent filter 1810 may determine whether to process an I/O transaction based on a type of TLP or an address of the TLP. Smart filter 1810 is also configured to process I/O transactions if it is determined that I/O transactions are to be processed. Processing the I/O transaction may include copying data associated with the I/O transaction to memory of soft endpoint 1632 or copying data associated with the I/O transaction to memory of load transfer complex 1634.

Endpoint interface 1814 is configured to forward the I/O transaction to load transfer complex 1634 of the computing device in response to determining not to process the I/O transaction. Forwarding the I/O transaction may include encapsulating the I/O transaction with metadata, such as a size of the I/O transaction, an original time of the I/O transaction, a start of the I/O transaction, an end of the I/O transaction, or an indication of a last transaction in the queue. Endpoint interface 1814 is also configured to receive a response from load transfer complex 1634 in response to forwarding the I/O transaction to load transfer complex 1634.

Retry manager 1812 is configured to store responses received from load transfer complex 1634 in a history buffer at soft endpoint 1632. The retry manager 1812 is also configured to retrieve a response from the history buffer in response to receiving a retry I/O transaction.

Referring now to fig. 19-20, in use, a computing device 1602 may perform a method 1900 for flexible endpoint acceleration. It is to be appreciated that in some embodiments, the operations of the method 1900 may be performed by one or more components of the environment 1800 of the computing device 1602 (e.g., the soft endpoint 1632) as shown in FIG. 18. Method 1900 begins in block 1902 in which soft endpoint 1632 programs the endpoint firmware with a soft endpoint configuration. Endpoint firmware may be received from the host using any suitable technique. For example, the processor 1620 may be provisioned with firmware via the RC1624 or the firmware may be provisioned out-of-band. The endpoint firmware may be implemented as stored instructions or other data that are processed by the soft core 1704 or other programmable elements of the soft endpoint 1632. Endpoint firmware may be stored, for example, in memory 1708 or other volatile or non-volatile storage of soft endpoint 1632.

In block 1904, soft endpoint 1632 establishes a simulated endpoint hierarchy for the soft endpoint. The endpoint hierarchy may include any number, type, and/or arrangement of analog I/O devices. To establish the simulation endpoint hierarchy, soft endpoint 1632 may simulate configuration spaces, I/O links, devices, functions, virtual functions, and/or other components of the endpoint hierarchy. One example of a simulation endpoint hierarchy is illustrated in fig. 22 and described further below. In some embodiments, soft endpoint 1632 emulates one or more PCIe switches in block 1906. Each emulated switch includes an upstream port and one or more downstream ports, and may include a plurality of virtual PCI-PCI bridges. In some embodiments, soft endpoint 1632 emulates one or more PCIe bridges in block 1908. The PCIe bridge connects the PCIe hierarchy to different I/O interconnects, such as PCI or PCI-X. In some embodiments, in block 1910, the soft endpoint 1632 emulates one or more PCIe endpoints. Each endpoint may include a PCIe device and one or more functions (i.e., physical functions and/or virtual functions). Soft endpoint 1632 may emulate various types of endpoints, including NVMe endpoints, VirtiO endpoints, Adaptive Virtual Function (AVF) endpoints, or other endpoints.

In block 1912, the soft endpoint 1632 receives a transaction from the host at the transaction layer of the I/O interconnect. The transaction may originate, for example, at root complex 1624 of computing device 1602. In some embodiments, in block 1914, soft endpoint 1632 may receive a PCI express Transaction Layer Packet (TLP) from the host.

In block 1916, the soft endpoint 1632 determines whether it receives a retried transaction from the host. As described further below, in some embodiments, the host may retry the transaction after experiencing a timeout or other error. Soft endpoint 1632 may use any suitable technique to identify the retried transaction. For example, soft endpoint 1632 may store the outstanding transactions in a history buffer or other data structure, and may determine whether the transaction matches the outstanding transaction. As another example, a retried transaction may reference a parallel address space. If the transaction is retried, method 1900 proceeds to block 1918, described below. If the transaction has not been retried, method 1900 branches to block 1920.

In block 1920, soft endpoint 1632 determines whether it is to process the transaction itself; that is, whether the transaction is to be processed at the soft endpoint 1632. Soft endpoint 1632 may use any filtering policy or other criteria to determine whether to process the transaction. In some embodiments, in block 1922, the soft endpoint 1632 may filter the transaction based on its transaction type. The transaction types may include PCIe transaction types such as configuration read, configuration write, memory read, memory write, I/O read, I/O write, and/or message. For example, in some embodiments, soft endpoint 1632 may handle configuration reads and writes and/or memory reads and writes, and may not handle other types of transactions. As another example, soft endpoint 1632 may handle all types of transactions. As another example, soft endpoint 1632 may not process any type of transaction. In some embodiments, in block 1924, soft endpoint 1632 may filter the transaction based on the target address of the transaction. For example, soft endpoint 1632 may process transactions with addresses in the lower 64 bytes of the configuration space and may not process transactions with higher addresses. Of course, as described above, the soft endpoint 1632 may evaluate or otherwise consider multiple filtering criteria in combination.

In block 1926, soft endpoint 1632 checks whether the transaction should be handled by soft endpoint 1632 itself. If so, the method 1900 branches to block 1928, which is shown in FIG. 20. If soft endpoint 1632 determines not to process the transaction, method 1900 branches to block 1934, which is also shown in FIG. 20.

In block 1928, shown in FIG. 20, soft endpoint 1632 processes the transaction using the resources of soft endpoint 1632. Soft endpoint 1632 may perform computational tasks, Direct Memory Access (DMA) transfers, or other operations that are processed from load transfer complex 1634 load transfer endpoints. The soft endpoint 1632 may perform one or more endpoint simulations, configuration space simulations, or other processing tasks, for example, using the soft core 1704 of the soft endpoint 1632. Soft core 1704 may be programmed with soft endpoint firmware. In some embodiments, in block 1930, soft endpoint 1632 may copy data associated with the transaction to soft endpoint 1632. For example, soft endpoint 1632 may use DMA engine 1706 of soft endpoint 1632 to copy data identified by the transaction from memory 1626 to memory 1708 of soft endpoint 1632. Managing DMA transfers by soft endpoint 1632 may load transfer DMA processing from load transfer complex 1634. In some embodiments, in block 1932, soft endpoint 1632 may copy data associated with the transaction to load transfer complex 1634. For example, soft endpoint 1632 may use DMA engine 1706 of soft endpoint 1632 to copy data identified by the transaction from memory 1626 to memory 1716 of load transfer complex 1634. Managing DMA transfers by soft endpoint 1632 may load transfer DMA processing from load transfer complex 1634. After processing the transaction, method 1900 proceeds to block 1944, described below.

Referring back to block 1926 shown in FIG. 19, if soft endpoint 1632 determines not to process the transaction, method 1900 branches to block 1934, which is shown in FIG. 20. In block 1934, soft endpoint 1632 passes the transaction to load transfer complex 1634 via endpoint interface 1710. The soft endpoint 1632 may pass the transaction to the load transfer complex 1634 as a PCIe TLP, for example. In some embodiments, in block 1936, soft endpoint 1632 may utilize the metadata to encapsulate the transaction. For example, soft endpoint 1632 may add one or more headers to the PCIe TLP as it is transferred to load transfer complex 1634. The metadata may include processing hints or other additional information that may be used by load transfer complex 1634 to accelerate processing. The metadata may include, for example, processing hints about: the size of the transaction, the original time of the transaction, the end of the transaction, the start of a new transaction, and whether the transaction is the last in the queue.

Load transfer complex 1634 may process the transaction as it is received. For example, load transfer complex 1634 may execute a storage stack that executes one or more NVMe commands or other storage commands based on the transaction. Load transfer complex 1634 may access remote storage 1640 on one or more remote devices 1604 to process transactions. In block 1938, soft endpoint 1632 may receive a response from load transfer complex 1634 after load transfer complex 1634 completes processing the transaction. The response may be implemented as a PCIe completion, TLP, message, interrupt or other data received by soft endpoint 1632 from load transfer complex 1634. In some embodiments, soft endpoint 1632 may decapsulate the metadata from the response in block 1940. For example, soft endpoint 1632 may strip one or more headers from a PCIe TLP or completion received from load transfer complex 1634. In some embodiments, soft endpoint 1632 may store the response in a history buffer in block 1942. As described further below, in some embodiments, load transfer complex 1634 may return a response that is valid but not received until after a timeout set by the host. In these embodiments, rather than resubmit the transaction to load transfer complex 1634, soft endpoint 1632 may cache the response in a history buffer and retrieve the response locally. After receiving the response from load transfer complex 1634, method 1900 proceeds to block 1944.

In block 1944, the soft endpoint 1632 forwards the response to the host. The response may be, for example, issued from an upstream port of endpoint interface 1702 of soft endpoint 1632 and may be addressed to root complex 1624, memory 1626, or other component of computing device 1602. The response may be generated by soft endpoint 1632 contacting the transaction itself, as described above in connection block 1928, and/or the response may be received from load transfer complex 1634, as described above in connection block 1938. After forwarding the response, the method 1900 loops back to block 1912 shown in FIG. 19 to continue processing the transaction.

Referring back to block 1916 of FIG. 19, if soft endpoint 1632 determines that the transaction was retried, method 1900 proceeds to block 1918, where soft endpoint 1632 returns a response from the history buffer to the host corresponding to the retried transaction. As described above, the responses stored in the history buffer were previously received from load transfer complex 1634. By forwarding the response from the local history buffer of soft endpoint 1632 instead of committing the transaction to load transfer complex 1634, soft endpoint 1632 may improve performance for retried transactions. After forwarding the response, the method 1900 loops back to block 1912 to continue processing the transaction.

Referring now to fig. 21, in use, a computing device 1602 can execute a method 2100 for transaction retry. It is to be appreciated that in some embodiments, the operations of the method 2100 can be performed by one or more components of the environment 1800 of the computing device 1602 as shown in fig. 18 (e.g., the driver 1804 or a hardware or software intermediary layer between the driver 1804 and the soft endpoint 1632). The method 2100 begins in block 2102, in which the computing device 1602 commits an I/O transaction to the soft endpoint 1632. As described above, the transaction may originate at the root complex 1624 of the computing device 1602, and the transaction may be implemented as a PCI express Transaction Layer Packet (TLP).

In block 2104, the computing device 1602 determines whether a timeout associated with the transaction has occurred. For example, the computing device 1602 may determine whether a response associated with the transaction was received from the soft endpoint 1632 before expiration of the predetermined time limit. If a timeout has not occurred (i.e., if a response is received or the transaction is otherwise successfully completed), the method 2100 loops back to block 2102 to commit additional transactions to the soft endpoint 1632. If a timeout occurs, the method 2100 proceeds to block 2106.

In block 2106, computing device 1602 commits the retry transaction to soft endpoint 1632. Similar to the original transaction, the retry transaction may be a PCIe TLP originating from the root complex 1624. In some embodiments, the retry transaction may have the same address or otherwise be the same as the original transaction. In these embodiments, soft endpoint 1632 may identify the retry transaction, for example, by maintaining a list or other data structure of outstanding transactions. In some embodiments, in block 2108, the computing device 1602 can commit the retry transaction in a parallel address space with the original transaction. For example, a retry transaction may use a different base address but use the same offset as the original transaction. In this embodiment, soft endpoint 1632 may identify the retry transactions as those included in the parallel address space. After committing the retry transaction, method 2100 loops back to block 2102 to commit additional transactions to soft endpoint 1632.

Referring now to FIG. 22, diagram 2200 illustrates one possible embodiment of an emulated device hierarchy that may be established by soft endpoint 1632. As shown, the emulated device hierarchy may include a scalable I/O virtualization (SIOV) NVMe endpoint 2202 and PCIe switches 2204, 2206 coupled to a root complex 1624. The SIOV NVMe endpoint 2202 includes a number of physical functions. Switch 2204 is coupled to a single-root I/O virtualization (SRIOV) NVMe endpoint 2208 that includes multiple physical functions and multiple virtual functions, and to a SIOV VirtIO endpoint 2210 that includes multiple physical functions. Switch 2206 is coupled to SRIOV VirtIO endpoint 2212, which includes a plurality of physical functions and a plurality of virtual functions. All elements 2202, 2204, 2206, 2208, 2210, 2212 are emulated by soft endpoint 1632 and thus may be programmed in firmware or otherwise flexibly configured by computing device 1602.

Examples of the invention

Illustrative examples of the techniques disclosed herein are provided below. Embodiments of the technology may include any one or more of the examples described below, as well as any combination thereof.

Example 1 includes a soft endpoint for I/O communications, the soft endpoint comprising: a flexible endpoint for establishing a simulated endpoint hierarchy based on endpoint firmware of the soft endpoint; a transaction layer to receive I/O transactions originating from a root port of a computing device in response to establishment of the simulated endpoint hierarchy; an intelligent filter to (I) determine whether the I/O transaction is to be processed, and (ii) process the I/O transaction in response to determining that the I/O transaction is to be processed; and an endpoint interface to forward the I/O transaction to a load transfer complex of the computing device in response to determining not to process the I/O transaction.

Example 2 includes the subject matter of example 1, and wherein the flexible endpoint is further to: programming endpoint firmware for the soft endpoint, wherein establishing the endpoint hierarchy comprises establishing the endpoint hierarchy in response to programming the endpoint firmware.

Example 3 includes the subject matter of any of examples 1 and 2, and wherein establishing the endpoint hierarchy comprises: emulating a PCI express switch, a PCI express bridge, or a PCI express endpoint.

Example 4 includes the subject matter of any of examples 1-3, and wherein the I/O transaction comprises a PCI express transaction layer packet.

Example 5 includes the subject matter of any of examples 1-4, and wherein determining whether to process the I/O transaction comprises: determining whether to process the I/O transaction based on the type of the transaction layer packet.

Example 6 includes the subject matter of any of examples 1-5, and wherein determining whether to process the I/O transaction comprises: determining whether to process the I/O transaction based on an address of the transaction layer packet.

Example 7 includes the subject matter of any of examples 1-6, and wherein processing the I/O transaction includes: copying data associated with the I/O transaction to a memory of the soft endpoint.

Example 8 includes the subject matter of any of examples 1-7, and wherein processing the I/O transaction includes: copying data associated with the I/O transaction to a memory of the load transfer complex.

Example 9 includes the subject matter of any of examples 1-8, and wherein forwarding the I/O transaction to the load transfer complex comprises: the I/O transaction is encapsulated with metadata.

Example 10 includes the subject matter of any of examples 1-9, and wherein the metadata includes a size of the I/O transaction, an original time of the I/O transaction, a start of the I/O transaction, an end of the I/O transaction, or an indication of a last transaction in a queue.

Example 11 includes the subject matter of any of examples 1-10, and further comprising a retry manager, wherein the endpoint interface is further to receive a response from the load transfer complex in response to forwarding the I/O transaction to the load transfer complex; and the retry manager is to store the response in a history buffer of the soft endpoint.

Example 12 includes the subject matter of any of examples 1-11, and wherein the transaction layer is to receive a retry I/O transaction originating from the root port, wherein the retry I/O transaction is associated with the I/O transaction; and the retry manager is to retrieve the response from the history buffer in response to receipt of the retry I/O transaction.

Example 13 includes a method for I/O communication, the method comprising: establishing, by a soft endpoint of a computing device, a simulated endpoint hierarchy based on endpoint firmware of the soft endpoint, wherein the soft endpoint is coupled to a root port of the computing device, and wherein the soft endpoint is further coupled to a load transfer complex of the computing device; receiving, by the soft endpoint, an I/O transaction originating from the root port in response to establishing the simulation endpoint hierarchy; determining, by the soft endpoint, whether to process the I/O transaction; processing, by the soft endpoint, the I/O transaction in response to determining to process the I/O transaction; and forwarding, by the soft endpoint, the I/O transaction to the load transfer complex in response to determining not to process the I/O transaction.

Example 14 includes the subject matter of example 13, and further comprising: programming, by the computing device, endpoint firmware of the soft endpoint, wherein establishing the endpoint hierarchy comprises establishing the endpoint hierarchy in response to programming the endpoint firmware.

Example 15 includes the subject matter of any of examples 13 and 14, and wherein establishing the endpoint hierarchy includes: emulating a PCI express switch, a PCI express bridge, or a PCI express endpoint.

Example 16 includes the subject matter of any of examples 13-15, and wherein receiving the I/O transaction includes: a PCI express transaction layer packet is received.

Example 17 includes the subject matter of any of examples 13-16, and wherein determining whether to process the I/O transaction includes: determining whether to process the I/O transaction based on the type of the transaction layer packet.

Example 18 includes the subject matter of any of examples 13-17, and wherein determining whether to process the I/O transaction includes: determining whether to process the I/O transaction based on an address of the transaction layer packet.

Example 19 includes the subject matter of any of examples 13-18, and wherein processing the I/O transaction includes: copying data associated with the I/O transaction to a memory of the soft endpoint.

Example 20 includes the subject matter of any of examples 13-19, and wherein processing the I/O transaction includes: copying data associated with the I/O transaction to a memory of the load transfer complex.

Example 21 includes the subject matter of any of examples 13-20, and wherein forwarding the I/O transaction to the load transfer complex comprises: the I/O transaction is encapsulated with metadata.

Example 22 includes the subject matter of any of examples 13-21, and wherein the metadata includes a size of the I/O transaction, an original time of the I/O transaction, a start of the I/O transaction, an end of the I/O transaction, or an indication of a last transaction in a queue.

Example 23 includes the subject matter of any of examples 13-22, and further comprising: receiving, by the soft endpoint, a response from the load transfer complex in response to forwarding the I/O transaction to the load transfer complex; and storing, by the soft endpoint, the response in a history buffer of the soft endpoint.

Example 24 includes the subject matter of any of examples 13-23, and further comprising: receiving, by the soft endpoint, a retry I/O transaction originating from the root port, wherein the retry I/O transaction is associated with the I/O transaction; and retrieving, by the soft endpoint, the response from the history buffer in response to receiving the retry I/O transaction.

Example 25 includes a computing device comprising: a processor; and a memory having stored therein a plurality of instructions that, when executed by the processor, cause the computing device to perform the method of any of examples 13-24.

Example 26 includes one or more non-transitory computer-readable storage media comprising a plurality of instructions stored thereon that, in response to being executed, cause a computing device to perform the method of any of examples 13-24.

Example 27 includes a computing device comprising means for performing the method of any of examples 13-24.

48页详细技术资料下载

Techniques for flexible I/O endpoint acceleration

相关技术

网友询问留言