Cache monitoring

文档序号：1598850 发布日期：2020-01-07 浏览：8次中文

阅读说明：本技术 高速缓存监视 (Cache monitoring ) 是由王任李彬 A.J.赫尔里奇戴宗元 R.胡加哈利于 2019-05-28 设计创作，主要内容包括：高速缓存监视。在一个示例中公开了一种计算装置,包括：处理器;包括多个高速缓存级别的多级高速缓存;外围设备,被配置为将数据直接写到直接可写高速缓存;以及高速缓存监视电路,其包括当高速缓存行被分配到直接可写高速缓存中时递增的高速缓存计数器La,当高速缓存行被处理器处理并从直接可写高速缓存解除分配时递增的Lp,并且当高速缓存行从直接可写高速缓存被逐出到存储器时递增的Le,其中高速缓存监视电路根据高速缓存计数器确定直接写策略。(Cache monitoring. Disclosed in one example is a computing apparatus comprising: a system and method for managing data in a multi-level cache includes a processor, a multi-level cache including a plurality of cache levels, a peripheral device configured to write data directly to a directly writable cache, and a cache monitor circuitIncluding a cache counter that is incremented when a cache line is allocated in a direct-writable cache La Incremented when a cache line is processed by a processor and deallocated from a direct-writable cache Lp And incremented when a cache line is evicted from the direct-writable cache to memory Le Wherein the cache monitor circuit determines the direct write policy based on the cache counter.)

1. A computing device, comprising:

a processor;

a multi-level cache comprising a plurality of cache levels;

a peripheral device configured to write data directly to the direct-writable cache, and

cache monitor circuitry comprising a cache counter that is incremented when a cache line is allocated into a direct-writable cacheLaIncremented when a processor processes a cache line and deallocates the cache line from the direct-writable cacheLpAnd incremented when a cache line is evicted from the direct-writable cache to memoryLeWherein the cache monitor circuit determines the direct write policy based on the cache counter.

2. The computing device of claim 1, wherein the cache line is incremented only when the cache line is evicted without being retrieved by the processorLe。

3. The computing apparatus of claim 1 wherein the directly writable cache is a last-level cache.

4. The computing device of claim 1, wherein the multi-level cache is a non-inclusive cache.

5. The computing device of claim 1, wherein the cache counter is a sample counter configured to sample a set of less than all cache sets.

6. The computing device of claim 5, wherein the cache counter is configured to sample approximately 2% of all cache sets.

7. The method of claim 6Wherein the cache monitor circuit is further configured to reset after the policy windowLa、LpAndLe。

8. the computing apparatus of claim 6, wherein the cache monitor circuitry is to allocate all transactions on the sampled set to the direct writable cache.

9. The computing device of claim 7, wherein the cache monitor circuitry is to allocate gap cache transactions according to a policy.

10. The computing device of claim 1, wherein to determine a direct write strategy comprises to determineLpIs approximately equal toLaAnd isLeAt or near zero, a determination is made that the cache is not over utilized, and a direct write policy is determined to write incoming data to the direct writable cache.

11. The computing device of claim 1, wherein to determine a direct policy comprises to determineLpIs substantially less thanLaAnd isLeSubstantially non-zero, determining that the cache is over-utilized, and determining a direct write policy to write incoming data directly to the memory.

12. The computing apparatus of claim 1, wherein the peripheral device is an ethernet network interface.

13. The computing apparatus of claim 1, wherein the peripheral device is a cache coherent core-to-core network interface.

14. The computing device of claim 13, wherein the cache coherent core-to-core network interface is Omni-Path or a derivative of Omni-Path.

15. The computing apparatus of claim 1, wherein the peripheral device is a cache-coherent core-to-core network interface on a peripheral component interconnect express (PCIe) interconnect.

16. A shared resource monitor comprising:

an interface communicatively coupled to the core;

an interface communicatively coupled to the cache;

an interface communicatively coupled to a peripheral device;

cache counterLaIncrement when a cache line is allocated into the cache;

cache counterLpIncrement when a core processes a cache line and deallocates the cache line from cache;

cache counterLe，Increment when a cache line is evicted from the direct-writable cache to memory, and

a policy circuit to determine a direct write policy based on the cache counter.

17. The shared resource monitor of claim 16, wherein the cache is a direct writable cache of a multi-level cache.

18. The shared resource monitor of claim 17, wherein the directly writable cache is a last-level cache.

19. The shared resource monitor of claim 17, wherein the multi-level cache is a non-inclusive cache.

20. The shared resource monitor of claim 16, wherein a cache line is incremented only when evicted without being retrieved by a coreLe。

21. The shared resource monitor of claim 16, wherein the cache counter is a sample counter configured to sample less than all of the cache transactions.

22. The shared resource monitor of claim 21, wherein the cache counter is configured to sample approximately 2% of all cache transactions.

23. The shared resource monitor of claim 21, wherein the policy circuit resets after the policy windowLa、LpAndLe。

24. the shared resource monitor of claim 21, wherein the cache monitor circuitry allocates all sampled transactions directly to the cache.

25. The shared resource monitor of claim 24, wherein the cache monitor circuitry allocates the gap cache transactions according to a policy.

Technical Field

The present disclosure relates generally to the field of enterprise computing, and more particularly, but not exclusively, to systems and methods for providing cache monitoring.

Background

In some modern data centers, the functionality of the equipment or devices may not be tied to a particular fixed hardware configuration. Rather, processing, memory, storage, and accelerator functions may be aggregated from different locations in some cases to form virtual "compound nodes. Contemporary networks may include data centers that host a large number of general purpose hardware server devices contained, for example, in server racks and controlled by supervisory programs. Each hardware device may run one or more instances of a virtual device, such as a workload server or virtual desktop.

Drawings

The present disclosure is best understood from the following detailed description when read with the accompanying drawing figures. It is emphasized that, in accordance with the standard practice in the industry, various features are not necessarily drawn to scale and are used for illustrative purposes only. Where a scale is shown, either explicitly or implicitly, it provides only one illustrative example. In other embodiments, the dimensions of the various features may be arbitrarily increased or decreased for clarity of discussion.

Fig. 1 is a block diagram of a Central Processing Unit (CPU) with shared resource monitoring capabilities according to one or more examples of the present specification.

Fig. 2 is a block diagram illustrating a system of memory transactions without device direct input/output (DDIO) according to one or more examples of this specification.

Fig. 3 is a block diagram of a memory transaction to support and enable DDIOs according to one or more examples of the present specification.

FIGS. 4 and 5 are signal flow diagrams illustrating the placement (put) operations that may be used, for example, in a high performance computing fabric (fabric) such as Intel Omni-Path.

Fig. 6 is a flow diagram of a method of monitoring a cache in accordance with one or more examples of the present description.

FIG. 7 is a block diagram of selected elements of a CPU according to one or more examples of the present specification.

FIG. 8 is a block diagram of a general approach and an I/O approach in a Last Level Cache (LLC) according to one or more examples of the present specification.

Fig. 9 is a block diagram of selected components of a datacenter having connectivity to a network of Cloud Service Providers (CSPs) according to one or more examples of the present application.

FIG. 10 is a block diagram of selected components of an end-user computing device, according to one or more examples of the present specification.

FIG. 11 is a block diagram of components of a computing platform according to one or more examples of the present specification.

Fig. 12 is a block diagram of a CPU according to one or more examples of the present specification.

Fig. 13 is a block diagram of a rack scale design in accordance with one or more examples of the present specification.

Fig. 14 is a block diagram of a Software Defined Infrastructure (SDI) data center according to one or more examples of the present description.

Fig. 15 is a block diagram of a data center according to one or more examples of the present description.

Detailed Description

The following disclosure provides many different embodiments, or examples, for implementing different features of the disclosure. Specific examples of components and arrangements are described below to simplify the present disclosure. Of course, these are merely examples and are not intended to be limiting. Further, the present disclosure may repeat reference numerals and/or letters in the various examples. This repetition is for the purpose of simplicity and clarity and does not in itself dictate a relationship between the various embodiments and/or configurations discussed. Different embodiments may have different advantages, and no particular advantage is necessarily required of any embodiment.

Modern enterprise computing may include servers, appliances, virtual machines, containers, and other distributed or non-distributed computing systems, among others. In such computing environments, operators may need visibility and control over shared resources, such as Last Level Cache (LLC) and memory bandwidth. To provide those operators with the required visibility into those shared resources, the enterprise CPU may provide Shared Resource Monitors (SRMs), such as Intel Resource Director Technology (RDT). In embodiments of the present description, a CPU cache is provided as a form of computer data storage that allows for later efficient access to recently used information. By way of non-limiting example, a cache may contain frequently used instructions, predicted instructions, recently used data, or predicted data, for example.

Data direct I/O (ddio) is an I/O protocol that enables devices such as peripheral devices (e.g., Network Interface Cards (NICs)) to deliver data directly into a cache level such as an LLC without first accessing main memory, such as may occur in a Direct Memory Access (DMA) model. With DDIO, the processor cache (rather than the main memory of the system) becomes the primary source and destination of I/O data. This improves the I/O data processing efficiency of the CPU for data delivery and data consumption from the I/O device.

As used in this specification, a NIC is a computer hardware component that enables a computer to communicatively connect with a network. NICs may be used in both wired and wireless computing embodiments, and are provided as add-in cards that fit into expansion slots of a computer motherboard. By way of non-limiting example, NICS is also known as a network interface controller card, network adapter card, expansion card, LAN card, and circuit board.

In an example, a device such as an ethernet NIC may send data directly to the processor LLC using DDIO, allocating cache lines as "I/O ways," which are a number of special LLC ways reserved specifically for I/O traffic. This helps to reduce access latency and improve performance and power.

In another embodiment, other network construction techniques may be used. For example, embodiments may provide a Host Fabric Interface (HFI) for a High Performance Computing (HPC) fabric such as Intel Omni-Path. Such HPC architectures may provide cache coherent (coherent) memory operations between different cores or processors. Like a NIC, HFI can also write data directly to LLC using DDIO. In the case of HFI, the cache operation may be a "write update": if the line is found in the cache, the line is updated in the LLC, but if the line is not found in the LLC, the line is written to system memory. Other embodiments of the HPC fabric may provide HFI using PCIe interfaces. Current versions of PCIe are non-cache coherent, so these constructs may behave like ethernet NICs (e.g., always write to a dedicated IO path).

Both ethernet and HPC approaches can suffer problems if, for example, the cache is over-subscribed (oversubscribed) or under-subscribed (undersubscribed). In the case of an ethernet NIC, when the I/O ingress rate is faster than the CPU consumption rate, the data may be evicted from memory before the CPU consumes the data. This not only negates the benefits of DDIO, but it actually reduces CPU performance due to unnecessary LLC cache evictions, whereas without DDIO, data may simply be written directly to memory via DMA. These unnecessary cache evictions may also increase the power consumption of the hardware platform due to the additional block moves that restore data after it has been evicted from the cache.

In the case of HPC architectures, HFI may only perform write updates, which means that for new data arrivals not found in the cache, the data can be written back to system memory and subsequently brought into the cache by the CPU even if the cache subscription is insufficient. This can cause unnecessary memory errors (memory tries) with negative performance and energy impact. In the case of HFI using PCIe interfaces, the conditions are similar to those faced in the case of ethernet.

Problems can arise, especially in the case of non-inclusive caches. A strict inclusive cache is one in which all levels of cache are relevant, such that data found in, for example, the L1 cache must always be found in the L2 and L3 caches. A non-inclusive cache is a cache in which data in a lower level of the cache may or may not be found in an upper level of the cache. For example, data may be lowered into lower levels of cache so that they may be consumed by the CPU and may be removed from upper levels that no longer require their cache. But in the case of a non-inclusive cache, the write update DDIO peripheral only sees that the cache line is not in the LLC. It does not know whether the data has been evicted by the LLC (e.g., because the processor is stuck in trouble (bog down) and cannot keep up with IO operations), or because the core has consumed the data (e.g., because the processor is running far ahead of the IO operations).

Better performance can be achieved when monitoring LLC utilization and data for all I/O devices is allocated into the LLC when there is space (e.g., when the processor remains in front of IO operations) and written to memory when there is insufficient space (e.g., when the CPU cannot keep up with incoming I/os). Embodiments of the present specification provide SRMs such as Intel RDT comprising having countersLa、LpAndLethe cache monitor of (1). While these three counters do not give the DDIO peripheral visibility as to why a cache line is no longer in the LLC (or some other level of cache, such as L2 or L1), they can be used in the aggregation to inform a meaningful direct write policy, enabling the DDIO device to know whether the CPU is holding (so that data should be DDIO into LLC) or falling behind (so that data should be DMA into main memory).

The cache monitor of the present specification monitors the use of I/O ways in the LLC, including in non-inclusive caches. This may advantageously provide a unified architecture for adaptive DDIO for ethernet, HPC fabric, other peripherals, and even software threads.

Some existing cache monitors include a single LLC occupancy counter per thread, each counter being associated with a resource monitoring id (rmid). The RMID enables independent and simultaneous monitoring of multiple concurrently running threads on a multicore processor. Per-thread architecture Model Specific Registers (MSRs) allow each hardware thread to be associated with an RMID. The MSR may include, for example, a Class (CLOS) field along with an RMID field for controlling the service of resource allocation. The RMID may represent a software thread or an I/O thread. While the examples of this specification specifically discuss the use of RMIDs for I/O threads, it should be noted that the cache monitoring capabilities disclosed herein can be used in all cases including software threads RMIDs to give a better indication of cache utilization to better service applications.

The LLC occupancy counters available in some existing cache monitors are configured to monitor the LLC behavior of a small sample set of all available LLC sets. Empirically, it has been found that the behavior of the cache is sufficiently random that sampling a small set is sufficient to create a representative sample of all cache transactions. For example, the cache may include hundreds or thousands of individual sets, and one or more occupancy counters may sample some small number of individual sets, such as 2%.

The counter is incremented (e.g., by 1) when a line is filled in the LLC, and decremented (e.g., by 1) when a line disappears from the LLC. However, this does not take into account which row goes. For example, a line may be pulled to a mid-level cache (MLC) or L1 cache so that the core may consume the data, or the line may be evicted to system memory. In an inclusive cache, there will be no possible aliasing, since even if the CPU fetches a line, it is still present in the LLC and the counter does not change. The counter is only decremented when a line is evicted to memory.

However, in a non-inclusive cache, a single LLC occupancy counter may not be able to provide a complete picture of cache occupancy. This may be particularly true where adaptive DDIO is provided, where the intelligent policy instructions indicate whether data is written directly to the cache or via DMA to main memory. In the case of DDIO, if many lines are evicted from the LLC before they are processed by the core, this means that the LLC is congested and new lines should not be allocated into the LLC. On the other hand, if a core pulls many lines to the MLC/L2 cache for processing, the LLC is released and new lines should be allocated into the LLC. But a single LLC occupancy counter may not be able to distinguish between these two cases, which in fact represent opposite phenomena. In the former case, data is evicted because the LLC is over-subscribed, while in the latter case, data is consumed by the CPU and the LLC is under-subscribed.

To provide a more accurate map of cache utilization, the cache monitor of the present specification provides three separate counters per RMID, i.e., three separate countersLa、LpAndLe. These three counters capture LLC utilization better, particularly in the case of DDIO transactions, but more generally in any data transaction. As with existing LLC counters, the counters disclosed herein can also operate on a small subset of all cache lines represented as a sample set. For example, it has been empirically determined that small sampling, such as 2% of all sets in the cache, has been determined to be sufficient to represent overall cache behavior.

Counter with a memoryLaIndicating an allocated cache line. When there is a cache line allocated into the LLC,Laand (4) increasing.

Counter with a memoryLpIndicating that the CPU has processed the cache line. Whenever a cache line is processed by the CPU and deallocated (deallocated) from the LLC,Lpincrement, a condition that indicates that data has not been evicted from the LLC, but has been consumed, and thus is no longer needed.

Counter with a memoryLeIndicating that the cache line has been evicted from the LLC. When there are cache lines evicted from the LLC to system memory that are not touched by the CPU,Leincrement, which indicates a situation where the CPU cannot keep up with the IO operation.

In a specific time windowiThe period, which may be a small time window, these three counters may be used to describe cache dynamics. If the CPU is fast enough to be in the windowiDuring which all incoming I/O packets are processed, thenLpShould be approximately equal toLaTo do soLeShould be very small or close to 0. On the other hand, if the I/O reaches the point where it overwhelms the CPU's processing power too quickly, thenLeWill be relatively significant, andLpwill be less thanLa。

Because these counters may only be applied to small transaction samples (i.e., those occurring on the selected set), in some embodiments, those cache lines that are sampled are always allocated into the LLC. These samples may then be used to determine a direct write strategy for the RMID. According to use for windowsiThe direct write policy of (1) handles cache operations that hit other cache sets. For example, in a windowiAt-1, it can be determined that the LLC subscription is not sufficient. Thus, for windowsiThe direct write policy is set to put the new cache line DDIO into the cache. But if in the windowiSamples taken during this period indicate that LLC is now oversubscribed, then the window is targetedi+ 1 sets a new direct write strategy to DMA new data to main memory. This allows the CPU to catch up with the I/O. After each window, the counter may be reset for the next policy windowLa、LpAndLe。

window openingiIs a configurable parameter. For example, 100 samples on the selected set may constitute a measurement window, with all counters being reset at the beginning of the next window. For each windowiThe processing rate and eviction rate for the window may be calculated:

the processing rate and eviction rate reflect the instantaneous cache utilization (window)iCache, I/O and CPU interactions). Height ofRPIndicating that a core can efficiently handle most of the incoming trafficThis is a desirable property. Height ofREIndicating that most of the incoming packets are evicted before being processed by the core, which is generally considered an undesirable behavior. In thatREIn the case of being high, the new incoming packet should be written directly to system memory to avoid cache line bouncing (bounding), which causes degraded performance and consumes unnecessary energy. Embodiments of the present description employ a low-cost Exponentially Weighted Moving Average (EWMA) filter on rate samples (rate samples) to smooth oscillations and noise and obtain cache behavior trends. For example, EWMA may operate as follows:

RP（new）=（1-a）* RP（Old age）+ a * RP（i）

RE（New）=（1-b）* RE（Old age）+ b * RE（i）

Weight parameteraAndbindicating how agile the current rate will respond from the windowiThe latest sample of (2). The greater the weight given to a new sample, the greater the agility of the average weight response change, but it is also more susceptible to noise. These parameters may be adjusted experimentally in various embodiments depending on implementation details.

Systems and methods for cache monitoring will now be described with more particular reference to the accompanying drawings. It should be noted that throughout the figures, certain reference numerals may be repeated to indicate that a particular device or block is fully or substantially consistent across the figures. However, this is not intended to imply any particular relationship between the various embodiments disclosed. In some examples, a class (genus) element may be referenced by a particular reference numeral ("widget 10"), while individual classes or examples of the class may be referenced by hyphenated numbers ("first particular widget 10-1" and "second particular widget 10-2").

Contemporary computing platforms, such as the hardware platforms provided by Intel @orsimilar platforms, may include the capability to monitor device performance and make decisions regarding resource provisioning. For example, in a large data center such as may be provided by a Cloud Service Provider (CSP), the hardware platform may include rack-mounted (rack-mounted) servers with computing resources such as processors, memory pools, accelerators, and other similar resources. As used herein, "cloud computing" includes network-connected computing resources and technologies that enable ubiquitous (typically global) access to data, resources, and/or technologies. Cloud resources are typically characterized by a large flexibility to dynamically allocate resources according to current workload and demand. This may be accomplished, for example, through virtualization, where resources such as hardware, storage, and networks are provided to Virtual Machines (VMs) via a software abstraction layer and/or container (containment), where instances of network functionality are provided separately from each other in "containers," but sharing the underlying operating system, memory, and driver resources.

As used herein, a processor includes any programmable logic device having a set of instructions. The processor may be real or virtual, local or remote, or any other configuration. As a non-limiting example, a processor may comprise Intel ® processors (e.g., Xeon ® processors, Core @, Pentium @, Atom @, CeLeron @, x86 or other). Processors may also include competing processors such as AMDs (e.g., Kx series x86 works, or Athlon, Opteron, or Epyc series Xeon works), ARM processors, or IBM PowerPC and Power ISA processors, to name a few.

In embodiments of the present disclosure, a VM is a sequestered partition within a computing device that allows operating systems and other applications to be used independently of other programs on the device in which it is contained. VMs, containers, and the like may be collectively referred to as "guest" systems.

Fig. 1 is a block diagram of a CPU 100 with shared resource monitoring capabilities according to one or more examples of the present specification. In this case, the CPU 100 includes a Cache Home Agent (CHA) 140, which controls access to the cache. CHA140 includes a core way mask 128 and an I/O way mask 132, which may be used to mask cache ways allocated for general core use and cache ways allocated exclusively for I/O use, respectively. CHA140 also includes monitoring logic 136, which may be part of a shared resource monitor (e.g., a cache monitor of an SRM as shown in FIG. 7).

Core 104 accesses CHA140 via an in-die interconnect (IDI) 116 or other similar interconnect technology. In accordance with the teachings of this specification, the IDI 116 extracts the RMID from the access request and may provide the RMID to the monitoring logic 136, which the monitoring logic 136 may use to apply a direct write policy. The actual data is directed to the core way mask 128 (or alternatively to main memory, depending on the current direct write strategy).

In the case of HPC link 108 and PCIe 112, the access request is routed to I/O path mask 132. Like the IDI 116-1, the IDI 116-2 includes circuitry to extract the RMID from the access request, which may be provided to the monitoring logic 132. Table 120 may be populated with policy data indicating IDI 116 and may additionally store policy instructions for accessing I/O pathway mask 132. As in the case of IDI 116-1, the RMID and RMID valid bits may be extracted from the transaction and provided to the monitoring logic 136. The actual data is provided to the I/O path mask 132 (or alternatively, written directly to memory).

In the case of PCIe 112, Intel System on a chip constructs (IOSF) 124 (or similar constructs) receive the DDIO requests and extract the RMIDs, as in the case of IDI 116. The RMID is provided to the monitoring logic 136 for policy determination when the access request is provided to the I/O way mask 132.

FIG. 1 illustrates that cache monitoring may be used to monitor LLC utilization for I/O ways. In some existing systems, a single RMID is assigned to an I/O, and the RMID may be used to monitor LLC utilization for the I/O. As described above, some existing systems use a single LLC occupancy counter for each RMID, which monitors LLC behavior for a small set of samples. The counter is incremented when a line is filled to the LLC with the associated RMID, and decremented when a line is driven from the LLC.

This single counter is suitable for a conventional inclusive cache hierarchy, where the LLC includes MLCs and L1. But for non-inclusive caches, when a line disappears from the LLC, the counter is not directly distinguished by the line being fetched by the CPU to the MLC or evicted to system memory. In the former case, the line is still present in the cache, while in the latter case it is not. This is a limitation on a single LLC occupancy counter, which limits its effectiveness in the DDIO case. While the LLC busy counter may be used to provide a direct write strategy in which DDIO may be adaptively enabled (sending all packets to the LLC) or disabled (sending all packets to memory) under certain conditions, this approach is more useful when differentiating between CPU consumption and cache evictions.

This is because existing LLC occupancy counters only monitor cache occupancy in the LLC, and not cache occupancy in the entire cache hierarchy. In the case of an inclusive cache, it is sufficient to monitor the cache occupancy in the LLC alone, where the LLC occupancy correctly reflects the occupancy throughout the cache hierarchy. But in the case of a non-inclusive cache hierarchy, a single counter has limited availability. The CPU may bring the rows into the MLC for processing, which is actually the desired behavior for DDIO, meaning that the LLC may actually be under-subscribed. But the LLC occupancy counter gives the impression that the line has been evicted and that the LLC is in fact oversubscribed, and therefore lacks sufficient capacity for input of I/O data. In a single LLC occupancy counter based system, DDIOs can be turned on or off in a coarse-grained manner, regardless of the fine-grained behavioral model for the overall non-inclusive cache.

To make more efficient use of DDIO, all rows may be installed into the LLC if there is enough space for faster processing. In a best case scenario, all lines installed into the LLC I/O partition are processed by the CPU before being evicted from the cache. Some existing I/O systems, such as HPC fabric over PCIe and Ethernet, always install data into the LLC through DDIO. In other cases, such as non-PCIe HPC constructs, DDIO performs only write updates.

Any of these scenarios may be suboptimal in the case of underutilization or over-utilization of the cache. Advantageously, the shared resource monitor of the present specification, including a cache monitor with three LLC occupancy counters, provides a unified DDIO architecture for all I/O devices, intelligently directs incoming data to the LLC when there is space, and writes data to memory when the LLC oversubscribes. This avoids evictions prior to processing.

To achieve this result, the Shared Resource Monitor (SRM) of the present specification extends the existing LLC monitoring capability to distinguish cases when lines are evicted from memory from cases when lines are processed by the CPU. Note that in the non-inclusive cache hierarchy, in any of these cases, lines disappear from the LLC, even though they essentially represent the opposite result (i.e., the CPU is over-subscribed in the former case, or under-subscribed in the latter case). The SRM of the present specification provides a unified I/O architecture that achieves high performance, significantly reduces memory bandwidth and energy consumption, and thus reduces latency in computing operations.

Fig. 2 is a block diagram illustrating a system of memory transactions without DDIOs in accordance with one or more examples of the present specification.

In the example of FIG. 2, processor socket 202 includes eight cores, namely core 204-0, core 204-1, core 204-2, core 204-3, core 204-4, core 204-5, core 204-6, and core 204-7. Cores 204 share LLC 208. The processor socket 202 may be communicatively coupled to other sockets within the same system via interfaces such as Intel Omni-Path or other HPC fabric.

In this case, the NIC 212 is communicatively coupled to the processor socket 202, and the system further includes a memory 220.

In this example, NIC 212 may not be able to write data directly to LLC 208 because the system does not support DDIO, or because LLC 208 is oversubscribed. Thus, in operation 1, the NIC 212 receives an incoming packet. Because NIC 212 cannot directly write packets to LLC 208, NIC 212 instead DMA's data to memory 220.

At operation 2, core 204-0 needs to access and process data from NIC 212. Thus, in operation 3, data is written from memory 220 into LLC 208.

In operation 4, data is fetched from LLC 208 into core 204-0. According to common computing practice, writing data into memory 220 and fetching data from memory 220 may take hundreds or thousands of clock cycles. It is therefore advantageous to bypass memory 220 where possible to do so.

Fig. 3 is a block diagram of a memory transaction to support and enable DDIOs according to one or more examples of the present specification.

In the example of FIG. 3, processor socket 302 includes eight cores, namely core 304-0, core 304-1, core 304-2, core 304-3, core 304-4, core 304-5, core 304-6, and core 304-7. Cores 304 share LLC 308. The processor socket 302 may be communicatively coupled to other sockets within the same system via interfaces such as Intel Omni-Path or other HPC fabric.

In this example, at operation 1, the NIC 312 receives incoming data. Because DDIO is supported and enabled, NIC 312 writes data directly to LLC 308, e.g., to IOC region 307 of LLC 308. IOC area 307 is an I/O allocation area of the LLC. This avoids a trip to memory 320, and thus data may be fetched from LLC 308 in the order of tens of clock cycles, versus hundreds or thousands of clock cycles in writing to memory 320 and reading from memory 320. Thus, by bypassing memory 320, the performance of a memory read operation is increased by approximately one order of magnitude.

Thus, it is advantageous to do so when data is written directly to the LLC via DDIO without evicting the data from the LLC before the core can process the data. In many cases, however, the IOC 307 has limited capacity. If the incoming data rate is faster than the CPU service rate, the line is evicted to memory 320 before the line has an opportunity to be processed. These data must then be brought back again for later processing. When the IOC 307 is full, it is better to write the data directly to the memory 320 to avoid evicting the data from the cache and then bringing it back when it is needed.

FIGS. 4 and 5 are signal flow diagrams illustrating placement operations that may be used, for example, in a high performance computing architecture such as Intel Omni-Path. As described above, these types of constructs may use write updates for DDIO. In a write update process, when new data is written to a cache block, the new data is broadcast to all caches containing the same cache line. This ensures that all caches are updated with the new value of the data. Conceptually, in a configuration such as the one shown in fig. 4 and 5, when a put message is received, the CHA simply checks whether the current line is present in the LLC. If it is a hit, the line is written to the LLC. If it is a miss, the line is DMA'd directly to main memory, bypassing the LLC. However, as mentioned above, in case the IOC is under-subscribed, this is not really a desired behavior, since the data will be better distributed into the LLC, as in the case of the ethernet NIC.

Turning to fig. 4, remote HFI 404 first issues an Omni-Path put message to local HFI 408. The local HFI 408 then performs a memory push write operation to the CHA 412. CHA 412 issues a pull to local HFI 408, and local HFI 408 then executes a block write command to the end byte of data of CHA 412. In the example of fig. 4, there is an LLC miss, so a data block of n bytes is written to the memory 416. The memory 416 returns an acknowledgement to the CHA 412.

In parallel, local HFI 408 may return an Omni-Path acknowledgement to remote HFI 404, and remote HFI 404 may respond with an Omni-Path end-to-end (E2E) acknowledgement.

Turning to fig. 5, remote HFI 404 issues an Omni-Path put message to local HFI 408. The local HFI 408 issues memory push writes of n bytes of data to the CHA 412. The CHA 412 responds with an acknowledgement to pull n bytes of data, and the local HFI 408 places n bytes of data blocks into the CHA 412. In this case, there is an LLC hit, so CHA 412 writes the data directly to the LLC.

As previously described, the local HFI 408 acknowledges the response with Omni-Path and the remote HFI 404 acknowledges the response with Omni-Path E2E.

Fig. 6 is a flow diagram of a method 600 of monitoring a cache according to one or more examples of the present description. In the example of fig. 6, all counters are initialized to 0 at the beginning.

At block 604, the row arrives at the CHA from the I/O device or from some other source, as described throughout the specification.

In decision block 608, the CHA checks to see if this is part of the sample set. For example, CHA may sample approximately 2% of the available set in cache. Those 2% of all sets are considered sufficient to represent the entire cache to make policy decisions for the complete cache.

If the transaction hits one of the sample sets, then in block 612, the data is allocated directly into the LLC according to the description herein. In response to the transaction, such asLa、LpAndle and the likeMay be incremented as described herein.

If the sample does not hit one of the sample sets, in decision block 620, CHA determines if the cache is currently over-utilized (as determined by the current cache policy). This may be based on a policy set by the cache transaction based on the samples in the previous window. If the current policy is that the cache (or, in particular, the LLC) is not over utilized, then in block 616 the memory controller installs the block directly into the LLC according to the policy.

Returning to block 620, if the cache is currently over utilized, the device writes the data directly to memory. In block 624, the memory controller writes the data to the memory according to the policy.

In block 698, the method is complete.

Fig. 7 is a block diagram of selected elements of a CPU 700 according to one or more examples of the present description.

In the example of fig. 7, CPU 700 includes a core 736, LLC 740, and Shared Resource Monitor (SRM) 704. The shared resource monitor 704 may be, for example, Intel RDT circuits.

By way of non-limiting example, embodiments of the SRM 704 can provide a cache monitor 708 configured to provide insight into individual thread, application, or VM utilization of LLC or other cache levels. This may improve application characterization, "noisy neighbor" detection, performance debugging, real-time provisioning, and resource scheduling.

By way of non-limiting example, the SRM 704 may also provide a cache allocator 724. The cache allocator 724 improves control by prioritizing VMs, containers, or applications through reallocation of LLC capacity. This may provide enhanced runtime certainty, as well as protect important VMs and virtualized resources, such as virtual switches, applications, or other critical resources, from resource contention. This may also prevent noise neighbor interference across the priority workload.

As a non-limiting example, the SRM 704 may further provide a memory bandwidth monitor 728. In the example of fig. 7, the memory bandwidth monitor 728 may independently track multiple VMs or applications by providing memory bandwidth monitoring for each running thread simultaneously. This may provide for detection of noisy neighbors that over-utilize memory bandwidth, characterization and debugging of performance for bandwidth-sensitive applications, and enhanced non-uniform memory access (NUMA) aware scheduling.

By way of non-limiting example, SRM 704 may also provide a code and data prioritizer (prioritizer) 732. The code and data prioritizer 732 enables separate control of code and data placement in the LLC, such as in an L3 cache, for example. Some special types of workloads may benefit from increased runtime certainty, thereby enabling greater predictability of application performance.

As shown in FIG. 7, SRM 704 may be provided as Intel RDT and counters may be providedLa712、Lp716 andLe 720。

counter with a memoryLa712 is used to indicate that a cache line is allocated. When there is a cache line allocated into the LLC,La712 are increased.

Counter with a memoryLp716 is used to indicate that the CPU processes the cache line. Whenever there is a cache line that is processed by the CPU and deallocated from the LLC,Lp716 is incremented, indicating a situation where data has not yet been evicted from the LLC, but has been consumed, and is therefore no longer needed.

Counter with a memoryLe720 is used to count when a cache line is evicted from the LLC. When in useWhen there is a CPU-untouched cache line evicted from the LLC to system memory,Le720 are incremented. During a particular time window, which may be a small time window, these three counters may be used to describe cache dynamics. If the CPU is fast enough to process all incoming I/O packets, thenLpShould be approximately equal toLaTo do soLeShould be very small or close to 0. On the other hand, if the I/O reaches the point where it overwhelms the CPU's processing power too quickly,Lewill be relatively significant, andLpwill be less thanLa。

FIG. 8 is a block diagram of a general approach and an I/O approach in an LLC according to one or more examples of the present description. The general purpose pathway 804 may be allocated for general core usage, while the I/O pathway 808 may be allocated exclusively for I/O usage. These allocations may reduce latency and improve overall system performance and energy.

Fig. 9 is a block diagram of selected components of a data center 900 having connectivity to a network of Cloud Service Providers (CSPs) 902, according to one or more examples of the present description. Embodiments of the data center 900 disclosed herein may be adapted or configured to provide a method of cache monitoring according to the teachings of the present specification.

As non-limiting examples, the CSP 902 may be a traditional enterprise data center, an enterprise "private cloud" or a "public cloud" that provides services such as infrastructure as a service (IaaS), platform as a service (PaaS), or software as a service (SaaS). In some cases, the CSP 902 may provide an HPC platform or service instead of or in addition to a cloud service. Indeed, although not explicitly the same, the HPC cluster ("supercomputer") may be similar in structure to a cloud data center, and the teachings of this specification may be applied to either unless and except where explicitly noted.

The CSP 902 may supply some number of workload clusters 918, which may be clusters of individual servers, blade servers, rack-mounted servers, or any other suitable server topology. In this illustrative example, two workload clusters 918-1 and 918-2 are shown, each providing a rack-mounted server 946 in a chassis 948.

In this illustration, workload cluster 918 is shown as a modular workload cluster conforming to the rack unit ("U") standard, where a 19 inch wide standard rack can be constructed to accommodate 42 units (42U), each 1.75 inches high and about 36 inches deep. In this case, computing resources such as processors, memory, storage, accelerators, and switches may fit into some multiple rack units from one to 42.

However, other embodiments are also contemplated. For example, fig. 13 shows a Rack Scale Design (RSD). In an RSD, the chassis may be built according to standard 42U dimensions, but may be configured to accommodate a Line Replaceable Unit (LRU) in the calculation of a "sled (sled)". In RSDs, the ski base may or may not be constructed from standard rack units. For example, the sled housing the processor may be adapted to a standard 3U configuration, or it may be of any size depending on preferred design considerations. In an RSD, an entire pre-populated resource rack may be provided as a unit, where the rack hosts a plurality of resource sleds, including elements such as computations (e.g., processors), memory, storage, accelerators, power supplies, networking, and others. Each ski chassis may house multiple LRUs, each embedded in a separate ski. If a resource fails, the LRU hosting the resource may be pulled and a new LRU may be modularly inserted. The failed LRU may then be repaired or discarded, depending on the nature of the failure. RSD is particularly, if not exclusively, beneficial in the case of Software Defined Infrastructure (SDI), where composite nodes may be built from decomposed resources. A large pool of resources may be provided and an SDI coordinator (arbiter) may assign them to the composite nodes as needed.

Each server 946 may host a separate operating system and provide server functionality, or the servers may be virtualized, in which case they may be under the control of a Virtual Machine Manager (VMM), hypervisor, and/or coordinator, and may host one or more virtual machines, virtual servers, or virtual devices. These server racks may be collocated in a single data center, or may be located in different geographic data centers. Depending on the contractual agreement, some servers 946 may be dedicated to certain enterprise customers or tenants, while other servers may be shared.

Various devices in a data center may be connected to one another via a switch fabric 970, which switch fabric 970 may include one or more high-speed routing and/or switching devices. The switch fabric 970 may provide "north-south" traffic (e.g., traffic to and from a Wide Area Network (WAN), such as the internet) and "east-west" traffic (e.g., traffic across data centers). Historically, north-south traffic accounts for the majority of network traffic, but as web services become more complex and distributed, the volume of east-west traffic has risen. In many data centers, east-west traffic now accounts for the majority of the traffic.

In addition, as the capabilities of each server 946 increase, traffic may further increase. For example, each server 946 may provide multiple processor slots, each slot housing a processor having four to eight cores, and sufficient memory for the cores. Thus, each server may host multiple VMs, each generating its own traffic.

To accommodate the large volume of traffic in a data center, a high performance switch fabric 970 may be provided. Switch fabric 970 is shown in this example as a "flat" network, where each server 946 may have a direct connection (e.g., a "star" configuration) to a top of rack (ToR) switch 920 and each ToR switch 920 may be coupled to a core switch 930. This two-layer flat network architecture is shown only as an illustrative example. In other examples, other architectures may be used, such as a three-layer star or leaf-spine based on a "Clos" architecture (also referred to as a "fat-tree" topology), a hub-and-spoke topology, a mesh topology, a ring topology, or a 3-D mesh topology, as non-limiting examples.

The construction itself may be provided by any suitable interconnection. For example, each server 946 may include Intel Host Fabric Interfaces (HFI), NICs, Host Channel Adapters (HCAs), or other host interfaces. For simplicity and uniformity, these may be referred to throughout this specification as "host fabric interfaces" (HFIs), which should be broadly construed as interfaces that communicatively couple a host to a data center fabric. The HFI may be coupled to one or more host processors via an interconnect or bus, such as PCI, PCIe, or the like. In some cases, this interconnect bus as well as other "local" interconnects (e.g., core-to-core ultimate Path (Ultra Path) interconnects) may be considered part of fabric 970. In other embodiments, the Ultimate Path Interconnect (UPI) (or other locally consistent interconnect) may be considered part of the security domain of the processor complex and thus not part of the fabric.

Interconnect technologies may be provided by a single interconnect or a hybrid interconnect, such as PCIe to provide on-chip communications, 1Gb or 10Gb copper ethernet to provide a relatively short connection to ToR switch 920, and fiber optic cable to provide a relatively long connection to core switch 930. As a non-limiting example, the interconnect technology that can be found in a data center comprises Intel Omni-Path^TMArchitecture (OPA), TrueScale^TMUPI (formerly QPI or KTI), FibreChannel, ethernet, FibreChannel over ethernet (fcoe), InfiniBand, PCI, PCIe, or fiber, to name a few. The fabric may be cache and memory coherent, cache and memory non-coherent, or a mix of coherent and non-coherent interconnects. Some interconnects are more popular for some purposes or functions than others, and selecting the appropriate configuration for this application is a matter of routine skill (exercise). For example, OPA and Infiniband are commonly used in HPC applications, while ethernet and FibreChannel are more popular in cloud datacenters. These examples are clearly non-limiting and as data centers evolve, construction techniques similarly evolve.

Note that while a high-end fabric such as an OPA is provided by way of illustration herein, more generally, fabric 970 may be any suitable interconnect or bus for a particular application. In some cases, this may include conventional interconnections such as Local Area Networks (LANs), token ring networks, Synchronous Optical Networks (SONETs), Asynchronous Transfer Mode (ATM) networks, wireless networks such as WiFi and Bluetooth, the "plain old telephone system" (POTS) interconnection, or the like. It is also expressly contemplated that in the future, new network technologies may emerge to supplement or replace some of those listed herein, and that any such future network topologies and technologies may be or form part of fabric 970.

In certain embodiments, fabric 970 may provide communication services on various "layers," as originally outlined in the Open Systems Interconnection (OSI) seven-layer network model. In contemporary practice, the OSI model is not strictly followed. In general, layers 1 and 2 are often referred to as "ethernet" layers (although in some data centers or supercomputers, ethernet may be replaced or supplemented by newer technologies). Layers 3 and 4 are commonly referred to as the transmission control protocol/internet protocol (TCP/IP) layers (which may be further subdivided into TCP and IP layers). Layers 5-7 may be referred to as "application layers". These layer definitions are disclosed as a useful framework but are intended to be non-limiting.

Fig. 10 is a block diagram of an end-user computing device 1000 according to one or more examples of the present specification. Embodiments of the computing device 1000 disclosed herein may be adapted or configured to provide a method of cache monitoring according to the teachings of the present specification.

As described above, the computing device 1000 may provide cloud services, high-performance computing, telecommunication services, enterprise data center services, or any other computing services that benefit from the computing device 1000, as appropriate.

In this example, a fabric 1070 is provided to interconnect various aspects of the computing device 1000. Configuration 1070 may be the same as configuration 970 of fig. 9, or may be a different configuration. As described above, the formation 1070 may be provided by any suitable interconnection technique. In this example, Intel Omni-Path ™ is used as an illustrative and non-limiting example.

As shown, computing device 1000 includes a plurality of logic elements that form a plurality of nodes. It should be understood that each node may be provided by a physical server, a group of servers, or other hardware. Each server may run one or more virtual machines according to its application.

Node 01008 is a processing node that includes processor socket 0 and processor socket 1. The processors may be, for example, Intel Xeon ® processors having a plurality of cores, such as 4 or 8 cores. The node 01008 may be configured to provide network or workload functionality, such as by hosting multiple virtual machines or virtual devices.

On-board communication between processor socket 0 and processor socket 1 may be provided by on-board uplink 1078. This may provide a very high speed, short length interconnect between the two processor sockets so that the virtual machines running on node 01008 may communicate with each other at very high speeds. To facilitate this communication, a virtual switch (vSwitch) may be provided on node 01008, which may be considered part of fabric 1070.

Node 01008 is connected to fabric 1070 via HFI 1072. HFI 1072 can be connected to Intel Omni-Path^TMAnd (5) constructing. In some examples, communications with the fabric 1070 may be tunneled, such as by being in Omni-Path^TMProvide UPI tunneling.

Because the computing device 1000 may provide many functions in a distributed manner, which was provided on-board in the previous generations, a high-capacity HFI 1072 may be provided. HFI 1072 may operate at multi-gigabit per second and in some cases may be tightly coupled with node 01008. For example, in some embodiments, the logic for HFI 1072 is integrated directly with the processor on a system on a chip. This provides very high speed communication between the HFI 1072 and the processor socket without the need for an intermediate bus device, which may introduce additional delay in the construction. However, this is not meant to preclude embodiments in which the HFI 1072 is provided on a conventional bus. Rather, it is expressly contemplated that in some examples, HFI 1072 may be provided on a bus, such as a PCIe bus, which is a serialized version of PCI that provides higher speed than traditional PCI. Throughout the computing device 1000, various nodes may provide different types of HFI 1072, such as onboard HFI and plug-in HFI. It should also be noted that some blocks in a system-on-chip may be provided as Intellectual Property (IP) blocks, which may be "dropped" into an integrated circuit as modular units. Thus, in some cases, HFI 1072 may originate from such an IP block.

Note that in a "network is a device" approach, node 01008 may provide limited or no on-board memory or storage. Rather, node 01008 may rely primarily on distributed services, such as storage servers and networked storage servers. On the board, node 01008 may only provide enough memory and storage to boot the device and communicate it with fabric 1070. This distributed architecture is possible and may be advantageous due to the very high speed of contemporary data centers, since there is no need to over-provision resources for each node. Instead, a large pool of high-speed or dedicated memory may be dynamically provisioned among multiple nodes, such that each node has access to a large pool of resources, but those resources are not idle when they are not needed by that particular node.

In this example, node 1 storage server 1004 and node 2 storage server 1010 provide the operational memory and storage capabilities of node 01008. For example, the memory server node 11004 may provide Remote Direct Memory Access (RDMA), while the node 01008 may access memory resources on the node 11004 via the fabric 1070 in a direct memory access manner, similar to how it would access its own onboard memory. The memory provided by the memory server 1004 may be conventional memory, such as double data rate type 3 (DDR 3) Dynamic Random Access Memory (DRAM), which is volatile, or may be a more exotic type of memory, such as Persistent Fast Memory (PFM), similar to Intel 3DCrosspoint ™ bold @ (3 DXP), which runs at DRAM-like speeds, but is non-volatile.

Similarly, a storage server node 21010 may be provided instead of providing node 01008 with an onboard hard disk. The storage server 1010 may provide a networked disk cluster (NBOD), PFM, Redundant Array of Independent Disks (RAID), Redundant Array of Independent Nodes (RAIN), Networked Attached Storage (NAS), optical storage, tape drive, or other non-volatile storage solution.

Thus, in performing its designated functions, node 01008 may access memory from storage server 1004 and store the results on storage provided by storage server 1010. Each of these devices is coupled to fabric 1070 via HFI 1072, which provides fast communication that enables these techniques.

By way of further illustration, node 31006 is also depicted. Node 31006 also includes HFI 1072, and two processor sockets internally connected by an uplink. However, unlike node 01008, node 31006 includes its own on-board memory 1022 and storage 1050. Thus, node 31006 may be configured to perform its functions primarily on-board, and may not need to rely on memory server 1004 and storage server 1010. However, where appropriate, node 31006 may supplement its own on-board memory 1022 and storage 1050 with distributed resources similar to node 01008.

Computing device 1000 can also include accelerator 1030. These may provide various acceleration functions including hardware or co-processor acceleration for such things as packet processing, encryption, decryption, compression, decompression, network security, or other acceleration functions in the data center. In some examples, accelerator 1030 may include a deep learning accelerator, which may be directly attached to one or more cores in a node, such as node 01008 or node 31006. As a non-limiting example, an example of such an accelerator may comprise Intel ® QuickData technology. (QDT), Intel Quick Assist Technology (QAT), Intel Direct Cache Access (DCA), Intel Extended message signaled Interrupt (Extended message signal interruption) (MSI-X), Intel received silicon Call merging (RSC) and other acceleration technologies.

In other embodiments, the accelerator may also be provided as an Application Specific Integrated Circuit (ASIC), Field Programmable Gate Array (FPGA), co-processor, Graphics Processing Unit (GPU), Digital Signal Processor (DSP), or other processing entity, which may optionally be tuned or configured to provide accelerator functionality.

The basic building blocks of the various components disclosed herein may be referred to as "logic elements. Logic elements may include hardware (including, for example, a software programmable processor, ASIC, or FPGA), external hardware (digital, analog, or mixed signal), software, reciprocating software, services, drivers, interfaces, components, modules, algorithms, sensors, components, firmware, microcode, programmable logic, or objects that may be coordinated to achieve a logical operation. Additionally, some of the logic elements are provided by a tangible, non-transitory computer-readable medium having stored thereon executable instructions for instructing a processor to perform a task. Such non-transitory media may include, for example, a hard disk, solid state memory or disk, Read Only Memory (ROM), PFM (e.g., Intel 3D Crosspoint @)^TM) External storage, RAID, RAIN, NAS, optical storage, tape drive, backup system, cloud storage, or any combination of the foregoing, as non-limiting examples. Such media may also include instructions programmed into a Field Programmable Gate Array (FPGA), or instructions encoded in hardware on an ASIC or processor. In embodiments of the present disclosure, an FPGA is a configurable logic circuit in which gates may be dynamically arranged to implement logic that operates at speeds similar to those implemented in an integrated circuit.

FIG. 11 is a block diagram of components of computing platform 1102A, according to one or more examples of the present description. Embodiments of the computing platform 1102A disclosed herein may be adapted or configured to provide a method of cache monitoring according to the teachings of the present specification.

In the depicted embodiment, platforms 1102A, 1102B, and 1102C, as well as datacenter management platform 1106 and data analytics engine 1104, are interconnected via a network 1108. In other embodiments, the computer system may include any suitable number (i.e., one or more) of platforms. In some embodiments (e.g., when the computer system includes only a single platform), all or a portion of system management platform 1106 may be included on platform 1102. The platform 1102 may include platform logic 1110 having one or more central processing units. (CPU) 1112, memory 1114 (which may include any number of different modules), chipset 1116, communications interface 1118, and any other suitable hardware and/or software for executing hypervisor 1120 or other operating system capable of executing workloads associated with applications running on platform 1102. In some embodiments, the platform 1102 may serve as a host platform for one or more client systems 1122 invoking these applications. The platform 1102A may represent any suitable computing environment, such as a high-performance computing environment, a data center, a communication service provider infrastructure (e.g., one or more portions of an evolved packet core), an in-memory (in-memory) computing environment, a vehicle (e.g., automobile or aircraft) computing system, an internet of things environment, an industrial control system, other computing environment, or a combination thereof.

In various embodiments of the present disclosure, the cumulative stress (stress) and/or rate of cumulative stress of a plurality of hardware resources (e.g., cores and uncores) is monitored, and an entity of computer platform 1102A (e.g., system management platform 1106, supervisor 1120, or other operating system) may allocate hardware resources of platform logic 1110 to execute workloads according to stress information. In some embodiments, self-diagnostic capabilities may be combined with pressure monitoring to more accurately determine the health of hardware resources. Each platform 1102 may include platform logic 1110. The platform logic 1110 includes, among other logic that enables the functionality of the platform 1102, one or more CPUs 1112, memory 1114, one or more chipsets 1116, and a communications interface 1128. Although three platforms are shown, computer platform 1102A may be interconnected with any suitable number of platforms. In various embodiments, platform 1102 may reside on a circuit board that is mounted in a chassis, rack, or other suitable structure including multiple platforms (which may include, for example, racks or backplane switches) coupled together by a network 1108.

Each of CPUs 1112 may include any suitable number of processor cores and support logic (e.g., uncore). The cores may be coupled to each other, to memory 1114, to at least one chipset 1116, and/or to communication interface 1118 through one or more controllers resident on CPU1112 and/or chipset 1116. In certain embodiments, CPU1112 is implemented within a socket that is permanently or removably connected to platform 1102A. Although four CPUs are shown, platform 1102 may include any suitable number of CPUs.

The memory 1114 may include any form of volatile or non-volatile memory including, but not limited to, magnetic media (e.g., one or more tape drives), optical media, Random Access Memory (RAM), ROM, flash memory, removable media, or any other suitable local or remote memory component or components. The memory 1114 may be used for short-term, medium-term, and/or long-term storage of the platform 1102A. Memory 1114 may store any suitable data or information used by platform logic 1110, including software embedded in a computer-readable medium and/or encoded logic incorporated in hardware or otherwise stored (e.g., firmware). Memory 1114 may store data used by the cores of CPU 1112. In some embodiments, memory 1114 may also include storage for instructions that may be executed by cores or other processing elements of CPU1112 (e.g., logic resident on chipset 1116) to provide functionality associated with manageability engine 1126 or other components of platform logic 1110. Platform 1102 may also include one or more chipsets 1116 including any suitable logic to support the operation of CPU 1112. In various embodiments, chipset 1116 may reside on the same die or package as CPU1112 or on one or more different die or packages. Each chipset may support any suitable number of CPUs 1112. Chipset 1116 may also include one or more controllers to couple other components of platform logic 1110 (e.g., communication interface 1118 or memory 1114) to one or more CPUs. In the depicted embodiment, each chipset 1116 also includes a manageability engine 1126. Manageability engine 1126 may include any suitable logic to support the operation of chipset 1116. In a particular embodiment, manageability engine 1126 (which may also be referred to as an innovation engine) is capable of collecting real-time telemetry data from various connections between chipset 1116, CPU(s) 1112 and/or memory 1114 managed by chipset 1116, other components of platform logic 1110, and/or components of platform logic 1110. In various embodiments, the collected telemetry data includes pressure information as described herein.

In various embodiments, manageability engine 1126 operates as an out-of-band asynchronous computing agent that is capable of interfacing with various elements of platform logic 1110 to collect telemetry data with no or minimal disruption to running processes on CPU 1112. Manageability engine 1126 may, for example, comprise a dedicated processing element (e.g., a processor, controller, or other logic) on chipset 1116 that provides the functionality of manageability engine 1126 (e.g., by executing software instructions), thus saving the processing cycles of CPU1112 for operations associated with the workload executed by platform logic 1110. Further, dedicated logic for manageability engine 1126 may operate asynchronously with respect to CPU1112 and may collect at least some telemetry data without increasing the load on the CPU.

Manageability engine 1126 may process the telemetry data it collects (specific examples of processing of pressure information are provided herein). In various embodiments, manageability engine 1126 reports the data it collects and/or the results of its processing to other elements in the computer system, such as one or more supervisors 1120 or other operating systems and/or system management software (which may run on any suitable logic, such as system management platform 1106). In particular embodiments, critical events such as cores that have accumulated excessive pressure may be reported before the normal interval for reporting telemetry data (e.g., a notification may be sent immediately upon detection).

In addition, manageability engine 1126 may include configurable programmable code to set which CPU(s) 1112 a particular chipset 1116 manages and/or which telemetry data may be collected.

Each of the chipsets 1116 also includes a communication interface 1128. Communication interface 1128 may provide for the communication of signaling and/or data between chipset 1116 and one or more I/O devices, one or more networks 1108, and/or one or more devices coupled to network 1108, such as system management platform 1106. For example, communication interface 1128 may be used to send and receive network traffic, such as data packets. In particular embodiments, communication interface 1128 includes one or more physical Network Interface Controllers (NICs), also referred to as network interface cards or network adapters. The NIC may include electronic circuitry that communicates using any suitable physical layer and data link layer standards, such as ethernet (e.g., as defined by the IEEE 802.3 standard), fibre channel, InfiniBand, Wi-Fi, or other suitable standards. A NIC may include one or more physical ports that may be coupled to a cable (e.g., an ethernet cable). The NIC may enable communication between any suitable element of chipset 1116 (e.g., manageability engine 1126 or switch 1130) and another device coupled to network 1108. In various embodiments, the NIC may be integrated with the chipset (i.e., may be on the same integrated circuit or circuit board as the rest of the chipset logic) or may be on a different integrated circuit or circuit board that is electromechanically coupled to the chipset.

In particular embodiments, communication interface 1128 may allow communication of data associated with the management and monitoring functions performed by manageability engine 1126 (e.g., between manageability engine 1126 and data center management platform 1106). In various embodiments, manageability engine 1126 may utilize elements of communication interface 1128 (e.g., one or more NICs) to report telemetry data (e.g., to system management platform 1106) in order to reserve the use of the NIC of communication interface 1118 for operations associated with the workload performed by platform logic 1110.

Switch 1130 may be coupled to various ports of communication interface 1128 (e.g., provided by a NIC) and may exchange data between those ports and various components of chipset 1116, such as one or more peripheral component interconnect express (PCIe) lanes coupled to CPU 1112. Switch 1130 may be a physical or virtual (i.e., software) switch.

Platform logic 1110 may include additional communication interfaces 1118. Similar to communication interface 1128, communication interface 1118 may be used for the communication of signaling and/or data between platform logic 1110 and one or more networks 1108 and one or more devices coupled to networks 1108. For example, communication interface 1118 may be used to send and receive network traffic, such as data packets. In particular embodiments, communication interface 1118 includes one or more physical NICs. These NICs may enable communication between any suitable element of platform logic 1110 (e.g., CPU1112 or memory 1114) and another device coupled to network 1108 (e.g., an element of another platform or remote computing device coupled to network 1108 over one or more networks).

The platform logic 1110 may receive and execute any suitable type of workload. The workload may include any request to utilize one or more resources of the platform logic 1110, such as one or more cores or associated logic, for example, of the platform logic 1110. For example, the workload may include a request to instantiate a software component, such as an I/O device driver 1124 or a client system 1122, a request to process a network packet received from a virtual machine 1132 or device external to platform 1102A (e.g., a network node coupled to network 1108), a request to execute a process or thread associated with client system 1122, an application running on platform 1102A, hypervisor 1120, or other operating system running on platform 1102A, or other suitable processing request.

Virtual machine 1132 may use its own dedicated hardware to emulate a computer system. Virtual machine 1132 may run a guest operating system on top of hypervisor 1120. The components of platform logic 1110 (e.g., CPU1112, memory 1114, chipset 1116, and communication interface 1118) may be virtualized such that it shows to the guest operating system that virtual machine 1132 has its own dedicated components.

Virtual machine 1132 may include a virtualized nic (vnic) that is used by the virtual machine as its network interface. A vNIC may be assigned a Media Access Control (MAC) address or other identifier, thus allowing multiple virtual machines 1132 to be individually addressable in a network.

VNF 1134 may include a software implementation of functional building blocks with defined interfaces and behaviors that may be deployed in a virtualization infrastructure. In particular embodiments, VNF 1134 may include one or more virtual machines 1132 that collectively provide particular functionality (e.g., WAN optimization, Virtual Private Network (VPN) termination, firewall operations, load balancing operations, security functions, etc.). VNF 1134 running on platform logic 1110 may provide the same functionality as a traditional network component implemented by dedicated hardware. For example, VNF 1134 may include components to perform any suitable Network Function Virtualization (NFV) workload, such as virtualized evolved packet core (vEPC) components, mobility management entities, 3 rd generation partnership project (3 GPP) control and data plane components, and so forth.

SFCs 1136 are groups of VNFs 1134 that are organized as a chain to perform a series of operations, such as network packet processing operations. Service function links may provide the ability to define an ordered list of network services (e.g., firewalls, load balancers) that are stitched together in a network to create a service chain.

Hypervisor 1120 (also referred to as a virtual machine monitor) may include logic to create and run guest systems 1122. Hypervisor 1120 may present guest operating systems run by virtual machines with virtual operating platforms (i.e., it shows to the virtual machines that they run on separate physical nodes when they are actually merged onto a single hardware platform) and manage the execution of the guest operating systems through platform logic 1110. The services of hypervisor 1120 may be provided through virtualization in software or through hardware-assisted resources that require minimal software intervention, or both. Multiple instances of various guest operating systems may be managed by hypervisor 1120. Each platform 1102 may have a separate instance of a hypervisor 1120.

Hypervisor 1120 may be a native or bare-metal hypervisor that runs directly on platform logic 1110 to control the platform logic and manage the guest operating systems. Alternatively, hypervisor 1120 may be a hosted hypervisor running on a host operating system and abstracting the guest operating system from the host operating system. Hypervisor 1120 may include a virtual switch 1138, which may provide virtual switching and/or routing functionality to virtual machines of guest system 1122. Virtual switch 1138 can include a logical switching fabric that couples the vnics of virtual machines 1132 to one another, thus creating a virtual network over which the virtual machines can communicate with one another.

Virtual switch 1138 may include software elements that are executed using components of platform logic 1110. In various embodiments, hypervisor 1120 may communicate with any suitable entity (e.g., an SDN controller), which may cause hypervisor 1120 to reconfigure parameters of virtual switch 1138 in response to changing conditions in platform 1102 (e.g., addition or deletion of virtual machines 1132 or identification of optimizations that may be made to enhance performance of the platform).

The hypervisor 1120 may also include resource allocation logic 1144, which may include logic to determine allocation of platform resources based on telemetry data (which may include pressure information). The resource allocation logic 1144 may also include logic, such as components of the platform logic 1110, for communicating with various components of the platform logic 1110 entity of the platform 1102A to implement such optimization.

Any suitable logic may make one or more of these optimization decisions. For example, the system management platform 1106, the resource allocation logic 1144 of the hypervisor 1120 or other operating system, or other logic of the computer platform 1102A may be able to make such a determination. In various embodiments, system management platform 1106 may receive telemetry data from workload placement and manage workload placement across multiple platforms 1102. The system management platform 1106 may communicate with a hypervisor 1120 (e.g., in an out-of-band manner) or other operating system of the various platforms 1102 to enable workload placement directed by the system management platform.

The elements of platform logic 1110 may be coupled together in any suitable manner. For example, a bus may couple any of the components together. A bus may comprise any known interconnect, such as a multi-drop bus, a mesh interconnect, a ring interconnect, a point-to-point interconnect, a serial interconnect, a parallel bus, a coherent (e.g., cache-coherent) bus, a layered protocol architecture, a differential bus, or a Gunning Transceiver Logic (GTL) bus.

In embodiments of the present description, cache coherency is a memory architecture that provides uniform sharing and mapping between multiple caches. For example, the caches may map to the same address space. If two different caches cache the same address in a shared address space, the coherence agent provides logic (hardware and/or software) to ensure compatibility and coherence of the shared resource. For example, if two caches have cached the same address, the coherency agent ensures that changes are propagated to one cache when the value stored in the address is updated in the other cache. For example, coherency may be maintained via "snooping," where each cache monitors the address lines of each other cache and detects updates. Cache coherency may also be maintained via a directory-based system, where shared data is placed in a shared directory where coherency is maintained. Some distributed shared memory architectures may also provide coherency, for example by emulating the aforementioned mechanisms.

Coherency may be based on "snooping" or directory. In a snooping protocol, coherency may be maintained by a write invalidate, where a first cache snooping a write to the same address in a second cache invalidates its own copy. If the program attempts to read the value from the first cache, this forces a read from memory. Alternatively, in a write update, the first cache snoops writes to the second cache, and the cache controller (which may include a coherence agent) copies out the data and updates the copy in the first cache.

By way of non-limiting example, the current cache coherency model includes MSI (modified, shared, invalid), MESI (modified, exclusive, shared, invalid), MOSI (modified, owned, shared, invalid), MOESI (modified, owned, exclusive, shared, invalid), MERSI (modified, exclusive, read-only or recent, shared, invalid), MESIF (modified, exclusive, shared, invalid), write-once, Synapse, Berkeley, Firefly, and Dragon protocols. Further, the ARM processor may include an Advanced Microcontroller Bus Architecture (AMBA) 4 ACE to provide cache coherency for a system on a chip (SoC) or elsewhere.

The elements of computer platform 1102A may be coupled together in any suitable manner, such as through one or more networks 1108. Network 1108 may be a combination of one or more networks operating using one or more suitable networking protocols or any suitable network. A network may represent a series of nodes, points, and interconnecting communication paths for receiving and transmitting packets of information that propagate through a communication system. For example, the network may include one or more firewalls, routers, switches, security devices, anti-virus servers, or other useful network devices.

Fig. 12 illustrates a block diagram of a CPU1212, according to one or more examples of this specification. Embodiments of the CPU1212 disclosed herein may be adapted or configured to provide a method of cache monitoring according to the teachings of this specification. In particular, the CPU1212 may include an SRM, as shown in fig. 7.

Although the CPU1212 depicts a particular configuration, the cores and other components of the CPU1212 may be arranged in any suitable manner. CPU1212 may include any processor or processing device, such as a microprocessor, embedded processor, DSP, network processor, application processor, co-processor, SoC, or other device executing code. In the depicted embodiment, CPU1212 includes four processing elements (cores 1230 in the depicted embodiment), which may include asymmetric processing elements or symmetric processing elements. However, CPU1212 may include any number of processing elements that may be symmetric or asymmetric.

Examples of hardware processing elements include: a thread unit, thread slot, thread, processing unit, context unit, logical processor, hardware thread, core, and/or any other element capable of maintaining a state of a processor, such as an execution state or an architectural state. In other words, in one embodiment, a processing element refers to any hardware capable of being independently associated with code, such as a software thread, operating system, application, or other code. A physical processor (or processor socket) generally refers to an integrated circuit, which potentially includes any number of other processing elements, such as cores or hardware threads.

A core may refer to logic located on an integrated circuit capable of maintaining an independent architectural state, where each independently maintained architectural state is associated with at least some dedicated execution resources. A hardware thread may refer to any logic located on an integrated circuit capable of maintaining an independent architectural state, wherein the independently maintained architectural states share access to execution resources. The physical CPU may include any suitable number of cores. In various embodiments, the cores may include one or more out-of-order processor cores or one or more in-order processor cores. However, the cores may be individually selected from any type of core, such as a native core, a software management core, a core adapted to execute a native Instruction Set Architecture (ISA), a core adapted to execute a translated ISA, a co-designed (co-designed) core, or other known core. In a heterogeneous core environment (i.e., asymmetric cores), some form of translation, such as binary translation, may be utilized to schedule or execute code on one or both cores.

In the depicted embodiment, core 1230A includes an out-of-order processor having a front-end unit 1270 for fetching incoming instructions, performing various processing (e.g., caching, decoding, branch prediction, etc.), and passing instructions/operations to an out-of-order (OOO) engine. The OOO engine performs further processing on the decoded instructions.

Front end 1270 may include a decode module coupled to fetch logic to decode fetched elements. In one embodiment, the fetch logic includes a separate sequencer associated with a thread slot of core 1230. Generally, core 1230 is associated with a first ISA that defines/specifies instructions executable on core 1230. Machine code instructions, which are typically part of the first ISA, include a portion of an instruction (referred to as an opcode) that references/specifies an instruction or operation to be performed. The decode module may include circuitry to recognize these instructions from their opcodes and to pass the decoded instructions in the pipeline for processing as defined by the first ISA. In one embodiment, the decoders of core 1230 recognize the same ISA (or a subset thereof). Alternatively, in a heterogeneous core environment, decoders of one or more cores (e.g., core 1230B) may recognize a second ISA (a subset of the first ISA or a different ISA).

In the depicted embodiment, the out-of-order engine includes a dispatch unit 1282 to receive decoded instructions, which may be one or more microinstructions or one or more microinstructions, from the front-end unit 1270uops, and allocates them to appropriate resources, such as registers and the like. Next, the instructions are provided to a reservation station 1284, and the reservation station 1284 reserves resources and schedules them for execution on one of a plurality of execution units 1286A-1286N. Various types of execution units may exist, including, for example, Arithmetic Logic Units (ALUs), load and store units, Vector Processing Units (VPUs), floating point execution units, and so forth, among others. The results from these different execution units are provided to a reorder buffer (ROB) 1288, which takes the out-of-order results and returns them to the correct program order.

In the depicted embodiment, both the front end unit 1270 and the out-of-order engine 1280 are coupled to different levels of the memory hierarchy. Specifically shown is an instruction level cache 1272 which is in turn coupled to a mid-level cache 1276 which is in turn coupled to a last level cache 1295. In one embodiment, last-level cache 1295 is implemented in on-chip (sometimes referred to as uncore) unit 1290. Uncore 1290 may communicate with system memory 1299, which in the illustrated embodiment is implemented by embedded dram (edram). Various execution units 1286 within OOO engine 1280 communicate with a first level cache 1274 that is also in communication with a mid-level cache 1276. Additional cores 1230B-1230D may also be coupled to last-level cache 1295.

In particular embodiments, uncore 1290 may be in a voltage domain and/or frequency domain that is separate from the voltage domain and/or frequency domain of the core. That is, uncore 1290 may be powered by a supply voltage that is different than the supply voltage used to power the core and/or may operate at a frequency that is different than the operating frequency of the core.

CPU1212 may also include a Power Control Unit (PCU) 1240. In various embodiments, PCU1240 may control the supply voltage and operating frequency applied to each of the cores (on a per core basis) and to the uncore. PCU1240 may also instruct a core or an uncore to enter an idle state (where no voltage and clock are provided) when no workload is executing.

In various embodiments, PCU1240 may detect one or more stress characteristics of a hardware resource, such as a core and an uncore. The pressure characteristic may include an indication of an amount of pressure being placed on the hardware resource. As examples, the pressure characteristic may be a voltage or frequency applied to the hardware resource, a power level, current level, or voltage level sensed at the hardware resource, a temperature sensed at the hardware resource, or other suitable measurement. In various embodiments, when a pressure characteristic is sensed at a particular time, multiple measurements of the particular pressure characteristic may be performed (e.g., at different locations). In various embodiments, PCU1240 may detect the pressure characteristic at any suitable interval.

In various embodiments, PCU1240 is a separate component from core 1230. In a particular embodiment, PCU1240 operates at a clock frequency that is different from the clock frequency used by core 1230. In some embodiments where the PCU is in a microcontroller, the PCU1240 executes instructions according to an ISA different from the ISA used by the core 1230.

In various embodiments, the CPU1212 may also include a non-volatile memory 1250 to store pressure information (such as pressure characteristics, delta pressure values, cumulative pressure values, pressure buildup rates, or other pressure information) associated with the core 1230 or the non-core 1290 such that the pressure information is maintained when power is lost.

As shown in FIG. 12, the CPU1212 may further include an SRM 1252. In embodiments of the present disclosure, the SRM 1252 provides a unified I/O architecture that achieves high performance, significantly reduces memory bandwidth and energy consumption, and thus reduces latency in computing operations.

Fig. 13 is a block diagram of a Rack Scale Design (RSD) 1300 according to one or more examples of the present description. Embodiments of the RSD 1300 disclosed herein may be adapted or configured to provide a method of cache monitoring according to the teachings of the present specification.

In this example, the RSD 1300 includes a single chassis 1304 to illustrate certain principles of the RSD. It should be understood that the RSD 1300 may include many such chassis, and the chassis need not be identical to each other. In some cases, a multi-purpose rack such as rack 1304 may be provided, while in other examples, a single-purpose rack may be provided. For example, the chassis 1304 may be considered a high-inclusion chassis that includes resources that may be used to allocate a large number of composite nodes. On the other hand, other examples may include racks that are individually dedicated to computing sleds, storage sleds, memory sleds, and other resource types, which may be integrated together into a composite node. Thus, the chassis 1304 of fig. 13 should be understood as a non-limiting example of a chassis that may be used in the RSD 1300.

In the example of fig. 13, the rack 1304 may be a standard rack having an outer width of about 23.6 inches and a height of 78.74 inches. In common use, this is referred to as a "42U rack". However, the rack 1304 need not conform to the "rack unit" standard. Rather, the chassis 1304 may include multiple chassis optimized for its purpose.

The rack 1304 may be marketed and sold as a single-chip unit with multiple LRUs within each chassis. In this case, the LRU may be a sled, and thus the LRU may be easily swapped out when needed for replacement.

In this example, the chassis 1304 includes a power chassis 1310, a storage chassis 1316, three computing chassis (1324-1, 1324-2, and 1324-3), 3-D cross point^TMA (3 DXP) chassis 1328, an accelerator chassis 1330, and a networking chassis 1334. Each chassis canIncluding one or more LRU sleds that maintain the appropriate resources. For example, the power chassis 1310 includes multiple thermally pluggable power supplies 1312 that may provide a shared power supply to the racks 1304. In other embodiments, some ski carriages may also include their own power source, depending on the needs of the embodiment.

The storage chassis 1316 includes a plurality of storage skis 1318. Each of the computing chassis 1324 contains a plurality of computing skis 1320. The 3DXP chassis 1328 may include a plurality of 3DXP sleds 1326, each hosting a 3DXP memory server. And the accelerator chassis 1330 may host multiple accelerators, e.g., Intel @ Quick Assist^TMtechnology (fast assistance technology) (QAT), FPGA, ASIC or other accelerators of the same or different types. The accelerators within the accelerator chassis 1330 may be of the same type or different types, as desired for a particular embodiment.

Over time, the various LRUs within the rack 1304 may become damaged, outdated, or may experience functional errors. When this occurs, the LRU may be pulled and replaced with a compatible LRU, thus allowing the rack to continue full scale operation.

Fig. 14 is a block diagram of a Software Defined Infrastructure (SDI) data center 1400 in accordance with one or more examples of the present description. Embodiments of the SDI data center 1400 disclosed herein may be adapted or configured to provide a method of cache monitoring according to the teachings of the present specification.

Some applications hosted within SDI data center 1400 may use a set of resources to achieve their intended purpose, such as processing database queries, serving web pages, or providing computer intelligence.

Some applications tend to be sensitive to a subset of a particular resource. For example, SAP HANA is an in-memory, column-oriented relational database system. The SAP HANA database may use processors, memory, disks, and constructs, while being most sensitive to memory and processors. In one embodiment, compound node 1402 includes one or more cores 1410 that perform processing functions. Node 1402 may also include a caching agent 1406 that provides access to a cache. One or more applications 1414 run on the node 1402 and communicate with the SDI fabric via the FA 1418. Dynamically provisioning the node 1402 with resources may include selecting a set of resources and ensuring that the quantity and quality of provisioning meets required performance metrics, such as SLA and quality of service (QoS). The resource selection and allocation of applications 1414 may be performed by a resource manager, which may be implemented in a coordination and system software stack 1422. As a non-limiting example, throughout this specification, a resource manager may be viewed as if it were implemented alone or by a coordinator. Note that many different configurations are possible.

In SDI data centers, applications may be executed by compound nodes such as node 1402 dynamically assigned by SDI manager 1480. Such nodes are referred to as compound nodes because they are not nodes that necessarily co-locate all resources. Rather, they may include resources that are distributed in different portions of the data center, dynamically allocated, and virtualized to specific applications 1414.

In this example, memory resources from three memory sleds of the memory rack 1430 are allocated to the node 1402, memory resources from four memory sleds of the memory rack 1434 are allocated, and additional resources from five resource sleds of the resource rack 1436 are allocated to the application 1414 running on the composite node 1402. All of these resources may be associated and aggregated with a particular computing sled to create a composite node. Once the compound node is created, the operating system can be booted in node 1402, and applications can begin running using the aggregated resources as if they were physically collocated resources. As described above, FA 1418 may provide certain interfaces that enable this operation to occur seamlessly with respect to node 1402.

As a general proposition, the more memory and computational resources that are added to a database processor, the better throughput it can achieve. However, this does not necessarily apply for a disc or construction. Adding more disks and building bandwidth may not necessarily increase the performance of the SAP HANA database above a certain threshold.

SDI data center 1400 may address scaling of resources by mapping an appropriate amount of off-board (offboard) resources to an application based on application requirements provided by a user or network administrator, or directly by the application itself. This may include allocating resources from various resource racks, such as the storage rack 1430, the storage rack 1434, and the resource rack 1436.

In an example, the SDI controller 1480 also includes a Resource Protection Engine (RPE) 1482 configured to allocate permissions of various target resources to the resolved computational resources (DRCs) that are allowed to access them. In this example, the desired resource is enforced by the FA serving the target resource.

In certain embodiments, elements of the SDI data center 1400 may be adapted or configured to operate with the decomposed telemetry model of the present description.

Fig. 15 is a block diagram of a data center 1500 in accordance with one or more examples of the present description. Embodiments of the data center 1500 disclosed herein may be adapted or configured to provide a method of cache monitoring according to the teachings of the present specification.

In the example of FIG. 15, two compound nodes are defined, compound node 1534-1 and compound node 1534-2.

According to the teachings of this specification, data center 1500 includes a plurality of resources that can be decomposed and can be defined as part of a compound node. For example, each of the computing skis 1526-1 and 1526-2 includes a processor, 1530-1 and 1530-2, respectively. Each processor 1530 may host a respective application 1532-1 and 1532-2.

Note that in various embodiments, the computing sled 1526-1 may also provide local memory, storage, accelerators, or other resources for the processor 1530-1. However, in accordance with the SDI teachings of the present specification, certain resources allocated to the complex node 1534 may also be decomposed, or physically remote from the processor 1530. In this example, each composite node 1534 has assigned to it one or more FPGAs 1512 residing in FPGA sled 1504. These FPGAs can provide acceleration functions that operate at near hardware speeds and are provided by the cores 1506. Each FPGA 1512 may also have access to certain local FPGA resources 1508. The complex node 1534 may also have access to a memory block 1524 within the memory device sled 1522. The storage 1522 may also be a disaggregated resource provided in a resource sled.

It should be noted that for simplicity and clarity of illustration, only selected components are disclosed in this illustration. However, other decomposed resources may also be provided. For example, the data center 1500 may include a memory server providing disaggregated memory, including persistent fast memory, that the compound node 1534 may access via RDMA.

In this example, the composite node 1534-1 includes a processor 1530-1 on the computing sled 1526-1, running an application 1532-1, and accessing the fabric 1570 via the FA 1518-3. The composite node 1534-1 also includes an FPGA 1512-1 running on the FPGA sled 1504-1, a running FPGA core 1506-1, and has access to FPGA resources 1508-1. The FPGA sled 1504-1 can access the fabric 1570 via the FA 1518-1. Note that in this example, multiple FPGAs on FPGA sled 1504-1 can be connected to each other via a passive backplane, and a single FA 1518-1 can be provided for the entire sled. The composite node 1534-1 may also have access to a memory block 1524-1 on the memory device sled 1522. Within FPGA sled 1504-2, FPGA 1512-2 has access to shared resources 1508-2, which is accessed through two different cores, core 1506-2 and core 1506-3. Core 1506-2 on FPGA 1512-1 is also assigned to compound node 1534-1, while core 1506-3 is not assigned to compound node 1534-1.

The composite node 1534-2 includes a processor 1530-2 running an application 1532-2 on a computing sled 1526-2. The computing sled 1526-2 is connected to the configuration 1570 via the FA 1518-4. Note that the computing sled 1526 may also include multiple processors, memories, and other local resources that may be communicatively coupled to each other via a passive backplane and share a common FA 1518. The composite node 1534-2 also includes a core 1506-3 running on the shared FPGA 1512-2 and has access to the shared resources 1508-2. Complex node 1534-2 may store data on memory block 1524-2.

The foregoing has outlined features of one or more embodiments of the subject matter disclosed herein. These embodiments are provided so that those of ordinary skill in the art (PHOSTITA) can better understand the various aspects of the disclosure. Certain well-understood terminology and underlying technology and/or standards may be referred to without detailed description. It is expected that the PHOSTITA will have or have access to sufficient background knowledge or information in those techniques and standards to implement the teachings of this specification.

PHOSTITA will appreciate that they can readily use the present disclosure as a basis for designing or modifying other processes, structures, or variations for carrying out the same purposes and/or achieving the same advantages of the embodiments introduced herein. The PHOSTITA will also recognize that such equivalent constructions do not depart from the spirit and scope of the present disclosure, and that they may make various changes, substitutions, and alterations herein without departing from the spirit and scope of the present disclosure.

In the foregoing description, certain aspects of some or all embodiments have been described in greater detail than is strictly necessary to practice the following claims. These details are provided by way of non-limiting example only for the purpose of providing a context and description for the disclosed embodiments. Such details are not to be understood as necessary, and such details are not to be "distorted" as a limitation on the claims. The phrase may refer to an "embodiment" or "embodiments. These phrases and any other references to embodiments should be construed broadly to refer to any combination of one or more embodiments. Furthermore, the features disclosed in a particular "embodiment" may also be extended only across multiple implementations. For example, if features 1 and 2 are disclosed in "embodiment," embodiment a may have feature 1 but lack feature 2, while embodiment B may have feature 2 but lack feature 1.

This specification may provide illustrations in block diagram form, with some features being disclosed in separate blocks. These should be construed broadly to disclose how the various features interoperate, but are not intended to imply that those features must be necessarily embodied in separate hardware or software. Furthermore, where a single block discloses more than one feature in the same block, those features need not necessarily be embodied in the same hardware and/or software. For example, computer "memory" may in some cases be distributed or mapped between multiple levels of cache or local memory, main memory, battery-backed volatile memory, and various forms of persistent storage, such as hard disks, storage servers, optical disks, memory, tape drives, or the like. In some embodiments, some components may be omitted or combined. In a general sense, the arrangements depicted in the figures may be more logical in their representations, while a physical architecture may include various permutations, combinations, and/or hybrids of these elements. The operational goals outlined herein may be achieved using a myriad of possible design configurations. Thus, the associated infrastructure has myriad alternative arrangements, design choices, device possibilities, hardware configurations, software implementations, and device options.

Reference may be made herein to computer-readable media, which may be tangible and non-transitory computer-readable media. As used in this specification and throughout the claims, a "computer-readable medium" should be taken to include one or more computer-readable media of the same or different types. By way of non-limiting example, the computer-readable medium may comprise an optical drive (e.g., CD/DVD/Blu-ray), a hard disk drive, a solid state drive, flash memory, or other non-volatile media. The computer-readable medium may also include media such as ROM, FPGA or ASIC configured to execute desired instructions, stored instructions for programming the FPGA or ASIC to execute desired instructions, IP blocks that may be integrated into other circuits in hardware, or instructions encoded directly into hardware or into microcode on a processor such as a microprocessor, DSP, microcontroller, or any other suitable component, device, element or object as appropriate and based on particular needs. Non-transitory storage media herein are expressly intended to include any non-transitory dedicated or programmable hardware configured to provide the disclosed operations or cause a processor to perform the disclosed operations.

Throughout this specification and claims, various elements may be "communicatively", "electrically", "mechanically", or otherwise "coupled" to one another. Such coupling may be a direct point-to-point coupling or may include intermediate devices. For example, two devices may be communicatively coupled to each other via a controller that facilitates communication. The devices may be electrically coupled to each other via intermediate devices such as signal boosters (boost), voltage dividers, or buffers. Mechanically coupled devices may be indirectly mechanically coupled.

Any "module" or "engine" disclosed herein may refer to or include software, a software stack, hardware, firmware, and/or a combination of software, circuitry configured to perform the functions of the engine or module, or any of the computer-readable media disclosed above. Such modules or engines may be provided on or in conjunction with a hardware platform, which may include hardware computing resources such as processors, memory, storage, interconnects, network and network interfaces, accelerators, or other suitable hardware, where appropriate. Such a hardware platform may be provided as a single monolithic device (e.g., in a PC form factor), or be equipped with some or part of the distributed functionality (e.g., "complex nodes" in high-end data centers, where computing, memory, storage, and other resources may be dynamically allocated and need not be local to each other).

There may be flow charts, signal flow diagrams, or other illustrations showing operations disclosed herein performed in a particular order. Unless explicitly stated otherwise, or unless required in a particular context, the order should be understood as a non-limiting example only. Further, where one operation is shown following another, other intervening operations may also occur, which may or may not be related. Some operations may also be performed concurrently or in parallel. In the case where an operation is referred to as being "based on" or "in accordance with" another item or operation, this should be understood to imply that the operation is at least partially based on or in accordance with the other item or operation. This should not be construed as implying that an operation is based on or exclusively on an item or operation, or is based on or exclusively on an item or operation.

All or part of any of the hardware elements disclosed herein may be readily provided in a SoC, including a CPU package. SoC represents an Integrated Circuit (IC) that integrates components of a computer or other electronic system into a single chip. Thus, for example, a client device or a server device may be provided in whole or in part in a SoC. The SoC may contain digital, analog, mixed signal, and radio frequency functions, all of which may be provided on a single chip substrate. Other embodiments may include a multi-chip module (MCM) in which multiple chips are located within a single electronic package and are configured to interact closely with each other through the electronic package.

In a general sense, any suitably configured circuit or processor may execute any type of instructions associated with data to implement the operations detailed herein. Any processor disclosed herein can transform an element or article (e.g., data) from one state or thing to another state or thing. Further, information tracked, sent, received, or stored in a processor may be provided in any database, register, table, cache, queue, control list, or storage construct, all of which may be referenced at any suitable time frame, based on particular needs and implementations. Any memory or storage element disclosed herein should be construed as being properly encompassed within the broad terms "memory" and "storage.

Computer program logic implementing all or part of the functionality described herein may be embodied in various forms, including, but not limited to, source code forms, computer executable forms, machine instructions or microcode, programmable hardware, and various intermediate forms (e.g., tables generated by an assembler, compiler, linker, or locator). In an example, the source code includes a series of computer program instructions implemented in various programming languages, such as object code, assembly language, or a high-level language such as OpenCL, FORTRAN, C + +, JAVA, or HTML, for use with various operating systems or operating environments, or in hardware description languages, such as Spice, Verilog, and VHDL. The source code may define and use various data constructs and communication messages. The source code may be in computer-executable form (e.g., via an interpreter), or the source code may be converted (e.g., via a translator, assembler, or compiler) into computer-executable form, or into an intermediate form, such as bytecode. Any of the preceding may be used to construct or describe appropriate discrete or integrated circuits, sequential, combined, state machines, or the like, where appropriate.

In an example embodiment, any number of the circuits of the figures may be implemented on a board of an associated electronic device. The board may be a generic circuit board that may hold various components of the internal electronic system of the electronic device and also provide connectors for other peripheral devices. Any suitable processor and memory may be suitably coupled to the board based on particular configuration requirements, processing needs, and computing design. Note that with the numerous examples provided herein, interaction may be described in terms of two, three, four, or more electronic components. However, this is done for clarity and example purposes only. It should be appreciated that the systems may be consolidated or reconfigured in any suitable manner. Along similar design alternatives, any of the illustrated components, modules, and elements of the drawings may be combined in various possible configurations, all of which are within the broad scope of this specification.

Numerous other changes, substitutions, variations, alterations, and modifications may be ascertained to one skilled in the art and it is intended that the present disclosure encompass all such changes, substitutions, variations, alterations, and modifications as falling within the scope of the appended claims. To assist the United States Patent and Trademark Office (USPTO) and additionally to assist any reader in interpreting the claims appended to the application regarding any patent issued to the application, applicants wish to point out applicant: (a) it is not intended that any of the appended claims refer to 35 u.s.c. paragraph six (6) part 112 (before the AIA) or paragraph (f) of the same part (after the AIA) as it existed on its filing date unless the use of "means for" or "step for" in particular claims was specifically made, and (b) is not intended to limit the present disclosure by any statement in the specification in any way that is not explicitly reflected in the appended claims.

Example implementation

The following examples are provided by way of illustration.

Example 1 includes a computing device, comprising: a system and method for cache line allocation in a direct-writable cache includes a processor, a multi-level cache including multiple cache levels, a peripheral device configured to write data directly to the direct-writable cache, and cache monitoring circuitry including a cache counter that is incremented when a cache line is allocated in the direct-writable cacheLaIncremented when a processor processes a cache line and deallocates the cache line from the direct-writable cacheLpIncremented when a cache line is evicted from a direct-writable cache to memoryLeWherein the cache monitor circuit determines the direct write policy based on the cache counter.

Example 2 includes the computing device of example 1, wherein the cache line is incremented only when evicted without retrieval by the processorLe。

Example 3 includes the computing apparatus of example 1, wherein the directly writable cache is a last-level cache.

Example 4 includes the computing apparatus of example 1, wherein the multi-level cache is a non-inclusive cache.

Example 5 includes the computing apparatus of example 1, wherein the cache counter is a sample counter configured to sample a set of less than all of the cache sets.

Example 6 includes the computing apparatus of example 5, wherein the cache counter is configured to sample approximately 2% of all cache sets.

Example 7 includes the computing apparatus of example 6, wherein the cache monitor circuitry is further to reset after the policy windowLa、LpAndLe。

example 8 includes the computing apparatus of example 6, wherein the cache monitor circuitry is to allocate all transactions on the sampled set to the direct writable cache.

Example 9 includes the computing apparatus of example 7, wherein the cache monitor circuitry is to cache the transaction according to a policy allocation gap (interval).

Example 10 includes the computing device of example 1, wherein to determine the direct write strategy includes to determine a write strategyLpIs approximately equal toLaAnd isLeAt or near zero, determining that the cache is not over utilized, and determining a direct write policy to write incoming data to the direct writable cache.

Example 11 includes the computing device of example 1, wherein determining the direct policy includes determiningLpIs substantially less thanLaAnd isLeSubstantially non-zero, determining that a cache is over-utilized, and determining a direct write policy to write incoming data directly to memory.

Example 12 includes the computing apparatus of example 1, wherein the peripheral device is an ethernet network interface.

Example 13 includes the computing apparatus of example 1, wherein the peripheral is a cache-coherent core-to-core network interface.

Example 14 includes the computing apparatus of example 13, wherein the cache-coherent core-to-core network interface is Omni-Path or a derivative of Omni-Path.

Example 15 includes the computing apparatus of example 1, wherein the peripheral device is a cache-coherent core-to-core network interface on a peripheral component interconnect express (PCIe) interconnect.

Example 16 includes a shared resource monitor, comprising: an interface communicatively coupled to a core, an interface communicatively coupled to a cache, an interface communicatively coupled to a peripheral, a cache counter that increments when a cache line is allocated into the cacheLa(ii) a Cache counter incremented when a core processes and deallocates a cache line from cacheLp(ii) a Cache counter incremented when a cache line is evicted from a direct-writable cache to memoryLeAnd a policy circuit that determines a direct write policy based on the cache counter.

Example 17 includes the shared resource monitor of example 16, wherein the cache is a direct writable cache of a multi-level cache.

Example 18 includes the shared resource monitor of example 17, wherein the directly writable cache is a last-level cache.

Example 19 includes the shared resource monitor of example 17, wherein the multi-level cache is a non-inclusive cache.

Example 20 includes the shared resource monitor of example 16, wherein the cache line is incremented only when evicted without being retrieved by the coreLe。

Example 21 includes the shared resource monitor of example 16, wherein the cache counter is a sample counter configured to sample less than all of the cache transactions.

Example 22 includes the shared resource monitor of example 21, wherein the cache counter is configured to sample approximately 2% of all cache transactions.

Example 23 includes the shared resource monitor of example 21, wherein the policy circuit resets after the policy windowLa、LpAndLe。

example 24 includes the shared resource monitor of example 21, wherein the cache monitor circuitry is to allocate all sampled transactions directly to the cache.

Example 25 includes the shared resource monitor of example 24, wherein the cache monitor circuitry is to allocate the gap cache transaction according to a policy.

Example 26 includes the shared resource monitor of example 16, wherein determining the direct write policy includes determiningLpIs approximately equal toLaAnd isLeAt or near zero, it is determined that the cache is not over utilized, and a direct write policy for writing incoming data to the direct writable cache is determined.

Example 27 includes the shared resource monitor of example 16, wherein determining the direct policy includes determiningLpIs substantially less thanLaAnd isLeSubstantially non-zero, determining that the cache is over-utilized, and determining a direct write strategy to direct incoming data to memory.

Example 28 includes the shared resource monitor of example 16, wherein the peripheral device is an ethernet network interface.

Example 29 includes the shared resource monitor of example 16, wherein the peripheral is a cache-coherent core-to-core network interface.

Example 30 includes the shared resource monitor of example 29, wherein the cache-coherent core-to-core network interface is Omni-Path or a derivative of Omni-Path.

Example 31 includes the shared resource monitor of example 16, wherein the peripheral device is a cache-coherent core-to-core network interface on a peripheral component interconnect express (PCIe) interconnect.

Example 32 includes an Intellectual Property (IP) block comprising the shared resource monitor of any of examples 1-31.

Example 33 includes a Field Programmable Gate Array (FPGA) including the shared resource monitor of any of examples 1-31.

Example 34 includes an Integrated Circuit (IC) comprising the shared resource monitor of any of examples 1-31.

Example 35 includes a microprocessor including the IC of example 34.

Example 36 includes a system on a chip (SoC) including the microprocessor of example 35.

Example 37 includes a computer-implemented method of providing device direct input/output (DDIO), comprising: communicatively coupled to a shared Last Level Cache (LLC), an LLC shared by multiple cores, communicatively coupled to a peripheral device, incrementing a cache counter when a cache line is allocated into the LLCLa(ii) a Incrementing a cache counter when a core processes and deallocates a cache line from a cacheLp(ii) a Incrementing a cache counter when a cache line is evicted from a direct-writable cache to memoryLeAnd setting a direct write policy for the LLC based on the cache counter.

Example 38 includes the method of example 37, wherein the LLC is a member of a non-inclusive cache.

Example 39 includes the method of example 37,also included is incrementing only when a cache line is evicted without being retrieved by a coreLe。

Example 40 includes the method of example 37, further comprising incrementing a cache counter according to a sampling algorithm, including sampling less than all of the cache transactions.

Example 41 includes the method of example 40, further comprising sampling approximately 2% of all cache transactions.

Example 42 includes the method of example 37, further comprising resetting after the policy windowLa、LpAndLe。

example 43 includes the method of example 37, wherein the cache monitor circuitry is to allocate all sampled transactions directly to the cache.

Example 44 includes the shared resource monitor of example 24, further comprising allocating the slot cache transaction according to a policy.

Example 45 includes the method of example 37, wherein determining the direct write strategy includes determiningLpIs approximately equal toLaAnd isLeAt or near zero, a determination is made that the cache is not over utilized, and a direct write policy is determined to write incoming data to the direct writable cache.

Example 46 includes the method of example 37, wherein determining the direct policy includes determiningLpIs substantially less thanLaAnd isLeSubstantially non-zero, determining that the cache is over-utilized, and determining a direct write policy to write incoming data directly to the memory.

Example 47 includes the method of example 37, wherein the peripheral device is an ethernet network interface.

Example 48 includes the method of example 37, wherein the peripheral device is a cache-coherent core-to-core network interface.

Example 49 includes the method of example 48, wherein the cache-coherent core-to-core network interface is Omni-Path or a derivative of Omni-Path.

Example 50 includes the method of example 37, wherein the peripheral device is a cache-coherent core-to-core network interface on a peripheral component interconnect express (PCIe) interconnect.

Example 51 includes an apparatus comprising means for performing the method of any of examples 37-50.

Example 52 includes the apparatus of example 51, wherein the apparatus comprises a computing system.

Example 53 includes the apparatus of example 52, wherein the computing system comprises a Central Processing Unit (CPU) comprising a plurality of cores and a shared resource monitor.

Example 54 includes a shared resource monitor comprising circuitry to perform the method of any of examples 37-50.

Example 55 includes an Intellectual Property (IP) block comprising the shared resource monitor of example 54.

Example 56 includes a Field Programmable Gate Array (FPGA) comprising the shared resource monitor of example 54.

Example 57 includes an Integrated Circuit (IC) comprising the shared resource monitor of example 54.

Example 58 includes a microprocessor including the IC of example 57.

Example 59 includes a system on a chip (SoC) including the microprocessor of example 58.

Example 60 includes one or more tangible, non-transitory computer-readable media having instructions stored thereon for causing an apparatus to provide a shared resource monitor, the shared resource monitor comprising: an interface communicatively coupled to a core, an interface communicatively coupled to a cache, an interface communicatively coupled to a peripheral, a cache counter that increments when a cache line is allocated into the cacheLa(ii) a Cache counter incremented when a core processes and deallocates a cache line from cacheLp(ii) a Cache counter incremented when a cache line is evicted from a direct-writable cache to memoryLeAnd a policy circuit that determines a direct write policy based on the cache counter.

Example 61 includes the one or more tangible, non-transitory computer-readable media of example 60, wherein the cache is a direct-writable cache of a multi-level cache.

Example 62 includes the one or more tangible, non-transitory computer-readable media of example 61, wherein the direct-writable cache is a last-level cache.

Example 63 includes the one or more tangible, non-transitory computer-readable media of example 61, wherein the multi-level cache is a non-inclusive cache.

Example 64 includes the one or more tangible, non-transitory computer-readable media of example 60, wherein the cache line is incremented only when evicted without being retrieved by the coreLe。

Example 65 includes the one or more tangible, non-transitory computer-readable media of example 60, wherein the cache counter is a sample counter configured to sample a set of less than all cache sets.

Example 66 includes the one or more tangible, non-transitory computer-readable media of example 65, wherein the cache counter is configured to sample approximately 2% of all cache sets.

Example 67 includes the one or more tangible, non-transitory computer-readable media of example 65, wherein the policy circuitry resets after the policy windowLa、LpAndLe。

example 68 includes the one or more tangible, non-transitory computer-readable media of example 65, wherein the cache monitor circuitry is to allocate all sampled transactions directly to the cache.

Example 69 includes the one or more tangible, non-transitory computer-readable media of example 68, wherein the cache monitoring circuitry is to cache the transaction according to a policy allocation gap.

Example 70 includes the one or more tangible, non-transitory computer-readable media of example 60, wherein determining the direct write policy comprises determining a direct write policyLpIs approximately equal toLaAnd isLeAt or near zero, a determination is made that the cache is not over utilized, and a direct write policy is determined to write incoming data to the direct writable cache.

Example 71 includes one or more of the tangible of example 60Wherein determining the direct policy comprises determiningLpIs substantially less thanLaAnd isLeSubstantially non-zero, determining that the cache is over-utilized, and determining a direct write policy to write incoming data directly to the memory.

Example 72 includes the one or more tangible, non-transitory computer-readable media of example 60, wherein the peripheral device is an ethernet network interface.

Example 73 includes the one or more tangible, non-transitory computer-readable media of example 60, wherein the peripheral device is a cache-coherent core-to-core network interface.

Example 74 includes the one or more tangible, non-transitory computer-readable media of example 73, wherein the cache-coherent core-to-core network interface is Omni-Path or a derivative of Omni-Path.

Example 75 includes the one or more tangible, non-transitory computer-readable media of example 60, wherein the peripheral device is a cache-coherent core-to-core network interface on a peripheral component interconnect express (PCIe) interconnect.

Example 76 includes the one or more tangible, non-transitory computer-readable media of any of examples 60-75, wherein the instructions comprise instructions to program an Intellectual Property (IP) block.

Example 77 includes the one or more tangible, non-transitory computer-readable media of any one of examples 60-75, wherein the instructions comprise instructions to program a Field Programmable Gate Array (FPGA).

Example 78 includes the one or more tangible, non-transitory computer-readable media of any one of examples 60-75, wherein the instructions comprise instructions to program an Integrated Circuit (IC).

Example 79 includes the one or more tangible, non-transitory computer-readable media of example 78, wherein the IC includes a processor.

Example 80 includes the one or more tangible, non-transitory computer-readable media of any of examples 60-75, wherein the instructions comprise instructions to program a system on a chip (SoC).

46页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：存储器系统以及用于访问存储器系统的方法

Cache monitoring

相关技术

网友询问留言