Cache monitoring
阅读说明:本技术 高速缓存监视 (Cache monitoring ) 是由 王任 李彬 A.J.赫尔里奇 戴宗元 R.胡加哈利 于 2019-05-28 设计创作,主要内容包括:高速缓存监视。在一个示例中公开了一种计算装置,包括:处理器;包括多个高速缓存级别的多级高速缓存;外围设备,被配置为将数据直接写到直接可写高速缓存;以及高速缓存监视电路,其包括当高速缓存行被分配到直接可写高速缓存中时递增的高速缓存计数器<I>La</I>,当高速缓存行被处理器处理并从直接可写高速缓存解除分配时递增的<I>Lp</I>,并且当高速缓存行从直接可写高速缓存被逐出到存储器时递增的<I>Le</I>,其中高速缓存监视电路根据高速缓存计数器确定直接写策略。(Cache monitoring. Disclosed in one example is a computing apparatus comprising: a system and method for managing data in a multi-level cache includes a processor, a multi-level cache including a plurality of cache levels, a peripheral device configured to write data directly to a directly writable cache, and a cache monitor circuitIncluding a cache counter that is incremented when a cache line is allocated in a direct-writable cache La Incremented when a cache line is processed by a processor and deallocated from a direct-writable cache Lp And incremented when a cache line is evicted from the direct-writable cache to memory Le Wherein the cache monitor circuit determines the direct write policy based on the cache counter.)
1. A computing device, comprising:
a processor;
a multi-level cache comprising a plurality of cache levels;
a peripheral device configured to write data directly to the direct-writable cache, and
cache monitor circuitry comprising a cache counter that is incremented when a cache line is allocated into a direct-writable cacheLaIncremented when a processor processes a cache line and deallocates the cache line from the direct-writable cacheLpAnd incremented when a cache line is evicted from the direct-writable cache to memoryLeWherein the cache monitor circuit determines the direct write policy based on the cache counter.
2. The computing device of claim 1, wherein the cache line is incremented only when the cache line is evicted without being retrieved by the processorLe。
3. The computing apparatus of claim 1 wherein the directly writable cache is a last-level cache.
4. The computing device of claim 1, wherein the multi-level cache is a non-inclusive cache.
5. The computing device of claim 1, wherein the cache counter is a sample counter configured to sample a set of less than all cache sets.
6. The computing device of claim 5, wherein the cache counter is configured to sample approximately 2% of all cache sets.
7. The method of claim 6Wherein the cache monitor circuit is further configured to reset after the policy windowLa、LpAndLe。
8. the computing apparatus of claim 6, wherein the cache monitor circuitry is to allocate all transactions on the sampled set to the direct writable cache.
9. The computing device of claim 7, wherein the cache monitor circuitry is to allocate gap cache transactions according to a policy.
10. The computing device of claim 1, wherein to determine a direct write strategy comprises to determineLpIs approximately equal toLaAnd isLeAt or near zero, a determination is made that the cache is not over utilized, and a direct write policy is determined to write incoming data to the direct writable cache.
11. The computing device of claim 1, wherein to determine a direct policy comprises to determineLpIs substantially less thanLaAnd isLeSubstantially non-zero, determining that the cache is over-utilized, and determining a direct write policy to write incoming data directly to the memory.
12. The computing apparatus of claim 1, wherein the peripheral device is an ethernet network interface.
13. The computing apparatus of claim 1, wherein the peripheral device is a cache coherent core-to-core network interface.
14. The computing device of claim 13, wherein the cache coherent core-to-core network interface is Omni-Path or a derivative of Omni-Path.
15. The computing apparatus of claim 1, wherein the peripheral device is a cache-coherent core-to-core network interface on a peripheral component interconnect express (PCIe) interconnect.
16. A shared resource monitor comprising:
an interface communicatively coupled to the core;
an interface communicatively coupled to the cache;
an interface communicatively coupled to a peripheral device;
cache counterLaIncrement when a cache line is allocated into the cache;
cache counterLpIncrement when a core processes a cache line and deallocates the cache line from cache;
cache counterLe,Increment when a cache line is evicted from the direct-writable cache to memory, and
a policy circuit to determine a direct write policy based on the cache counter.
17. The shared resource monitor of claim 16, wherein the cache is a direct writable cache of a multi-level cache.
18. The shared resource monitor of claim 17, wherein the directly writable cache is a last-level cache.
19. The shared resource monitor of claim 17, wherein the multi-level cache is a non-inclusive cache.
20. The shared resource monitor of claim 16, wherein a cache line is incremented only when evicted without being retrieved by a coreLe。
21. The shared resource monitor of claim 16, wherein the cache counter is a sample counter configured to sample less than all of the cache transactions.
22. The shared resource monitor of claim 21, wherein the cache counter is configured to sample approximately 2% of all cache transactions.
23. The shared resource monitor of claim 21, wherein the policy circuit resets after the policy windowLa、LpAndLe。
24. the shared resource monitor of claim 21, wherein the cache monitor circuitry allocates all sampled transactions directly to the cache.
25. The shared resource monitor of claim 24, wherein the cache monitor circuitry allocates the gap cache transactions according to a policy.
Technical Field
The present disclosure relates generally to the field of enterprise computing, and more particularly, but not exclusively, to systems and methods for providing cache monitoring.
Background
In some modern data centers, the functionality of the equipment or devices may not be tied to a particular fixed hardware configuration. Rather, processing, memory, storage, and accelerator functions may be aggregated from different locations in some cases to form virtual "compound nodes. Contemporary networks may include data centers that host a large number of general purpose hardware server devices contained, for example, in server racks and controlled by supervisory programs. Each hardware device may run one or more instances of a virtual device, such as a workload server or virtual desktop.
Drawings
The present disclosure is best understood from the following detailed description when read with the accompanying drawing figures. It is emphasized that, in accordance with the standard practice in the industry, various features are not necessarily drawn to scale and are used for illustrative purposes only. Where a scale is shown, either explicitly or implicitly, it provides only one illustrative example. In other embodiments, the dimensions of the various features may be arbitrarily increased or decreased for clarity of discussion.
Fig. 1 is a block diagram of a Central Processing Unit (CPU) with shared resource monitoring capabilities according to one or more examples of the present specification.
Fig. 2 is a block diagram illustrating a system of memory transactions without device direct input/output (DDIO) according to one or more examples of this specification.
Fig. 3 is a block diagram of a memory transaction to support and enable DDIOs according to one or more examples of the present specification.
FIGS. 4 and 5 are signal flow diagrams illustrating the placement (put) operations that may be used, for example, in a high performance computing fabric (fabric) such as Intel Omni-Path.
Fig. 6 is a flow diagram of a method of monitoring a cache in accordance with one or more examples of the present description.
FIG. 7 is a block diagram of selected elements of a CPU according to one or more examples of the present specification.
FIG. 8 is a block diagram of a general approach and an I/O approach in a Last Level Cache (LLC) according to one or more examples of the present specification.
Fig. 9 is a block diagram of selected components of a datacenter having connectivity to a network of Cloud Service Providers (CSPs) according to one or more examples of the present application.
FIG. 10 is a block diagram of selected components of an end-user computing device, according to one or more examples of the present specification.
FIG. 11 is a block diagram of components of a computing platform according to one or more examples of the present specification.
Fig. 12 is a block diagram of a CPU according to one or more examples of the present specification.
Fig. 13 is a block diagram of a rack scale design in accordance with one or more examples of the present specification.
Fig. 14 is a block diagram of a Software Defined Infrastructure (SDI) data center according to one or more examples of the present description.
Fig. 15 is a block diagram of a data center according to one or more examples of the present description.
Detailed Description
The following disclosure provides many different embodiments, or examples, for implementing different features of the disclosure. Specific examples of components and arrangements are described below to simplify the present disclosure. Of course, these are merely examples and are not intended to be limiting. Further, the present disclosure may repeat reference numerals and/or letters in the various examples. This repetition is for the purpose of simplicity and clarity and does not in itself dictate a relationship between the various embodiments and/or configurations discussed. Different embodiments may have different advantages, and no particular advantage is necessarily required of any embodiment.
Modern enterprise computing may include servers, appliances, virtual machines, containers, and other distributed or non-distributed computing systems, among others. In such computing environments, operators may need visibility and control over shared resources, such as Last Level Cache (LLC) and memory bandwidth. To provide those operators with the required visibility into those shared resources, the enterprise CPU may provide Shared Resource Monitors (SRMs), such as Intel Resource Director Technology (RDT). In embodiments of the present description, a CPU cache is provided as a form of computer data storage that allows for later efficient access to recently used information. By way of non-limiting example, a cache may contain frequently used instructions, predicted instructions, recently used data, or predicted data, for example.
Data direct I/O (ddio) is an I/O protocol that enables devices such as peripheral devices (e.g., Network Interface Cards (NICs)) to deliver data directly into a cache level such as an LLC without first accessing main memory, such as may occur in a Direct Memory Access (DMA) model. With DDIO, the processor cache (rather than the main memory of the system) becomes the primary source and destination of I/O data. This improves the I/O data processing efficiency of the CPU for data delivery and data consumption from the I/O device.
As used in this specification, a NIC is a computer hardware component that enables a computer to communicatively connect with a network. NICs may be used in both wired and wireless computing embodiments, and are provided as add-in cards that fit into expansion slots of a computer motherboard. By way of non-limiting example, NICS is also known as a network interface controller card, network adapter card, expansion card, LAN card, and circuit board.
In an example, a device such as an ethernet NIC may send data directly to the processor LLC using DDIO, allocating cache lines as "I/O ways," which are a number of special LLC ways reserved specifically for I/O traffic. This helps to reduce access latency and improve performance and power.
In another embodiment, other network construction techniques may be used. For example, embodiments may provide a Host Fabric Interface (HFI) for a High Performance Computing (HPC) fabric such as Intel Omni-Path. Such HPC architectures may provide cache coherent (coherent) memory operations between different cores or processors. Like a NIC, HFI can also write data directly to LLC using DDIO. In the case of HFI, the cache operation may be a "write update": if the line is found in the cache, the line is updated in the LLC, but if the line is not found in the LLC, the line is written to system memory. Other embodiments of the HPC fabric may provide HFI using PCIe interfaces. Current versions of PCIe are non-cache coherent, so these constructs may behave like ethernet NICs (e.g., always write to a dedicated IO path).
Both ethernet and HPC approaches can suffer problems if, for example, the cache is over-subscribed (oversubscribed) or under-subscribed (undersubscribed). In the case of an ethernet NIC, when the I/O ingress rate is faster than the CPU consumption rate, the data may be evicted from memory before the CPU consumes the data. This not only negates the benefits of DDIO, but it actually reduces CPU performance due to unnecessary LLC cache evictions, whereas without DDIO, data may simply be written directly to memory via DMA. These unnecessary cache evictions may also increase the power consumption of the hardware platform due to the additional block moves that restore data after it has been evicted from the cache.
In the case of HPC architectures, HFI may only perform write updates, which means that for new data arrivals not found in the cache, the data can be written back to system memory and subsequently brought into the cache by the CPU even if the cache subscription is insufficient. This can cause unnecessary memory errors (memory tries) with negative performance and energy impact. In the case of HFI using PCIe interfaces, the conditions are similar to those faced in the case of ethernet.
Problems can arise, especially in the case of non-inclusive caches. A strict inclusive cache is one in which all levels of cache are relevant, such that data found in, for example, the L1 cache must always be found in the L2 and L3 caches. A non-inclusive cache is a cache in which data in a lower level of the cache may or may not be found in an upper level of the cache. For example, data may be lowered into lower levels of cache so that they may be consumed by the CPU and may be removed from upper levels that no longer require their cache. But in the case of a non-inclusive cache, the write update DDIO peripheral only sees that the cache line is not in the LLC. It does not know whether the data has been evicted by the LLC (e.g., because the processor is stuck in trouble (bog down) and cannot keep up with IO operations), or because the core has consumed the data (e.g., because the processor is running far ahead of the IO operations).
Better performance can be achieved when monitoring LLC utilization and data for all I/O devices is allocated into the LLC when there is space (e.g., when the processor remains in front of IO operations) and written to memory when there is insufficient space (e.g., when the CPU cannot keep up with incoming I/os). Embodiments of the present specification provide SRMs such as Intel RDT comprising having countersLa、LpAndLethe cache monitor of (1). While these three counters do not give the DDIO peripheral visibility as to why a cache line is no longer in the LLC (or some other level of cache, such as L2 or L1), they can be used in the aggregation to inform a meaningful direct write policy, enabling the DDIO device to know whether the CPU is holding (so that data should be DDIO into LLC) or falling behind (so that data should be DMA into main memory).
The cache monitor of the present specification monitors the use of I/O ways in the LLC, including in non-inclusive caches. This may advantageously provide a unified architecture for adaptive DDIO for ethernet, HPC fabric, other peripherals, and even software threads.
Some existing cache monitors include a single LLC occupancy counter per thread, each counter being associated with a resource monitoring id (rmid). The RMID enables independent and simultaneous monitoring of multiple concurrently running threads on a multicore processor. Per-thread architecture Model Specific Registers (MSRs) allow each hardware thread to be associated with an RMID. The MSR may include, for example, a Class (CLOS) field along with an RMID field for controlling the service of resource allocation. The RMID may represent a software thread or an I/O thread. While the examples of this specification specifically discuss the use of RMIDs for I/O threads, it should be noted that the cache monitoring capabilities disclosed herein can be used in all cases including software threads RMIDs to give a better indication of cache utilization to better service applications.
The LLC occupancy counters available in some existing cache monitors are configured to monitor the LLC behavior of a small sample set of all available LLC sets. Empirically, it has been found that the behavior of the cache is sufficiently random that sampling a small set is sufficient to create a representative sample of all cache transactions. For example, the cache may include hundreds or thousands of individual sets, and one or more occupancy counters may sample some small number of individual sets, such as 2%.
The counter is incremented (e.g., by 1) when a line is filled in the LLC, and decremented (e.g., by 1) when a line disappears from the LLC. However, this does not take into account which row goes. For example, a line may be pulled to a mid-level cache (MLC) or L1 cache so that the core may consume the data, or the line may be evicted to system memory. In an inclusive cache, there will be no possible aliasing, since even if the CPU fetches a line, it is still present in the LLC and the counter does not change. The counter is only decremented when a line is evicted to memory.
However, in a non-inclusive cache, a single LLC occupancy counter may not be able to provide a complete picture of cache occupancy. This may be particularly true where adaptive DDIO is provided, where the intelligent policy instructions indicate whether data is written directly to the cache or via DMA to main memory. In the case of DDIO, if many lines are evicted from the LLC before they are processed by the core, this means that the LLC is congested and new lines should not be allocated into the LLC. On the other hand, if a core pulls many lines to the MLC/L2 cache for processing, the LLC is released and new lines should be allocated into the LLC. But a single LLC occupancy counter may not be able to distinguish between these two cases, which in fact represent opposite phenomena. In the former case, data is evicted because the LLC is over-subscribed, while in the latter case, data is consumed by the CPU and the LLC is under-subscribed.
To provide a more accurate map of cache utilization, the cache monitor of the present specification provides three separate counters per RMID, i.e., three separate countersLa、LpAndLe. These three counters capture LLC utilization better, particularly in the case of DDIO transactions, but more generally in any data transaction. As with existing LLC counters, the counters disclosed herein can also operate on a small subset of all cache lines represented as a sample set. For example, it has been empirically determined that small sampling, such as 2% of all sets in the cache, has been determined to be sufficient to represent overall cache behavior.
Counter with a memoryLaIndicating an allocated cache line. When there is a cache line allocated into the LLC,Laand (4) increasing.
Counter with a memoryLpIndicating that the CPU has processed the cache line. Whenever a cache line is processed by the CPU and deallocated (deallocated) from the LLC,Lpincrement, a condition that indicates that data has not been evicted from the LLC, but has been consumed, and thus is no longer needed.
Counter with a memoryLeIndicating that the cache line has been evicted from the LLC. When there are cache lines evicted from the LLC to system memory that are not touched by the CPU,Leincrement, which indicates a situation where the CPU cannot keep up with the IO operation.
In a specific time windowiThe period, which may be a small time window, these three counters may be used to describe cache dynamics. If the CPU is fast enough to be in the windowiDuring which all incoming I/O packets are processed, thenLpShould be approximately equal toLaTo do soLeShould be very small or close to 0. On the other hand, if the I/O reaches the point where it overwhelms the CPU's processing power too quickly, thenLeWill be relatively significant, andLpwill be less thanLa。
Because these counters may only be applied to small transaction samples (i.e., those occurring on the selected set), in some embodiments, those cache lines that are sampled are always allocated into the LLC. These samples may then be used to determine a direct write strategy for the RMID. According to use for windowsiThe direct write policy of (1) handles cache operations that hit other cache sets. For example, in a windowiAt-1, it can be determined that the LLC subscription is not sufficient. Thus, for windowsiThe direct write policy is set to put the new cache line DDIO into the cache. But if in the windowiSamples taken during this period indicate that LLC is now oversubscribed, then the window is targetedi+ 1 sets a new direct write strategy to DMA new data to main memory. This allows the CPU to catch up with the I/O. After each window, the counter may be reset for the next policy windowLa、LpAndLe。
window openingiIs a configurable parameter. For example, 100 samples on the selected set may constitute a measurement window, with all counters being reset at the beginning of the next window. For each windowiThe processing rate and eviction rate for the window may be calculated:
the processing rate and eviction rate reflect the instantaneous cache utilization (window)iCache, I/O and CPU interactions). Height ofRPIndicating that a core can efficiently handle most of the incoming trafficThis is a desirable property. Height ofREIndicating that most of the incoming packets are evicted before being processed by the core, which is generally considered an undesirable behavior. In thatREIn the case of being high, the new incoming packet should be written directly to system memory to avoid cache line bouncing (bounding), which causes degraded performance and consumes unnecessary energy. Embodiments of the present description employ a low-cost Exponentially Weighted Moving Average (EWMA) filter on rate samples (rate samples) to smooth oscillations and noise and obtain cache behavior trends. For example, EWMA may operate as follows:
RP(new)=(1-a)* RP(Old age)+ a * RP(i)
RE(New)=(1-b)* RE(Old age)+ b * RE(i)
Weight parameteraAndbindicating how agile the current rate will respond from the windowiThe latest sample of (2). The greater the weight given to a new sample, the greater the agility of the average weight response change, but it is also more susceptible to noise. These parameters may be adjusted experimentally in various embodiments depending on implementation details.
Systems and methods for cache monitoring will now be described with more particular reference to the accompanying drawings. It should be noted that throughout the figures, certain reference numerals may be repeated to indicate that a particular device or block is fully or substantially consistent across the figures. However, this is not intended to imply any particular relationship between the various embodiments disclosed. In some examples, a class (genus) element may be referenced by a particular reference numeral ("widget 10"), while individual classes or examples of the class may be referenced by hyphenated numbers ("first particular widget 10-1" and "second particular widget 10-2").
Contemporary computing platforms, such as the hardware platforms provided by Intel @orsimilar platforms, may include the capability to monitor device performance and make decisions regarding resource provisioning. For example, in a large data center such as may be provided by a Cloud Service Provider (CSP), the hardware platform may include rack-mounted (rack-mounted) servers with computing resources such as processors, memory pools, accelerators, and other similar resources. As used herein, "cloud computing" includes network-connected computing resources and technologies that enable ubiquitous (typically global) access to data, resources, and/or technologies. Cloud resources are typically characterized by a large flexibility to dynamically allocate resources according to current workload and demand. This may be accomplished, for example, through virtualization, where resources such as hardware, storage, and networks are provided to Virtual Machines (VMs) via a software abstraction layer and/or container (containment), where instances of network functionality are provided separately from each other in "containers," but sharing the underlying operating system, memory, and driver resources.
As used herein, a processor includes any programmable logic device having a set of instructions. The processor may be real or virtual, local or remote, or any other configuration. As a non-limiting example, a processor may comprise Intel ® processors (e.g., Xeon ® processors, Core @, Pentium @, Atom @, CeLeron @, x86 or other). Processors may also include competing processors such as AMDs (e.g., Kx series x86 works, or Athlon, Opteron, or Epyc series Xeon works), ARM processors, or IBM PowerPC and Power ISA processors, to name a few.
In embodiments of the present disclosure, a VM is a sequestered partition within a computing device that allows operating systems and other applications to be used independently of other programs on the device in which it is contained. VMs, containers, and the like may be collectively referred to as "guest" systems.
Fig. 1 is a block diagram of a
Core 104 accesses CHA140 via an in-die interconnect (IDI) 116 or other similar interconnect technology. In accordance with the teachings of this specification, the IDI 116 extracts the RMID from the access request and may provide the RMID to the
In the case of HPC link 108 and
In the case of
FIG. 1 illustrates that cache monitoring may be used to monitor LLC utilization for I/O ways. In some existing systems, a single RMID is assigned to an I/O, and the RMID may be used to monitor LLC utilization for the I/O. As described above, some existing systems use a single LLC occupancy counter for each RMID, which monitors LLC behavior for a small set of samples. The counter is incremented when a line is filled to the LLC with the associated RMID, and decremented when a line is driven from the LLC.
This single counter is suitable for a conventional inclusive cache hierarchy, where the LLC includes MLCs and L1. But for non-inclusive caches, when a line disappears from the LLC, the counter is not directly distinguished by the line being fetched by the CPU to the MLC or evicted to system memory. In the former case, the line is still present in the cache, while in the latter case it is not. This is a limitation on a single LLC occupancy counter, which limits its effectiveness in the DDIO case. While the LLC busy counter may be used to provide a direct write strategy in which DDIO may be adaptively enabled (sending all packets to the LLC) or disabled (sending all packets to memory) under certain conditions, this approach is more useful when differentiating between CPU consumption and cache evictions.
This is because existing LLC occupancy counters only monitor cache occupancy in the LLC, and not cache occupancy in the entire cache hierarchy. In the case of an inclusive cache, it is sufficient to monitor the cache occupancy in the LLC alone, where the LLC occupancy correctly reflects the occupancy throughout the cache hierarchy. But in the case of a non-inclusive cache hierarchy, a single counter has limited availability. The CPU may bring the rows into the MLC for processing, which is actually the desired behavior for DDIO, meaning that the LLC may actually be under-subscribed. But the LLC occupancy counter gives the impression that the line has been evicted and that the LLC is in fact oversubscribed, and therefore lacks sufficient capacity for input of I/O data. In a single LLC occupancy counter based system, DDIOs can be turned on or off in a coarse-grained manner, regardless of the fine-grained behavioral model for the overall non-inclusive cache.
To make more efficient use of DDIO, all rows may be installed into the LLC if there is enough space for faster processing. In a best case scenario, all lines installed into the LLC I/O partition are processed by the CPU before being evicted from the cache. Some existing I/O systems, such as HPC fabric over PCIe and Ethernet, always install data into the LLC through DDIO. In other cases, such as non-PCIe HPC constructs, DDIO performs only write updates.
Any of these scenarios may be suboptimal in the case of underutilization or over-utilization of the cache. Advantageously, the shared resource monitor of the present specification, including a cache monitor with three LLC occupancy counters, provides a unified DDIO architecture for all I/O devices, intelligently directs incoming data to the LLC when there is space, and writes data to memory when the LLC oversubscribes. This avoids evictions prior to processing.
To achieve this result, the Shared Resource Monitor (SRM) of the present specification extends the existing LLC monitoring capability to distinguish cases when lines are evicted from memory from cases when lines are processed by the CPU. Note that in the non-inclusive cache hierarchy, in any of these cases, lines disappear from the LLC, even though they essentially represent the opposite result (i.e., the CPU is over-subscribed in the former case, or under-subscribed in the latter case). The SRM of the present specification provides a unified I/O architecture that achieves high performance, significantly reduces memory bandwidth and energy consumption, and thus reduces latency in computing operations.
Fig. 2 is a block diagram illustrating a system of memory transactions without DDIOs in accordance with one or more examples of the present specification.
In the example of FIG. 2, processor socket 202 includes eight cores, namely core 204-0, core 204-1, core 204-2, core 204-3, core 204-4, core 204-5, core 204-6, and core 204-7. Cores 204
In this case, the NIC 212 is communicatively coupled to the processor socket 202, and the system further includes a memory 220.
In this example, NIC 212 may not be able to write data directly to
At
In operation 4, data is fetched from
Fig. 3 is a block diagram of a memory transaction to support and enable DDIOs according to one or more examples of the present specification.
In the example of FIG. 3, processor socket 302 includes eight cores, namely core 304-0, core 304-1, core 304-2, core 304-3, core 304-4, core 304-5, core 304-6, and core 304-7. Cores 304
In this example, at
Thus, it is advantageous to do so when data is written directly to the LLC via DDIO without evicting the data from the LLC before the core can process the data. In many cases, however, the
FIGS. 4 and 5 are signal flow diagrams illustrating placement operations that may be used, for example, in a high performance computing architecture such as Intel Omni-Path. As described above, these types of constructs may use write updates for DDIO. In a write update process, when new data is written to a cache block, the new data is broadcast to all caches containing the same cache line. This ensures that all caches are updated with the new value of the data. Conceptually, in a configuration such as the one shown in fig. 4 and 5, when a put message is received, the CHA simply checks whether the current line is present in the LLC. If it is a hit, the line is written to the LLC. If it is a miss, the line is DMA'd directly to main memory, bypassing the LLC. However, as mentioned above, in case the IOC is under-subscribed, this is not really a desired behavior, since the data will be better distributed into the LLC, as in the case of the ethernet NIC.
Turning to fig. 4, remote HFI 404 first issues an Omni-Path put message to local HFI 408. The local HFI 408 then performs a memory push write operation to the
In parallel, local HFI 408 may return an Omni-Path acknowledgement to remote HFI 404, and remote HFI 404 may respond with an Omni-Path end-to-end (E2E) acknowledgement.
Turning to fig. 5, remote HFI 404 issues an Omni-Path put message to local HFI 408. The local HFI 408 issues memory push writes of n bytes of data to the
As previously described, the local HFI 408 acknowledges the response with Omni-Path and the remote HFI 404 acknowledges the response with Omni-Path E2E.
Fig. 6 is a flow diagram of a method 600 of monitoring a cache according to one or more examples of the present description. In the example of fig. 6, all counters are initialized to 0 at the beginning.
At block 604, the row arrives at the CHA from the I/O device or from some other source, as described throughout the specification.
In decision block 608, the CHA checks to see if this is part of the sample set. For example, CHA may sample approximately 2% of the available set in cache. Those 2% of all sets are considered sufficient to represent the entire cache to make policy decisions for the complete cache.
If the transaction hits one of the sample sets, then in block 612, the data is allocated directly into the LLC according to the description herein. In response to the transaction, such asLa、LpAndle and the likeMay be incremented as described herein.
If the sample does not hit one of the sample sets, in decision block 620, CHA determines if the cache is currently over-utilized (as determined by the current cache policy). This may be based on a policy set by the cache transaction based on the samples in the previous window. If the current policy is that the cache (or, in particular, the LLC) is not over utilized, then in block 616 the memory controller installs the block directly into the LLC according to the policy.
Returning to block 620, if the cache is currently over utilized, the device writes the data directly to memory. In block 624, the memory controller writes the data to the memory according to the policy.
In block 698, the method is complete.
Fig. 7 is a block diagram of selected elements of a
In the example of fig. 7,
By way of non-limiting example, embodiments of the
By way of non-limiting example, the
As a non-limiting example, the
By way of non-limiting example,
As shown in FIG. 7,
counter with a memoryLa712 is used to indicate that a cache line is allocated. When there is a cache line allocated into the LLC,La712 are increased.
Counter with a memoryLp716 is used to indicate that the CPU processes the cache line. Whenever there is a cache line that is processed by the CPU and deallocated from the LLC,Lp716 is incremented, indicating a situation where data has not yet been evicted from the LLC, but has been consumed, and is therefore no longer needed.
Counter with a memoryLe720 is used to count when a cache line is evicted from the LLC. When in useWhen there is a CPU-untouched cache line evicted from the LLC to system memory,Le720 are incremented. During a particular time window, which may be a small time window, these three counters may be used to describe cache dynamics. If the CPU is fast enough to process all incoming I/O packets, thenLpShould be approximately equal toLaTo do soLeShould be very small or close to 0. On the other hand, if the I/O reaches the point where it overwhelms the CPU's processing power too quickly,Lewill be relatively significant, andLpwill be less thanLa。
FIG. 8 is a block diagram of a general approach and an I/O approach in an LLC according to one or more examples of the present description. The general purpose pathway 804 may be allocated for general core usage, while the I/O pathway 808 may be allocated exclusively for I/O usage. These allocations may reduce latency and improve overall system performance and energy.
Fig. 9 is a block diagram of selected components of a
As non-limiting examples, the CSP 902 may be a traditional enterprise data center, an enterprise "private cloud" or a "public cloud" that provides services such as infrastructure as a service (IaaS), platform as a service (PaaS), or software as a service (SaaS). In some cases, the CSP 902 may provide an HPC platform or service instead of or in addition to a cloud service. Indeed, although not explicitly the same, the HPC cluster ("supercomputer") may be similar in structure to a cloud data center, and the teachings of this specification may be applied to either unless and except where explicitly noted.
The CSP 902 may supply some number of workload clusters 918, which may be clusters of individual servers, blade servers, rack-mounted servers, or any other suitable server topology. In this illustrative example, two workload clusters 918-1 and 918-2 are shown, each providing a rack-mounted
In this illustration, workload cluster 918 is shown as a modular workload cluster conforming to the rack unit ("U") standard, where a 19 inch wide standard rack can be constructed to accommodate 42 units (42U), each 1.75 inches high and about 36 inches deep. In this case, computing resources such as processors, memory, storage, accelerators, and switches may fit into some multiple rack units from one to 42.
However, other embodiments are also contemplated. For example, fig. 13 shows a Rack Scale Design (RSD). In an RSD, the chassis may be built according to standard 42U dimensions, but may be configured to accommodate a Line Replaceable Unit (LRU) in the calculation of a "sled (sled)". In RSDs, the ski base may or may not be constructed from standard rack units. For example, the sled housing the processor may be adapted to a standard 3U configuration, or it may be of any size depending on preferred design considerations. In an RSD, an entire pre-populated resource rack may be provided as a unit, where the rack hosts a plurality of resource sleds, including elements such as computations (e.g., processors), memory, storage, accelerators, power supplies, networking, and others. Each ski chassis may house multiple LRUs, each embedded in a separate ski. If a resource fails, the LRU hosting the resource may be pulled and a new LRU may be modularly inserted. The failed LRU may then be repaired or discarded, depending on the nature of the failure. RSD is particularly, if not exclusively, beneficial in the case of Software Defined Infrastructure (SDI), where composite nodes may be built from decomposed resources. A large pool of resources may be provided and an SDI coordinator (arbiter) may assign them to the composite nodes as needed.
Each
Various devices in a data center may be connected to one another via a
In addition, as the capabilities of each
To accommodate the large volume of traffic in a data center, a high
The construction itself may be provided by any suitable interconnection. For example, each
Interconnect technologies may be provided by a single interconnect or a hybrid interconnect, such as PCIe to provide on-chip communications, 1Gb or 10Gb copper ethernet to provide a relatively short connection to ToR switch 920, and fiber optic cable to provide a relatively long connection to
Note that while a high-end fabric such as an OPA is provided by way of illustration herein, more generally,
In certain embodiments,
Fig. 10 is a block diagram of an end-
As described above, the
In this example, a
As shown,
Node 01008 is a processing node that includes
On-board communication between
Node 01008 is connected to
Because the
Note that in a "network is a device" approach, node 01008 may provide limited or no on-board memory or storage. Rather, node 01008 may rely primarily on distributed services, such as storage servers and networked storage servers. On the board, node 01008 may only provide enough memory and storage to boot the device and communicate it with
In this example,
Similarly, a storage server node 21010 may be provided instead of providing node 01008 with an onboard hard disk. The
Thus, in performing its designated functions, node 01008 may access memory from
By way of further illustration, node 31006 is also depicted. Node 31006 also includes
In other embodiments, the accelerator may also be provided as an Application Specific Integrated Circuit (ASIC), Field Programmable Gate Array (FPGA), co-processor, Graphics Processing Unit (GPU), Digital Signal Processor (DSP), or other processing entity, which may optionally be tuned or configured to provide accelerator functionality.
The basic building blocks of the various components disclosed herein may be referred to as "logic elements. Logic elements may include hardware (including, for example, a software programmable processor, ASIC, or FPGA), external hardware (digital, analog, or mixed signal), software, reciprocating software, services, drivers, interfaces, components, modules, algorithms, sensors, components, firmware, microcode, programmable logic, or objects that may be coordinated to achieve a logical operation. Additionally, some of the logic elements are provided by a tangible, non-transitory computer-readable medium having stored thereon executable instructions for instructing a processor to perform a task. Such non-transitory media may include, for example, a hard disk, solid state memory or disk, Read Only Memory (ROM), PFM (e.g., Intel 3D Crosspoint @)TM) External storage, RAID, RAIN, NAS, optical storage, tape drive, backup system, cloud storage, or any combination of the foregoing, as non-limiting examples. Such media may also include instructions programmed into a Field Programmable Gate Array (FPGA), or instructions encoded in hardware on an ASIC or processor. In embodiments of the present disclosure, an FPGA is a configurable logic circuit in which gates may be dynamically arranged to implement logic that operates at speeds similar to those implemented in an integrated circuit.
FIG. 11 is a block diagram of components of computing platform 1102A, according to one or more examples of the present description. Embodiments of the computing platform 1102A disclosed herein may be adapted or configured to provide a method of cache monitoring according to the teachings of the present specification.
In the depicted embodiment, platforms 1102A, 1102B, and 1102C, as well as
In various embodiments of the present disclosure, the cumulative stress (stress) and/or rate of cumulative stress of a plurality of hardware resources (e.g., cores and uncores) is monitored, and an entity of computer platform 1102A (e.g.,
Each of CPUs 1112 may include any suitable number of processor cores and support logic (e.g., uncore). The cores may be coupled to each other, to memory 1114, to at least one chipset 1116, and/or to
The memory 1114 may include any form of volatile or non-volatile memory including, but not limited to, magnetic media (e.g., one or more tape drives), optical media, Random Access Memory (RAM), ROM, flash memory, removable media, or any other suitable local or remote memory component or components. The memory 1114 may be used for short-term, medium-term, and/or long-term storage of the platform 1102A. Memory 1114 may store any suitable data or information used by platform logic 1110, including software embedded in a computer-readable medium and/or encoded logic incorporated in hardware or otherwise stored (e.g., firmware). Memory 1114 may store data used by the cores of CPU 1112. In some embodiments, memory 1114 may also include storage for instructions that may be executed by cores or other processing elements of CPU1112 (e.g., logic resident on chipset 1116) to provide functionality associated with manageability engine 1126 or other components of platform logic 1110. Platform 1102 may also include one or more chipsets 1116 including any suitable logic to support the operation of CPU 1112. In various embodiments, chipset 1116 may reside on the same die or package as CPU1112 or on one or more different die or packages. Each chipset may support any suitable number of CPUs 1112. Chipset 1116 may also include one or more controllers to couple other components of platform logic 1110 (e.g.,
In various embodiments, manageability engine 1126 operates as an out-of-band asynchronous computing agent that is capable of interfacing with various elements of platform logic 1110 to collect telemetry data with no or minimal disruption to running processes on CPU 1112. Manageability engine 1126 may, for example, comprise a dedicated processing element (e.g., a processor, controller, or other logic) on chipset 1116 that provides the functionality of manageability engine 1126 (e.g., by executing software instructions), thus saving the processing cycles of CPU1112 for operations associated with the workload executed by platform logic 1110. Further, dedicated logic for manageability engine 1126 may operate asynchronously with respect to CPU1112 and may collect at least some telemetry data without increasing the load on the CPU.
Manageability engine 1126 may process the telemetry data it collects (specific examples of processing of pressure information are provided herein). In various embodiments, manageability engine 1126 reports the data it collects and/or the results of its processing to other elements in the computer system, such as one or more supervisors 1120 or other operating systems and/or system management software (which may run on any suitable logic, such as system management platform 1106). In particular embodiments, critical events such as cores that have accumulated excessive pressure may be reported before the normal interval for reporting telemetry data (e.g., a notification may be sent immediately upon detection).
In addition, manageability engine 1126 may include configurable programmable code to set which CPU(s) 1112 a particular chipset 1116 manages and/or which telemetry data may be collected.
Each of the chipsets 1116 also includes a communication interface 1128. Communication interface 1128 may provide for the communication of signaling and/or data between chipset 1116 and one or more I/O devices, one or
In particular embodiments, communication interface 1128 may allow communication of data associated with the management and monitoring functions performed by manageability engine 1126 (e.g., between manageability engine 1126 and data center management platform 1106). In various embodiments, manageability engine 1126 may utilize elements of communication interface 1128 (e.g., one or more NICs) to report telemetry data (e.g., to system management platform 1106) in order to reserve the use of the NIC of
Switch 1130 may be coupled to various ports of communication interface 1128 (e.g., provided by a NIC) and may exchange data between those ports and various components of chipset 1116, such as one or more peripheral component interconnect express (PCIe) lanes coupled to CPU 1112. Switch 1130 may be a physical or virtual (i.e., software) switch.
Platform logic 1110 may include additional communication interfaces 1118. Similar to communication interface 1128,
The platform logic 1110 may receive and execute any suitable type of workload. The workload may include any request to utilize one or more resources of the platform logic 1110, such as one or more cores or associated logic, for example, of the platform logic 1110. For example, the workload may include a request to instantiate a software component, such as an I/O device driver 1124 or a client system 1122, a request to process a network packet received from a
SFCs 1136 are groups of
Hypervisor 1120 (also referred to as a virtual machine monitor) may include logic to create and run guest systems 1122. Hypervisor 1120 may present guest operating systems run by virtual machines with virtual operating platforms (i.e., it shows to the virtual machines that they run on separate physical nodes when they are actually merged onto a single hardware platform) and manage the execution of the guest operating systems through platform logic 1110. The services of hypervisor 1120 may be provided through virtualization in software or through hardware-assisted resources that require minimal software intervention, or both. Multiple instances of various guest operating systems may be managed by hypervisor 1120. Each platform 1102 may have a separate instance of a hypervisor 1120.
Hypervisor 1120 may be a native or bare-metal hypervisor that runs directly on platform logic 1110 to control the platform logic and manage the guest operating systems. Alternatively, hypervisor 1120 may be a hosted hypervisor running on a host operating system and abstracting the guest operating system from the host operating system. Hypervisor 1120 may include a virtual switch 1138, which may provide virtual switching and/or routing functionality to virtual machines of guest system 1122. Virtual switch 1138 can include a logical switching fabric that couples the vnics of
Virtual switch 1138 may include software elements that are executed using components of platform logic 1110. In various embodiments, hypervisor 1120 may communicate with any suitable entity (e.g., an SDN controller), which may cause hypervisor 1120 to reconfigure parameters of virtual switch 1138 in response to changing conditions in platform 1102 (e.g., addition or deletion of
The hypervisor 1120 may also include
Any suitable logic may make one or more of these optimization decisions. For example, the
The elements of platform logic 1110 may be coupled together in any suitable manner. For example, a bus may couple any of the components together. A bus may comprise any known interconnect, such as a multi-drop bus, a mesh interconnect, a ring interconnect, a point-to-point interconnect, a serial interconnect, a parallel bus, a coherent (e.g., cache-coherent) bus, a layered protocol architecture, a differential bus, or a Gunning Transceiver Logic (GTL) bus.
In embodiments of the present description, cache coherency is a memory architecture that provides uniform sharing and mapping between multiple caches. For example, the caches may map to the same address space. If two different caches cache the same address in a shared address space, the coherence agent provides logic (hardware and/or software) to ensure compatibility and coherence of the shared resource. For example, if two caches have cached the same address, the coherency agent ensures that changes are propagated to one cache when the value stored in the address is updated in the other cache. For example, coherency may be maintained via "snooping," where each cache monitors the address lines of each other cache and detects updates. Cache coherency may also be maintained via a directory-based system, where shared data is placed in a shared directory where coherency is maintained. Some distributed shared memory architectures may also provide coherency, for example by emulating the aforementioned mechanisms.
Coherency may be based on "snooping" or directory. In a snooping protocol, coherency may be maintained by a write invalidate, where a first cache snooping a write to the same address in a second cache invalidates its own copy. If the program attempts to read the value from the first cache, this forces a read from memory. Alternatively, in a write update, the first cache snoops writes to the second cache, and the cache controller (which may include a coherence agent) copies out the data and updates the copy in the first cache.
By way of non-limiting example, the current cache coherency model includes MSI (modified, shared, invalid), MESI (modified, exclusive, shared, invalid), MOSI (modified, owned, shared, invalid), MOESI (modified, owned, exclusive, shared, invalid), MERSI (modified, exclusive, read-only or recent, shared, invalid), MESIF (modified, exclusive, shared, invalid), write-once, Synapse, Berkeley, Firefly, and Dragon protocols. Further, the ARM processor may include an Advanced Microcontroller Bus Architecture (AMBA) 4 ACE to provide cache coherency for a system on a chip (SoC) or elsewhere.
The elements of computer platform 1102A may be coupled together in any suitable manner, such as through one or
Fig. 12 illustrates a block diagram of a CPU1212, according to one or more examples of this specification. Embodiments of the CPU1212 disclosed herein may be adapted or configured to provide a method of cache monitoring according to the teachings of this specification. In particular, the CPU1212 may include an SRM, as shown in fig. 7.
Although the CPU1212 depicts a particular configuration, the cores and other components of the CPU1212 may be arranged in any suitable manner. CPU1212 may include any processor or processing device, such as a microprocessor, embedded processor, DSP, network processor, application processor, co-processor, SoC, or other device executing code. In the depicted embodiment, CPU1212 includes four processing elements (cores 1230 in the depicted embodiment), which may include asymmetric processing elements or symmetric processing elements. However, CPU1212 may include any number of processing elements that may be symmetric or asymmetric.
Examples of hardware processing elements include: a thread unit, thread slot, thread, processing unit, context unit, logical processor, hardware thread, core, and/or any other element capable of maintaining a state of a processor, such as an execution state or an architectural state. In other words, in one embodiment, a processing element refers to any hardware capable of being independently associated with code, such as a software thread, operating system, application, or other code. A physical processor (or processor socket) generally refers to an integrated circuit, which potentially includes any number of other processing elements, such as cores or hardware threads.
A core may refer to logic located on an integrated circuit capable of maintaining an independent architectural state, where each independently maintained architectural state is associated with at least some dedicated execution resources. A hardware thread may refer to any logic located on an integrated circuit capable of maintaining an independent architectural state, wherein the independently maintained architectural states share access to execution resources. The physical CPU may include any suitable number of cores. In various embodiments, the cores may include one or more out-of-order processor cores or one or more in-order processor cores. However, the cores may be individually selected from any type of core, such as a native core, a software management core, a core adapted to execute a native Instruction Set Architecture (ISA), a core adapted to execute a translated ISA, a co-designed (co-designed) core, or other known core. In a heterogeneous core environment (i.e., asymmetric cores), some form of translation, such as binary translation, may be utilized to schedule or execute code on one or both cores.
In the depicted embodiment,
Front end 1270 may include a decode module coupled to fetch logic to decode fetched elements. In one embodiment, the fetch logic includes a separate sequencer associated with a thread slot of core 1230. Generally, core 1230 is associated with a first ISA that defines/specifies instructions executable on core 1230. Machine code instructions, which are typically part of the first ISA, include a portion of an instruction (referred to as an opcode) that references/specifies an instruction or operation to be performed. The decode module may include circuitry to recognize these instructions from their opcodes and to pass the decoded instructions in the pipeline for processing as defined by the first ISA. In one embodiment, the decoders of core 1230 recognize the same ISA (or a subset thereof). Alternatively, in a heterogeneous core environment, decoders of one or more cores (e.g.,
In the depicted embodiment, the out-of-order engine includes a dispatch unit 1282 to receive decoded instructions, which may be one or more microinstructions or one or more microinstructions, from the front-end unit 1270uops, and allocates them to appropriate resources, such as registers and the like. Next, the instructions are provided to a
In the depicted embodiment, both the front end unit 1270 and the out-of-
In particular embodiments,
CPU1212 may also include a Power Control Unit (PCU) 1240. In various embodiments, PCU1240 may control the supply voltage and operating frequency applied to each of the cores (on a per core basis) and to the uncore. PCU1240 may also instruct a core or an uncore to enter an idle state (where no voltage and clock are provided) when no workload is executing.
In various embodiments, PCU1240 may detect one or more stress characteristics of a hardware resource, such as a core and an uncore. The pressure characteristic may include an indication of an amount of pressure being placed on the hardware resource. As examples, the pressure characteristic may be a voltage or frequency applied to the hardware resource, a power level, current level, or voltage level sensed at the hardware resource, a temperature sensed at the hardware resource, or other suitable measurement. In various embodiments, when a pressure characteristic is sensed at a particular time, multiple measurements of the particular pressure characteristic may be performed (e.g., at different locations). In various embodiments, PCU1240 may detect the pressure characteristic at any suitable interval.
In various embodiments, PCU1240 is a separate component from core 1230. In a particular embodiment, PCU1240 operates at a clock frequency that is different from the clock frequency used by core 1230. In some embodiments where the PCU is in a microcontroller, the PCU1240 executes instructions according to an ISA different from the ISA used by the core 1230.
In various embodiments, the CPU1212 may also include a
As shown in FIG. 12, the CPU1212 may further include an
Fig. 13 is a block diagram of a Rack Scale Design (RSD) 1300 according to one or more examples of the present description. Embodiments of the RSD 1300 disclosed herein may be adapted or configured to provide a method of cache monitoring according to the teachings of the present specification.
In this example, the RSD 1300 includes a single chassis 1304 to illustrate certain principles of the RSD. It should be understood that the RSD 1300 may include many such chassis, and the chassis need not be identical to each other. In some cases, a multi-purpose rack such as rack 1304 may be provided, while in other examples, a single-purpose rack may be provided. For example, the chassis 1304 may be considered a high-inclusion chassis that includes resources that may be used to allocate a large number of composite nodes. On the other hand, other examples may include racks that are individually dedicated to computing sleds, storage sleds, memory sleds, and other resource types, which may be integrated together into a composite node. Thus, the chassis 1304 of fig. 13 should be understood as a non-limiting example of a chassis that may be used in the RSD 1300.
In the example of fig. 13, the rack 1304 may be a standard rack having an outer width of about 23.6 inches and a height of 78.74 inches. In common use, this is referred to as a "42U rack". However, the rack 1304 need not conform to the "rack unit" standard. Rather, the chassis 1304 may include multiple chassis optimized for its purpose.
The rack 1304 may be marketed and sold as a single-chip unit with multiple LRUs within each chassis. In this case, the LRU may be a sled, and thus the LRU may be easily swapped out when needed for replacement.
In this example, the chassis 1304 includes a
The storage chassis 1316 includes a plurality of
Over time, the various LRUs within the rack 1304 may become damaged, outdated, or may experience functional errors. When this occurs, the LRU may be pulled and replaced with a compatible LRU, thus allowing the rack to continue full scale operation.
Fig. 14 is a block diagram of a Software Defined Infrastructure (SDI) data center 1400 in accordance with one or more examples of the present description. Embodiments of the SDI data center 1400 disclosed herein may be adapted or configured to provide a method of cache monitoring according to the teachings of the present specification.
Some applications hosted within SDI data center 1400 may use a set of resources to achieve their intended purpose, such as processing database queries, serving web pages, or providing computer intelligence.
Some applications tend to be sensitive to a subset of a particular resource. For example, SAP HANA is an in-memory, column-oriented relational database system. The SAP HANA database may use processors, memory, disks, and constructs, while being most sensitive to memory and processors. In one embodiment, compound node 1402 includes one or more cores 1410 that perform processing functions. Node 1402 may also include a caching agent 1406 that provides access to a cache. One or more applications 1414 run on the node 1402 and communicate with the SDI fabric via the FA 1418. Dynamically provisioning the node 1402 with resources may include selecting a set of resources and ensuring that the quantity and quality of provisioning meets required performance metrics, such as SLA and quality of service (QoS). The resource selection and allocation of applications 1414 may be performed by a resource manager, which may be implemented in a coordination and system software stack 1422. As a non-limiting example, throughout this specification, a resource manager may be viewed as if it were implemented alone or by a coordinator. Note that many different configurations are possible.
In SDI data centers, applications may be executed by compound nodes such as node 1402 dynamically assigned by SDI manager 1480. Such nodes are referred to as compound nodes because they are not nodes that necessarily co-locate all resources. Rather, they may include resources that are distributed in different portions of the data center, dynamically allocated, and virtualized to specific applications 1414.
In this example, memory resources from three memory sleds of the memory rack 1430 are allocated to the node 1402, memory resources from four memory sleds of the memory rack 1434 are allocated, and additional resources from five resource sleds of the resource rack 1436 are allocated to the application 1414 running on the composite node 1402. All of these resources may be associated and aggregated with a particular computing sled to create a composite node. Once the compound node is created, the operating system can be booted in node 1402, and applications can begin running using the aggregated resources as if they were physically collocated resources. As described above, FA 1418 may provide certain interfaces that enable this operation to occur seamlessly with respect to node 1402.
As a general proposition, the more memory and computational resources that are added to a database processor, the better throughput it can achieve. However, this does not necessarily apply for a disc or construction. Adding more disks and building bandwidth may not necessarily increase the performance of the SAP HANA database above a certain threshold.
SDI data center 1400 may address scaling of resources by mapping an appropriate amount of off-board (offboard) resources to an application based on application requirements provided by a user or network administrator, or directly by the application itself. This may include allocating resources from various resource racks, such as the storage rack 1430, the storage rack 1434, and the resource rack 1436.
In an example, the SDI controller 1480 also includes a Resource Protection Engine (RPE) 1482 configured to allocate permissions of various target resources to the resolved computational resources (DRCs) that are allowed to access them. In this example, the desired resource is enforced by the FA serving the target resource.
In certain embodiments, elements of the SDI data center 1400 may be adapted or configured to operate with the decomposed telemetry model of the present description.
Fig. 15 is a block diagram of a
In the example of FIG. 15, two compound nodes are defined, compound node 1534-1 and compound node 1534-2.
According to the teachings of this specification,
Note that in various embodiments, the computing sled 1526-1 may also provide local memory, storage, accelerators, or other resources for the processor 1530-1. However, in accordance with the SDI teachings of the present specification, certain resources allocated to the complex node 1534 may also be decomposed, or physically remote from the processor 1530. In this example, each composite node 1534 has assigned to it one or
It should be noted that for simplicity and clarity of illustration, only selected components are disclosed in this illustration. However, other decomposed resources may also be provided. For example, the
In this example, the composite node 1534-1 includes a processor 1530-1 on the computing sled 1526-1, running an application 1532-1, and accessing the fabric 1570 via the FA 1518-3. The composite node 1534-1 also includes an FPGA 1512-1 running on the FPGA sled 1504-1, a running FPGA core 1506-1, and has access to FPGA resources 1508-1. The FPGA sled 1504-1 can access the fabric 1570 via the FA 1518-1. Note that in this example, multiple FPGAs on FPGA sled 1504-1 can be connected to each other via a passive backplane, and a single FA 1518-1 can be provided for the entire sled. The composite node 1534-1 may also have access to a memory block 1524-1 on the
The composite node 1534-2 includes a processor 1530-2 running an application 1532-2 on a computing sled 1526-2. The computing sled 1526-2 is connected to the configuration 1570 via the FA 1518-4. Note that the computing sled 1526 may also include multiple processors, memories, and other local resources that may be communicatively coupled to each other via a passive backplane and share a common FA 1518. The composite node 1534-2 also includes a core 1506-3 running on the shared FPGA 1512-2 and has access to the shared resources 1508-2. Complex node 1534-2 may store data on memory block 1524-2.
The foregoing has outlined features of one or more embodiments of the subject matter disclosed herein. These embodiments are provided so that those of ordinary skill in the art (PHOSTITA) can better understand the various aspects of the disclosure. Certain well-understood terminology and underlying technology and/or standards may be referred to without detailed description. It is expected that the PHOSTITA will have or have access to sufficient background knowledge or information in those techniques and standards to implement the teachings of this specification.
PHOSTITA will appreciate that they can readily use the present disclosure as a basis for designing or modifying other processes, structures, or variations for carrying out the same purposes and/or achieving the same advantages of the embodiments introduced herein. The PHOSTITA will also recognize that such equivalent constructions do not depart from the spirit and scope of the present disclosure, and that they may make various changes, substitutions, and alterations herein without departing from the spirit and scope of the present disclosure.
In the foregoing description, certain aspects of some or all embodiments have been described in greater detail than is strictly necessary to practice the following claims. These details are provided by way of non-limiting example only for the purpose of providing a context and description for the disclosed embodiments. Such details are not to be understood as necessary, and such details are not to be "distorted" as a limitation on the claims. The phrase may refer to an "embodiment" or "embodiments. These phrases and any other references to embodiments should be construed broadly to refer to any combination of one or more embodiments. Furthermore, the features disclosed in a particular "embodiment" may also be extended only across multiple implementations. For example, if
This specification may provide illustrations in block diagram form, with some features being disclosed in separate blocks. These should be construed broadly to disclose how the various features interoperate, but are not intended to imply that those features must be necessarily embodied in separate hardware or software. Furthermore, where a single block discloses more than one feature in the same block, those features need not necessarily be embodied in the same hardware and/or software. For example, computer "memory" may in some cases be distributed or mapped between multiple levels of cache or local memory, main memory, battery-backed volatile memory, and various forms of persistent storage, such as hard disks, storage servers, optical disks, memory, tape drives, or the like. In some embodiments, some components may be omitted or combined. In a general sense, the arrangements depicted in the figures may be more logical in their representations, while a physical architecture may include various permutations, combinations, and/or hybrids of these elements. The operational goals outlined herein may be achieved using a myriad of possible design configurations. Thus, the associated infrastructure has myriad alternative arrangements, design choices, device possibilities, hardware configurations, software implementations, and device options.
Reference may be made herein to computer-readable media, which may be tangible and non-transitory computer-readable media. As used in this specification and throughout the claims, a "computer-readable medium" should be taken to include one or more computer-readable media of the same or different types. By way of non-limiting example, the computer-readable medium may comprise an optical drive (e.g., CD/DVD/Blu-ray), a hard disk drive, a solid state drive, flash memory, or other non-volatile media. The computer-readable medium may also include media such as ROM, FPGA or ASIC configured to execute desired instructions, stored instructions for programming the FPGA or ASIC to execute desired instructions, IP blocks that may be integrated into other circuits in hardware, or instructions encoded directly into hardware or into microcode on a processor such as a microprocessor, DSP, microcontroller, or any other suitable component, device, element or object as appropriate and based on particular needs. Non-transitory storage media herein are expressly intended to include any non-transitory dedicated or programmable hardware configured to provide the disclosed operations or cause a processor to perform the disclosed operations.
Throughout this specification and claims, various elements may be "communicatively", "electrically", "mechanically", or otherwise "coupled" to one another. Such coupling may be a direct point-to-point coupling or may include intermediate devices. For example, two devices may be communicatively coupled to each other via a controller that facilitates communication. The devices may be electrically coupled to each other via intermediate devices such as signal boosters (boost), voltage dividers, or buffers. Mechanically coupled devices may be indirectly mechanically coupled.
Any "module" or "engine" disclosed herein may refer to or include software, a software stack, hardware, firmware, and/or a combination of software, circuitry configured to perform the functions of the engine or module, or any of the computer-readable media disclosed above. Such modules or engines may be provided on or in conjunction with a hardware platform, which may include hardware computing resources such as processors, memory, storage, interconnects, network and network interfaces, accelerators, or other suitable hardware, where appropriate. Such a hardware platform may be provided as a single monolithic device (e.g., in a PC form factor), or be equipped with some or part of the distributed functionality (e.g., "complex nodes" in high-end data centers, where computing, memory, storage, and other resources may be dynamically allocated and need not be local to each other).
There may be flow charts, signal flow diagrams, or other illustrations showing operations disclosed herein performed in a particular order. Unless explicitly stated otherwise, or unless required in a particular context, the order should be understood as a non-limiting example only. Further, where one operation is shown following another, other intervening operations may also occur, which may or may not be related. Some operations may also be performed concurrently or in parallel. In the case where an operation is referred to as being "based on" or "in accordance with" another item or operation, this should be understood to imply that the operation is at least partially based on or in accordance with the other item or operation. This should not be construed as implying that an operation is based on or exclusively on an item or operation, or is based on or exclusively on an item or operation.
All or part of any of the hardware elements disclosed herein may be readily provided in a SoC, including a CPU package. SoC represents an Integrated Circuit (IC) that integrates components of a computer or other electronic system into a single chip. Thus, for example, a client device or a server device may be provided in whole or in part in a SoC. The SoC may contain digital, analog, mixed signal, and radio frequency functions, all of which may be provided on a single chip substrate. Other embodiments may include a multi-chip module (MCM) in which multiple chips are located within a single electronic package and are configured to interact closely with each other through the electronic package.
In a general sense, any suitably configured circuit or processor may execute any type of instructions associated with data to implement the operations detailed herein. Any processor disclosed herein can transform an element or article (e.g., data) from one state or thing to another state or thing. Further, information tracked, sent, received, or stored in a processor may be provided in any database, register, table, cache, queue, control list, or storage construct, all of which may be referenced at any suitable time frame, based on particular needs and implementations. Any memory or storage element disclosed herein should be construed as being properly encompassed within the broad terms "memory" and "storage.
Computer program logic implementing all or part of the functionality described herein may be embodied in various forms, including, but not limited to, source code forms, computer executable forms, machine instructions or microcode, programmable hardware, and various intermediate forms (e.g., tables generated by an assembler, compiler, linker, or locator). In an example, the source code includes a series of computer program instructions implemented in various programming languages, such as object code, assembly language, or a high-level language such as OpenCL, FORTRAN, C + +, JAVA, or HTML, for use with various operating systems or operating environments, or in hardware description languages, such as Spice, Verilog, and VHDL. The source code may define and use various data constructs and communication messages. The source code may be in computer-executable form (e.g., via an interpreter), or the source code may be converted (e.g., via a translator, assembler, or compiler) into computer-executable form, or into an intermediate form, such as bytecode. Any of the preceding may be used to construct or describe appropriate discrete or integrated circuits, sequential, combined, state machines, or the like, where appropriate.
In an example embodiment, any number of the circuits of the figures may be implemented on a board of an associated electronic device. The board may be a generic circuit board that may hold various components of the internal electronic system of the electronic device and also provide connectors for other peripheral devices. Any suitable processor and memory may be suitably coupled to the board based on particular configuration requirements, processing needs, and computing design. Note that with the numerous examples provided herein, interaction may be described in terms of two, three, four, or more electronic components. However, this is done for clarity and example purposes only. It should be appreciated that the systems may be consolidated or reconfigured in any suitable manner. Along similar design alternatives, any of the illustrated components, modules, and elements of the drawings may be combined in various possible configurations, all of which are within the broad scope of this specification.
Numerous other changes, substitutions, variations, alterations, and modifications may be ascertained to one skilled in the art and it is intended that the present disclosure encompass all such changes, substitutions, variations, alterations, and modifications as falling within the scope of the appended claims. To assist the United States Patent and Trademark Office (USPTO) and additionally to assist any reader in interpreting the claims appended to the application regarding any patent issued to the application, applicants wish to point out applicant: (a) it is not intended that any of the appended claims refer to 35 u.s.c. paragraph six (6) part 112 (before the AIA) or paragraph (f) of the same part (after the AIA) as it existed on its filing date unless the use of "means for" or "step for" in particular claims was specifically made, and (b) is not intended to limit the present disclosure by any statement in the specification in any way that is not explicitly reflected in the appended claims.
Example implementation
The following examples are provided by way of illustration.
Example 1 includes a computing device, comprising: a system and method for cache line allocation in a direct-writable cache includes a processor, a multi-level cache including multiple cache levels, a peripheral device configured to write data directly to the direct-writable cache, and cache monitoring circuitry including a cache counter that is incremented when a cache line is allocated in the direct-writable cacheLaIncremented when a processor processes a cache line and deallocates the cache line from the direct-writable cacheLpIncremented when a cache line is evicted from a direct-writable cache to memoryLeWherein the cache monitor circuit determines the direct write policy based on the cache counter.
Example 2 includes the computing device of example 1, wherein the cache line is incremented only when evicted without retrieval by the processorLe。
Example 3 includes the computing apparatus of example 1, wherein the directly writable cache is a last-level cache.
Example 4 includes the computing apparatus of example 1, wherein the multi-level cache is a non-inclusive cache.
Example 5 includes the computing apparatus of example 1, wherein the cache counter is a sample counter configured to sample a set of less than all of the cache sets.
Example 6 includes the computing apparatus of example 5, wherein the cache counter is configured to sample approximately 2% of all cache sets.
Example 7 includes the computing apparatus of example 6, wherein the cache monitor circuitry is further to reset after the policy windowLa、LpAndLe。
example 8 includes the computing apparatus of example 6, wherein the cache monitor circuitry is to allocate all transactions on the sampled set to the direct writable cache.
Example 9 includes the computing apparatus of example 7, wherein the cache monitor circuitry is to cache the transaction according to a policy allocation gap (interval).
Example 10 includes the computing device of example 1, wherein to determine the direct write strategy includes to determine a write strategyLpIs approximately equal toLaAnd isLeAt or near zero, determining that the cache is not over utilized, and determining a direct write policy to write incoming data to the direct writable cache.
Example 11 includes the computing device of example 1, wherein determining the direct policy includes determiningLpIs substantially less thanLaAnd isLeSubstantially non-zero, determining that a cache is over-utilized, and determining a direct write policy to write incoming data directly to memory.
Example 12 includes the computing apparatus of example 1, wherein the peripheral device is an ethernet network interface.
Example 13 includes the computing apparatus of example 1, wherein the peripheral is a cache-coherent core-to-core network interface.
Example 14 includes the computing apparatus of example 13, wherein the cache-coherent core-to-core network interface is Omni-Path or a derivative of Omni-Path.
Example 15 includes the computing apparatus of example 1, wherein the peripheral device is a cache-coherent core-to-core network interface on a peripheral component interconnect express (PCIe) interconnect.
Example 16 includes a shared resource monitor, comprising: an interface communicatively coupled to a core, an interface communicatively coupled to a cache, an interface communicatively coupled to a peripheral, a cache counter that increments when a cache line is allocated into the cacheLa(ii) a Cache counter incremented when a core processes and deallocates a cache line from cacheLp(ii) a Cache counter incremented when a cache line is evicted from a direct-writable cache to memoryLeAnd a policy circuit that determines a direct write policy based on the cache counter.
Example 17 includes the shared resource monitor of example 16, wherein the cache is a direct writable cache of a multi-level cache.
Example 18 includes the shared resource monitor of example 17, wherein the directly writable cache is a last-level cache.
Example 19 includes the shared resource monitor of example 17, wherein the multi-level cache is a non-inclusive cache.
Example 20 includes the shared resource monitor of example 16, wherein the cache line is incremented only when evicted without being retrieved by the coreLe。
Example 21 includes the shared resource monitor of example 16, wherein the cache counter is a sample counter configured to sample less than all of the cache transactions.
Example 22 includes the shared resource monitor of example 21, wherein the cache counter is configured to sample approximately 2% of all cache transactions.
Example 23 includes the shared resource monitor of example 21, wherein the policy circuit resets after the policy windowLa、LpAndLe。
example 24 includes the shared resource monitor of example 21, wherein the cache monitor circuitry is to allocate all sampled transactions directly to the cache.
Example 25 includes the shared resource monitor of example 24, wherein the cache monitor circuitry is to allocate the gap cache transaction according to a policy.
Example 26 includes the shared resource monitor of example 16, wherein determining the direct write policy includes determiningLpIs approximately equal toLaAnd isLeAt or near zero, it is determined that the cache is not over utilized, and a direct write policy for writing incoming data to the direct writable cache is determined.
Example 27 includes the shared resource monitor of example 16, wherein determining the direct policy includes determiningLpIs substantially less thanLaAnd isLeSubstantially non-zero, determining that the cache is over-utilized, and determining a direct write strategy to direct incoming data to memory.
Example 28 includes the shared resource monitor of example 16, wherein the peripheral device is an ethernet network interface.
Example 29 includes the shared resource monitor of example 16, wherein the peripheral is a cache-coherent core-to-core network interface.
Example 30 includes the shared resource monitor of example 29, wherein the cache-coherent core-to-core network interface is Omni-Path or a derivative of Omni-Path.
Example 31 includes the shared resource monitor of example 16, wherein the peripheral device is a cache-coherent core-to-core network interface on a peripheral component interconnect express (PCIe) interconnect.
Example 32 includes an Intellectual Property (IP) block comprising the shared resource monitor of any of examples 1-31.
Example 33 includes a Field Programmable Gate Array (FPGA) including the shared resource monitor of any of examples 1-31.
Example 34 includes an Integrated Circuit (IC) comprising the shared resource monitor of any of examples 1-31.
Example 35 includes a microprocessor including the IC of example 34.
Example 36 includes a system on a chip (SoC) including the microprocessor of example 35.
Example 37 includes a computer-implemented method of providing device direct input/output (DDIO), comprising: communicatively coupled to a shared Last Level Cache (LLC), an LLC shared by multiple cores, communicatively coupled to a peripheral device, incrementing a cache counter when a cache line is allocated into the LLCLa(ii) a Incrementing a cache counter when a core processes and deallocates a cache line from a cacheLp(ii) a Incrementing a cache counter when a cache line is evicted from a direct-writable cache to memoryLeAnd setting a direct write policy for the LLC based on the cache counter.
Example 38 includes the method of example 37, wherein the LLC is a member of a non-inclusive cache.
Example 39 includes the method of example 37,also included is incrementing only when a cache line is evicted without being retrieved by a coreLe。
Example 40 includes the method of example 37, further comprising incrementing a cache counter according to a sampling algorithm, including sampling less than all of the cache transactions.
Example 41 includes the method of example 40, further comprising sampling approximately 2% of all cache transactions.
Example 42 includes the method of example 37, further comprising resetting after the policy windowLa、LpAndLe。
example 43 includes the method of example 37, wherein the cache monitor circuitry is to allocate all sampled transactions directly to the cache.
Example 44 includes the shared resource monitor of example 24, further comprising allocating the slot cache transaction according to a policy.
Example 45 includes the method of example 37, wherein determining the direct write strategy includes determiningLpIs approximately equal toLaAnd isLeAt or near zero, a determination is made that the cache is not over utilized, and a direct write policy is determined to write incoming data to the direct writable cache.
Example 46 includes the method of example 37, wherein determining the direct policy includes determiningLpIs substantially less thanLaAnd isLeSubstantially non-zero, determining that the cache is over-utilized, and determining a direct write policy to write incoming data directly to the memory.
Example 47 includes the method of example 37, wherein the peripheral device is an ethernet network interface.
Example 48 includes the method of example 37, wherein the peripheral device is a cache-coherent core-to-core network interface.
Example 49 includes the method of example 48, wherein the cache-coherent core-to-core network interface is Omni-Path or a derivative of Omni-Path.
Example 50 includes the method of example 37, wherein the peripheral device is a cache-coherent core-to-core network interface on a peripheral component interconnect express (PCIe) interconnect.
Example 51 includes an apparatus comprising means for performing the method of any of examples 37-50.
Example 52 includes the apparatus of example 51, wherein the apparatus comprises a computing system.
Example 53 includes the apparatus of example 52, wherein the computing system comprises a Central Processing Unit (CPU) comprising a plurality of cores and a shared resource monitor.
Example 54 includes a shared resource monitor comprising circuitry to perform the method of any of examples 37-50.
Example 55 includes an Intellectual Property (IP) block comprising the shared resource monitor of example 54.
Example 56 includes a Field Programmable Gate Array (FPGA) comprising the shared resource monitor of example 54.
Example 57 includes an Integrated Circuit (IC) comprising the shared resource monitor of example 54.
Example 58 includes a microprocessor including the IC of example 57.
Example 59 includes a system on a chip (SoC) including the microprocessor of example 58.
Example 60 includes one or more tangible, non-transitory computer-readable media having instructions stored thereon for causing an apparatus to provide a shared resource monitor, the shared resource monitor comprising: an interface communicatively coupled to a core, an interface communicatively coupled to a cache, an interface communicatively coupled to a peripheral, a cache counter that increments when a cache line is allocated into the cacheLa(ii) a Cache counter incremented when a core processes and deallocates a cache line from cacheLp(ii) a Cache counter incremented when a cache line is evicted from a direct-writable cache to memoryLeAnd a policy circuit that determines a direct write policy based on the cache counter.
Example 61 includes the one or more tangible, non-transitory computer-readable media of example 60, wherein the cache is a direct-writable cache of a multi-level cache.
Example 62 includes the one or more tangible, non-transitory computer-readable media of example 61, wherein the direct-writable cache is a last-level cache.
Example 63 includes the one or more tangible, non-transitory computer-readable media of example 61, wherein the multi-level cache is a non-inclusive cache.
Example 64 includes the one or more tangible, non-transitory computer-readable media of example 60, wherein the cache line is incremented only when evicted without being retrieved by the coreLe。
Example 65 includes the one or more tangible, non-transitory computer-readable media of example 60, wherein the cache counter is a sample counter configured to sample a set of less than all cache sets.
Example 66 includes the one or more tangible, non-transitory computer-readable media of example 65, wherein the cache counter is configured to sample approximately 2% of all cache sets.
Example 67 includes the one or more tangible, non-transitory computer-readable media of example 65, wherein the policy circuitry resets after the policy windowLa、LpAndLe。
example 68 includes the one or more tangible, non-transitory computer-readable media of example 65, wherein the cache monitor circuitry is to allocate all sampled transactions directly to the cache.
Example 69 includes the one or more tangible, non-transitory computer-readable media of example 68, wherein the cache monitoring circuitry is to cache the transaction according to a policy allocation gap.
Example 70 includes the one or more tangible, non-transitory computer-readable media of example 60, wherein determining the direct write policy comprises determining a direct write policyLpIs approximately equal toLaAnd isLeAt or near zero, a determination is made that the cache is not over utilized, and a direct write policy is determined to write incoming data to the direct writable cache.
Example 71 includes one or more of the tangible of example 60Wherein determining the direct policy comprises determiningLpIs substantially less thanLaAnd isLeSubstantially non-zero, determining that the cache is over-utilized, and determining a direct write policy to write incoming data directly to the memory.
Example 72 includes the one or more tangible, non-transitory computer-readable media of example 60, wherein the peripheral device is an ethernet network interface.
Example 73 includes the one or more tangible, non-transitory computer-readable media of example 60, wherein the peripheral device is a cache-coherent core-to-core network interface.
Example 74 includes the one or more tangible, non-transitory computer-readable media of example 73, wherein the cache-coherent core-to-core network interface is Omni-Path or a derivative of Omni-Path.
Example 75 includes the one or more tangible, non-transitory computer-readable media of example 60, wherein the peripheral device is a cache-coherent core-to-core network interface on a peripheral component interconnect express (PCIe) interconnect.
Example 76 includes the one or more tangible, non-transitory computer-readable media of any of examples 60-75, wherein the instructions comprise instructions to program an Intellectual Property (IP) block.
Example 77 includes the one or more tangible, non-transitory computer-readable media of any one of examples 60-75, wherein the instructions comprise instructions to program a Field Programmable Gate Array (FPGA).
Example 78 includes the one or more tangible, non-transitory computer-readable media of any one of examples 60-75, wherein the instructions comprise instructions to program an Integrated Circuit (IC).
Example 79 includes the one or more tangible, non-transitory computer-readable media of example 78, wherein the IC includes a processor.
Example 80 includes the one or more tangible, non-transitory computer-readable media of any of examples 60-75, wherein the instructions comprise instructions to program a system on a chip (SoC).
- 上一篇:一种医用注射器针头装配设备
- 下一篇:存储器系统以及用于访问存储器系统的方法