Region-based directory scheme adapted to large cache sizes
阅读说明:本技术 适应于大高速缓存大小的基于区域的目录方案 (Region-based directory scheme adapted to large cache sizes ) 是由 维德希亚纳坦·卡利亚纳孙达拉姆 凯文·M·莱帕克 阿米特·P·阿普特 加内什·巴拉里斯南 埃 于 2018-09-19 设计创作,主要内容包括:公开了用于维持基于区域的高速缓存目录的系统、设备和方法。一种系统包括多个处理节点,其中每个处理节点包括高速缓存子系统。所述系统还包括高速缓存目录,以帮助管理所述系统的不同高速缓存子系统之间的高速缓存一致性。为了减少所述高速缓存目录中条目的数量,所述高速缓存目录基于区域而非基于高速缓存线来跟踪一致性,其中区域包括多条高速缓存线。因此,所述系统包括基于区域的高速缓存目录以跟踪具有高速缓存在所述系统中的任何高速缓存子系统中的至少一条高速缓存线的区域。所述高速缓存目录在每个条目中包括引用计数以跟踪每区域被高速缓存的高速缓存线的总数。如果给定条目的引用计数变为零,则所述高速缓存目录回收所述给定条目。(Systems, devices, and methods for maintaining a region-based cache directory are disclosed. A system includes a plurality of processing nodes, where each processing node includes a cache subsystem. The system also includes a cache directory to help manage cache coherency between different cache subsystems of the system. To reduce the number of entries in the cache directory, the cache directory tracks coherency based on a region, rather than based on a cache line, where a region includes a plurality of cache lines. Thus, the system includes a region-based cache directory to track a region having at least one cache line cached in any cache subsystem in the system. The cache directory includes a reference count in each entry to track the total number of cache lines cached per region. If the reference count of a given entry becomes zero, the cache directory reclaims the given entry.)
1. A system, comprising:
a plurality of processing nodes, wherein each processing node comprises one or more processors and a cache subsystem;
a memory; and
a cache directory configured to:
maintaining an entry for each region of memory having at least one cache line cached in any cache subsystem of the system, wherein a region comprises a plurality of cache lines;
maintaining a reference count for each entry of the cache directory, wherein the reference count tracks a number of cache lines of a corresponding region cached in any cache subsystem of the system; and is
A given entry is reclaimed in response to its reference count reaching zero.
2. The system of claim 1, wherein the cache directory is further configured to:
allocating a first entry for a first region in response to receiving an indication that a first cache line of the first region is being cached by a first cache subsystem of the system; and is
Initializing a reference count of the first entry to one.
3. The system of claim 2, wherein the cache directory is further configured to: incrementing the reference count of the first entry in response to receiving an indication that a second cache line of the first region is being cached by a second cache subsystem of the system.
4. The system of claim 1, wherein the cache directory is further configured to:
whether the tracking area is private or shared;
interpreting a first field of a first entry as a central processing unit valid field in response to determining that a first region is private, wherein the first entry corresponds to the first region;
interpreting the first field of the first entry as a cluster valid field in response to detecting the first zone-to-shared state change.
5. The system of claim 1, wherein the cache directory is further configured to:
maintaining a first array to track a portion of memory at a first granularity;
maintaining a second array to track portions of memory at a second granularity, wherein the second granularity is finer granularity than the first granularity; and is
Performing lookups to the first array and the second array in parallel in response to receiving a given memory request.
6. The system of claim 5, wherein the cache directory is further configured to: generating probes based only on data extracted from matching entries of the second array in response to the lookup hitting in both the first array and the second array.
7. The system of claim 1, wherein the cache directory is further configured to:
monitoring a reference count field of an entry in a first array;
begin tracking a given region based on a finer granularity in response to detecting that a reference count for the given region has exceeded a threshold; and is
Creating an entry in a second array in response to receiving a memory request targeting the given region after detecting that the reference count has exceeded the threshold.
8. A method, comprising:
maintaining an entry in a cache directory for each region of memory having at least one cache line cached in any cache subsystem of the system, wherein a region comprises a plurality of cache lines;
maintaining a reference count for each entry of the cache directory, wherein the reference count tracks a number of cache lines of a corresponding region cached in any cache subsystem of the system; and
a given entry is reclaimed in response to its reference count reaching zero.
9. The method of claim 8, further comprising:
allocating a first entry for a first region in response to receiving an indication that a first cache line of the first region is being cached by a first cache subsystem of the system; and
initializing a reference count of the first entry to one.
10. The method of claim 9, further comprising: incrementing the reference count of the first entry in response to receiving an indication that a second cache line of the first region is being cached by a second cache subsystem of the system.
11. The method of claim 8, further comprising:
whether the tracking area is private or shared;
interpreting a first field of a first entry as a central processing unit valid field in response to determining that a first region is private, wherein the first entry corresponds to the first region;
interpreting the first field of the first entry as a cluster valid field in response to detecting the first zone-to-shared state change.
12. The method of claim 8, further comprising:
maintaining a first array to track a portion of memory at a first granularity;
maintaining a second array to track portions of memory at a second granularity, wherein the second granularity is finer granularity than the first granularity; and
performing lookups to the first array and the second array in parallel in response to receiving a given memory request.
13. The method of claim 12, further comprising: generating probes based only on data extracted from matching entries of the second array in response to the lookup hitting in both the first array and the second array.
14. The method of claim 8, further comprising:
monitoring a reference count field of an entry in a first array;
begin tracking a given region based on a finer granularity in response to detecting that a reference count for the given region has exceeded a threshold; and
creating an entry in a second array in response to receiving a memory request targeting the given region after detecting that the reference count has exceeded the threshold.
15. An apparatus, comprising:
a cache directory configured to:
maintaining an entry for a memory having each region of at least one cache line cached in any of a plurality of cache subsystems, wherein a region comprises a plurality of cache lines;
maintaining a reference count for each entry of the cache directory, wherein the reference count tracks a number of cache lines of a corresponding region cached in any cache subsystem; and is
A given entry is reclaimed in response to its reference count reaching zero.
16. The apparatus of claim 15, wherein the cache directory is further configured to:
allocating a first entry for a first region in response to receiving an indication that a first cache line of the first region is being cached by a first cache subsystem of the system; and is
Initializing a reference count of the first entry to one.
17. The apparatus of claim 16, wherein the cache directory is configured to: incrementing the reference count of the first entry in response to receiving an indication that a second cache line of the first region is being cached by a second cache subsystem of the system.
18. The apparatus of claim 15, wherein the cache directory is configured to:
whether the tracking area is private or shared;
interpreting a first field of a first entry as a central processing unit valid field in response to determining that a first region is private, wherein the first entry corresponds to the first region;
interpreting the first field of the first entry as a cluster valid field in response to detecting the first zone-to-shared state change.
19. The apparatus of claim 15, wherein the cache directory is further configured to:
maintaining a first array to track a portion of memory at a first granularity;
maintaining a second array to track portions of memory at a second granularity, wherein the second granularity is finer granularity than the first granularity; and is
Performing lookups to the first array and the second array in parallel in response to receiving a given memory request.
20. The apparatus of claim 19, wherein the cache directory is further configured to: generating probes based only on data extracted from matching entries of the second array in response to the lookup hitting in both the first array and the second array.
Background
Drawings
The advantages of the methods and mechanisms described herein may be better understood by reference to the following description taken in conjunction with the accompanying drawings in which:
FIG. 1 is a block diagram of one embodiment of a computing system.
FIG. 2 is a block diagram of one embodiment of a core complex.
FIG. 3 is a block diagram of one embodiment of a multi-CPU system.
FIG. 4 is a block diagram of one embodiment of a cache directory.
FIG. 5 is a block diagram of another embodiment of a cache directory.
FIG. 6 is a block diagram of another embodiment of a cache directory.
Figure 7 is a diagram of one embodiment of repurposing a cluster valid field in a cache directory entry.
FIG. 8 is a generalized flow diagram illustrating one embodiment of a method for maintaining a region-based cache directory.
FIG. 9 is a generalized flow diagram illustrating one embodiment of a method for a coherency slave to process an invalidation indication from a CPU.
FIG. 10 is a generalized flow diagram illustrating one embodiment of a method for a cache directory to track accesses to a first region of memory.
FIG. 11 is a generalized flow diagram illustrating one embodiment of a method for maintaining an array of cache directories at different granularity levels.
FIG. 12 is a generalized flow diagram illustrating one embodiment of a method for operating a multi-array cache directory.
FIG. 13 is a generalized flow diagram illustrating one embodiment of a method for a cache directory to maintain a data structure with entries for recently accessed private areas.
FIG. 14 is a generalized flow diagram illustrating one embodiment of a method for maintaining an array of cache directories at different granularity levels.
Detailed Description
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the methods and mechanisms presented herein. However, it will be recognized by one of ordinary skill in the art that various embodiments may be practiced without these specific details. In some instances, well-known structures, components, signals, computer program instructions, and techniques have not been shown in detail to avoid obscuring the methods described herein. It will be appreciated that for simplicity and clarity of illustration, elements illustrated in the figures have not necessarily been drawn to scale. For example, the dimensions of some of the elements may be exaggerated relative to other elements.
Various systems, devices, methods, and computer-readable media are disclosed herein for implementing a region-based cache directory scheme that accommodates large cache sizes. In one embodiment, a system includes at least a plurality of processing nodes, a memory, and a cache directory. Each processing node includes one or more processors and a cache subsystem. In one embodiment, the cache directory is configured to track portions of the memory based on regions, where each region includes a plurality of cache lines. The cache directory maintains entries for each region of memory having at least one cache line cached in any cache subsystem of the system. The cache directory also maintains a reference count for each entry, where the reference count is the sum of the number of cache lines of the region cached in all cache subsystems of the system. The cache entry evicts the given entry in response to the reference count of the given entry reaching zero.
In one embodiment, a cache directory allocates a first entry for a first region in response to receiving an indication that a first cache line of the first region is being cached by a first cache subsystem of a system. The cache directory also initializes a reference count of the newly allocated first entry to one. The cache directory increments a reference count of the first entry to two in response to receiving an indication that a second cache line of the first region is being cached by a second cache subsystem of the system, wherein the second cache line is not necessarily physically distinct from the first cache line. The cache directory increments the reference count for each subsequent cache line of the first region cached by the cache subsystem.
In one embodiment, the cache directory tracks whether the region is private or shared. For example, the cache directory stores a first entry to track the first region. When the first area is private, the first field of the first entry is interpreted as having a first meaning. When the first region is shared, the first field of the first entry is interpreted as having a second meaning different from the first meaning. In one embodiment, the first field is interpreted as a CPU valid field when the first region is private and as a cluster valid field when the first region is shared, where a cluster refers to a group of specific CPUs.
In one embodiment, a cache directory maintains a first array to track portions of memory at a first granularity. The cache directory also maintains a second array to track portions of the memory at a second granularity, where the second granularity is finer granularity than the first granularity. In this implementation, the cache directory performs lookups on the first array and the second array in parallel in response to receiving a given memory request. The cache directory generates one or more probes based only on data extracted from matching entries of the second array in response to a lookup hit in both the first array and the second array.
In one embodiment, the cache directory monitors a reference count field of an entry in the first array. In one embodiment, the cache directory begins tracking a given region with finer granularity in response to detecting that the region has transitioned from private to shared. In this implementation, the cache directory creates an entry in the second array in response to receiving a memory request targeting a given region after detecting that the region has transitioned from private to shared. In one embodiment, the cache directory begins tracking the given region based on a finer granularity in response to detecting that the reference count of the given region has exceeded a programmable threshold. In this embodiment, the cache directory creates an entry in the second array in response to receiving a memory request targeting the given region after detecting that the reference count has exceeded the threshold. In one embodiment, the first array is an area-based array and the second array is a line-based array.
Referring now to FIG. 1, a block diagram of one embodiment of a
The one or
In various embodiments, the
Turning now to FIG. 2, a block diagram of one embodiment of a core complex 200 is shown. In one embodiment, the core complex 200 includes four processor cores 210A-210D. In other embodiments, core complex 200 may include other numbers of processor cores. It should be noted that a "core complex" may also be referred to herein as a "processing node" or "CPU". In one embodiment, the components of core complex 200 are included within
Each processor core 210A-210D includes a cache subsystem for storing data and instructions retrieved from a memory subsystem (not shown). For example, in one embodiment, each core 210A-210D includes a corresponding first-level (L1) cache 215A-215D. Each processor core 210A-210D may include or be coupled to a corresponding second level (L2) cache 220A-220D. Additionally, in one embodiment, the core complex 200 includes a third level (L3) cache 230 shared by the processor cores 210A-210D. The L3 cache 230 is coupled to a coherency master to access the fabric and the memory subsystem. It should be noted that in other embodiments, core complex 200 may include other types of cache subsystems having other numbers of caches and/or other configurations with different cache levels.
Referring now to FIG. 3, a block diagram of one embodiment of a
In one embodiment, each CPU305A-305N is connected to a corresponding coherency master device 315A-315N. As used herein, a "coherency master" is defined as an agent that handles traffic flowing through the interconnect fabric (e.g., bus/fabric 318) and manages the coherency of connected CPUs. To manage coherency, a coherency master receives and processes coherency related messages and probes, and the coherency master generates coherency related requests and probes. It should be noted that "coherency master" may also be referred to herein as a "coherency master".
In one embodiment, each CPU305A-305N is coupled to a pair of coherent slaves through a corresponding coherent master 315A-315N and bus/
In a configuration similar to the CPU305A configuration,
When a coherent slave device receives a memory request targeting its corresponding memory controller, the coherent slave device performs a lookup of its corresponding cache directory to determine whether the request targets a region having at least one cache line cached in any cache subsystem. In one embodiment, each cache directory in
It should be noted that although not shown in fig. 3, in other embodiments, there may be other connections from bus/
Turning now to FIG. 4, a block diagram of one embodiment of a
The
The sector
The organization of the subgroups and the number of bits in the sector
The cluster
Referring now to FIG. 5, a block diagram of another embodiment of a cache directory 500 is shown. In one embodiment, cache directory 500 includes at least a control unit 505, the control unit 505 coupled to a region-based array 510 and an auxiliary line-based array 515. Region-based array 510 includes entries for tracking region-based cached data. In one embodiment, each entry of region-based array 510 includes a reference count that counts a number of accesses to a cache line cached by a cache subsystem of a computing system (e.g.,
Thus, in one embodiment, when the number of cache lines cached for a given region reaches a threshold, the auxiliary line-based array 515 will begin tracking accesses to individual lines of the given region. Each time a new cache line is accessed from a given region, a new entry is created for the cache line in auxiliary line-based array 515. In one embodiment, the lookup of the region-based array 510 and the line-based array 515 is performed in parallel.
In one embodiment, only shared regions with reference counts greater than a threshold will be tracked by auxiliary line-based array 515 based on cache lines. Shared region refers to a region having cache lines stored in the cache subsystems of at least two different CPUs. Private area refers to an area having cache lines cached by only a single CPU. Thus, in one embodiment, for shared regions having a reference count greater than a threshold, there will be one or more entries in line-based array 515. In this embodiment, for private areas, there will not be any entries in line-based array 515.
Turning now to FIG. 6, a block diagram of another embodiment of a
In one embodiment,
In one embodiment, I/O transactions that do not modify the sector valid bit or cluster valid bit may benefit from the
Thus,
Referring now to FIG. 7, a diagram of one embodiment of repurposing a cluster valid field in a cache directory entry is shown. In one embodiment, entry 705A includes various fields associated with the shared region being tracked by the cache directory. The
If the cluster valid field 720A remains unchanged even for a private region, then a probe needs to be sent to all CPUs in the cluster that are identified as caching at least one cache line for that region. In contrast, in one embodiment, if the region is private (i.e., accessed by only a single cluster), the cluster valid field is repurposed to either the owner valid field or the CPU valid field. This allows the cache directory to probe one particular CPU for the private region.
Thus, if the area being tracked by entry 705A transitions from being a shared area to being a private area, entry 705B exhibits a change in fields as compared to entry 705A for this private area. As shown in entry 705B, state 710B now includes private 715B encoding to represent the private state of the region. Since state 710B has now changed to private 715B, the previous cluster valid field 720A now becomes CPU valid field 720B. Each
By changing the cluster valid field 720A to the CPU valid field 720B, a directional probe may be issued that only targets CPUs with cache lines from the region. This helps to reduce the number of unnecessary probes generated by the cache directory. In one embodiment, if a request targeting a private area (corresponding to entry 705B) is received from a different cluster, then this private area becomes the shared area. When this occurs, the cluster valid field 720A is restored to its normal operation because the region is now shared.
Turning now to FIG. 8, one embodiment of a
The cache directory tracks a region of memory having at least one cache line cached in any cache subsystem of the multi-CPU system, wherein each region includes a plurality of cache lines (block 805). The cache directory maintains an entry for each region of memory having at least one cache line cached in any cache subsystem of the system (block 810). The cache directory maintains a reference count for each entry of the cache directory, where the reference count is the sum of the number of cache lines of the region's cache in all cache subsystems of the system (block 815). In response to receiving a memory request targeting a cache line within a given region, the cache directory sends a probe to each CPU having any cache lines of the given region (block 820). In response to the cache line of the corresponding region being cached by the CPU, the cache directory increments the reference count for the given entry (block 825). In response to the cache line of the corresponding region being evicted from the cache subsystem of the CPU, the cache directory decrements the reference count for the given entry (block 830). After
Referring now to FIG. 9, one embodiment of a
The coherence slave then decrements the reference count in the located cache directory entry for the identified region (block 920). If the reference count is now equal to zero (
Turning now to FIG. 10, one embodiment of a method 1000 of a cache directory tracking accesses to a first region of memory is illustrated. The cache directory receives a first memory request for a first cache line within a first region of memory (block 1005). In response to determining that the first region has no entry in the cache directory, a new entry is allocated in the cache directory for the first region and a reference count for the new entry is initialized to one (block 1010). If the cache directory is full, the cache directory evicts the existing entry to allocate a new entry. The cache directory may utilize any suitable eviction algorithm to determine which entry to evict. Next, the cache directory receives a second memory request for a second cache line within the first region of memory, where the second cache line is not necessarily physically distinct from the first cache line (block 1015). In response to receiving the second memory request for the first region of memory, a reference count of an entry corresponding to the first region of memory is incremented to two (block 1020). After block 1020, method 1000 ends. It should be noted that the reference count of the entry corresponding to the first region of memory is incremented for each subsequent memory request to any cache line within the first region. Similarly, each time a cache line from the first region is evicted from the cache subsystem, the reference count is decremented.
Referring now to FIG. 11, one embodiment of a
The cache directory maintains a reference count for each entry in the first array to track the number of cache lines of the region cached somewhere in the system (block 1110). If the reference count of a given entry exceeds a programmable threshold (
For each new cache line cached in the system from the corresponding address range, the cache directory allocates a new entry in the second array to track this portion of the region at the second granularity (block 1130). If the cache directory is full, the cache directory evicts the entry from the second array to allocate a new entry. The cache directory may utilize any suitable eviction algorithm to determine which entry to evict. After
Turning now to FIG. 12, one embodiment of a
Referring now to FIG. 13, one embodiment of a method 1300 of a cache directory maintaining a data structure with entries for recently accessed private areas is shown. The cache directory maintains a data structure to cache entries for the private regions for the last N accesses (block 1305). The value of N may vary depending on the embodiment. The cache directory also maintains one or more arrays having entries for tracking regions having cache lines cached in any cache subsystem of the multi-CPU system (block 1310).
Next, a memory request is received at the cache directory (block 1315). In response to receiving the memory request, the cache directory performs a lookup of a data structure having entries for the private regions for the last N accesses (block 1320). If the lookup results in a hit (conditional block 1325, "yes" leg), the cache directory generates probes for one or more cache subsystems with the matching entry (block 1330). In addition, the cache directory cancels the lookup of the one or more primary cache directory arrays (block 1335). If the lookup results in a miss (conditional block 1325, "no" branch), the cache directory performs one or more lookups for the one or more main cache directory arrays for the request (block 1340). The cache directory may utilize any suitable processing procedure to process a lookup of the primary cache directory array. After blocks 1335 and 1340, the method 1300 ends.
Referring now to FIG. 14, one embodiment of a
The cache directory determines whether any of the private regions have transitioned to the shared state (conditional block 1410). If any of the private regions has transitioned to the shared state (
For each new cache line cached in the system from the corresponding address range, the cache directory allocates a new entry in the second array to track this portion of the region at the second granularity (block 1425). If the cache directory is full, the cache directory evicts the entry from the second array to allocate a new entry. The cache directory may utilize any suitable eviction algorithm to determine which entry to evict. After
In various embodiments, the methods and/or mechanisms described herein are implemented using program instructions of a software application. For example, program instructions that are executable by a general-purpose processor or a special-purpose processor are contemplated. In various implementations, such program instructions may be represented by a high-level programming language. In other embodiments, the program instructions may be compiled from a high-level programming language into binary, intermediate, or other forms. Alternatively, program instructions describing the behavior or design of the hardware may be written. Such program instructions may be represented by a high-level programming language such as C. Alternatively, a Hardware Design Language (HDL) such as Verilog may be used. In various embodiments, the program instructions are stored on any of a variety of non-transitory computer-readable storage media. The storage medium is accessible by the computing system during use to provide program instructions to the computing system for program execution. Generally, such computing systems include at least one or more memories and one or more processors configured to execute program instructions.
It should be emphasized that the above-described embodiments are merely non-limiting examples of implementations. Numerous variations and modifications will become apparent to those skilled in the art once the above disclosure is fully appreciated. It is intended that the following claims be interpreted to embrace all such variations and modifications.
- 上一篇:一种医用注射器针头装配设备
- 下一篇:多节点系统低功率管理