Region-based directory scheme adapted to large cache sizes

文档序号：1174204 发布日期：2020-09-18 浏览：13次中文

阅读说明：本技术 适应于大高速缓存大小的基于区域的目录方案 (Region-based directory scheme adapted to large cache sizes ) 是由维德希亚纳坦·卡利亚纳孙达拉姆凯文·M·莱帕克阿米特·P·阿普特加内什·巴拉里斯南埃于 2018-09-19 设计创作，主要内容包括：公开了用于维持基于区域的高速缓存目录的系统、设备和方法。一种系统包括多个处理节点,其中每个处理节点包括高速缓存子系统。所述系统还包括高速缓存目录,以帮助管理所述系统的不同高速缓存子系统之间的高速缓存一致性。为了减少所述高速缓存目录中条目的数量,所述高速缓存目录基于区域而非基于高速缓存线来跟踪一致性,其中区域包括多条高速缓存线。因此,所述系统包括基于区域的高速缓存目录以跟踪具有高速缓存在所述系统中的任何高速缓存子系统中的至少一条高速缓存线的区域。所述高速缓存目录在每个条目中包括引用计数以跟踪每区域被高速缓存的高速缓存线的总数。如果给定条目的引用计数变为零,则所述高速缓存目录回收所述给定条目。(Systems, devices, and methods for maintaining a region-based cache directory are disclosed. A system includes a plurality of processing nodes, where each processing node includes a cache subsystem. The system also includes a cache directory to help manage cache coherency between different cache subsystems of the system. To reduce the number of entries in the cache directory, the cache directory tracks coherency based on a region, rather than based on a cache line, where a region includes a plurality of cache lines. Thus, the system includes a region-based cache directory to track a region having at least one cache line cached in any cache subsystem in the system. The cache directory includes a reference count in each entry to track the total number of cache lines cached per region. If the reference count of a given entry becomes zero, the cache directory reclaims the given entry.)

1. A system, comprising:

a plurality of processing nodes, wherein each processing node comprises one or more processors and a cache subsystem;

a memory; and

a cache directory configured to:

maintaining an entry for each region of memory having at least one cache line cached in any cache subsystem of the system, wherein a region comprises a plurality of cache lines;

maintaining a reference count for each entry of the cache directory, wherein the reference count tracks a number of cache lines of a corresponding region cached in any cache subsystem of the system; and is

A given entry is reclaimed in response to its reference count reaching zero.

2. The system of claim 1, wherein the cache directory is further configured to:

allocating a first entry for a first region in response to receiving an indication that a first cache line of the first region is being cached by a first cache subsystem of the system; and is

Initializing a reference count of the first entry to one.

3. The system of claim 2, wherein the cache directory is further configured to: incrementing the reference count of the first entry in response to receiving an indication that a second cache line of the first region is being cached by a second cache subsystem of the system.

4. The system of claim 1, wherein the cache directory is further configured to:

whether the tracking area is private or shared;

interpreting a first field of a first entry as a central processing unit valid field in response to determining that a first region is private, wherein the first entry corresponds to the first region;

interpreting the first field of the first entry as a cluster valid field in response to detecting the first zone-to-shared state change.

5. The system of claim 1, wherein the cache directory is further configured to:

maintaining a first array to track a portion of memory at a first granularity;

maintaining a second array to track portions of memory at a second granularity, wherein the second granularity is finer granularity than the first granularity; and is

Performing lookups to the first array and the second array in parallel in response to receiving a given memory request.

6. The system of claim 5, wherein the cache directory is further configured to: generating probes based only on data extracted from matching entries of the second array in response to the lookup hitting in both the first array and the second array.

7. The system of claim 1, wherein the cache directory is further configured to:

monitoring a reference count field of an entry in a first array;

begin tracking a given region based on a finer granularity in response to detecting that a reference count for the given region has exceeded a threshold; and is

Creating an entry in a second array in response to receiving a memory request targeting the given region after detecting that the reference count has exceeded the threshold.

8. A method, comprising:

maintaining an entry in a cache directory for each region of memory having at least one cache line cached in any cache subsystem of the system, wherein a region comprises a plurality of cache lines;

a given entry is reclaimed in response to its reference count reaching zero.

9. The method of claim 8, further comprising:

allocating a first entry for a first region in response to receiving an indication that a first cache line of the first region is being cached by a first cache subsystem of the system; and

initializing a reference count of the first entry to one.

10. The method of claim 9, further comprising: incrementing the reference count of the first entry in response to receiving an indication that a second cache line of the first region is being cached by a second cache subsystem of the system.

11. The method of claim 8, further comprising:

whether the tracking area is private or shared;

interpreting the first field of the first entry as a cluster valid field in response to detecting the first zone-to-shared state change.

12. The method of claim 8, further comprising:

maintaining a first array to track a portion of memory at a first granularity;

maintaining a second array to track portions of memory at a second granularity, wherein the second granularity is finer granularity than the first granularity; and

performing lookups to the first array and the second array in parallel in response to receiving a given memory request.

13. The method of claim 12, further comprising: generating probes based only on data extracted from matching entries of the second array in response to the lookup hitting in both the first array and the second array.

14. The method of claim 8, further comprising:

monitoring a reference count field of an entry in a first array;

begin tracking a given region based on a finer granularity in response to detecting that a reference count for the given region has exceeded a threshold; and

creating an entry in a second array in response to receiving a memory request targeting the given region after detecting that the reference count has exceeded the threshold.

15. An apparatus, comprising:

a cache directory configured to:

maintaining an entry for a memory having each region of at least one cache line cached in any of a plurality of cache subsystems, wherein a region comprises a plurality of cache lines;

maintaining a reference count for each entry of the cache directory, wherein the reference count tracks a number of cache lines of a corresponding region cached in any cache subsystem; and is

A given entry is reclaimed in response to its reference count reaching zero.

16. The apparatus of claim 15, wherein the cache directory is further configured to:

allocating a first entry for a first region in response to receiving an indication that a first cache line of the first region is being cached by a first cache subsystem of the system; and is

Initializing a reference count of the first entry to one.

17. The apparatus of claim 16, wherein the cache directory is configured to: incrementing the reference count of the first entry in response to receiving an indication that a second cache line of the first region is being cached by a second cache subsystem of the system.

18. The apparatus of claim 15, wherein the cache directory is configured to:

whether the tracking area is private or shared;

interpreting the first field of the first entry as a cluster valid field in response to detecting the first zone-to-shared state change.

19. The apparatus of claim 15, wherein the cache directory is further configured to:

maintaining a first array to track a portion of memory at a first granularity;

maintaining a second array to track portions of memory at a second granularity, wherein the second granularity is finer granularity than the first granularity; and is

Performing lookups to the first array and the second array in parallel in response to receiving a given memory request.

20. The apparatus of claim 19, wherein the cache directory is further configured to: generating probes based only on data extracted from matching entries of the second array in response to the lookup hitting in both the first array and the second array.

Background

Drawings

The advantages of the methods and mechanisms described herein may be better understood by reference to the following description taken in conjunction with the accompanying drawings in which:

FIG. 1 is a block diagram of one embodiment of a computing system.

FIG. 2 is a block diagram of one embodiment of a core complex.

FIG. 3 is a block diagram of one embodiment of a multi-CPU system.

FIG. 4 is a block diagram of one embodiment of a cache directory.

FIG. 5 is a block diagram of another embodiment of a cache directory.

FIG. 6 is a block diagram of another embodiment of a cache directory.

Figure 7 is a diagram of one embodiment of repurposing a cluster valid field in a cache directory entry.

FIG. 8 is a generalized flow diagram illustrating one embodiment of a method for maintaining a region-based cache directory.

FIG. 9 is a generalized flow diagram illustrating one embodiment of a method for a coherency slave to process an invalidation indication from a CPU.

FIG. 10 is a generalized flow diagram illustrating one embodiment of a method for a cache directory to track accesses to a first region of memory.

FIG. 11 is a generalized flow diagram illustrating one embodiment of a method for maintaining an array of cache directories at different granularity levels.

FIG. 12 is a generalized flow diagram illustrating one embodiment of a method for operating a multi-array cache directory.

FIG. 13 is a generalized flow diagram illustrating one embodiment of a method for a cache directory to maintain a data structure with entries for recently accessed private areas.

FIG. 14 is a generalized flow diagram illustrating one embodiment of a method for maintaining an array of cache directories at different granularity levels.

Detailed Description

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the methods and mechanisms presented herein. However, it will be recognized by one of ordinary skill in the art that various embodiments may be practiced without these specific details. In some instances, well-known structures, components, signals, computer program instructions, and techniques have not been shown in detail to avoid obscuring the methods described herein. It will be appreciated that for simplicity and clarity of illustration, elements illustrated in the figures have not necessarily been drawn to scale. For example, the dimensions of some of the elements may be exaggerated relative to other elements.

Various systems, devices, methods, and computer-readable media are disclosed herein for implementing a region-based cache directory scheme that accommodates large cache sizes. In one embodiment, a system includes at least a plurality of processing nodes, a memory, and a cache directory. Each processing node includes one or more processors and a cache subsystem. In one embodiment, the cache directory is configured to track portions of the memory based on regions, where each region includes a plurality of cache lines. The cache directory maintains entries for each region of memory having at least one cache line cached in any cache subsystem of the system. The cache directory also maintains a reference count for each entry, where the reference count is the sum of the number of cache lines of the region cached in all cache subsystems of the system. The cache entry evicts the given entry in response to the reference count of the given entry reaching zero.

In one embodiment, a cache directory allocates a first entry for a first region in response to receiving an indication that a first cache line of the first region is being cached by a first cache subsystem of a system. The cache directory also initializes a reference count of the newly allocated first entry to one. The cache directory increments a reference count of the first entry to two in response to receiving an indication that a second cache line of the first region is being cached by a second cache subsystem of the system, wherein the second cache line is not necessarily physically distinct from the first cache line. The cache directory increments the reference count for each subsequent cache line of the first region cached by the cache subsystem.

In one embodiment, the cache directory tracks whether the region is private or shared. For example, the cache directory stores a first entry to track the first region. When the first area is private, the first field of the first entry is interpreted as having a first meaning. When the first region is shared, the first field of the first entry is interpreted as having a second meaning different from the first meaning. In one embodiment, the first field is interpreted as a CPU valid field when the first region is private and as a cluster valid field when the first region is shared, where a cluster refers to a group of specific CPUs.

In one embodiment, a cache directory maintains a first array to track portions of memory at a first granularity. The cache directory also maintains a second array to track portions of the memory at a second granularity, where the second granularity is finer granularity than the first granularity. In this implementation, the cache directory performs lookups on the first array and the second array in parallel in response to receiving a given memory request. The cache directory generates one or more probes based only on data extracted from matching entries of the second array in response to a lookup hit in both the first array and the second array.

In one embodiment, the cache directory monitors a reference count field of an entry in the first array. In one embodiment, the cache directory begins tracking a given region with finer granularity in response to detecting that the region has transitioned from private to shared. In this implementation, the cache directory creates an entry in the second array in response to receiving a memory request targeting a given region after detecting that the region has transitioned from private to shared. In one embodiment, the cache directory begins tracking the given region based on a finer granularity in response to detecting that the reference count of the given region has exceeded a programmable threshold. In this embodiment, the cache directory creates an entry in the second array in response to receiving a memory request targeting the given region after detecting that the reference count has exceeded the threshold. In one embodiment, the first array is an area-based array and the second array is a line-based array.

Referring now to FIG. 1, a block diagram of one embodiment of a computing system 100 is shown. In one embodiment, computing system 100 includes at least core complexes 105A-105N, input/output (I/O) interfaces 120, a bus 125, one or more memory controllers 130, and a network interface 135. In other embodiments, computing system 100 may include other components and/or computing system 100 may be arranged differently. In one embodiment, each core complex 105A-105N includes one or more general purpose processors, such as Central Processing Units (CPUs). It should be noted that a "core complex" may also be referred to herein as a "processing node" or "CPU". In some embodiments, one or more of the core complexes 105A-105N may include a data parallel processor having a highly parallel architecture. Examples of data parallel processors include Graphics Processing Units (GPUs), Digital Signal Processors (DSPs), Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), and the like. Each processor core within core complexes 105A-105N includes a cache subsystem having one or more cache levels. In one embodiment, each core complex 105A-105N includes a cache (e.g., a third level (L3) cache) shared among multiple processor cores.

The one or more memory controllers 130 represent any number and type of memory controllers accessible by the core complexes 105A-105N. The one or more memory controllers 130 are coupled to any number and type of memory devices (not shown). For example, the types of memory in the one or more memory devices coupled to the one or more memory controllers 130 may include Dynamic Random Access Memory (DRAM), Static Random Access Memory (SRAM), NAND flash memory, NOR flash memory, ferroelectric random access memory (FeRAM), and the like. I/O interface 120 represents any number and type of I/O interfaces (e.g., Peripheral Component Interconnect (PCI) bus, PCI expansion (PCI-X), pcie (PCI express) bus, gigabit ethernet (GBE) bus, Universal Serial Bus (USB)). Various types of peripheral devices may be coupled to I/O interface 120. Such peripheral devices include, but are not limited to, displays, keyboards, mice, printers, scanners, joysticks, or other types of game controllers, media recording devices, external storage devices, network interface cards, and the like.

In various embodiments, the computing system 100 may be a server, a computer, a laptop, a mobile device, a gaming console, a streaming device, a wearable device, or any of various other types of computing systems or devices. It should be noted that the number of components of computing system 100 may vary from implementation to implementation. There may be more or less of each component than shown in figure 1. It should also be noted that computing system 100 may include other components not shown in FIG. 1. Additionally, in other embodiments, the computing system 100 may be structured differently than shown in FIG. 1.

Turning now to FIG. 2, a block diagram of one embodiment of a core complex 200 is shown. In one embodiment, the core complex 200 includes four processor cores 210A-210D. In other embodiments, core complex 200 may include other numbers of processor cores. It should be noted that a "core complex" may also be referred to herein as a "processing node" or "CPU". In one embodiment, the components of core complex 200 are included within core complexes 105A-105N (of FIG. 1).

Each processor core 210A-210D includes a cache subsystem for storing data and instructions retrieved from a memory subsystem (not shown). For example, in one embodiment, each core 210A-210D includes a corresponding first-level (L1) cache 215A-215D. Each processor core 210A-210D may include or be coupled to a corresponding second level (L2) cache 220A-220D. Additionally, in one embodiment, the core complex 200 includes a third level (L3) cache 230 shared by the processor cores 210A-210D. The L3 cache 230 is coupled to a coherency master to access the fabric and the memory subsystem. It should be noted that in other embodiments, core complex 200 may include other types of cache subsystems having other numbers of caches and/or other configurations with different cache levels.

Referring now to FIG. 3, a block diagram of one embodiment of a multi-CPU system 300 is shown. In one embodiment, the system includes a plurality of CPUs 305A-305N. The number of CPUs per system may vary from implementation to implementation. Each CPU305A-305N may include any number of cores 308A-308N, respectively, where the number of cores varies depending on the implementation. Each CPU305A-305N also includes a corresponding cache subsystem 310A-310N. Each cache subsystem 310A-310N may include any number of levels of cache and any type of cache hierarchy.

In one embodiment, each CPU305A-305N is connected to a corresponding coherency master device 315A-315N. As used herein, a "coherency master" is defined as an agent that handles traffic flowing through the interconnect fabric (e.g., bus/fabric 318) and manages the coherency of connected CPUs. To manage coherency, a coherency master receives and processes coherency related messages and probes, and the coherency master generates coherency related requests and probes. It should be noted that "coherency master" may also be referred to herein as a "coherency master".

In one embodiment, each CPU305A-305N is coupled to a pair of coherent slaves through a corresponding coherent master 315A-315N and bus/fabric 318. For example, CPU305A is coupled to coherent slaves 320A-320B through coherent master 315A and bus/fabric 318. Coherent slave device (CS)320A is coupled to Memory Controller (MC)330A, and coherent slave device 320B is coupled to memory controller 330B. The coherency slave 320A is coupled to a Cache Directory (CD)325A, where the cache directory 325A includes entries for memory regions of memory accessible through the memory controller 330A having cache lines cached in the system 300. It should be noted that cache directory 325A, as well as each of the other cache directories, may also be referred to as a "probe filter". Similarly, coherency is coupled from device 320B to cache directory 325B, where cache directory 325B includes entries for memory regions of memory accessible through memory controller 330B having cache lines cached in system 300. It should be noted that the example of two memory controllers per CPU is merely indicative of one implementation. It should be appreciated that in other embodiments, each of the CPUs 305A-305N may be connected to a number of memory controllers other than two.

In a configuration similar to the CPU305A configuration, CPU 305B is coupled to coherency slaves 335A-335B through coherency master 315B and bus/fabric 318. The coherency slave 335A is coupled to the memory through the memory controller 350A, and the coherency slave 335A is also coupled to the cache directory 345A to manage the coherency of cache lines corresponding to the memory accessible through the memory controller 350A. The coherency slave 335B is coupled to the cache directory 345B and the coherency slave 335B is coupled to the memory through the memory controller 365B. For example, CPU 305N is coupled to coherent slaves 355A-355B through coherent master 315N and bus/fabric 318. Coherent slave 355A-355B are coupled to cache directories 360A-360B, respectively, and coherent slave 355A-355B are coupled to memory through memory controllers 365A-365B, respectively. As used herein, a "coherency slave" is defined as an agent that manages coherency by processing received requests and probes targeting a corresponding memory controller. It should be noted that a "consistency slave" may also be referred to herein as a "consistency slave". Additionally, as used herein, a "probe" is defined as a message passed from a consistency point to one or more caches in a computer system to determine whether the cache has a copy of a data block and optionally to indicate a state in which the cache should place the data block.

When a coherent slave device receives a memory request targeting its corresponding memory controller, the coherent slave device performs a lookup of its corresponding cache directory to determine whether the request targets a region having at least one cache line cached in any cache subsystem. In one embodiment, each cache directory in system 300 tracks a region of memory, where a region includes a plurality of cache lines. The size of the tracked area may vary from implementation to implementation. By tracking at the granularity of regions rather than at a finer granularity of cache lines, the size of each cache directory may be reduced. It should be noted that "region" may also be referred to herein as a "page". When a coherence slave receives a request, the coherence slave determines the region targeted by the request. Then, a lookup of the cache directory is performed for this region. If the lookup results in a hit, the coherency slave sends a probe to the one or more CPUs identified in the hit entry. The type of probe generated by the coherency slave depends on the coherency state specified by the hit entry.

It should be noted that although not shown in fig. 3, in other embodiments, there may be other connections from bus/fabric 318 to other components not shown to avoid obscuring the drawing. For example, in another embodiment, bus/fabric 318 includes connections for one or more I/O interfaces and one or more I/O devices.

Turning now to FIG. 4, a block diagram of one embodiment of a cache directory 400 is shown. In one embodiment, cache directory 400 includes a control unit 405 and an array 410. Array 410 may include any number of entries, where the number of entries varies according to the implementation. In one embodiment, each entry of array 410 includes a status field 415, a sector valid field 420, a cluster valid field 425, a reference count field 430, and a tag field 435. In other embodiments, the entries of array 410 may include other fields and/or may be arranged in other suitable manners.

The status field 415 includes a status bit that specifies the aggregation status of the region. The aggregate state reflects the most restrictive cache line state for this particular region. For example, the state of a given region is stored as "dirty" even if only one cache line in the entire given region is dirty. Furthermore, the state of a given region is stored as "shared" even if only one cache line is shared throughout the given region.

The sector valid field 420 stores a bit vector corresponding to a subset or sector of lines within a region to provide fine granularity tracking. By tracking subsets of lines within a region, the number of unwanted regular consensus probes and individual line probes generated when deploying a region nullification probe can be reduced. As used herein, a "region invalidation probe" is defined as a probe generated by a cache directory in response to a region entry being evicted from the cache directory. When the coherency master receives the region invalidation probe, the coherency master invalidates each cache line of the region cached by the local CPU. In addition, the region invalidation probe includes a tracker and sector valid bits to mitigate probe amplification at the CPU cache.

The organization of the subgroups and the number of bits in the sector valid field 420 may vary depending on the implementation. In one embodiment, two lines are tracked within a particular region entry using the sector valid field 420. In another embodiment, other numbers of lines may be tracked within each zone entry. In this implementation, the sector valid field 420 may be used to indicate the number of partitions being tracked separately within the region. In addition, the partition may be identified using an offset stored in the sector valid field 420. Each offset identifies a location of a given partition within a given region. The sector valid field 420 or another field of the entry may also indicate a separate owner and a separate status for each partition within a given area.

The cluster valid field 425 includes a bit vector for tracking the presence of regions across various CPU cache clusters. For example, in one embodiment, CPUs are grouped together into CPU clusters. The bit vector stored in the cluster valid field 425 is used to reduce the probe destinations of the regular consistency probes and the area invalidation probes.

Reference count field 430 is used to track the number of cache lines of the region's cache somewhere in the system. On the first access to the area, an entry is placed in table 410 and reference count field 430 is set to one. Over time, each time the cache accesses a cache line from this region, the reference count is incremented. When a cache line from this region is evicted by the cache, the reference count is decremented. Finally, if the reference count reaches zero, the entry is marked as invalid and can be reused for other regions. By utilizing reference count field 430, the incidence of region nullification probes may be reduced. Reference count field 430 allows for reclaiming directory entries when the entries are associated with regions with no active subscribers. In one embodiment, once the reference count crosses a threshold, the reference count field 430 may saturate. The threshold may be set to a value large enough to handle private access modes while sacrificing some accuracy in handling widely shared access modes of communicated data. The tag field 435 includes tag bits for identifying the entry associated with a particular region.

Referring now to FIG. 5, a block diagram of another embodiment of a cache directory 500 is shown. In one embodiment, cache directory 500 includes at least a control unit 505, the control unit 505 coupled to a region-based array 510 and an auxiliary line-based array 515. Region-based array 510 includes entries for tracking region-based cached data. In one embodiment, each entry of region-based array 510 includes a reference count that counts a number of accesses to a cache line cached by a cache subsystem of a computing system (e.g., system 300 of FIG. 3) for the region. In one embodiment, when the reference count for a given region reaches a threshold, tracking of the given region by the auxiliary line-based array 515 on a line basis will begin.

Thus, in one embodiment, when the number of cache lines cached for a given region reaches a threshold, the auxiliary line-based array 515 will begin tracking accesses to individual lines of the given region. Each time a new cache line is accessed from a given region, a new entry is created for the cache line in auxiliary line-based array 515. In one embodiment, the lookup of the region-based array 510 and the line-based array 515 is performed in parallel.

In one embodiment, only shared regions with reference counts greater than a threshold will be tracked by auxiliary line-based array 515 based on cache lines. Shared region refers to a region having cache lines stored in the cache subsystems of at least two different CPUs. Private area refers to an area having cache lines cached by only a single CPU. Thus, in one embodiment, for shared regions having a reference count greater than a threshold, there will be one or more entries in line-based array 515. In this embodiment, for private areas, there will not be any entries in line-based array 515.

Turning now to FIG. 6, a block diagram of another embodiment of a cache directory 600 is shown. In one embodiment, cache directory 600 includes a control unit 605, a region-based array 610, a secondary line-based array 615, and a structure 620 for caching the N most recently accessed private pages. It should be noted that N is a positive integer that may vary depending on the implementation.

In one embodiment, structure 620 includes a storage location for temporarily caching entries for the last N accessed private pages. When control unit 605 receives a memory request or an invalidation request that matches an entry in structure 620, control unit 605 is configured to increment or decrement the reference count, modify the cluster valid field and/or the sector valid field outside of arrays 610 and 615, and so on. Thus, rather than having to read and write entries in arrays 610 and 615 for each access, access to the most recently accessed private page 620 may bypass access to arrays 610 and 615. Using structure 620 may help speed up updates to cache directory 600 for these private pages.

In one embodiment, I/O transactions that do not modify the sector valid bit or cluster valid bit may benefit from the structure 620 for caching the N most recently accessed private pages. Typically, an I/O transaction will only modify the reference count of a given entry, and instead of performing reads and writes to array 610 or 615 each time, structure 620 may be updated.

Thus, fabric 620 enables efficient access to cache directory 600. In one implementation, the incoming request performs a lookup of structure 620 before performing a lookup of arrays 610 and 615. In one embodiment, control unit 605 determines whether there is a hit or miss in structure 620 when an incoming request is allocated in an input queue of a coherency slave (e.g., coherency slave 320A of fig. 3). Later, when the request reaches the head of the queue, the control unit 605 already knows whether the request is a hit in the structure 620. If the request is a hit in structure 620, then a lookup of arrays 610 and 615 may be avoided.

Referring now to FIG. 7, a diagram of one embodiment of repurposing a cluster valid field in a cache directory entry is shown. In one embodiment, entry 705A includes various fields associated with the shared region being tracked by the cache directory. The status field 710A stores a sharing code 715A to indicate that the corresponding region is shared. As used herein, a "shared" region refers to a region having cache lines cached by multiple different CPU clusters. When the status field 710A stores the shared encoding 715A, the cluster valid field 720A stores a bit vector indicating which CPU clusters 725A-725N are caching the cache line of the corresponding region. In this example, the cluster valid field 720A groups CPUs together into a cluster. In one embodiment, if cluster bits 725A-725N are set to one, this instructs the CPU cluster to store at least one cache line from the region. In one embodiment, if cluster bits 725A-725N are set to zero, this indicates that none of the CPUs in the cluster are storing a cache line from the region. Entry 705A may also include any number of other fields not shown to avoid obscuring the figure.

If the cluster valid field 720A remains unchanged even for a private region, then a probe needs to be sent to all CPUs in the cluster that are identified as caching at least one cache line for that region. In contrast, in one embodiment, if the region is private (i.e., accessed by only a single cluster), the cluster valid field is repurposed to either the owner valid field or the CPU valid field. This allows the cache directory to probe one particular CPU for the private region.

Thus, if the area being tracked by entry 705A transitions from being a shared area to being a private area, entry 705B exhibits a change in fields as compared to entry 705A for this private area. As shown in entry 705B, state 710B now includes private 715B encoding to represent the private state of the region. Since state 710B has now changed to private 715B, the previous cluster valid field 720A now becomes CPU valid field 720B. Each bit 735A-735N of the bit vector stored in the CPU valid field 720B represents a single CPU in the original cluster. A particular CPU 735A-735N bit is set to one if a given CPU in this cluster caches at least one cache line of the corresponding region. Otherwise, if a given CPU in the cluster does not cache any cache lines from the region, the corresponding CPU 735A-735N bit is set to zero.

By changing the cluster valid field 720A to the CPU valid field 720B, a directional probe may be issued that only targets CPUs with cache lines from the region. This helps to reduce the number of unnecessary probes generated by the cache directory. In one embodiment, if a request targeting a private area (corresponding to entry 705B) is received from a different cluster, then this private area becomes the shared area. When this occurs, the cluster valid field 720A is restored to its normal operation because the region is now shared.

Turning now to FIG. 8, one embodiment of a method 800 for maintaining a region-based cache directory is illustrated. For discussion purposes, the steps in this embodiment and those of fig. 9-13 are shown in sequential order. It should be noted, however, that in various embodiments of the described methods, one or more of the described elements are performed concurrently, in a different order than illustrated, or omitted entirely. Other additional elements may also be performed as desired. Any of the various systems or devices described herein are configured to implement the method 800.

The cache directory tracks a region of memory having at least one cache line cached in any cache subsystem of the multi-CPU system, wherein each region includes a plurality of cache lines (block 805). The cache directory maintains an entry for each region of memory having at least one cache line cached in any cache subsystem of the system (block 810). The cache directory maintains a reference count for each entry of the cache directory, where the reference count is the sum of the number of cache lines of the region's cache in all cache subsystems of the system (block 815). In response to receiving a memory request targeting a cache line within a given region, the cache directory sends a probe to each CPU having any cache lines of the given region (block 820). In response to the cache line of the corresponding region being cached by the CPU, the cache directory increments the reference count for the given entry (block 825). In response to the cache line of the corresponding region being evicted from the cache subsystem of the CPU, the cache directory decrements the reference count for the given entry (block 830). After block 830, the method 800 ends.

Referring now to FIG. 9, one embodiment of a method 900 for a coherence slave to process an invalidation indication from a CPU is shown. The coherency slave receives an invalidation indication from the given CPU indicating that the given CPU is no longer caching the particular cache line (block 905). In response to receiving the invalidation indication, the coherency slave determines a region of the particular cache line (block 910). Next, the coherence slave locates the cache directory entry for the identified region (block 915).

The coherence slave then decrements the reference count in the located cache directory entry for the identified region (block 920). If the reference count is now equal to zero (conditional block 925, "yes" leg), the coherence slave invalidates the entry (block 930). The entry may now be reused to allocate a new entry upon receiving a memory request targeting an area without a cache directory entry. If the reference count is still greater than zero (conditional block 925, "no" leg), then coherency slave maintains the valid state of the entry (block 935). After blocks 930 and 935, the method 900 ends.

Turning now to FIG. 10, one embodiment of a method 1000 of a cache directory tracking accesses to a first region of memory is illustrated. The cache directory receives a first memory request for a first cache line within a first region of memory (block 1005). In response to determining that the first region has no entry in the cache directory, a new entry is allocated in the cache directory for the first region and a reference count for the new entry is initialized to one (block 1010). If the cache directory is full, the cache directory evicts the existing entry to allocate a new entry. The cache directory may utilize any suitable eviction algorithm to determine which entry to evict. Next, the cache directory receives a second memory request for a second cache line within the first region of memory, where the second cache line is not necessarily physically distinct from the first cache line (block 1015). In response to receiving the second memory request for the first region of memory, a reference count of an entry corresponding to the first region of memory is incremented to two (block 1020). After block 1020, method 1000 ends. It should be noted that the reference count of the entry corresponding to the first region of memory is incremented for each subsequent memory request to any cache line within the first region. Similarly, each time a cache line from the first region is evicted from the cache subsystem, the reference count is decremented.

Referring now to FIG. 11, one embodiment of a method 1100 for maintaining an array of cache directories at different granularity levels is illustrated. The cache directory maintains entries in the first array to track an address range of the memory at a first granularity (block 1105). In one embodiment, the first granularity is a page of memory. In one embodiment, the page is 4 Kilobytes (KB). In other embodiments, the first granularity may correspond to other address range sizes.

The cache directory maintains a reference count for each entry in the first array to track the number of cache lines of the region cached somewhere in the system (block 1110). If the reference count of a given entry exceeds a programmable threshold (conditional block 1115, yes branch), the cache directory begins tracking the corresponding address range at a second granularity, where the second granularity is finer granularity than the first granularity. (block 1120). In one embodiment, the second granularity corresponds to a cache line. In one embodiment, a cache line includes 64 bytes. In other embodiments, other cache line sizes may be utilized. If the reference count for the given entry is less than or equal to the programmable threshold (conditional block 1115, the "no" branch), the cache directory continues to track the corresponding address range at the first granularity (block 1125). After block 1125, the method 1100 returns to block 1110.

For each new cache line cached in the system from the corresponding address range, the cache directory allocates a new entry in the second array to track this portion of the region at the second granularity (block 1130). If the cache directory is full, the cache directory evicts the entry from the second array to allocate a new entry. The cache directory may utilize any suitable eviction algorithm to determine which entry to evict. After block 1130, the method 1100 returns to block 1110.

Turning now to FIG. 12, one embodiment of a method 1200 for operating a multi-array cache directory is shown. A memory request is received at a multi-array cache directory of a multi-CPU system (block 1205). The cache directory performs parallel lookups of the first array and the second array for the memory request (block 1210). In one embodiment, the first array tracks the range of addresses at a first granularity and the second array tracks the range of addresses at a second granularity, wherein the second granularity is a finer granularity than the first granularity. If the lookup results in a hit to both the first array and the second array (conditional block 1215, the "yes" branch), the cache directory generates a probe with the matching entry from the second array (block 1220). If the lookup results in a hit to only the first array (conditional block 1225, yes branch), the cache directory generates a probe with the matching entry from the first array (block 1230). If the lookup results in a miss to both the first array and the second array (conditional block 1225, no branch), the cache directory allocates a new entry in the first array for the address range of the memory request at the first granularity (block 1235). If the first array is full, the cache directory may evict an existing entry from the first array. The cache directory may utilize any suitable technique to determine which existing entry to evict. If the cache directory evicts an existing entry from the first array, the cache directory sends an invalidation probe to one or more CPUs having cache lines of corresponding address ranges. After blocks 1220, 1230, and 1235, the method 1200 ends.

Referring now to FIG. 13, one embodiment of a method 1300 of a cache directory maintaining a data structure with entries for recently accessed private areas is shown. The cache directory maintains a data structure to cache entries for the private regions for the last N accesses (block 1305). The value of N may vary depending on the embodiment. The cache directory also maintains one or more arrays having entries for tracking regions having cache lines cached in any cache subsystem of the multi-CPU system (block 1310).

Next, a memory request is received at the cache directory (block 1315). In response to receiving the memory request, the cache directory performs a lookup of a data structure having entries for the private regions for the last N accesses (block 1320). If the lookup results in a hit (conditional block 1325, "yes" leg), the cache directory generates probes for one or more cache subsystems with the matching entry (block 1330). In addition, the cache directory cancels the lookup of the one or more primary cache directory arrays (block 1335). If the lookup results in a miss (conditional block 1325, "no" branch), the cache directory performs one or more lookups for the one or more main cache directory arrays for the request (block 1340). The cache directory may utilize any suitable processing procedure to process a lookup of the primary cache directory array. After blocks 1335 and 1340, the method 1300 ends.

Referring now to FIG. 14, one embodiment of a method 1400 for maintaining an array of cache directories at different granularity levels is illustrated. The cache directory maintains entries in the first array to track address ranges of private areas of memory at a first granularity (block 1405). In one embodiment, the first granularity is a page of memory. In one embodiment, the page is 4 KB. In other embodiments, the first granularity may correspond to other address range sizes.

The cache directory determines whether any of the private regions have transitioned to the shared state (conditional block 1410). If any of the private regions has transitioned to the shared state (conditional block 1410, "yes" branch), the cache directory begins tracking the corresponding address range at a second granularity, where the second granularity is finer granularity than the first granularity. (block 1415). In one embodiment, the second granularity corresponds to a cache line. In one embodiment, a cache line includes 64 bytes. In other embodiments, other cache line sizes may be utilized. If any of the private regions have not transitioned to the shared state (conditional block 1410, no branch), the cache directory continues to track the corresponding address range at the first granularity (block 1420). After block 1420, the method 1400 returns to block 1410.

For each new cache line cached in the system from the corresponding address range, the cache directory allocates a new entry in the second array to track this portion of the region at the second granularity (block 1425). If the cache directory is full, the cache directory evicts the entry from the second array to allocate a new entry. The cache directory may utilize any suitable eviction algorithm to determine which entry to evict. After block 1425, the method 1400 returns to block 1410.

In various embodiments, the methods and/or mechanisms described herein are implemented using program instructions of a software application. For example, program instructions that are executable by a general-purpose processor or a special-purpose processor are contemplated. In various implementations, such program instructions may be represented by a high-level programming language. In other embodiments, the program instructions may be compiled from a high-level programming language into binary, intermediate, or other forms. Alternatively, program instructions describing the behavior or design of the hardware may be written. Such program instructions may be represented by a high-level programming language such as C. Alternatively, a Hardware Design Language (HDL) such as Verilog may be used. In various embodiments, the program instructions are stored on any of a variety of non-transitory computer-readable storage media. The storage medium is accessible by the computing system during use to provide program instructions to the computing system for program execution. Generally, such computing systems include at least one or more memories and one or more processors configured to execute program instructions.

It should be emphasized that the above-described embodiments are merely non-limiting examples of implementations. Numerous variations and modifications will become apparent to those skilled in the art once the above disclosure is fully appreciated. It is intended that the following claims be interpreted to embrace all such variations and modifications.

28页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：多节点系统低功率管理

Region-based directory scheme adapted to large cache sizes

相关技术

网友询问留言