Coherent directory caching based on memory structure

文档序号:1316033 发布日期:2020-07-10 浏览:13次 中文

阅读说明:本技术 基于存储器结构的一致性目录缓存 (Coherent directory caching based on memory structure ) 是由 F·R·德罗普斯 T·麦基 于 2019-01-30 设计创作,主要内容包括:在一些示例中,关于基于存储器结构的一致性目录缓存实现,硬件序列器可以包括用于对包括与多个缓存行相关的信息的一致性目录缓存识别相邻缓存行的硬件。可以确定与相邻缓存行中的每一个相关联的状态。基于确定与相邻缓存行中的一个相关联的状态和与剩余的活动的相邻缓存行相关联的状态相同,可以对相邻缓存行进行分组。对于一致性目录缓存,硬件序列器可以利用存储器结构中的条目以识别分组的缓存行。与存储器结构中的条目相关联的数据可以包括多于两种的可能的存储器状态。(In some examples, with respect to a memory fabric-based coherency directory cache implementation, the hardware sequencer may include hardware to identify adjacent cache lines for a coherency directory cache that includes information related to a plurality of cache lines. A state associated with each of the adjacent cache lines may be determined. Based on determining that the state associated with one of the adjacent cache lines is the same as the state associated with the remaining active adjacent cache lines, the adjacent cache lines may be grouped. For coherent directory caching, a hardware sequencer may utilize an entry in a memory structure to identify a cacheline of a packet. The data associated with an entry in the memory structure may include more than two possible memory states.)

1. An apparatus, the apparatus comprising:

a hardware sequencer comprising hardware to:

identifying, for a coherent directory cache comprising information associated with a plurality of cache lines, an adjacent cache line;

determining a state associated with each of the adjacent cache lines;

grouping adjacent cache lines based on determining that a state associated with one of the adjacent cache lines is the same as a state associated with remaining active adjacent cache lines; and

for the coherent directory cache, utilizing an entry in a memory structure to identify information related to a cacheline of a packet, wherein data associated with the entry in the memory structure comprises more than two possible memory states.

2. The apparatus of claim 1, wherein the memory structure comprises a Ternary Content Addressable Memory (TCAM).

3. The apparatus of claim 1, wherein the entry comprises an address that uniquely identifies the entry in the memory structure.

4. The apparatus of claim 3, wherein the hardware further causes the hardware sequencer to:

writing a specified number of lower bits of the address as "X" bits, wherein the data associated with the entry in the memory structure includes a possible memory state of "0", "1", or "X", and wherein the "X" memory state represents "0" or "1"; and

based on the lower bits written, a coherency directory cache track is specified as valid for each of the cache lines of the packet.

5. The apparatus of claim 1, wherein the entry comprises a single entry in the memory structure for identifying information related to a cacheline of a packet.

6. The apparatus of claim 1, wherein the number of cachelines grouped is equal to four adjacent cachelines.

7. The apparatus of claim 1, wherein the hardware further causes the hardware sequencer to:

utilizing the entry to specify

A line of the cache of zero is stored,

is not a valid entry associated with the cache line, or

A specified number of adjacent cache lines, wherein the specified number is greater than 1.

8. The apparatus of claim 1, wherein the hardware further causes the hardware sequencer to:

designating the entry as a new entry;

determining whether the memory structure includes a previous entry corresponding to the new entry; and

based on determining that the memory structure does not include the previous entry corresponding to the new entry, adding the new entry into an unused entry location of the memory structure.

9. The apparatus of claim 1, wherein the hardware further causes the hardware sequencer to:

designating the entry as a new entry;

determining whether the memory structure includes a previous entry corresponding to the new entry;

determining whether all entry locations in the memory structure are used based on determining that the memory structure does not include the previous entry corresponding to the new entry;

evicting a least recently used entry of the memory structure based on determining that all entry locations in the memory structure are used; and

adding the new entry into an entry location corresponding to the evicted least recently used entry of the memory structure.

10. The apparatus of claim 1, wherein the hardware further causes the hardware sequencer to:

designating the entry as a new entry;

determining whether the memory structure includes a previous entry corresponding to the new entry;

based on determining that the memory structure includes the previous entry corresponding to the new entry, determining, for the previous entry, whether a designated bit corresponding to the new entry is set;

designating the new entry as a cache hit based on determining that the designated bit is set.

11. The apparatus of claim 10, wherein the hardware further causes the hardware sequencer to:

determining whether a state associated with the new entry is the same as a state associated with the previous entry based on determining that the designated bit is not set;

based on determining that the state associated with the new entry is the same as the state associated with the previous entry, setting the designated bit to add the new entry to the coherent directory cache.

12. The apparatus of claim 11, wherein the hardware further causes the hardware sequencer to:

based on determining that the state associated with the new entry and the state associated with the previous entry are not the same, adding the new entry to the coherent directory cache as a different entry than the previous entry.

13. A computer-implemented method, the method comprising:

identifying, for a coherence directory cache including information associated with a plurality of cache lines, an adjacent cache line;

determining a state associated with each of the adjacent cache lines;

grouping adjacent cache lines based on determining that a state associated with one of the adjacent cache lines is the same as a state associated with remaining active adjacent cache lines; and

for the coherent directory cache, a single entry in a Ternary Content Addressable Memory (TCAM) is used to identify information associated with a cache line of a packet.

14. The method of claim 13, further comprising:

determining whether the state associated with the one of the adjacent cache lines has changed;

based on determining that the state associated with the one of the adjacent cache lines has changed, designating the one of the adjacent cache lines whose state has changed as a new entry;

determining whether the TCAM includes another entry corresponding to the new entry; and

based on determining that the TCAM does not include another entry corresponding to the new entry, adding the new entry into an unused entry location of the TCAM.

15. The method of claim 13, further comprising:

determining whether the state associated with the one of the adjacent cache lines has changed;

based on determining that the state associated with the one of the adjacent cache lines has changed, designating the one of the adjacent cache lines whose state has changed as a new entry;

determining whether the TCAM includes another entry corresponding to the new entry;

determining whether all entry locations in the TCAM are used based on determining that the TCAM does not include another entry corresponding to the new entry;

evicting a least recently used entry of the TCAM based on determining that all entry locations in the TCAM are used; and

adding the new entry to an entry location corresponding to the evicted least recently used entry of the TCAM.

16. The method of claim 13, further comprising:

determining whether a state associated with the one of the adjacent cache lines has changed; and

based on determining that the state associated with the one of the adjacent cache lines has changed, clearing programming associated with the one of the adjacent cache lines whose state has changed to remove the one of the adjacent cache lines whose state has changed from the grouped cache line.

17. A non-transitory computer readable medium having stored thereon machine readable instructions that, when executed, cause a processor to:

identifying a set of a specified number of adjacent cache lines upon receiving a request or upon completing a previously received request associated with a coherency directory cache that includes information associated with a plurality of cache lines;

determining a state and ownership associated with each of the adjacent cache lines; and

utilizing, for the coherency directory cache, entries in a memory structure to identify the information related to the set of the specified number of neighboring cache lines based on determining that a state and ownership associated with one of the neighboring cache lines is the same as a state and ownership, respectively, associated with remaining active neighboring cache lines, wherein data associated with the entries in the memory structure includes more than two possible memory states.

18. The non-transitory computer-readable medium of claim 17, wherein the specified number of adjacent cache lines is equal to four adjacent cache lines.

19. The non-transitory computer-readable medium of claim 17, wherein the memory structure comprises a Ternary Content Addressable Memory (TCAM).

20. The non-transitory computer readable medium of claim 17, wherein the entry includes an address that uniquely identifies the entry in the memory structure, and wherein the machine readable instructions, when executed, further cause the processor to:

writing a specified number of lower bits of the address as "X" bits, wherein the data associated with the entry in the memory structure includes a possible memory state of "0", "1", or "X", and wherein the "X" memory state represents "0" or "1"; and

the coherency directory cache track is designated as valid for each cache line of the set of the designated number of adjacent cache lines based on the written lower bits.

Background

With respect to cache coherence, directory-based coherence may be implemented for non-uniform memory access (NUMA) and other such memory access types. In this regard, the coherency directory may include entry information to track the state and ownership of memory blocks that may be shared between processors in the multi-processor shared memory system. A coherent directory cache may be described as a component that stores a subset of coherent directory entries that provide faster access and increased data bandwidth. For directory-based coherency, the node controller may use a coherent directory cache to manage communications between different nodes of a computer system or different computer systems. In this regard, a coherent directory cache may track the state of cache blocks (or cache lines) of a computer system or of different computer systems. For example, a coherent directory cache may track which nodes of a computer system or different computer systems are sharing a cache block.

Drawings

Features of the present disclosure are illustrated by way of example and not limitation in the figures of the accompanying drawings(s), in which like references indicate similar elements, and in which:

FIG. 1 illustrates an example layout of a memory fabric-based coherent directory cache implementation and related components;

fig. 2 shows a processing flow of a processing state machine (process state machine) to illustrate an operation of the memory structure-based coherency directory cache implementing apparatus of fig. 1;

fig. 3 illustrates a scrubber flow of a background scrubbing state machine (background scrubbing state machine) to illustrate an operation of the memory structure based coherency directory cache implementation apparatus of fig. 1;

FIG. 4 illustrates an example block diagram of a memory fabric-based coherent directory cache implementation;

FIG. 5 illustrates an example flow diagram of a method for a memory fabric-based coherent directory cache implementation; and

FIG. 6 illustrates another example block diagram of a memory fabric-based coherent directory cache implementation.

Detailed Description

For simplicity and illustrative purposes, the present disclosure is described by referring mainly to examples. In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present disclosure. It may be evident, however, that the disclosure can be practiced without limitation to these specific details. In other instances, methods and structures have not been described in detail so as not to unnecessarily obscure the present invention.

In this disclosure, the terms "a" and "an" are intended to mean at least one of the particular elements. As used herein, the term "including" means including, but not limited to, the term "comprising" means including, but not limited to. The term "based on" means based at least in part on.

Disclosed herein are memory fabric-based coherent directory cache implementations, methods for operating memory fabric-based coherent directory caches, and non-transitory computer-readable media having stored thereon machine-readable instructions for providing memory fabric-based coherent directory caches. Apparatus, methods, and non-transitory computer-readable media disclosed herein provide for utilizing Ternary Content Addressable Memory (TCAM) to implement a coherent directory cache.

The coherent directory cache may include information associated with a plurality of memory blocks. The size of these memory blocks may be defined to be the same as the system cache lines of the computer system for ease of implementation. For clarity of discussion, these cache line sized memory blocks may be referred to as cache lines. The cacheline information can identify a processor (or another device) that stores the cacheline in a computer system (or a different computer system). The coherency directory and coherency directory cache may include coherency state and ownership information associated with each of the system memory cache lines. As the number of cache lines increases, the size of the coherency directory and the coherency directory cache may similarly increase. For performance reasons, an increase in the size of a coherent directory cache may result in a corresponding increase in the use of die area (die area) associated with the coherent directory cache, and a similar increase in power usage associated with the coherent directory cache. In this regard, it is technically challenging to implement a coherent directory cache with less use of die area associated with the coherent directory cache, and with less power usage associated with the coherent directory cache.

To address at least the above technical challenges, the apparatus, methods, and non-transitory computer-readable media disclosed herein reduce die size affected by increased directory size and/or reduce utilization of system power by utilizing a coherent directory cache that maintains coherent directory information for a subset of system cache lines. In addition, or in other examples, additional die area and power may be used to provide a larger coherent directory cache to thereby improve system performance. In this regard, coherent directory caching may be achieved by utilizing a TCAM. The capabilities of TCAMs include the ability to select "don't care" (or "wildcards") (e.g., "X") bits. The don't care bit may be used to represent information related to multiple adjacent cache lines having the same TCAM entry. In this regard, adjacent cache lines may be grouped according to the same ownership and status information.

For example, for a memory fabric-based coherency directory cache implementation, adjacent cache lines may be identified for a coherency directory cache that includes information related to a plurality of cache lines. The state and ownership associated with each of the adjacent cache lines may be determined. Based on determining that the state and ownership associated with one of the adjacent cache lines is the same as the state and ownership associated with the remaining active adjacent cache lines, respectively, then the adjacent cache lines may be grouped. In addition, a single entry in the TCAM can be used for a coherency directory cache to identify information related to a cache line of a packet.

For the apparatus, methods, and non-transitory computer-readable media disclosed herein, the elements (e.g., components) of the apparatus, methods, and non-transitory computer-readable media disclosed herein may be any combination of hardware and programs to implement the functions of the elements. In some examples described herein, a combination of hardware and programming may be implemented in a number of different ways. For example, the program of an element may be processor-executable instructions stored on a non-transitory machine-readable storage medium, and the hardware of the element may include processing resources to execute the instructions. In these examples, a computing device implementing these elements may include a machine-readable storage medium storing the instructions and a processing resource executing the instructions, or the machine-readable storage medium may be stored and accessed by the computing device and the processing resource separately. In some examples, some or all of the elements may be implemented in hardware circuitry.

Fig. 1 illustrates an example layout of a coherent directory cache implementation apparatus (hereinafter also referred to as "apparatus 100") based on a memory fabric.

Referring to FIG. 1, apparatus 100 may include a multiplexer 102 to receive requests such as processor snoop requests (snoop requests) or node controller requests. A processor snoop request may be described as an operation initiated by a local processor to query the state and ownership of a memory block or cache line. The node controller request may be described as an operation initiated by a remote processor or remote node controller that is sent to a local node controller comprising the apparatus 100. Requests may be directed to the coherency directory tag 104 to determine whether state information exists for a particular memory block (i.e., cache line). The coherence directory tag 104 can include information related to a plurality of memory blocks. That is, the coherency directory tag 104 may comprise a set of upper addresses (upper addresses) that correspond to the system memory blocks or cache lines in the coherency directory cache where state and ownership information is cached. For example, the upper addresses may include upper address A, upper address B, …, upper address N, and the like. Each high order address may have a corresponding row number (e.g., row number 1, 2, …, N) associated with each entry. Depending on the location, the respective high order addresses may be 0-N don't care bits. As disclosed herein, the size of these memory blocks may be defined to be the same as the system cache line of the computer system (or a different computer system) for ease of implementation. For clarity of discussion, these cache line sized memory blocks may be referred to as cache lines.

Ownership may be described as identifying which node or processor owns ownership of a tracked system memory block or cache line. In the shared state, ownership may include a node or processor sharing a system memory block or cache line.

The request may be processed by the TCAM 106. For TCAM 106, each cache entry may include a TCAM entry holding a high order address for comparison purposes with the request. This high order address may be referred to as a tag. With respect to high order addresses, the processor system may include a byte address or a word address that allows defining the data bits to be accessed. When multiple bytes or words are grouped together into larger blocks (such as cache lines), the higher order address bits may be used to uniquely locate each block or cache line of system memory, and the lower order address bits may be used to uniquely locate each byte or word within a block or cache line of system memory.

The tag may be described as a link descriptor for identifying the upper address. The directory tag may be described as a link descriptor used in the directory portion of the cache memory. Coherent directory tag 104 may include all tags of a coherent directory cache and may be described as a link descriptor for use in a directory portion of a coherent directory cache memory. The coherency directory tag 104 may include high order address bits that define the block of system memory being tracked.

The directory tag may represent a portion of a coherent directory cache address that uniquely identifies a directory entry. The directory tag may be used to detect whether a directory cache line exists within the coherent directory tag 104, and if so, the matching entry may identify the location of the cache information in the directory state store (storage). A coherency directory cache entry may represent the coherency state and ownership of a single system cache line of memory.

At match encoder 108, the request processed by TCAM 106 can be processed to determine a binary representation of the relevant row (e.g., address) of coherency directory tag 104. For the TCAM 106, each row or entry of the TCAM 106 can include a match row (match line) that is activated when the entry matches the input search value. For example, if TCAM 106 has 1024 entries, it will output 1024 matching rows. The 1024 matching rows may be encoded into binary values that may be used, for example, to address a memory storing state and ownership information. For example, if match row 255 is active, the encoded output from match encoder 108 is OFF16

The block of state information 110 may include a current representation of the state and ownership of the requested memory block (i.e., cache line) handled by the TCAM 106. For example, the state information 110 may include: a "valids" column containing a set of valid bits (e.g., 1111, 0000, 0011, 0010); a "state info" column including information such as shared, invalid, or exclusive; and a "sharing vector/ownershirp" column that includes shared information for the shared state and ownership of the proprietary state. According to an example, a row of state information 110 may correspond to a row of a coherence directory tag 104. Alternatively, a single row of coherency directory tag 104 may correspond to multiple rows of state information 110. With respect to coherency directory tag 104 and state information 110, assuming that high order address A covers all four cache lines that are valid, the four cache lines may include the same state information and shared vector/ownership. The length of the significant bits may correspond to the decoded number of the irrelevant bits. The coherency directory cache output information associated with the memory block status and ownership information may also include a directory cache hit indicator status (e.g., a coherency directory tag 104 hit) or a directory cache miss indicator status in response to a request received by the multiplexer 102. Ownership may include an indication of the computer system that is sharing the memory block or the node(s) of a different computer system. In this regard, the actual stored information may depend on the implementation and coherency protocol used. For example, if the protocol being used includes shared state, the ownership information may include a list of nodes or processors that share the block. The state and ownership may be retrieved from state information 110 memory storage (memory store) based on the associated match line from the TCAM 106 encoded into the memory address by the match encoder 108.

For example, the replacement policy may use least recently used (L RU) tracking circuitry 112. if the associated cache is full and a new entry is to be added, then least recently used tracking circuitry 112 may evict (evict) the least recently used cache entry.

The tag data associated with an entry in the TCAM 106 can include a possible memory state "0", "1", or "X", where the "X" memory state can represent a "0" or "1" and can be designated as a "don't care" memory state. The least significant digit in the TCAM 106 of the cache line address may define the address of the cache line in a set of cache lines. The least significant digit may be represented by an "X" memory state. Thus, one coherency directory cache entry may represent the state of several (e.g., 2, 4, 8, 16, etc.) system cache lines of memory. These memory blocks or system cache lines may be grouped by powers of 2 and powers other than 2. For powers other than 2, the addresses may be compared with respect to the range. For example, if the address is between a and C, then a memory block or system cache line may be grouped. Thus, each TCAM entry can represent any number of system cache lines of memory. The plurality of cache lines may be grouped based on determining that the plurality of cache lines are adjacent, and further based on determining that the plurality of cache lines include the same state and ownership that shares the TCAM entry. In this regard, adjacent cache lines may include cache lines within the boundaries of the defined group. Thus, adjacent cache lines may include cache lines that are adjacent, proximate, or meet a group addressing specification.

Processing state machine 114 may analyze the state and ownership information of the relevant cache lines based on requests, such as processor snoop requests and/or node controller requests, to identify cache lines that may be merged with respect to TCAM 106.

The background flush state machine 116 may also analyze state and ownership information associated with neighboring cache lines to identify cache lines that may be merged with respect to the TCAM 106. Thus, with regard to cache line merging, the processing state machine 114 may perform the merge function when a new entry is added, and the background scrubbing state machine 116 may perform the merge function as a background operation when the coherency indicator cache is not busy processing other requests. The state and ownership information may change over time with respect to background operations performed by the background cleanup state machine 116. When information about a given block is initially written and cannot be grouped because the state or ownership information does not match the information of other blocks in the combined set, that information for the given block may correspond to a separate coherency directory cache entry. Grouping may occur at this point if some information regarding the state or ownership changes at a later time. Thus, the background cleanup state machine 116 may operate when processing of requests, such as processor snoop requests and/or node controller requests, is not in progress. In this regard, the background scrubbing state machine 116 can find a matching entry and overwrite the TCAM entry to perform the grouping of memory blocks represented by a single entry as disclosed herein.

The functions of the processing state machine 114 and the background scrubbing state machine 116 with respect to adjacent cache line packets containing the same state and ownership may be performed by the hardware sequencer 118 and the hardware sequencer 120, respectively, or other circuitry contained in the processing state machine 114 and the background scrubbing state machine 116. Some of the functions performed by both hardware sequencer 118 and hardware sequencer 120 are described below.

According to an example, hardware sequencers 118 and 120 may include hardware to identify adjacent cache lines for coherency directory tag 104 that includes information related to a plurality of cache lines. In an example, the hardware sequencer 118 and the hardware sequencer 120 may be hardware state machines or may be part of a larger state machine. Optionally, the apparatus 100 may include a processor (e.g., the processor 604 of fig. 6) to implement some or all of the steps of the hardware sequencer 118 and the hardware sequencer 120 (the steps may be implemented as instructions by the processor).

For an implementation of apparatus 100 that includes hardware sequencer 118 and hardware sequencer 120, hardware sequencer 118 and hardware sequencer 120 may also include hardware for determining the state and ownership associated with each of the adjacent cache lines.

Hardware sequencer 118 and hardware sequencer 120 may also include hardware (or processor-implemented instructions) that group adjacent cache lines based on determining that the state and ownership associated with one of the adjacent cache lines is the same as the state and ownership associated with the remaining active adjacent cache lines, respectively. Grouping adjacent cache lines may include setting a "don't care" bit if needed to include the cache line to be added, and setting the corresponding valid bit of the validity field. In this regard, an equation-based comparison may be used to determine whether the two pieces of information about state and ownership are the same. The remaining active cache lines may be described as the cache lines currently represented in the set in the coherent directory cache (e.g., the remaining active cache lines may include a valid bit set in the state information).

Hardware sequencer 118 and hardware sequencer 120 may also include hardware (or processor-implemented instructions) to utilize entries in a memory structure for coherency directory tag 104 to identify information (e.g., address bits) associated with a cache line of a packet. In this regard, the data associated with the "don't care" entry in the memory structure may include more than two possible memory states. According to an example, an entry may include an address that uniquely identifies the entry in the memory structure. For example, an entry may include an address without any "don't care" bits. According to an example, an entry may include a single entry in a memory structure to identify information related to a cacheline of a packet. For example, an entry may include an address with one or more of the least significant digits as "don't care" bits. According to an example, the number of cachelines of a packet can be equal to four adjacent cachelines. For example, an entry may include an address with the two least significant digits as the "don't care" bit.

According to an example, the memory structure may include the TCAM 106 shown in fig. 1. For the TCAM 106, the hardware sequencer 118 and the hardware sequencer 120 may also include hardware (or processor-implemented instructions) for writing a specified number of lower bits of the address as "X" bits. In this regard, the data associated with the entry in the TCAM 106 can include a possible memory state "0", "1", or "X", where the "X" memory state (e.g., "don't care" memory state) can represent a "0" or a "1". For example, when an entry is written into the coherent directory tag 104, the lower two bits of the upper address (tag) may be programmed to be "don't care" in the TCAM. This example shows a configuration when a single coherent cache entry covers a set of up to four system cache lines. The status information may include a 4-bit valid field. An implementation with a 4-bit valid field may represent that the two least significant high order address bits may be allowed as a "don't care" implementation. In this regard, for other implementations, the number of bits in the validity field may vary. For example, for an implementation with up to 3 "don't care" bits, the valid field would be 8 bits long, since there are 2^3 ^ 8 (or, typically, 2^ n, where n represents the number of "don't care" bits) unique decodes for the three lower address bits. With respect to state information including a 4-bit valid field, each of these 4 bits may correspond to the decoding of the lower two bits of the upper address, thereby allowing each bit to be associated with one of the four cache lines within the four cache line groups. These 4 bits may be considered valid bits for each of the four system memory cache lines. Each TCAM entry can now represent state and ownership information from 0 (not a valid entry) to anywhere in the four cache lines of system memory. In addition, hardware sequencer 118 and hardware sequencer 120 may also include hardware (or processor-implemented instructions) to designate coherency directory cache tracking as valid for each of the grouped cache lines based on the low order bits of the write. Coherency directory cache tracking may be described as whether the state of the coherency directory cache monitor bit is active or inactive.

Hardware sequencer 118 and hardware sequencer 120 may also include hardware (or processor-implemented instructions) to utilize entries to designate a zero cache line (not a valid entry associated with a cache line) or a prescribed number of adjacent cache lines, where the prescribed number is greater than one.

A search of the TCAM 108 may be performed to determine whether a new entry is to be added. The search of the TCAM 106 may be performed by using the upper address bits of the cache line corresponding to the received request. If there is a TCAM miss, the tag may be written to an unused entry. In this regard, the hardware sequencer 118 may also include hardware (or processor-implemented instructions) to designate an entry as a new entry and determine whether the coherent directory cache memory structure includes a previous entry corresponding to the same group as the new entry. In this regard, the new entry may be added to an unused entry location of the coherent directory cache memory structure based on a determination that the coherent directory cache memory structure does not include a previous entry corresponding to the same group as the new entry.

When a new entry is to be added, a search of the TCAM 106 can be performed. If all cache entries are used, the least recently used entry will be evicted and a new tag will be written to the TCAM entry. In this regard, the hardware sequencer 118 may also include hardware (or processor-implemented instructions) to designate an entry as a new entry and determine whether the memory structure includes a previous entry corresponding to the same group as the new entry. Based on determining that the memory structure does not include a previous entry corresponding to the new entry, the hardware sequencer 118 may further include hardware (or processor-implemented instructions) to determine whether all entry locations in the memory structure are used. Based on determining that all entry locations in the memory structure are used, hardware sequencer 118 may further include hardware (or processor-implemented instructions) to evict the least recently used entry of the memory structure. In addition, new entries may be added to the entry locations corresponding to the evicted least recently used entries of the memory structure.

If there is a match between the new upper address bits and the tag entry within the TCAM during the TCAM search, the 4-bit field discussed above can be checked. If the corresponding bit in the 4-bit field selected by the decoding of the lower two bits of the upper address is set, a cache hit may be indicated and processing may continue. In this regard, if a cache hit is not determined, the hardware sequencer 118 may further include hardware (or processor-implemented instructions) to designate the entry as a new entry and determine whether the memory structure includes a previous entry corresponding to the new entry. Based on determining that the memory structure includes a previous entry corresponding to the new entry, the hardware sequencer 118 may also include hardware (or processor-implemented instructions) to determine whether to set the designated bit corresponding to the new entry for the previous entry. Additionally, based on determining that the designated bit is set, the hardware sequencer 118 may further include hardware (or processor-implemented instructions) to designate the new entry as a cache hit.

If the corresponding bit in the 4-bit field is not set, the status and ownership information may be compared. If the state and ownership information is the same for the new system memory cache line and the cached values of the state and ownership information, then the corresponding bit in the 4-bit field may be set to add the new system memory cache line to the coherency directory tag 104. The status and ownership fields may apply to all cache lines that match the address field and have a corresponding valid bit in the 4-bit validity field. Thus, if the state and ownership of the cache line being evaluated matches the state and ownership fields, the corresponding bit of the validity field may be set. With respect to the state and ownership information, based on determining that the specified bit is not set, the hardware sequencer 118 may also include hardware (or processor-implemented instructions) to determine whether the state and ownership associated with the new entry is the same as the state and ownership associated with the previous entry, respectively. Further, based on determining that the state and ownership associated with the new entry are the same as the state and ownership associated with the previous entry, respectively, the hardware sequencer 118 may also include hardware (or processor-implemented instructions) to set the designated bit to add the new entry to the device 100. In this regard, setting the designated bit may refer to a valid bit associated with a particular system memory block or cache line.

The status and ownership information may be compared if the corresponding bit in the 4-bit field is not set. If the state and ownership information read from the state information 110 is different from the state and ownership information associated with the new tag, the new tag may be added to the TCAM 106. In this regard, based on determining that the state and ownership associated with the new entry differ from the state and ownership associated with the previous entry, respectively, the hardware sequencer 118 may also include hardware (or processor-implemented instructions) to add the new entry to the consistent directory tag 104 as a different entry than the previous entry.

The hardware sequencer 118 may also include hardware (or processor-implemented instructions) to determine whether a state or ownership associated with one of the adjacent cache lines has changed. Based on determining that the state or ownership associated with one of the adjacent cache lines has changed, the hardware sequencer 118 may also include hardware (or processor-implemented instructions) to designate the one of the adjacent cache lines for which the state or ownership has changed as the new entry. The hardware sequencer 118 may also include hardware (or processor-implemented instructions) to determine whether the TCAM 106 includes another entry corresponding to the new entry, for example, by searching the TCAM 106 for a matching entry. Based on determining that the TCAM 106 does not include another entry corresponding to the new entry, the hardware sequencer 118 may also include hardware (or processor-implemented instructions) to add the new entry to an unused entry location of the TCAM 106.

It may also be necessary to update the current TCAM entry just matched to clear the "don't care" programming of one or more of the lower tag bits. This update may be required so that the entry will not match the next time the current tag is used to search the TCAM 106.

Based on determining that the TCAM 106 does not include other entries corresponding to the new entry, the hardware sequencer 118 may further include hardware (or processor-implemented instructions) to determine whether all entry locations in the TCAM 106 are used. Based on determining that all entry locations in the TCAM 106 are used, the hardware sequencer 118 may further include hardware (or processor-implemented instructions) to evict the least recently used entry of the TCAM 106. The hardware sequencer 118 may also include hardware (or processor-implemented instructions) to add the new entry to the entry location corresponding to the evicted least recently used entry of the TCAM 106.

Based on determining that the state or ownership associated with one of the adjacent cache lines has changed, the hardware sequencer 118 may further include hardware (or processor-implemented instructions) to clear programming associated with the one of the adjacent cache lines whose state or ownership has changed to remove the one of the adjacent cache lines whose state or ownership has changed from the grouped cache line.

According to an example, assuming that the coherency directory tag 104 includes an entry of 10X, a validity field 0011, and a state/ownership SO, and a snoop request is received for a cache line address 110 having the state/ownership SO, the entry of 10X may be updated to address 1XX, the validity field may be set to 0111, and the SO may be returned in response to the snoop.

Part of the information in the processor snoop request and node controller request may be used to determine how to drive the selection on multiplexer 102. If there is a processor snoop request and no node controller request, then the processing state machine 114 may drive the select line to the multiplexer 102 to select the processor snoop request.

In the example implementation of fig. 1, the process state machine 114 may control the multiplexer 102. The processing state machine 114 may receive a portion of the amplified information associated with the different request selected.

With respect to information sent from the match encoder 108 to the processing state machines 114 and L RU tracking circuit 112, the processing state machines 114 and L RU tracking circuit 112 can receive both the match/no match indicator and the TCAM row address of the match entry from the match encoder 108.

The directory state output shown in FIG. 1 may contain the state and ownership information of the match request. The directory state output may be sent to a node controller or other circuitry within a processor Application Specific Integrated Circuit (ASIC) in which the apparatus 100 is located. Other circuitry may include circuitry to send the initial request to a coherent directory cache.

The cache hit/miss status output shown in FIG. 1 may represent an indication of whether the request matches an entry within the coherent directory cache. The cache hit/miss status output may be sent to a node controller or other circuitry within the processor ASIC in which the apparatus 100 resides. Other circuitry may include circuitry to send the initial request to a coherent directory cache.

Fig. 2 illustrates a process flow to illustrate the operation of the apparatus 100. The process flow may be performed by the process state machine 114. Various operations of the processing state machine 114 may be performed by the hardware sequencer 118.

Referring to FIG. 2, at block 200, a process flow may begin with respect to the operation of the process state machine 114.

At block 202, the processing state machine 114 may determine whether a request (e.g., a processor snoop request, a node controller request, etc.) has been received.

Based on the determination at block 202 that a request (e.g., a processor snoop request, a node controller request, etc.) has been received, at block 204, the processing state machine 114 may trigger the TCAM 106 to search for the coherent directory tag 104. In this regard, the address associated with the cache line contained in the received request may be used to search for a matching tag value. As disclosed herein, for a coherent directory tag 104 implemented by a TCAM 106, each cache entry may include a TCAM entry to hold the upper address to be compared. This high order address may be referred to as a tag. The directory tag may represent a portion of a directory address that uniquely identifies the directory tag. The tag may be used to detect whether a directory cache line exists within the device 100, and if so, the matching entry may identify the location of the cache information in the directory state information 110 store.

At block 206, the processing state machine 114 may determine whether a match for the request is detected in the TCAM 106. According to an example, assuming a request for address 1110 is received, a match may be determined as follows with respect to the TCAM entries for address 1111, address 111X, and address 11XX (e.g., having up to two least significant digit don't care bits). The 0 bit of the received address does not match the corresponding 1 bit of the TCAM address 1111 and therefore results in a miss. Conversely, bits 0 of the received address are not compared with corresponding bits X of TCAM addresses 111X and 11XX, resulting in a match.

Based on the determination at block 206 that a match is detected, at block 208, the processing state machine 114 may obtain a TCAM row address associated with the match at block 206.

At block 210, it may be determined whether the request at block 202 is a state change request. Based on determining at block 210 that the request at block 202 is a state change request, the process state machine 114 may proceed to block 212. At block 212, the processing state machine 114 may check the stored state information to determine whether a number of valid bits are set.

Based on the determination at block 212 that the plurality of valid bits are not set, at block 214, the status information may be updated.

Based on the determination at block 212 that the plurality of valid bits are set, the processing state machine 114 may calculate and update a new don't care bit for the current TCAM entry at block 216. For example, for a single TCAM entry representing four memory blocks, the most significant don't care bit can be cleared and never care to become a match on a one (or zero).

At block 218, the processing state machine 114 may update the state information and adjust the valid bit. For example, for a match on one as described above, the valid bits may become 1100 for the associated state information valid bits that are each 1111.

At block 220, the processing state machine 114 may add a new TCAM entry associated with the state change request. In this regard, the processing state machine 114 may write an entry into the TCAM and write the relevant state information that matches the address associated with the state change request.

Based on determining at block 210 that the request at block 202 is not a state change request, the processing state machine 114 may proceed to block 222. At block 222, the processing state machine 114 may update the least recently used tracking circuit 112 for a match to move the TCAM row address to the top of the TCAM row address list to indicate that the TCAM row address is used as the most recently used TCAM row address.

At block 224, the processing state machine 114 may obtain state information about the match from the state information 110. State information 110 may be described as a memory or storage element that may be written to and read from. In the example implementation of fig. 1, state information 110 may be stored in Static Random Access Memory (SRAM) or another type of memory.

At block 226, the processing state machine 114 may decode the memory block valid bit. The system memory block valid bit or cache line valid bit may be located within the state information 110 store. In this regard, if the TCAM row address represents an entry representing more than one cache line, the processing state machine 114 can decode the associated block valid bit to identify the valid bit associated with the system memory block. According to an example, if a TCAM row address of seven represents an entry representing more than one cache line, processing state machine 114 may decode the associated block valid bit of binary 1101 to identify the valid bit of a 1 associated with a block of system memory.

At block 228, the processing state machine 114 may determine whether the current block is valid. For example, the processing state machine 114 may determine whether the associated block valid bit is active or inactive (i.e., here, active/inactive may be used to describe the state of the valid bit without defining whether a 1 or "0" state represents valid or invalid). In this regard, an implementation may define whether 1 is valid or invalid. However, other implementations may define the opposite mapping.

Based on the determination at block 228 that the current block is valid, the processing state machine 114 may output a cache hit/miss status at block 230. The cache hit/miss status may be output to the node controller/processor requestor and other portions of the ASIC that may include the requestor.

At block 232, the processing state machine 114 may output directory state information in response to the request received at block 202.

Based on determining that the current block is invalid at block 228, the processing state machine 114 may determine whether the current state of the current request being processed is equal to the stored state at block 234. The current state may be determined from a lookup into a coherence directory. The stored state may be described in information stored in state information 110. The stored state may include state and ownership information for the cache line(s) maintained in the coherent directory cache. In this regard, the processing state machine 114 may determine whether the state between blocks associated with the request received at block 202 is the same as the stored state. The stored state information may represent information associated with the current coherency cache entry. Such a configuration may utilize additional information from the fully consistent directory (e.g., by reading the current state).

Based on determining that the current state is equal to the stored state at block 234, the processing state machine 114 may update the block valid bit associated with the new memory block at block 236. In this regard, the valid bit of the new block may be set.

Based on the determination at block 234 that the current state is not the same as the stored state, the processing state machine 114 may update the matching TCAM entry to remove "don't care" at block 238. In this regard, as TCAM entries cannot be shared, as a single TCAM entry is now needed, "don't care" TCAM entries can be removed. In this regard, the don't care bit may be changed or removed within the TCAM entry to now take advantage of a more accurate match with any new incoming request. If the state or ownership of one of the four system cache lines, as described above, requires updating the state or ownership information and the other cache lines sharing the TCAM entry are not updated, a new tag may be added to the TCAM 106 as described above. The current TCAM entry (the entry just matched) may also need to be updated to clear the "don't care" programming of one or more of the low tag bits. This update may be needed so that the next time the current tag is used to search the TCAM 106, this entry will not match because the state and ownership information are no longer the same and they may no longer share the TCAM entry. According to an example, assuming that the TCAM includes entry 00XX, and there are a valid bit of 0000, 0001, and 0010 and a invalid bit of 0011, a request for 0011 is received, and 0011 has a different state/ownership from the rest (e.g., 0000, 0001, and 0010), the TCAM entry may become 000X and a new entry of 0011 may be added at blocks 238 and 240. With regard to 0010, two new entries may be added (e.g., one for 0010 and one for 0011).

At block 240, the processing state machine 114 may determine a TCAM tag for the new TCAM entry and update the state information accordingly. With respect to block 240, "don't care" may not be used by block 240 because the state information associated with the new request does not match the state or ownership information stored in the coherent directory cache. That is, a TCAM entry may need to be more accurate and may not represent a set of system memory blocks or cache lines.

Based on the determination at block 206 that a match is not detected, at block 242, the processing state machine 114 may determine that a TCAM tag having "don't care" is associated with the memory block group represented by the address of the request block. For block 242, the path from block 206 to block 242 allows the TCAM entry to represent a set of system memory blocks or cache lines, since this is the first request in a set of system memory blocks or cache lines, and, since it is the first in the cache, there is no need to compare with any stored state or ownership information that may be stored in state information 110.

At block 244, the processing state machine 114 may select a TCAM entry by using the least recently used tracking circuit 112. That is, processing state machine 114 may select a line/location for the new TCAM entry and select the TCAM entry for eviction. For the example implementation of FIG. 1, the unused entries may also represent the least recently used entries.

At block 246, the processing state machine 114 may determine whether the selected TCAM entry from block 244 is active. The TCAM may include a never match state to identify the entry as invalid. A TCAM entry may change from active to inactive if the TCAM entry may not be used, the background scrubbing operation disclosed herein with respect to fig. 3 has combined multiple TCAM entries into a single entry, or the TCAM entry is evicted.

Based on the determination at block 246 that the selected TCAM entry from block 244 is active, at block 248, the processing state machine 114 may write state information to the coherency directory in which the cache is operating. Further, at block 250, the process state machine 114 may update the state information.

Based on the determination at block 246 that the selected TCAM entry from block 244 is inactive, at block 250, the processing state machine 114 may update the TCAM entry-related state information entry, for example, by writing the TCAM new entry to the location of the previous TCAM entry.

At block 252, the processing state machine 114 may update the TCAM 106 with the tag determined at block 242.

At block 254, the processing state machine 114 may output the cache miss status to the original requesting circuitry or to a node controller or other portion of the processor containing the coherency directory cache.

With respect to FIG. 2, when a received cache line request is to modify a currently don't care bit, the new TCAM entry may be made to overwrite the new pair of system memory blocks, but the memory block to which the cache line request belongs may be tagged with a valid bit.

Fig. 3 shows a cleaner flow to illustrate the operation of the apparatus 100. The cleaner flow may be performed by the background cleaning state machine 116. Various operations of background scrubbing state machine 116 may be performed by hardware sequencer 120. The operations performed by the background scrubbing state machine 116 may be performed when the status information of the entry is updated, but the operations may utilize additional TCAM search and write operations, and the processing state machine 114 may be busy processing the next request and unable to perform these operations. Thus, the background cleanup state machine 116 may be executed without interfering with the operation of the process state machine 114.

Referring to FIG. 3, at block 300, a scrubber flow may begin with respect to the operation of the scrubber state machine 116.

At block 302, the cleanup state machine 116 may set the count value to zero. The count value may be set to zero to effectively analyze all contents of the TCAM 106.

At block 304, the flush state machine 116 may determine whether a request (e.g., a processor snoop request, a node controller request, etc.) has been received.

Based on the determination at block 304 that a request (e.g., a processor snoop request, a node controller request, etc.) has been received, processing may resume at block 304 until processing state machine 114 processes the request.

Based on the determination at block 304 that no request (e.g., processor snoop request, node controller request, etc.) has been received, at block 306, the scrubbing state machine 116 may read the TCAM entry selected by the count at block 302. This count may be used as the row number of the TCAM entry being analyzed, where the row number may represent the address of the TCAM entry.

At block 308, the scrubbing state machine 116 may read the current state information of the TCAM entry read at block 306.

At block 310, the flush state machine 116 may determine whether the associated entry (e.g., from block 306) is fully extended (since all possible memory blocks are represented by a single entry) or unused. When reading a TCAM entry, the low order bits of the tag can be checked. If the state of the low-order tag bit matches the values associated with all possible "don't care," then the associated entry is fully extended. The status information 110 may also be read to check the valid bit.

Based on determining at block 310 that the associated entry has been used and is not fully expanded, at block 312, the scrubbing state machine 116 may search the TCAM for contiguous memory blocks. In the disclosed example, TCAM 106 may include bit fields associated with search operations in a search that allow global "don't care". The lower bits of the search may be set to "don't care" and a TCAM search may be performed. In this regard, hardware sequencer 120 may also include hardware (or processor-implemented instructions) to identify adjacent cache lines for coherency directory tag 104 that includes information related to a plurality of cache lines. In this regard, the TCAM may include a global "don't care" bitmask that allows for bit exclusion in search operations. In this example, a global don't care bitmask may be applied to the lower address bits of the coherent directory tag 104.

At block 314, the cleanup state machine 116 may determine whether a TCAM match is detected. The flush state machine 166 may further determine the state and ownership associated with each of the detected adjacent cache lines.

Based on the determination at block 314 that a match is detected, at block 316, the cleanup state machine 116 may obtain new state information associated with the newly matched entry. In this regard, entries based on the count value may be excluded from searching or consideration to prevent matching of erroneous entries. In addition, TCAM entries having row addresses greater than the count value may be searched and considered.

At block 318, the scrubbing state machine 116 may determine whether the new state information is the same as the current state information associated with reading the TCAM entry based on the count value.

Based on determining that the new state information is the same as the current state information at block 318, the cleaning state machine 116 may update the read state information at block 320.

At block 322, the scrubbing state machine 116 may update the TCAM entry read based on the count value to include a "don't care" bit. The TCAM entry may be rewritten such that some of the low tag bits are set to the "don't care" value. This is to allow the TCAM entry to represent multiple blocks or cachelines of system memory.

At block 324, the scrubbing state machine 116 may invalidate the matching TCAM entry obtained by searching the TCAM.

At block 326, the flush state machine 116 may update the least recently used tracking circuit 112.

At block 328, the flush state machine 116 may increment the count by 1.

At block 330, the scrubbing state machine 116 may determine whether the count is greater than the count associated with the largest TCAM entry.

Based on the determination at block 330 that the count is not greater than the maximum TCAM entry, further processing may return to block 304.

Based on the determination at block 330 that the count is greater than the maximum TCAM entry, at block 332, the scrubbing state machine 116 may implement a time delay before restarting. The time delay may be omitted. However, the need to re-clean the coherent directory cache 100 entries again may be reduced until the entries have been updated. The time delay may allow a time window when an update may occur. In this regard, a scrubbing type operation may be performed after each entry update. However, for performance reasons, scrubbing type operations may be performed in the background to allow requests to be processed at a higher priority than scrubbing operations.

Fig. 4-6 illustrate an example block diagram 400, an example flow diagram of a method 500, and another example block diagram 600, respectively, for a memory fabric-based coherent directory cache implementation. Block diagram 400, method 500, and block diagram 600 may be implemented on apparatus 100 described above with reference to fig. 1 by way of example and not limitation. Block diagram 400, method 500, and block diagram 600 may be implemented in other devices. In addition to showing block diagram 400, FIG. 4 shows hardware of apparatus 100 that may perform the steps of block diagram 400. The hardware may include a hardware sequencer 118 (and a hardware sequencer 120) that includes hardware to perform the steps of the block diagram 400. Alternatively, the hardware may include a processor (not shown) and a memory (not shown), such as a non-transitory computer readable medium storing machine readable instructions that when executed by the processor cause the processor to perform the steps of block diagram 400. The memory may represent a non-transitory computer-readable medium. FIG. 5 may represent a method of a coherent directory cache implementation based on a memory fabric, and steps of the method. Fig. 6 may represent a non-transitory computer-readable medium 602 having stored thereon machine-readable instructions to provide a memory fabric-based implementation of a coherent directory cache. The machine readable instructions, when executed, cause the processor 604 to perform the steps of the block diagram 600, also shown in fig. 6.

The processor (not shown) of fig. 4 and/or the processor 604 of fig. 6 may include a single or multiple processors or other hardware processing circuitry to perform the methods, functions, and other processes described herein. These methods, functions, and other processes may be embodied as machine-readable instructions stored on a computer-readable medium, which may be non-transitory (e.g., non-transitory computer-readable medium 602 in fig. 6), such as a hardware storage device (e.g., RAM (random access memory), ROM (read only memory), EPROM (erasable programmable ROM), EEPROM (electrically erasable programmable ROM), hard disk, and flash memory). The memory (not shown) of fig. 4 may include RAM, where machine-readable instructions and data for the processor may reside at runtime.

Referring to fig. 1-4, and in particular to block diagram 400 shown in fig. 4, hardware sequencer 118 (and hardware sequencer 120) may include hardware that identifies (e.g., at 402) adjacent cache lines for coherency directory tag 104 containing information related to a plurality of cache lines.

Hardware sequencer 118 (and hardware sequencer 120) may include hardware that determines (e.g., at 404) a state associated with each of the adjacent cache lines.

Based on determining that the state associated with one of the adjacent cache lines is the same as the state associated with the remaining active adjacent cache lines, hardware sequencer 118 (and hardware sequencer 120) may include hardware that groups the adjacent cache lines (e.g., at 406).

Hardware sequencer 118 (and hardware sequencer 120) may include hardware that utilizes (e.g., at 408) entries in a memory structure for a coherency directory cache to identify information associated with a cacheline of a packet. In this regard, the data associated with an entry in the memory structure may include more than two possible memory states.

Referring to fig. 1-3 and 5, and in particular to fig. 5, for method 500, at block 502, the method may include identifying neighboring cache lines for a coherency directory tag 104 that includes information related to a plurality of cache lines.

At block 504, the method may include determining a state associated with each of the adjacent cache lines.

Based on determining that the state associated with one of the adjacent cache lines is the same as the state associated with the remaining active adjacent cache lines, at block 506, the method may include grouping the adjacent cache lines.

At block 508, the method may include utilizing a single entry in the TCAM 108 for the coherency directory tag 104 to identify information associated with the cache line of the packet.

Referring to fig. 1-3 and 6, and in particular to fig. 6, for block diagram 600, non-transitory computer-readable medium 602 may include instructions 606 to identify a set of a specified number of adjacent cache lines upon receiving a request (e.g., as disclosed herein with respect to fig. 1 and 2), or upon completing a previously received request (e.g., as disclosed herein with respect to fig. 1 and 3) related to a coherence directory tag 104 containing information related to a plurality of cache lines.

Processor 604 may fetch, decode, and execute instructions 608 to determine the state and ownership associated with each of the adjacent cache lines.

Based on determining that the state and ownership associated with one of the adjacent cache lines is the same as the state and ownership associated with the remaining active adjacent cache lines, respectively, processor 604 may fetch, decode, and execute instructions 610 to utilize entries in the memory structure for coherency directory tag 104 to identify information related to a set of a specified number of adjacent cache lines. The data associated with an entry in the memory structure may include more than two possible memory states.

Described and illustrated herein are examples and some of their variations. The terms, descriptions and numbers used herein are set forth by way of illustration only and are not meant as limitations. There are many variations possible within the spirit and scope of the subject matter which is intended to be defined by the following claims and their equivalents, in which all terms are meant in their broadest reasonable sense unless otherwise indicated.

27页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:从易失性存储器到非易失性存储器的可配置数据刷新

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!

技术分类