Apparatus and method for processing write operation

文档序号:1327714 发布日期:2020-07-14 浏览:8次 中文

阅读说明:本技术 用于处理写入操作的装置和方法 (Apparatus and method for processing write operation ) 是由 安德鲁·克里斯托弗·罗斯 理查德·罗伊·格里森思怀特 阿里·格哈森·赛迪 于 2018-11-28 设计创作,主要内容包括:提供了一种用于处理写入操作的装置和方法。该装置具有第一处理设备,该第一处理设备用于执行指令的序列,其中,该序列包括至少一个指令以及至少一个写入指令,该至少一个指令用于激活软件协议以建立用于将数据写入到第一存储器区域的所有权权限,该至少一个写入指令在建立所有权权限之后执行,以便执行一个或多个写入操作来输出写入数据以存储在第一存储器区域中的至少一个存储器位置中。与第一处理设备相关联的回写缓存用于存储在一个或多个写入操作期间输出的写入数据。一致性电路耦合到回写缓存以及至少一个另外的缓存,该至少一个另外的缓存与至少一个另外的处理设备相关联。第一处理设备响应于触发事件而启动清除操作,以便使得写入数据从回写缓存写入到存储器。此外,一致性电路响应于清除操作而与至少一个另外的缓存交互以实现硬件协议,以便使写入数据对至少一个另外的处理设备可见。这可以提供用于在某些系统中实现缓存一致性的非常高效且高性价比的机制。(An apparatus and method for processing a write operation are provided. The apparatus has a first processing device to execute a sequence of instructions, wherein the sequence includes at least one instruction to activate a software protocol to establish ownership rights for writing data to a first memory region and at least one write instruction to execute after establishing the ownership rights to perform one or more write operations to output the write data for storage in at least one memory location in the first memory region. A write-back cache associated with the first processing device is to store write data output during one or more write operations. The coherency circuit is coupled to the write-back cache and to at least one further cache associated with at least one further processing device. The first processing device initiates a flush operation in response to a triggering event to cause write data to be written from the write-back cache to the memory. Further, the coherency circuit interacts with the at least one further cache to implement a hardware protocol in response to the flush operation in order to make the write data visible to the at least one further processing device. This may provide a very efficient and cost-effective mechanism for achieving cache coherency in some systems.)

1. An apparatus, comprising:

a first processing device to execute a sequence of instructions, the sequence including at least one instruction to activate a software protocol to establish ownership rights for writing data to a first memory region and at least one write instruction to execute after establishing the ownership rights to perform one or more write operations to output write data for storage in at least one memory location in the first memory region;

a write-back cache associated with the first processing device and arranged to store the write data output during the one or more write operations; and

a coherency circuit coupled to the write-back cache and at least one additional cache associated with at least one additional processing device;

wherein:

the first processing device initiating a flush operation in response to a triggering event to cause the write data to be written from the write-back cache to memory; and is

The coherency circuit interacts with the at least one further cache to implement a hardware protocol to make the write data visible to the at least one further processing device in response to the flush operation.

2. The apparatus of claim 1, wherein the apparatus implements a cache coherency protocol for the first memory region through a combination of the software protocol triggered by the first processing device and the hardware protocol implemented by the coherency circuit.

3. Apparatus according to claim 1 or 2, wherein the coherency circuitry is arranged to cause the at least one further cache to invalidate any of the following cache entries in response to the flush operation: the contents of the cache entry are associated with the at least one memory location of the write data.

4. The apparatus of any preceding claim, wherein:

the at least one additional cache comprises a plurality of caches;

the coherency circuit maintains a record that provides, for each cache of the plurality of caches, an indication of the following memory locations: the data of the memory location is cached by the cache; and is

The coherency circuit is arranged to: upon determining which of the plurality of caches to interact with in order to make the write data visible to the at least one further processing device, referencing the record.

5. The apparatus of claim 4, wherein the coherency circuit is a snoop circuit.

6. The apparatus of any preceding claim, further comprising:

a lock storage component to maintain lock indications for a plurality of memory regions; and

the software protocol is arranged to determine whether a lock indication for the first memory region is clear when activated by the first processing device, and in the event that the lock indication is clear, to set the lock indication for the first memory region and to identify to the first processing device that the first processing device has ownership rights for the first memory region.

7. Apparatus as claimed in any preceding claim, wherein said first processing device is arranged to release said ownership rights after notification from said coherency circuit that said write data has been made visible to said at least one further processing device.

8. The apparatus of any preceding claim, wherein the trigger event comprises one of:

executing, by the first processing device, a clear instruction in the sequence that initiates the clear operation; and

an eviction event triggered for an entry in the write-back cache that includes the write data.

9. The apparatus of claim 8, wherein the flush instruction identifies a particular entry of the write-back cache whose data is to be flushed.

10. The apparatus of claim 8, wherein the flush instruction triggers a flush of a plurality of entries of the write-back cache, and further comprising a storage unit to store information that enables the plurality of entries to be identified.

11. The apparatus of claim 10, wherein:

the storage unit is arranged to store an indication of the write operation performed by the first processing device;

the clear instruction is an end instruction in the sequence of instructions, and the first processing device is responsive to execution of the end instruction by:

causing write data that is the subject of each of the following write operations to be purged from the write-back cache to the memory: an indication for the write operation is stored in the storage unit; and

clearing the indication of the write operation stored in the storage unit.

12. The apparatus of claim 11, wherein the memory is a non-volatile memory that forms a persistent point, and executing the end instruction causes the write data to be propagated to the persistent point.

13. An apparatus as claimed in claim 11 or claim 12, wherein the first processing device is further arranged such that the indication of the write operation performed by the first processing device is stored in the memory unit and is located after a start instruction in the sequence of instructions.

14. The apparatus of claim 13, wherein the first processing device is arranged to perform a write operation to the first memory region only when the ownership rights to write data to the first memory region have been established and a pending write instruction specifying a memory location in the first memory region is located after the start instruction and before the end instruction.

15. An apparatus as claimed in any preceding claim when dependent on claim 2, wherein the cache coherency protocol is implemented only by the coherency circuit for at least one further memory region, and the first processing device is arranged to obtain exclusive write access to the at least one further memory region by means of the step for the at least one further cache performed by the coherency circuit.

16. A method of processing a write operation in an apparatus having a first processing device, a write-back cache associated with the first processing device, and a coherency circuit coupled to the write-back cache and at least one further cache associated with at least one further processing device, the method comprising:

executing on the first processing device at least one instruction to activate a software protocol to establish ownership rights for writing data to a first memory region and at least one write instruction to execute after establishing the ownership rights to perform one or more write operations to output write data for storage in at least one memory location in the first memory region;

storing the write data output during the one or more write operations in the write-back cache;

in response to a triggering event, causing the first processing device to initiate a flush operation to cause the write data to be written from the write-back cache to memory; and

in response to the flush operation, causing the coherency circuit to interact with the at least one further cache to implement a hardware protocol to make the write data visible to the at least one further processing device.

17. An apparatus, comprising:

a first processing component to execute a sequence of instructions, the sequence including at least one instruction to activate a software protocol to establish ownership rights for writing data to a first memory region and at least one write instruction to execute after establishing the ownership rights to perform one or more write operations to output write data for storage in at least one memory location in the first memory region;

a write-back cache component associated with the first processing component to store the write data output during the one or more write operations; and

a coherency component to couple to the write-back cache component and at least one additional cache associated with at least one additional processing device;

wherein:

the first processing component is to initiate a flush operation in response to a triggering event to cause the write data to be written from the write-back cache component to memory; and

the coherency component is to interact with the at least one further cache component to implement a hardware protocol to make the write data visible to the at least one further processing component in response to the flush operation.

Technical Field

The present technology relates to an apparatus and method for processing write operations.

Background

A data processing system typically comprises a plurality of processing devices which may perform data processing operations on data, and these various processing devices may be arranged to be able to access shared data in a memory. Typically, processing devices will include one or more levels of local cache in which data manipulated by the processing devices may be cached to allow the data to be accessed more quickly than would be possible each time the data needs to be fetched from memory.

However, the presence of a local cache may cause coherency problems in that when the latest version is cached in the local cache of one processing device, another processing device may have access to outdated data.

While hardware cache coherency mechanisms have been developed for this situation, these mechanisms can be relatively expensive to implement in terms of hardware area and/or execution time.

Disclosure of Invention

In a first example configuration, there is provided an apparatus comprising: a first processing device to execute a sequence of instructions, the sequence including at least one instruction to activate a software protocol to establish ownership rights for writing data to a first memory region and at least one write instruction to execute after establishing the ownership rights to perform one or more write operations to output the write data for storage in at least one memory location in the first memory region; a write-back cache associated with the first processing device and arranged to store write data output during one or more write operations; and coherency circuitry coupled to the write-back cache and to at least one further cache associated with at least one further processing device; wherein: the first processing device initiating a flush operation in response to a triggering event to cause write data to be written from the write-back cache to the memory; and the coherency circuit is to interact with the at least one further cache to implement a hardware protocol to make the write data visible to the at least one further processing device in response to the flush operation.

In another example configuration, a method of processing write operations in an apparatus having a first processing device, a write-back cache associated with the first processing device, and a coherency circuit coupled to the write-back cache and at least one further cache associated with at least one further processing device is provided, the method comprising: executing, on the first processing device, at least one instruction to activate a software protocol to establish ownership rights for writing data to the first memory region, and at least one write instruction to execute after establishing the ownership rights to perform one or more write operations to output the write data for storage in at least one memory location in the first memory region; storing, in a write-back cache, write data output during one or more write operations; in response to a triggering event, causing the first processing device to initiate a flush operation to cause write data to be written from the write-back cache to the memory; and in response to the flush operation, cause the coherency circuit to interact with the at least one further cache to implement the hardware protocol so as to make the write data visible to the at least one further processing device.

In yet another example configuration, an apparatus is provided that includes: a first processing component to execute a sequence of instructions, the sequence including at least one instruction to activate a software protocol to establish ownership rights for writing data to a first memory region and at least one write instruction to execute after establishing the ownership rights to perform one or more write operations to output write data for storage in at least one memory location in the first memory region; a write-back cache component associated with the first processing component for storing write data output during one or more write operations; and a coherency component to couple to the write-back cache component and to at least one further cache associated with at least one further processing device; wherein: the first processing component is used for responding to a trigger event and starting a clearing operation so as to enable the write data to be written into the memory from the write-back cache component; and a coherency component to interact with the at least one further cache component to implement a hardware protocol to make the write data visible to the at least one further processing component in response to the flush operation.

Drawings

The present technology will be further described, by way of illustration only, with reference to examples of the present technology as shown in the accompanying drawings, in which:

FIG. 1 is a block diagram of an example system in which techniques described herein may be implemented;

FIG. 2 illustrates one example of a directory that may be maintained by the coherency circuit of FIG. 1;

FIG. 3 illustrates one example of a locking storage component that may be provided in the memory of the system of FIG. 1;

FIGS. 4A and 4B illustrate a flow diagram of a process that may be performed in a processing device when performing one or more write operations;

FIG. 5 is a flowchart showing steps performed by the coherency circuit in response to a clear operation initiated by a processing device that has performed one or more write operations in one example arrangement;

FIG. 6 is a flow chart showing how the coherency mechanism employed by the system of FIG. 1 is made dependent on the region of memory on which a write operation is to be performed;

7A-7D illustrate a sequence of example code that may be executed by one of the processing devices in an example arrangement;

FIG. 8 illustrates a FIFO (first in, first out) storage component that may be provided in one or more processing devices in an example arrangement to track write operations that need to be flushed;

FIG. 9 schematically illustrates write back caching in one example arrangement; and

FIG. 10 shows an example sequence of instructions that a data processing apparatus may execute in one example.

Detailed Description

In systems comprising multiple processing devices with their own local cache structures, a cache coherency protocol is typically implemented to ensure that each processing device has the same view of data shared between the processing devices. Such a cache coherency protocol needs to ensure that different processing devices cannot update data associated with a memory location in a conflicting manner, which may occur if two different processing devices are allowed to perform write operations on the same data simultaneously (which may be referred to as a "contended write" condition). Thus, when a write operation is required for data at a particular memory address, then according to known hardware cache coherency schemes, the processing device wishing to perform the write operation notifies the coherency circuitry so that the coherency circuitry then performs any necessary steps with respect to any cached copies of the data that may be held by other processing devices in order to ensure that the requesting processing device can then perform the write operation without the risk of any of the other processing devices performing a conflicting write for that memory address. The coherency circuit may also ensure that whenever any processing device subsequently requests access to a memory address, it will obtain the latest version of the data.

However, in some systems where data is shared between multiple processing devices, the inventors have recognized that a software scheme may also be in place that may effectively ensure that only one processing device has the ability to write to a particular memory location at any point in time. By way of example only, in a rack-level computing system (where a large database of data may be shared among various processing devices), software techniques may be used to control which processing device is able to write to any particular memory location at any particular point in time. Thus, the inventors have recognized that in such systems, some of the functions typically associated with cache coherency mechanisms would no longer need to be performed by coherency circuitry.

In particular, the inventors have observed that the functions performed by the hardware coherency mechanism essentially involve steps for establishing the authority of the processing device to write data, and other steps taken after performing the write operation for making the updated write data visible to other processing devices. In a system that has provided software mechanisms to effectively control which processing device can write to any particular memory location, the inventors have recognized that cache coherency can be effectively maintained without requiring the cache coherency protocol to be fully implemented in hardware. In particular, the steps required to establish the right to write data may be performed in software, and then a hardware mechanism is used to ensure that the updated write data is visible to other processing devices. This may result in a significant reduction in the cost (in terms of hardware area and/or execution time) associated with providing cache coherency support within the system.

In one example arrangement, an apparatus may be provided that includes a first processing device configured to execute a sequence of instructions, where the sequence includes at least one instruction to activate a software protocol to establish ownership rights for writing data to a first memory region and at least one write instruction to execute after establishing the ownership rights to perform one or more write operations to output the write data for storage in at least one memory location in the first memory region. A first processing device has a write-back cache associated with the first processing device in which write data output during one or more write operations is stored. Furthermore, a coherency circuit coupled to the write-back cache and to at least one further cache associated with at least one further processing device is provided.

By using a software protocol, it may be ensured that when a first processing device is given ownership rights with respect to a first memory region, no other processing device will be able to perform a conflicting write operation to that first memory region. However, it is still necessary to ensure that the write data generated by the first processing device is made visible to the other processing devices.

According to the described example, this is achieved by the first processing device initiating a flush operation in response to a triggering event to cause write data to be written from the write-back cache to the memory. "purging" of a write data item from the write-back cache to memory should be understood herein to mean that a modified copy of the data item (i.e., the write data referred to herein) stored in the write-back cache is used to update the original data item stored in memory. In addition to performing the flush operation, the coherency circuit is then arranged to interact with the at least one further cache in response to the flush operation so as to implement a hardware protocol to make the write data visible to the at least one further processing device.

According to this approach, a software programmer writing instructions to be executed on a processing device needs to know the software protocol required to establish ownership rights, and in particular needs to include appropriate instructions in the sequence of instructions executed by the various processing devices, such that when a write operation is to be performed for a first memory region, the relevant processing device first seeks to establish ownership rights using the software protocol. This means that the hardware coherency circuit does not need to consider the problem of competing writes, as software is responsible for avoiding such problems. However, the hardware protocol may then be implemented by the coherency circuit to handle the problem of making the write data visible to at least one further processing device. Because the coherency circuit may take the required steps directly in response to a notification of a clear operation, software executing on the processing device need not take any steps to clear and/or invalidate copies of data held by other processing devices before performing a write operation, nor need steps be taken to invalidate copies on other processing devices after writing data.

As a result, the approach taken is far less cumbersome from a software perspective than if the entire cache coherency protocol were implemented in software. Furthermore, the implementation cost (in terms of hardware area and/or execution time) is much lower than for a system implementing a full (full) cache coherency protocol in hardware, since the required hardware protocol only needs to be concerned with the problem of making the write data visible to other processing devices, while the responsibility for the establishment of write permissions is transferred to software. The hardware protocol implemented by the coherency circuit can be made completely transparent to a software programmer because instructions need not be added to the code executing on the processing device in order to implement this functionality, but rather the coherency circuit implements the required functionality entirely in the hardware triggered by the clear operation.

Thus, in accordance with the above technique, it will be appreciated that the apparatus may be arranged to implement a cache coherency protocol for the first memory region by a combination of a software protocol triggered by the first processing device and a hardware protocol implemented by the coherency circuit. This may provide a very efficient mechanism for achieving cache coherency, especially in systems that already make use of appropriate software mechanisms to ensure that only one processing device can write to any particular memory location at a time.

The manner in which the coherency circuit responds to the clear operation may vary depending on the implementation. In one example, the coherency circuitry is arranged to, in response to a flush operation, cause at least one further cache to invalidate: the contents of the cache entry are associated with at least one memory location to which data is written. It is sufficient to simply invalidate those cache entries, since it is now known that the most recent data of the memory location(s) in question is the data generated by the first processing device during the execution of the write operation(s). Furthermore, the flush operation will cause the most recent version of the data to be output to the memory, and thus when any one of the at least one further processing device requests data for the at least one memory location, there will be a miss in its local cache, and the correct data will be retrieved from the memory, as appropriate.

While invalidating entries in the manner described above is the simplest method, the following cache entries may be updated if needed: the contents of the cache entry are associated with at least one memory location of the write data to store the most recent write data generated by the first processing device, rather than invalidating the entry. This will avoid having to subsequently re-request the data from the memory if it is needed at some future point in time.

The coherency circuit may be arranged to operate in various ways. For example, the coherency circuitry may be arranged to respond to a flush operation by only broadcasting an invalidation or update request to all caches associated with further processing devices, irrespective of whether those caches may be caching copies of data affected by the write operation. However, in alternative embodiments, the coherency circuit may be arranged to perform more targeted operations. In particular, in one example, the at least one further cache comprises a plurality of caches, and the coherency circuit maintains a record that provides, for each cache of the plurality of caches, an indication of a memory location for which data is cached by that cache. The coherency circuitry is then arranged to refer to the record when determining which of the plurality of caches to interact with in order to make the write data visible to the at least one further processing device.

In one such example where the records are maintained by the coherency circuit, the coherency circuit may take the form of a snoop circuit. Snoop circuitry is a known type of cache coherency circuitry that can maintain an indication of which memory addresses are cached by various caches in the system, and use this information to direct snoop requests to a particular cache in order to implement a cache coherency protocol. However, according to the above technique, no snoop circuit is required to implement a full cache coherency protocol, at least for the first memory region, since the aspect of establishing the right to write data is instead transferred to a software protocol.

There are a number of ways in which software protocols may be implemented in a device. In one example, the apparatus further comprises a lock storage component to maintain lock indications for the plurality of memory regions, and the software protocol is arranged to determine whether the lock indication for the first memory region is clear when activated by the first processing device. In the event the lock indication is clear, the software protocol then causes the lock indication to be set for the first memory region and identifies to the first processing device that the first processing device has ownership rights for the first memory region.

Thus, the lock indication for each memory region may be considered to be in the form of an indication that is saved as a separate variable that may be set to indicate when the associated memory region is in the process of being updated.

Depending on the implementation, each memory region tracked by the locked storage component may include a separate memory address or a series of memory addresses. Each memory location is identified by a memory address, and thus there may be one or more memory locations in each memory region tracked by the locked storage component.

In one example arrangement, the first processing device is arranged to release ownership rights after notification from the coherency circuit that the write data has been made visible to the at least one further processing device. Thus, once a first processing device is granted ownership rights, the first processing device will retain the ownership rights until it can be ensured that write data generated by the first processing device is guaranteed to be visible to any of the other processing devices. It should be noted that in some cases this does not mean that the write data has actually been written to main memory, as long as it has reached a consistent point within the system (e.g., a buffer storage component that temporarily stores the data before it is written to memory), and thus it can be guaranteed that when any of the other processing devices requests data at the affected memory locations, those processing devices will obtain the most recent write data from the consistent point (if such data has not already been written to main memory).

The triggering event for initiating a purge operation by the first processing device may take a variety of forms. For example, the triggering event may occur as a result of the first processing device executing a clear instruction in the sequence of instructions, the execution of which causes a clear operation to be performed. As another example, if an eviction event (eviction event) is triggered for an entry in a write-back cache containing write data, the triggering event may occur. This may occur, for example, due to capacity issues in write-back caches, and in particular, the victim entry (victims entry) needs to be selected in order to make room for new data to be allocated into the cache. For example, such an eviction event may designate an entry that has been written by one of the write operations as a victim entry before the first processing device executes a flush instruction to actively flush the entry.

The clear instruction may take a variety of forms. For example, in a simple case, there may be a clear instruction associated with each write instruction, such that once the write data is written to the desired memory location, a subsequent clear instruction identifies the memory location.

However, in alternative arrangements, a single flush instruction may be arranged to trigger flushing of multiple entries of the write-back cache, and in these cases the apparatus may further comprise a storage unit for storing information enabling multiple entries to be identified. For example, the techniques described in co-owned co-pending U.S. patent application No. 15/501,278 may be utilized, where a flush instruction may take the form of an end instruction as defined in that patent application, which causes a number of identified entries in a write-back cache that have been subject to a write operation to be subsequently flushed to main memory. In the example discussed in the above-mentioned co-pending application, the memory forms a non-volatile memory that forms a persistent point in the system, and executing the end instruction causes write data from the identified entry to be propagated to the persistent point. However, in the present technique, the end instruction may be used regardless of the form of the memory, and when combined with the above mechanism, a very efficient technique for maintaining cache coherency in association with write data resulting from a sequence of write operations may be provided, wherein the software protocol ensures that no other processing device can perform a conflicting write operation in the memory region, and the hardware coherency circuitry then makes all write data visible to at least one other processing device in response to the flush operation.

In one example arrangement, the storage unit may be arranged to store an indication of a write operation performed by the first processing device, the clear instruction may be an end instruction in a sequence of instructions, and the first processing device may respond to executing the end instruction by: causing write data that is the subject of each of the following write operations to be purged from the write-back cache to memory: an indication for the write operation is stored in a storage unit; and clearing the indication of the write operation stored in the memory location.

By using the above-described end instruction technique, in the event that a programmer of the data processing apparatus wishes to ensure that a write operation (initiated by execution of the write instruction) will occur for a particular ordering of the memory, this can be done by adding an end instruction to the sequence of instructions given by the programmer to define the data processing operation that the apparatus should perform.

In some example arrangements, the first processing device is further arranged such that the indication of the write operation performed by the first processing device is stored in the storage unit and is located after a start instruction in the sequence of instructions. Thus, additional instructions may be added to the set of instructions to which the processing device responds, which enables the system programmer to encapsulate the "write set" of write operations (before which the start instruction precedes and the end instruction terminates it). In this way it can be ensured that the order of writing (of the set of writes relative to other write operations) desired by the programmer is followed in the content of the main memory in terms of write back caching operations rather than by the operation of the first processing device which might change the order in which certain instructions are executed for scheduling reasons.

In one example arrangement, the first processing device may be arranged to perform a write operation to the first memory region only if ownership rights for writing data to the first memory region have been established and a pending write instruction specifying a memory location in the first memory region is preceded by a start instruction and followed by an end instruction.

While the above techniques may be employed for all memory regions in memory, in other examples, the above techniques may be limited to particular memory regions, and for other memory regions, a full hardware cache coherency scheme may be implemented, if desired. Thus, in one such arrangement, the cache coherency protocol is implemented only by the coherency circuitry for the at least one further memory region, and the first processing device is arranged to obtain exclusive write access to the at least one further memory region by the step performed by the coherency circuitry for the at least one further cache. Thus, for at least one further memory region, rather than using a software protocol, steps taken directly by the coherency circuitry for various other caches within the system are used to ensure that the first processing device obtains exclusive write access to that memory region before performing a write operation.

Specific examples will now be described with reference to the accompanying drawings.

FIG. 1 is a block diagram of a system in which the above-described techniques may be implemented. The plurality of processing devices 10, 20, 30 are arranged to have access to the memory 70. Each processing device may be provided with one or more levels of local cache 15, 25, 35 in which local copies of data being processed by the processing device are temporarily stored. Each data value has an associated memory address identifying a memory location in the memory 70, and a copy of the data value is typically stored in a cache along with an indication of that memory address, in accordance with standard practice. If the data in the local cache is updated (e.g., as a result of one or more write operations performed by the processing device on the data), the updated data may be written back to the memory 70. The cache may be arranged in various ways. For example, if the cache is arranged as a write-through cache, any updates made to data stored in the cache line are also propagated via the interconnect 40 onto the memory 70 so that the updates are copied without delay in the memory. However, at least one level of local caching is typically arranged as a write-back cache, wherein updated versions of data held in the cache are not immediately propagated onto the memory. Rather, a dirty flag (dirty flag) may be associated with cache lines containing data that is more up-to-date than memory, so that when those cache lines' contents are subsequently evicted from the cache (evict), the necessary updates to the memory may be made. The techniques described herein may be applied with respect to any processing device having at least one level of local cache arranged as a write-back cache. In fig. 1, cache 15 is identified as a write-back cache, but in one example arrangement, each of local caches 15, 25, 35 may be a write-back cache.

Since the respective processing devices are capable of caching copies of data locally in their respective local caches 15, 25, 35, it is necessary to implement a cache coherency protocol to ensure that conflicting write operations do not occur to the same memory location in different processing devices and to ensure that each processing device has the same view of the shared data. In particular, while some of the data stored in memory may be specific to a particular processing device, it is often the case that processing devices will share some of the data stored in memory. As an example, the memory may include a shared database 75, and each of the processing devices may be assigned processing tasks related to information held in the shared database 75. It is important to ensure that certain data in the shared database is not subject to conflicting updates by multiple processing devices and that each of the processing devices is provided with a consistent view of the data held in the shared database 75.

In one example arrangement, coherency circuitry 50 may be provided in interconnect 40 to implement at least part of a cache coherency protocol. When implementing a cache coherency protocol, two tasks basically need to be undertaken. In particular, when a processing device wishes to perform a write operation with respect to a particular memory location, it must be ensured that the processing device wishing to perform the write operation has established the right to perform the write operation with respect to that memory location. This may be used to avoid the possibility of conflicting writes to the same memory location. Furthermore, once the data for a particular memory location has been updated by such a write operation, it is important to ensure that whenever any other processing device requests access to that memory location, it can see the updated data. Thus, as part of the cache coherency protocol, it must be ensured that the results of any write operation are made visible to all other processing devices.

While coherency circuitry may be arranged in the interconnect to implement the cache coherency protocol entirely in hardware to take account of the required functionality described above, according to the techniques described herein, the implementation of the cache coherency protocol is split between software and hardware for at least some regions of memory. In particular, coherency circuit 50 may be used to ensure that updated data generated by one processing device as a result of a write operation is made visible to other processing devices. However, obtaining permission to write data to a memory location is transferred to a software protocol so that coherency circuit 50 need not consider this aspect. As will be discussed in more detail later, the software protocol may use a lock-based mechanism that uses a lock component 80 maintained in memory 70 to ensure that any processing device obtains ownership rights for an associated memory region before it performs a write operation to a memory address in a memory region whose cache coherency is controlled by a separate software/hardware mechanism. In this way, the software has the task of avoiding a race write condition for such memory regions, and the coherency circuit then implements a hardware protocol to make any updated write data visible to other processing devices.

In accordance with the techniques described herein, once a write operation (or in some cases, a series of write operations) is performed to produce updated write data in one or more memory locations of a specified memory region, a flush operation is triggered to propagate the updated write data in the local write-back cache 15 to the appropriate memory locations in the memory 70. In one example, the memory is non-volatile memory and thus represents a persistent point in the system. In particular, once the data has been committed to memory 70, the data will remain even if power is removed from the system. The cleanup operation may thus cause the data to propagate to a persistent point in the system. However, in another example, the memory need not be a non-volatile memory.

In accordance with the described techniques, a coherency circuit subsequently takes a plurality of coherency actions in response to initiation of a clear operation in order to make write data visible to other processing devices.

For example, consider the following case: the processing device 10 executes instructions to activate a software protocol to establish ownership rights for the processing device to write data to the first memory region, and then performs one or more write operations to at least one memory location in the first memory region, such that updated write data is stored in the local write-back cache 15. Processing device 10 may then execute a clear instruction to initiate a clear operation so as to cause all affected write data to be propagated to memory 70 for storage at the appropriate memory location(s). The coherency circuit may then respond to the flush operation by taking steps associated with the local caches 25, 35 of the other processing devices 20, 30 to ensure that if these processing devices need to access the updated data, they will see the updated data. In one case, the coherency circuitry may be arranged to propagate only invalidate requests to each of the local caches 25, 35, thereby specifying the affected memory location(s). Each local cache will then perform a lookup for each specified memory location to determine if an entry exists in the cache for that memory location. If so, the current content is invalidated in the local cache. This will ensure that if the processing device subsequently requests data at that memory address, a miss will be detected in the local cache resulting in the data being subsequently retrieved from memory, with the result that the processing device 20, 30 will receive the correct data.

Although in a simple arrangement the coherency circuit may simply broadcast an invalidation request to all other local caches, in an alternative embodiment the coherency circuit may take the form of a snoop circuit comprising a snoop controller 55 and associated directory 60. The directory is maintained by the snoop controller to provide an indication of the memory locations for which data is cached by each of the local caches 15, 25, 35. Thus, with respect to the one or more memory locations affected by the flush operation, snoop controller 55 may perform a lookup within the directory to identify which of the local caches may have a cache local copy of the affected data, and may therefore direct the invalidation request to the relevant cache, rather than simply broadcast the invalidation request to each cache.

In this way, it can be seen that the coherency circuit can respond to a clear operation to ensure that at the time updated write data is written to memory, it can be ensured that all processing devices will have a consistent view of the data.

During the period before the clear operation is performed, and while processing device 10 is executing one or more write instructions to generate updated write data for one or more memory locations, the software protocol may control (schedule) access to these one or more memory locations by other processing devices. In some cases, the intended software use of the data may be such that the software protocol allows another processing device to read the old value of the data from memory; while in other cases such a read operation may be disabled at that time.

Fig. 2 is a diagram schematically illustrating one example arrangement of the directory 60 of fig. 1. In this example, the directory has a plurality of entries 100, 105, each identifying a memory address or series of memory addresses, and then identifying which caches are likely to contain copies of the data in that memory address or series of memory addresses. This may be used by the snoop controller to determine which local caches need to be invalidated in response to the previously mentioned flush operation. In the example of fig. 2, a tick is used to indicate that the associated cache may hold a copy of the cache, and a cross is used to indicate that the associated cache does not hold a copy. It should be appreciated that the tick and the cross can be encoded with any suitable value, for example, a logical one value can indicate a tick and a logical zero value can indicate a cross.

It should be understood that FIG. 2 shows only one example arrangement of directory 60, and that any other suitable arrangement may be used to provide the snoop controller with information regarding which caches need to be invalidated in response to a received flush operation.

FIG. 3 shows an example arrangement of a locking storage assembly 80 that may be used. The locked storage component may contain a plurality of entries 110, 115, 120, each associated with a particular memory region. A lock indication may be maintained for each memory region to identify whether at least one memory address in the region is in the process of having its data value updated. If the lock indication is set (set) it indicates that data for at least one memory address in the region is being updated, and if the lock indication is clear (clear) it indicates that no address in the region contains data that is in the process of being updated.

When a processing device wishes to perform a write operation to a memory address in one of the regions managed by the locked storage component, the processing device will execute instructions seeking to have the associated lock indication set. If the associated lock indication is currently clear, it is set and an acknowledgement will be returned to the processing device, which in effect identifies to the processing device that ownership rights have now been established for the associated memory region, and thus the write operation can be freely performed for that memory region. However, if the lock indication has been set, the processing device will be prevented from proceeding beyond the instruction seeking to set the lock, in particular the processing device will not be allowed to execute any write instruction seeking to write to a memory address in the relevant region at this stage. In one example, execution will be halted (stall) until the lock indication for the relevant region returns to a clear state, at which point the lock indication may again be set and an acknowledgement issued to the processing device to identify that it now has ownership rights for the memory region, after which the write instruction(s) may be executed. In this way, ownership rights for multiple memory regions can therefore be maintained using a software protocol to ensure that at any particular point in time, only one processing device is entitled to the following rights: a write operation is performed for the memory locations in each such region. This simplifies the operations required by coherency circuit 50, since it does not need to take any steps to ensure exclusive ownership of a particular processing device, but rather only needs to respond to subsequent clear operations to ensure that the processing device has a consistent view of the data after it has been updated.

Fig. 4A and 4B provide a flowchart illustrating steps performed in a processing device when the processing device wishes to update data in a particular memory region for which combined software/hardware mechanisms are used to achieve cache coherency in one example arrangement.

At step 200, a fetch lock instruction is executed by processing circuitry for a desired memory region. This results in the lock component 80 in memory being accessed, particularly with the lock indication status for the relevant region being accessed. If the lock indication has been set, the process may halt at step 205 until it is determined that the lock indication is currently clear, at which point the lock indication may be set to indicate that the processing device executing the acquire lock instruction has acquired a lock. Once the lock is obtained, this effectively means that the requesting processing device has established ownership rights for writing the data to the location in the desired memory region, as shown at step 210. At this stage, the software protocol prevents the other processing devices from being able to write access to the memory region in question because if any of those other processing devices were to execute a get lock instruction, it would be determined that the lock indication was set and therefore they would not be able to obtain a lock for that memory region. Thus, steps 200, 205, 210 may be viewed as implementing a software protocol to establish the right to write data.

After step 210, the processing device may then perform one or more write operations to generate write data for the memory locations in the locked memory region and then store the generated write data in a local write-back cache of the processing device at step 215. The process may continue until a trigger for a purge operation is detected at step 220.

In one example arrangement, the triggering for the clear operation may occur for various reasons. For example, in a simple case, a separate clear instruction may be associated with each write instruction, such that the process performed is for the processing device to obtain a lock, to then perform a single write operation to the desired memory location, and then for the clear instruction to be performed to trigger a clear operation for that write data. However, in another example arrangement, a series of write operations to memory addresses in a locked memory region may be grouped together, with a single clear operation being triggered after these write operations have been performed. For example, one suitable mechanism for achieving such functionality is that described in co-owned co-pending U.S. patent application No. 15/501,278, which is hereby incorporated by reference in its entirety. In particular, the end instruction defined therein may be used as a clear instruction to trigger a clear operation, and if desired, the associated start instruction may be used to identify the start of a sequence of write instructions whose write data will need to be cleared when the end instruction is encountered. A separate memory structure may be maintained to track memory locations that need to be cleared in response to an end instruction.

As yet another example of a trigger that may occur at step 220, this may be an eviction event that targets a cache entry containing updated write data resulting from the write operation performed at step 215. Eviction events may occur, for example, due to capacity problems in the local cache, where victim entries need to be selected to make room for new data to be allocated in the cache. When the victim entry is the entry containing the write data generated at step 215, then this may trigger a clear operation at step 220. If the previous example arrangement is employed in which a series of write operations are associated with a start instruction and an end instruction, an eviction for one of the affected cache entries may be used to cause a flush operation for at least that entry to be triggered, or may result in all entries for which write data has been generated between the start instruction and the end instruction being flushed when the eviction trigger occurs, if desired.

When a triggering event occurs at step 220, then the process proceeds to step 225, where a clear operation is performed on the write data that has been generated for the locked memory region in step 225, so as to cause the memory contents to be updated to reflect the updated data generated by the write operation(s).

As previously described, coherency circuit 50 will respond to the presence of a seed flush operation to perform an invalidate operation against any relevant entries in the local caches of the other processing devices in order to ensure that the write data to be written to memory is visible to all processing devices. At step 230, the processing device that has obtained the lock on the desired memory region waits for the validation from the coherency circuit that the write data is visible to the other processing devices. In case multiple cache entries may be flushed during the flushing process, an acknowledgement that all relevant write data has been made visible to other processing devices will be awaited at step 230, i.e. all required invalidation operations have been performed.

Once such an acknowledgement is received at step 230, the clear operation is complete and a determination is made at step 235 as to whether there are any other write operations or clear operations to be performed. This may be determined to be the case, for example, if there are one or more additional write or clear instructions in the sequence of instructions before executing the instruction that would cause the lock to be released. If there is at least one other write operation or clear operation to perform, the process proceeds to step 220, determines if there is a trigger for another clear operation in step 220, and if not, the process returns to step 215 to perform one or more additional write operations.

Once it is determined at step 235 that there are no more write operations or clear operations to be performed, the processing circuitry may then be arranged to execute a release lock instruction at step 240 to release the lock on the memory region. This is accomplished by performing an access to the appropriate entry in the lock storage component 80 to reset the lock indication for the region to a clear state.

FIG. 5 is a flow chart illustrating operation of coherency circuit 50 in one example. At step 300, coherency circuit 50 waits for a notification of a clear operation from the processing device performing the steps previously discussed with reference to fig. 4A and 4B. Upon notification of the flush operation, coherency circuitry causes the associated write data to be propagated onto the memory 70 at step 305 in order to cause the memory contents to be updated. In parallel, the process proceeds to step 310, where directory 60 is referenced to identify which other caches (if any) may store data for the memory address associated with the flush operation in step 310. Then, at step 315, snoop controller 55 is arranged to send an invalidate request to each cache indicated by the directory information. The invalidation request will identify the memory address in question and will cause the local cache receiving the invalidation request to perform a lookup operation to identify whether there is in fact an entry to cache the copy of the data at that memory address and, if so, the entry will be invalidated. This will ensure that if the associated processing device requests data at that memory address, the associated processing device will no longer receive a hit in the local cache, but will instead retrieve the correct data from the memory 70.

Although in the depicted example snoop controller 55 sends an invalidate request to the relevant local cache at step 315, in an alternative approach an update request may be sent with the new write data in order to cause the local cache contents to be updated to correctly reflect the required data.

At step 320, snoop controller 55 waits for an acknowledgement that all required invalidations have occurred. Upon receiving the acknowledgement, the snoop controller may notify the processing device that initiated the flush operation to identify that the data is now visible to other processing devices at step 325. It should be noted that if at step 310 the directory indicates that no other caches can store data for the desired memory address, then the process may proceed directly from step 310 to step 325.

In the case where a flush operation is applied for multiple memory addresses, such as when the previously mentioned start and end instructions are used, then an invalidate request may be sent to the required cache for each updated cache line content, and at step 320, the snoop controller will wait for confirmation that all required invalidations have occurred for all updated data values before proceeding to step 325.

It should be appreciated that steps 310 through 325 are initiated entirely by coherency circuit 50 and do not require any visibility to a software programmer and, in particular, do not require any explicit instruction to be executed by any processing device. Thus, it can be seen that steps 310 to 325 implement a hardware protocol to make the write data visible to all processing devices.

While it may take many clock cycles for the data output at step 305 to actually be written to memory, the write data will typically reach a coherency point (e.g., one or more write buffers associated with the interface to memory) shortly after it propagates from the cache to the interconnect 40. In FIG. 5, it is assumed that at least the check at step 310 has been performed to the consistency point, and thus upon reaching step 325, it is understood that write data will be available from the consistency point even though it has not been written back to main memory.

While the above-described mechanism may be implemented for all memory regions of memory 70 that may be shared between processing devices (which uses a combination of software and hardware protocols to implement the required cache coherency protocol), in alternative examples, this approach may be used for some regions, but full hardware cache coherency may be implemented for other regions, if desired. This is schematically illustrated by the flow chart of fig. 6. Specifically, it is determined at step 350 whether the region in question requires full hardware coherency. If so, the process proceeds to step 355 where the coherency circuit is used to fully implement the cache coherency protocol in hardware in step 355. In particular, the following existing listening mechanism may be used: it does not only implement the above steps in response to a flush operation to ensure that the updated data is visible to all processing devices, but also performs coherency actions with respect to the local cache before any processing device performs a write operation to ensure that the processing device wishing to perform the write operation has an existing snoop mechanism marked as a cached copy that is exclusive to that processing device. Thus, this approach enables the establishment of the right to write data to be embedded in the hardware protocol executed by coherency circuit 50. Any known hardware cache coherency scheme may be used at step 355.

However, if the region does not require full hardware coherency, the process proceeds to step 360, where the write rights to the region are managed using the previously described software protocol techniques and with reference to the locking component 80 in the memory 70 in step 360. Then, at step 365, the steps required to make the updated write data visible to all processing devices are managed using the hardware protocol previously discussed and implemented by coherency circuit 50.

Fig. 7A-7D illustrate example instruction sequences that may be executed by a processing device having a local write-back cache where a software protocol and a hardware protocol are used in combination to achieve cache coherency. As shown in FIG. 7A, a single write instruction and associated clear instruction may be executed between the acquire lock instruction and the release lock instruction. In an alternative arrangement, as shown in FIG. 7B, a set of write instructions may be identified using the previously mentioned start and end instructions, where the end instruction is used to initiate a clear operation for all write data resulting from the series of write instructions. Fig. 7C and 7D show further alternative arrangements that are variations of the arrangement shown in fig. 7A, and in particular in which multiple write instructions and associated clear instructions are executed between a acquire lock instruction and a release lock instruction. It should be noted in fig. 7D that a clear instruction associated with a particular write instruction does not necessarily need to immediately follow the write instruction in sequence.

As previously mentioned, in an example arrangement using the start and end instructions previously described, a storage unit may be provided in association with a processing device (which performs a sequence of instructions including such start and end instructions) to maintain an indication of write operations performed by the processing device between the start and end instructions. Fig. 8 shows an example arrangement of such a memory cell, where the memory cell takes the form of a FIFO memory circuit 450.

When a write instruction sandwiched by a start instruction and an end instruction is encountered, processing device indicates to FIFO control 448 and the associated write-back cache that an indication of the write operation performed in response to the write instruction should be added to the entry of FIFO storage component 446. Fig. 8 shows an example configuration in which four entries of the FIFO storage component 446 are currently filled with indications of write operations a through D. As shown in the figure, these indications may include the target memory address of the write operation (or indeed only a partial memory address if this is sufficient to identify the corresponding entry in the write-back cache) or may instead be a cache block indicator to show where in the write-back cache the corresponding entry is to be found. The FIFO control unit 448 maintains overall control over the FIFO storage component 446 to manage the addition of entries and the deletion of entries (either individual contents of the FIFO or the entire contents of the FIFO) as needed. In the event that the FIFO storage circuit 450 is required to respond to a query from the write-back cache as to whether a particular data item in the write-back cache has a corresponding entry in the FIFO storage component 446, the FIFO storage circuit 450 may be configured in a content addressable manner so as to provide this functionality. One example of when such a separate query may be received is after eviction of a data item from a write-back cache, in response to which FIFO control 448 may be configured to cause a corresponding entry in FIFO storage component 446 to be removed. Alternatively, as part of its participation in the cache coherency mechanism, a content query from the write-back cache may be received in response to a snoop query received by the write-back cache, but as will be described below with reference to fig. 9, the write-back cache itself may be configured to store an indication of whether the corresponding entry is currently held in the FIFO storage component 446, in order to avoid that the FIFO storage component 446 must be provided with this content addressable memory configuration and thus facilitate providing the FIFO storage component 446 as a relatively small addition to the processing device. Finally, as shown in FIG. 8, FIFO control 448 is also configured to monitor whether FIFO storage component 446 has become completely filled (and thus will not be able to make more entries). In the event that FIFO storage component 446 is completely filled, FIFO control 448 signals this fact to the write-back cache so that the corresponding entry in the write-back cache is then quickly written back to memory 70 and the contents of FIFO 446 are cleared.

Fig. 9 schematically illustrates a write-back cache 432, shown with a controller 460 and a plurality of cache lines 462. Cache controller 460 monitors data items received from an associated processing device and control signals received from both the pipeline of the processing device and FIFO control 448. Cache lines 462 are configured to store data items used by the processing device in its data processing operations, and each cache line stores both the data item itself and an associated status flag (dirty/valid, etc. in the usual manner). Note, however, that the other bit of each cache line is defined herein as a dedicated tag 464 that can be considered a clean-on-snoop marker, where the presence of tag 464 causes the cache line to be flushed to memory if a snoop is received from another cache/CPU that hits one of cache lines 462. Eviction of a cache line is just one such example, and for this example, tag 464 may therefore be considered a clear-on-even marker. In the configuration shown, cache controller 460 is configured to monitor snoops received from other caches in the system, and when a snoop is received from another cache relating to a data item (for which a tag is set) stored in a cache line 462 of tag 464 cache 432, cache controller 460 is configured to immediately cause the cache line to be evicted. This event is also signaled to FIFO control 448 so that any corresponding contents in FIFO storage component 446 may be cleared.

Fig. 10 illustrates an example sequence of instructions that may be executed by a processor device in one example. In this case, it is assumed that a start instruction and an end instruction are used to identify a series of writes to be performed to non-volatile memory (that forms a persistent point within the system). In particular, the non-volatile memory may be non-volatile byte addressable (NVBA) memory. The first part of the sequence, labeled a, has a plurality of instructions related to reads, writes and data processing instructions, where these instructions access data items not stored in the non-volatile byte-addressable memory. This is followed by an nvba _ begin instruction (an example of the aforementioned begin instruction) that defines the beginning of the first region (write set), where read instructions, data processing instructions, and write instructions are (at least partially) related to data items stored in the non-volatile byte-addressable memory. This first area (write set) (labeled B) ends with the nvba _ end instruction (an example of the end instruction mentioned previously). Thereafter, a second set of instructions (a) follows, the second set of instructions (a) being independent of the data items stored in the non-volatile byte-addressable memory. Finally, a second region (write set) of read instructions, data processing instructions, and write instructions (B) associated with data items stored in the non-volatile write addressable memory is defined (at least in part) by a second nvba _ begin instruction and a second nvba _ end instruction. Even when the accessed data items are stored in non-volatile memory (accesses thereto are cached by a write-back cache), the provision of nvba _ begin and nvba _ end instructions enables the programmer to define write sets, the present technology being able to maintain the order of these write sets (relative to each other).

According to the techniques described herein, each block identified by a start instruction and an end instruction is preceded by a get lock instruction and followed by a release lock instruction, such that the previously mentioned software protocol mechanisms may be used to ensure that the processing device obtains the right to write before performing any write operations. As a result, the hardware protocol incremented by the coherency circuit need only ensure that the write data is made visible to various other processing devices upon the occurrence of a clear operation triggered by an end instruction.

In this application, the word "configured to …" is used to indicate that an element of an apparatus has a configuration capable of performing the defined operation. In this context, "configuration" means an arrangement or manner of interconnection of hardware or software. For example, the apparatus may have dedicated hardware providing the defined operations, or a processor or other processing device that can be programmed to perform the functions. "configured to" does not mean that the device elements need to be changed in any way in order to provide the defined operation.

Although illustrative embodiments of the present invention have been described in detail herein with reference to the accompanying drawings, it is to be understood that the invention is not limited to those precise embodiments, and that various changes, additions and modifications may be effected therein by one skilled in the art without departing from the scope and spirit of the invention as defined by the appended claims. For example, various combinations of the features of the dependent claims and the features of the independent claims may be made without departing from the scope of the invention.

27页详细技术资料下载
上一篇:一种医用注射器针头装配设备
下一篇:用于分布式缓冲器存储器系统的地址/命令芯片同步的自动数据芯片地址定序器

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!

技术分类