Write streaming in a processor

文档序号：425864 发布日期：2021-12-21 浏览：22次中文

阅读说明：本技术 处理器中的写入流 (Write streaming in a processor ) 是由 A·A·查查德 T·D·安德森 D·M·汤普森于 2020-05-26 设计创作，主要内容包括：在所描述的实例中,处理器系统(100)包含产生存储器写入请求的处理器核心(102)、高速缓存存储器(116)及存储器控制器(114)。所述存储器控制器(114)具有存储器管线。所述存储器控制器(114)经耦合以控制所述高速缓存存储器(116)并通信地耦合到所述处理器核心(102)。所述存储器控制器(114)经配置以：接收来自所述处理器核心(102)的所述存储器写入请求；在所述存储器管线上调度所述存储器写入请求；及在所述存储器管线上调度所述存储器写入请求中的相应者的同时,将写入确认发送到所述处理器核心(102),所述写入确认证实所述相应存储器写入请求的数据有效负载到所述高速缓存存储器(116)的写入已完成。(In described examples, a processor system (100) includes a processor core (102) that generates a memory write request, a cache memory (116), and a memory controller (114). The memory controller (114) has a memory pipeline. The memory controller (114) is coupled to control the cache memory (116) and communicatively coupled to the processor core (102). The memory controller (114) is configured to: receiving the memory write request from the processor core (102); scheduling the memory write request on the memory pipeline; and sending a write acknowledgement to the processor core (102) while scheduling a respective one of the memory write requests on the memory pipeline, the write acknowledgement confirming that a write of a data payload of the respective memory write request to the cache memory (116) is complete.)

1. A processor system, comprising:

a processor core configured to generate a memory write request;

a cache memory; and

a memory controller having a memory pipeline, the memory controller coupled to control the cache memory and communicatively coupled to the processor core, the memory controller configured to:

receiving the memory write request from the processor core;

scheduling the memory write request on the memory pipeline; and

sending a write acknowledgement to the processor core while scheduling a respective one of the memory write requests on the memory pipeline, the write acknowledgement confirming that a write of a data payload of the respective memory write request to the cache memory is complete.

2. The processor system of claim 1, wherein the memory controller is configured to adjust the scheduling of performing the memory write request action and the sending of the write acknowledge action if conditions are met including at least: the respective memory write requests will not interrupt ordering and coherency and will complete in a limited amount of time if scheduled on the memory pipeline.

3. The processor system of claim 2, wherein the condition is satisfied if the memory pipeline is inherently in-order, such that once a memory transaction for a corresponding memory enters the memory pipeline, it will read or write to the memory in the order in which it was scheduled.

4. The processor system of claim 2, wherein the condition is satisfied if the memory pipeline includes one or more pipeline banks, and each pipeline bank is independent of other pipeline banks, such that:

write transactions on one of the pipeline banks do not affect ordering or coherency requirements with respect to write transactions on the other of the pipeline banks, and

within each of the pipeline banks, write and read memory access requests to the cache memory proceed in the order in which they were scheduled.

5. The processor system of claim 2, further comprising a system clock, wherein the scheduling the memory write request action and sending the write acknowledge action to the processor core are performed simultaneously meaning that both actions are performed on the same cycle of the system clock.

6. The processor system of claim 2, wherein the processor core is configured to unpause a process in dependence upon completed processing of the respective memory write request after receiving a write acknowledgement corresponding to the respective memory write request.

7. The processor system of claim 2, wherein if the condition cannot be guaranteed to be met, the memory controller is configured to suspend the memory write request.

8. The processor system of claim 2, further comprising a clock, wherein the processor core is configured to generate a memory write request at each cycle of the clock.

9. The processor system of claim 2, wherein the cache memory is a lower level cache memory and the memory controller is a lower level memory controller, the processor system further comprising:

a higher level cache memory; and

a higher level cache memory controller coupled to control the higher level cache memory to process memory transactions and communicatively coupled to the processor core and the lower level memory controller;

wherein the lower level memory controller receives the memory write request via the higher level memory controller or sends the write acknowledgement to the processor core.

10. The processor system of claim 9, wherein the higher level cache memory is more local to, or has a lower response time to, memory transactions generated by the processor core than the lower level cache memory.

11. The processor system of claim 8, wherein the higher level cache memory is a level 1 cache (L1 cache) and the lower level cache memory is a level 2 cache (L2 cache).

12. A method of operating a processor system, the method comprising:

generating a memory write request using a processor core;

receiving, using a memory controller, the memory write request from the processor core;

scheduling the memory write request on a memory pipeline of the memory controller; and

while the scheduling step schedules respective ones of the memory write requests on the memory pipeline, sending a write acknowledgement to the processor core, the write acknowledgement confirming that writing of a data payload of the respective memory write requests to a cache memory controlled by the memory controller is complete.

13. The method of claim 12, further comprising, prior to the scheduling and the sending, determining whether a condition is satisfied, the condition comprising at least: if scheduled on the memory pipeline, the respective memory write request will meet ordering and coherency requirements and will complete in a limited amount of time; and

adjusting the scheduling and the sending with respect to the respective memory write requests in accordance with whether the determination determines that the condition with respect to the respective memory write requests is satisfied.

14. The method of claim 13, wherein the condition is satisfied if the memory pipeline is inherently in-order, such that once a memory transaction for the cache memory enters the memory pipeline, it will read or write to the cache memory in the order in which it was scheduled.

15. The method of claim 13, wherein the condition is satisfied if the memory pipeline includes one or more pipeline banks, and each pipeline bank is independent of other pipeline banks, such that:

write transactions on one of the pipeline banks do not affect ordering or coherency requirements with respect to write transactions on the other of the pipeline banks, and

within each of the pipeline banks, write and read memory access requests to the cache memory proceed in the order in which they were scheduled.

16. The method of claim 13, wherein performing the sending step concurrently with the scheduling step means performing the sending step and the scheduling step on the same cycle of a clock of the processor system.

17. The method of claim 13, further comprising:

after the processor core receives a write acknowledgement corresponding to the respective memory write request, unpause processes in the processor core depending on the respective memory write request completing processing.

18. The method of claim 13, further comprising:

if the condition cannot be guaranteed to be met, suspending the respective memory write request at the memory controller.

19. The method of claim 13, further comprising:

wherein the generating step generates a memory write request at each cycle of a clock of the processor system.

20. The method of claim 13, wherein the first and second light sources are selected from the group consisting of,

the processor system further includes a higher level memory controller controlling a higher level cache memory;

wherein the cache memory is a lower level cache memory and the memory controller is a lower level memory controller; and is

Wherein the lower level memory controller receives the memory write request via the higher level memory controller or sends the write acknowledgement to the processor core.

21. The method as in claim 20 wherein the higher level cache memory is a level 1 cache (L1 cache) and the lower level cache memory is a level 2 cache (L2 cache).

22. The method of claim 20, wherein the higher level cache memory is more local to, or has a lower response time to, memory transactions generated by the processor core than the lower level cache memory.

23. The method of claim 20, wherein the lower level memory controller performs the sending step by sending the write acknowledgement to the higher level memory controller.

Technical Field

This description relates generally to processing devices, such as systems on a chip (SoC), that may be formed as part of an integrated circuit. More particularly, the present description relates to such systems with improved management of write operations.

Background

An SOC is an integrated circuit on a single die having multiple functional blocks, such as one or more processor cores, memory, and inputs and outputs.

Hierarchical memories move data and instructions between memory blocks having different read/write response times for respective processor cores, such as a Central Processing Unit (CPU) or a Digital Signal Processor (DSP). For example, memory that is more local to the respective processor core typically has a lower response time. Hierarchical memories include cache memory systems having multiple levels (e.g., L1 and L2), where different levels describe different degrees of locality or different average response times of the cache memory to respective processor cores. Herein, a more local or lower response time cache memory (e.g., an L1 cache) is referred to as a higher level cache memory than a less local or higher response time lower level cache memory (e.g., an L2 cache or an L3 cache). Associativity of a cache refers to cache memory separation, where a set associativity divides the cache into memory sets, and each such set stores a number of (way) blocks, while a fully associative cache is not constrained by set restrictions. Thus, for the integer N, each location in main memory (system memory) may reside in any one of N possible locations in an N-way associative cache.

The "victim cache" memory caches data (e.g., cache lines) evicted from a cache memory (e.g., the L1 cache). If the L1 cache read results in a miss (data corresponding to a portion of main memory is not stored in the L1 cache), then a lookup occurs in the victim cache. If the victim cache lookup results in a hit (data corresponding to the requested memory address is present in the victim cache), then the contents of the victim cache location that produced the hit are swapped with the contents of the corresponding location in the respective cache (the L1 cache in this example). Some example victim caches are fully associative. Data corresponding to any location in main memory may be mapped to (stored in) any location in the fully associative cache.

Disclosure of Invention

In described examples, a processor system includes a processor core that generates a memory write request, a cache memory, and a memory controller. The memory controller has a memory pipeline. The memory controller is coupled to control the cache memory and communicatively coupled to the processor core. The memory controller is configured to receive the memory write request from the processor core; scheduling the memory write request on the memory pipeline; and sending a write acknowledgement to the processor core while scheduling a respective one of the memory write requests on the memory pipeline, the write acknowledgement confirming that a write of a data payload of the respective memory write request to the cache memory is complete.

Drawings

FIG. 1 is a block diagram of an example processor that is part of a SoC.

FIG. 2 is a block diagram of an example memory pipeline for the SoC of FIG. 1.

Fig. 3 is an example of a process for a memory write operation for the SoC of fig. 1.

FIG. 4 is a block diagram of an example memory pipeline for the SoC of FIG. 1.

Fig. 5 is an example of a process for a memory write operation for the SoC of fig. 1.

Detailed Description

FIG. 1 is a block diagram of an example processor 100 that is part of a SoC 10. The SoC 10 includes a processor core 102, such as a CPU or DSP, that generates new data. Processor 100 may include clock 103, clock 103 may be part of processor core 102, or may also be separate from processor core 102 (a separate clock not shown). The processor core 102 also generates memory read requests requesting reads from the data memory controller 104(DMC) and the stream engine 106 and memory write requests requesting writes to the data memory controller 104(DMC) and the stream engine 106. In some embodiments, processor core 102 generates one read request or write request per processor core clock cycle. The memory controller returns a write acknowledgement to processor core 102 to confirm that the requested memory write was performed. Processor core 102 is also coupled to receive instructions from a program memory controller 108 (PMC). Flow engine 106 facilitates processor core 102 sending certain memory transactions and other memory-related messages that bypass DMC 104 and PMC 108.

The SoC 10 has a hierarchical memory system. Each cache at each level may be unified or divided into separate data and program caches. For example, the DMC 104 may be coupled to a level 1 data cache 110(L1D cache) to control data writes to the L1D cache 110 and data reads from the L1D cache 110. Similarly, PMC 108 may be coupled to level 1 program cache 112(L1P cache) to read instructions from L1P cache 112 for execution by processor core 102. (in this example, processor core 102 does not generate writes to L1P cache 112.) L1D cache 110 may have an L1D victim cache 113. A unified memory controller 114(UMC) for a level two cache (L2 cache 116, such as L2 SRAM) is communicatively coupled to receive read and write memory access requests from the DMC 104 and the PMC 108, and to receive read requests from the streaming engine 106, the PMC 108, and a memory management unit 117 (MMU). (the example L2 controller UMC 114 is referred to as a "unified" memory controller in the example system because UMC 114 can store both instructions and data in the L2 cache 116.) UMC 114 is communicatively coupled to pass read data and write acknowledgements (from outside the level 1 cache) to DMC 104, stream engine 106, and PMC 108, and then to processor core 102. UMC 114 is also coupled to control writes to and reads from L2 cache 116 and to pass memory access requests to a level three cache controller 118(L3 controller) from L2 cache 116. The L3 controller 118 is coupled to control writes to the L3 cache 119 and reads from the L3 cache 119. UMC 114 is coupled to receive (via L3 controller 118) write acknowledgements and data read from L2 cache 116 and L3 cache 119. UMC 114 is configured to control the pipelining of memory transactions for program content and data content (read and write requests for instructions, data transmission, and write acknowledgements). The L3 controller 118 is coupled to control writes to the L3 cache 119 and reads from the L3 cache 119, and to mediate transactions using external functions 120, the external functions 120 being external to the processor 100, such as other processor cores, peripheral functions of the SOC 10, and/or other SoCs (and also to control snoop transactions). Thus, the L3 controller 118 is a shared memory controller of the SoC 10, and the L3 cache 119 is a shared cache memory of the SoC 10. Thus, memory transactions associated with the processor 100 and external functions 120 pass through the L3 controller 118.

The memory transactions are generated by processor core 102 and communicated to a lower level of cache memory or generated by external functions 120 and communicated to a higher level of cache memory. For example, a victim write transaction may be initiated by UMC 114 in response to a read transaction from processor core 102 that generates a miss in L2 cache 116.

The MMU 117 provides address translation and memory attribute information to the processor core 102. It does this by looking up information in a table stored in memory (the connection between MMU 117 and UMC 114 enables MMU 117 to access the memory containing the table using a read request).

Figure 2 is a block diagram of an example memory pipeline 200 included within or associated with UMC 114 of figure 1,thus, for illustration, fig. 2 also repeats the various blocks from fig. 1 that communicate with UMC 114. The memory pipeline 200 includes an initial schedule block 202 coupled to an integer number M of pipeline banks 206. Each pipeline bank 206 includes an integer number P of stages 208 and is illustrated as a vertical column below the initial scheduling block 202. DMC 104 pass number N₁A line-wide bus 204-1 is coupled to the initial scheduling block 202, enabling the DMC 104 to provide the number of transfers N at a time₁A read or write request for bit data. The stream engine 106 passes the number N₂A line-wide bus 204-2 is coupled to the initial scheduling block 202 to enable the stream engine 106 to provide a number of transfers N at a time₂A read request for bit data. PMC 108 by number N₃A line-wide bus 204-3 is coupled to initial scheduling block 202 to enable PMC 108 to provide a number of transfers N at a time₃A read request for bit data. The L3 controller 118 passes the number N₄A line-wide bus 204-4 is coupled to the initial scheduling block 202, enabling L3118 to provide a number of transfers N at a time₄A read or write request for bit data. MMU 117 pass number N₅The line-wide bus 204-5 is coupled to the initial scheduling block 202, enabling the MMU 117 to provide the number of transfers N at a time₅A read request for bit data.

When a memory controller (e.g., DMC 104, streaming engine 106, PMC 108, MMU 117, or L3 controller 118) of processor 100 communicates a request to UMC 114 to read from or write to a memory (e.g., memory in L2 cache 116, L3 cache 119, or external functions 120) interposed by UMC 114, initial scheduling block 202 schedules the request to be handled by the appropriate pipeline vault 206 for the particular request. Thus, the initial scheduling block 202 performs arbitration for read and write requests. Arbitration determines which pipe bank 206 will receive which memory transactions queued at the initial scheduling block 202, and in what order. Typically, read or write requests are scheduled into a corresponding one of the pipeline banks 206, for example, depending on the memory address of the data being written or requested, the request load of the pipeline bank 206, or a pseudorandom function. The initial scheduling block 202 schedules read and write requests received from the DMC 104, the stream engine 106, the PMC 108, and the L3 controller 118 by selecting in a first stage of the pipeline library 206. Memory transactions requesting execution to the L3 cache 119 (or external functions 120) are arbitrated by and scheduled into the L3 cache line by the L3 cache scheduling block 404 in the L3 controller 118 (see FIG. 4) after passing through the memory line 200 (the bank 206 and possibly the bus snoop correlation stage, which are not shown) corresponding to the L2 cache 116.

Request scheduling prevents conflicts between read or write requests that will be handled by the same pipeline bank 206 and maintains memory coherency (described further below). For example, request scheduling maintains an order between memory transactions placed into a memory transaction queue (memory access request queue) of the initial scheduling block 202 by different memory controllers of the processor 100 or different buses of the same memory controller.

Further, a pipelined memory transaction (read or write request) sent by the DMC 104 or PMC 108 is requested because the memory transaction has passed through the corresponding level 1 cache line (in the DMC 104 for the L1D cache 110 and in the PMC 108 for the L1P cache 112) and either directed to a lower level cache or memory endpoint (or external function 120), or has generated a miss in the corresponding level 1 cache, or bypassed the L1D cache 110 because the corresponding data payload of the write request cannot be cached by the L1D cache 110. Typically, a memory transaction leading to a DMC 104 or PMC 108 that generates a level 1 cache hit results in a write acknowledge from L1D cache 110 or a response with data or instructions read from L1D cache 110 or L1P cache 112, respectively. Thus, memory transactions that generate level 1 cache hits typically do not require access to the pipeline bank 206 shown in FIG. 2, the pipeline bank 206 controlling or intervening memory accesses to the L2 cache 116, the L3 cache 119, and external functions 120 (see FIG. 1).

The pipeline library 206 shown in FIG. 2 is part of the UMC 114. L1D cache 110 may hold data generated by processor core 102. For example, external function 120 may access data in L1D cache 110 by writing the data to L2 cache 116 or L3 cache 119, or reading or evicting the data from L1D cache 110 using snoop transactions controlled by L3 controller 119 and managed by UMC 114(L2 controller, as an agent).

Memory consistency is that the contents of memory in the system (or at least what is considered or indicated as valid) are the same as what is expected by one or more processors in the system based on an ordered stream of read and write requests. Writes affecting particular data or particular memory locations are prevented from bypassing earlier issued writes or reads affecting the same data or the same memory locations. In addition, certain types of transactions have priority, such as victim cache transactions and snoop transactions.

Bus snooping is the scheme by which a coherency controller (snooper) in such a cache monitors or snoops bus transactions to maintain memory coherency in a distributed shared memory system, such as the SoC 10. If a transaction modifying a shared cache block occurs on the bus, the snooper checks whether its respective cache has an identical copy of the shared block. If a cache has a copy of the shared block, the corresponding snooper performs an action to ensure memory coherency in the cache. For example, depending on the transactions detected on the bus, this operation may be to flush, invalidate, or update the shared block.

A "write stream" refers to a stream of write requests issued by a device (e.g., processor core 102), such as one write request per cycle, without suspension. For example, the write stream may be interrupted by a pause due to a full buffer or an insufficient number of write request identifiers. The ability to cause write requests to be drawn from the memory transaction queue as quickly as possible facilitates write streaming.

In order for processor core 102 to know that the write has completed, it must receive a write acknowledgement. To maintain coherency, processor core 102 may self-limit to a given number of outstanding write requests by limiting write requests that will exceed the limit until a write acknowledgement of the outstanding write requests is received. Thus, processor core 102 and L1D cache 110 may wait for a write acknowledgement (or "handshake") to proceed while halting the corresponding write streaming process within processor core 102. The suspension of the interrupted write stream may also be caused by the processor core 102 or DMC 104 waiting for a write acknowledgement from a previous write request. Processor core 102 may also be configured to pause while waiting for write acknowledgements with respect to certain operations, such as fence operations. Write completions in lower level caches (e.g., L2 cache 116 or L3 cache 119) are detected by the DMC 104 (level 1 cache 110 controller) using write acknowledgements (handshakes) forwarded by the UMC 114 (level 2 cache 116 controller). However, due to various pipeline requirements including arbitration, ordering, and coherency, a write may require many cycles to complete.

At the first level of arbitration performed by initial schedule block 202, UMC 114 (the L2 cache 116 controller, which includes initial schedule block 202) determines whether to allow memory transactions to proceed in memory pipeline 200 and in which of the pipeline banks 206. Writes to the L2 cache 116 typically have few operations between (1) initial arbitration and scheduling, and (2) write completions. The remaining operations of the scheduled write request may include, for example, checking for errors (such as firewall, addressing, and out-of-range errors), read-modify-write actions (updating an error check code of the data payload of the write request), and committing the data payload of the write request to memory. Typically, each bank 206 is independent such that write transactions on the bank 206 (e.g., data writes from the L1D cache 110 to the L2 cache 116) have no ordering or coherency requirements with respect to write transactions on other banks 206. Within each of the pipelined banks, writes to the L2 cache 116 occur in their scheduled order. Relative ordering is maintained with respect to the partial writes that cause the read-modify-write transaction. If a memory transaction causes an addressing hazard or violates ordering requirements, the transaction is suspended and not issued to the pipeline bank 206. (partial write is a write request with a data payload that is less than the minimum write length of the destination cache.partial write triggers a read-modify-write transaction in which data is read from the destination cache to fill the data payload of the write request to the minimum write length of the destination cache, and an updated Error Correction Code (ECC) is generated from and appended to the generated fill data payload.A fill data payload with the updated ECC is the contents written to the destination cache.)

Because of these characteristics of the memory pipeline 200, once writes are scheduled within the pipeline bank 206 (e.g., writes of data from the L1D cache 110 to the L2 cache 116), the writes are guaranteed to follow all ordering requirements and not violate coherency (thus, satisfying the conditions that need to be satisfied to avoid violating ordering and coherency). A (variable) number of cycles may be required to commit a write operation to memory, but a read operation issued after issuing this write operation will "see" the write operation. Thus, if the requested data or memory location modified by the write is read, the read will retrieve the version of the data or the contents of the memory location specified by the write, rather than the previous version. Write-write ordering is also maintained. The L3 cache 119 write requests may also be scheduled by the memory pipeline 200 (by the UMC 114) such that in-order completion of the L3 cache 119 write requests is guaranteed. These guarantees mean that write requests dispatched by the memory pipeline 200 (the L2 cache line) into the pipeline bank 206 can be guaranteed to meet ordering and coherency requirements and to complete in a limited amount of time. Put another way, this guarantee is to ensure that a write transaction to a particular address and currently being scheduled onto the pipeline bank 206 will have its value "committed" to memory after a previously scheduled write transaction to the same address and before a later scheduled write transaction to the same address (the write will complete and store the corresponding data payload in memory). This guarantee may be based on the pipeline being inherently "in order" so that once a command enters the pipeline, it will write to memory (commit) in the order in which it was scheduled. In other words, there is no bypass path within the pipeline. (the bypass path described below is handled so that it does not violate ordering guarantees, e.g., on older transactions targeting the same memory address.)

"simultaneously with … …" is defined herein to mean simultaneously or immediately thereafter. Thus, a first event occurring "simultaneously" with a second event may mean that the two events occur on the same cycle of the system clock.

UMC 114(L2 controller) sends a write acknowledgement of the write of the data to L2 cache 116 or a higher level cache (e.g., data from L1D cache 110) to DMC 104(L1 controller) while initial scheduling block 202 (first level arbitration) schedules the corresponding write request. Thus, the write acknowledgement indicating the write completion is sent concurrently with the write request being scheduled, rather than after the memory pipeline 200 completes processing the write request. This accelerated validation is achieved by the assurance that the scheduled write requests will complete in order and meet the consistency requirements. UMC 114 creates the illusion that a write request is completed on the cycle it is scheduled, rather than the cycle the corresponding data is committed (written) to memory. From the observability of the processor core 102 or DMC 104 perspective, the L2 cache 116 appears to complete the write request immediately upon scheduling the write request. This enables the DMC 104 to more quickly unpause the processor core 102 (or prevent the processor core 102 from stalling), and to draw write requests from the queue with less delay (faster), thereby improving overall performance. The queues are transaction queues in UMC 114, which transactions are sent from respective "masters" (functional blocks that can send memory transactions to UMC 114 for queuing), such as DMC 104, stream engine 106, PMC 108, MMU 117, and L3 controller 118. The queue may be implemented as a hold phase in which memory transactions reside while waiting to be arbitrated by the initial scheduling block 202 and scheduled to the pipeline bank 206.

Processor core 102 is typically configured to read data from memory to process the data. This is also true for other processor cores 102 of other processors of the SoC 10, such as the processor 100, with respect to memory accessible by those other processors. However, other processors of the SoC 10 require that data generated by the processor core 102 be available outside the data generating processor 100 in order to be able to access the generated data. This means that the generated data passes through the L3 controller 118 to be accessible either within the shared memory (L3 cache 119) or externally by transmission to external functions 120.

Fig. 3 is an example of a process 300 for a memory write operation for the SoC 10 of fig. 1. In step 302, the initial schedule block 202 schedules write requests directed to the L2 cache 116 to the pipeline bank 206. In step 304, UMC 114(L2 cache controller) sends a write acknowledgement directly to processor core 102 after the write request is dispatched to the pipeline bank. In step 306, the processor core 102 is un-halted in response to the write acknowledge.

FIG. 4 is a block diagram of an example memory pipeline 400 for the SoC 10 of FIG. 1. The memory pipeline 400 as shown in FIG. 4 is similar to the memory pipeline 200 as shown in FIG. 2, but also includes a bypass path 402. Bypass path 402 couples initial schedule block 202 to L3 cache schedule block 404, thus skipping over memory lines corresponding to at least one level of cache. For example, the bypass path 402 enables the bypassed writes to bypass the portion of the memory pipeline 400 associated with the L2 cache 116, thereby reducing the overall processing time of the bypassed writes. Thus, writes to certain memory regions may be written to the L3 cache 119, rather than to the L2 cache 116 (or other lower level cache). Such writes (bypass writes) may be safely performed without accessing the memory pipeline stage associated with the L2 cache 116, which includes the pipeline bank 206 and associated bus snoops (not shown). This simplifies the consistency requirement of the bypass write.

To allow bypass writes to use bypass path 402, memory coherency imposes ordering requirements. For example, writes may not bypass other writes (write-write ordering); writes cannot bypass reads (read-write ordering); writes cannot bypass victim cache transactions (e.g., write requests from L1D victim cache 113 to L1D cache 110, which may be handled by UMC 114 on an L1 level by a cache miss of a victim cache coherent write request); and the write cannot bypass the snoop response (e.g., a snoop response corresponding to the controller requesting a write from the victim cache to the L1D cache 110 when the request is not caused by a cache miss). Victim cache transactions and snoop responses have high priority because they constitute memory synchronization events. Furthermore, L1D cache 110 victims are traversing the full pipeline because, for example, they contain updating internal states in UMC 114. However, bypass writes directly after the victim cache transaction may also be prioritized, ensuring that bypass writes will not be blocked or suspended (similar to a slipstream). This prioritization state applies only to a single bypass write immediately following a victim cache transaction, and ends when the bypass write is sent to the L3 controller 118 for further processing.

The initial schedule block 202 may have a specified bypass state (or "bypass mode") in which bypass writes may be scheduled to bypass the path 402, rather than to the full memory pipeline 400 (including to the pipeline bank 206). When the initial schedule block 202 is in the bypass mode, the write bypasses the entire pipeline of the intermediate level cache, including the associated internal arbitration. When the initial schedule block 202 is not in the bypass mode, writes are bypassed through the full memory pipeline 400.

Fig. 5 is an example of a process 500 for a memory write operation of SoC 10 of fig. 1. FIG. 5 depicts conditions for the initial scheduling block 202 to enter and remain in bypass mode, such that bypass writing is allowed to continue along the bypass path 402 of FIG. 4. In step 502, the initial scheduling block 202 determines whether the next memory transaction in the queue is a bypass write, carries data or is associated with a memory location guaranteed not to be written to the L2 cache 116. (e.g., the data payload is too large to fit the L2 cache 116, or contains a data type not stored by the L2 cache 116.) if not, then the next memory transaction is processed using normal (non-bypass) processing (through the full memory pipeline 400) according to step 504, and processing returns to step 502. In step 506, the initial scheduling block 202 determines whether the write to the L1D victim cache or the write from the L1D victim cache is in the pipeline (in the pipeline bank 206). If so, the victim cache write is prioritized as a memory synchronization event, and in step 504, the bypass write is not allowed to continue along bypass path 402. Instead, the bypass write receives normal (non-bypass) processing (although prioritizing non-bypass processing), and the process 500 returns to step 502. In step 508, the initial scheduling block 202 determines whether there is an L1D write (write from the DMC 104; in some embodiments, any write request generated outside of the memory pipeline 200) in the pipeline bank 206 (the pipeline bank 206 to which the bypass write is to be scheduled). If so, then in step 510, the bypass write is delayed until the L1D write (all L1D writes in the corresponding pipeline bank 206; in some embodiments, all write requests generated outside of the memory pipeline 200) is flushed from the corresponding pipeline bank 206, (e.g., the corresponding write commits to memory, or reaches the L3 cache schedule block 404 for scheduling into the corresponding memory pipeline of the L3 controller 118). Otherwise, in step 512, the bypass write enters bypass mode 514, and the bypass write is sent along bypass path 402 to L3 controller 118 (and from there to L3 cache 119). Next, in step 516, the initial scheduling block 202 checks whether the next memory transaction in its queue is a bypass write. If so, the process 500 remains in bypass mode by returning to step 512 and the next memory transaction in the queue (bypass write) is sent to the L3 cache schedule 404 along bypass path 402. Thus, after the bypass writes have met the conditions for using bypass path 402, the sequentially queued bypass writes may also be sent to L3 cache schedule 404 via bypass path 402 without having to re-check the conditions addressed in steps 506 and 508. If the next memory transaction in the queue of the initial scheduling block 202 is not a bypass write, then in step 518 the initial scheduling block 202 exits the bypass mode, the next memory transaction in the queue receives normal (non-bypass) processing, and the process 500 returns to step 502.

In the described embodiments, modifications are possible, and other embodiments are possible within the scope of the claims.

In some embodiments, the streaming engine passes and returns responses to both read and write requests.

In some embodiments, the processor may include multiple processor cores (embodiments with multiple processor cores are not shown) with functional couplings to the DMC, the stream engine, and the PMC similar to those shown in and described with respect to fig. 1.

In some embodiments, a bus implementing parallel read or write requests may correspond to different types of read or write requests, such as for different memory blocks or for different purposes.

In some embodiments, the streaming engine enables the processor core to communicate directly with the lower level cache (e.g., the L2 cache), skipping over the higher level cache (e.g., the L1 cache) to avoid data synchronization issues. This may be used to help maintain memory coherency. In some such embodiments, the stream engine may be configured to transmit only read requests, rather than both read and write requests.

In some embodiments, the L3 cache or other lower level memory may schedule write requests so that write acknowledgements may be sent to the DMC (or processor core or another lower level memory controller) at the same time the write requests are scheduled to the corresponding memory pipeline.

In some embodiments, different banks of memory access pipelines may have different numbers of stages.

In some embodiments, a processor in an external function may access data stored in the L2 cache; in some such embodiments, consistency is not guaranteed between the content stored in the L2 cache and the content cached in other processors in the external function.

In some embodiments, a write may be a bypass write if the data contained is too large for a lower level cache (e.g., an L1D cache or an L2 cache).

In some embodiments, a write may be a bypass write if the page attributes mark the write as corresponding to a device type memory region that is not cached by a UMC (L2 cache controller).

In some embodiments, the L1D cache (or other lower level cache) may cache a data payload that is bypassed by writes.

In some embodiments, the memory coherency rules of the processor prohibit bypassing memory transactions (memory read requests or memory write requests) other than L1D victim cache writes and L1D writes.

In some embodiments, the bypass path jumps the bypass write to the final arbitration phase prior to being scheduled into the memory bank of the L3 cache (not shown).

In some embodiments, the guarantee that memory writes will never be directed to the L2 cache (corresponding to a bypass write) includes a selection that this is never changed, or a guarantee that such writes to the L2 cache are not possible.

In some embodiments, the guarantee that memory writes will never be directed to the L2 cache includes that the L2 cache does not have a copy or hash of the corresponding data.

In some embodiments, the guarantee that a memory write will never be directed to the L2 cache may change (be initiated if the guarantee was not recently validated). In such embodiments, when a line is being written to the L2 cache, making this guarantee newly (e.g., changing the corresponding mode register to make the guarantee) may require flushing the corresponding L2 cache. In this case, a cache flush may avoid that the cached copy of the data payload, which is now guaranteed not to be written to the L2 cache, remains in the L2 cache after the guarantee is made.

In some embodiments, the cache controller initiates memory transactions only in response to transactions initiated by the processor core 102 or the external functions 120.

In some embodiments, for example, depending on the memory address of the data being written or requested, the request load of the pipeline bank 206, or a pseudorandom function, only read or write requests can be scheduled into a corresponding one of the pipeline banks 206.

Modifications are possible in the described embodiments, and other embodiments are possible within the scope of the claims.

17页详细技术资料下载

Write streaming in a processor

相关技术

网友询问留言