Adaptive load merging

文档序号：1815551 发布日期：2021-11-09 浏览：7次中文

阅读说明：本技术 适应性加载合并 (Adaptive load merging ) 是由玛布·埃尤勒米希尔·威廉·万·托尔于 2021-04-28 设计创作，主要内容包括：本公开涉及适应性加载合并。公开了用于处理加载请求的装置和方法。响应于指定要从存储器取回的数据项的加载请求,取回包括由该加载请求所识别的数据项的一系列数据项。加载请求在加载请求被执行之前被缓冲。合并电路为加载请求和在未决加载缓冲电路中缓冲的一个或多个其他加载请求的集合确定地址邻近性条件是否为真。当由一个或多个其他加载请求的集合识别的所有数据项都被包括在该一系列数据项内时,地址邻近性条件为真。当地址邻近性条件为真时,该一个或多个其他加载请求的集合被抑制。合并预测电路基于合并电路先前对加载请求的处理,为每个加载请求生成合并预测。(The present disclosure relates to adaptive load coalescing. Apparatus and methods for processing load requests are disclosed. In response to a load request specifying a data item to be retrieved from memory, a series of data items including the data item identified by the load request are retrieved. The load request is buffered before the load request is executed. The merge circuit determines whether the address proximity condition is true for the load request and a set of one or more other load requests buffered in pending load buffer circuitry. The address proximity condition is true when all data items identified by the set of one or more other load requests are included within the series of data items. When the address proximity condition is true, the set of one or more other load requests is suppressed. The merge prediction circuit generates a merge prediction for each load request based on previous processing of the load request by the merge circuit.)

1. An apparatus, comprising:

load processing circuitry responsive to a load request specifying a data item to retrieve from memory a series of data items including the data item identified by the load request;

pending load buffer circuitry to buffer load requests to retrieve from memory the data items specified by the load requests before they are executed by said load processing circuitry;

merge circuitry to determine whether an address proximity condition is true for the load request and a set of one or more other load requests buffered in the pending load buffer circuitry, wherein the address proximity condition is true when all data items identified by the set of one or more other load requests are included within the series of data items,

wherein the merge circuitry inhibits processing of the set of one or more other load requests by the load processing circuitry in response to the address proximity condition being true; and

merge prediction circuitry to generate a merge prediction for a load request based on previous processing of the load request by the merge circuitry.

2. Apparatus as claimed in claim 1, wherein said merge circuit suppresses processing of said load request by said merge circuit in response to a merge prediction without a merge.

3. An apparatus as claimed in claim 1 or claim 2, further comprising a feedback path from the merge circuit to the merge prediction circuit, wherein the merge circuit is responsive to validity of the address proximity condition to notify the merge prediction circuit via the feedback path of a merge result that is dependent on validity of the address proximity condition for the load request.

4. Apparatus as claimed in any preceding claim, wherein said merge prediction circuitry comprises a merge history store to hold content dependent on previous processing of load requests by said merge circuitry, and said merge prediction circuitry is arranged to generate a merge prediction for a load request based on the content of said merge history store.

5. An apparatus as claimed in claim 4, wherein the content held by the merge history store comprises a probabilistic data structure.

6. An apparatus as claimed in claim 5, wherein the merge history store comprises a bloom filter.

7. An apparatus as claimed in claim 5 or claim 6, further comprising a hash generation unit to generate contents of the probabilistic data structure, wherein the hash generation unit comprises a hash sequencer to generate the contents of the probabilistic data structure in a plurality of hash stages.

8. An apparatus as claimed in any one of claims 4-7, further comprising a reset circuit to reset contents of the merge history store in response to at least one of:

a predetermined period of time has elapsed;

receiving a context switch indication; and/or

A predetermined fullness of the merge history store is reached.

9. An apparatus as claimed in any one of claims 1 to 8, further comprising a bypass path to said load processing circuitry and bypassing said merge circuitry, wherein said merge prediction circuitry is responsive to a merge prediction not involving such that said load request is provided to said load processing circuitry via said bypass path.

10. An apparatus as claimed in any one of claims 1-8, further comprising:

second load processing circuitry responsive to the load request specifying the data item to retrieve from the memory the series of data items including the data item identified by the load request; and

load process selection circuitry responsive to a merge prediction generated by the merge prediction circuitry to direct the load request to the load processing circuitry, or to the second load processing circuitry, in dependence upon the merge prediction.

11. The apparatus as claimed in claim 10, further comprising:

a first load request issue queue to feed the pending load buffer circuitry; and

a second load request issue queue to feed the second load processing circuitry,

wherein the load process selection circuitry is responsive to a merge prediction generated by the merge prediction circuitry to direct the load request to either the first load request issue queue or the second load request issue queue in compliance with the merge prediction.

12. Apparatus as claimed in any preceding claim, wherein said load request originates from a load instruction, and said apparatus further comprises an instruction cache store to store an entry for a frequently used instruction, wherein said instruction cache store is arranged to store said merge prediction in association with said load instruction.

13. Apparatus as claimed in claim 12, wherein said instruction cache storage is a micro-operation cache.

14. An apparatus as claimed in any preceding claim, further comprising a data dependency determination circuit to determine data dependencies between instructions, wherein the merge prediction circuit is arranged to generate the merge prediction in dependence on at least one data dependency determined by the data dependency determination circuit.

15. An apparatus as claimed in claim 14, wherein said data dependency determination circuit comprises a register renaming circuit.

16. Apparatus as claimed in any preceding claim, wherein said merge prediction circuitry is arranged to generate a merged merge prediction for said load request by default.

17. A method of data processing, comprising:

buffering the load request to retrieve from memory the data item specified by the load request before the load request is executed, wherein executing the load request specifying the data item includes retrieving from memory a series of data items including the data item identified by the load request;

determining whether an address proximity condition is true for the load request and a set of one or more other buffered load requests, wherein the address proximity condition is true when all data items identified by the set of one or more other load requests are included within the series of data items;

suppressing execution of the set of one or more other load requests when the address proximity condition is true; and is

A merge prediction is generated for a load request based on previous processing of the load request.

18. An apparatus, comprising:

means for buffering the load request to retrieve from memory the data item specified by the load request before the load request is executed, wherein executing the load request specifying the data item comprises retrieving from memory a series of data items including the data item identified by the load request;

means for determining whether an address proximity condition is true for the load request and a buffered set of one or more other load requests, wherein the address proximity condition is true when all data items identified by the set of one or more other load requests are included within the series of data items;

means for suppressing execution of the set of one or more other load requests when the address proximity condition is true; and

means for generating a merge prediction for a load request based on previous processing of the load request.

Technical Field

The present technology relates to data processing. More particularly, the present technology relates to the processing of load requests in a data processing apparatus.

Background

In performing load operations, the data processing apparatus may be provided with the ability to merge certain loads that have been identified as having a suitable degree of spatial proximity together so that they can be serviced simultaneously. More specifically, it may be identified that the spatial proximity between two load requests is sufficiently close that only one memory access needs to be performed that will return the data requested by each of the spatially-adjacent load requests.

Disclosure of Invention

At least some examples provide an apparatus comprising: load processing circuitry responsive to a load request specifying a data item to retrieve from memory a series of data items including the data item identified by the load request; pending load buffer circuitry to buffer load requests to retrieve from memory the data items specified by the load requests before they are executed by said load processing circuitry; merge circuitry to determine whether an address proximity condition is true for the load request and a set of one or more other load requests buffered in the pending load buffer circuitry, wherein the address proximity condition is true when all data items identified by the set of one or more other load requests are included within the series of data items, wherein the merge circuitry is responsive to the address proximity condition being true to inhibit processing by the load processing circuitry of the set of one or more other load requests; and merge prediction circuitry to generate a merge prediction for a load request based on previous processing of the load request by the merge circuitry.

At least some examples provide a method of data processing, comprising: buffering the load request to retrieve from memory the data item specified by the load request before the load request is executed, wherein executing the load request specifying the data item includes retrieving from memory a series of data items including the data item identified by the load request; determining whether an address proximity condition is true for the load request and a set of one or more other buffered load requests, wherein the address proximity condition is true when all data items identified by the set of one or more other load requests are included within the series of data items; suppressing execution of the set of one or more other load requests when the address proximity condition is true; and generating a merge prediction for the load request based on previous processing of the load request.

At least some examples provide an apparatus comprising: means for buffering the load request to retrieve from memory the data item specified by the load request before the load request is executed, wherein executing the load request specifying the data item comprises retrieving from memory a series of data items including the data item identified by the load request; means for determining whether an address proximity condition is true for the load request and a buffered set of one or more other load requests, wherein the address proximity condition is true when all data items identified by the set of one or more other load requests are included within the series of data items; means for suppressing execution of the set of one or more other load requests when the address proximity condition is true; means for generating a merge prediction for a load request based on previous processing of the load request.

Drawings

The present technology will be further described, by way of example only, with reference to the embodiments of the present technology illustrated in the accompanying drawings, which are to be read in conjunction with the following description:

FIG. 1 schematically illustrates an apparatus including load processing circuitry having the capability to merge some pending loads, according to some example embodiments;

FIG. 2A schematically illustrates an apparatus according to some example embodiments, including load processing circuitry having the capability to merge some pending loads;

FIG. 2B schematically illustrates the interaction of pending load buffers and merge circuitry, in accordance with some example embodiments;

FIG. 3 schematically illustrates an apparatus including load processing circuitry having the capability to merge some pending loads, according to some example embodiments;

fig. 4A schematically illustrates a merge prediction circuit, according to some example embodiments;

fig. 4B schematically illustrates a merge prediction circuit, according to some example embodiments;

FIG. 5A schematically illustrates hash generation circuitry and bloom filter circuitry in some example embodiments;

fig. 5B schematically illustrates an apparatus according to some example embodiments; and is

Fig. 6 is a flowchart illustrating a sequence of steps taken according to some example embodiments.

Detailed Description

In one example herein, there is an apparatus comprising: load processing circuitry responsive to a load request specifying a data item to retrieve from memory a series of data items including the data item identified by the load request; pending load buffer circuitry to buffer load requests to retrieve from memory the data items specified by the load requests before they are executed by said load processing circuitry; merge circuitry to determine whether an address proximity condition is true for the load request and a set of one or more other load requests buffered in the pending load buffer circuitry, wherein the address proximity condition is true when all data items identified by the set of one or more other load requests are included within the series of data items, wherein the merge circuitry is responsive to the address proximity condition being true to inhibit processing by the load processing circuitry of the set of one or more other load requests; and merge prediction circuitry to generate a merge prediction for a load request based on previous processing of the load request by the merge circuitry.

Merging load requests is based on identifying that the relevant load requests are sufficiently spatially adjacent (in terms of the memory locations they reference) that they can be serviced by one "merged" memory access. This spatial proximity may be referred to herein as an "address proximity condition" true for the relevant load request. The opportunity to merge load requests in this manner arises in such systems: in such a system, the load processing circuitry is responsive to a load request specifying a data item to be retrieved from memory, typically not just the data item itself as part of a triggered memory access. This is because efficiency is achieved by constraining memory accesses to a particular data size, e.g., to align with cache line sizes in the system (although the present technique is not limited to this particular choice, other granular memories are equally applicable). In general, while using a merge circuit to determine whether pending load requests may be merged has a cost, there may be situations where, despite incurring an increase in latency through processing by the merge circuit, this may be offset by an improvement in memory access bandwidth that may be substantial enough to achieve an overall performance improvement, e.g., where a significant portion of the load instructions may be merged. Nonetheless, the inventors of the present technology have recognized that when only a relatively small number of load instructions can be successfully fused (merged), the increase in latency due to the additional merging process may outweigh the benefits that can be obtained. The present technique therefore proposes to provide a merge prediction circuit arranged to generate a prediction of whether a given load request is to be merged with other load requests based on previous processing of the load requests by the merge circuit. This merge prediction may then be utilized by the device in a number of ways, for example, to modify its behavior with respect to a particular load request.

In some embodiments, the merge circuit suppresses processing of load requests by the merge circuit in response to a merge prediction that is not merged. Thus, as a possible response to a generated prediction that a load request will not be merged with other load requests, the merge circuit may avoid processing the load request and thus may save processing when the load request is expected not to be merged.

The collection of information regarding previous processing of load requests by the merge circuit may be provided in a number of ways, but in some embodiments the apparatus further comprises a feedback path from the merge circuit to the merge prediction circuit, wherein the merge circuit is responsive to the validity of the address proximity condition to notify the merge prediction circuit via the feedback path of a merge result that depends on the validity of the address proximity condition for the load request. Thus, based on the determination by the merge circuit of the address proximity condition for the load request, the merge circuit may notify the merge prediction circuit of the result of the merge. Depending on the configuration, this may be positive or negative configured, i.e. it may be arranged that the merging circuit indicates to the merge prediction circuit in particular load requests that have not yet been merged, or the merge prediction circuit may inform the merge prediction circuit of those load requests that have been merged. The merge prediction circuit may then collect this information as appropriate, based on which it may make future predictions.

The merge prediction circuitry may be variously configured, but in some embodiments the merge prediction circuitry comprises a merge history store to hold content dependent on previous processing of load requests by the merge circuitry, and is arranged to generate a merge prediction for a load request based on the content of the merge history store.

The content held by the merge history store may take a variety of forms, but in some embodiments the content held by the merge history store comprises a probabilistic data structure. In some embodiments, this content may explicitly correspond to previous merged history results, but in other embodiments the content may be reduced in a probabilistic manner as well. Probabilistic history storage reduces the required storage capacity.

In some example embodiments, the merge history store includes a bloom filter.

In some embodiments, the apparatus further comprises a hash generation unit for generating the contents of the probabilistic data structure, wherein the hash generation unit comprises a hash sequencer to generate the contents of the probabilistic data structure in a plurality of hash stages. Arranging the hash sequence to generate the contents of the probabilistic data structure in multiple hash stages may make hardware implementation simpler.

Where the probabilistic data structure (e.g. a bloom filter) depends on an evolving data structure that asymmetrically captures the merge history, i.e. the bloom filter may be arranged to provide information as to whether a particular load must be able to be merged or whether it is likely (but not certain) to be un-merged, the evolving nature of the data structure means that the false positive rate (e.g. merged loads are marked as un-merged) will generally increase over time. Thus, in some embodiments, the apparatus includes a reset circuit to reset the contents of the merge history store in response to at least one of: a predetermined period of time has elapsed; receiving a context switch indication; and/or to achieve a predetermined fullness of the merged history store.

As described above, the apparatus may be arranged in a number of ways to utilise the merge prediction generated by the merge prediction circuitry, but in some embodiments the apparatus further comprises a bypass path to the load processing circuitry and bypassing the merge circuitry, wherein the merge prediction circuitry is responsive to a non-merged merge prediction such that a load request is provided to the load processing circuitry via the bypass path. Thus, the provision of a bypass path enables load requests that have been identified as not being merged to proceed directly to the load processing circuitry without being processed by the merging circuitry.

In some embodiments, the apparatus further comprises: second load processing circuitry responsive to a load request specifying a data item to retrieve from memory a series of data items including the data item identified by the load request; and load processing selection circuitry responsive to the merge prediction generated by the merge prediction circuitry to direct load requests to either the load processing circuitry or to the second load processing circuitry in dependence on the merge prediction. Thus, according to such an embodiment, the load processing circuitry of the apparatus is "doubled" (by providing a second load processing circuitry), and the load processing selection circuitry may then use the merge prediction to either provide the load request to the load processing circuitry via pending load buffer circuitry (and thus subject to potential merging via processing by the merge circuitry), or directly to the second load processing circuitry (so as not to even attempt merging).

In some such embodiments, the apparatus further comprises: a first load request issue queue to feed pending load buffer circuitry; and a second load request issue queue to feed the second load processing circuitry, wherein the load process selection circuitry is responsive to the merge prediction generated by the merge prediction circuitry to direct the load request to either the first load request issue queue or the second load request issue queue in dependence upon the merge prediction. This arrangement thus enables a further level of decoupling of the paths to the (first) and second load processing circuits by having a dedicated issue queue before each of them, namely a "merged" issue queue for the (first) load processing circuit path and a "non-merged" issue queue to the second load processing circuit.

The generated merge prediction may also be stored in various locations in the apparatus, and in some embodiments wherein the load request originates from a load instruction, the apparatus further comprises an instruction cache store to store entries of frequently used instructions, wherein the instruction cache store is arranged to store the merge prediction in association with the load instruction. This in turn enables efficient utilization of merge predictions when load instructions are encountered again, with reduced latency.

The instruction cache storage may be, for example, a micro-operation cache.

In some embodiments, the apparatus further comprises a data dependency determination circuit to determine data dependencies between the instructions, wherein the merge prediction circuit is arranged to generate the merge prediction in dependence on at least one data dependency determined by the data dependency determination circuit. It has been found that in some cases information about data dependencies between instructions is useful for determining the likelihood of merging. In general, some load instructions provide input values directly (or with a very low degree of indirection) to other nearby load instructions, and often no merging occurs for loads that access a pointer. Thus, the information generated by the data dependency determination circuit may provide a basis for classifying instructions as being merged or non-merged based on the dependencies of the instructions and/or what dependent instructions may depend on them.

The data dependency determination circuitry may, for example, comprise register renaming circuitry.

In some embodiments the merge prediction circuitry is arranged to generate a merged merge prediction for a load request by default. This corresponds to the default behavior of the device being a merge, since this must be done at least once through the merge hardware to determine the probability of merging with other loads in flight.

In one example herein, there is a method of data processing, comprising: buffering the load request to retrieve from memory the data item specified by the load request before the load request is executed, wherein executing the load request specifying the data item includes retrieving from memory a series of data items including the data item identified by the load request; determining whether an address proximity condition is true for the load request and a set of one or more other buffered load requests, wherein the address proximity condition is true when all data items identified by the set of one or more other load requests are included within the series of data items; suppressing execution of the set of one or more other load requests when the address proximity condition is true; and generating a merge prediction for the load request based on previous processing of the load request.

In one example herein, there is an apparatus comprising: means for buffering the load request to retrieve from memory the data item specified by the load request before the load request is executed, wherein executing the load request specifying the data item comprises retrieving from memory a series of data items including the data item identified by the load request; means for determining whether an address proximity condition is true for the load request and a buffered set of one or more other load requests, wherein the address proximity condition is true when all data items identified by the set of one or more other load requests are included within the series of data items; means for suppressing execution of the set of one or more other load requests when the address proximity condition is true; means for generating a merge prediction for a load request based on previous processing of the load request.

Some specific embodiments are now described with reference to the drawings.

Fig. 1 schematically illustrates an apparatus 10 arranged to receive load requests (e.g. generated by data processing circuitry in response to load instructions identified in program code being executed) and to generate memory accesses to service those load requests. The apparatus 10 comprises a load unit 12, which load unit 12 generates memory accesses in dependence of received load requests. The apparatus 10 also includes a pending load buffer 14 in which pending load buffers 14 received load requests are held prior to processing by the load units 12. The apparatus 10 further comprises a merger (merging circuit) 16, which merger (merging circuit) 16 receives content from the pending load buffer 14 and, where possible, merges one or more load requests such that they are processed by the load unit 12 as a single item, in particular generates only a single memory access request. The merger 16 is arranged to determine whether two (or more) load requests can be merged based on an address proximity condition. The address proximity condition is defined based on the arrangement of the load unit 12, i.e. in response to a given (individual) load request, the load unit generates a memory access which causes a series of data items to be retrieved from memory. For example, in some embodiments, the series of data items may correspond to cache line sizes in a wider data processing system of which device 10 forms a part. That is, data is retrieved from memory in cache line sized blocks. Thus, when the second (or more) load request relates to a data item that is in any case caused to be retrieved from memory by another (first) pending request, an opportunity arises to merge the pending load requests due to the fact that: the first load request will cause retrieval of a series of data items, where the series of data items includes the data item(s) specified by the second (or more) load requests. In the example of cache line size data retrieval, the second (or more) load requests refer to data items within the same cache line as the first load request. Note that the tags of the "first" and "second" load requests are used herein only to distinguish load requests, and do not impose the necessary ordering on them when processed by the merger 16.

The apparatus 10 further comprises a merge predictor arranged to receive an indication of an incoming load request (e.g. at least a portion of a load instruction address) and to generate, based on information relating to the load request, a merge prediction based on a previous encounter with the load request. The apparatus 10 further comprises path selection circuitry 20 arranged to pass incoming load requests to the pending load buffer 14 or directly to the load unit 12. This selection of a path is made based on the output of the merge predictor circuit. Thus, based on the manner in which a given load request was previously processed by the device 10, the path selection circuitry 20 selects the route that the load request will take when passing through the device. In particular, load requests may be passed directly to the load unit 12 when they are expected not to be merged, and thus processing associated with the path through the pending load buffer 14 and the merger 16 may be avoided.

Fig. 2A schematically illustrates an apparatus 50 in some example embodiments. In a manner that will be familiar to those of ordinary skill in the art, the apparatus 50 is arranged in a pipeline to process instructions and thus a series of stages, namely, a fetch stage 52, a decode stage 54, a micro-operation cache 56, a rename stage 58 and an instruction queue 60. Those of ordinary skill in the art will also appreciate that many other components and stages are possible, but the choice shown in FIG. 2A is merely representative for purposes of understanding the present technology. According to the illustrated arrangement, the fetch stage 52 causes instructions from a sequence of instructions defined in program code to be retrieved from memory and passed to the decode stage 54 for decoding. The decoded instructions may be stored in the micro-op cache 56 and then processed by the rename stage 58 (where the apparatus 50 is an out-of-order data processing apparatus). Instructions queued in the instruction queue 60 may then be passed to several destinations depending on their type, but the focus of the present disclosure is to process load instructions that generate load requests, and such instructions pass from the instruction queue 60 to the register read stage 62 and then to the address generation stage 64. Thus, it can be seen that the arrangement of the apparatus 50 includes "early" address resolution (with respect to the operation of the load unit 72) which is useful in the context of the present technique (i.e. when seeking to merge some pending loads) as it provides the apparatus with sufficient (simultaneous) visibility of pending loads in flight (when held in the pending load buffer 68) in order to make timely decisions as to which loads are to be grouped together (merged). An indication of the load instruction (which will cause the load request) is also passed from the decode stage 52 to the merge predictor 66, which, based on this indication (e.g. load instruction address) of the load instruction, the merge predictor 66 is arranged to generate a merge prediction, i.e. whether or not this load instruction is expected to result in a load request to be merged with at least one other co-pending load request. This is done based on the previous processing of the load request by the merge circuit (i.e., the merger 70). This merge prediction is communicated to the address generation circuitry 64, and the address generation circuitry 64 selects a forward path for each load request based on the received merge prediction. These forward paths lead to the loading unit 72. The first path is via the pending load buffer 68 and the merger 70 and the second path leads directly from the address generation circuitry 64 to the load unit 72. Thus, in general, when a load request processed by the address generation circuitry 64 is expected to be merged, then it is passed via the first path (via the pending load buffer 68 and the merger 70 in order to attempt to merge with other pending loads held in the pending load buffer 68). Alternatively, when a pending load is expected not to be merged, it is passed directly to load unit 72. Since it is necessary to send the pending load at least once through the merge hardware in order to determine its probability of merging with other pending loads in flight, the default behavior of the apparatus 50 is a merge attempt, i.e., initially all pending loads are considered as merged loads by default and will pass through the pending load buffer 68 and the merger 70. However, once the pending load reaches the merger 70 and determines that its merge failed, the merger 70 notifies the merge predictor 66 of the failure. The merge predictor 66 then builds a record of those loads that have failed the past merge, and on this basis, when they are encountered again, signals from the merge predictor 66 to the address generation circuitry 64 may cause these non-merged load requests to bypass the merge circuitry 70 and proceed directly to the load unit 72. This also enables the structure of the pending load buffer 68 and the search logic to compare various permutations of pending loads to be simpler by reducing the number of pending loads passing through the pending load buffer.

Note also that the output of merge predictor 66 is also shown in fig. 2A as being passed to decoding circuitry 54. Thus, the decode circuitry may on the one hand annotate decoded load instructions with an indication of their merge predictions, and on the other hand it may also maintain a record of merge predictions for at least some load instructions that are frequently decoded, so that these may be decoded faster. The output of merge predictor 66 may also be passed to any other pipeline stage (dashed lines in FIG. 2A), either for storage in association with certain load instructions (e.g., in a micro-op cache), or for annotation of decoded load instructions as they pass through the pipeline with an indication of their merge prediction (e.g., for opportunistic purposes, which may preferably be performed downstream of decode stage 54 in some cases). Various configurations of merge predictor 66, particularly the manner in which it stores information regarding how previous load requests were processed, are contemplated and will be discussed in more detail below with reference to FIGS. 4A, 4B, 5A, and 5B.

A pending load arriving at load unit 72 then causes a data fetch. The apparatus includes an active load store 74 in which a record of in-flight loads that have initiated, but not yet completed, access to the memory system is maintained. Processing of load requests begins with the load unit 72 passing the request to Translation Lookaside Buffer (TLB) lookup and fault check circuitry 76 to perform the required lookup (for translation from virtual to physical addressing) and to respond appropriately to any faults. Note that various levels of caching (not shown) may additionally be provided, these caches may be virtually indexed, or physically indexed, and accesses to the TLB may thus be modified accordingly. When the relevant series of data items is retrieved, for example after a data cache access, either as a result of a cache hit or by a further level of access to the memory system (not shown), the data read and multiplexing circuitry 78 processes the resulting data. The data items corresponding to the load request itself are extracted and passed to the selector 86. A series of data items resulting from the memory accesses are passed via a data buffer 80 to a de-merge and format circuit 82. The de-merge circuit 82 receives an indication of the fused (merged) load request from the merger 70, where this indication shows which elements of the series of data items are to be additionally extracted (in order to provide data items corresponding to one or more other load requests that are suppressed as a result of merging with the nominally-processed load request). The de-merge and format circuit 82 thus extracts the multiple data items required for the merged load request. The result is then passed to another input of the selector 86 via the data buffer 84. The selector 86 thus receives an input from the data buffer 84 and an input from the data read and multiplexing circuit 78. The selection between the two is controlled by an input provided by load unit 72 indicating whether this is a merged load. The resulting data is then stored in a result cache 88, which result cache 88 may also provide input to the register read stage 62 as part of its processing to determine the address of the load request it receives.

FIG. 2B schematically illustrates the interaction of pending load buffers and merge circuitry, according to some example embodiments. Pending load buffer 90 includes a number of entries, each of which may hold a pending load (i.e., may hold information about loads that have been indicated but not yet executed). In the example of FIG. 2B, pending load buffer 90 has eight entries, but in other embodiments, the pending load buffer may have more entries than this, or may have fewer entries than this-the present technique is not limited to any particular number of entries in the pending load buffer. Each entry is shown as including a main portion 91 and a merge indication 92. The main portion 91 holds the above information about each load (particularly the address to be accessed by the load) and the merge indication 92 is used to mark those loads that have been identified as being merged with other loads in the pending load buffer. In the example of FIG. 2B, the oldest pending load in the buffer provides a portion of the address comparison, while another portion is provided by the address of each of the other (seven) entries in the buffer. The address comparison circuit 93 compares the address of the oldest pending load with the address of each other pending load in the buffer. This comparison may examine only a portion of each address. For example, where the load unit is configured to initiate loads from memory on a cache line basis, the upper portions of the addresses may be compared to determine whether two load addresses are to be serviced by retrieving a single cache line that covers both addresses. The result of the address comparison performed by the address comparison circuit 93 is notified to the merge control circuit 94. When there is no (close enough) match between the address of the oldest pending load and the other pending loads, the merge control 94 passes the oldest pending load to the load unit and it is removed from the pending load buffer (in practice the entry may simply be allowed to be overwritten in the pending load buffer). The merge control 94 also notifies the merge predictor that the load merge failed. However, when at least one (sufficiently close) match is found between the address of the oldest pending load and the other pending loads, the merge control 94 updates the merge indication 92 for the matching one or more other pending loads. These other pending loads are thus marked as having been merged with another load in the pipeline of the pending load buffer that precedes them. When a pending load becomes the oldest pending load, a merge indication 92 is received by the merge control circuitry 94, and the merge control circuitry 94 may then notify the load unit of the load as a merged load (as described above). However, further address comparisons of the already merged loads with the contents of the pending load buffer are performed, as one or more pending loads may have arrived during this period, which may also be merged.

Fig. 3 schematically illustrates the apparatus 100 in some example embodiments. In contrast to the schematic illustration in FIG. 2A, the example apparatus 100 of FIG. 3 does not show the pipeline stages preceding the instruction queue 102 (but this is merely for clarity and brevity of discussion herein). Load instructions in the instruction queue 102 are directed (e.g., by instruction address) to the merge predictor 104, and the merge predictor 104 generates a merge prediction based on the direction. As in the example of fig. 2A, the merge predictor may accomplish this in various ways, and it may store its information in various ways, which will be discussed in more detail below with reference to fig. 4A, 4B, 5A, and 5B. The generated merge prediction is notified to the path selector 106, and the path selector 106 then causes the load request to follow one of two paths through the device 100. The pending load requests are passed to a merge issue queue (CIQ) 108 according to the first path, i.e., the path resulting from the prediction of the merge. Alternatively, the non-merging prediction causes the path selector 106 to pass the load request to a non-merging issue queue (NCIQ) 110. Load requests from the CIQ 108 or NCIQ 110 utilize the register read phase 112. Thereafter, load requests passed via CIQ 108 are passed out of register read stage 112 to address generation stage 114. They then enter a pending load buffer 116 where several pending load requests are held simultaneously. The merger circuit 118 references the contents of the pending load buffer 116 in determining whether a given load request can be merged. When it fails, this fact is communicated to the merge predictor 104, and the merge predictor 104 builds a record of those load requests that failed the previous merge in order to generate its merge prediction. Pending load requests that are not suppressed by the merger 118 because they have been merged with another load request are passed to the load unit 120 and the device maintains a record of the active loads in its active load store 122. As in the case of the example embodiment described above with respect to FIG. 2A, the load unit 120 then passes the pending load to the TLB lookup and fault check circuitry 124. Data read and multiplexing circuitry 126 extracts the data item corresponding to the load request itself and passes it to selector 134. A series of data items resulting from the memory accesses are passed via a data buffer 128 to a de-merge and format circuit 130. The de-merge and format circuit 130 also receives input from the merger 118 indicating the nature of the merge performed (i.e., which data items should be fetched) so that it can then de-merge the required data items from the series of data items retrieved (e.g., cache lines). These de-merged data items are then passed to selector 134 via data buffer 132.

Pending load requests predicted to be non-merged are passed directly from NCIQ 110 to load unit 140 via register read stage 112. Since these load requests are predicted to be non-merged, no early address generation is required, and therefore the address generation stage 142 is placed after the load unit 140 so that the memory accesses required to service these load requests can be initiated earlier, avoiding the latency associated with early address generation for the merge path. Load unit 140 also maintains a record of its active loads in active load store 144. The results of the address generation 142 are passed to the TLB lookup and fault check circuit 124 and then to the data read and multiplexing circuit 126. The results of the non-coalesce requests are passed directly from the data read and multiplexing circuit 126 to the selector 134. Under the control of load unit 140, the selector then passes one of its inputs to result cache 136, the contents of which may be utilized by register read stage 112 in its address resolution role.

Fig. 4A schematically illustrates a configuration of a merge predictor 150 in more detail, according to some example embodiments. The merge predictor 150 is arranged to receive an indication of a load instruction, which in the illustrated embodiment comprises a subset of addresses (bits [17:2 ]). This information is received by the compare circuit 152, which compare circuit 152 receives another input from the merge history store 154. The compare circuit 152 compares the load instruction address information to the contents of the merged store history 154 (i.e., the compare circuit 152 has visibility of all history entries in the merged store history 154 and compares to the incoming load instruction address) and generates a prediction (merged or not) for this load instruction based thereon. The previous merge results result in an update to the contents of the merge history store 154. These may in principle be explicit records of the merged results of previously encountered load instructions, or may be restricted to a particular result, such as a non-merged result. However, in other embodiments, it is also possible to maintain a record of the merge results in the merge history store. In general, the merge history may be an explicit record of all merged results, a record of non-merged results, a record of merged results, or a probabilistic representation of any of them.

Fig. 4B schematically illustrates further components of the merge predictor 160, in fact the merge predictor 160 may be the same merge predictor as the merge predictor 150 of fig. 4A, but with different functionality being focused on in fig. 4B. In particular, merge predictor 160, in addition to merge history stores 162a-d, is shown to include an update control 164. The update control may receive information from one or more possible inputs, shown in FIG. 4B as context switch signals, timers, and merge results (including information about the load instruction address). On this basis, update control 164 modifies the contents of merge history stores 162 a-d. In general, the merged results will result in an update to the merged history store that has been described above (i.e., this may be an explicit record of all merged results, a record of non-merged results, a record of merged results, or a probabilistic representation of any of them). Control is updated to maintain the contents of the merge history stores 162 a-d. There are various ways in which the merged history store may be caused to be reset. One example is the elapse of a predetermined period of time (indicated by a timer) because, in the case where the merged history store would necessarily have a limited capacity, it may be the case that the "fullness" of the merged history stores 162a-d slowly causes degradation of their functionality over time. To alleviate this, the update control 164 may cause the content to be reset after a certain period of time has elapsed. Alternatively or additionally, when a broader data processing system experiences a context switch, it may be appropriate for the merged history store to be reset, and thus the update control 164 may cause this reset upon receiving a context switch indication. Update control 164 also directly monitors merge history stores 162a-d and may reset the contents when a certain "fullness" of the stores is reached. The merge history store is shown in FIG. 4B as including four parts: 162a, 162b, 162c and 162 d. These different portions of the merged history store are used to store different merged historical data sets that may be used, for example, to store merged historical information relating to different process contexts. Update control 164 may instruct the merged history store to switch between these different history sets at the time of process context switch.

FIG. 5A schematically illustrates more details of the operation of a merge predictor, according to some example embodiments. In such an embodiment, a hash unit 170 is provided, the hash unit 170 being arranged to generate a hash of the instruction address information it receives. Hash unit 170 receives load instruction address information from two sources, namely as an indication of an incoming load instruction in the instruction queue and as an indication related to the merge result from the merge unit. The hash unit 170 generates its hash value using a hash key store and sequencer circuit 172. The provision of the sequencer 172 allows the hash unit to generate hash values in multiple hash stages, thereby reducing the overall complexity of the hardware required to support the hash function. This type of non-cryptographic hash may be generated quickly, for example, with an efficient hardware configuration using exclusive-or and/or shift operations. The output of the hashing unit 170 is passed to a path selector 174, which path selector 174 determines the forward path of the hash information based on whether a check is being performed on the merge predictor store (i.e., to generate a merge prediction) or whether the stored contents are being modified using the merge result. In the example of fig. 5A, this store is a bloom filter store 174. Thus, the contents of the bloom filter store 174 are updated based on the hash generated for the merge result. The bloom filter shown is thus able to determine whether a particular load request can be merged, or whether there is a possibility that it may not be merged. Then subsequently when the load instruction address (or a portion thereof) is received by the cache unit 70, this information is passed to the compare circuit 176, which compare the instruction address information with the contents of the bloom filter store 144 and generate a merge prediction based thereon.

FIG. 5B schematically illustrates components of a data processing apparatus 180 that includes a sequence of pipeline stages in some embodiments as described above. These components are shown as fetch stage 182, decode stage 184, micro-operation cache stage 186, rename stage 188, and instruction queue 190. In the same manner as the example of the embodiment described above, the load instruction decoded by the decode stage 184 is also notified to the merge predictor circuit 192 so that it generates a merge prediction. This merge prediction is passed to the instruction queue 190, which may form the forward path that the load request takes through the rest of the device (not shown, but in accordance with any of the embodiments described above). It is important to note with reference to FIG. 5B that the merge predictor also receives input from the rename stage 188. More generally, merge predictor 192 may be provided with an input from any data dependency determination circuit (renaming stage 188 is one example). This is done because it has been found that the data dependencies of load instructions include useful information that can be used to determine whether a particular load request will merge. In particular, it has been found that when some load instructions provide input values directly (or with a very low degree of indirection) to other co-pending load instructions, the loads of the access pointers tend not to merge. Thus, data dependency information may therefore be taken into account in the determination made when the merge predictor generates its merge prediction. FIG. 5B also illustrates an example in which the merge predictor is arranged to provide (dashed arrows in the figure) merge predictions for certain load instructions (as described above with reference to FIG. 2A) to either or both of the decode stage 184 and the micro-operation cache 186. In this arrangement, when the rename stage detects a load-load dependency chain (which, as described above, has been found to significantly reduce the likelihood of a merge), not only will the rename stage 188 inform the merge predictor 192 to update itself to indicate that the first load should not be merged, but when the merge predictor 192 has provided a prediction earlier in the pipeline (e.g., in the decode stage or in the micro-operation cache), the rename stage 188 may decide to override the prediction. More specifically, the rename stage 188 may be arranged to determine whether the merge predictor 192 has an actual prediction for this load, or whether the merge predictor 192 assumes a default "mergeable" prediction. Thus, the merge predictor may indicate an "actual prediction" or a "default prediction" for this load, e.g., using a single bit of the merge prediction that it provides. In the latter case, the rename stage may override the prediction of the first load if a load-load dependency chain is detected.

FIG. 6 is a flowchart illustrating a sequence of steps taken by a method according to some embodiments. The flow may be considered to begin at step 200 when the next load request is received. It is then determined at step 202 whether a non-merge prediction has been previously stored for this load request (i.e., for this load instruction). If so, flow proceeds directly to step 212 where the load request is executed separately (i.e., without any attempt to merge it with other load requests). However, if it is found in step 202 that no non-merge prediction is stored for this load request, then flow proceeds to step 204. Where the load request is added to a pending load buffer. It is then determined whether the load request can be merged with other pending load buffer contents at step 206. If not, flow proceeds via step 207 where it is checked whether the pending load buffer has sufficient contents to allow a meaningful determination of "not merge" to be made. This is done with reference to an occupancy threshold of the pending load buffer. Below this threshold, it is defined that no "non-merge" determination should be made because there are not enough other loads held in the buffer to compare with the current load request. Thus, when the current occupancy is not higher than the threshold, the flow proceeds to step 212, and the loading is processed separately. The flow then returns to step 200. However, if the current occupancy is above the threshold, then flow proceeds via step 208, where the load request stores the non-merged prediction for this purpose, and then it is processed, i.e., processed separately, at step 212. Returning to the consideration of step 206, if, however, it is determined in step 206 that the load request can be merged with other pending load buffer contents, then other "overlapping" load requests are suppressed in step 210. This may include the other pending load buffer contents mentioned above, or may include the load request itself, depending on the configuration employed and, for example, the order in which the load requests are to travel through the pending load buffer. For example, it may be preferable, but not necessary, that the oldest load request in the sequential pending load buffer be acted upon (to reduce latency) while those merged with it that follow it are suppressed.

Briefly summarized, an apparatus and method for processing load requests are disclosed. In response to a load request specifying a data item to be retrieved from memory, a series of data items including the data item identified by the load request are retrieved. The load request is buffered before the load request is executed. The merge circuit determines whether the address proximity condition is true for the load request and a set of one or more other load requests buffered in pending load buffer circuitry. The address proximity condition is true when all data items identified by the set of one or more other load requests are included within the series of data items. When the address proximity condition is true, the set of one or more other load requests is suppressed. The merge prediction circuit generates a merge prediction for each load request based on previous processing of the load request by the merge circuit.

In this application, the word "configured to" is used to mean that an element of an apparatus has a configuration capable of performing a defined operation. In this context, "configuration" refers to an arrangement or interconnection of hardware or software. For example, the device may have dedicated hardware that provides the defined operations, or a processor or other processing device may be programmed to perform the functions. "configured to" does not mean that the device element needs to be changed in any way in order to provide the defined operation.

Although illustrative embodiments have been described in detail herein with reference to the accompanying drawings, it is to be understood that the invention is not limited to those precise embodiments, and that various changes, additions and modifications may be effected therein by one skilled in the art without departing from the scope of the invention as defined in the appended claims. For example, various combinations of the features of the dependent claims could be made with the features of the independent claims without departing from the scope of the present invention.

21页详细技术资料下载

上一篇：一种医用注射器针头装配设备

下一篇：一种提高芯片设计模块性能的快速环境切换方法

Adaptive load merging

相关技术

网友询问留言